Learning in the Absence of Training Data 3031310101, 9783031310102

This book introduces the concept of “bespoke learning”, a new mechanistic approach that makes it possible to generate va

200 31 6MB

English Pages 240 [241] Year 2023

Report DMCA / Copyright

DOWNLOAD FILE

Polecaj historie

Learning in the Absence of Training Data
 3031310101, 9783031310102

Table of contents :
Foreword
Preface
Acknowledgements
Contents
1 Bespoke Learning to Generate Originally-Absent Training Data
1.1 Introduction
1.1.1 Some Definitions
1.2 Prediction Notwithstanding Unavailability Of Training: Real-world Examples
1.2.1 Gravitational Mass Density in Galaxies
1.2.2 Composition of Rocks in Petrophysics
1.2.3 Temporally-Evolving Systems
1.3 Relevant Ambition
1.4 Bespoke Learning
1.4.1 Training Data Is Absent: Not that Training Data Points Are Missing
1.5 A Wide-Angled View: Bespoke Learning in the Brain?
1.6 Summary
References
2 Learning the Temporally-Evolving Evolution-Driving Function of a Dynamical System, to Forecast Future States: Forecasting New COVID19 Infection Numbers
2.1 Introduction
2.1.1 Time Series Modelling
2.1.1.1 ARIMA, etc.
2.1.1.2 Empirical Dynamical Models
2.1.1.3 Our New Approach and EDM
2.1.2 Why Seek Potential?
2.1.3 Hidden Markov Models and Our Approach
2.1.4 Markov Decision Processes; Reinforcement Learning and Our Approach
2.1.5 A New Way to Forecast: Learn the Evolution-Driving Function
2.1.6 Evolution-Driver, Aka Potential Function
2.2 Learning Scheme: Outline and 3 Underlining Steps
2.2.1 Can the Potential be Learnt Directly Using Observed Phase Space Variables?
2.3 Robustness of Prediction: Extra Information from the 2nd Law
2.4 3-Staged Algorithm
2.4.1 Outline of Bespoke Learning in Step I
2.4.2 Embedding Potential into Support of Phase Space pdf: Part of Step I
2.4.3 Learning Potential as Modelled with a Gaussian Process: Part of Step II
2.4.4 Rate and Location Variable Computation: Part of Step III
2.5 Collating All Notation
2.6 Details of the Potential Learning
2.6.1 Likelihood
2.6.2 Likelihood Given Vectorised Phase Space pdf
2.6.3 Prior, Posterior and MCMC-Based Inference
2.7 Predicting Potential at Test Time: Motivation
2.7.1 Learning the Generative Process Underlying Temporal Variation of Potential, Following Bespoke Potential Learning
2.7.1.1 Closed-Form Prediction at a New Time Window
2.7.1.2 Errors of Forecasting
2.7.1.3 Advantage of Our Learning Strategy: Reviewed
2.7.2 Improved Modelling of Σcompo?
2.8 Illustration: Forecasting New Rate and Number of COVID19 Infections
2.8.1 Data
2.8.2 What Is ``Potential'' in This Application?
2.9 Negative Forecast Potential, etc.
2.9.1 Implementing the 3-Step Learning+Forecasting in this Application
2.9.2 Few Technical Issues to Note in This Empirical Illustration
2.9.3 Collating the Background
2.9.4 Bespoke Learning of Potential: Results from Steps I and III
2.9.5 Forecasting: Results from Steps II and III
2.9.6 Quality of Forecasting
2.9.6.1 How Far from the Mean?
2.9.6.2 Information Redundancy and Forecasting at the 8-th Time Window
2.9.6.3 Permutation Entropy
2.9.6.4 Why Is the Forecast Bad Around the 2020–2021 Transition?
2.9.7 Work to Be Done
2.10 Summary
References
3 Potential to Density via Poisson Equation: Application to Bespoke Learning of Gravitational Mass Density in Real Galaxy
3.1 Introduction
3.1.1 Motivating Bespoke Learning
3.1.2 A Particularly Difficult Data Set for Bespoke Learning
3.2 Methodology
3.2.1 Potential and Phase Space Probability Density Function
3.2.2 Phase Flow and Boltzmann Equation
3.2.3 So Far this Is What We Know
3.2.4 Why Consider System Potential to be Velocity Independent?
3.2.5 Relevant Abstraction
3.2.6 Centrality of the Potential
3.2.7 Half the Phase Space Coordinates Cannot be Observed
3.2.8 Why Include Only Energy to Recast Support of fW(·)?
3.2.9 Probability of Data
3.2.10 Isotropy
3.2.11 Likelihood, Including Acknowledgement of Measurement Uncertainties
3.2.12 Ingredients of Inference, Following Likelihood Identification
3.2.13 In the Absence of Training Data
3.2.14 ρ and f Vectors
3.2.15 Computing Potential at a Given R, Given Vectorised Gravitational Mass Density
3.2.16 Model in Light of Vectorised Potential and Phase Space Pdf
3.2.17 Convolving with Error Density
3.2.18 Priors
3.2.19 Inference on f and ρ
3.2.20 How Many Energy Partitions? How Many Radial Partitions?
3.2.20.1 Binning the Radial Range
3.2.20.2 Binning the Energy Range
3.2.21 Inference Using MCMC
3.2.22 Wrapping Up Methodology
3.3 Empirical Illustration on Real Galaxy NGC4649
3.3.1 Learning the ρ(R) Function and pdf f()
3.3.1.1 Predicting from the Learnt ρ(·)
3.3.1.2 Gravitational Mass Enclosed Within a Radius
3.3.1.3 Details of Learning and Prediction
3.3.1.4 Predicting Upon Learning the Phase Space pdf
3.4 Conclusion
3.4.1 Testing for Isotropy in the Data
3.4.2 Working with a Multivariate Phase Space pdf
3.4.3 Summary
References
4 Bespoke Learning in Static Systems: Application to Learning Sub-surface Material Density Function
4.1 Introduction
4.2 Bespoke Learning, Followed by 2-Staged Supervised Learning
4.2.1 Bayesian Implementation of Bespoke Learning
4.2.1.1 Learning Y at Given Values of X, Using Data on W
4.2.1.2 How Can W Inform on the Sought Y?
4.2.1.3 Motivating a Model for the Likelihood
4.2.1.4 Posterior and Inference from It
4.2.2 What if a Different Likelihood?
4.2.3 Dual-Staged Supervised Learning of g(X)(=Y)
4.3 Application to Materials Science
4.3.1 Generating the Originally-Absent Training Data
4.3.2 Underlying Stochastic Process
4.3.3 Details of Existing Bespoke Learning
4.4 Learning Correlation of Underlying GP and Predicting Sub-surface Material Density
4.4.1 Non-parametric Kernel Parametrisation
4.4.2 Predictions
4.4.3 Forecasting
4.5 Conclusions
References
5 Bespoke Learning of Disease Progression Using Inter-Network Distance: Application to Haematology-Oncology: Joint Work with Dr. Kangrui Wang, Dr. Akash Bhojgaria and Dr. Joydeep Chakrabartty
5.1 Introduction
5.2 Learning Graphical Models and Computing Inter-Graph Distance
5.2.1 Learning the Inter-Variable Correlation Structure of the Data
5.2.2 Learning the SRGG
5.2.3 Inference on the SRGG
5.2.4 Distance Between Graphical Models
5.3 Learning Relative Score Parameters Using Learnt Graphical Models
5.3.1 Details of Score Computation
5.4 Application to Haematology-Oncology
5.4.1 Learning of the Relation Between VOD-Score and Pre-transplant Variables and Prediction
5.4.2 Which Pre-transplant Factors Affect SOS/VOD Progression Most?
5.5 Summary
References
A Bayesian Inference by Posterior Sampling Using MCMC
A.1 Bayesian Inference by Sampling
A.1.1 How to Sample?
A.1.2 MCMC
A.1.3 Metropolis-Hastings
A.1.4 Gibbs Sampling
A.1.5 Metropolis-Within-Gibbs
Index

Citation preview

Dalia Chakrabarty

Learning in the Absence of Training Data

Learning in the Absence of Training Data

Dalia Chakrabarty

Learning in the Absence of Training Data

Dalia Chakrabarty Department of Mathematics Brunel University London Uxbridge, Middlesex, UK

ISBN 978-3-031-31010-2 ISBN 978-3-031-31011-9 https://doi.org/10.1007/978-3-031-31011-9

(eBook)

© Springer Nature Switzerland AG 2023 This work is subject to copyright. All rights are reserved by the Publisher, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting, reproduction on microfilms or in any other physical way, and transmission or information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now known or hereafter developed. The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication does not imply, even in the absence of a specific statement, that such names are exempt from the relevant protective laws and regulations and therefore free for general use. The publisher, the authors, and the editors are safe to assume that the advice and information in this book are believed to be true and accurate at the date of publication. Neither the publisher nor the authors or the editors give a warranty, expressed or implied, with respect to the material contained herein or for any errors or omissions that may have been made. The publisher remains neutral with regard to jurisdictional claims in published maps and institutional affiliations. This Springer imprint is published by the registered company Springer Nature Switzerland AG The registered company address is: Gewerbestrasse 11, 6330 Cham, Switzerland

“yadih¯asti tadanyatra”? –Ved Vyas inspired Ma aar Baba.

Foreword

Predictive modelling is the goal of much of data science and machine learning. However, it relies on having representative data that links the inputs of interest to the target outputs. In the situation where we are fortunate enough to have such data, success is still not guaranteed. A model trained on the available data may not be genuinely applicable to the test scenarios we care most about, due to drifts on the generative process over time, or just plain qualitative differences between the training and test environments. It is this transfer learning challenge that fuels much of the recent talk on robust and reliable artificial intelligence. This is also manifest in the need for explainable models that would allow us to better trust a predictive system pushed out of the comfort zone of the historical data in which it was trained. Ultimately, when deploying a predictive system “in the wild”, we are betting that there is not much of an extrapolation taking place. However, on which grounds can we claim that? Ultimately, it will be naive to assume that information about the specificities of the problem at hand is not necessary, that a blackbox and purely data-driven approach is guaranteed to succeed. In this volume, Dr Dalia Chakrabarty frames this challenge under one further step: perhaps we should be more transparent in that, for some predictive problems, effectively the data is just not there. Our input variables of interest may be hard to measure, either due to technological or economical constraints, or because of the disparities between training and test regimes. This links to more classical problems, such as regression with measurement error, inverse problems, latent variable modelling and causal inference. However, the point of view introduced here cuts more directly to the central issue: that, as data scientists, we should tap into the structural properties of the problem at hand and design how information between the missing and target variables can be learned. There are links, for instance, to the ideas of instrumental variables in causal learning and measurement error problems, and the use of temporal information to disentangle information about latent variables in non-linear independent component analysis. There is the need to cast a more unifying view of such ideas.

vii

viii

Foreword

From this, the concept of bespoke learning emerges. Unlike supervised learning, we are not just handed explicit data between input-output signals and the convenient assumption that training and test environments are exchangeable. Unlike unsupervised learning, there is here far more structure to be exploited and a more targeted question, particularly as we drive inference based on design choices. Unlike reinforcement learning, we are interested in more than a black-box function that maximises rewards, but on learning the structure of the mapping between signals. The framing also puts the science of the domain up and front: tapping into her experience as an astrophysicist, Dr Chakrabarty understands that structural constraints on the data generating process are necessary to warrant extrapolations over time or over environmental changes. Numerous and timely examples of this concept are presented. I invite the reader to explore the roadmap of bespoke learning that this manuscript introduces, and rethink the way their own analysis is carried over. London, UK February 2022

Ricardo Silva

Preface

The method of Science is often manifest in the pursuit of the relation between a pair of associated random structures—say, X and Y —and such an endeavour is sustained by a training set that comprises pairs of values of these variables. This book is motivated to offer solutions to the difficulty that arises, when such a training set is not accessible, owing to ignorance of values that Y attains, at a chosen value of X. Then how would we learn the functional relationship between X and Y ? Indeed, absence of such empirical information appears to disqualify any ambition for supervised learning of the functional relation between such a pair of random structures—that we are aware, are mutually associated. However, such conceivable disappointment is naturally mitigated, if we can manage to successfully generate values of one of the two relevant variables, at a chosen or designed value of the other variable. In fact, such an aim entails the thematic focus of this book. Said “generation” of the (originally-absent) values of Y , at each of N design values of X, is referred to generically as the bespoke learning of Y . This problem of an unavailable Y at any X value is encountered in multiple real-world situations in which learning of the relationship between X and Y is the ambition. I discuss this problem of absent training data—and solutions to the same—to enable: – Forecasting of new infection numbers of a pandemic (such as COVID-19) – Learning of the gravitational mass density of a real galaxy, to thereby permit Astronomers to quantify the fraction of dark matter in this system – Non-destructive learning of the material density function of a lab-grown material sample, to then permit prediction of this density at any sub-surface location within the bulk of this material sample – Reliable learning of a parametrisation of progress of a terminal disease, given variably long time series of condition parameters of different recipients of bone marrow transplants, to then permit supervised learning of the relation between this disease progression parameter and pre-disposal; procedural; and treatment parameters

ix

x

Preface

In fact, there are many other practical applications that suffer from the ramifications of the absence of training data, including those in Neuroscience, Finance, Petrophysics, Healthcare, Econometrics, Composite index formulation, etc. The motivation behind coining the term “bespoke learning” is primarily to distinguish such an endeavour from the (mainly) three other forms of learning that are commonly undertaken—namely, supervised, unsupervised and reinforcement learning. None of these appear to correctly describe the exercise that is attempted here. The distinction is clarified in Chap. 1. At the same time, a personal bias has contributed to the semantics—the learning here is referred to as bespoke, since my original perception was indeed that the generation of values of one of the variables, at a chosen value of the other, is likely to be specific to the system and/or the considered problem. However, as I have journeyed towards finishing this book, my former understanding has evolved and currently stands as somewhat revised, as the possibility emerges for identification of clusters of possible solution methodologies to this general problem of absent training data. These broadly pertain to three techniques that fall under the rubric of such bespoke learning: – for a dynamical system, this is pursued by tapping into the dynamics intrinsic to the system, – or by computing the distance between a pair of graphical-models/networks learnt for the respective time series data set that represents the temporal variation of the system, – while for static systems, designing likelihoods towards parameter learning, is shown to be relevant Thus, we will soon discuss – Bespoke learning of the time and state dependence of the function that causes the evolution of a generic dynamical system—a function that we identify as the potential function of this system—using which we forecast future states within a generalised Newtonian treatment – Bespoke learning of the structural density function in gravitational/electrical systems by invoking the underlying system Physics that permits linking this density to the learnt potential function – Bespoke learning of values of a property of a static system, at design points, by invoking details of the considered system, to help construct a likelihood function that abides by all available constraints – Bespoke learning of a scalar parametrisation of a multivariate time series spanning a variable temporal range, by computing a distance between a learnt pair of graphical models of a pair of such time series data—to thereby permit (non-parametric) regression of system parameters on such a bespoke learnt scalar parameter While additional bespoke learning techniques surely abound, the above broad classes of methods are forwarded here, in the hope that this is a good start. The general methodology is first discussed in each of the four chapters that follow the

Preface

xi

introductory one. Thereafter, an empirical illustration of the discussed method is presented within the respective chapter, given real data. Solutions are put forward for the bespoke learning of the evolution-driving function, aka the “potential function”, at design time points, in generic dynamical systems—via exploitation of the temporal evolution of the phase space pdf . Such learning then enables populating the originally absent training data comprising information on time-potential pairs, and this generated training data ultimately allows for the supervised learning of the temporal variation of the potential function. Indeed, this supervised learning is useful, as it then allows for the forecasting of states that the system attains at a future time point, within a Newtonian paradigm that allows for the link between system states, and the potential. A fully generalised treatment of the potential and system state variable is adopted. This permits an application to the forecasting of the daily new COVID-19 infection numbers in Chap. 2, following the learning of the temporally evolving potential function. It is appreciated that there could exist real-world problems in which the unknown structural build of a dynamical system is pursued, and that this is directly informed upon by such a potential function. Such is in fact relevant for those systems, in which the underlying system Physics allows for structure to be deterministically linked to the potential, as in self-gravitating, dynamical systems, such as galaxies. Then the similar construct that is introduced in Chap. 2, for bespoke learning of the potential, is invoked again, and the training data is generated to subsequently undertake the supervised learning of the structural function of the system (as well as the pdf of the phase space variables of the system). This is undertaken in Chap. 3. In static systems as well, the problem of absent training data compounds the difficulty of supervised learning of a sought inter-variable functional relationship, given inhomogeneities in the correlation structure of the available data (and highdimensionality of the sought function, if pertinent). In such situations, the originally absent training data can be generated by recalling details that allow for the learning of the sought structural property of the system at selected inputs, aided possibly by experiments designed to collect new information. All this renders possible, a probabilistic solution to a hard, multiple and sequential inverse problem in Materials Science, discussed in Chap. 4. Such bespoke learning of the system property at design inputs is then employed to capacitate the supervised learning of the system property as a function of relevant inputs—as the required training data is then rendered available. Lastly, we considered the problem that sometimes plagues supervised learning, namely, that the available training data comprises pairs of design points and the corresponding value of the output, though different realisations of the output are of disparate “lengths”. In particular, we consider an output that is itself a multivariate time series, such that its distinct realisations are time series that span different temporal ranges. Then the supervised learning of the functional relation between this output and the input variables is difficult; the ulterior aim is learning of the relationship between the input (comprising system parameters) and such an output, followed by identification of the most influential input variables. Then a solution is advanced in the form of bespoke learning of a scalar-valued score that carries

xii

Preface

the information contained in any realisation of this output variable, irrespective of its length. In fact, the graphical model of any such realisation is learnt, and the distance between (the posterior probabilities of) a pair of learnt graphical models is advanced as the difference between the scores of the two realisations that are observed at the two given design inputs. The method is illustrated on an example within Haematology-Oncology, in Chap. 5. In all such discussion, the aim has been to present the proposed solutions in a way that can be appreciated by researchers involved in pan-disciplinary applications. This was the motivation behind toning down mathematical notation in the presentation, with preference given to running text; the requisite mathematics though is present. One last point that I would like to make is that the methodology development discussed in this book, as well as all considered applications, is invariably undertaken within the Bayesian framework, with inference performed using Markov Chain Monte Carlo (MCMC) techniques. This motivates the need for the appendix on MCMC techniques. Oxfordshire, UK December, 2022

Dalia Chakrabarty

Acknowledgements

I am hugely indebted to my collaborators, who have made the different applications and analyses possible. The empirical illustration discussed in Chap. 5 owes to collaborative work undertaken along with Dr. K. Wang (then in the Information Engineering Group, Department of Engineering, University of Cambridge); Dr. Joydeep Chakrabartty and Dr. Akash Bhojgaria (from Department of Haematology, Health Care Global EKO Cancer Hospital, Kolkata, India). The material sample that was used for the purpose of illustration of the learning of the sub-surface material density function, was grown in the laboratory—and imaged with Scanning Electron Microscopes—by Prof. S. Paul (De Montfort University) and members of his group. The data used in the astronomical illustration that was discussed in Chap. 3 was provided by Dr. Kristine Woodley, when she was affiliated to the University of California Observatories. The ARIMA analysis presented in Chap. 2 was contributed to by Miss N. C. Paul (Exeter College, University of Oxford). Lastly, I would like to take this opportunity to thank my husband and our daughter for all their help and love.

xiii

Contents

1 Bespoke Learning to Generate Originally-Absent Training Data. . . . . . . 1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.1.1 Some Definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.2 Prediction Notwithstanding Unavailability Of Training: Real-world Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.2.1 Gravitational Mass Density in Galaxies . . . . . . . . . . . . . . . . . . . . 1.2.2 Composition of Rocks in Petrophysics . . . . . . . . . . . . . . . . . . . . . 1.2.3 Temporally-Evolving Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.3 Relevant Ambition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.4 Bespoke Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.4.1 Training Data Is Absent: Not that Training Data Points Are Missing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.5 A Wide-Angled View: Bespoke Learning in the Brain? . . . . . . . . . . . . . 1.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

Learning the Temporally-Evolving Evolution-Driving Function of a Dynamical System, to Forecast Future States: Forecasting New COVID19 Infection Numbers . . . . . . . . . . . . . . . . . . . . . . . . . . 2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.1.1 Time Series Modelling. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.1.2 Why Seek Potential? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.1.3 Hidden Markov Models and Our Approach . . . . . . . . . . . . . . . . 2.1.4 Markov Decision Processes; Reinforcement Learning and Our Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.1.5 A New Way to Forecast: Learn the Evolution-Driving Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.1.6 Evolution-Driver, Aka Potential Function . . . . . . . . . . . . . . . . . . 2.2 Learning Scheme: Outline and 3 Underlining Steps . . . . . . . . . . . . . . . . . 2.2.1 Can the Potential be Learnt Directly Using Observed Phase Space Variables? . . . . . . . . . . . . . . . . . . . . . . . . . . .

1 1 4 4 8 9 9 11 14 15 17 20 21

23 23 24 28 31 32 35 38 40 40 xv

xvi

Contents

2.3 2.4

Robustness of Prediction: Extra Information from the 2nd Law . . . . 3-Staged Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.4.1 Outline of Bespoke Learning in Step I . . . . . . . . . . . . . . . . . . . . . . 2.4.2 Embedding Potential into Support of Phase Space pdf : Part of Step I . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.4.3 Learning Potential as Modelled with a Gaussian Process: Part of Step II. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.4.4 Rate and Location Variable Computation: Part of Step III. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.5 Collating All Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.6 Details of the Potential Learning. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.6.1 Likelihood . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.6.2 Likelihood Given Vectorised Phase Space pdf . . . . . . . . . . . . . 2.6.3 Prior, Posterior and MCMC-Based Inference . . . . . . . . . . . . . . 2.7 Predicting Potential at Test Time: Motivation . . . . . . . . . . . . . . . . . . . . . . . . 2.7.1 Learning the Generative Process Underlying Temporal Variation of Potential, Following Bespoke Potential Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.7.2 Improved Modelling of compo ? . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.8 Illustration: Forecasting New Rate and Number of COVID19 Infections. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.8.1 Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.8.2 What Is “Potential” in This Application? . . . . . . . . . . . . . . . . . . . 2.9 Negative Forecast Potential, etc. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.9.1 Implementing the 3-Step Learning+Forecasting in this Application . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.9.2 Few Technical Issues to Note in This Empirical Illustration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.9.3 Collating the Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.9.4 Bespoke Learning of Potential: Results from Steps I and III . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.9.5 Forecasting: Results from Steps II and III . . . . . . . . . . . . . . . . . . 2.9.6 Quality of Forecasting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.9.7 Work to Be Done . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.10 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

Potential to Density via Poisson Equation: Application to Bespoke Learning of Gravitational Mass Density in Real Galaxy . . . . . . 3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.1.1 Motivating Bespoke Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.1.2 A Particularly Difficult Data Set for Bespoke Learning . . . 3.2 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2.1 Potential and Phase Space Probability Density Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

42 43 45 46 49 50 53 55 56 56 58 59

61 72 73 73 74 75 76 78 83 83 85 89 95 97 99 101 101 103 104 105 106

Contents

xvii

3.2.2 3.2.3 3.2.4

Phase Flow and Boltzmann Equation . . . . . . . . . . . . . . . . . . . . . . . So Far this Is What We Know . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Why Consider System Potential to be Velocity Independent? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2.5 Relevant Abstraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2.6 Centrality of the Potential . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2.7 Half the Phase Space Coordinates Cannot be Observed . . . 3.2.8 Why Include Only Energy to Recast Support of fW (·)? . . . 3.2.9 Probability of Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2.10 Isotropy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2.11 Likelihood, Including Acknowledgement of Measurement Uncertainties. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2.12 Ingredients of Inference, Following Likelihood Identification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2.13 In the Absence of Training Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2.14 ρ and f Vectors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2.15 Computing Potential at a Given R, Given Vectorised Gravitational Mass Density . . . . . . . . . . . . . . . . . . . . . 3.2.16 Model in Light of Vectorised Potential and Phase Space Pdf . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2.17 Convolving with Error Density . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2.18 Priors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2.19 Inference on f and ρ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2.20 How Many Energy Partitions? How Many Radial Partitions? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2.21 Inference Using MCMC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2.22 Wrapping Up Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.3 Empirical Illustration on Real Galaxy NGC4649 . . . . . . . . . . . . . . . . . . . . 3.3.1 Learning the ρ(R) Function and pdf f (ε) . . . . . . . . . . . . . . . . 3.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.4.1 Testing for Isotropy in the Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.4.2 Working with a Multivariate Phase Space pdf . . . . . . . . . . . . . . 3.4.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 Bespoke Learning in Static Systems: Application to Learning Sub-surface Material Density Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2 Bespoke Learning, Followed by 2-Staged Supervised Learning . . . . 4.2.1 Bayesian Implementation of Bespoke Learning . . . . . . . . . . . . 4.2.2 What if a Different Likelihood? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2.3 Dual-Staged Supervised Learning of g(X) (= Y ) . . . . . . . . . 4.3 Application to Materials Science . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.3.1 Generating the Originally-Absent Training Data . . . . . . . . . . . 4.3.2 Underlying Stochastic Process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

107 107 108 109 110 110 112 113 114 114 116 117 119 120 121 123 124 124 125 128 129 132 136 148 148 149 149 150 153 153 155 157 159 160 166 166 169

xviii

Contents

4.3.3 Details of Existing Bespoke Learning . . . . . . . . . . . . . . . . . . . . . . Learning Correlation of Underlying GP and Predicting Sub-surface Material Density . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.4.1 Non-parametric Kernel Parametrisation . . . . . . . . . . . . . . . . . . . . 4.4.2 Predictions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.4.3 Forecasting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.5 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

170

4.4

5 Bespoke Learning of Disease Progression Using Inter-Network Distance: Application to Haematology-Oncology: Joint Work with Dr. Kangrui Wang, Dr. Akash Bhojgaria and Dr. Joydeep Chakrabartty . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.2 Learning Graphical Models and Computing Inter-Graph Distance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.2.1 Learning the Inter-Variable Correlation Structure of the Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.2.2 Learning the SRGG . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.2.3 Inference on the SRGG . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.2.4 Distance Between Graphical Models . . . . . . . . . . . . . . . . . . . . . . . 5.3 Learning Relative Score Parameters Using Learnt Graphical Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.3.1 Details of Score Computation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.4 Application to Haematology-Oncology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.4.1 Learning of the Relation Between VOD-Score and Pre-transplant Variables and Prediction . . . . . . . . . . . . . . . . 5.4.2 Which Pre-transplant Factors Affect SOS/VOD Progression Most? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A Bayesian Inference by Posterior Sampling Using MCMC . . . . . . . . . . . . . . A.1 Bayesian Inference by Sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A.1.1 How to Sample? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A.1.2 MCMC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A.1.3 Metropolis-Hastings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A.1.4 Gibbs Sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A.1.5 Metropolis-Within-Gibbs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

173 174 179 183 186 187

189 189 194 194 197 200 201 203 204 206 211 213 214 216 219 219 220 220 221 223 223

Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 225

Chapter 1

Bespoke Learning to Generate Originally-Absent Training Data

Abstract This chapter motivates the need for attending to the class of problems that pertain to a mismatch between the proposed ambition of supervised learning of the relationship between a pair of variables, and the debilitating information deficiency that sometimes plagues such ambition—in the form of the lack of training data that is a requisite for the sought supervised learning. Multiple examples of such problems are presented, from across disciplines. The one and only solution that is then relevant, entails the learning of values of one variable—out of the relevant pair of variables—at design values of the other. Such learning is introduced as “bespoke learning”, and it is distinguished from supervised, unsupervised and reinforcement learning. Descriptions of the methodologies that accomplish the same are presented: in generic dynamical systems, to forecast future states by learning the evolution-driving function of the system; or to learn a parametrisation of the output variable that is a variably-long multivariate time series; in static systems, by formulating likelihood of the unknown system parameters given data, using accessible information. Summaries of the forthcoming empirical illustrations of such methodologies are included.

1.1 Introduction While at school, we have all experienced fitting curves to a sample of points, {(x1 , y1 ), (x2 , y2 ) . . . (xn , yn )}, that comprises observations made during an experiment. The aim of such fitting is to identify some parameter(s) of a chosen (parametric) model that is represented by the curve that is being fit to the observed sample of data points. Here, this chosen model offers a functional link between the variables, (X and Y ), measured values of which constitute the observed sample of data points. Then having identified such originally-unknown parameter(s) of this model, said model is rendered fully specified, such that (s.t.) we can employ this model to predict the value of X (or Y ) at a newly-observed value of Y (or X). A demand of real-world implementation of the curve-fitting exercise, is the justification of the relevance of a chosen parametric function as a representation of the functional link between variables X and Y . Such a parametric choice is

.

© Springer Nature Switzerland AG 2023 D. Chakrabarty, Learning in the Absence of Training Data, https://doi.org/10.1007/978-3-031-31011-9_1

1

2

1 Bespoke Learning

typically questioned, especially when one (or at least one) of the variables is highdimensional, and/or its observed values are distributed s.t. correlations between pairs of values of Y over one interval in .X , are widely different from correlations elsewhere in .X , where .X ∈ X . When such realities of the task at hand show up, learning parameters of a chosen parametric function is abandoned in favour of the desire that knowledge of the shape of the sought functional link between X and Y be sought, given the available data, i.e. the correlation structure of the curve that we seek to fit is sought. This is addressed by learning this inter-variable functional link as generated by an adequate stochastic process, that conveys to this function, information about the correlation structure of the data. In addition, there are multiple other shortcomings that potentially plague the exercise of fitting a chosen parametric functional form to the observed set of data points. These include addressing of measurement errors in either or both of X and Y .

In this book however, the primary focus is going to be on addressing a particularly intransigent information deficiency, that will appear to fundamentally incapacitate prediction of the values attained by either variable, at a newly-observed value of the other. Providing solutions to allow for prediction—even in such an information-sparse paradigm—is discussed in the following chapters of the book. This information deficiency that we mention, constitutes the lack of the sample of points .{(x1 , y1 ), (x2 , y2 ), . . . , (xn , yn )} for .n ∈ N; if such an observed sample is unavailable, then what do we fit the chosen parametric functional form to, or for that matter, how do we then undertake the learning of the functional relationship between X and Y using a stochastic process? So the possibility of prediction at a newly-observed value of one of the relevant variables, appears challenged.

If we possess information on the model of the probability distribution of Y , given X, (as in parametric regression), then using such a model, (with the regression coefficients estimated), we can in principle predict values of Y that are realised at a given value of X. On the other hand, in this book, interest lies in permitting prediction of either variable, when there is no readily available motivation for adopting a parametric model for the probability distribution of a variable, given the other; such situations can sometimes be compounded by the high-dimensional nature of one of the variables. Thus, instead of attempting to model the probability distribution of either variable, the focus in the following couple of chapters is on predicting either X (or Y ) values, that are realised at measured/observed values of Y (or X), by formulating the likelihood of the unknowns given the data—sometimes using information on the (temporal) evolution of the probability density function (pdf ) of the observed variable, such that the unknowns are embedded within the definition of the support of this pdf. Here, by support of the density of a variable, is implied the smallest

1.1 Introduction

3

closed set of values of this variable, s.t. the probability for the variable to take values in this set is 1. Such an approach is presented in the next couple chapters, as relevant for dynamical systems. Indeed, it is possible that the pdf of the observable, and its temporal evolution, are varying with time—and such signatures of non-stationarity are acknowledged, and addressed in the forwarded method. On the other hand, for static systems, we address the problem of unavailability of the sample of points .{(x1 , y1 ), (x2 , y2 ) . . . (xn , yn )}, by motivating the likelihood, as guided by the (sparse) information that may be available on trends in the comparison between observations and the model. Such is achieved by invoking the bespoke Physics of the system. In such systems, prediction of values of either variable that are realised at any value of the other variable, is made possible only if the functional relationship between the two variables is learnt; such a functional learning demands generation of that originally-unavailable sample. An illustration of such an exercise is presented in the fourth chapter. On other occasions still, other methods are relevant to the addressing of the system-specific demands of the problem. One such approach is detailed in the final chapter, and illustrated in the application discussed therein. There, a generic methodology is presented to motivate circumventing concerns that may arise with supervised learning, when the output is a multivariate time series, different realisations of which occur over varying temporal intervals. A model-independent, scalar-valued parametrisation of such a time series data is learnt—relative to that for another such data set—by computing the distance between the pair of graphical models (or networks) that are learnt given the two data sets. Thus, in this book we discuss methodologies that facilitate the learning of Y at a chosen X, when the learning of the functional link between X and Y is unachievable, given the availability status of the data .{(xi , yi )}ni=1 . Here, on the basis of the overall discussions, we propose three ways of accomplishing such hitherto unachievable learning of Y : – by accessing system dynamics, to attain information towards the learning of the function that is causing evolution in the dynamical system; – by structuring likelihoods in static systems; – and by learning a model-free parameter that bears information contained in a time series data—relative to that contained in another differently-long data—via computation of distance between graphical models learnt given the data pair. A major fallout of such learning is then the generation of the very set of pairs of values of X and Y —where such a set was absent prior to our newly undertaken learning of Y at given X values. Once this set is composed, it then allows for the supervised learning of the functional relationship between the variables, without us having to impose a parametric choice on this function, which—as we have mentioned above—becomes particularly circumspect, when at least one of the variables is high-dimensional, and/or when the data on Y bears inhomogeneities in correlation as we move from one sub-interval in the space of X, to another. Indeed, we then look forward to the unravelling of the mechanism that permits this special type of learning of Y .

4

1 Bespoke Learning

An important question that arises at this point, is about the relevance of such an information-sparse paradigm that is invoked here. Do such situations actually occur in real-world systems? To answer this question, multiple examples are provided below, from across disciplines. However, before proceeding to expound on such examples, notation relevant to such discussion is first presented.

1.1.1 Some Definitions The variable X parametrises the behaviour or structure of a system, and Y is an observed variable that is associated with such an X. In some applications in this book, X will be treated as a vector of model parameters; typically modelled as a d-dimensional vector, that will be denoted .X ∈ X ⊆ Rd . In some of the empirical illustrations discussed in later chapters, the variable Y will be modelled as an m-th order tensor. Then we will denote this observable as .Y ∈ Y ⊆ Rk1 ×k2 ×...×km , for .kj ∈ N; j = 1, . . . , m. The sample .{(x1 , y1 ), (x2 , y2 ) . . . (xn , yn )}, of n observed pairs of values of variables X and Y , is the training data set. In this training set, the general interpretation is that the value of X is chosen—or designed—by the practitioner, s.t.: .xi is the i-th “design point”, at which the realised value of Y is .yi . Here .i = 1, 2, . . . , n. We aim to learn the relation between X and Y , in order to predict/learn the value of one variable at a “newly observed” value of the other. By “newly-observed” values of a variable, is implied values of the variable that are not included in the training data set. Such a newly-observed value of Y is referred to as a test datum on Y , and is denoted .y (test) . We will want to predict values of Y realised at test value .x (test) of X. Alternatively, we might be interested in learning the values of X at which such test data on Y are realised.

1.2 Prediction Notwithstanding Unavailability Of Training: Real-world Examples Above, we got a glimpse of the formalism of prediction, subsequent to the supervised learning of the functional relationship between a system variable X, and an associated variable Y . However, this formalism is at odds with multiple pandisciplinary real-world problems that are distinguished by the absence of training data, and ignorance of probability distribution of either variable. The sole data available in these systems is test data on either X or Y . Indeed, if the aim is to predict the sales figures for a new book by an identified author, then reliable prediction of such sales figures, is hard to achieve via the modelling of sales, as affected by factors such as: author name/reputation; demographics at the location of the book store; past sales of books of the considered genre; etc.

1.2 Prediction Notwithstanding Unavailability Of Training: Real-world Examples

5

Smaller the body of past information available on one of the relevant factors—for example, on the reputation of the author—worse is the prediction. Additionally, the more difficult it is to parametrise an attribute, greater is the possibility of erring or losing information, when quantifying its effect on the prediction of sales. Importantly, sales figures, and values of the parametrised factors relevant to one particular book launch at a given book store, are not in general representative of sales figures relevant to a different book that is launched at a different geographical location (that is marked by different demographic parameters). Thus, if the difficult endeavour of objective assignment of values to all relevant factors, is successfully undertaken in n distinct book launches, then the measured values of the sales variable Y , along with assigned values of the system parameters—that are collated into a vector .X—could be used to form a training dataset .{(x i , yi )}ni=1 , using which one could, in principle, undertake prediction of sales figures for the book in question. However, even when the above endeavour is “successfully” undertaken, relevance of such a training set to a new book launch is circumspect. In other words, preferences and attitudes towards one book, are specific to that book in question. (To some extent, such attitudes are driven by demographics and socio-economic realities that are relevant to the geographical location of a considered book store, s.t. information about preferences/attitudes will—weakly— percolate into the model via the store location factor). Thus, while a subset of this training data will indicate one form of the input-output functional link, implying a given prediction for the sales figures of the test book, another subset might indicate a different prediction. Basically, the dependence of Y on .X is rendered sample-dependent, and therefore, predicton of sales figures for a new book, given the considered .X will not be reliable in general. In such circumstances, we summarise that there is no reliable information on Y at a given .X, and therefore, there is no training data available for .X − Y learning that will produce a meaningful prediction of the sales figures for the new book. Then, if new information on attitudes and preferences are invoked, (via expert elicitaion, say), reliability of prediction of Y may or may not improve; we appreciate that the more novel the appeal of the new book is, to reader preferences, weaker is the contribution of elicited information towards the strengthening of the sales prediction. Inded, this is why the success or failure of a new book sometimes comes as a surprise. While the above scenario sheds light on limitations of the avilable training set, our interest surpasses challenges faced by expert elicitation. Here, we discuss situations in which acceptably-reliable expert elicitation is not possible, as can arise for example, when an upcoming book is highly novel in content/style. Similarly, in the context of an example problem in Astronomy, the expert of the field cannot elicit the value of the mass of all gravitating matter enclosed within a given region inside a newly observed galaxy. Indeed, from their experience with galaxies, they could tell us that this mass lies in an interval: .[some very small value, some very large value]. However, such a broad range is not informative to be of any use to further the study of this galaxy, and this is what we refer to as not being “acceptably-reliable” above. In this way, in multiple other realworld problems, it is precisely because the domain expert does not know what the

6

1 Bespoke Learning

output values are at each given input, and that they would like to solve this problem, that we propose such sought solutions. We discuss such applications below. At the same time, the expert may have information—however weak/unreliable that might be—to contribute towards the values that the output variable attains at a design input. Indeed, the methodologies that exemplify bespoke learning— which we discuss in this book, for learning values of the output at each of a set of chosen input values—permit acknowledgement of such uncertain information that might be available at some/all of the relevant inputs, but crucially, such expertelicited prior information is not a requisite in these methods. The point of this book is to advance bespoke learning methodologies that excavate information by exploiting the dynamics/correlation-structure/Physics that is intrinsic the system; this allows for reliable learning of the output at a design point, irrespective of the strength of information content in priors that domain experts might be able to offer at some/all of the design inputs. That these methods bear the capacity for taking on board such additional information, owes to the robustly Bayesian nature of these methods. Bespoke learning presented in this book is fully Bayesian, with inference on the values of the variables made by sampling from the posterior probability of the variables, given the data. Such posterior sampling is performed in the discussed applications using Markov Chain Monte Carlo techniques. Returning to the book sales example discussed above, an argument maybe advanced that we only have to seek a relevant regression model to achieve “an idea” about what the “average” sales would be, and achieve “some quantification of uncertainty” about this predicted average. In this book, the aim is to avoid such modelled ideas of an average, or of uncertainties about it. Rather, the target here is to achieve automated and robust prediction on the quantity of interest given all available information—measured or otherwise, which implies that objective uncertainties are provided with such prediction, as permitted within the Bayesian approach undertaken here. So this example tells us that in general, it might not be possible to have a training set available when a new book is being launched, and even if a training data set were available, its relevance to a new book launch is not established. Difficulties in prediction are then not the result of difficulties of parametrising attributes, but stem from the unavailability of training data sets, or in fact, from the uselessness of the available training data, for the prediction exercise at hand. In fact, such a problem transcends difficult-to-model behavioural or social settings, to plague even the physical sciences, including laboratory-based disciplines. We made a passing reference to one such example above, and will encounter more such examples soon. Ultimately, the hope is to attain a form of—or a probability distribution of— the functional relationship that links the variable that embodies everything that we can observe, and the system property (that we want to predict at a test value of the observable). Thus, if this functional relationship is known—perhaps probabilistically—we will be able to undertake probabilistic prediction of either variable, at any newly-observed value of another. Extending the lesson of the above example, it holds that the irrelevance of training data—if available—often arises due to an unquantifiable latent parameter

1.2 Prediction Notwithstanding Unavailability Of Training: Real-world Examples

7

that is difficult, or impossible to include within the model of the functional relation between X and Y . This latent parameter has the effect that it renders the available data sample-specific. This owes to behavioural and/or structural differences generated in one sample of the system, compared to another, triggered by the different ways in which this latent variable manifests across samples. The sensitive sample-specificity implies that generative processes that underlie values of the system parameters or the associated observable in one sample, do not necessarily concur with processes that underlie measurements obtained from another sample. While we anticipate a difference between these underlying generative processes relevant to the two cases, we cannot identify which subset of an available training data set is sampled from which latent process, where any such latent process itself remains unlearnt, given the available data. But over and above the presence of such latency in the generation of different parts of the training set, the very existence of the training set is challenged in different systems. Existence of a training data set implies that it is possible to measure or observe a value of Y at a design point, i.e. at a chosen value of X. However, in some cases, the observable Y that bears influence on/from the system, remains unknown at a designed x. For even other systems, the system variable X may not be an observable, though an observed signature of the unknown x is manifest in the available measurements on Y . We delineate examples from these situations separately below. Material Density at Sub-surface Locations There is no paucity of practical examples to demonstrate that prediction of either variable is often enjoined, notwithstanding the unattainable inter-variable or inputoutput functional link. For example, for a newly grown sample of material—that is destined to be used in the manufacture of electronic devices—the material density at an arbitrary sub-surface location inside the material sample, is unknown in general. Material density as a function of the location vector variable, cannot be achieved by inverting image data of the material sample (taken with electron microscopes) since these images are intricate projections of the convolution of such a sought material density function and an unknown microscopy-specific blurring function. Here, the “intricacy” of the image formation alludes to the multiple projections of the aforementioned convolution, over an identified volume within the bulk of the material sample. So images do not readily inform on density as a function of the location vector, but definitely carry information on the density structure. Importantly, we cannot reliably simulate such a material density function of a new material sample, since each sample bears a distinct density structure depending on the latent growth conditions, where reliable information on how such conditions affect the density structure is unavailable. Destroying the material sample in pursuit of the density in its bulk is firstly undesirable, and more to the point, learning the material density at micro to nano metre length scales, even within such a destructive scheme, requires imaging, which does not readily yield material density as a function of location, as noted above.

8

1 Bespoke Learning

It follows that we cannot form the ordered pair: .{(i-th subsurface location, material density at this location).} at any .i = 1, . . . , n, implying unavailability of a training set comprising such ordered pairs. So we cannot learn the functional link between material density and the subsurface location at which this density is realised. Material density of real-world samples is (typically) highly discontinuously-distributed across the bulk of the sample; hence approximating such a distribution with a parametric form leads to erroneous predictions. Thus, prediction of the subsurface density of a given sample is rendered difficult in general. We will revisit this example in Chap. 4.

1.2.1 Gravitational Mass Density in Galaxies In galactic astronomy, in which we hope to learn the density function of “all” gravitating matter in a distant galaxy, given noisy observations of the motion of a sample of galactic particles. Here, by “all” gravitating matter, is implied matter that exerts a gravitational force on any other gravitational mass, and could either emanate light of its own or reflect light (from external luminous sources) that is incident upon it, or could be dark matter—which neither emits light, nor reflects light off itself. The gravitational mass density function—that is an unknown function of observed particle position variables—is the system property that is affected by the observed particle motion, though the functional relation between the gravitational mass density function and particle motion is unknown. This functional relation is in fact, underpinned by the probability distribution of the phase space vectors of galactic particles, i.e. the pdf of the particle position and velocity variables. The distribution of the particle phase space vector is relevant, since matter that lies internal to the particle orbit, has an attractive effect on the particle, owing to gravitational field of the former—more tightly the particle is gravitationally bound to such matter, less free it is to execute motion. However, we are unaware of the exact functional form of the dependence of particle motion on the distribution of such gravitational mass. In other words, motion parameters of particles cannot be generated at designed gravitational mass density functions, i.e. training data cannot be generated, and we cannot adopt the conventional supervised learning approach. We appreciate that this unknown functional relation is subject to the probability distribution of the variables that carry information on the state of the galaxy. We would ultimately want to predict the gravitational mass density function of a distant galaxy, a sample of particles in which have been tracked for their motion and position vectors; (indeed astronomical systems allow only some

1.2 Prediction Notwithstanding Unavailability Of Training: Real-world Examples

9

of the components of these two vectors to be observed)1 . In the absence of training data, this problem then reduces to the prediction of the gravitational mass density function, (along with the learning of the system state space probability density function), given observations of observable components of the location and velocity vectors of galactic particles.

1.2.2 Composition of Rocks in Petrophysics In Petrophysics, interest lies in identifying the petrologically-relevant compositional makeup of a rock that has been dug up from a well, to indicate “producibility” of the well—that constitutes a reservoir. Nuclear Magnetic Resonance (NMR) data obtained from a rock bears information on such sought compositional signature, though no reliable method exists for predicting the compositional parameters at which a recorded NMR dataset is realised. Methods currently prescribed within Petrophysics suffer from multiple errors, including severe concerns over lack of identifiability of advanced solutions; lack of objectivity in accounting for observational noise—especially non-Gaussian noise; ad-hoc nature of details of regularisation that is employed in the undertaken implementation. It is possible to record the NMR data from a rock of course, but the compositional parameters remain unknown, prohibiting the establishment of a: .{(composition of i-th rock, NMR data recorded for the i-th rock).} pair, i.e. of a training dataset. In principle, destructive compositional analysis of individual rock samples is possible, though such an exercise is achievable in a rare few facilities worldwide, and is prohibitively expensive, limiting the .{(composition, NMR data).} information for a small sample of rocks—if at all. However petrologically-relevant composition varies widely across rocks, sensitively conditional on the latent and unquantifiable geological evolution and ambience that are experienced by a sampled rock. Then the small-sample training set that is possible on rare occasions, is rendered unrepresentative of test rocks that are subject to other extraneous conditions. Thus, a useful training set remains elusive, nullifying the possibility of supervised learning of the inter-variable function, triggering the need for an alternative way for predicting composition, given observed NMR data in individual rocks.

1.2.3 Temporally-Evolving Systems In certain applications, the presence of a so-called training data is deceptive. For instance, reliable forecasting of the pan-national/regional vector of the day’s

1 The

galaxies that we discuss here, are the “elliptical” galaxies [3], the images of which are elliptical, i.e. those that can be ascribed a global triaxial geometry.

10

1 Bespoke Learning

infection numbers of a hitherto-unknown disease, is not possible using the currently available data on the infection numbers in the past. In other words, hitherto available data on infection numbers recorded at past time points, is not the training data that would permit the sought value of the infection numbers at a future time point. Dynamics of the spread of this new disease is not reliably captured by a model, given the unforeseeable non-linearity of its spread, and because of the influence of multiple—not always identifiable—factors (or covariates) on the spread. Thus for example, an unprecedented surge in the value of the infection numbers, or an isolated intervention undertaken to affect the evolution of the disease, will invalidate model-based forecasting that relies on the reproducibility of the trends in the spreading of the disease in the past. Thus, available data comprising the set of ordered pairs: .{(i-th time-point, infection rate vector at this i-th time-point)}ni=1 , does not constitute a useful training data set, where such “useful”ness refers to the objective of future forecasting of the infection numbers. One way we could still undertake this forecasting is by learning the function that drives or causes the evolution of this vector. Drawing on background from Potential Theory, this evolution-driving function (of relevant system variables, as well as of time), is referred to as the potential function. While details are discussed below, we clarify at the outset that such an evolutiondriving function is not considered in the methodology to be a “latent variable”. The bespoke learning approach is not designed on state space models that invoke latent variables, though it is possible to interpret the offered methodology in the light of state space models. If such an interpretation is sought, what may appear to be latent variables as per existing state space methodologies, will not however warrant the same treatment (as latent variables) within our method. This is because—to state the obvious—our method is not the same as existing state space approaches. As we will see in the second chapter, this evolution-driving (or potential) function, and the pdf of the state space variables, are functions that we seek to learn, and to do so, we bespoke learn values of each function at relevant inputs. In still other applications, this evolution-driving potential function of the system, may itself be what we pursue, since its learning—given observations of the phase space coordinates—offers quantification of the system structure in certain systems in which structure and potential are deterministically linked. Thus, for such a dynamical system, if viewable as having equilibrated, the ambition translates to the learning of the latent probability density function of the state space variable, along with the parameter that represents the time-independent system structure function. The estimation of both the state and a static model parameter have been attempted in the past [16, 27], within the general paradigm of non-linear state-space modelling [6]. Particle-based filtering techniques are employed in this context increasingly more often [22].

1.3 Relevant Ambition

11

1.3 Relevant Ambition Unlike in these past work, in this book, the focus is on a new method applied to a dynamical system, in which the evolution driving function is itself varying with time, and the system manifests non-linear dynamical behaviour. The learning of this unknown evolution-driving potential function, along with the learning of the pdf of the phase space vector are discussed. The initial interest is in such learning and thereafter, we will move on to discuss prediction of the phase space variable at some future time. Also, in this work such learning is accomplished given the only relevant data that is the test data comprising measurements of some or all the components of the phase space vector. Examples of problems marked by absent training data, as discussed sometimes in GeoStatistics or spatial Statistics, are fundamentally different from the situations we discuss here. The latter type of problems are comparatively more information sparse, in two ways. Firstly, in spatial Statistics, the spatially-varying observable is regressed—typically linearly, or according to an identified parametric model— against the latent signal, which is also a spatially-varying function, [5, 10]. However, in the kind of applications that we are considering here, the observable can be a highdimensional structure s.t. the functional relationship between the observable and the latent signal function is high-dimensional. So a parametric regression model is not relevant to the capturing of the correlation amongst the multiple components of this high-dimensional, (i.e. tensor-valued in general) functional relationship, where certain component functions might indeed be discontinuous. This technical concern regarding dimensionalities, is superseded by the fact that in our considered examples, the pdf of neither the observable, nor the system property, is known in general, unlike in spatial applications [17]. Indeed, learning the pdf of the observable is one of our aims. So to formalise the last paragraph, observable .Y that we want to model as function of the latent system property (.X), is given as .Y = ξ (X), where the function .ξ (·) is in general non-linear and unknown. .X is unknown, and if following spatial statistics, .Y (s i ) = X(s i )+errors, at the i-th value of location variable .S, with the latent signal function .X(s i ) modelled as .X(s i ) = z(s i )T β + B(s i ), where .z(s i )T is the i-th row of a design matrix .Z. Thus, this model of the signal function represents a linear relationship between the spatial field, and values of .Y . Here .β is the vector of unknown coefficients that is modelled using .{B(s)}, a zero-mean GP. However, for us, .ξ (X(S)) is typically an unknown, tensor-valued, occasionally discontinuous function that cannot be cast into the form of .Ξ β, with matrix .Ξ representing a linear relationship—or for that matter, a parametric relationship—between the field of the latent .X(S), and .Y , in general. The pdf of .X(S) or .ξ (X(S)) is unknown in general, and it cannot be approximated by closed-form densities. Basically, unavailability of training data—or, when pertinent, the inapplicability of the existing training data in prediction at test data—implies that any approach that is a regression in one guise or the other, is rendered irrelevant for the learning of the sought system parameter X, at measured values of an associated observable

12

1 Bespoke Learning

Y . If and when the originally-absent training set is generated, we can of course undertake regression. Limited by access only to the available test data, we then appear deprived of the means to learn the inter-variable relationship, and while this function remains unlearnt, how could we possibly predict the values of X, at which a value of Y is realised, or predict Y at a given X? It therefore follows that we either need an alternative that will allow for such sought prediction, or we need a way for generating the originally remote training data set to capacitate the supervised learning of the input-output function, to thereby enable prediction. The latter strategy is what this book offers. In this book, we will offer two broad kinds of solutions, to the two broad types of systems, in which need arises for prediction notwithstanding absence of training data, at given test data. Method 1(a)

Systems that evolve with time, but for which, the training data: .{(ith time point, state space coordinates at this i-th time).}ni=1 , that is available up to the n-th time point, is irrelevant for prediction of the state space variable at a future time point. Then • the (generally) time-dependent evolution-driving—or potential— function of this dynamical system is learnt over narrow time windows, over which stationarity of evolution is assumed, by embedding the potential into the support of the pdf of the state space variable. Such is rendered possible after invoking information on the temporal evolution of this pdf. The form of the potential function changes with time, as we move from one time window to another. The potential function learnt in each of n such time windows, generates a time-potential training data set, to permit potential learning, by modelling the potential function as a random realisation from an adequately chosen, stochastic process. This process is specified, and potential function at any time predicted. Prediction/forecast of the state space variable is then possible using the equivalent of Newton’s equations of motion, for which, potential is the input.

Method 1(b)

Dynamically-evolving systems, in which the potential is known as a deterministic proxy of the system structure. This potential is then sought, along with pdf of the state space variable, and the only available information comprises offline data on an observable that bears influence on/of the system parameters. Ignorance of the nature of the probability distributions of either the system variable, or the associated observable, compounds the information sparseness. Then • the desired potential function is embedded within the support of the state space variable—i.e support of the pdf of this relevant state space variable—subsequent to recalling information on the temporal-evolution of this pdf. The result offers the training

1.3 Relevant Ambition

13

data sets—that were originally absent—that allow for Gaussian Process-based learning of both the potential function, and the state space pdf. Method 1(c)

Time-varying systems in which a multivariate time series is the output and a vector of inputs is modelled to be associated with this output, where the measurements of the output may cover different temporal horizons, and/or the interest is in learning the progression score over the respective observational horizon, (as is relevant when recording patient charts of sufferers of a disease, such as a potentially-terminal ramification of a bone marrow transplant in patients with certain blood cancers). The sought functional dependence between such a progression score, and the high-dimensional input variable is then sought, to identify input factors that influence the score more than others, and predict the score for a test patient with a given input value. Then • the distance between the (posterior probability of the) graphical model learnt given an observed time series and that of a reference subject in the sample/cohort, is used to compute the score of the subject, relative to the score of the reference subject. Here a graphical model of a given time series data is constructed using a sample of realisations of a random graph variable that is inferred upon Bayesianly, given the (time series) data. This is followed by the learning of the with-uncertainty ranking, by strength of association between the score and each of the (many) input variables that affect this output score.

Method 2

In static systems, in which some system function or vector is unknown, and the training data set that is the requisite for the learning of this unknown, is unavailable. We will undertake the learning of the unknown, using bespoke system Physics, given the available test data. Such learning then populates the originallyabsent training data set. Typically • a model of the likelihood is motivated to abide by information on trends that are identified on the comparison between the model and the relevant observations. Even though multiple forms of likelihood are possible under these limiting constraints, Bayesian inference performed with such likelihoods offers sampled values of the unknowns, the sample means of which concur, though sample variances differ. So such uncertainties learnt on the unknowns are parametrised, and this parameter learnt. Once the originally-absent training set is populated, the functional relation between the system parameters and the observable is learnt as a realisation from a Gaussian Process. Prediction at any test value of either variable is then possible.

14

1 Bespoke Learning

As stated above, solutions are offered in the book within a Bayesian framework, and inference is undertaken using a chosen flavour of Markov Chain Monte Carlo (MCMC) techniques. Appendix A discusses the preliminary idea of MCMC. In the following chapters, Method 1(a) enumerated above, will be illustrated on the forecasting of (a simple transformation of) the daily new number of infections of COVID 19, in eight countries. The methodology can be applied to address cross-disciplinary interest in the prediction of a state space variable relevant to the dynamical system under consideration, at a future time point. Method 1(b) will be empirically illustrated in an astronomical context, using data on the observable components of the location and velocity vectors of a sample of galactic particles that reside in a distant galaxy. The gravitational potential function that determines the evolution of the dynamical system, namely the observed galaxy, will be learnt. Method 1(c) will find an application in identifying the most strongly associated pre-disposal variables, (amongst a large set of inputs), that lie in a state of relationship with an output that represents the progression score of a potentiallyterminal disease that affects recipients of bone marrow transplants in certain blood cancer patients. Here, this disease progression score will be learnt given the time series data on the recorded phenotypic parameters of each patient in a cohort during a stipulated, pre-transplant to post-transplant time range. However, some patients expire from this disease, or other diseases, before others, s.t. the time series that is the patient chart, is differently long for the different patients. An application of Method 2 will be undertaken to address the problem of learning the sub-surface material density function of a material sample that has been imaged in a novel imaging experiment, with a Scanning Electron Microscope. We discuss further real-world applications later in the chapter, (Sect. 1.5).

1.4 Bespoke Learning So we appreciate that the kind of learning that is sought in the contexts motivated above, on values of either an intrinsic system variable, or an associated observable— given test data alone on the other—is not supervised learning [1, 23]. After all, in supervised learning, a sample of known pairs of values of the input and output variables, is employed—to learn the relationship between the variables in order to predict the value of one variable at test data on the other, or to classify test data into different classes. In the problems we have enumerated above, the lack of training data is the main impediment to the learning of the inter-variable relationship. Hence this is not supervised learning. We motivate a methodology that allows for the bespoke learning of one variable at each datum contained in the test data available on the other variable, and this then generates the originally-absent training data, using which, supervised learning of the relationship between the two variables is permitted.

1.4 Bespoke Learning

15

Also, what we undertake here, is not reinforcement learning, [13, 24]. Reinforcement learning refers to the type of learning that is characterised by a sequence of actions that maximise a cumulative reward. While inference on the unknowns could involve the imposition of a penalty, in all work relevant to the subsequent four chapters in this book, such inference is MCMC-based. Our bespoke learning of the values of the unknowns occur in a Markov sequence that is relevant to our choice of the inference, but such bespoke learning could in principle have been undertaken within another form of inference as well—and in fact, within a nonBayesian framework as well, (though the acuity of such a choice is questionable). However, what is relevant here is that the learning of values of Y (or X), at data points that comprise only observed values of either X (or Y ), is not in general undertaken to optimise a reward, implying that the undertaken bespoke learning is not equivalent to reinforcement learning in any sense. The type of learning that permits the learning of one variable at a test datum available on an associated variable, is definitely not unsupervised learning conducted on the available test data. On the contrary, unsupervised learning refers to the discovery of patterns in the data that is available on one variable, to enable the learning of the intra-data correlation structure—such as the clustering distribution of the given data [11, 14]. Therefore, the need arises for distinguishing the type of learning that we undertake here, from the unsupervised type.

The need to have a distinguishing name for such learning is motivated. Such learning will be hereafter referred to as bespoke learning; to reiterate, bespoke learning entails learning the values of a random variable that is motivated to lie in a state of relationship with another variable, though no information is available on the nature of this relationship, or on the probability distributions of either variable, with test data available only on the latter variable. So what is absent, is a training data set comprising pairs of values of the two relevant random variables—with the output of the pair known to be realised at a design input value. Only in the presence of such a training set, can the relationship between the two variables be learnt.

1.4.1 Training Data Is Absent: Not that Training Data Points Are Missing It follows from the last paragraph, that to try and learn the functional relationship between two random structures, this originally-absent training set needs to be generated; this is accomplished via bespoke learning of the originally-absent values of one of these variables—say the variable that is the output of this relationship—at

16

1 Bespoke Learning

any chosen value of the other, which in this consideration, is rendered the input to this function.

It is not as if values of the output are available at certain design inputs, and unavailable at others—at no input do we have the means to know the corresponding output. Thus the problem is not one of missing training data points. The problem is one of absent training data, given that there is no reliable information on what values the output might attain at an arbitrarily chosen input.

Thus, the material scientist does not have information on what the material density is, at an arbitrarily chosen location inside the bulk of the material sample that they have grown. Under certain growth conditions used in the laboratory to grow this material sample, expert elicitation may in rare cases permit a rough idea about how a selected part of the sub-surface volume of this material sample—for example, the part just below the surface—can be richer in a component material than other parts, and therefore relatively denser. However, it would be impossible to assign the boundaries for such “parts” of the sub-surface volume. Neither would it be possible to suggest by what factor the density in one location is higher than that in another. At most, weak priors on the material density may be attained at a chosen sub-surface location via expert elicitation, and in the Bayesian implementation of bespoke learning that we discuss here, any such prior on the sought parameter can be acknowledged. However, expert elicitation alone cannot populate that originallyabsent training data on pairs of values of sub-surface location, and the material density at that location. Again, elicitation from the literature may or may not be useful towards formulation of priors on the gravitational mass density at a given location inside a galaxy, depending on whether reliable density values are available in the literature for a given galaxy, or not. It is in fact owing to the lack of consensus on the gravitational mass density values at arbitrarily-chosen locations inside a galaxy, that learning of the same is such a topical problem. Similarly, the producibility signatures in a rock that is sampled from a new well—and NMR observations of which are undertaken— are not reliably known to a petrophysicist in general. The number of infections of a pandemic on a given day, are not known apriori; any (approximate) suggestion that an expert might forecast for that day on the basis of their model, will indeed be limited by this approximate nature of their forecast, and that is precisely why a reliable forecasting technique is awaited, which might avail of the expert elicited suggestions as inputs towards priors on the sought reliable forecasting. Indeed, the strength of such priors that the practitioners may choose to use, will depend on their decision regarding the level of approximation anticipated by them in the expertsuggested forecasting.

1.5 A Wide-Angled View: Bespoke Learning in the Brain?

17

Then, to repeat what we stated earlier in Sect. 1.2, the objective of bespoke learning is to resort to one of the espoused techniques (see enumerated techniques in Sect. 1.1), to produce reliable results in each such aforementioned application—and other real-world applications—independent of approximate suggestions that may or may not be available on the values attained by the output variable at a chosen input, without also needing to undertake decisions about the level of the approximation relevant to the made suggestions. It is as if observations of the variable Y were not available at a given value of the associated variable X, and then new technology allows for such observations to become possible. Indeed, then pairs of values of X and Y become accessible. Here bespoke learning is permitting such a technology that allows for the construction of that originally-absent set of pairs of values of X and Y . If in the (bespoke) learning of the values of Y , priors are available on the sought values of Y at a particular value of X, then such priors will be acknowledged.

Indeed, offering bespoke learning solutions is analogous to the situation when freshly updated technology enables previously impossible observations, where one could not earlier observe the relevant output variable owing to technological shortcomings in the undertaking of the desired observations.

We need to be clear that the originally-absent values of the output variable that are generated via bespoke learning, at a chosen input, are not any artificial constructs, but actual values of the output variable in the real-world, so that pairs of: .{(chosen input value, bespoke learnt output values at this input).} provide the originally-absent training data—using which, we can then undertake the learning of that originallyelusive functional relationship between the input and output variables. Prediction of either variable, at a test value of another can then be undertaken. In this book, “learning” refers to any generic endeavour that is accompanied by comprehensive and objective uncertainties that improve as quantity of information input into the inference scheme increases, and/or as quality of such input information improves, (such as noise in data decreases).

1.5 A Wide-Angled View: Bespoke Learning in the Brain? Methodologies similar to what is described in the following chapters will be of relevance to applications that require forecasting over short to long time-scales, within a non-linearly evolving dynamical system, the evolution of which is itself temporally-varying—as in Finance [20]; Econometrics [7]; Healthcare [21], etc. That the forecasting discussed in Chap. 2 is fast, as well as reliable, renders applications to these areas possible. Again, in systems in which the structural

18

1 Bespoke Learning

density function can be deterministically related to the evolution-driving function, the approach as clarified and implemented in Chap. 3, is relevant. This is germane to astronomical applications—from the learning of the gravitational mass density of individual (globular) clusters of stars [8], to galaxy clusters [18]. The illustration of bespoke learning on a static system, as indicated in Chap. 4, entails solving a difficult inverse problem, to permit values of the output at a chosen value of the input, given the observations made on a transformation of this output. Such applications abound: [2, 28] to cite a couple, and we have discussed above, another such possibility within the realm of Petrophysics [9]. There are multiple applications possible of the method discussed in Chap. 5, in addition to the kind of healthcarerelated application that is included in this chapter. Bespoke learning of the scalar parametrisation of an output that is a multivariate time series over a variable temporal range, is relevant to credit scoring, especially when performed under deficiency in (prior) credit history [4]. Learning a composite developmental score, given development-related data that spans differential time ranges, is also sought, within Healthcare [26], Econometrics [19], etc. Additionally, bespoke learning appears to be at the heart of the functioning of the brain; we discuss this next. An interesting observation on the proposed bespoke learning relates to the functioning of our brains. Recently, suggestions have been made that our brain does not “store” information, and neither does it “retrieve” information from a “memory register” as and when required, [12]. The author states that our brains are not congenitally equipped with “design elements” that render our brains’ functioning comparable to that of digital computers—in contradiction to the suggestion made by [25]—and neither do the said elements develop in the brain. Under such a hypothesis, the relevance of supervised learning is circumspect, irrespective of any reward scheme that might be germane to such learning. If information is nether stored nor retrieved in the brain, then it might be considered that for a problem/task that has been encountered before, the previously-obtained solutions are not being revisited and implemented. Then it follows that the problem/task could be solved anew every time we encounter it, or, it is possible that there exists a mechanism that causes the hardware of the brain to be updated, once the problem/task is solved, s.t. after the i-th successful encounter with this problem/task, the solution is likely to be (at least) not more “difficult” than it was after the .i − 1-th solution of this problem/task; .i = 2, 3, . . .. Such a simplistic sequential ordering of the “difficulty” is likely to be modulated, given the interactions preceding and current, at the i-th instance of learning, as per the suggestion made by [15]. In any case, it appears that the solution is attempted anew every time we face this problem/task, and a pattern that is conducive to the solution is not learnt and thereafter stored in the brain—to be retrieved for subsequent encounters with this same problem. So there is no supervised learning of a pattern, based on encountering the problem multiple times in the past, where such a pattern is invoked in the following encounter, to perform the prediction or decision task as required. Indeed, if the hardware of the brain does indeed get updated after each past encounter with a given problem/tasks, then that is suggestive of the (renewed) learning at the i-th

1.5 A Wide-Angled View: Bespoke Learning in the Brain?

19

instance to be different from all learning that were undertaken in previous instances. An updated—over past instances of addressing this problem—neural infrastructure, can be viewed as offering a comparatively more information rich environment, within which the problem/task is currently being faced. Effectively then, it can be hypothesised that the learning of the solution to a given problem/task is performed by the brain anew, with a varying, (perhaps sequentially non-decreasing) amount of information that is available at the ith instance, from all previously undertaken learning instances—except, the net information content at this instance is modulated by previous-to-current interactions with the “environment”. So the learning in the brain is not supervised—unlike in a computer—but bespoke, within a sequentially varying information background. However, no simple ordering of the quality/quantity of the relevant information background can be motivated. To summarise, in this section we enquired how the brain attempts performing a new task. Does the brain invoke a model of the relationship between a generic stimulus—referred to as a situation-defining parameter .X—and reaction-defining parameter .Y ? Such a model of the said relationship could have been learnt using a set of situation and reaction parameters, and aided by prior experiences that include previous encounters with this task, and interactions with the environment—with the learning perhaps affected by imposition of penalties and/or rewards. Or does the brain attempt identifying certain reactions to given realisations of the situationdefining parameter .X, where, to facilitate the bespoke “prediction” of reactions to the currently-relevant .X, aforementioned prior experiences could be invoked? At the same time, it is acknowledged that triggers for these forwarded reactions to the given stimuli, remain unidentified in general. Whatever prompts the choice of one subset of realisations of .Y over others, is specific to the task at hand and the acknowledged stimuli, as well as the individual’s bespoke interactions with their environment— current and past. Such bespoke prediction is however not resulting from supervised learning, in the sense that a readied model of the relationship between .X and .Y is not being used to read-off the .Y values relevant to the currently-relevant .X. Also, such learning may not be undertaken necessarily to maximise an identifiable reward; importantly, the learning does not attain its converged form, only upon the maximisation of any such reward. While it is possible that the efficiency of the learning is likely to improve from one iteration of solving the given problem/task to the next, the learning of this solution itself, is not conditional on such iterativelyimproving efficiency. The solution learnt even at the first attempt, (or iteration), is very much a solution to the problem at hand. Additionally, it is possible, that interactions with the environment—both external to the individual and internal— render efficiency of the solution, worse at a certain implementational instance, over previous occasions. In this sense, such a learning is not reinforcement learning either. Indeed, in the paradigm that challenges the concept of the metaphor of the digital device for the brain, it is a bespoke type of learning that appears germane.

20

1 Bespoke Learning

1.6 Summary Before ending this chapter, we recapitulate its underlying motif. Here we introduced a new kind of learning, that offers values of a variable, given a data set that comprises values of another (associated) variable. In lieu of the sought values of this variable, it was not possible to compose the training data that comprises pairs of values of the design input, and this output that is realised at such an input. With the output bespoke learnt at a given design point, we are provided with the training set that comprises the ordered pairs of values of the two variables. So only subsequent to the bespoke learning, is the training set generated—to eventually permit supervised learning of the functional relationship between this pair of variables. This new type of learning is referred to as bespoke learning, to firstly distinguish it from supervised, unsupervised and reinforcement learning. Secondly, the semantics here are motivated by the occasion-specificity of the methodology that is motivated by the nature of the system. In the chapter we include a brief discussion about the possibility of bespoke learning being germane to the functioning of the brain. If the theory holds that the brain—unlike a digital device—does not store and retrieve information, then a corollary of the same is that undertaking a task—such as solving a problem—is performed anew, though with every implementation, the relevant cerebral infrastructure might be updated. Conditional on such updating of this internal environment, the task may be rendered less “difficult”—where “difficulty” requires objective qualification—while the external environment also interacts at every implementation in a bespoke way. If forecasting future states of a dynamic system is the task at hand, then it is the temporally-varying evolution-driver of this system that is learnt—by embedding it into the support of the phase space variable of the system; in certain systems, the system structure is deterministically known if the evolution-driver is learnt. On the other hand if the considered task is the learning of a structural/behavioural property of a static system, then it is the comparison of the observed and model outputs that is invoked to motivate the likelihood. Again, if observations from a dynamic system are available in the form of a multivariate time series covering a variable temporal range, then relating such an observable to static system inputs is accomplished by first learning a scalar-valued, lossless parametrisation of this observable, which is thereafter related to the static input. Under the circumstances, such parametrisation is developed as the distance between the learnt graphical models of pairs of such diversely long realisations of the time series data. These various implementation of bespoke learning will be made to different areas, including epidemic infection number prediction; Astronomy; Material Science; Haematology-Oncology. The subsequent chapters in the book do not just present the bespoke learning of the relevant variables—thereafter, the supervised learning of the inter-variable functional relationship that is sanctioned by the generation of the originally-absent training data, is expounded upon.

References

21

References 1. What is supervised learning? https://www.ibm.com/cloud/learn/supervised-learning, 2020. 2. M.P. Anderson and W.W. Woessner. Applied Groundwater Modeling: Simulation of Flow and Advective Transport. Number v. 4 in Applied groundwater modeling: simulation of flow and advective transport. Elsevier Science, 1992. 3. J. Binney and M. Merrifield. Galactic Astronomy. Princeton Series in Astrophysics. Princeton University Press, Princeton, NJ, 1998. 4. Antonio Blanco, Rafael Pino-Mejías, Juan Lara, and Salvador Rayo. Credit scoring models for the microfinance industry using neural networks: Evidence from peru. Expert Systems with Applications, 40(1):356–364, 2013. 5. Calum Brown, Janine B. Illian, and David F. R. P. Burslem. Success of spatial statistics in determining underlying process in simulated plant communities. Journal of Ecology, 104(1):160–172, 2016. 6. Bradley P. Carlin, Nicholas G. Polson, and David S. Stoffer. A monte carlo approach to nonnormal and nonlinear state-space modeling. Journal of the American Statistical Association, 87:493–500, 1992. 7. J. Castle, D.F. Hendry, and M.P. Clements. Forecasting. Yale University Press, 2019. 8. Dalia Chakrabarty. An inverse look at the center of m15. The Astronomical Journal, 131, 2006. 9. G. R. Coates, L. Xhao, and M. G. Prammer. NMR logging; principles & applications. Halliburton Energy Services Publication H02308, Houston, 1999. 10. Peter Diggle and Paulo Justiniano Ribeiro. Model-based Geostatistics. Series Title Springer Series in Statistics. Springer, New York, NY, 2006. 11. R.O. Duda and P.E. Hart. Pattern Classification and Scene Analysis. Wiley, 1973. 12. Robert Epstein. The empty brain, 2016. https://aeon.co/essays/your-brain-does-not-processinformation-and-it-is-not-a-computer. 13. Vincent François-Lavet, Peter Henderson, Riashat Islam, Marc G. Bellemare, and Joelle Pineau. An introduction to deep reinforcement learning. Foundations and Trends in Machine Learning, 11(3–4):219–354, 2018. 14. Trevor Hastie, Robert Tibshirani, and Jerome Friedman. Unsupervised Learning, pages 485– 585. Springer, New York, NY, 2009. 15. A. Jasanoff. The Biological Mind: How Brain, Body, and Environment Collaborate to Make Us Who We Are. Basic Books, 2018. 16. J. Liu and M. West. Combined Parameter and State Estimation in Simulation-Based Filtering, pages 197–223. Springer New York, New York, NY, 2001. 17. Marco Minozzo and Luca Bagnato. A unified skew-normal geostatistical factor model. Environmetrics, 32(4):e2672, 2021. 18. Glenn Roberts. How to weigh a galaxy cluster, 2014. https://www.symmetrymagazine.org/ article/july-2014/how-to-weigh-a-galaxy-cluster. 19. Max Roser. Human development index (hdi), 2014, with the last revision made in. https:// ourworldindata.org/human-development-index. 20. M. Samonas. Financial Forecasting, Analysis, and Modelling. John Wiley and Sons, Ltd, 2015. 21. I. N. Soyiri and D. D. Reidpath. An overview of health forecasting. Environmental health and preventive medicine, 18(1):1–9, 2013. 22. G. Storvik. Particle filters for state-space models with the presence of unknown static parameters. IEEE Transactions on Signal Processing, 50(2):281–289, 2002. 23. M. R. M. Talabis, R. McPherson, I. Miyamoto, J. L. Martin, and D. Kaye. Chapter 1 - analytics defined. In Mark Ryan M. Talabis, Robert McPherson, I. Miyamoto, Jason L. Martin, and D. Kaye, editors, Information Security Analytics, pages 1–12. Syngress, Boston, 2015. 24. Martijn van Otterlo and Marco Wiering. Reinforcement Learning and Markov Decision Processes, pages 3–42. Springer Berlin Heidelberg, Berlin, Heidelberg, 2012.

22

1 Bespoke Learning

25. J. Von Neumann, P.M. Churchland, and P.S. Churchland. The Computer and the Brain. The Silliman Memorial Lectures Series. Yale University Press, 2000. 26. Ann M Weber, Marta Rubio-Codina, Susan P Walker, Stef van Buuren, Iris Eekhout, Sally M Grantham-McGregor, Maria Caridad Araujo, Susan M Chang, Lia CH Fernald, Jena Derakhshani Hamadani, Charlotte Hanlon, Simone M Karam, Betsy Lozoff, Lisy Ratsifandrihamanana, Linda Richter, and Maureen M Black. The d-score: a metric for interpreting the early development of infants and toddlers across global settings. BMJ Global Health, 4(6), 2019. 27. Zeng Yong and Wu Shu. State-Space Models. Applications in Economics and Finance. Springer, New York, 2013. 28. Hongbo Zhao and Bingrui Chen. Inverse analysis for rock mechanics based on a high dimensional model representation. Inverse Problems in Science and Engineering, 29(11):1565–1585, 2021.

Chapter 2

Learning the Temporally-Evolving Evolution-Driving Function of a Dynamical System, to Forecast Future States: Forecasting New COVID19 Infection Numbers Abstract A new method of forecasting future states of a generic dynamical system is provided in this chapter. We advocate a method for the supervised learning of the temporally-varying function that causes, or drives the evolution of the considered dynamical system, in order to permit forecasting of states. However, to capacitate such supervised learning, we need to generate the unavailable training data that comprises pairs of values of the input variables (time and state), and the evolutiondriving function, realised at a given value of this input. In fact, the evolution-driver is bespoke learnt at a design input, by recalling the temporal variation of the probability density function of the phase space variables of this system. Having thus generated the originally-absent training data, the evolution driving function is thereafter learnt, by modelling it with a Gaussian Process of relevant dimensions. Subsequently, the evolution driving function is forecast at a time point in the future. Then the forecast evolution-driver is placed in a generalised Newtonian paradigm, to compute the phase space coordinates at that time. After all, the evolution-driver is appreciated as the potential function that drives the dynamics of the system, where Newton’s Second Law allows for the connection between the phase space variables and the potential. An empirical illustration on the forecasting of new infection numbers of the COVID19 pandemic is discussed.

2.1 Introduction In Chap. 1, we had a glimpse of the motivation behind seeking a methodology that allows for reliable forecasting of the values that an identified system parameter will attain at a future time point, where the system at hand is evolving with time. Such a temporally-evolving—or dynamical system—displays non-linear dynamics in general. Thus, a small variation affected on the current phase space coordinates, may typically lead to widely different phase space coordinates at a future time. The discontinuities displayed in the distribution of the phase space variable with time, in a generic real-world dynamical system, may not necessarily be captured with he help of past trends in evolution of this phase space variable. In light of such a possibility, a case is made here, for the pursuit of the cause for the evolution of a © Springer Nature Switzerland AG 2023 D. Chakrabarty, Learning in the Absence of Training Data, https://doi.org/10.1007/978-3-031-31011-9_2

23

24

2 Forecasting by Learning Evolution-Driver

deterministic system—from one state at a given time, to another state at another time.

2.1.1 Time Series Modelling Existing approaches within time series analysis, forecast values that a phase space variable will attain at a future time point, reliant upon the assumption that the evolution of the system leading up to that future time point can be tracked, given values that the phase space parameter has taken till the current time point, [20, 25, 26, 36, 38]. Such may hold in certain systems, especially when forecasting in the very near future is sought. But if forecasting into the long-term future is sought in non-linear dynamical systems, reliance on data obtained so far, may be a shortcoming [6, 9, 10, 32]. Basically, the evolution of a dynamical system can be such that (s.t.) the future evades capture by parametric models that attempt prediction of future states, based on states observed so far, i.e. to the current time. Supervised learning might not be relevant, since training data that comprises hitherto recorded trends in the evolution of the state space parameter, may not be representative of the future, given that said trends may not be replicated in the future. Then an alternative approach is sought, to permit reliable prediction into the future, in non-linear dynamical systems, the evolution of which is itself evolving. Below, we dwell briefly on the background relevant to existing approaches, before considering the germane question of what this has to do with learning in the absence of training data. To offer a glimpse of the answer, it suffices to say the following for now. Here we advance a new technique for forecasting phase space parameter values of generic non-linear dynamical systems, by invoking the learning of the function that causes, or drives the evolution of the system; having learnt this function, the evolution-driver is predicted at a test time. Then by inputting this predicted value into a generalised version of Newton’s Second Law, we will compute the rate of change of the phase space parameters, and subsequently, compute the phase space parameters, at this test time. However, there is no training data readily available for the proposed learning of the evolutiondriver of the system—as a function of time and the phase space variable. (The model that suggests such inputs to the evolution-driving function, will be detailed below). Therefore, we will need to come up with a new bespoke learning strategy that will generate this unavailable training data, to thereby permit the supervised learning of the temporally-varying evolution-driving function. Generation of the originally-absent and sought training set, via such bespoke learning is Step I in this approach, to be followed by the supervised learning of the evolution-driving function in Step II, which will permit prediction of the phase space variable at a test time, as will be discussed within Step III. But first we discuss some of the prevalent approaches in regard to forecasting, and place our proposed approach in the context of such background.

2.1 Introduction

2.1.1.1

25

ARIMA, etc.

A pattern that existed in the past, could replicate itself, and if so, it could in principle be exploited within the scope of an employed model, to estimate the phase space variable at the current time of consideration. Under the assumption that system evolution is governed by the superposition of a high frequency change upon another low frequency change in phase space, a parametric pattern can be advanced to model the evolution of the phase space variable, where a rapidly changing trend is added to a slowly-evolving trend. Accuracy of forecasting made on the basis of such an approach is subject to the validity of the above assumption, in addition of course to the adequacy of the values assigned to variables that parametrise details— for example, the chosen parametric functions that describe system dynamics—of the fast and slow changes invoked in the model. In models of this type, when the same parameter values are suggested to fit the data at all times, universality of the dynamical behaviour is assumed within the used model. In particular, in a linear dynamical system, evolution of the phase space variable n ˙ = H x, where .H is a constant matrix. If the .X ∈ R with time T is given as .x(t) temporal variation occurs discretely in time, then value of the phase space variable in the t-th time step is .x t = H x t−1 . Then attempting exploitation of an identified pattern in such a time series within the scope of linear models, implies using the local order structure, superimposed over an average, as in the autoregressive integrated moving average model or ARIMA and vector autoregressive or VAR models (Chapters 9 and 12 respectively in [20]). By definition, success of such a linear approach is ensured by the absence of non-linearities in the time series of the phase space variable. On the other hand, real world dynamical systems consistently display nonlinearities that belie justification behind modelling dynamics “in terms of relationships that are constant or fixed it time” [16]. In addition, nonlinear time series models have been considered by Kantz and Schreiber [21] and Franses van Dijk [12]. There are multiple prescriptions suggested for extending these models to accommodate non-linearities. However, success of the above forecasting approaches is essentially challenged at any future time during which the phase space variable attains values that defy the patterns of the past. One can suggest non-linear models of varying degrees of sophistication—and/or predict accuracy of the predictability itself, at a test time point, by advancing techniques to evaluate information replication at this time point—but if trends in the existing time series are not duplicated at a future time point, forecasting based on learning patterns in the existing observations, is likely to be inaccurate. Indeed, evolution of a dynamical system can be s.t. the future differs from hitherto observed trends, and it then follows that the system state at a future time, evades capture by forecasting tools that work by recognising patterns in past data. This underlines the fundamental shortcoming of the methods mentioned above. Again, there is a huge amount of work done with State Space Models that address evolution of a hidden state via a prescribed model, while the hidden state at any time connects to the observations at that time, via another model [2, 7, 11, 19, 22].

26

2 Forecasting by Learning Evolution-Driver

The state space approach can acknowledge non-linearity, and is applicable for multivariate time series, as well as to both continuous time and discrete time steps. However, relaxation of the Markov assumption on the state, may need to be imposed differently at different times. Additionally, a parametrised, matrix representation of the transition operator in the state equation, may require different revisions at different times, and similarly, a matrix representation of the operator that links state and observation is a misrepresentation in typical non-linear dynamical systems, the phase spaces of which are marked by complexity. At different times, such a dynamical system will typically demand varying corrections to such matrix representations and/or the parametrisation invoked in these state space models. It then follows, that such strategies will correctly forecast as long as applications are made to systems in which evolution falls within the remit of such pre-chosen models; in systems that evolve differently, forecasting at a test time point will be incorrect. Needless to say, the error of forecasting will become apparent only after the test time point is past, and at all times till then, there would be no information available on whether the forecast had been correct or not. This is a difficulty that can have consequences for policymaking, especially in relation to long-term forecasting. It appears that a more robust forecasting approach is called for, in order to mitigate limitations of the above strategies.

2.1.1.2

Empirical Dynamical Models

In this context, the concept of an Empirical Dynamical Model (or EDM) has been advanced; in this approach, the focus is on the attractor-generated dynamics of the time series of phase space variables, [8, 27, 37]. Indeed, as the authors [16] suggest, description of dynamics of real-world systems requires invoking the temporally-varying relationships between relevant variables, including the phase space variable—thus motivating EDMs, that abide by such a requirement. Forecasting is undertaken by an EDM, while acknowledging non-linearities in the time series of the phase space variable, by identifying the attractor that drives the system dynamics, given the observed time series of the relevant variables. Thus, a basic assumption of these models is that the system dynamics is deterministically identifiable; indeed, a “purely stochastic” dynamical system cannot be modelled by an EDM, with the aim of performing forecasting [31]. The attempt in the EDM approach is to undertake selection of the model (of the time series) that optimises the embedding dimension and a “tuning parameter”, given the sample at hand. This tuning parameter modulates the correlation of the time series variables allowed in the model, to the state of the dynamical system. The founding idea of the EDM approach is laudable, as it encourages the connection between the underlying dynamics of the system, to values of relevant parameters that are manifest. However, the implementation relevant to the identification of the attractor, or the undertaking of the cross-validation in the different implementation of EDMs, [16], could be cumbersome. More importantly, the results of the forecasting, appear to be sensitive to the parameters of the

2.1 Introduction

27

model, and that is a worry. For example, robustness of forecasting is likely to be subject to the accuracy of identification of the optimal temporal variation in the embedding dimension and tuning parameters; such variation would be manifest in non-linearly evolving dynamical systems. (Objective uncertainty quantification of such parameters is of accompanying interest, especially in such situations). In other words, when the information from the past is not necessarily interpretable at the present as deterministic, attempts to parametrise dynamics-driven forecasting, could be compromised. Lastly, attractors are only one kind of descriptors of the non-linear dynamics that is manifest in the evolution of a generic dynamical system. Thus, attractors necessarily, but do not sufficiently embody the cause behind the effect that is the evolution of the system evolution. We need to construct the toolbox that generates the future state of the system, on a more comprehensive descriptor of system nonlinear dynamics, than attractors alone.

2.1.1.3

Our New Approach and EDM

The idea of using system dynamics towards the forecasting of future states— as included to motivate EDMs—is an excellent one. This chapter introduces a new way of attempting the same, within a streamlined approach, which however is fundamentally different from any other forecasting technique discussed in the literature so far. In this approach the state is computed at a future time point in a generic dynamical system, for which the evolution driving function is learnt. How we perform such learning is something we will discuss soon. But for now, we return to the previously posited argument that the state at a test time point should be “forecast-able” in a deterministic system; earlier we had wondered how existing information can be invoked to accomplish the same. Now, treating this system as deterministic, the Newtonian paradigm will yield a computable state at any time, as long as we possess knowledge about the evolution-driver that determines the evolution of the system from the current state to that at the test time point. Such knowledge is not readily apparent within “existing information”, and therefore needs to be learnt, given information available cumulatively till the current time. The essence of the method is that Newton’s Second Law describes the evolution of the phase space variable of the system, in terms of this evolution-driver. To be precise, temporal variation of the rate of change of the location variable is driven by the gradient of the learnt evolution-driving function, where the “gradient” is computed with respect to the location variable.

So at this point, it is clear to those familiar with Dynamics, that the sought evolution-driver is the potential of the system [17]. Then, concerns about the methodology used for learning this potential at a given time, are prefaced by (continued)

28

2 Forecasting by Learning Evolution-Driver

what the “potential” of a generic dynamical system is, in the context of our approach. Ultimately, we will answer the question regarding the methodology relevant for undertaking computation of the phase space variable, given such potential that is learnt at a given time point.

2.1.2 Why Seek Potential? Thus, in our new approach, forecasting is in fact performed on the evolution-driving function that causes, and therefore informs directly on the system dynamics, and the state is computed thereby.

For the deterministic system—irrespective of its signature non-linear dynamics—this evolution-driver is what causes the system to transit to a state at a given time point, given the state that it was in, at a past time. This evolution driving function is the system potential that links to the phase space variables deterministically, via Newton’s 2nd Law, in a classical or deterministic system. So learning the potential offers the fundamental tool that permits informed state forecasting. Hence our pursuit of this potential function. Here we have included the qualification of “informed”, to emphasise acknowledgement of the inputs from system dynamics, towards said state forecasting.

Indeed, though the system is essentially deterministic, our learning of the potential occurs within an information deficient paradigm. This renders our learning of the potential probabilistic. Therefore our state forecasting is also with uncertainty. The form of this sought evolution-driver is not as sensitive to temporal variations, as the phase space variable itself is, in general, in a non-linear and non-equilibrium situation. Therefore a more robust forecasting is likely of the potential, than the state. (Later, in our empirical illustration, we will note that the form of the learnt potential varies much less over the period during which potential is being bespoke learnt, compared to the location variable, values of which nearly double during this period—see middle panel of Fig. 3.2 and lower left panel of Fig. 3.3.) The phase space variable is computed subsequent to the achievement of the forecast/predicted value of this dynamics-informing potential function. In fact, potential is a function of the phase space variable and time, s.t. it informs on the evolution of the system phase space variable as time varies.

2.1 Introduction

29

Potential and Thereafter The argument that motivates the search for such an evolution-driver is based in the very basics of Newtonian dynamics, (as well as, to non-Newtonian dynamical treatments). If we treat a state space variable of a system as the generalised “location” variable, and the rate of change of this state space variable as the generalised rate (or “velocity”) variable, then forecasting the evolution-driver, or the “potential”—as a function of time and the state space variable—will in principle allow us to perform forecasting the evolution-driver at any other time point, as long as the model assumptions hold at all relevant time points. Once the evolutiondriver itself is known, the state space variable and its rate can be computed within a generalised approach to Newton’s Second Law.

Connection to Bespoke Learning The cornerstone of this method is then the knowing of the evolution-driver at the test time, when the state is sought. We will resort to the forecasting of this function at the test time, subsequent to the supervised learning of the temporal variation of the evolution-driver. Now such supervised learning necessitates a training dataset comprising pairs of values of time and this evolution-driver that is realised at at the designed time. However, we do not appear to know what the evolution driving function (i.e. the potential) of a given dynamical system is, at a chosen time point. So we will need to motivate a methodology for undertaking the learning of this sought evolution-driver at a chosen time—indeed, such learning then concurs with what is identified as bespoke learning in the previous chapter. In fact, a novel method for bespoke learning of the evolution-driver (i.e. the potential function) is provided in this chapter: the potential is embedded within the support of the probability density function of the phase space variable relevant to the given dynamical system, and this helps define the likelihood of the sought parameters, given the data. Such a likelihood is then invoked into Bayes rule, and Bayesian inference on all unknowns is undertaken, using Markov Chain Monte Carlo, or MCMC techniques; see Appendix A. Having thus (bespoke) learnt the evolution-driver at a design time, we populate the originally-absent training dataset that comprises time-potential pairs. Generation of this training set then renders supervised learning of the temporal evolution of the potential, allowing for its forecast at a test time. Using such a forecast potential in the (generalised) Newtonian paradigm, values of the rate and location variable at that test time are computed. Notwithstanding all this implementational novelty, the very interpretation of a potential in a generic dynamical system remains unexplained, and this will be discussed below. In fact, the “generalisation” of the Newtonian paradigm—under which the system state is computed at a test time, given the predicted potential at this time—will be another dynamical generalisation that will be discussed clearly, to elaborate on the detailed formulation of this forecasting strategy.

30

2 Forecasting by Learning Evolution-Driver

Supervised Learning of Potential Function Using Generated Training Set To return to the supervised learning of the temporal dependence of the evolution driving function, forecasting in generic real-world dynamical systems is permitted by our modelling of this sought evolution-driver—or potential function—as generically as possible, i.e. by not imposing any parametric structure to it. Instead we treat it as a random realisation from an appropriate non-stationary stochastic process (such as a Gaussian Process). The parameters of this stochastic process will be learnt using the data, s.t. we can predict the mean potential function that is sampled from this process, along with the uncertainty of this sought function (as the predicted variance on the potential function realised from the identified process).

Prediction Following Supervised Learning At the last stage of this forecasting strategy, we predict the evolution-driver (or potential) at a test time, and compute the state of the system at this time, within a generalised Newtonian paradigm. In fact we will perform closed-form prediction of the mean potential and the variance on it, within a paradigm that emulates Newton’s 2nd Law, to compute the rate variable—with uncertainties on the same stemming from the uncertainties on the learnt potential—and the location variable. Thus this strategy is a three-staged one: 1. Step I: bespoke learning of the potential at each of a set of design times, to give rise to a training dataset comprising time-potential pairs; 2. Step II: supervised learning of the evolution-driver (or potential function) by treating it as a sample function of a stochastic process, the parameters of which we will learn given the training dataset generated in the previous step; 3. Step III: prediction of the mean and variance of the potential at a test time, and computing the phase space variables therefrom.

Why Model Potential Function as Realised from a Stochastic Process? Thus, though the underlying system dynamics is deterministic, in our approach, the system dynamics—driven by the potential function—is viewed as stochastic, given our treatment of the potential as a random realisation from a stochastic process. This approach is only to compensate for the essential lack of information about the phase space pdf, and the evolution-driver (or potential function) of this system, at any time point. Thus this new approach that we discuss below, is fundamentally different from the EDM setup. A crucial ingredient of this difference lies in our pursuit of the potential function, (and the phase space pdf ), as distinguished from the attractor, as within EDMs. Even as we model the generative stochastic process of the unknown potential function as non-stationary, it still merits mention that temporal sensitivity of a phase

2.1 Introduction

31

space variable is higher in the presence of non-linearities in system dynamics, than is the form of the (unknown) potential function. Seeking of the potential is unique to this work. That this is sought, while acknowledging anticipated inhomogeneities in the correlation between its realisations as it varies temporally, and with the phase space variable, is the mainstay of the method. Other than the fact that system dynamics is invoked to permit forecasting of states in both EDMs and in this new search of the system potential to permit state forecasting, nothing else is in fact common to the two frameworks. Our approach is Bayesian, and also allows for the computation of objective uncertainties, on all learnt unknowns, and thereby on the computed location variable and the rate, at any time point of interest.

2.1.3 Hidden Markov Models and Our Approach It may be perceived that our method is a Hidden Markov Modelling based forecasting approach, [4]—with the phase space pdf that we seek in our work, considered the “hidden” state. The method introduced here approaches an HMM description most closely, if the unknown phase space pdf at a given time is treated as the hidden state, and potential at a given time and state, the “observed state” at that time. Firstly, such equivalence is misplaced since the potential is not an observed/observable state, but a construct that we will need to learn, using the observables that are phase space variables that describe system dynamics. Indeed, had we treated potential as the “hidden” and the phase space variables the “observed” state in our effort to find an analogy of this new method with HMM, that strategy would have failed, since the connection between the hidden and observed states in that interpretation is not probabilistic, but known, namely Newton’s 2nd Law. This is what had prompted us to interpret the phase space pdf as hidden, and potential as the observed state. However, even if the alternative interpretation were to hold, in this new method, we will find no need to impose Markovianity between potential functions that are realised at neighbouring time points. (Indeed there exists no need to impose Markovianity between realisations of the phase space pdf at adjacent time steps). Additionally, as we will see below, we do not seek the conditional probability of the potential at a time, given the phase space pdf at other times. Instead, the temporal variation of the pdf directly feeds into—and at the same time, is fed by—system dynamical information; such a dynamics-fed temporal variation of the pdf permits the embedding of the potential function in the support of this pdf. In other words, interpretation of the pdf as exactly what it is, namely a probability density function, is of relevance here. It is the embedding of a sought unknown function, into the support of the sought unknown pdf, that allows for the bespoke learning of the potential at chosen (or designed) time points, using which, we ultimately learn parameters of the stochastic process that generates such a potential function. Subsequent prediction of the potential is possible then— and only after the potential is predicted at a test time point, can we compute the location variable and its rate at this test time. We can then see that such potential

32

2 Forecasting by Learning Evolution-Driver

prediction is essentially non-linear, i.e. forecasting future states of a non-linear dynamical system, is very much within the scope of our work, as is forecasting given continuous-valued states.

There is no demand on adherence to a prescribed parametrisation of nonlinearity, or of observation-dependence in our work—unlike in a non-linear extension of a State Space Model.

The continuous-valued phase space pdf and the continuous-valued phase space variables can, in principle, be computed at continuously varying times. Furthermore, all considerations of non-Gaussian/Gaussian noise in the observations can be addressed within the Bayesian potential learning indicated above, by convolving the error density of the observations, with the likelihood of all model parameters given the data, in every iteration of the Metropolis-within-Gibbs based inference that we undertake to learn the potential.

Form of Phase Space pdf Is Unknown Lastly, there is no reason to believe that the phase space pdf is (truncated) Normal or skew-Normal—or even unimodal, as we possess no information on the form of the pdf in general. In fact, we are very keen to ensure that in this information-sparse environment, we do not inadvertently impose a chosen correlation structure on the phase space pdf that is learnt at a design time point—lest the results become priordriven, and are then misconstrued as acceptable. Indeed, results of our empirical illustration obtained within such a mind-set indicate that such scepticism is wellfounded.

2.1.4 Markov Decision Processes; Reinforcement Learning and Our Approach How does the solution that we are proposing here for forecasting in a generic dynamical system, compare with a Markov Decision Process, and more generally, with reinforcement learning? The gist of these techniques—discussed here in the context of discrete times—is to take the action A to move to a state .Xi in the ith time step, with probability .PA (Xi−1 , Xi ), from the current state .Xi−1 that is currently attained by the system at this .i −1-th time step, s.t. a reward .RA (Xi−1 , Xi ) is obtained, (where the reward depends only on action and the current state in a Markov Decision Process). In reinforcement learning, one of the goals is the cumulative reward given a state and/or action, where a predicted reward helps in

2.1 Introduction

33

the identification of the action in the next time step. The other goal links to the desired identification of the action at the next step. To be precise, at the current step, the probability distribution of the action variables is generated using N number of “simulated” states that are considered candidates for .Xi , given the state at the current time. Then using the output from both or either of these goals, a model is trained to access the state at the next time step. Concerns with these techniques pertain to: 1. the set of simulated states that are considered accessible as the next state after the current. However, non-linear dynamics manifests in the system’s evolution from one state to another, so that at the next time point, the state can indeed be very far from the current. Acknowledgement of such a possibility, while generating the set of (simulated) candidates for the next state, appears difficult in lieu of information on what causes the evolution of the system. Definitely, the past selection of states is not an indication of such sought information, if the nonlinearity has not shown up yet, or shown up differently in the past, compared to the unknown future ahead. 2. the reward function that will guide the choice of the future state, from within the set of (simulated) possible states, is not informed by the system dynamics, but is instead, fixed by choice. Typically, the reward function is ascribed a (parametric) form that depends on the action and the current state, (and perhaps the future state). Ideally, the reward—the optimisation of which will inform on what the next state should be—can be constrained to yield the next state with the correct probability, only if that next state is known. As this is not the case, a universal protocol for designing a reward function is not achievable for a generic dynamical system, and approximations might need to be relied upon.

Instead, here we are trying to learn the causation of the evolution, by learning the evolution-driving function, aka, the potential function—under the assumption that there exists a time-scale over which the system evolution does not vary with time. This relieves us of the need to select a future state from a constructed set of possible future states, via a constructed design protocol, which is bound to invite heuristics that are not informed by system dynamics. We will see in our empirical illustration that our state forecasting works even at a time point, at which departures from the past trends in the rate variable are manifest.

As said above, the nature of the collection of future states that is simulated in Markov Decision Processes, and in general in reinforcement learning, is arbitrary in the sense that, irrespective of how small or big (the size of this collection, i.e.) N is, the state that is attained at the next time point, may be very far from the current state. Indeed, the state at the next time point may be infinitely away from the

34

2 Forecasting by Learning Evolution-Driver

current state, in a chaotic system, which is nonetheless deterministic, and therefore, state forecasting is fundamentally possible via Newton’s 2nd Law, if the potential is known. Even otherwise, when the future state is not necessarily a replicate of a past state, it is probable that the simulated collection of future states precludes the state that is actually attained at the following time point. This is valid in a generic non-linear dynamical system that is evolving s.t. its evolution-driving function is itself time-dependent.

What such a future state will be, is not answered by constructing a set of possible states—perhaps guided by past experience—but instead, by imposing causality on what the state at the next, i.e. the i-th time point, should be. In other words, it is the dynamics that is intrinsic to the way the system is evolving that needs to be invoked, in order to compute the probability for the state at the next, (i.e. the i-th time point), to be .Xi , conditional on the current, and perhaps some of all the past states till the 1-st time point, namely, .Xi−1 , Xi−2 , . . . , X1 . Such dynamics is not specified by the past states that the system has attained from the 1st to a finite .i − 1 number of time points— such a finite collection of past states is only a manifestation of said system dynamics. In fact, any finite collection of states is a manifestation of the system dynamics, but no such collection can specify the system dynamics. Rather, system dynamics is informed upon in its entirety by the function that determines evolution of a generic dynamical system, namely the time and state dependent evolution-driving function of this system, aka, the potential function. So, the simulated set of states on which the reinforcement learner is trained, in order to perform one-step ahead forecast, cannot in principle be a correct representation of system dynamics at all time points. One has to be particularly lucky, or one needs to work with a dynamical system that is evolving particularly simply—such as linearly—in order for a chosen simulated collection to successfully capture the full range of states that the system could be at, in the next step. Adapting the simulated collection by experience can be disastrous, in systems in which the considered future is not a replicate of the past. The objective here is to provide an automated protocol for probabilistic state forecasting, in a generic dynamical system.

The Markov dependence of the state in the i-th time step, (which is inherent in Markov Decision Processes), may need to be relaxed in the presence of non-linear dynamics, but the required relaxation may span different past states, for different i. In practice, computational costs typically inhibit using a comprehensively large enough set of simulated states, encouraging errors. Again, the reward function is an existing difficulty in reinforcement learning [1, 13, 35]. To recap what was said above, the reward attained at any step is what permits choice of action .A = a over .A = a / . In a Markov Decision Process [5], Ideally, any

2.1 Introduction

35

criterion for selecting the reward is designed to ensure that the reward is maximal when the state that is attained at the following time point, is the state that is truly attained by the system then. But at the .i − 1-th time step—when the reward is being decided upon—since there is no information on what this “truly attained” state will be at the i-th step, any criterion for converging on the best reward function at the .i − 1-th step, will be insufficiently informed upon, and therefore the reward may be incorrectly formulated. Our proposed approach is not affected by such worries. As we will see from more detailed discussion of our approach, learning the evolution-driver of the system will enable identification of the state in the following time step. We do not need to decide which of some “possible” states are accessible in the next time step. Questions about a deterministic identification of that sought state at a future time point can be addressed by what we indicate above, namely, in deterministic systems in which the evolution-driver is fully known, prediction of the future state is indeed deterministic. Of course, given that our learning of this evolution-driver is probabilistic, even for deterministic dynamical systems, our prediction of attainable states in the following time step is probabilistic, i.e. is with uncertainty. As long as the system is a classical dynamical system, it should in principle be tractable within one of the centuriesold dynamical treatments, in which the causation of the state transitions is linked to the evolution-driver, or the potential function. Our approach merely taps into this existing body of knowledge, seeded in Physics and Applied Mathematics, since times past.

2.1.5 A New Way to Forecast: Learn the Evolution-Driving Function We attempt prediction of the value of the system phase space parameters, at a given future time. Treated as a dynamical system, we will treat the system phase space to host the location random variable (r.v.) and the rate r.v., where this rate is defined as the time derivative of the location r.v. It is noted that all through this chapter, we equate this rate variable with the momentum variable—in our consideration of the dynamics, the momentum r.v. is the scaled rate r.v. Thus, we will on occasion consider the system to be Hamiltonian [30], and yet, unlike in Hamiltonian dynamics—in which one expresses the dynamics in terms of location and momentum—we will work with the location and “rate” variables at a given time, as the phase space variables defining the system state. The underlying motivation is to learn the evolution-driver of this dynamical system. Subsequent computation of the rate, (and thereby the location r.v.) at the learnt evolution-driver, follows an approach borrowed from the fundamentals of Newtonian Dynamics. In a classical dynamical system, if we know the evolution-driving function—that will in general be a function of the phase space parameter vector, and of time—we will be enabled to predict the phase space coordinates at a future time, given the data on these

36

2 Forecasting by Learning Evolution-Driver

coordinates that have been obtained till the current time. Such forecasting of phase space coordinates is after all what we undertake in Dynamics. Thus, the crux of the argument is that the evolution-driving function will need to be learnt. As we refer to this evolution-driver as the potential function following standard nomenclature, we say that the potential function will be learnt— as dependent on the components of the phase space r.v., and time, with the potential treated as a random realisation of an adequately chosen stochastic process that will in general be non-stationary. Learning the covariance structure of this non-stationary stochastic process is then what the challenge reduces to. The training data that is required for the learning of the potential function is however unavailable to begin with. At the outset, we do not have information on what the value of the potential function is, at a given value of the phase space variable, and the time r.v. T . In other words, at design points defined by the phase space variable and time, we cannot generate a value of the potential function. Therefore any training set that we might have envisaged populating—with the ambition of learning the potential function—remains unachievable. So, it may appear that we cannot undertake the supervised learning of the potential function, since the training data required for such a learning exercise is not accessible. In light of this, the aforementioned dynamics-driven approach to forecasting, then demands that such originally-absent training data be generated, to render the potential function learnable, at a given phase space variable and given T . The generation of such originally-absent training data falls under the purview of bespoke learning that was motivated in the last chapter, and this is undertaken using the novel embedding of the potential inside the support of the probability density function (pdf ) of the phase space variables. This novel methodology for the bespoke potential learning in fact only offers values of the potential at temporal locations within designed time windows, during which the system is modelled as an autonomous Hamiltonian system, i.e. a dynamical system that abides by Hamilton’s equations, with the Hamiltonian bearing no explicit time-dependence[18, 23].1 Then the potential bears no explicit time dependence. In the system that is autonomous during a time window, the phase space pdf is also bereft of an explicit dependence on time during that time window. However, across time windows, the form of the potential function—and of the phase space pdf —is different from the potential (and pdf ) during another time window. Thus, such bespoke learning promises the originally-absent training data for the supervised learning of the potential, where the promised training data is a set of pairs of: design time corresponding to the centroid of a time window, and the “bespokely-learnt” potential in this time window. Indeed, in another time window, the “bespokely-learnt” potential has a different value, leading to distinct time-potential pairs that comprise the training set. Such bespoke learning of the

1 Defining the dynamical system as autonomous with a 2-dimensional phase space within any time window, is equivalent to considering the system within any time window to be a 3-dimensional non-autonomous dynamical system, with the 3rd dimension being time.

2.1 Introduction

37

potential within any such time window, (accompanied by the learning of the phase space pdf at this time window), comprises Step I of our overall learning+forecasting strategy.

Any such time window, is by design s.t. system dynamics within it is assumed Hamiltonian, i.e. abides by Hamilton’s equations of motion. The system is also assumed to be autonomous during any such time window, i.e. the potential and phase space pdf do not bear an explicit time dependence, though across time windows, system dynamics can definitely be non-autonomous s.t. potential is modelled as varying from one time window to another—as is the phase space pdf. Another global feature of this potential model is that this function is in general rate-independent, s.t. in our model, the system potential is assumed dependent on the location variable .X(T ) and on time T s.t. the random potential function is .Φ(X(T ), T ).

In fact, given the expectation that across time, the system is not an autonomous dynamical system, we are prompted in Step II of our learning, to implement such a newly-populated training data to learn the temporal variation of the potential function, by modelling it as a random realisation from an adequately chosen, stochastic process—to be precise, a Gaussian Process or GP [33]. Supervised learning of the covariance structure of this generative stochastic process, then permits closed-form, and uncertaintyincluded prediction/forecasting of the potential at a test time point.

Indeed, learning the potential is not enough. The ultimate aim is the prediction of the location variable and its rate, at any given time point. Knowledge of the potential at a test time point, is employed within Step III of our learning+forecasting work, in the (equivalent) framework of Newton’s Second Law, [17], to inform on the rate of change of the location variable, and thereby on the location variable itself. Thus, the time derivative of the rate of change of the location variable is formulated as proportional to the negative spatial gradient of the potential function, where by “spatial” is implied: with respect to the location .X(T ). Now, when we know the potential function deterministically, phase space coordinates are predicted deterministically at a given time, within the context of Newtonian Dynamics. However, as we propose to learn the potential function probabilistically, forecasting of the phase space variables is also probabilistic— accompanied by comprehensive and objective uncertainties.

38

2 Forecasting by Learning Evolution-Driver

2.1.6 Evolution-Driver, Aka Potential Function Students of Dynamics have internalised the scalar-valued gravitational (or electrostatic) potential. For others, an intuition of the potential function might be useful. So for example, potential at a location (where the gravitational field due to a mass condensation is experienced), relative to another location—which is more distant from the considered mass condensation than the original location—is the energy that is expended in moving an independent unit mass from this location, to the other location. Then the gravitational fields is weaker at the new location. Energy is expended, i.e. work is done in relocating the test mass, since gravitation is an attractive force; one had to expend energy to wring the body with the unit mass, away from the attractive field, to a location where the body experiences weaker attraction. This expended energy is then added to the energy of the relocated body, which would have been more relaxed at the original location, with lower potential energy. More the system is disturbed from the state at which it is most “relaxed”, higher is its potential energy. In contrast to this example, a unit positive test charge is more relaxed when it is relocated to a far away location where the repulsive field set up by a positive charge condensation is weaker, than at a nearer location where the strength of this field is stronger. Following on from this elementary notion of potential, in this chapter, potential is modelled in general as a function of the location variable and of time, s.t. the potential at a given time and state parametrises the “effort” that is required to be input into the system at this time, for it to be transferred from a state where the system is undisturbed, to the current state. So more disturbed the system is, in its current state—away from the state at which it was undisturbed—higher is its potential. When the system is in the maximally-disturbed state, the system potential is higher than in any other state at any time. This maximally-disturbed state is one s.t. any further disturbance of the given system is impossible, or if further disturbance is imposed, the system will disintegrate into one that is no longer recognisable as one for which the forecasting is considered. Thus the potential function attains its highest value at any time if the system at this time is at its maximally-disturbed state. Again, the system potential takes its minimal value when the system is at the undisturbed state—or minimally-disturbed state. We begin by considering a system for which the state is marked at time .T ∈ T ⊆ R≥0 , by the time-dependent location variable .X(T ) ∈ X ⊆ R. Let potential .Φ : X × T −→ R be s.t. when the considered system is most disturbed, potential function attains its highest value .φmax . Then by the motivation developed in the last paragraph for the potential, at any time, potential of the maximally-disturbed state—denoted as .φmax —is higher than that of any other state at any time, including the minimally-disturbed state at which potential is .φmin , i.e. .φmin < φmax . We will assign the value 0 to .φmax since there is no information impeding such a choice; importantly, the prediction of state using Newton’s 2nd Law, is not affected by an arbitrary additive constant in the definition of the potential, since it is the gradient of the potential (with respect to location) that is of relevance in the expression of this Law.

2.1 Introduction

39

Definition 2.1 We linearly transform the potential function without loss of generality, s.t. the original potential function .Φ(X(T ), T ) is linearly transformed to the effective potential: φ(X(T ), T ) =

.

Φ(X(T ), T ) − φmax , −φmin

(2.1)

with .φmax set to 0. Then it follows that .φ(·, ·) takes values in .[−1, 0]. The aforementioned motivation for the potential is simply to place the effective potential .φ(X(T ), T ) on a scale of our choice, namely [.−1, 0]. So while it is true that this effective potential is placed on a scale, the 0-point of which is a seemingly ad hoc choice, that does not affect the learning of the system evolution. Indeed, as stated above, the system evolution is invariant to a shift in the potential by a constant, since it is the gradient of the potential—i.e. the derivative of the potential with respect to the location parameter—that determines evolution of the rate of change of the rate parameter. In other words, evolution is determined solely by the shape of the potential function, and this shape is what we will learn; any scale will be subsumed into another global scale in the definition of the potential, that we will identify from the data. It is important to clarify at this stage that as in physical systems, in this model too, potential energy is proportional to the potential, i.e. it is a constant times the potential, where such a constant—in concert with the global scale of the potential— is subsumed within the definition of the support of the phase space pdf. This scale does not affect the learning of the shape of the potential. Thus, from an implementational point of view in this model, potential and potential energy are equivalent. Here, we set the effective potential—which we refer to simply as the “potential” hereon—to be negative at all values of the location space r.v., at all times. The disturbance that affects the evolution of .X(T ), is s.t. is causes the system to deviate from its most relaxed state, and such deviation is marked by an increase in the potential, which is still non-positive though, i.e. .φmin < φ(X(t), t) ≤ 0, ∀t > 0. The continual effect of this disturbance is to render the system as one to which any state—marked by values of the location variable .X(T ), and its rate of change .V (T )—can be ascribed at any time. The maximally-disturbed state is either unattainable by the given system, or s.t. the system finds itself in this state, just as it is disintegrating into a structure that is not identifiable any longer as a system to which any future state can be assigned. Such a maximally-disturbed state is parametrised by the value .xdeath of the location parameter (attained at a time .tdeath , say). So to summarise, by the definition developed above, potential of the given identifiable system is .φ(X(T ), T ) ∈ [−1, 0), .∀t ∈ T ⊆ R≥0 \tdeath . We find a parallel between this generic motivation for the potential, and that in a gravitationally bound system. Inside a gravitationally bound system, any particle that is bound to the system, can possess potential energy less than the maximum

40

2 Forecasting by Learning Evolution-Driver

potential energy of 0; potential of any such bound particle is negative, indicating that if a particle were to be brought into a location within the system from an infinitely distant location, work will be done by this system—thereby depreciating its potential energy—as the system attracts the particle into its midst. (This situation is distinguished from the case when work is required to be done on the system, to bring a particle from infinity, into this system that repels such a newly introduced particle). Thus, the potential at any location within the gravitationally-bound system is negative. Indeed, the system is more relaxed when a particle is in a more strongly attraction-bound state, than at a less strongly attraction-bound one. Thus, at locations where the gravitational attraction is higher inside a self-gravitating system, the potential is more negative. So if the location inside this gravitationally-bound system at time T is the vector .X(T ), then .Φ(X(T ), T ) ≤ 0, with upper-bound on this potential being .φmax = 0, and the most negative value of its potential .φmin = Φ(0, t). Then by the above definition, (in Eq. 2.1), the effective potential for this system is .φ(X(T ), T ) ∈ [−1, 0).

2.2 Learning Scheme: Outline and 3 Underlining Steps Detailed discussion of the modus operandi for prediction/forecasting of a phase space variable value is clarified in the next section, while in this section, an overview of the methodology is provided. The basic aim is to learn the temporal evolution of the potential function, s.t. we can predict it at some future (test) time point. Once forecast, we plan to compute the rate and value of the location variable, at that test time point. However, the learning of the functional relationship between the potential function and time, cannot materialise, unless there is a training data set that comprises measured/simulated/computed/known values of the potential function at chosen or designed inputs, i.e. design state and time. However, we cannot simulate or compute this potential function at a given test time, since there exists no knowledge on this construct referred to as the potential, in the generic dynamical system under consideration. Indeed, generation of information on the potential as a function of time, is the objective of our endeavour.

2.2.1 Can the Potential be Learnt Directly Using Observed Phase Space Variables? Let us now recall that information on the form of the potential function cannot be obtained from Newton’s Second Law, even if information on the measured values of the rate and location variables were available to be input into Newton’s Second Law. After all, the Second Law allows for a linear connection between the gradient of the

2.2 Learning Scheme: Outline and 3 Underlining Steps

41

potential (with respect to the location variable) and the rate of change of this rate variable. So once a functional form is ascribed to the system potential, the Second Law permits computation of the value of the potential function at a chosen/design time point .T = t, given the measured values of the rate variable (.v(t)) and the location variable (.x(t)) at this time point. But given a set of values of the location and rate variables, the Second Law does not provide a route to learn the potential function - or does it? It is in principle possible to model the time rate of change of the rate variable .V (T ) with a GP, after learning the rate .V (T ) with a GP, given a training set comprising pairs of values of time at which a rate value is measured, and the rate measured at that time—if such measurements of rate were available. What we really require as a direct input into the Second Law, is the rate of change of the rate variable .V (T ). This sought learning of the .dV (T )/dT function will need to be preceded by the learning of the GP that generates .V (T ), using the aforementioned training data consisting of time-rate pairs. We will invoke this GP in the next paragraph. While the rate of change of .V (T ) informs on the gradient (with respect to .X(T )) of the unknown potential via the Second Law, the proposed supervised learning of the potential .φ(X(T ), T ), requires integration of the GP—that is identified to generate .V (T )—with respect to .X(T ). Such an integration is likely to be difficult, given that the mean and variance structures of that GP are not identified as .X(T ) dependent. Instead—as indicated in the previous paragraph, such a GP provides a distribution over the space of the .V (T ) function. So it appears difficult to see how one can undertake the proposed supervised learning of the potential function, using measurements of location and rate at different times. This difficulty notwithstanding, such direct stochastic implementation of the Second Law remains an avenue for further investigation.

In effect, the connection between the potential function, and phase space variables .X(T ) and .V (T )—manifest in the Second Law—is what can be exploited, to put forward state forecasting that is superior to forecasting of .X(T ) (and .V (T )) using patterns in past values of .X(T ) (and .V (T )). Endeavours of the latter type are fundamentally unreliable, irrespective of the sophistication of the estimation/learning techniques that are employed in capturing such patterns—the future does not necessarily replicate the past. The solution to this difficulty is to impose constraints on the forecast state, such that the forecasting methodology adapts to temporally-local deviations from hitherto observed states.

42

2 Forecasting by Learning Evolution-Driver

2.3 Robustness of Prediction: Extra Information from the 2nd Law Our pursuit of the potential as a random function that is a realisation from a Gaussian Process, is helped by the information obtained from the Second Law on the phase space variables. Here, the potential that is forecast at a given time, is input into the Second Law, to compute the rate and location variables at that time. These phase space variable values—along with the updated rate of change of the rate variable—then augment similar information from the past, to affect the updating of the correlation structure of the GP that is invoked to underlie the potential function. It is from such an updated GP—with correlation structure updated at the time point at which the latest forecasting has occurred—that the mean (and variance) of the potential function at the following time point is forecast. Thus, we try to motivate a method in which there is a clear feedback from the state identified at a time point— computed by using the current potential in the Second Law—towards the generative process that yields potential in the subsequent time point. Newton’s 2nd Law suggests that .dV (t)/dt ∝ ∇ X(t) φ(x(t), t). So we appreciate that the following happen. • Knowledge of the potential function till the i-th time window implies that variation in this function with location, at this time point is learnable, s.t. the rate of change of the rate variable at .T = ti is known. In other words, there exists a constraint on the rate of change of the rate, as a function of time, at .T = ti . • Additionally, while in the .i − 1-th time window, value of .V (T ) in the i-th time window was forecast using the potential function that was forecast for this i-th time window. Then it follows that .dV (t)/dt is constrained—or informed upon via the 2nd Law— at .T = ti , while .V (T = ti ) is known (as forecast). This implies that the forecast value of the rate at .T = ti+1 can avail of a constraint. Such information renders the rate forecasting in the .i + 1-th time window better than what would be, if we were relying solely on information from the past of the rate time series to perform the rate forecasting. Also, that .V (t) is better constrained in the .i + 1-th time window via the 2nd Law—than without—implies in turn that – in the .i + 1-th time window, the variation in rate as a function of time is better constrained, – i.e. variation of the potential with location is better constrained during the .i +1-th time window, – i.e. the correlation of the sought potential function at the .i + 1-th time has the luxury of extra information, than would be the case if we relied only on information redundancy in the past of the time series on potential, to forecast the value of this function at this time window. In this way, information obtained using the Second Law guides the correlation function of the process that generates the potential function. To emphasise, we realise that in

2.4 3-Staged Algorithm

43

absence of the constraints input by the state computed using Newton’s Second Law at a given time, the form of the correlation of this process at this time would have remained solely dependent on patterns in past values of the potential. To clarify, the forecasting of the potential function at this i-th time window, happens as a result of the closed-form prediction of its mean and variance values while at the .i − 1-th time window, and such prediction is permitted within the GPbased learning of the potential function that we undertake. This prediction does not follow from using the 2nd Law. So the extra constraint/information that we enjoy in the work, owe to the connection that exists between the potential and phase space variables, as embodied in the 2nd Law. Additionally, there is advantage to be gained from forecasting potential, over state. Indeed, the state that the system is in may undergo a very large change between the current time point and the future, even when potential has not changed in form. On the other hand, potential is a manifestation of the inherent Physics of the system, and the magnitude of its fractional change, is typically less than that of a state variable, over any given time period.

2.4 3-Staged Algorithm The primary novelty of the methodology that is introduced in this chapter is in motivating the connection between the state and the evolution-driver of the system, while constructing a pathway that allows for the implementation of this connection, within a 3-staged algorithmic protocol that we discuss in detail. That we do not possess information on the value of the potential function, at a design time point, implies that we cannot populate the training data set that is fundamentally required to undertake supervised learning of the time dependence of the potential—this absent training set will need to first be generated, to render such learning permissible. While it may be possible under certain model assumptions, to invert the measured values of .X(T ) and .V (T ) to impose constraints on the potential function, here we undertake generation of the originally-absent training data in a generic system, via a novel form of bespoke learning of the potential, given the data that exists on the phase space variables .X(T ) and .V (T ) at time points within the time period that we refer to as the learning period. The undertaken bespoke learning is broadly speaking, not constrained by assumptions about a particular form of the temporal variation in the potential function, or about the smoothness of the temporal variation in .X(T ) or .V (T ) during the learning period, or about noise in the measured values of .X(T ) and .V (T ), in generic applications.

44

2 Forecasting by Learning Evolution-Driver

The only assumption that the undertaken bespoke learning is affected by, is that the non-stationary evolution of this generic non-linear dynamical system occurs s.t. there exists a time interval over which the system is an autonomous Hamiltonian system. The temporal range over which we conduct learning and forecasting of potential, is partitioned into time windows, the width of each of which is this time interval. There is variation in the potential function, as time elapses from the temporal location of one time window, to another.

In fact, in the methodology that is developed below for such bespoke learning of the potential function, an important parameter of the model is the largest time interval .τ > 0 that defines the width of a time window over which the autonomous nature of the (assumed) Hamiltonian system holds. The full temporal range of the learning period is then partitioned into time windows of width equal to this identified maximum tenure .τ ; centroid of each such time window is a design time point in the sought training data set, s.t. at each design time point, our novel bespoke learning offers the corresponding values of the potential. Thus the generation of the originally-absent training data set—using the novel bespoke learning methodology that we expound upon below—comprises our first step in the learning, i.e. Step I. Once Step I is undertaken, the next step is to perform the supervised learning of the time variation of the potential function. This is undertaken by modelling the potential as a random function, that is a sample function of an adequately chosen stochastic process. The training data generated during Step I helps us learn the covariance structure of this stochastic process, allowing for closed-form prediction of the mean and variance of the sampled potential, at a test time point. In fact, we forecast the potential at time points that lie beyond the learning period. We perform the forecasting sequentially, gathering information from the recentlypredicted potential to augment the training data set, to undertake forecasting at subsequent times. Thus the learning of the temporal variation of the potential is updated as we perform predictions at test times within the time period that we refer to as the prediction period. Such learning and forecasting comprise the second step in our strategy, i.e. Step II. Once the mean value of the random potential is forecast at a time .T = t during the prediction period—as embedded within the uncertainty interval on the potential, namely the predicted standard deviation of the potential at this time point—in Step III we compute the rate .V (t) and the location variable .X(t) at this time. Such a computation uses the equivalent of Newton’s Second Law. Thus, Steps I, II, III define our 3-stepped approach to the forecasting of future states of the generic nonstationary dynamical system.

2.4 3-Staged Algorithm

45

To summarise, the forecasting of the location variable .X(t), and its rate (the variable .V (t)) is undertaken at any test time point .T = t, by following a three-stepped sequential protocol. • Step I: model the evolution of the considered non-stationary dynamical system to be locally autonomous and Hamiltonian, and undertake a new bespoke learning methodology to learn the potential at the centroid of a time window that embodies such localisation in time. This helps generate the originally-absent training data that comprises pairs of values of time, and the system potential that is realised at this time. All such time windows are confined to a temporal range referred to as the learning period. • Step II: model the potential as a random realisation drawn from an adequately chosen stochastic process. To undertake potential prediction at the test time point given by the centroid of the first time window subsequent to the end of the learning period—i.e. at the centroid of the first time window within the prediction period—covariance function of the underlying stochastic process is learnt using training data that is generated during the learning period, using the methodology used in Step I. Thereafter, closedform prediction of the mean and variance of this random potential is made, at this test time point. Potential forecasting is performed at the centroid of the second time window within the prediction period, by augmenting the aforementioned training data with the potential predicted at the centroid of the first time window inside the prediction period, and so on. Thus, one step ahead forecasting of potential is undertaken, during the prediction period. • Step III: at the test time point, at which values of the potential are forecast, compute the uncertainty-included location variable, and its rate, by inputting the predicted potential into a generalised version of Newton’s Second Law.

2.4.1 Outline of Bespoke Learning in Step I We will commence Step I with the learning of the potential in each of the (.Nτ number of) time windows during which the system is assumed to behave as an autonomous Hamiltonian system. Then under a model of the evolution of the pdf of the system phase space r.v.s, we are permitted to embed the potential in the definition of the support of this pdf. However, the pdf is not known to us either. Thus in this step, the sought functions are both the potential and the pdf.

46

2 Forecasting by Learning Evolution-Driver

However, we lack information on the following pair: (value of domain variable of either sought function, value of sought function).

.

Then with training data not present to facilitate supervised learning of either sought function, we vectorise each sought function, i.e. partition the relevant subset of its domain, and hold the functional output over each such partition, an unknown constant. Then it is the potential vector and pdf vector that we learn, within each time window.

Embedding the potential in the support of the pdf of the phase space variables, is the key trick that allows for the sought potential to be introduced into the likelihood of the unknowns—which are components of the sought vectors that are the discretised or vectorised versions of the unknown potential function and the pdf —given the data on the location r.v. during a time window. This likelihood, in concert with chosen priors on the vectorised potential and pdf, provides the joint posterior probability density of all unknowns given the data. We sample from this posterior using Markov Chain Monte Carlobased techniques, [34]—Metropolis-within-Gibbs to be precise—to compute the marginal posterior on each unknown given the data. Then values of each unknown are identified within 95.% Highest Probability Density credible regions (HPDs), at each of the .Nτ number of time windows during any of which, this dynamical system is assumed to be an autonomous Hamiltonian system.

A primer on MCMC techniques is included in Appendix A.

2.4.2 Embedding Potential into Support of Phase Space pdf: Part of Step I By definition, any of the .Nτ considered time windows is s.t. the system is assumed to be an autonomous Hamiltonian system, when within any such time window. Across any pair of time windows, evolution of the system is non-autonomous in general. In other words, there is an explicit time dependence in the potential function on time points delineated by centroids of distinct time windows, but inside a time window, the autonomous nature of the system implies that potential does not bear an explicit time dependence. Definition 2.2 Thus, when time variable T takes a value t that is s.t. t is inside the j -th time window, the system potential is .φ (j ) (X(t)), while if t is within

2.4 3-Staged Algorithm

47

the temporal bounds that define the .j / -th time window, the system potential is (j / ) (X(t)), where .φ (j ) (·) = φ (j / ) (·) in general. Here .j = j / ; j, j / ∈ {1, . . . , N }. .φ τ

Again, the Hamiltonian nature of the system within the j -th time window implies that the evolution of this dynamical system within the temporal bounds of this time window, is governed by Liouville’s Theorem [15, 28].

Liouville will hold, since infinitely-many states exist in the system phase space W at any time. So system evolution is governed at all times confined within the bounds of the j -th time window, by Liouville, via the system phase space pdf (j ) .f X,V (x, v). Indeed, such is true .∀j ∈ {1, . . . , Nτ }. However, there is nothing in our assumption of the temporally-localised autonomous+Hamiltonian nature of this system, that warrants the same form of the phase space pdf across distinct time windows. .

(j / )

Definition 2.3 Thus, in the .j / time window, the phase space pdf is .fX,V (x, v), (j )

(j / )

with .fX,V (·, ·) = fX,V (·, ·) in general, for .j = j / ; j, j / ∈ {1, . . . , Nτ }. We remark that it is modelling the system as Hamiltonian within any time window, that allows for Liouville to be invoked—to determine the total temporal variation of the phase space pdf. Modelling the system to be autonomous during any of the time windows implies that the potential does not bear an explicit timedependence during a time window. Definition 2.4 The Hamiltonian nature of the system in the j -th time window implies that Liouville holds, i.e. the total temporal variation of the pdf of the phase space variables, during the j -th time window, is 0. In other words, from Liouville it (j ) follows that, during the j -th time window, the density .fX,V (x, v) of phase does not change with time, i.e. during the j -th time window, the following holds: (j )

.

dfX,V (x(t), v(t)) dt

= 0.

This holds .∀j ∈ {1, . . . , Nτ }. Such temporal evolution of the phase space pdf during any time window over which the system is an autonomous Hamiltonian system, implies that during any such time interval, a solution for the above differential equation—that pins the total time derivative of the phase space pdf down to 0—is a function of integrals of motion. Integrals of motion are such functions of the phase space variables that do not vary with time, along trajectories in the phase space.

48

2 Forecasting by Learning Evolution-Driver

Definition 2.5 One such integral of motion of the (autonomous) Hamiltonian system during any time window, is the system energy variable .ε , that is really a function of the generic phase space variables .X(T ) and .V (T ) as per the definition:

ε (X(T ), V (T )) := Φ(X(T )) + (V (T ))2 /2.

.

(2.2)

Thus, the system energy is motivated as the sum of potential and “kinetic” energies, following standard dynamics. Modelling a phase space pdf as dependent on .K ∈ N number of integrals of motion (where each such integral is a function of the phase space coordinates), is equivalent to imposing K number of constraints on the system phase space. Then to ensure evolution in an N-dimensional phase space, we demand .K ≤ N − 1, s.t. the system is not rendered stationary, but is allowed at least 1 degree of freedom. Given that in this problem the generic location variable .X(T ) is scalar-valued, dimensionality of the phase space is two. Then to ensure that there is at least one degree of freedom allowed in this phase space, we model the phase space pdf with only one integral of motion, namely .ε (X(T ), V (T )), that is given partly by the potential energy and partly by the “kinetic” energy, where the latter is defined as half the squared rate r.v. V . Thus during the j -th time window, (j ) the phase space pdf is alternately expressed as .fε (ε(X(T ), V (T ))). Then, the temporal evolution of phase space pdf during the j -th time window—which is governed by Liouville’s Theorem—is recast as the following equation: (j )

.

dfε (ε(X(t), V (t))) dt

= 0.

As the effective potential .φ(X(T ), T ) ∈ [−1, 0), the system energy = φ(X(T ), T ) + (V (T ))2 /2 can be non-negative, or negative. If non-negative, the location variable .X(T ) can approach infinity at a non-negative rate; if on the other hand, .ε (X(T ), V (T )) < 0, the maximal value that the location variable can attain is truncated to a finite value. In our empirical illustration of the method for forecasting, (as we will soon see), .X(T ) approaching infinity is not possible, as there is an upper limit to the value that the state space variable can attain. Essentially, it is not possible for the system energy to be non-negative, if X cannot exceed an upper limit, or if non-negative system energy implies that the system disintegrates into a form that is no longer identifiable as the system for which this forecasting is undertaken. This motivates us to develop the method here to exclude the possibility of non-negative values of the energy r.v. Then the most negative that .ε (X(T ), V (T )) can be, at any value of T , is when the contribution from the .V 2 /2 term is nil, i.e. the most negative value of the energy is the most negative value of the effective potential. We normalise .ε (X(T ), V (T )) by the negative of the minimum value (.−φmin ) of the potential .Φ(X(T ), T ), to ensure that the minimum value of the energy r.v. is .−1. Then .ε (X(T ), V (T )) takes values in [.−1, 0).

ε (X(T ), V (T ))

.

2.4 3-Staged Algorithm

49

A different motivation for such a bound on values available to the energy variable allows for the interpretation of the “kinetic energy” of this generic dynamical system. We recall that potential is interpreted to parametrise the effort that needs to be input into the system, to bring it from the minimally-disturbed state to its current state. On the other hand, “kinetic” energy of this system at a given time quantifies the effort that the system itself exerts to move out of its current state—owing to the value of the rate variable at this time—to achieve a state of higher disturbance. Then it appears possible in general for the system to be disturbed even beyond what is allowed by maximising the potential, i.e. by letting potential achieve value .φmax . However, as we have motivated below, the system is such that it cannot be disturbed any further, once potential attains the value .φmax , or if disturbed further than when the potential is maximal, the system disintegrates into a form that is unrecognisable as the one that we undertake forecasting for. Then it follows that kinetic energy has to be truncated to attain a maximal value that is tuned to the value attained by potential at the given state, for all times. In fact, the total energy that is the sum of such potential and kinetic energies, is parametrised as the net effort that is available to be input to the system, to move it from its current state to the maximally disturbed state. Thus, if the total energy is positive, it implies that effort is available to transfer the system to even beyond the maximally disturbed state, which is either unattainable for the given system, or this is a state at which the system is no longer recognisable as one to which this forecasting methodology could be applied. Thus, we desire confining our attention to only those states that render the total energy non-positive, i.e. states that retain the system as “bound”. Given that kinetic energy is modelled as .(V (T ))2 /2, it is non-negative, s.t. total energy is rendered positive only if kinetic energy exceeds the absolute of the potential. So a “bound” system implies kinetic energy .≤ |φ(X(T ), T )| at all T . Thus, total energy .ε (X(T ), V (T )) takes values in .[−1, φmax ], with .φmax (= 0) attained at .T = tdeath . So for all times .T = tdeath energy takes values in [-1,0).

2.4.3 Learning Potential as Modelled with a Gaussian Process: Part of Step II Once the vectorised potential is learnt at the centroid of each of the .Nτ time windows that live within the learning period, we proceed to Step II of our learning strategy, towards forecasting at a future time point within the prediction period. In the 2nd step, we model the potential vector random variable as a random realisation from a vector-variate Gaussian Process. Then by definition, the joint probability density of the .Nτ realisations of the random potential vector is a matrix Normal density, with • mean matrix .μ; • inter-row (or between-components-of-potential-vector) covariance matrix .Σ compo , and

50

2 Forecasting by Learning Evolution-Driver

• inter-column (or between-times-at-centroids-of-distinct-time windows) covariance matrix .Σ time . Thus, the likelihood of .μ, .Σ compo , .Σ time is matrix Normal, given the data—which in this case comprise the learnt values of the potential vectors. The covariance matrices are parametrised using covariance kernels, where possible. As before, once the likelihood is formulated, we choose priors and write the joint posterior of all unknowns given the data. Posterior sampling is undertaken with an MCMC algorithm (see Appendix A). Subsequent to the learning of the unknown parameters of the generative GP, we undertake closed-form prediction of a realisation from it, at any time .t (test) , where a realisation from this GP is the effective potential vector at this test time .t (test) . Thus we undertake closed-form forward prediction of the expectation of the potential vector at a test time point, embedded within an uncertainty interval that is identified as the predicted standard deviation of the sampled potential—with this interval placed symmetrically about the predicted mean [33]. This in a nutshell constitutes our prediction/forecasting of the potential at this given test time, using the potential vectors learnt at each of .Nτ time windows. But how do we compute the rate variable, and thereby predict the value of the location variable at this test time? This is undertaken in Step III of our learning when we invoke Newtonian Dynamics to underline evolution of the system location and rate variables.

2.4.4 Rate and Location Variable Computation: Part of Step III Newton’s 2nd Law states that the rate of change of momentum of a system is proportional to the force acting on it. Then it can be generalised to suggest that the time derivative of the rate of change .X˙ ≡ V of the of the location variable .X(T ) of our considered dynamical system, is proportional to the gradient of the potential function, where the imposed force is expressed as the gradient of the .X(T ) and T dependent effective potential, with the gradient defined as the derivative with respect to the location variable .X(T ), for a scalar-valued .X(T ). Thus, in our model, within any time window—that is inside the prediction period—when the system is an autonomous Hamiltonian one, .

d 2 X(t) dV (X) / ∂φ(X, t) = α0 , ≡ V (X) 2 dX ∂X dt

where

/

α0 ∈ R is a constant.

Indeed the location r.v. .X(T ) is itself time-dependent, but that dependence was dropped from the notation in the equation above, for brevity’s sake. So at .T = t, / we end up with the solution that .V (X)2 /2 = α0 φ(X, t) + α1 , where .α1 ∈ R / is a constant with respect to the location r.v. X. We rewrite .α0 as .−α0 , so that 2 .V (X) /2 = α0 (−φ(X, t)) + α1 . where .−φ(X, t) is positive.

2.4 3-Staged Algorithm

51

However in our model, for a given time window, the relevant range of values of the location variable is partitioned into .NX number of partitions and the potential function held a constant over each such X-partition. Thus, over the j -th time window, the location variable is rendered the vector .X(j ) ∈ RNX , a component of which is one of these X-partitions. In other words, (j )

(j )

X (j ) := (X1 , . . . , XNX )T ,

.

j = 1, . . . , Nτ , . . . , (j )

(j )

where vector .X(j ) ∈ RNX is written in the basis .{e1 , . . . , eNX }, with unit Euclidean (j )

(j )

(j )

norm ascribed to .ei , .∀i = 1, . . . , NX . Thus, the inner product .X(j ) · ei = Xi , (j ) in the j -th time .∀i = 1, . . . , NX . The rate of change of the location vector .X (j ) window is then the rate vector .V ∈ RNX in this time window, with .V (j ) := (j ) (j ) T (V1 , . . . , VNX ) . Then the effective potential over the i-th X-partition, at the j -th time window, (j ) is parametrised as .φi , .∀i = 1, . . . , NX ; j = 1, . . . , Nτ , . . .. So the vectorised potential function that we aim to learn in this j -th time window is the vector-valued r.v.: (j )

(j )

φ (j ) := (φ1 , . . . , φNX )T .

.

See Fig. 2.1 for a cartoon depicting this vectoised potential, (and the vectorised (j ) phase space pdf that we discuss below). We treat the r.v. .φi independent of the (j ) r.v. .φi / , .∀i = i / ; i, i / ∈ {1, . . . , NX }, and .∀j ∈ {1, . . . , Nτ , . . . , }. Also, the r.v. (j )

φi

.

(j )

is dependent on the variable .Xi , but is not affected by any X-partition other

Fig. 2.1 Schematic depiction of the vectorised versions of the two functions that we seek to learn in our work, namely the potential function and the phase space pdf. The vectorised potential is (j ) shown on the left, where in the j -th time window, over the i-th X-bin, the potential is .φi , for (j ) (j ) (j ) .i = 1, . . . , NX , s.t. the potential vector in this time window is .φ = (φ1 , . . . , φNX )T . On the (j ) right, the k-th component of the vectorised phase space pdf .f in this time window, is depicted as height of the bar over the k-th .ε-partition, .k = 1, . . . , Nε . Each X-bin and .ε-partition has the same width in our work, though this can be generalised

52

2 Forecasting by Learning Evolution-Driver (j )

(j )

than the i-th X-partition, at this j -th time window, i.e. if .Xi / changes, .φi (j ) .φ i

does (j )

not necessarily change, .∀i = Then another representation of is .φ(Xi ). (j ) We learn these mutually independent .φi parameters at the j -th time window, .∀i = 1, . . . , NX ; ∀j = 1, . . . , Nτ , . . . , .. This vectorised version of the potential function is then built of piecewise (j ) (j ) (j ) constant-valued functions .φi (Xi ) that we condense to the notation .φi . In the pictorial representation of this vectorised version of the potential function, potential is a horizontal line over any X-bin, and how high or low this line is over the ith X-bin, is independent of changes to any other X-bin. However, if the value (j ) attained by the random location variable .Xi changes, then the position of this horizontal line changes. The above interpretation holds for all .i = 1, . . . , NX . Then grad of this “vectorised” potential function, with respect to the random location (j ) (j ) (j ) (j ) vector .X(j ) is the sum of vectors .dφ1 /dX1 e1 , . . . , dφNX /dXNX eNX . Then by Newton’s 2nd Law, this sum is proportional to the time derivative of the rate vector (j ) (j ) = (e dX (j ) /dt, . . . , e T .V 1 NX dXNX /dt) , in this time window. 1 i/.

(j )

Now, .Xi

(j )

and .Xi / are independent .∀i = i / ; i, i / ∈ {1, . . . , NX } ⇒

(j ) (j ) (j ) (j ) • .X˙ i and .X˙ i / are independent, i.e. .Vi and .Vi / are independent, and (j ) (j ) (j ) (j ) (j ) • .Vi ≡ X˙ i and .Xi / are independent, s.t. .∂Vi /∂Xi / = 0 .∀i = i / ; i, i / = (j )

(j )

1, . . . NX , with .∂Vi /∂Xi

not equal to zero necessarily.

Such holds .∀j = 1, . . . , Nτ , . . .. (j ) Then grad of .(Vi )2 , with respect to the location vector .X(j ) is   (j ) 2 (j ) (j ) (j ) (j )  2 d Vi dXi dVi dVi (j ) (j ) dVi (j ) ei . Vi e = 2V e = 2 e = 2 .∇ = i i i (j ) (j ) i dt dX(j ) dt dXi dXi i Then adding over all .i = 1, . . . , NX in the j -th time window, the time derivative of the full rate vector is proportional to the vector sum of grad of square of each component of the rate vector. But Newton’s 2nd Law states that the time derivative of rate is proportional to the grad of the potential. So in the vectorised paradigm we use, the 2nd Law generalises to   (j ) (j ) (j ) (j ) ∇ (j ) (V1 )2 + . . . + (VNX )2 /2 = α0 ∇ (j ) (−φ1 − . . . − φNX ),

.

where .−α0 is a constant and (j )

∇ (j ) := e1

.

d (j ) dX1

(j )

+ . . . + e NX (j )

d (j )

,

dXNX (j )

with vector .X(j ) ∈ RNX written in the basis .{e1 , . . . , eNX }, as stated above.

2.5 Collating All Notation

53

This implies that in our j -th time window,  .

 (j ) (j ) (V1 )2 + . . . + (VNX )2 /2 =

V (j ) 2 /2 = (j )

α0 (−φ1 (X1 ) − . . . − φNX (XNX )(j ) ) + α1 ,

(2.3)

where .α0 , α1 ∈ R are constants. Thus, the norm of the rate of change of location variable vector .X(j ) is . V (j ) , which we compute as  the   (j ) (j ) . 2 α0 (−φ1 (X 1 ) − . . . − φNX (XNX ) ) + α1 . So at a test time point .t (test) ∈ / {1, . . . , Nτ }, the mean effective potential vector is   (test) (test) T (test) = φ1 , . . . , φN as a random sample from the learnt first predicted: .φ X stochastic process that underlies it. Then, using this predicted potential vector, the rate of change vector .V (test) of the state space variable is computed to be s.t. the magnitude of this rate of change is .

V

(test)

   (test) (test)

= 2 α0 (−φ1 − . . . − φNX ) + α1 .

(2.4)

The closed-form forward prediction of the mean potential vector at a test time is allowed via our learning of the GP that we invoke to generate the potential function; an accompanying closed-form prediction on the variance of the potential is also allowed. Thus, an uncertainty-included prediction of the potential, at the test time, is undertaken. Since we know empirical values of the location parameter vector at the j -th time window, .∀j ∈ {Nτ + 1, Nτ + 2, . . .}, we can compute values of .α0 and .α1 by fitting the forecast results for location obtained from two time windows, to these empirical values. Then these very fit values of .α0 and .α1 are used thereafter at all other timewindows within the prediction period. (The first sentence in this paragraph speaks only about observations of location—and not of the rate—since observations of the rate variable are typically unavailable in gneric, real-world systems, with location the only observable. Then we can only obtain an approximation of the empirical rate, as the difference between location parameter values observed 1 time unit apart).

2.5 Collating All Notation This section is dedicated to the discussion of the technical details of the forecasting of values of the phase space variables at a test time .T = t (test) . We begin with a note on the notation that is used in this exercise; some of the notation that is mentioned

54

2 Forecasting by Learning Evolution-Driver

below is recalled from above, and the rest is motivated to enable discussions on the methodology that follows later. • The location variable is referred to as .X ∈ X ⊆ Rd in general, though we will lay the foundations of the methodology using .d = 1. • The rate of change of the location variable at a given time—referred to as the rate ˙ ). This will be typically abbreviated below, to .V ≡ X. ˙ variable—is .V (T ) ≡ X(T • The phase space is .W ⊆ R2 , for a scalar-valued X. In other words, .(X, V ) ∈ W. • The time variable is .T ∈ T ⊆ R. • The effective potential function is .φ(X(T ), T ), which as motivated in Sect. 2.1.6, takes values in .[−1, 0), i.e. .φ : X × T −→ [−1, 0). We will typically abbreviate the notation for the effective potential to .φ(X, T ), and refer to it as “potential”. • Forecasting at test time .t (test) will be enabled, subsequent to our learning of the potential during the learning period which covers the time interval starting at time .T = 0, and ends at .T = tL . We partition this interval .[0, tL ] into .Nτ number of time windows, s.t. in the j -th time-window, the system behaves as an autonomous Hamiltonian system, .∀j = 1, . . . , Nτ . • We will see below, that under the model of a Hamiltonian system during a time window when evolution of the system phase space pdf is governed by Liouville Theorem s.t. the total time derivative of this pdf vanishes, s.t. the phase space pdf can be modelled as the pdf of the system energy variable, where energy is defined as the sum of the potential and kinetic energies in our work: .ε (x, v) := φ(x, t) + v 2 /2. Thus the phase space pdf is .fε (ε(x, v)). (j ) • During the j -th time window, pdf of the energy variable is denoted by .fε (ε), and the potential function of this autonomous Hamiltonian system is denoted by (j ) (X); .j = 1, . . . , N . The forms of the phase space pdf and potential vary as .φ τ we move from one local time window to another. • As the value of .φ (j ) (X) is not accessible at a design value of X, during any time window, training data (for the supervised learning of the temporal variation of the potential function), is unavailable during any time window. Then we adopt a new method to learn the discretised—or vectorised—version of the effective potential, (j ) (j ) (j ) (j ) i.e. learn the iid parameters .φ1 , . . . , φNX , where the interval .[x1 , xNX ) is partitioned into .NX equally-wide partitions of width .δj , s.t. the potential function (j ) over the i-th such partition is held an unknown constant—denoted by .φi . This construct holds .∀j = 1, . . . , Nτ . Thus, the potential function is replaced by the (j ) (j ) potential vector .φ (j ) := (φ1 , . . . , φNX )T in our model. • It is to be noted that the range of values that the location variable X attains during the j -th time window is different from that attained during the .j / -th time window. Here .j = j / ; j, j / ∈ {1, . . . , NX }. • The only data that is available during the j -th time window is .Dj := (j ) (j ) (x1 , . . . , xNj )T . .∀j = 1, . . . , Nτ . • The phase space pdf is similarly vectorised. During the j -th time window, we do not know the phase space pdf, and neither do we have access to its realisations at a design value of the energy r.v. .ε (X, V ). Hence we vectorise the pdf that

2.6 Details of the Potential Learning It is desired that, in the j-th time window, the potential function φ(j) (X(t)) once input into 2nd Law =⇒ X(t), V (t).

Input this forecast potential into the 2nd Law to compute rate at that time point in the future. Compute location by integrating over rate till that future time point.

55

But φ(j) (·) is unknown

Using this training data, learn the vector-valued random potential function of time, by modelling it as a sample function from a vector-variate Gaussian Process. Forecast potential at future time point that is the centroid of a time window within the prediction period.

Then learn potential as function of time and location. But training set τ {(j, φ(j) (X)}N j=1 is unavailable

Vectorise φ(j) (·) to the φ(j) vector; the phase space pdf to the f (j) vector

Invoke priors on each component of φ(j) and f (j) . Compute posterior pdf of all such components and perform inference using MCMC. This results in φ(j) learnt in the j-th time window. Training data τ {(j, φ(j) )}N j=1 is now ready for the Nτ time windows during the learning period.

Learn φ(j) vector inside the j-th time window by embedding potential in the support of the phase space pdf. Integrate the unobserved rate out of phase space pdf to yield the marginal pdf of the location variable. Define likelihood using location marginals computed at each location datum-all in the j-th time window

Fig. 2.2 Flowchart depicting the state forecasting scheme

we seek to learn. To accomplish this, the range of values of the system energy (i.e. .ε (x, v) := φ(x, t) + v 2 /2), is identified as [.−1,0], where we recall that the system energy variable has been normalised by .−φmin ; (see Sect. 2.1.6). During the j -th time window, the energy range [.−1,0] is partitioned into .Nε partitions of equal width, and the phase space pdf is held constant over each such partition, (j ) (j ) s.t. over the k-th .ε-partition, .fε (ε) takes the value .fk , where .k ∈ {1, . . . , Nε }. (j )

(j )

We define the vector .f (j ) := (f1 , . . . , fNε )T . It is this vector .f (j ) that we will learn .∀j ∈ {1, . . . , Nτ }. In Fig. 2.1, the vectorised potential and phase space pdf are depicted schematically. The flowchart of this whole forecasting scheme is shown in Fig. 2.2.

2.6 Details of the Potential Learning The potential is learnt anew at each time window using the method motivated above, (during which the system is considered to be autonomous Hamiltonian), by embedding the potential into the support of the phase space pdf. One aspect of the potential learning that we need to be aware of, is the choice of the width of the time window. If the rate of evolution picks up during the learning period, s.t. the system remains autonomous only over a shorter time than the originally-chosen width of the time window, then the potential learning will be rendered wrong. A small time width is therefore recommended to start with. However, if this is found inadequate midway through the learning period, then we will have to contract the time window width and restart the training. However, if the rate of evolution picks up during the prediction period then that will not amount to an error in the potential learning. Once the potential has been learnt and

56

2 Forecasting by Learning Evolution-Driver

we are only updating the correlation structure of the GP that is invoked to model the potential function, the difference between two time points at which potential values are sought/forecast, only inputs towards the parametrisation of the covariance structure of the GP, (using covariance kernels). So, as long as the updated time width is used (in the kernel), our forecasting will not be rendered inaccurate.

2.6.1 Likelihood During the j -th time window, the unknown parameters are the vectorised versions of the phase space pdf and the potential function, i.e. .f (j ) , and .φ (j ) respectively. Here (.j ∈ {1, . . . , Nτ }). At this time window the only available data .Dj comprises observations of the location variable X. The likelihood of the unknown parameters given the available data is then (j )

(j )

(j )

(j )

(j )

(j )

(f1 , . . . , fNε , φ1 , . . . , φNX |x1 , . . . , xNj ).

.

Had we not vectorised the sought functions, we could have expressed the likelihood in the j -th time window (assuming the .Nj observations of X in this j -th time window to be iid), as ⎛ Nj  ⎜ . ⎝ i=1

v∈V (j )

⎞ Nj

(j )



 ⎟ (ε)dv ⎠ = i



i=1 v∈V (j )

(j )



  φ (j ) (xi ) + v 2 /2 dv,

where the marginal pdf of the location variable is attained by integrating over all possible values that the unobserved rate variable V attains in this j -th time window. Such values are denoted to live in the space .V (j ) ⊆ R. So we would need to identify the range of values of V that we will need to integrate over. However, the likelihood envisaged in the last equation is not of relevance, as we do vectorise the sought pdf and the effective potential function.

2.6.2 Likelihood Given Vectorised Phase Space pdf In light of the vectorised form of the location pdf, the likelihood of the unknown parameters given the observations available during the j -th time window, is identified as (j ) (j ) (j ) (j ) (j ) (j ) . (f 1 , . . . , fNε , φ1 , . . . , φNX |x1 , . . . , xNj )

Nj Nε    (j ) = fi k=1

i=1

 (j )

v∈Vi (k)

dv

,

(2.5)

2.6 Details of the Potential Learning

57

(j )

where .Vi (k) is the volume occupied by the rate variable V in the i-th .ε-partition, (j ) during the j -th time window; this is a function of the potential at .X = xk . Basically, the integral of the phase space pdf, with respect to the rate variable, (see last equation of last subsection), is replaced in the vectorised treatment of the pdf, by the sum of contributions to this integral from each such .ε-partition. Over each such partition, the pdf is itself a constant, and so can be brought outside the integral with respect to V . Then to compute the contribution to the integral from any such partition, we have to compute the limits on the rate variable relevant to the .ε-partition in question, within the currently-considered time-window. Details of these limits is what we discuss next. The range of values that .V attains in the k-th normalised energy partition, is identifiable from the definition of the normalised-energy (recall Eq. 2.2):  2 .ε(x, v) := φ(x, t) + v /2 ⇒ v = ± 2ε(x, v) − 2φ(x, t).

(j )

In the vectorised paradigm, the value of the potential at a given datum .xq (j ) (j ) in the j -th time-window is .φi , if .xq falls in the i-th X-bin. Thus, in any time-window, the index for the X-bin used below—which determines the component of the potential vector that is invoked—is driven by the datum at which the result is sought. In our notation below, we do not however mark this index of the X-bin as a function of the datum. This helps simplify the notation. Any reference to the X-bin index in the results below, should be read as this index of a function of the datum under consideration.

When in the k-th .ε-partition, the minimal value of .ε (x, v) is .−1 + δε (k − 1), and its maximal value is .−1 + δε k. This is because .ε (x, v) takes values in [.−1,0]; here the width of each .ε-partition is .δε = 1/Nε . Inserting these values of the energy r.v. in its definition as the sum of the effective potential and half the squared rate, we get: • that in the j -th time window, the smallest positive value of V in the k-th .εpartition, at the i-th X-partition is (k,i,j ) .v min

   (j ) = 2 −1 + δε (k − 1) − φi

58

2 Forecasting by Learning Evolution-Driver

• and the largest positive value of V in this k-th .ε-partition, at the i-th X-partition is    (k,i,j )

vmax

.

(j )

2 −1 + δε k − φi

=

,

during the j -th time window, where .j ∈ {1, . . . , Nτ }, .k ∈ {1, . . . , Nε }, .i ∈ {1, . . . , NX }. Then it implies that the volume occupied by V in the k-th .ε-partition, at the i-th X-bin, is 



(k,i,j )

≡2

dv (j )

.

v∈Vk

vmax 

(k,i,j )

i

   (j ) dv = 2 2 −1 + δε k − φi

vmin

   (j ) − 2 −1 + δε (k − 1) − φi .

Inserting this into the definition of likelihood presented above in Eq. 2.5, we get (j )

(j )

(j )

(j )

(j )

(j )

(f1 , . . . , fNε , φ1 , . . . , φNX |x1 , . . . , xNj ) =

.

Nj Nε   .

i=1

(j ) 2fk

       (j ) (j ) − 2 −1 + δε (k − 1) − φi 2 −1 + δε k − φi .

k=1

(2.6) (j )

(j )

NX ε This is the likelihood of the unknowns—namely, the .{fk }N k=1 and the .{φi }i=1 parameters in the j -th time window—given the data .Dj that is available during this time window.

2.6.3 Prior, Posterior and MCMC-Based Inference We choose priors on each unknown parameter. We choose weak priors, in lieu of available information on the value of the potential function over a selected interval of values of the location variable X—or rather on components of the potential vector; and even more so on the form of the phase space pdf over a selected interval of values of the system energy—or rather on components of the pdf vector. If an application includes information on the potential and/or phase space pdf, such would be welcome contributions to the prior selection. In the absence of such information we choose weak priors; in fact we work with a prior probability density (j ) Nε  (j ) NX 2 .π0 (θ ) = N (μθ , σ ) for an unknown parameter .θ ∈ {{f {φi }i=1 } the θ k }k=1

2.7 Predicting Potential at Test Time: Motivation

59

support of the probability distribution of which is .R. [If the unknown .θ is s.t. its distribution has support over a subset of .R, then the prior can be changed from a Normal (with mean .μθ and variance .σθ2 ) to a prior that bears the identified support of the distribution of .θ ]. The likelihood and the prior together allow for the formulation of the joint posterior of all the .NX + Nε number of unknowns relevant to a given time window, given the values of the location variable X that is observed during this time window. Samples are generated from this joint posterior of the unknowns given the available data on X, using Metropolis-within-Gibbs. The potential parameters (j ) X (.{φi }N i=1 ) are updated in the first block of each iteration of the implementation of (j ) ε this MCMC algorithm, and the pdf parameters (.{fk }N k=1 ) are updated in the 2nd block of any iteration, at the updated potential parameters. Given the demands of non-positivity on the potential parameters, we propose candidates for any potential parameter from a truncated Normal that is right truncated at 0, with mean that is the current value of the considered potential parameter, and variance that is experimentally chosen. Again, given the non-negative nature of the pdf parameters, we propose candidates for any pdf parameter from a truncated Normal that is left truncated at 0, with mean that is the current value of the considered pdf parameter, and variance that is experimentally chosen. Upon convergence of the MCMC chain, marginal posterior probability density of any parameter is computed (given the available data in the considered time window), and the 95.% HPDs computed therefrom. Thus, the Metropolis-within-Gibbs algorithm is implemented to undertake Bayesian inference on the values of each of the parameters during the j -th time window, when the empirical illustration of this potential learning methodology is undertaken.

2.7 Predicting Potential at Test Time: Motivation This MCMC-based bespoke learning of the potential is undertaken independently for each of the .Nτ time windows, to result in the uncertainty-included learning of the potential vector .φ (j ) , and the pdf vector .f (j ) for the j -th time window that is a part of the learning period. The ultimate aim behind the learning of the potential vector is to enable the supervised learning of the temporal dependence of the potential vector. This reduces to the pursuit of the covariance structure of the generative stochastic process that underlines the vector-valued potential r.v. Such covariance learning will involve • learning the covariance between pairs of the .Nτ -dimensional vectors—where one such vector is defined for each of the .NX of components of the vectorvalued potential r.v. that is learnt during the learning period. We refer to this .NX × NX -dimensional covariance matrix as the inter-component covariance matrix .Σ compo .

60

2 Forecasting by Learning Evolution-Driver

• learning the covariance between each pair of (the .Nτ number of) potential vectors that are learnt as realised during any of two time windows considered in the modelling; we refer to this as the inter-time covariance. As there are .Nτ number of time windows during which a new potential vector is learnt, such inter-time covariance is informed upon by the .Nτ × Nτ -dimensional covariance matrix .Σ time . The motivation for learning the covariance matrices of this underlying stochastic process is to be allowed to sample from it, at any given (test) time .t (test) . In other words, once the covariance matrices are learnt, we can sample a realisation of the random potential vector from this process, at the known time .t (test) . In fact, learning the covariance function of this mother process enables prediction of the mean (and variance) of the samples that can be drawn from this process at the given test time. In principle, .t (test) could be lying outside the time interval .[0, tL ], to the left of, or right of this interval—or indeed inside this interval, where we recall that .[0, tL ] is the interval of values of time within which the potential learning is undertaken, i.e. this is the interval that traverses the temporal extent of the .Nτ time windows. We have identified this interval as the learning period. Thus our method allows for prediction as well as forecasting.

Such forecasting/prediction indeed follows the supervised learning of the functional relationship between time and the (vectorised) potential function. The strength/charm of this method is that at a given time, it permits the learning of the potential vector by noting how its components correlate at this time—instead of attempting to learn the temporal variation of the location parameter itself. As we have said before, the latter approach cannot work when a (limited) training set does not include the patterns that are manifest in the temporal variation in X at test times, and this problem increases in severity with increasing discontinuity in the distribution of values of X across time. To clarify, the worry is that when forecasting at a future time point .t (test) , the training set on values of T and X might be s.t. the temporal variation of X as .T −→ t (test) −, may not suggest the correct value of X at .T = t (test) . In other words, learning based on such a training set is then responsible for the erroneous forecast made at .t (test) . Our attempt to learn the potential of a dynamical system, locally in time and location is suggestive of the search for a causation for X to attain certain values at respective times, using which we can reliably predict/forecast X (and/or its rate of change V ). Such a desired causation that underpins the dynamics—dictating movement from one point in the system phase space to another—is what this method attempts to identify. It does so, by learning the potential as a function of time and the location variable X, with the potential embedded inside the support of the phase space pdf. The training required (continued)

2.7 Predicting Potential at Test Time: Motivation

61

for the learning of the aforementioned function is provided by the potential vectors learnt during different times, (namely centroids of the time windows), where components of such a potential vector are learnt at different values of X.

When we predict/forecast the potential vector at a test time, we then use the dynamical ideas discussed in Sect. 2.4.4 (see Eq. 2.4), to compute the rate of change of the location variable, and thereby compute the value of X. Undertaking of these computations is essentially the solution of differential equations, and therefore we require boundary conditions to perform these computations. The same are appreciated in the context of the considered empirical illustration of the method.

2.7.1 Learning the Generative Process Underlying Temporal Variation of Potential, Following Bespoke Potential Learning By achieving bespoke learning of the potential vectors at each of the time windows within which the system is assumed to be Hamiltonian and autonomous—though system evolution is marked as globally distinct from such characteristics—we have populated the originally absent training data set τ Dtrain−potential := {(j, φ (j ) )}N j =1 .

.

Now that this training data is at last available, we will learn the relationship between time, and the vector-valued random function that is the effective potential function. This vector-valued potential variable has .NX components, and at the end of the learning period, our generated training data includes .Nτ realisations of this .NX dimensional vector-valued variable. We denote this random vector-valued variable as .φ(T ), where its i-th component is denoted .φi (T ); to reconcile already-used (j ) notation with this, .φi (T = t ∈ [(j −1)τ, j τ )) ≡ φi . Indeed, this random potential function depends on the location variable X—which is itself in fact a function of time T —and so it may be argued that the X-dependence should be explicit in the notation for this vector-valued random potential. Such a dependence, though apparently missing, is borne by the notation .φi (T ) via the dependence on the “i”, as this indicates the value of the location parameter within the i-th partition of the .X vector, inside the time window that includes T .

62

2 Forecasting by Learning Evolution-Driver

We model the vector-valued, random potential function, as a random realisation from a vector-variate GP: φ(·) ∼ GP(μ(·), K(·, ·)).

.

Here the mean function .μ(·) is .NX -dimensional, with .i = 1, . . . , NX . .K(·, ·) is the matrix-valued covariance function of this GP s.t. its .i, i / element informs on the covariance between the i-th and .i / -th components of the sample (potential) function at a given time t, i.e. between .φi (t) and .φi / (t). Then by definition, the joint probability distribution of .Nτ realisations of this random vector-valued potential function is the matrix Normal density, that takes as its parameters: • the .NX × Nτ -dimensional mean matrix .μ; • the inter-time covariance matrix .Σ time ∈ R(Nτ ×Nτ ) ; • and the inter-component covariance matrix .Σ compo ∈ R(NX ×NX ) . We appreciate that correlation between a pair of components—.φi (t) and .φi / (t)—of the realised potential vector at a time .T = t during the j -th time window, has no explicit time dependence. Correlation between .φi (t) and .φi (t / ) can bear an explicit time dependence, where time t occurs during the j -th time window and .T / during the .j / -th time window; .j = j / ; .j, j / = 1, . . . , Nτ . In other words, the joint probability of the .Nτ realisations of the random .NX dimensional potential function .φ(T ) := (φ1 (T ), . . . , φNX (T ))T , at respective design (time) points that live in the training set .Dtrain−potential , is [φ (1) , . . . , φ (Nτ ) ] = MN (μ, Σ time , Σ compo ),

.

where .φ (j ) denotes the j -th realisation of the random function .φ(T ); “.MN (·, ·, ·)” denotes a matrix Normal density with the mean matrix and two covariance matrices. Here .j = 1, . . . , Nτ . This is equivalent to stating that the likelihood of the mean and covariance parameters (.μ, Σ time , Σ compo ), given the data (.Dtrain−potential ), is matrix Normal with mean matrix .μ and covariance matrices .Σ time , Σ compo . We rephrase the last statement, while recollecting the full form of the matrix Normal density: (μ, Σ time , Σ compo ) =

.

.

(2π )NX Nτ /2 |Σ



exp −

1 × N /2 N /2 time | X |Σ compo | τ

T −1 trace[Σ −1 time (φ − μ) Σ compo (φ − μ)]

2

 ,

(2.7)

2.7 Predicting Potential at Test Time: Motivation

63

where the .NX × Nτ -dimensional matrix .φ is defined as: X ;Nτ φ = [φ˜ i ]N i=1;j =1 .

(j )

.

(j ) • Here, the ij -th element .φ˜ i of matrix .φ, is the standardised version of the i-th (j ) component .φi of the potential vector .φ (j ) that is learnt in the j -th time window. (j ) • Here, the standardisation of .φi is undertaken by first subtracting the sample (j ) mean .φ¯ i from .φi .∀j = 1, . . . , Nτ , where Nτ  j =1

φ¯ i =

.

(j )

φi



.

(j ) Thereafter, we normalise the difference .φi − φ¯ i with the sample standard deviation .si , .∀i = 1, . . . , NX ; j = 1, . . . , Nτ , where Nτ 

si2 =

.

(j )

(φi

j =1

− φ¯ i )2

Nτ − 1

.

Thus, (j ) φ˜ i =

.

− φ¯ i , si

(j )

φi

.∀i = 1, . . . , NX ; j = 1, . . . , Nτ . • Zero-mean GP The Gaussian Process that we invoke to underline the random vector-valued potential function, is considered to be a zero-mean GP, i.e. .μ is modelled as the null matrix. This stands justified as the GP is but a prior on the vector-valued potential function, so our choice of its mean parametrisation is not misplaced, especially in light of our standardisation of the components of the collection of r.v.s, the joint of which is matrix Normal. • Covariance amongst potential at different X-partitions The covariance matrix .Σ compo that informs on the inter-row covariance of the matrix .φ, is populated using the empirical plug-in estimate of such interrow covariance. Thus, if the standardised i-th components of the .Nτ learnt (1) (N ) realisations of the potential are in .{φ˜ i , . . . , φ˜ i τ }, and the standardised .i / -th

64

2 Forecasting by Learning Evolution-Driver

components are in .{φ˜ i / , . . . , φ˜ i / inter-row covariance matrix is (1)

(Nτ )

}, then the plug-in (unbiased) estimate of the Nτ 

ˆ compo = [σˆ i,i / ], Σ

.

where

σˆ i,i / =

j =1

(j ) (j ) φ˜ i φ˜ i /

Nτ − 1

Nτ 

(j ) φ˜ i

Nτ j =1 − Nτ − 1 Nτ

Nτ  j =1

(j ) φ˜ i /

. Nτ (2.8)

(j ) (j ) We recall that the standardisation of .φi into .φ˜ i , occurs via subtraction from the former, of the mean of the i-th component computed over all j values. This Nτ  (j ) φ˜ i = 0 in our sample of the standardised components of the implies that . j =1

learnt values of the potential random function. • Improving on plug-in estimate of .Σ compo ? Such plug-in estimates of the .Σ compo matrix is undertaken for the empirical illustration of the methodology, as discussed in the next section. However, this can definitely be improved upon, where the need for such improvement would become all the more germane if there is only a moderately small number (1) (N ) of components of the vector .(φ˜ i , . . . , φ˜ i τ )T , i.e. if .Nτ is moderately low. Kernel-parametrisation of the .Σ compo matrix is possible, with the .i, i / -th element (.σˆ i,i / ) modelled as a declining function of the distance between i and .i / . Hyperparameters of such a covariance kernel will need to be learnt—such would be a more robust model of the kernel than if the hyperparameters are ascribed chosen global values. In fact, Chakrabarty & Wang (in preparation) suggest that the hyperparameters of the covariance kernels can each be modelled as a function of the sample path drawn from the higher-dimensional (i.e. vector-variate) GP where this sample-path is a realisation of the potential random function. Any such function of the sample path, is an unknown, and can itself be modelled as a random realisation from another (scalar-variate) GP that is proved to be a stationary GP, s.t. its hyperparameters are unknown constants, that we could learn given the data. • Recalling the correlation structure of the GP underlying the vector-valued potential function It is noteworthy that the .i, i / -th element .σˆ i,i / of the inter-row (or intercomponent) covariance matrix .Σ compo informs on the correlation between the values of the effective potential learnt during the i-th X-partition and the .i / -th X-partition, across all the time windows over which the potential are learnt. This is true for all relevant pairs of i and .i / . The value that the location parameter X attains in the i-th X-partition, during a given time window, itself varies as we move from one time window to another—and it is an unknown, as far as potential learning is concerned. At the same time, the inter-time covariance matrix tells us about the correlation between potential functions learnt over all .NX partitions during the j -th time window, and that during the .j / -th; .j, j / = 1, . . . , Nτ .

2.7 Predicting Potential at Test Time: Motivation

65

Between themselves, the inter-component and inter-time covariance matrices cover the full evolution of the potential function, across all times within the learning period, as well as across all X-values that are attained by the system during this period. Indeed, the evolution across time windows is non-stationary in general. We use the unbiased empirical estimate of the inter-component correlation to inform on the evolution with levels of values of X, while a kernelbased parametrisation of the inter-time window correlation, is undertaken. • Kernel parametrisation of inter-time covariance matrix .Σ time As evident in the last paragraph, the inter-time covariance matrix .Σ time is s.t. its .j, j / element informs on the correlation between the potential function (j ) (j / ) realised in the j -th time window (.φ˜ ) and the .j / -th (.φ˜ ). Here .j, j / = 1, . . . , Nτ . This covariance matrix is kernel parametrised, where in the empirical illustration discussed below, this kernel is chosen to be a simple kernel with local hyperparameters that we learn from the data. Thus, the kernel is Square Exponential (or SQE) in form, with hyperparameters that vary with the times input in the computation of the kernel. If we indeed decide to populate each distinct cell of the symmetric inter-time correlation matrix, using a distinct hyperparameter, that would definitely be a fully local model. However, such a model will require the computation of .Nτ × (Nτ − 1)/2 hyperparameters, which can be computationally daunting. So in the application discussed below in Sect. 2.8, we suggest the compromise that the same hyperparameter be used to populate a given row of the upper triangle of the covariance matrix .Σ time , but a different value of the hyperparameter is used to compute another row of this upper triangle. This may indicate possible errors in inference—after all, the correlation between the potential vector learnt at the 1st and .Nτ -th time windows should not be parameterised with the same length scale, as one that parametrises the correlation between potential learnt at the 1st and 2nd time windows, while that invoked to parametrise potential vectors learnt at the 2nd and 3rd timewindows is different. Indeed this is not the best of model choices, but suggesting that the same length scale underlie correlations between potential vectors learnt at: 1st and .Nτ -th; 1st and 2nd; 2nd and 3rd—is arguably worse. Thus, the kernel parametrisation that we use is an improvement over usage of a global length scale, i.e. a stationary covariance kernel. We define the covariance kernel of our usage in the undertaken empirical illustration (described below) as: /

(j ) (j ) Σ time = [Cov(φ˜ , φ˜ )], where

.

 Cov(φ˜

.

(j )

, φ˜

(j / )

) = exp −

(tj − tj / )2 ( j )2

 ∀j / > j ; j, j / ∈ {1, . . . , Nτ }.

Here . · is a length scale hyperparameter of this covariance kernel. The essential motivation behind kernel parametrisation in general, is to model the difference in the outputs realised at a pair of given design inputs, as a declining function

66

2 Forecasting by Learning Evolution-Driver

of (a norm, in general) of the difference between the two inputs. Thus, this covariance kernel bears a locally-specified form, and is non-stationary. Lastly we remind ourselves that by “local” here, we imply temporally local, i.e. pertaining to a given time window. Indeed, multiple other forms for this covariance kernel could have been suggested, and in particular, the 2-layered kernel suggested by Chakrabarty & Wang (under preparation) would have been a judicious choice. As stated above, such a kernel would have modelled each hyperparameter as a function of the sample path from the vector-variate GP, and equivalently as a realisation from—what is proved to be—a scalar-variate stationary GP. This nesting of multiple scalar-variate stationary GPs in the inner layer, within the outer layer formed by a non-stationary, higher-dimensional GP is a more reliable method of learning. While recommending the same, in the empirical illustration presented below, we opt for a simpler learning strategy, in which we learn the length scale hyperparameters . 1 , . . . . , Nτ of the aforementioned parametrically modelled kernel that parametrises the covariance matrix .Σ time . Concerns about the attainment of convergence of the MCMC chains run with this model are addressed by a careful monitoring of trendlessness of the traces of parameters learnt in such chains, and in learning all length scales, every time forecast of a potential vector at a new time-widow within the prediction period is undertaken. • The length scale hyperparameter . · is an amplitude-modulated one The covariance kernel described in the last paragraph could be considered to concur with the full Square Exponential kernel form, with a non-unit amplitude. Then in the kernel that expresses the correlation between the j -th bespoke learnt potential vector and the .j / -th one, the amplitude-included kernel would be   (tj − tj / )2 (j ) (j / ) ˜ ˜ .Cov(φ , φ ) = aj exp − ( ˜j )2

∀j / > j ; j, j / ∈ {1, . . . , Nτ }.

Then  2j = ˜2j

.

 ln(aj ) 1− , (tj − tj / )2

i.e. the length scale parameter that we use in the model for the kernel is modulated by the amplitude inherent in this chosen kernel shape, in a functional form that is specific to the inputs that we are computing the correlation for—j , and .j / in the equation above. Had we employed a distinct length scale for each .jj / pair, we could have claimed that locally-relevant amplitude modulated length scale parameters are used in our model for the covariance kernel that parametrises the inter-time covariance. For the currently used model, the length scale parameters are nonetheless amplitude modulated.

2.7 Predicting Potential at Test Time: Motivation

67

• Learning parameters of the matrix Normal likelihood using generated training set Thus, we use the generated training data to learn the parameters of the matrix Normal likelihood—that follows from the modelling of the random vector-valued potential function, as a realisation from a vector-variate, zero-mean GP. This training set includes the .Nτ number of pairs of design indices for the time windows, (namely, .1, . . . , Nτ ) and the (learnt) standardised values of this random (1) (Nτ ) , vector-valued potential, at the respective time window, (namely, .φ˜ , . . . , φ˜ respectively at design time indices .1, . . . , Nτ ). We make Bayesian inference on the parameters of this matrix Normal likelihood, i.e. learn each element of the inter-time covariance matrix of this density, by learning each length scale hyperparameter with 95.% Highest Probability Density credible regions. The inter-component covariance matrix of this density is estimated directly given the data. • Closed-form forward prediction Our primary interest lies in prediction of the vector-valued potential function, at some test time window, indexed as .Ntest . We undertake such prediction of (Ntest ) .φ by using the closed-form prediction of the mean of the GP-generated function .φ(·) at the input value of .T = t Ntest . Then this desired prediction at the .Ntest -th time window follows the learning performed over each of the .Nτ time windows that were included within the training data .Dtrain−potential , and the subsequent predictions at the Nτ + 1-th, Nτ + 2-th, . . . , Nτ + [Ntest − Nτ − 1]-th time windows.

.

• Prediction period The temporal range, post the learning period, is what we have so far referred to as the prediction period, but from the last paragraph it is clear that while no prediction may be undertaken within the learning period, supervised learning of the potential function—modelled with a GP—carries on post the passage of the .Nτ -th time window. • Chronologically sequential forecasting Every time forecasting is undertaken at the .Nj +Nτ -th time window, (.j = 1, . . . , Ntest −Nτ −1), it is undertaken post the supervised learning of the vectorvalued potential function across the temporal-range covered by the first .Nτ time windows, i.e. the learning period. Any such forecasting made at the .Nj +Nτ th time window, avails of results of the learning undertaken at the previous time windows by treating the potential learnt at these past times, to populate the training data, based upon which, forecast at this .Nj +Nτ -th time window is performed. So the forecasting that we perform, is chronologically sequential, with learning allowed from results accumulated till before the test time window at which this forecasting is undertaken.

68

2 Forecasting by Learning Evolution-Driver

• Using past learning When the prediction of the random vector-valued potential function is sought during the first time window within the prediction period, (i.e. during the the .N1+Nτ -th time window), length scale hyperparameters . 1 , . . . , Nτ , Nτ +1 are all learnt. Indeed, having learnt . Nτ +1 , and predicted .φ (Nτ +1) in the first of the time windows within this prediction period, we then predict potential in the next time window within this period, and so on. In fact, when predicting in the .Nτ + j -th time window, we employ all learnt+predicted information on the realisations of the random potential vector function till the .Nτ + j − 1-th time window. To be precise, this entails using the potential vector learnt/predicted from the 1st, to the .Nτ + j − 1-th time windows, to compute the plug-in estimates of the elements of inter-component covariance matrix .Σ compo . On the other hand, at this time window, we learn the length scale hyperparameters . 1 , . . . , Nτ +j −1 anew, along with . Nτ +j . This reiterated message is emphasised below.

When predicting the mean and variance of the potential at the generic test time point .t (test) that defines the location of the .tNτ +j -th time window, we undertake the learning of length scales relevant to all previous time windows, i.e. we learn anew, the length scales . 1 , . . . , Nτ +j −1 .

• Added demands with prediction at a new time window Thus, with every new prediction undertaken at a new time window within the prediction period, the dimensionality of the inter-time covariance matrix .Σ time , goes up by one. Thus, with each new time window at which prediction is sought, the number of length scale hyperparameters of this covariance matrix increases by one. Our strategy of learning the length scale hyperparameters anew, implies that the learning of one further parameter is undertaken every time prediction is sought at a new time window within the prediction period. The dimensionality of the .Σ compo matrix is .NX × NX . This remains the same even when prediction at an additional time window is sought. However, increasing the number of time windows within the prediction period, increases the sample size used to compute the plug-in estimate of the covariance between the i-th and .i / -th components of the vector-valued potential function, .∀i, i / = 1, . . . , NX . Then assuming that the variance in this estimation of the elements of the .Σ compo covariance matrix, decreases with increasing sample size, it can be surmised that the usage of the empirical or plug-in estimate for the covariance matrix .Σ compo , is a increasingly better approximation of the inter-component covariance structure in the vectorvalued potential function, as we forecast further into the future.

2.7 Predicting Potential at Test Time: Motivation

2.7.1.1

69

Closed-Form Prediction at a New Time Window

Let us consider details of the closed-form prediction of the mean and variance of the potential random variable, at the centroid of the .j + 1 + Nτ -th time window, (i.e. the .j + 1-th time window inside the prediction period, .j = 0, 1, . . .). Since the potential is the output to the temporal-dependence that is modelled with a GP, the mean and variance of the realisation from this GP can be predicted in a closedform way. Let the central temporal location of the current window be referred to by the generic name of a test data point .t (test) , where .t (test) ≡ tNτ +j +1 here. At this test input, along with the mean of the sample functions from the vector-variate GP invoked to model the vector-valued potential random function, variance of the same is also available [24]. This owes to the fact that the predictive distribution at a new (or test) input is Normal: π(φ (test) |φ(·), t (test) , 1 , . . . , Nτ +j −1 , Σ compo ) =

.

N (φ (test) , K (t (test) , t (test) )),

.

where the predicted vector-valued mean realisation of the potential, at the .Nτ +j +1th time window is .

(test)

φ

=

(Σ compo ⊗ Σ test−time )T (Σ compo ⊗ Σ time + Σ noise )−1 vec(φ (1) , . . . , φ (Nτ ) , φ (Nτ +1) , . . . , φ (Nτ +j ) ),

(2.9)

and the variance-covariance matrix of the prediction at this .Nτ + j + 1-th time window is .

K (t (test) , t (test) ) = (test)

(Σ compo ⊗ Σ test−time )− (Σ compo ⊗ Σ test−time )T (Σ compo ⊗ Σ time + Σ noise )−1 (Σ compo ⊗ Σ test−time ).

(2.10)

where — in our model, the Kronecker product of the two covariance matrices .Σ time and .Σ compo is denoted .Σ compo ⊗ Σ time , which is an .NX (Nτ + j ) × NX (Nτ + j )-dimensional block matrix, comprising .NX × NX -dimensional blocks. There are .Nτ + j number of such blocks stacked sequentially above each other, in each of .Nτ + j number of layers. The .n, n/ -th block—i.e. the n-th block in X the .n/ -th layer—is the matrix .[exp(−(n − n/ )2 / 2n )σˆ m,m/ ]N , for .n, n/ = m,m/ =1

70

2 Forecasting by Learning Evolution-Driver

1, . . . , Nτ + j . Here .Σ compo is populated using the plug-in estimates of the X . inter-component covariance, i.e. .Σ compo = [σˆ m,m/ ]N m,m/ =1

— .vec(φ (1) , . . . , φ (Nτ ) , φ (Nτ +1) , . . . , φ (Nτ +j ) is the vector created by concatenating the .Nτ + j learnt/predicted realisations of the random, .NX -dimensional, vector-valued potential function. Thus, this vector is .NX (Nτ + j )-dimensional. — The Kronecker product .Σ compo ⊗Σ test−time , is .NX (Nτ +j )×NX -dimensional. It is a block matrix comprising a block—the block being an .NX × NX dimensional matrix—stacked on top of each other, in each of .Nτ + j number of layers. Thus, there is a single block per layer. The block in the n-th layer, is the X matrix .[exp(−(n − (Nτ + j ))2 / 2n )σˆ m,m/ ]N , for .n = 1, . . . , Nτ + j . m,m/ =1 — The noise in the i-th component of the random vector-valued potential function (i) is assumed Gaussian, and is parametrised by the variance .(σnoise )2 . Here .i = 1, . . . , NX . Such variance parameters can also in general be learnt in this method. Then .Σnoise is the .NX ×NX -dimensional diagonal matrix with diagonal (1) (NX ) 2 ) . Then the matrix .Σ noise = Σnoise ⊗ INτ +j . elements .(σnoise )2 , . . . , (σnoise (test) — .Σ compo ⊗ Σ test−time is the .NX × NX -dimensional covariance matrix X 2 2 .[exp(−((Nτ + j ) − (Nτ + j )) / ˆ m,m/ ]N . Nτ +j )σ m,m/ =1 In this way, we predict the mean and variance of the vector-valued random potential function at any new time window, given the hyperparameters of the underlying vector variate GP that we will learn and estimate afresh, every time such prediction is sought. The latter learning that is carried out afresh at each new time window at which the predictions are sought, is undertaken using the likelihood and prior discussed in Sect. 2.6.1. At a new time window, we undertake the closed form prediction of the mean potential and the variance in this potential, using the GP that is learnt given the information accumulated from realisations of the potential function, till the previous time window.

2.7.1.2

Errors of Forecasting

This chronologically sequential nature of the prediction has the obvious shortcoming that an erroneous prediction made at the j -th time window within the prediction period, triggers errors in predictions at the .j / -th time window, .∀j / > j , in general. Here an “erroneous” prediction at the j -th time window implies that the .NX components of the observed realisations of the vector-valued potential variable .φ (j ) are s.t. the “observed” norm of the potential in the j -th time window does not lie within the interval on (Euclidean) norms  is predicted in this time  of the potential vector that NX NX (j ) (j ) (j ) (j ) window, i.e. the interval .[ (−2.5s + φ¯ )2 , (2.5s + φ¯ )2 ]. i=1

i

i

i=1

i

i

(j ) is the predicted standard deviation, and .φ¯ i the mean of the i-th Here component of the vector-valued potential that is predicted at the j -th time window. By “observed” norm of the potential in the j -th time window, is implied the scaled and shifted .−vj2 /2—which we will elaborate upon soon—where the observed rate (j ) .s i

2.7 Predicting Potential at Test Time: Motivation

71

V at this time window is .vj . It is of course true that the decision on whether a prediction is erroneous at the j -th time window, or not, can be made only after the observation at this time window becomes available. In other words, only at time (j ) can we become aware of the prediction at the j -th time window being .T > t erroneous.

2.7.1.3

Advantage of Our Learning Strategy: Reviewed

We wrap up this section by focusing on the important issue on what lends to the relatively higher accuracy of forecast using this approach, over existing ones. This discussion borrows from the material discussed in Sect. 2.3 where it was noted that relatively smaller fluctuations in the potential function can give rise to more abrupt changes in the state and that the 2nd Law offers constraints on values of the rate variable given the potential. We contexualise the same below, in light of the clearer understanding of the GP correlation structure that we have achieved since Sect. 2.3. Our aim to predict values of the location variable and the rate variable at test time points, is served by the advanced methodology, by identifying the causal driver of the evolution of the system from a given state to that at the test time, i.e. the potential function, in a generic dynamical system. Thus, it is the fundamental tool for state prediction/forecasting that we offer, and perform the prediction/forecasting using such tool. This approach offers robustness towards state forecasting at a test time, against future unexpected variability in system dynamics. If there is any perception that our forecasting methodology merely shifts the burden of forecasting from state variables to the potential, then that is a naive reading of the work. At the .ti -th time-window, the potential function for the following time-window is forecast. In order for such forecasting to be undertaken, we need to update the inter-component and inter-time covariance matrices, namely .Σ compo and .Σ time respectively, with the potential that has been predicted at this time-window. In reality, at any time-window, it is the correlation between the potential realised at different pairs of each of its 2 inputs, (time T and location .X(T )) that should populate each of the two correlation matrices. Indeed as has been motivated above, correlation between pairs of realisations of the random potential function in two X-bins, across all relevant times, populates an element of .Σ compo . All length scale hyperparameters of the covariance kernel that parametrises .Σ time , are learnt anew when making the potential forecast while at the .ti -th time window, and this learning allows for percolation of information from the past history of values of the potential vector, down into the updated version of the .Σ time matrix, at this time point. A new value of the random potential vector (for the .ti + 1-th time step), is fed by the correlation structure—updated, via the updated kernel hyperparameters— of the GP that generates the random potential vector. Such updating of the hyperparameters materialises in our inference, via the feeding in of the observed location values at this .ti -th time step, as well as via the trends in the variation of rate. After all, the deterministic Newtonian link between (location) gradient of potential

72

2 Forecasting by Learning Evolution-Driver

and the time rate of change of the rate, provides the rate of change of rate at this time step, as potential in this time point is learnt. Then knowing the current rate of change of the rate, and given that the rate is known at this time point from state computation (using the potential at this time in the Second Law), rate at the next time point is informed upon or constrained (as suggested in Sect. 2.3). The latter information is conjectured to render potential forecasting at the .ti + 1-th time step, robust. The forecast potential in turn informs on the learning of the new hyperparameter and updating of the other hyperparameters, and the cycle continues. It is conjected that this helps acknowledge temporally-local deviations from past trends in the potential time series, rendering the forecast accurate.

2.7.2 Improved Modelling of Σ compo ? One point that merits mention is that by resorting to learn the potential over each X-bin, we miss out on the possibility of kernel parametrisation of .Σ compo , with any kernel defined with location values as inputs; this was motivated under the perception that the index of the X-bin cannot serve as a meaningful input variable. So then we were left to learn each (distinct) element of the (symmetric) covariance matrix .Σ compo directly from MCMC, or populate any element of this covariance matrix using the sample estimate of the covariance. We choose to undertake the latter in the application delineated below. However, the index of an X-bin will definitely have a meaningful interpretation if the corresponding value of X is identified. This is difficult; it is in fact the location that is the unknown, and we cannot easily learn the unknown by using the unknown as an element of the inferential scheme. What if we retained the index of the X-bin itself, in the design of a kernel parametrised .Σ compo ? Is this index really not a meaningful input variable, that the covariance kerel could be designed with? After all, the i-th index implies that the location values that live in this i-th X-bin, comprise the i-ranked location interval, out of the .NX partitions of the relevant range of locations. So the distance between this interval of location values and those that live in the .i / -th X-bin is a meaingful distance, and a kernel parametrisation of .Σ compo using such a distance as the input to the kernel is conceivable. The same is suggested as a future exercise. So we agree that correlation between the different components of the random potential vector, needs to be better captured than the plug-in estimate that we currently use. The concern with such modelling of the inter-component covariance matrix grows less acute, as the number of time windows that the potential is learnt and predicted during, increases, since the unbiased plug-in estimate of an intercomponent correlation, improves as an approximation to the correlation between any pair of components, with increasing sample size that the estimate is computed using. It is possible to undertake the direct learning as well, as long as the dimensionality of .Σ compo does not exceed .∼ 10 × 10.

2.8 Illustration: Forecasting New Rate and Number of COVID19 Infections

73

2.8 Illustration: Forecasting New Rate and Number of COVID19 Infections The forecasting method discussed above, is implemented to forecast the number of daily new infection rate, and the number of such infections, by learning the (potential) function that triggers the evolution in the location variable .X ∈ R≥0 . Here, this location variable X is the variable that bears information on the logarithm of daily new infection numbers in eight countries added in quadrature, where these 8 countries are: UK, France, Germany, Italy, Spain, South Korea, USA and Brazil, where the variables representing the daily new number of reported COVID19 infections are respectively referred to as .XU K , XF rance .XGermany , .XI taly , .XSpain , .XSouth Korea , .XU SA , and .XBrazil . So, the location variable X is defined as the logarithm of the daily new number of infections in these eight countries, added in quadrature, i.e. X(T ) :=

.

2 2 2 2 2 0.5 log(XU K (T ) + XF rance (T ) + XGermany (T ) + XI taly (T ) + XSpain (T )+ 2 2 2 XSouth Korea (T ) + XU SA (T ) + XBrazil (T )).

So we forecast .X(T ), at any given time .T = t (test) , after forecasting the rate variable .V (T ) := dX(T )/dT , at this test time .t (test) . We undertake the forecasting by learning the effective potential .φ(X(T ), T ) that dictates the evolution of this .X(T ), as delineated within the discussion of Step II of implementation of this methodology, in Sect. 2.6. We would emphasise that there was no particular reason behind the choice of these countries—the data for which we started using, from around February of 2020. We could as well have used the data from any a single country, but felt that added inhomogeneities in the intra-data correlation structure would be a better challenge, and therefore selected some countries that were relevant, and others that were considered to potentially contribute significantly to the data, at the time when we began our data selection. The aim is to demonstrate that our forecast method works. To undertake forecasting of the daily new infection numbers in any country, the same methodology that is used here can be invoked, with data for that country.

2.8.1 Data Data for the learning is inclusive of the observed values of .Xm over this learning period, with the string-valued index .m = U K, F rance, Germany, I taly, .Spain, U SA, South Korea, Brazil. Such data is taken from the WHO Dashboard between 26th February, 2020 to 21st December, 2020, and thereafter from the Worldometer sites for each relevant country, for days between 22nd December,

74

2 Forecasting by Learning Evolution-Driver

2020 and 31st December, 2020. Due to lack of regular updating of the data on the WHO dashboard around the last 10 days of 2020, the source of data was switched to Worldometer for this period. Thereafter, we reverted back to the WHO dashboard for data on each country of interest. The data is available from the following site: https://covid19.who.int/WHO-COVID-19-global-data.csv • In this application to the forecasting of the rate and number of daily new infections of COVID19, we use .NX = 20, .Nτ = 13. • The learning period lasts from the 26th of February, 2020 to 15th September, 2020 in this application. • We report forecasting every 15th day, (as our time windows are 15 days long each, at the centroid temporal location of which, we make our forecast). • Such fortnightly forecast is undertaken between the 16th of September, 2020 and the 17th of February, 2021, i.e. over 11 time windows. (Forecasting into the middle of 2021 with this method is reported by Chakrabarty, under review.) • Thus, forecasting is undertaken at 11 time windows, following the end of the learning period. • However, it merits mention here that our method of augmentation of data using information available till the previous time window implies, forecasting at the .13 + j -th time window, using uncertainty-included values of potential that are learnt/predicted till the .13 + j − 1-th time window; .j = 1, 2, . . . , . . ..

2.8.2 What Is “Potential” in This Application? Following on from the interpretation of the generalised effective potential that is motivated above in Sect. 2.1.6, here, by “potential” of the system that is in its current state, is implied the linearly transformed effort that it takes to raise the value of X that marks the current state of the system. The population is least disturbed, i.e. is at its most relaxed state, when it is devoid of COVID19 infections, for a zero change in X. As X increases upward of zero, the population gets increasingly more disturbed, s.t. its potential is no longer the minimum. This “most relaxed” state is considered to have taken place on the 25th of February 2020, which is the day before our learning period commences, i.e. the day previous to the temporal location of the left edge of the first time window. As we enter the learning period, with every passing day, the system is diversely disturbed, though compared to the 25th February, it is not any less disturbed, i.e. the potential of any day from the beginning of the learning period (26th February to 15th September, 2020), is higher than the minimum potential .φmin that is attained on the 25th February, 2020. At the same time, when such “disturbance” is maximal, there is no uninfected population left in the considered countries. Then attainment of a state that bears further “disturbance” imposed, is not possible, where further “disturbance” implies an increase in the value of .X(T ) over the value attained when the whole population in all 8 countries, at the considered time, are infected. Thus the system is maximally-

2.9 Negative Forecast Potential, etc.

75

disturbed at a given time, when the whole population in each of the considered 8 countries, at that time, is infected. With our interest here in forecasting of the rate and (log of) new number of infections (X), we clarify that this variable X is sought on the existing uninfected population, at a test time point (in the future). Thus, the maximal disturbance that just strips away the base uninfected population, is not one that we assume attained. That maximal potential is denoted .φmax that is set to 0. Then the effective potential that we define is shifted by .φmax and scaled by .−φmin , as motivated earlier, in Sect. 2.1.6. This implies that for us, such effective potential variable takes values in .[−1, 0). We typically drop the “effective” adjective from the description of the sought potential.

2.9 Negative Forecast Potential, etc. Referring back to the discussion of the forecast rate and potential, (Sect. 2.1.6), we recall that the value .vtest of the rate forecast at a test time window that is located at time .t (test) , is defined in terms of the potential vector .φ (test) = (test) (test) 2 /2 = (φ1 , . . . , φNX )T forecast at this test time, according to the relation: .vtest   (test) (test) α0 (−φ1 − . . . − φNx ) + α1 .

We refer to the RHS of this last equation as the “negative forecast potential”, that is forecast at the test time .t (test) . In other words, the “negative forecast potential” is .

  (test) (test) α0 (−φ1 − . . . − φNx ) + α1 .

One way to check the accuracy of the forecast made at any test time, would be to compare this “negative forecast potential”, to the empirically observed value of half the squared rate, at a given test time point.

The empirically-observed rate can be compared to the square root of twice the “negative forecast potential”, at a given test time window, i.e. the empirically observed .V = vtest can be compared to the “forecast rate”:    (test) (test) . 2 α0 (−φ − . . . − φNx ) + α1 . 1

76

2 Forecasting by Learning Evolution-Driver

The location variable X at this test time point .t (test) can also be forecast, as the integral of the “forecast rate”with respect to time. Once we learn the arbitrarily shifted and scaled potential in any time window over which the system evolution is autonomous Hamiltonian, we then linearly transform such a learnt potential. The choice of the scale and shift in this transformation is such as to ensure that half the squared value of the empirically-observed value .vNτ +1 , of the rate variable V , and the location X, in the first time window within the prediction (−N +1) (N +1) − . . . − φNx τ ) + α1 , , and the integral of the rate period, matches .α0 (φ1 τ from .t = 0 to the time point that marks the first time window within the prediction period. Here .Nτ = 13; NX = 20 in our and the norm of the rate  application,   (test) (test) (test) − . . . − φN ) + α is given by . 2 α0 (−φ1 vector at any test time .t 1 , x as mentioned in Eq. 2.4 within Sect. 2.4.4, where we discuss Step III. Thus, by (13+1) (13+1) 2 matching .v13+1 /2 to .α0 (−φ1 − . . . − φN20 ) + α1 , and .x13+1 to the timeaccumulated rate, till the current time, we are ensuring that .α0 and .α1 values are identified s.t. the shifted+scaled potential forecast at this time window, matches the empirically “observed potential”. Then at all other test time points, the negative forecast potential—informed by these identified .α0 and .α1 —are compared against the “observed potential” .(v (test) )2 /2.

2.9.1 Implementing the 3-Step Learning+Forecasting in this Application As discussed above, we model the random potential function in the j -th time window as a vector-valued random variable .φ (j ) with .NX components, where .j = 1, . . . , Nτ , ...., where the first .Nτ = 13 time windows comprise the learning period. We learn components .φ (1) , . . . , φ (13) of .φ, in Step I, by embedding the potential in the definition of the support of the pdf of the phase space variables .X, V . Learning of the potential at each time window during the learning period is described in Sect. 2.4.2 above. The MCMC-based inference is undertaken using Metropolis-within-Gibbs algorithm to update the potential first in any iteration, and then update the phase space pdf at that iteration, at the updated potential. A second chain is then run, with stronger priors on the potential parameters, where said priors are centred at the modal potential values that are identified in the previous chain; this chain peforms improved inference on the pdf. Below, we discuss the motivation for undertaking these two chains. By learning .φ (1) , . . . , φ (13) , we generate the originally absent training data set that would enable the supervised learning of the relationship of the potential with time, and the location variable X. In other words, we have now generated the training set .Dtrain−potential = {(t1 , φ (1) ), . . . , (t13 , φ (13) )}, to allow for the learning of the temporal dependence of the potential vector in Step II. This in principle permits the prediction of the potential at a test time .t (test) Sect. 2.7. Learning of

2.9 Negative Forecast Potential, etc.

77

the potential, by modelling it as a random realisation from a (zero mean) GP is discussed above in Sect. 2.4.3. We perform the learning and estimation of the two GP covariance matrices .Σ time and .Σ compo , by first standardising every component of .φ (j ) at the design time point .tj , (.∀j = 1, . . . , Nτ ). We use the likelihood as in Eq. 2.7 in which we estimate each element of the covariance matrix .Σ compo using the unbiased estimator of the covariance between vectors comprising observations of any two components of the potential vectors learnt at a pair of design time points. This follows the method discussed in Sect. 2.7.1. We learn the covariance matrix .Σ time as kernel-parametrised, with hyperparameters of this covariance kernel identified as the unknown length scales . 1 , . . . , Nτ that we intend to learn, given the training data .Dtrain−potential that we have now generated. We employ Normal priors on each of our unknowns, with chosen means and variances of each prior density that we employ. Then by inserting the likelihood (as in Eq. 2.7), and the chosen priors into Bayes rule, we write down the joint posterior probability density of the unknowns given data .Dtrain−potential . We perform Bayesian inference from this joint posterior, using Metropolis-withinGibbs. Sampling using such an MCMC technique then allows for the identification of the marginal posterior of each unknown given the data, allowing for the learning of the 95.% Highest Probability Density credible regions (HPDs) on each unknown. Having learnt the underlying GP that generates the vector-valued potential random variable, using the training data accumulated till the .Nτ -th time window, we predict the mean potential vector at the test time window that occurs at time (test) , and also predict the variance in the realisations of the potential at this .t (test) .t time window. The interval of potential vectors ranging from -2.5 times this predicted standard deviation, to +2.5 times the same, offers the interval of potential values that is offered as the uncertainty about the predicted mean potential. In fact, we first make the predictions at the .Nτ + 1-th time window, and augment our training data thereafter, by this predicted mean potential, embedded within the predicted uncertainty. Thus, the training data for the forecasting at the .Nτ +2-th time window includes each of the .Nτ pairs of design time points and the corresponding learnt potential, followed by the time point .tNτ + 1, and the predicted potential (Nτ +1 ) .φ . In this way, the augmented training data: {(t1 , φ (1) ), . . . , (tNτ , φ (Nτ ) ), (tNτ +1 , φ (Nτ +1) ), . . . , (tNτ +j −1 , φ (Nτ +j −1) )}

.

is employed to allow for the learning—in the j -th time window within the prediction period—of the length scale hyperparameters . 1 , . . . , Nτ , Nτ +1 , . . . , . Nτ +j of the kernel that parametrises .Σ time . This permits forecast of the potential variable at the j -th time window within the prediction period. Indeed, it is such learning of .Σ time at the .Nτ + j -th time window, along with the estimation of the .Σ compo at that time window, that allows for the closed-form prediction of the mean and variance on each of the .NX components of the random potential vector at this time window. Upon the prediction of the mean and the variance of the potential vector at any time window, the equivalent of Newton’s Second Law is invoked within Step III

78

2 Forecasting by Learning Evolution-Driver

to compute the rate variable V and the location variable X, at this time. The methodology for this is discussed in Sect. 2.4.4. The potential learning within the learning period is undertaken using a numerical code that is written in CWeB. The first stage that employs the Metropolis-withinGibbs algorithm to learn both the potential and the phase space pdf takes about 210min for a converged chain comprising 1.8million iterations to be achieved. When learning the pdf alone, at a fixed potential, convergence takes about 90minutes, where the converged chain includes 1.8million iterations.

2.9.2 Few Technical Issues to Note in This Empirical Illustration In this section, we draw attention to a few disjoint—but important—technical points that are relevant to this application. How Long Does the Potential Forecasting Take? Forecasting at the .13 + j -th time window implies that the .Σ time covariance matrix has a dimension of .(13 + j ) × (13 + j ), and this involves the learning of length scale hyperparameters . 1 , 2 , . . . , 13+j . As relevant to this forecasting exercise, estimation of any element of the .Σ compo covariance matrix uses a pair of samples, each of size .13 + j . The learning and estimation of the updated covariance matrices, and the prediction thereafter of the mean and variance of each component of the 20-dimensional potential vector, takes about 2 minutes on a Intel(R) Core(TM) i5-5300U CPU at 2.30GHz laptop, using the numerical code written in C++. We employ Random Walk Metropolis-Hastings algorithm to generate posterior samples from the joint posterior of the .Nτ + j number of length scale hyperparameters of the covariance kernel that parametrises the .Σ time covariance matrix. This posterior is formulated using the matrix Normal likelihood—that results from modelling the temporal dependence of the potential vector, as a random realisation from a vectorvariate GP—and Normal priors on each sought length scale hyperparameter. Each such hyperparameter is proposed from a Normal density with an experimentally chosen variance. Above, .j = 1, 2, . . .. In a typical chain, 0.3–0.5million samples are considered for inference on the unknowns, after we run the chain for about 1million iterations. Erroneous Prediction at the j -th Time Window Suggests a Way to Improve Forecasting, Which We Discard It is anticipated that forecasting at a time window can be incorrect—in the sense that the empirically observed rate variable V lies outside the forecast uncertainty interval .[−2.5σV , 2.5σV ] that is symmetrically placed about the expected prediction .v ˜ on the value of V at this time window, where .σV is the predicted standard deviation of the rate variable V . If at the .Nτ + j -th time window, such an incorrect forecasting is noted—in particular, an over-prediction is performed at this .Nτ + j -th time

2.9 Negative Forecast Potential, etc.

79

window—then at all subsequent time windows, forecasting could be envisaged to be performed with the lower limit of the predicted value of the rate, i.e. using the value .v ˜ − 2.5σV in place of .v. ˜ On the other hand, if we under-predict at the .Nτ + j -th time window, at all subsequent time windows, forecasting could be performed with the higher limit of the predicted value of the rate, i.e. using the value .v˜ + 2.5σV , instead of using .v. ˜ However, the above suggestions for improving the accuracy of forecasting are concluded to be ad hoc. There is no quantification of the excess/deficiency on the value of the potential prediction made at the i-th X-partition, during the last time window, .∀i = 1, . . . , NX . There is absolutely no guarantee that all .NX components of the inaccurately predicted potential vector can be adjusted by the same additive term. More to the point, such reliance on gathering information at every past time point—including at the last time window—stands to limit the scope of our forecast methodology. So we have decided to not try and correct past inaccuracies of forecast. We provide the forecast potential, rate and location variable value, at every time window, along with corresponding uncertainties on each. Uncertainties in Learning the Potential Parameters, Affect the Learning of the pdf Forecasting proceeds at a given time window, by using data on potential that is accumulated till the previous time window, and forecasting occurs during the prediction period that follows the end of the learning period. During the learning period, the random potential function is discretised into the vector-valued potential random variable, by using .NX = 20 components of this random vector. Such potential learning is undertaken by embedding the sought potential in the definition of the support of the pdf of the phase space variables, X and V . The pdf is itself an unknown function of the system energy, and is modelled as a vector-valued random variable .f , that is ascribed .Nε =9-number of components: .f1 , . . . , f9 . Thus, learning the potential at the j -th time window reduces to the exercise of learning (j ) (j ) parameters .φ1 , . . . , φ20 , while learning the pdf reduces to learning the parameters (j ) (j ) .f 1 , . . . , f9 . Here, energy is the sum of the potential and .V 2 /2. Now, learning of the pdf needs to follow the learning of the potential since only if the potential is learnt, can the support of the pdf be informed upon. Thus, uncertainties in the learning of the potential directly trigger uncertainties in values that the energy variable attains. In other words, uncertainty-included learning of the (output) pdf, is undertaken at uncertainty-included values of the corresponding (input, or) domainvariable. Then it follows that when Bayesian inference is undertaken to learn values of the parameters .f1 , . . . , f9 (that are the components of the pdf vector .f ), iterative realisations of these will not converge. On the other hand, inference on the .φ1 , . . . , φ20 parameters is much more stable. This situation motivates that once we learn the potential in a chain, we use the modal learnt value of a potential component variable, as the mean of the prior density of that variable in the following chain, and

80

2 Forecasting by Learning Evolution-Driver

use stronger priors on the potential components than on any component of the .f vector in this chain. In all chains, we perform the learning of the components of the potential vector and those of the pdf vector, using Metropolis-within-Gibbs algorithm for the inference. We propose the non-positive .φ1 , . . . , φ20 parameters from a Truncated Normal density that is right truncated at 0, while we propose the non-negative .f1 , . . . , f9 parameters from a Truncated Normal density that is left truncated at 0. We update the .φ1 , . . . , φ20 parameters in the first block of any iteration of a chain run with this MCMC algorithm, and at the updated values of these parameters, we update the .f1 , . . . , f9 parameters in the same iteration of the chain. We obtain the marginals on each unknown parameter, directly from our MCMC implementation, and use these to learn the 95.% HPDs on each parameter. Traces of each component of the vectorised potential and pdf variables are monitored for trendlessness, and the inferred values used to construct histograms of each learnt parameter. Trendlessness in traces translate to unimodal, bell-shaped histograms. Any such histogram is an approximation of the marginal posterior probability density of that parameter, given the data used in the learning. Choosing the Binning of X and .ε In this application we use .NX =20 number of X-bins and .Nε = 9 number of .εbins that we employ in partitioning the relevant ranges of values of the location variable X, and the energy variable .ε, respectively. The reasons behind choosing fixed values of the parameters .NX and .Nε —instead of learning these values from the data—are now discussed. In fact, we exploit the available data to lead us to the values of .NX and .Nε , that we input into the learning at Step I. We do this, while appreciating that retaining .NX and .Nε as unknowns in the model, would imply the need for inference on locations of varying dimensionality. Indeed, undertaking MCMC-based inference when the number of unknown parameters is varying from one iteration to the next, is notoriously cumbersome. After all, .NX determines the dimensionality if the random potential vector, and .N is the dimensionality of the vectorised pdf. So instead of invoking Reversible Jump MCMC, we opt for holding these parameters as constants, the values of which are informed upon by the data at hand. At the outset, we remind ourselves that one constraint that these choices have to abide by is that every X-bin must be populated by at least one datum, and similarly, no .ε-bin should be empty. Also, it is clear that higher is the value of .NX , better is the approximation of the random potential function at a time window within the learning period, by the vectorised potential variable. Similarly, higher is .Nε , better is the approximation of the pdf by the vector-valued random variable .f . At the same time, we appreciate that working with an arbitrarily high value of these parameters will slow down computation of our MCMC-based inference on all the components of these unknown vector-valued variables. With this information in the background, we wished for .NX to be at least 15 in our work, as we wished for the potential to not be the same across distinct days that comprise any time window. After all, each time window in our work consists of 15 days, and we are aware that

2.9 Negative Forecast Potential, etc.

81

the empirically observed value of X changes at time scales no shorter than a day. Thus, there are 15 distinct (in general) values of X within any given time window. So using the same X value across two or more days of a time period might mean that the daily observational-variation in X is being smoothed over, or ignored. We wished to avoid any such condensation (or loss) of available empirical information. Indeed we could have proceeded with 15 X-bins in our model. An attempt to introduce greater resolution—in order to improve the approximation of the potential function by the discretised potential vector—led us to the rounded value of 20 for .NX , which was simultaneously noted to not decelerate computing speed to convergence. The choice of .Nε was motivated by the need to keep every .ε-bin occupied. Thus, for any arbitrary choice of .Nε , population of each realised .ε-bin is checked at each time window. To accomplish this, the seed .φ vector is used to define the energy .ε in each time window, (in terms of half the squared rate variable and the .φ vector as per Eq. 2.2). At a choice of .NX = 20, a choice of .Nε = 9 ensures population of every .ε-bin. Upon the undertaking of the MCMC chains for each time window within the learning period, the achieved .φ vectors were employed to check again for population of every .ε-bin, in every time window; .Nε = 9 does the job minimally. Indeed it is true that we could have worked with a larger number of bins, and the examination of the phase space pdf needs greater work. Such an ambition will need the support of a better-resolved energy domain. Choosing .Nτ and Temporal Width of a Time Window Naturally, the smaller is the period of time during which we assume system dynamics to be constrained—as autonomous Hamiltonian—the better is the modelling that we adopt, from the point of view of generality of our treatment. After all, the assumption that the system is autonomous Hamiltonian within a time window, implies that the temporal variation of the phase space pdf obeys Liouville’s Theorem as time changes within a time window, s.t. total time derivative of this pdf is 0 within any time window. However, there can be a non-zero change with time between the phase space pdf in one time window and that in another. So we will retain our modelling strategy as more general, and thereby commit less error in our application, if we choose relatively shorter time windows. From this point of view, for this application, a 15-day long time window is perhaps not the best choice. Multiple news reports in the media, and those advanced by the WHO, PHE and other national and international healthcare organisations suggest a 7-day averaging of the daily new number of infections. That 7-day average is definitely a rough choice of the width of the time window, as is the 15-day average. Ultimately we want to set this width to the maximum of the time period over which deviation from an explicit time-independence of the potential has been absent, during the learning period; such a time period is at the same time desired to be one during which any deviation from the system being Hamiltonian is avoided. Indeed, this application did not undertake an in-depth analysis of this period, and uses a chosen width to undertake the 3staged learning+forecasting. This is one of the aspects of this application that will be improved upon in future versions. Of course it is appreciated that shorter the width of any time window, larger is the number of design points in the .[0, tL ] period. So using a width of 15-days was a compromise within the current application.

82

2 Forecasting by Learning Evolution-Driver

We chose the learning period to range from the end of February 2020, to the beginning of September 2020, driven by national statistics on the daily new infection numbers in the UK. Such numbers had started attaining non-zero values around the end of February 2020, and peaked by the second quarter of 2020; indeed the numbers (prematurely) appeared to be on the wane by September 2020. Hence the choice of this learning period. Errors in the Data? In the bespoke learning of the potential variable, at design time points (corresponding to the centroids of chosen time windows), it is possible within the purview of this method, to take noise in the data under consideration. In our application, such data includes the values of the state space observable X, which by definition is the logarithm of the daily new numbers of infection of COVID19, reported in each of 8 different countries, added in quadrature. Hence noise in the data was not relevant in this application. Indeed, errors in the reporting of the daily new number of infections in any of these 8 countries were retrospectively adjusted by the reported number, and/or the same were accounted for in future daily infection numbers. However, when we utilised the available data on daily new numbers of infection for each of the 8 countries—either from the WHO dashboard, or from the Worldometer site— these data sets were presented as used in the work, bereft of reported errors. However, in other applications, it will be very much possible to include errors or noise in the data that is used to learn the potential during any time window. This is done by identifying the probability density of such an error variable, and convolving that at every iteration of the Metropolis-within-Gibbs algorithm, with the likelihood of the model parameters given the data, subsequent to the updating of the relevant parameters in that iteration. Phase Space Probability Density Function Is not Likely to be Truncated Normal or Skew Normal In our application we learnt the vectorised phase space pdf as the .f vector, but did not proceed thereafter to learn the full functional form of the pdf. This could always have been done—and will be considered in a future extension of this application— by modelling the sought .f (ε) function as a random realisation from a GP, which we ε would have learnt using the training set .{(εj , f j )}N j =1 . Secondly, it needs to be appreciated that we had no prior information on the shape of the phase space pdf, i.e. on how the density varies within a given time window, with the system energy .ε, (which is given by the sum of the potential and half the squared rate variable). So we used only non-informative priors on each component of the .f vector, and consciously eschewed levying any chosen correlation on the interplay between the value of the pdf at one value of the system energy variable, and the density at another value of energy, within a given time window. Indeed, results on the vectorised pdf learnt at the different time windows within the learning period, indicate the possibility that the pdf may not concur with truncated Normal or a skew Normal, but could manifest secondary modes, in addition to a bigger mode.

2.9 Negative Forecast Potential, etc.

83

Lastly, we mention that updating of components .f2 , . . . , fNε of the vectorised pdf f is undertaken in the 2nd block of the Metropolis-within-Gibbs algorithm, with .f1 computed, s.t. the (trapezium rule representation of the Riemann sum of the) integral of the pdf over all energies result in unity, at every update.

.

2.9.3 Collating the Background In this application, • learning of the potential vector occurs between 26/02/2020 and 15/09/2020, s.t. if 26/02/2020 is day 1, then the width of the learning period is .tL = 203 (in days); • in the results presented in this book, forecasting of the value of .X(T ) is undertaken between the time period starting on 16/09/2020 and ending on 17/02/2021; • the number of time windows that we partition the interval .[1, tL ] into is .Nτ =13, where the total variation of the pdf of the phase space variables during any of these 13 time windows is modelled as zero, but the pdf can vary as we move from one time window to another; • the width of any time window is 15 days; • number of X-partitions that the domain of the potential function is partitioned into is .NX = 20, s.t. in the j -th time window, the effective potential vector .φ (j ) is a 20-dimensional vector, .∀j = 1, . . . , 13; • number (.Nε ) of .ε-partitions that the normalised energy variable is split into is chosen to be 9. Motivation for choices are discussed in Sect. 2.9.2.

2.9.4 Bespoke Learning of Potential: Results from Steps I and III Histogram approximations of the marginal posterior probability density of 12 (of the NX = 20) components of the vector-valued potential variable .φ (13) are presented in Fig. 2.3. The figure includes similar approximations of 4 (of the .Nε = 9) components of the vector-valued .f variable in the 13-th time time window that exists within the learning period. The .NX = 20 components of the vector-valued potential random variable that is learnt at each of the .Nτ = 13 time windows within the learning period, are plotted against values of the location variable X, in the top right panel of Fig. 2.4. The potential learnt during the i-th time window, (.i = 1, . . . , Nτ = 13) is marked with the colour and symbol type as per the key that is tabulated in Table 2.1.

.

84

2 Forecasting by Learning Evolution-Driver

1000 800 600 400 200 0

2000 1500 1000 500 0

2000 1500 1000 500 0

6000 4000 2000 0

1500

3000 2000 1000 0

1000 500 22.53 f3

–0.8 ø16

–0.9 ø12

0

ø3

2000 1000 0.1

0

f7 2000 1000 –0.7

0

–0.5 ø18

ø17 2000 1500 1000 500 0

0.2

3000

2000 1500 1000 500 0

3000 2000 1000 0

–1

2 f5

3000

2000 1500 1000 500 0

–0.9 ø13

–1

–0.95 ø5

3000 2000 1000 0

–0.9 ø14

0.01 f9

4000 3000 2000 1000 0

–0.35 ø19

2000 1500 1000 500 0 –0.9

–0.8 ø15

3000 2000 1000 0

–1 ø6

–1 ø7

Fig. 2.3 Lower 3 rows: histograms of learnt values of components of the potential vector .φ (13) learnt during the last time window during the learning period. Top row: histograms of learnt values of components of the .f vector learnt during this same time window

The same colour and symbol types are employed to depict the empirically observed values of the location variable X, against the day on which such an observation is made, in the bottom right panel of Fig. 2.4. The learnt potential is plotted against the normalised location variable, as shown in the top left panel of Fig. 2.4. Again, the components of the pdf vectors are plotted against the value of the normalised system energy variable (.ε) in the bottom left panel of this figure. The same colour+symbol coding is employed in the depiction of the pdf parameters relevant to the time window under consideration, as declared above, in Table 2.1, and employed in the other plots included in this figure.

8u105 6u104 4u104 2u104 0

0

100 Days

200

85

6

0 –0.2

4

–0.4

pdf

105

Normalised potential

Observed exp(X)

2.9 Negative Forecast Potential, etc.

–0.6 –0.8 –1 0

5u104 exp(X)

2 0

105

–0.5 Normalised energy

Fig. 2.4 Left: using the colour and symbol type key tabulated in Table 2.1, the empirically observed value of .exp(X) is plotted against each day of a time window. Broken lines separate the distinct time windows that reside within the learning period. Middle: components of the potential vector learnt at a time window during the learning period, plotted against values of the daily new number of infections in 8 countries, added in quadrature, (.exp(X)). The learnt components are presented without the learnt uncertainties for ease of visualisation, and for the same reason, the individual components of a learnt potential vector are connected with a broken line. The colour and symbol type of the presented results correspond to the colour and symbol type designated for the time window under consideration, as per the key delineated in Table 2.1. Right: components of each vectorised phase space pdf that is learnt at a time window during the learning period, are plotted against the energy variable Table 2.1 Table displaying the key for colour and symbol type used in Fig. 2.4, corresponding to each of the 13 time windows that span the learning period. The table also includes dates that each such time window exists over Index for time window 1 2 3 4 5 6 7 8 9 10 11 12 13

Covers dates (in 2020) 26th February–11th March 12th March–26th March 27th March–10th April 11th April–25th April 26th April–10th May 11th May–26th May 27th May–10th June 11th June–25th June 26th June–10th July 11th July–26th July 27th July–10th August 11th August–26th August 27th August–9th September

Colour Cyan Green Blue Black Red Magenta Cyan Yellow Red Black Green Blue Magenta

Symbol and line type Unfilled square, solid Unfilled square, solid Unfilled square, solid Unfilled square, solid Unfilled square, solid Filled circle, solid Filled circle, broken Filled circle, broken Filled circle, broken Filled circle, broken Filled circle, broken Filled circle, broken Filled circle, broken

2.9.5 Forecasting: Results from Steps II and III The potential predicted at the j -th time window after the learning period finishes, (for time window index .j = 13 + 1, 13 + 2 . . .), is shifted by .α1 ≈ −13.79, and scaled by the factor .α0 ≈ 0.05294. These values of .α0 and .α1 were chosen to shift and scale all forecast potential vectors (.φ (j ) , j = 1, 2, . . .),

86

2 Forecasting by Learning Evolution-Driver

according to the method suggested in Sect. 2.9. In other words, .α1 ≈ −13.79, and .α0 ≈0.05294 ensures that at .j = 1 + 13, the empirically-observed rate .v14 ,   (14) ) + α and the observed location matches matches . 2 α0 (−φ1(14) − . . . − φN 1 20 the forecast location. In the lower plot in the top right-panel of Fig. 2.5, we find displayed, the “negative forecast potential” at the j -th time window inside the prediction period; this “negative forecast potential” is: (13+j )

α0 (−φ1

.

(13+j )

− . . . − φN20

) + α1 .

It is plotted against j , at the identified .α0 and .α1 , with half the squared value of the empirically observed rate (=.(v13+j )2 /2) overplotted. At the j -th time window within the prediction period, the “uncertainty in the forecast rate”, or .σvNτ +j , is defined as σvNτ +j

.

 (13+j ) (13+j ) − . . . − σφN ) + α1 , := 2α0 (−σφ1 20

(2.11)

(13+j )

where .σφi is the standard deviation predicted on this i-th component of the potential vector .φ (j ) that is forecast at the j -th time window inside the prediction period. Here we recall that .Nτ = 13 in our application. Then at this time window, uncertainty on the “negative forecast potential” is defined by the range that we refer to as .Ij , where Ij := [(v˜j )2 /2 − 2.5(σvNτ +j )2 , (v˜j )2 /2 + 2.5(σvNτ +j )2 ].

.

(2.12)

Here the “forecast rate” at this j -th time window is    (13+j ) (13+j ) − . . . − φN20 ) + α1 . .v ˜j := 2 α0 (−φ1 This representation of uncertainty in our forecast of the potential at any test time point, is useful. However, we also wish to compare the empirically observed, and forecast values of the rate variable, at different time locations representing time windows within the prediction period. So the natural choice would be to overplot the empirically observed rate in the j -th time window within the prediction period, at this time window, i.e. on .v˜j :=   (i.e. .v13+j ), on the “forecast rate”  (13+j ) (13+j ) 2 α0 (−φ1 − . . . − φN20 ) + α1 . However after the 7-th time window (13+j )

(13+j )

− . . . − φN20 ) .+α1 is rendered a within the prediction period, .α0 (−φ1 negative number. This would rule out taking of the square root of twice this number. Thus, to allow for a direct comparison of the empirically-observed, and “forecast rate”s, we add (the minimally-required) positive number .α2 to render the lower limit

87

1.1 1 0.9 0.8 14

16

18

20

22

24

Index of time window

Neg forecast/“observed” potential

Forecast/“observed” rate (shifted)

2.9 Negative Forecast Potential, etc.

0.2 0.15 0.1 0.05 0.0 –0.05

200

250

300

350

300

350

Days (t)

12 10

Learning Period

8

Prediction Period

6 0

100

200

300

Days since 26/02/2020

Forecast/odserved X

log (#daily infections)

13

12

11

10

200

250

Days (t)

Fig. 2.5 Top right: “negative forecast potential” at the j -th time window within the prediction period, plotted in black against the time point (in days) of this j -th time window. Error bars on the values of the “negative forecast potential” at the j -th time window, represent 5 times the predicted standard deviation at this time window. The negative empirical potential is plotted in red. Forecasting of negative forecast potential, is performed with the ARIMA model (ARIMA(3,0,1)) that is judged to yield the best forecasting over the first 7 time windows within the prediction period; results obtained using it are plotted in magenta. The forecast values obtained with this ARIMA model had to be scaled by the factor of about 0.3, to yield the best compatibility with the data. Forecasting performed by another ARIMA model (ARIMA(4,0,2)) is in green. The forecast made with this ARIMA model has been scaled by a factor of about 0.4 for maximal compatibility with data. Top left: forecast rate at the j -th time window, is plotted against j , in black, with superimposed error bars. The empirical rate, is overplotted in red. ARIMA models are fit to the empirical rates over the learning period, and out of these, the ARIMA model that performs the least-bad one-step ahead rate prediction over the first 7 time windows inside the prediction period (ARIMA(3,1,2)), is used to forecast over all considered time-windows. Results from this model ae plotted in magenta. Forecasting done with another ARIMA model (ARIMA(2,0,2)) is plotted in green. There is no additional artificial scaling of the forecast rates done, unlike for the forecast negative forecast potential values. The ARIMA-based predictions. were contributed by N. C. Paul (Exeter College, University of Oxford). Bottom left: X is plotted in black, along time in days. The dotted line demarcates the learning period comprising 13 time windows (of 15 day width each), from the prediction period used in our work. The empirically observed 15-day average of X is overplotted in red, during the prediction period. Bottom right: forecast values of X are plotted in black against time in days, where X is the logarithm of the daily new number of infections in 8 countries, added in quadrature. The 15-day average of the empirically observed value of X is plotted in red and the daily value of X is plotted in green

88

2 Forecasting by Learning Evolution-Driver

  (13+j ) (13+j ) of . α0 (−φ1 − . . . − φN20 ) + α1 + α2 ≥ 0, for all j under consideration— and overplot the equally-shifted empirical rate, i.e. .v13+j + α2 , on this “.α2 -shifted forecast rate”. This is the upper plot in the top right panel of Fig. 2.5. A value of .α2 ≈ 0.007742 minimally satisfies the requirement that the “forecast rate” once adjusted by it, is rendered positive. Comparison of the .α2 -unshifted “negative forecast potential”, and the empirically observed value of .V 2 /2 is possible as well; this comparison is noted in the lower plots in the top right panel of this figure. Logarithm of the “.α2 -shifted forecast rate” in the j -th time window within the prediction period is compared to the log of the .α2 -shifted empirically observed rate, and this comparison displayed in the top left panel of Fig. 2.5. In other words, the top left panel of Fig. 2.5 displays the logarithm of the positive-valued  (13+j ) (13+j ) . α0 (−φ − . . . − φN20 ) + α1 + α2 , is compared to the logarithm of the 1 empirically observed .α2 -shifted rate (i.e. .v13+j + α2 ). For the purposes of forecasting the value of the location parameter X at a test time, it is the .α2 -unshifted “forecast rate” (.v˜test ) that is numerically integrated over time, from the beginning of the prediction period, to the test time point at which X is sought. X at the beginning of the prediction period is known, as the value of X at the end of the .Nτ -th time window, which is the last time window during the learning period. Here .j = 1, 2, . . . , and for this application, .Nτ =13. This integration is undertaken numerically, using the Trapezium Rule. The value of this integral on the t-th day, is plotted in the bottom right panel of Fig. 2.5, against t for .t = 15Nτ + 1, 15Nτ + 2, . . ., where we recall that any of the time windows in our learning/prediction comprises 15 days. The value of this integral of such a t-th day, is the forecast value of the location variable X on this day. The empirically observed values of X during the prediction period are plotted against t in green. 15-day average of X is overplotted in red. The value attained on the t-th day from the beginning of our learning period, by the location variable X—computed using the predicted mean potential, is plotted against t in fine black solid lines, in the bottom left panel of Fig. 2.5. This black line is straddled by a band of X values corresponding to an interval of 5 times the predicted standard deviation of the potential. The empirical observations on X at the centroid of each time window within the prediction period, are depicted in a thick line. Daily observed X values are overplotted in green. Errors in the forecasting of the potential directly affect accuracy of the predicted values of the rate variable, and the latter inaccuracy renders forecasting of X values inaccurate. As the value of X predicted at time t is cumulative over the rate values obtained at all times .t / < t, errors in forecasting of the rate get amplified as time progresses. Histograms of some of the hyperparameters of the covariance kernel within the 11th time window during the prediction time, learnt using Metropolis-Hastings, are plotted in Fig. 2.7. These histograms approximate the marginal posterior probability densities of the respective hyperparameter, given the data used to learn these hyperparameters.

2.9 Negative Forecast Potential, etc.

89

2.9.6 Quality of Forecasting The equivalent of forecasting error that is relevant in our Bayesian approach, is the (13+j ) (13+j ) that the expected forecast value of .φi predicted interval of values of .φi (13+j ) lies in, where .φi is the random variable representing the i-th component of the vector-valued random potential variable .φ (13+j ) that we forecast at the j -th time window within the prediction period. The width of this interval is 5 standard (13+j ) (13+j ) deviations (s.d.s) of .φi , i.e. .5σφi , where closed-form prediction of the (13+j )

variance (and mean) of .φi is achieved, upon learning the covariance structure of the underlining vector-variate GP. This uncertainty on the “negative forecast potential” at the j -th time window within the prediction period is identified above as .Ij (in Eq. 2.12). As the predictive of the forecast variable is Normal, values included within interval .Ij , occur at probabilities of 0.025 to 0.975. It may appear that the forecasting error committed at the j -th interval could be conceived as parametrised by the width of .Ij . Our one step ahead forecasting is compared to forecasting done with an Autoregressive Integrated Moving Average (ARIMA) model that is fit to the training data that comprises the empirical rate recorded at each of the 13 windows within the learning period (The best-fit ARIMA model is sought for this training data, using the “arima” facility in MATLAB, where “arima(.p, D, q)” is stated to create the ARIMA(.p, D, q) model, that comprises p autoregressive terms; D nonseasonal differences, that are required to achieve stationarity; and q lagged errors used in the prediction). Using the empirical rates recorded within the learning period, the ARIMA model that offers the best forecasting over the first 7 time-windows within prediction period is noted in MATLAB to be characterised by .p = 3, .D = 1, .q = 2 for rate forecasting, and .p = 3, .D = 0, .q = 1 for the forecasting of negative forecast potential. The forecasting mentioned in the last sentence is a one step ahead forecasting of rates; the mean rate (shifted by .α2 ) forecast by this model, at each of the considered time windows within the prediction period, are over-plotted in magenta in the top left panel in Fig. 2.5. The errorbar on each plotted mean forecast rate is overplotted on each forecast rate value. The forecast from the ARIMA models is of very poor accuracy value, when compared to the empirically known rates at these time points (plotted in red in this panel of the figure). Additionally, the negative forecast potential that is forecast with the “best fit” ARIMA model, has to be scaled by an appropriate scale, in order for the results to show on the scale as the data—we scaled negative forecast potential by a factor of about 0.3. This is an added artefact of the search for the “best” ARIMA model over a given range of time-windows, using which, subsequent forecasting is undertaken. Another set of ARIMA models were used, to undertake forecasting of the potential; the search yielded distinct “best fit” ARIMA models, and the negative forecast potential then needs to be scaled by a different factor (about 0.4). This inferiority of the ARIMA, and sensitive dependence of the parameters of the ARIMA model used, renders forecasting with such a parametric time-

90

2 Forecasting by Learning Evolution-Driver

series approach inaccurate in the short and short-to-medium term. The problem is that we do not know which ARIMA model is a good choice, and as clearly evident in our results, forecasts are not robust to the model choice—choosing one model over another, gives hugely different results. The ACF and PACF plots using different models offer signatures that are not informative to help prefer one ARIMA model over another. Furthermore, the best ARIMA model for rate forecasting does not yield the best results for (negative forecast) potential forecasting, defying the deterministic relationship that allows for the computation of the rate given a negative forecast potential value, or vice versa. That a conventional parametric timeseries approach performs poorly in comparison to our forecasting method, is not surprising, since the time series of the rates is not a linear time series, and neither is it likely to be stationary in the long-run, with lack of homoscedasticity2 evident even within the short timescale that we report results on in this chapter. The comparison between the empirical rate and forecasts obtained with ARIMA models that are assigned other parameters, speaks even less encouragingly for the accuracy capability of ARIMA. We acknowledge the help from Ms. N.C. Paul (Exeter College, University of Oxford), in providing the MATLAB-based results on the ARIMA models, that permitted the comparison undertaken between ARIMA-based forecasting and our work.

Our method of forecasting the potential function, and computing phase space variables from the forecast potential, has high accuracy—at least over short time scales. In contrast to this approach, a parametric time series based approach such an an ARIMA-based forecasting, that imposes multiple constraints on the time series of the location variable achieves only low accuracy. It is appreciated that computing the location variable using these ARIMA-forecast rates will result in very poor comparison between such computed location and the empirical location, since the location computation at a given time, accumulates errors made in forecasting rate at all times till this test time point. This is of particular concern in applications such as, to the forecasting of the temporal variation of the COVID-19 infection numbers, or the rate of change of such numbers. An additional comment is that whenever forecasting on the phase space variables is directly the object of the exercise, accuracy is bound to be mitigated, since training that precludes an abrupt ˙ is fundamentally disadvantaged from change—such as a jump—in X (or .X), predicting such an abrupt change at a test point. In contrast, temporal variation in the underlying cause of the evolution of the location variable X—i.e. the potential—is less abrupt than the temporal variation in X itself. (continued)

2 Homoscedasticity

refers to constancy in the variance of the output, at all values of the input.

2.9 Negative Forecast Potential, etc.

91

Thus, seeking the potential function at future (test) time points, and computing the X value therefrom, makes for more reliable/accurate prediction. Discontinuity in the potential function that determines the dynamics of a dynamical system—and thereby, non-stationarity of the stochastic process that generates such a potential function—is relatively more suppressed, than it is in the distribution of the location variable itself. To enhance the quality of the forecasting of the potential function, we will need to acknowledge the non-stationary nature of the GP that underlies it, and this aspect of the forecasting will be improved upon, by us modelling each hyperparameter of the invoked covariance kernel, as a realisation from a new GP. In that paradigm, the covariance kernel is (effectively) non-parametrically learnt, and in that context, arguments about the relevance of one parametric form of the covariance kernel, over another, are rendered redundant.

2.9.6.1

How Far from the Mean?

The left and right edges of the interval .Ij (defined in Eq. 2.12), are respectively offered in columns 2 and 4 of Table 2.2, for different .j = 1, . . . , 10. (Mean “negative forecast potential” is in the 3rd column of this table). Then we check for the existence of half the squared empirically observed rate—i.e. the “empirical negative potential”—as observed within the j -th time window during the prediction period, within the interval .Ij . It is in fact the location of this “empirical negative potential” within the .Ij -th interval that is centred at the predicted mean “negative forecast potential”, that could contribute to the error of forecasting. For example, one parametrisation of such an error could be the normed difference between the forecast mean and empirical values, normalised by the forecast s.d. Such a distance expressed in units of the respective forecast s.d., at the respective j (.= 1, . . . , 10), is 0.6614, 0.8892, 1.973, 0.3233, 0.5464, 3.419, 1.062, 6.397, 4.147, 0.03539, 1.512. Thus, if this distance .≤ 2.5 in unit of s.d. relevant at a given j , then the potential values forecast at any time window within the prediction period, includes the empirically observed equivalent of the potential. This distance between the mean “negative forecast potential” at any j and half the squared observed rate—measured in units of the forecast s.d. at this j —is plotted against the centroid of the time window, in the middle panel of Fig. 2.6. Again, half the squared observed rate noted at the 13 time windows that exist within the learning period are plotted against the index of such a time window, in green—as displayed in the left panel of Fig. 2.6. The comparison of the “negative forecast potential” and half the squared observed rate at time windows with the prediction period are shown in black and red respectively in this panel as well.

92

2 Forecasting by Learning Evolution-Driver

Table 2.2 Table displaying information on quality of forecasting performed at the temporal location of the centroid of any time window (in column C1), where one representation of such quality is informed on, by comparing half the squared empirically observed rate in a given time window, (in column C5), to the interval of values that the “negative forecast potential” lies is, (with left and right edges of such an interval displayed in columns C2 and C4 respectively). Mean of the “negative forecast potential” in this time window is displayed in column C3. The “.α2 -shifted forecast rate” is in column C6 and the .α2 -shifted empirically-observed rate is in column C7. The normalised permutation entropy is in column C8 C2 0.01256 0.004441 0.01440 0.04774 0.06885 0.007525 0.01978 0.008036 0.008063 0.03585 0.04191

C3 0.01857 0.006670 0.02207 0.08534 0.1128 0.01159 0.03112 0.01252 0.01262 0.06552 0.08067

C4 0.02458 0.008901 0.02973 0.1229 0.1567 0.01566 0.04247 0.01700 0.01718 0.09518 0.1194

Difference (s.d.s)

0.2

Potential

0.15 0.1 0.05 0.0 –0.05

0

10

20

Index of time window

C5 0.02016 0.005877 0.02812 0.0902 0.1032 0.006027 0.02630 0.001052 0.005059 0.06510 0.1041

6 4 2 0 200

300

Time (days)

C6 0.2294 0.1698 0.2442 0.4315 0.4909 0.1966 0.2788 0.2013 0.2018 0.3828 0.4205

Permutation entropy

C1 202.5 217.5 232.5 247.5 262.5 277.5 292.5 307.5 322.5 337.5 352.5

C7 0.2362 0.1650 0.2678 0.4426 0.4710 0.1659 0.2609 0.1326 0.1600 0.3817 0.4730

C8 0.5881 0.6045 0.6383 0.6807 0.7163 0.7233 0.7408 0.7311 0.7194 0.7504 0.7726

0.8 0.75 0.7 0.65 0.6 200

300

Time (days)

Fig. 2.6 Here, by “potential” as an axis label in the panels, is implied the “negative forecast potential”. Left: figure to display redundancy in information previous to a given time window. The “negative forecast potential” at the j -th time window within the prediction period, plotted against the index of the j -th time window, in black. The indices are counted s.t. j =1 for the 1st time window inside the learning period. Thus the first time window at which forecasting happens, has .j = 14. Half the squared empirically-observed rate .vj is overplotted in red against this index, for each such time window at which forecasting is made, and is overplotted in green for .j = 1, . . . , 13, i.e. for time windows within the learning period. Middle: figure displaying temporal variation of the distance between the forecast mean “negative forecast potential” and half the squared observed rate, expressed in units of the forecast s.d. This is plotted against the temporal location of the centroid of each time window at which forecasting is undertaken. Right: permutation entropy as plotted against the centroid of each time window inside the prediction period. All recorded points are joined by straight lines, as offered within the facility of the used plotting programme; however, these lines are only for visual aid and not indicative of any functional learning between the dependent and independent variables relevant to any of these plots

2.9 Negative Forecast Potential, etc.

2.9.6.2

93

Information Redundancy and Forecasting at the 8-th Time Window

Information redundancy refers to the availability of information at a given time point, from the part of the time series that lies in the past. Redundancy in information available at the j -th time window within the prediction period—based on which, forecast at this time window is undertaken—is indicated in the plots in this panel. We note from these plots that it is the 8-th time window within the prediction period, at which redundancy in inherited information from past values of the “negative forecast potential”, is minimal. Indeed, there are no time windows in the past, at which the “negative forecast potential” is as small as it is in the 8-th time window within the prediction period. In light of this, we are not surprised to note the accommodation of half the squared observed rate, only near the .−2.5s.d. mark from the forecast mean, at this time window.

2.9.6.3

Permutation Entropy

A parametrisation of such redundancy in inherited information, at any given j is undertaken in the literature; it refers to the intrinsic predictability at this j in a time series [14, 31]. Predictability—or rather the lack of it—can be parametrised by permutation entropy [3], which has been advanced as a measure of the rate at which information is generated within the time series, as one moves from one time point to the next. Thus, as we progress in time—while undertaking one step ahead prediction—the rate of information generation is expected to increase, and saturation is expected, unless local deviations in the time series behaviour are manifest. This parametrisation however is affected by non-stationarity of the time series in question, i.e. weak stationarity of the time series is required for permutation entropy to be robust [29]. Thus, in our application, in which non-stationarity is expected on time scales . τ —where .τ is the chosen width of a time window— the rate of generation of new information is not to be considered as necessarily approximated by the permutation entropy. In other words, intrinsic predictability is not to be robustly treated to decrease (or increase) if an increasing (or decreasing) trend in the permutation entropy is noted in our work. We undertake interpretation of the indicators of quality of the forecast, while keeping these caveats in mind. The 8th column of Table 2.2 includes values of permutation entropy that is computed at each of the time windows within the prediction period, at which forecasting has been undertaken in our work. Permutation entropy suggests redundancy of information in the time series, which in turn could affect predictability. The trend in the values of this parameter is overall of an increasing nature, as time passes (in the right panel of Fig. 2.6). This overall trend is understood if we recall that redundancy in information is accruing as time progresses; at the same time we expect that as such information redundancy saturates, the increasing nature of the permutation entropy parameter will saturate as well, though local deviations from a smooth trend are expected. Such a local drop in the value of this parameter is noted in the 6th time

94

2 Forecasting by Learning Evolution-Driver

window within the prediction period, while a bigger drop is noted at the 8th and 9th such time windows. So to summarise, increasing information redundancy generally stems from the increasing rate of generation of new “useful” information as we move across the time series from one time window to the next. Here, the adjective “useful” serves to imply information that aids forecasting. Generation of such “useful” information is impeded at the 8th time window within the prediction period, during which the system is at a state that is only approximately resembled in the past, in terms of the rate value and the trend in rate of change of this variable as the 8th window is approached from the left. In other words, there is a deficiency of information from the past to facilitate forecasting at this time window. The permutation entropy then drops between the 8th and 9th windows. Again, the forecast rate at the 7th window—as well as the trend in the rate of change of this variable as the 7th window is approached from the left—resemble the situation at the 3rd time window. Now, in the 3rd window, forecast in the value of the rate, was indicated to be s.t. a sharp rise in rate occurred in the following window, namely, the 4th time window. Given this trend relevant at the the 3rd window, similarity in the situation with the 7th window suggested that the forecast made in this 7th window will indicate a rise in the rate value in the 8th window, compared to that in the 7th. Thus, there is guidance coming from the past, as far as forecast made at the 7th window is concerned, to forecast a rise in the rate at the 8th window. However, in spite of this information, our methodology is sufficiently robust to recognise evolutionary trends in the system, to forecast a lower rate at the 8th time window—though our forecast is not as low as the truth. Then we take home 2 points from this observation.

Firstly, our forecasting methodology follows the evolution of the system, as suggested by the system dynamics, and it can defy spurious suggestions that pre-exist in hitherto observed trends. Secondly, there is a deficiency in useful information available between the 7th and 8th windows; hence a drop in permutation entropy as we move from the 7th to 8th, and then again to the 9th, the reasons for which are explained above.

So the permutation entropy could be a useful parametrisation of where the forecasting is going wrong, except it provides biased results in non-stationary series. So we remain wary of over-interpreting permutation entropy-based results. The permutation entropy computed here, is for the time series comprising the “forecast rate” from the 1st time window within the learning period to the j th time window within the prediction period; so at the j -th time window inside the prediction period, it is referred to as .hj . It merits mention that in our work, .hj is computed using the permutation entropy facility in RStudio, and it is the

2.9 Negative Forecast Potential, etc.

normalised Shannon’s entropy, i.e. .(

95 d! 

pi log2 (pi )) normalised by the maximal

i=1

Shannon entropy that is achieved when all permutations of d-tuples (or d-lettered words) of the considered time series, have the same probability. Here .pi is the relative frequency of the i-th permutation of the amplitude ordering of such tuples.

2.9.6.4

Why Is the Forecast Bad Around the 2020–2021 Transition?

One feature of our forecasting is that in the forecasting of the potential—and importantly of the rate—there is an over-prediction, at the centroid of the 8th time window within the prediction period. The reason for this over-prediction is a decrease in the intrinsic predictability, caused by misleading information from that part of the past, that resembles the situation at the 7th window. Still our method attempts to defy these spurious information, and instead of forecasting a higher negative forecast potential and rate, (than at the 7th window), offers a lower value of these variables, though the forecast values are not as low as the data. In addition, it might be possible that the data on the daily new infection numbers was under-reported during the 8th time-window. We put forward a conjecture in this regard. This temporal interval is the one that coincided with the year-end break, and data was not recorded as systematically as previous and subsequent times, in multiple countries. In particular, there were conspicuous gaps in the reporting of the data on daily new numbers of infections in Spain. The data that remained unrecorded over the holiday period could have been redistributed during the ensuing working periods in the new year. If this guess is true, then it is possible that the empiricallyobserved values of the location variable X—and perhaps also of the rate variable— were under-reported. The 6th and 7th columns of Table 2.2 also provide the mean “forecast rate” and the empirically observed rate, at the j -th window; indeed, it is the forecasting of the rate variable V (and the location variable X), that interests us most. The comparison of the forecast and observed rates is noted in Fig. 2.5; forecasting of rates undertaken by this method is quite good it appears, from such presented results.

2.9.7 Work to Be Done As stated above, there are multiple strands of the above application, that will be addressed in a future contribution. Chief amongst these is the pursuit of the phase space pdf by treating this function as a realisation from a GP. Such a generative GP will be learnt using the training data set that consists of pairs of values of the energy variable .ε (corresponding to the j -th .ε-partition; .j = 1, . . . , Nε ), and the value of this pdf realised at it. This exercise will be undertaken at each time window within the learning period. In fact, this is undertaken in the next chapter, in which we will

96

2 Forecasting by Learning Evolution-Driver 3u104 2u104 104 0

0

0.1

3u104 2u104 104 0

0

0.1

I20 3u104 2u104 104 0

0

3u104 2u104 104 0

0

0.1

2u104 104 0

I9 3u104 2u104 104 0

0

I1

0

I16 3u104 2u104 104 0

0

0.1

0

I3

I19 3u104 2u104 104 0

0

I11 3u104 2u104 104 0

0.1

3u104 2u104 104 0

I17

0.1

I12

0.1

3u104 2u104 104 0

0

0

0.1

I23

2u104 104 0

2u104 104 0

I22 3u104 2u104 104 0

0

0.1

0

I21

0.1

I15

3u104 2u104 104 0

0.1

I5

0

0

0.1

0

0.1

I13

3u104 2u104 104 0

I7

Fig. 2.7 Histograms of learnt values of some of the length scale hyperparameters plotted against the iteration number of the Metropolis-Hastings chain that was run at the 11th time window inside the prediction period. The aforementioned hyperparameters of the covariance kernel parametrises the inter-time covariance matrix of the GP that is used to model the potential with. Data used in this chain comprises information on the potential vectors learnt at the .Nτ distinct time windows within the learning period, as well as the potential forecast in the first 10 time windows during the prediction period

. 1 , 3 , 5 , 7 , 9 , 11 , 12 , 13 , 15 , 16 , 17 , 19 , 20 , 21 , 22 , 23 ,

discuss a similar methodology as relevant to the bespoke learning relevant to Step I discussed above. Results of GP-based learning that is used to forecast potential in a time-window, is depicted in Fig. 2.7. Again, more extensive experimentation with the parameters such as the width of the time window, and the widths of the X-partition and the .ε-partition are due; the bin widths being of relevance to the application considered in the next chapter, we shall review these choices within the upcoming discussion. The width of a time window informs on the time scale over which the system can be approximated as autonomous Hamiltonian—so the development of a protocol to identify such a width for a given system, will need to be attempted. Also, from a technical viewpoint, the learning of the GP that we invoke to model the potential, could be improved by modelling each hyper-parameter of this GP using a new GP—distinguished from being the hyperparameters being treated as

2.10 Summary

97

constants, that we learn. Thus, the generative GP of each such hyper-parameter would be compounded with the mother GP that the potential function is invoked to be sampled from, in this learning strategy. This would give rise to an even better forecasting methodology that will be enabled to defy spurious trends suggested by past trends. Ultimately, we will be working to transform the numerical codes for the forecasting exercise to be a black box with relevant pliable inputs that the user can manipulate to suit their needs. In the meantime, the CWeB and C++-based codes are available for any potential user to use, upon expression of interest. It is possible that for an .NX ∼ 10, the (distinct) elements of .Σ compo be learnt directly from MCMC. Even for the currently considered application, with the time length of each time window curtailed to 7days, an .NX value of 7 will permit the learning of the inter-component covariance matrix directly from MCMC, within the same chain, as we learn the hyper-parameters of the covariance kernel that parametrises .Σ time . For higher-dimensional inter-component covariance matrices, kernel parametrisation will also be attempted.

2.10 Summary In this chapter we have introduced a new way of forecasting states of a nonstationary and non-linear, generic dynamical system, by actually forecasting the function that can be invoked to cause, or drive the system evolution; the rate and state space variables are then computed using the forecast potential, at any test time. States of a dynamical system can be predicted if we had information on such an evolution-driver, which we rephrase as the potential function within a generalisation of the basic Newtonian dynamical paradigm. Then obtaining information on this sought potential is the objective of the endeavour discussed in this chapter. We anticipate that upon knowing about this potential, the state at any time point can be computed using Newton’s Second Law—as relevant to this generalised paradigm. In fact, in the absence of deterministic knowledge on the potential, the states will be known with uncertainty. Indeed, such anticipated probabilistic knowledge of the potential function of a deterministic system is all that we will be enabled to avail of, using a novel implementation of bespoke learning. After all, such a potential is not known to us, and so we strive to learn it, at design time and state values, i.e. inputs to the potential function are time and state. Such potential leaning at the design inputs is accomplished by approximating the dynamics of the system as locally autonomous and Hamiltonian, such that during identified time intervals, total time derivative of the phase space pdf is considered zero. The phase space pdf is then rephrased from being a function of the location variable and rate—to being a function of these phase space variables, s.t. the latter function remains conserved in time as the system moves from one point in phase space to another, along a trajectory. Such a time-invariant function of the phase space variables is energy—part of which is

98

2 Forecasting by Learning Evolution-Driver

contributed to, by the potential. Thus by acknowledging the temporal evolution of the phase space pdf, we have admitted the sought potential into the support of the phase space pdf. This is the key to our bespoke learning of the potential (as a function of time and the state space variable), which comprises Step I of our learning+forecasting strategy. However, there exist no training data that can allow supervised learning of the potential inside a considered stationary interval—or a time window—as a function of the state. Additionally, we onsider the potential to depend only on the location and not the rate. So we recall that during a time window, there is no information available on the value of the potential function, at a given location. So, within a time window, in lieu of the relevant training data, we cannot learn the correlation of the potential function’s value at one value of the location variable, to the function’s value at another point in its domain. Instead we learn the vectorised form of the potential function inside each of the considered time windows, s.t. the function’s value over each considered partition of its domain, is a component of the potential vector. The other unknown function—namely, the phase space pdf —is also vectorised. We learn the components of each vector; such bespoke learning of the potential and pdf vectors then essentially offer the training sets that are required, to perform supervised learning of the respective function—by modelling the sought function with a GP that is learnt using the generated training data. Of course, there are multiple time windows over each of which a potential vector is learnt, i.e. training data is found available, subsequent to Step-I, to admit the learning of vector-valued potential random function of time and location variables, modelled with a vector-variate GP. Likelihood in this modelling exercise is then matrix Normal. Thus two covariance matrices—inter-time and inter-component— are parameters of this matrix Normal likelihood. We learn the inter-time covariance matrix by parametrising it using a covariance kernel, in which every pair of potential vectors realised at any distinct pair of time points, is correlated as driven by a distinct correlation length scale hyperparameter. The inter-component covariance matrix is computed using the empirical covariances. This GP-based learning comprises StepII of our strategy. Forecasting at a given time window is conducted using information on all potential vectors that are learnt and forecast during all time windows, till the one that is just prior to the current one. Thus, forecasting in our work is sequential, or one time window at a time. Our basic aim is short-term and accurate forecasting. Once the expectation of the potential vector is forecast at a test time window, along with the variance on it, we input the same into the generalised Newtonian paradigm, to forecast the uncertainty-included state of the system at this time. This is Step III of our learning+forecasting strategy. We realise that had our interest been limited to the potential—and not the state that the system is in, at any time point—the 3-step exercise could have been truncated to the first two steps. Indeed, such is the situation in the problem discussed in the next chapter, where the system under consideration is also more limited in variety in its potential being independent of time, as distinguished from the premise of the discussion undertaken herein—namely, any dynamical system.

References

99

References 1. P. Abbeel and A. Y. Ng. Apprenticeship learning via inverse reinforcement learning. 2004. 2. M. Aoki. State-Space Modelling for Time Series. Universitext. Springer-Verlag Berlin Heidelberg, 1990. 3. C. Bandt and Bernd P. Permutation entropy: a natural complexity measure for time series. Physical review letters, 88:174102, 2002. 4. Leonard E. Baum and Ted Petrie. Statistical Inference for Probabilistic Functions of Finite State Markov Chains. The Annals of Mathematical Statistics, 37(6):1554–1563, 1966. 5. Richard Bellman. A markovian decision process. Indiana Univ. Math. J., 6:679–684, 1957. 6. Brahmadeep and S. Thomassey. Intelligent demand forecasting systems for fast fashion. In Tsan-Ming Choi, editor, Information Systems for the Fashion and Apparel Industry, Woodhead Publishing Series in Textiles, pages 145–161. Woodhead Publishing, 2016. 7. J. Casals, A. Garcia-Hiernaux, M. Jerez, S. Sotoca, and A. A. Trindade. State-Space Methods for Time Series Analysis Theory, Applications and Software. Monogaphs on Statistics and Applied Pobability. Taylor & Francis Group, 2016. 8. C. W. Chang, M. Ushio, and Ch. Hsieh. Empirical dynamic modeling for beginners. Ecological Research, 32:785–796, 2017. 9. B. Chateau and B. Lapillonne. The medee approach: Analysis and long-term forecasting of final energy demand of a country. In Energy Modelling Studies and Conservation, pages 57– 67. Pergamon, 1982. 10. Philippe Chatigny, Jean-Marc Patenaude, and Shengrui Wang. Spatiotemporal adaptive neural network for long-term forecasting of financial time series. International Journal of Approximate Reasoning, 132:70–85, 2021. 11. J. Durbin and S.J. Koopman. Time Series Analysis by State Space Methods: Second Edition. Oxford Statistical Science Series. OUP Oxford, 2012. 12. Philip Hans Franses and Dick van Dijk. Non-Linear Time Series Models in Empirical Finance. Cambridge University Press, Cambridge, 2000. 13. K. J. Friston, J. Daunizeau, and S. J. Kiebel. Reinforcement learning or active inference? PloS one, 4(7):6421, 2009. 14. Joshua Garland, Ryan James, and Elizabeth Bradley. Model-free quantification of time-series predictability. Phys. Rev. E, 90:052910, 2014. 15. J. W. Gibbs. On the fundamental formula of statistical mechanics, with applications to astronomy and thermodynamics. The Scientific Papers of J. Willard Gibbs, reproduced from Proceedings of the American Association for the Advancement of Science, 33, 57–58 (1984), II:16, 1906. 16. A. Giron-Nava, S. B. Munch, A. F. Johnson, E. Deyle, C. C. James, E. Saberski, G. M. Pao, O. Aburto-Oropeza, and G. Sugihara. Circularity in fisheries data weakens real world prediction. Scientific Reports, 10:6977, 2020. 17. H. Goldstein, C. P. Poole, and J. L. Safko. Classical Mechanics. Addison-Wesley Longman, Incorporated, 2002. 18. Jarmo Hietarinta. Direct methods for the search of the second invariant. Physics Reports, 147(2):87–154, 1987. 19. R. Hyndman, A.B. Koehler, J.K. Ord, and R.D. Snyder. Forecasting with Exponential Smoothing: The State Space Approach. Springer Series in Statistics. Springer Berlin Heidelberg, 2008. 20. R.J. Hyndman and G. Athanasopoulos. Forecasting: principles and practice. OTexts, 2018. 21. H. Kantz and T. Schreiber. Nonlinear Time Series Analysis. Cambridge nonlinear science series. Cambridge University Press, 2004. 22. C.J. Kim, Kim Chang-Jin Nelson Charles R, and C.R. Nelson. State-Space Models with Regime Switching: Classical and Gibbs-Sampling Approaches with Applications. The MIT Press Series. MIT Press, 1999. 23. Peter E Kloeden and Meihua Yang. An Introduction to Nonautonomous Dynamical Systems and their Attractors. WORLD SCIENTIFIC, 2020.

100

2 Forecasting by Learning Evolution-Driver

24. Mauricio A. Álvarez, Rosasco Lorenzo, and Neil D. Lawrence. Kernels for vector-valued functions: A review. Foundations and Trends Machine Learning, 4 (3):195–266, 2012. 25. D.C. Montgomery, C.L. Jennings, and M. Kulahci. Introduction to Time Series Analysis and Forecasting. Wiley Series in Probability and Statistics. Wiley, 2011. 26. D.C. Montgomery, L.A. Johnson, and J.S. Gardiner. Forecasting and Time Series Analysis. Industrial engineering series. McGraw-Hill, 1990. 27. S. B. Munch, A. Brias, G. Sugihara, and T. L Rogers. Frequently asked questions about nonlinear dynamics and empirical dynamic modelling. ICES Journal of Marine Science, 77(4):1463–1479, 2019. 28. Oliver Nash. Liouville’s theorem for pedants. 2015. 29. A. Osmane, A. P. Dimmock, and T. I. Pulkkinen. Jensen-shannon complexity and permutation entropy analysis of geomagnetic auroral currents. Journal of Geophysical Research, 124:2541– 2551, 2019. 30. E. Ott. Chaos in Dynamical Systems. Cambridge University Press, 2002. 31. F. Pennekamp, A. C. Iles, J. Garland, G. Brennan, U. Brose, U. Gaedke, U. Jacob, P. Kratina, B. Matthews, S. Munch, M. Novak, G. M. Palamara, B. C. Rall, B. Rosenbaum, A. Tabi, C. Ward, R. Williams, H. Ye, and O. L. Petchey. The intrinsic predictability of ecological time series and its potential to guide forecasting. Ecological Monographs, 89(2):e01359, 2019. 32. L. Ramírez and J.M. Vindel. 13 - forecasting and nowcasting of dni for concentrating solar thermal systems. In Manuel J. Blanco and Lourdes Ramirez Santigosa, editors, Advances in Concentrating Solar Thermal Research and Technology, Woodhead Publishing Series in Energy, pages 293–310. Woodhead Publishing, 2017. 33. Carl Edward Rasmussen and Christopher K. I. Williams. Gaussian Processes for Machine Learning. The MIT Press, MIT, 2006. http://www.gaussianprocess.org/gpml/. 34. C. P. Robert and G. Casella. Monte Carlo Statistical Methods. Springer-Verlag, New York, 2004. 35. R. Salakhutdinov and G. Hinton. An efficient learning procedure for deep boltzmann machines. Neural Comput, 24(8):1967–2006, 2012. 36. R.H. Shumway and D.S. Stoffer. Time Series Analysis and Its Applications: With R Examples. Springer Texts in Statistics. Springer International Publishing, 2017. 37. G. Sugihara. Nonlinear forecasting for the classification of natural time series. Philos Trans R Soc A, 348:477–495, 1994. 38. R.S. Tsay. Analysis of Financial Time Series. CourseSmart. Wiley, 2010.

Chapter 3

Potential to Density via Poisson Equation: Application to Bespoke Learning of Gravitational Mass Density in Real Galaxy

Abstract In multiple real-world dynamical systems, structural properties can be deterministically linked to the evolution-driving function. For example, in selfgravitating systems, or in systems in which charge/current distributions dictate the dynamics, the evolution-driver or the system potential, is related in a known way to the density function that underlies system structure. In this chapter, the focus is on learning that density function, given observations that are possible of only some, instead of all of the phase space coordinates, in a dynamical system that we motivate to be in dynamic equilibrium. Then the embedding of the evolution-driver in the support of the probability density function (pdf ) of the phase space variables—by exploiting the temporal evolution of the pdf —is equivalent to the embedding of the sought structural density function in the support of the pdf. Such learning demands generation of a training data, and this is illustrated via the bespoke learning of the value of the density of all gravitating mass in a real galaxy NGC 4649, at chosen locations inside the galaxy; values of the phase space pdf at chosen points in its support, are also bespoke learnt. For such learning, data on two types of galactic particles are implemented. Supervised learning of the gravitational mass density function and pdf are undertaken thereafter, along with predictions at test points. Such prediction suggests gravitational mass of about 10–100 billion solar masses inside the inner 0.001 kpc in this real galaxy.

3.1 Introduction In the previous chapter, we borrowed the concept of potential, to model the evolution-driver of a considered dynamical system. In that chapter, bespoke learning of the potential, followed by its forecasting, allowed for the computation of the phase space coordinates of this system, at any future time point. To summarise, we invoked the “potential” of a generic dynamical system to perform forecasting of the state of the system. In this chapter we discuss a problem in which interest lies in the selfsame pursuit of such potential, which—in concert with the probability distribution function of the phase space variables—informs on the future state of the considered mechanical © Springer Nature Switzerland AG 2023 D. Chakrabarty, Learning in the Absence of Training Data, https://doi.org/10.1007/978-3-031-31011-9_3

101

102

3 Bespoke learning of density

system that is an autonomous dynamical system,1 [6, 22]. In this chapter we focus on such mechanical systems that admit a natural description of the potential function, which as we shall soon see, relates deterministically to the system structure, through the well-known Poisson Equation [1, 8, 14, 17]. This is typical of gravitationallybound systems, and systems that are driven by electromagnetic potentials. The learning or estimation of the potential, crucially reveals all that we need to know, in order to predict system dynamics, given a starting location and velocity of system particles. In other words, knowledge of the potential permits forecasting of states at any future time point. Here, the pursuit of potential is synonymous with the hunt for the distribution of the gravitational mass within the system, or that of the charge condensation and/or current patterns that are relevant to system behaviour. Conversely, the gravitational mass distribution in gravitational systems— or the charge and/or current distribution in electromagnetic systems—allows for the learning of the potential, using which, system behaviour can be deterministically forecast. However, the laboratory-based observation of this system structure is not always possible—such as for gravitationally-bound celestial systems. In fact, the observational constraints on such systems can be severe, especially in the more distant ones, i.e. in faraway external galaxies. What exacerbates the problem in such systems is that the major fraction of the sought structural inputs (on the density of the gravitational mass) is essentially unobservable. This pertains to dark matter, that constitutes the majority of the gravitational mass of such distant galaxies and other celestial systems, [9, 21, 24, 25, 33]. Dark matter is by nature undetectable photometrically, since such matter neither emits light of its own, nor reflects light incident upon it. Then in such systems, a direct observational estimation of the total gravitational mass—i.e. luminous and dark—is not possible. In lieu of this, we observe an effect of the total gravitational mass density in the system, and hope to learn this density, using data that comprises observations of such an effect. Then, as long as we know the relationship between this effect—parametrised as deemed suitable by the scientist—and the gravitational mass density, we can predict the mass density, within an Inverse Problem approach. However, if this relationship is itself unknown, then we are in difficult territory. The ready suggestion would be to then model the unknown relation—between the observable effect, and the unknown mass density— possibly with a Gaussian Process (GP), with the intention of learning this relation. Athwart such a suggestion, supervised learning of this unknown functional relation is not possible, since training data that adumbrates such a learning endeavour, is absent in these circumstances. To clarify, values of the (parametrised) effect of the galactic gravitational field in the given galaxy, are not available at a design value of the galactic gravitational mass density. So, the effect can be observed—under

1 Here by autonomous, we imply that the velocity function is not explicitly time-dependent in this system. This is discussed in Sect. 3.1.1.

3.1 Introduction

103

diverse galactic parameters—but no such observation can be linked to the unknown galactic gravitational mass density. We understand that not knowing the relationship between the observable effect (of the galactic gravitational field) and the gravitational mass density, is equivalent to saying that a reliable parametric model of this relationship—that is applicable to a generic galaxy (of the considered type)—does not exist. This fundamentally negates the usage of astronomical simulations of the behaviour and structure of a given galaxy, that can appear to yield the value of the effect of the galactic gravitational field, at the designed gravitational mass density value—such a value of the effect will however not be computable at a given mass density, since a model that links the two is not available. Thus, if said model is circumspect, the synthetic observation of the effect will be erroneously computed at a designed mass density, and therefore we will remain deprived of a reliable training set comprising pairs of mass density and effect values, for such simulated galaxies. Then it follows that the difficult task of supervised learning of the relationship between an observable effect of the galactic gravitational field, and the gravitational mass density, cannot be undertaken, annulling prospects of prediction of the mass density at which the effect is recorded for a test galaxy. So in summary we can state that in a given galaxy, we have no ready reliable information to suggest what the gravitational mass density value is (at a given intra-galactic location), given which, the recorded observation of the effect of the galactic field is realised.

3.1.1 Motivating Bespoke Learning In other words, there is no reliable training data that is available to render supervised learning of the functional relation between the observable effect of gravitation, and the gravitational mass density of the galaxy. All we can access is a set of observations of such an effect—parametrised perhaps, as an effect parameter. Using this data, we will need to perform bespoke learning of the galactic gravitational mass density values (at chosen locations inside the galaxy). Once such bespoke learning is undertaken, it will offer the training data required to perform supervised learning of the galactic gravitational mass density function. Such a gravitational mass density will allow for the gravitational potential function of the galaxy to be computed, since potential and gravitational mass density are deterministically connected in a gravitational system—by one of the fundamental equations of Potential Theory, namely the Poisson equation. While the potential dictates the evolution of the phase space variables, the system’s location in phase space, in which such evolution occurs, is informed upon by the pdf of the phase space variables [10, 19, 27].

104

3 Bespoke learning of density

In this chapter we discuss an example of a real-world situation that is in need of bespoke learning, aimed at the generation of the originally absent training data. Once such training data is available, it can be employed to learn the functional relationship between the sought system parameters and the observable. In this application, the system parameter is the gravitational mass density, which is a proxy for the (potential) function that informs on how evolution of the system state is driven. We call the density a “proxy” for the potential, as the two functions are related to each other in a known way. Indeed, the potential yet again—as in the previous chapter—is what we seek to learn, since the desired density function of all gravitating matter in the system, emanates from this potential. Thus, the potential is in a sense, the ambition of this chapter, while it was more a means to an end in the previous chapter, (to be precise, it was the tool used for the identification of the system state, at a test time point). In multiple other real-world problems, if the aim is to identify the structure of the system—parametrised in terms of the density of mass/charge, etc.—it will be possible to do so, via generalisation of the methodology described in this chapter.

The methodology motivated herein, pre-supposes that the considered system is an autonomous dynamical system. The justification behind this assumption is one that is motivated by the lack of inter-particle interaction and collisions in the galaxy, and our assumption that the potential has equilibrated, and is not varying with time. We discuss these in detail below. The evolution of the pdf of the phase space variables is then motivated to be such that it allows for the sought potential to be embedded into the support of this pdf, thus allowing for the likelihood to carry information on the unknown potential, (since likelihood is formulated in terms of the pdf of the observed phase space variables, conditional on model parameters). Once likelihood is formulated, posterior probability density of all unknowns is formulated, given the data on the observables, and posterior sampling is undertaken with an MCMC-based inference [28].

3.1.2 A Particularly Difficult Data Set for Bespoke Learning The remit of the application discussed currently is limited to the Universe sciences, because we wanted not only to discuss the particularly interesting idea of robust learning of dark matter fraction in a galaxy, (an ambition that could be broadened to include higher mass scales), but also draw the reader’s attention to a problem that is beset by severe observational constraints. Unlike in a laboratory situation, in these systems, observations are typically obtained during a single window in time, which by the evolutionary time scale of the considered system, is a snapshot in

3.2 Methodology

105

time over the system’s existence. Indeed, this snapshot view of the system that we can obtain, allows for a single viewing angle to the system. In addition, particles inside the system—such as stars, or identified clusters of stars that reside inside a galaxy, etc.—can be tracked only for the speed with which they are approaching, (or receding away from), the observer. No other component of their velocity vector is an observable. Then again, only the projection (onto the image plane of the system) of the location vector to such a particle, is an observable. In other words, the observables are sampled from a projection of the phase space pdf of the system. As we will soon see, half the phase space coordinates are observable, while the remaining half cannot be observed in these systems. Given such partially observed phase space coordinates of a sample of particles in a real galaxy, we will advance a method to learn the gravitational potential of the system, as well as its phase space pdf.

Since the data for such learning is not sampled from the phase space pdf –but from a projection of this pdf onto the space of the observables—we cannot hope to learn the phase space pdf using such data, unless we supplement such data with extra information about the structure of the system phase space, to accomplish the intended learning. Indeed, it is possible to assume the geometry of the phase space to be such, that incomplete sampling from the phase space is compensated for.

3.2 Methodology Here our system is a galaxy, the image of which is ascribed an elliptical shape, globally. Such “elliptical” galaxies are more prevalent than galaxies that manifest a disk-shaped luminous component [3], and are older systems than disk-shaped galaxies. The learning of the total gravitational mass content in these systems is also harder than in disky systems, in which matter is astronomically modelled to circulate about the galactic centre, (where random motion is superimposed upon such circulation). The speed of such circulatory motion at a chosen distance from the galactic centre, can be modelled to reflect upon the total (i.e. dark+luminous) mass that lies within that galactocentric distance [9, 29]. In elliptical galaxies on the other hand, no circulatory motion is thought to be the predominant contributor to the velocity of a general sample of particles that are members of this galaxy—motion is referred to be supported by randomness, rather than streaming along any direction inside the galaxy [4].

106

3 Bespoke learning of density

3.2.1 Potential and Phase Space Probability Density Function We express the (galactocentric) location vector .X = (X1 , X2 , X3 )T of a galactic particle, in the basis .{e1 , e2 , e3 }, where .e3 is along the line-of-sight that connects this particle to the observer, with .X3 positive towards the observer. Any plane orthogonal ˙ = (V1 , V2 , V3 )T is the to .e3 is .Span{e1 , e2 }, chosen s.t. .e1 ∧ e2 = e3 . Then .V = X velocity vector of the particle. As locations and velocities are 3-dimensional vectors in classical Physics, let .X ∈ X ⊆ R3 , and .V ∈ V ⊆ R3 . Let the phase space vector be .W := (XT , V T )T , with .W ∈ W ⊆ R6 , i.e. .W is the galactic phase space that hosts all location and velocity vectors. Let the pdf of .W be .fW (X1 , X2 , X3 , V1 , .V2 , V3 ), i.e. phase space density .fW : W −→ R≥0 . Let the potential of this system be .Φ(X1 , X2 , X3 ), where we choose to model the system as a Hamiltonian system, i.e. the potential is independent of particle velocity in this (assumed) dissipationless system. The justification for this dissipationless assumption lies in the timescale for inter-particle (gravitational) interaction in the system, to exceed the age of the Universe [4]. We discuss such a picture of a galaxy below. That the potential is not explicitly dependent on time, is reflection of the assumption that the temporal evolution of the system—which is driven by the potential—is in a state of dynamic equilibrium. We address this further in Sect. 3.2.2. The potential relevant to the systems that we deal with in this chapter, is the gravitational potential. Then for a particle to be bound to the system, its potential energy at location .X = x is negative, i.e. a test particle with unit gravitational mass, bears a negative potential energy at any location inside the system. A particle with positive potential energy is not bound to the system, i.e. its motion is not dictated by the potential of the considered system. We do not deal with such particles. So potential .Φ : X −→ R rmax . The standard result for the potential at a radial distance r from the centre of the galaxy, in terms of the gravitational mass density is: −4πG .Φ(r) = r

 0

r

 2





dr r ρ(r ) − 4π G r



dr  r  ρ(r  ).

(3.9)

3.2 Methodology

121

In light of the vectorised gravitational mass density, this equation is discretised to the following: Nx

For r < r0 Φ(r) = −4π G

.

ρq

q=1

For r0 ≤ r ≤ rmax Φ(r) = −4π G

m−1

q=1

ρq r

(r0 + qδR )2 − (r0 + (q − 1)δR )2 2





(r0 + qδR )3 − (r0 + (q − 1)δR )3 3



2 1 1 r 3 2 +ρm − − (r0 + (m − 1)δR ) + (r0 + mδR ) 6 3r 2 Nx

(r0 + qδR )2 − (r0 + (q − 1)δR )2 + ρq 2 q=m+1

 ρm (r0 + mδR ) − r 2 2 where r is identified to lie inside the m-th R-bin. NX

ρq (r0 + qδR )3 − (r0 + (q − 1)δR )3 Φ(r) = −4π G , r 3 +

For r > rmax

q=1

(3.10)

3.2.16 Model in Light of Vectorised Potential and Phase Space Pdf To use the vectorised phase space pdf in Eq. 3.4, we perform the integral over the unobservables .V1 , V2 , X3 , in individual energy partitions, i.e. over individual .ε bins, and then summing over such integrals corresponding to the all the energy partitions. Thus, the projected phase space pdf is ⎡ (k)

ν(rp(k) , v3 ) =

.



j =1

⎢ fj ⎢ ⎣

(up,k,ρ)

(up,k,ρ)



vμ

x3

(low,k,ρ) x3 =x3



⎤(j ) ⎥ vμ dvμ dx3 ⎥ ⎦

,

(3.11)

(low,k,ρ) vmu =vμ

∀ k = 1, . . . , Ndata . The limits of these integrals will in general include a dependence on the gravitational mass density vector .ρ, (as well as on the index k of the datum at which the projected phase space density is computed). So the integrals on the RHS are denoted to include .ρ, and k. The domain of the phase space pdf is partitioned into the .ε -bins, and the integrals on the RHS of Eq. 3.4 undertaken for

.

122

3 Bespoke learning of density

each such partition then finds the phase space pdf that is constant over this partition, to be brought outside the integral. Thus the integral over the unobservables in a given .ε -bin, corresponds to the volume that these unobservables occupy in this .ε -bin. This allows for identification of the limits of these integrals. In this endeavour, the most used tool will be the relationship between the energy, the kinetic energy and the gravitational potential (which is linked to the gravitational mass density using Poisson Equation—the explicit form of which we will discuss soon). We recall that energy

ε = Φ(R) + V 2 /2 ≡ Φ(

.

 Rp2 + X32 ) + (Vμ2 + V32 )/2.

So at a given .x3 , and observed .Rp (=.rp(k) , say) and .V3 (=.v3(k) , say),  (k) (k) vμ2 = 2ε − 2Φ( (rp )2 + x32 ) − (v3 )2 .

.

We wish to compute the highest and lowest values that .vμ will attain in the j -th .ε bin, .∀j = 1, . . . , Nε . In this j -th energy partition, at the k-th data point, the highest value of energy is .Φ(0) + j δε , s.t. the highest value of .vμ is:  (up,k,ρ) .vμ

 2(Φ(0) + j δε ) − 2Φ

=

 (k) (k) (rp )2 + x32 − (v3 )2 .

(3.12) (low,k,ρ)

= and the lowest value of .vμ2 in any energy partition, .∀k = 1, . . . , Ndata is .vμ 0. Now the innermost integral in the RHS of Eq. 3.11 (at the k-th datum and j -th energy partition and a given .x3 ), is the integral of the integrand .vμ , with respect to (low,k,ρ) (up,k,ρ) .vμ , between limits .vμ and .vμ . Then this integral offers  (vμ(up,k,ρ) )2 /2 − 0 = (Φ(0) + j δε ) − Φ( (rp(k) )2 + x32 ) − (v3(k) )2 /2,

.

using Eq. 3.12. Once the integral over .vμ is completed at the k-th datum and j -th energy partition, we proceed to undertake the integral over .x3 , at the same energy partition and datum. To undertake this integral we appreciate that the integral over .x3 is symmetric about .x3 = 0. So the integral is doubled, with us setting (lo,k,ρ)

x3

.

= 0, ∀k = 1, . . . , Ndata ; j = 1, . . . , Nε . (up,k,ρ)

Also, the maximal value .x3

(up,k,ρ)

x3

.

of .X3 at this j and k is = r2 − (rp(k) )2 ,

3.2 Methodology

123

where .r is the root of the r-dependent equation: (k)

(Φ(0) + j δε ) − Φ(r) − (v3 )2 /2 = 0,

.

where .Φ(r) is computed from .ρ(r) using Poisson Equation in general, and in the case of the vectorised gravitational mass density, using Eq. 3.10. Then inserting these limits into the integrals to the left of Eq. 3.11, we compute the projected phase space pdf.

3.2.17 Convolving with Error Density In this section, we clarify the convolution of the projected phase space pdf with the density .g(v3 ; 0, δv32 ) of the measurement error in .V3 , where the parameters of this error density—which we recall is Normal—is the zero mean and the variance .δv32 , in the distribution of this measurement error. (As stated above, measurement error in .Rp is considered negligible, compared to that in .V3 ). We model the error density in .V3 to be a Normal: 2 .g(v3 ; 0, δv ) 3



1

= 2π δv23

 −v32 exp . 2(δv3 )2

The convolution of the projected phase space pdf and this error density is (k) √ v3 + −2Φ(0)



(k)

(k)

ν(rp(k) , v3 ) ∗ g(v3 ; 0, δv23 ) =

.

ν(rp(k) , z)g(v3 − z; δv23 )dz, (k) √ z=v3 − −2Φ(0)

and computing this for each of the .Ndata data points in .D, at each iteration of the MCMC chain, is time-consuming. We approximate this integral by a closedform expression that results when we replace .ν(rp(k) , z) by its value computed at the √ √ (k) (k) centroid of the interval .[v3 − −2Φ(0), v3 + −2Φ(0)] that z takes values in, as relevant to the integral in the last equation. Then the convolution of the projected phase space density and the error density is approximately .

(k)

(k)

(k)

(ν ∗ g)(rp , v3 , δv3 ) ≈ (k) ν(rp(k) , v3 )



(k) √ v3 + −2Φ(0)

(k) g(v3 (k) √ z=v3 − −2Φ(0)

− z; δv3 )dz,

124

3 Bespoke learning of density (k)

(k)

(k)

ν(x1 , x2 , v3 ) × 2  (k) √   (k) √  v3 − −2Φ(0) v3 + −2Φ(0) √ (k) √ (k) erfc − erfc . =

2δv3

2δv3

(3.13) Here “erfc” is the complimentary error function.

3.2.18 Priors As we have motivated in Sect. 3.2.13, we do not have strong priors on the f1 , . . . , fNε parameters, since there is no reliable information available for the particular galaxy and particular particle type that we consider here. So we allow for the influence of the data on their inference, be relatively dominant, In other words, parametric priors are likely to be misrepresentative, and this is elicited by the Physics of the situation. So this motivates usage of weak priors on the components of the .f vector. We employ a Normal prior on .fj , with the mean of this Normal set to coincide with the initial or seed value of .fj with which the chain is initiated. Variance of this Normal is retained as .≥ 5 times the proposal variance for .fj , as used within our MCMC implementation. Here .j = 1, . . . , Nε . In lieu of information of the phase space pdf, we use a seed value that is the same for all components of the vectorised phase space pdf, i.e. we initiate the chain with a uniform phase space density. Again, there is no prior information on the gravitational mass density in the galaxy under consideration, though multiple parametric galactic gravitational mass density functions are available in the literature. We use one such generic model— the NFW model, [26]—to provide the seed value of .ρi , and the prior on .ρi is then a Normal with this seed value as its mean, and a variance given as .≥ 5 times the variance of the proposal of .ρi , .i = 1, . . . , NX . Of course the parameters of the NFW model may not be known for a given galaxy, s.t. prior mean and seed values for an component of the vectorised gravitational mass density function are not always known. We then use arbitrarily chosen values for these parameters.

.

3.2.19 Inference on f and ρ The support of the sought phase space pdf is energy, and therefore partly the potential function, which is our other unknown—except we learn the gravitational mass density function and retrieve potential from it at every iteration, using Poisson Equation. In this sense, the learning of the state space pdf is nested within the

3.2 Methodology

125

learning of the potential. In light of this, we perform MCMC-based inference on the sought parameters—which are the components of the .ρ and .f vectors in our vectorised paradigm—with the updating of the .ρ1 , . . . , ρNX parameters done within the first block of any iteration, followed by the updating of the .f1 , . . . , fNε X parameters at the updated .{ρi }N i=1 , in the second block of that iteration. In other words, we will be undertaking inference using Metropolis-within-Gibbs in which the state space vector .(ρ1 , . . . , ρNX , f1 , . . . , fNε )T is partitioned, and each partition updated sequentially within any iteration as within Gibbs sampling, while updating of individual parameters within a partition undertaken using Metropolis-Hastings, (see Appendix A.1.3). In the vectorised paradigm that we undertake, this nested nature of the learning strategy implies that as we update .ρ, uncertainty in its learning translates to uncertainty in the support of the phase space pdf. This is manifest in uncertainty in identification of the .ε -bin that .fj belongs to. We realise that such error in the allocation of a value to the j -th component of .f —for any .j ∈ {1, . . . , Nε }—will persist, as long as there is variation in .ρ1 , . . . , ρNX . Such variation is of course present all through the length of the MCMC chain, even in the post-burnin stage. So we conduct our inference such that at the end of the first chain that is run, convergence is diagnosed by examination of the trace of each of the .ρ-parameters; traces of the components of .f do not necessarily exhibit convergence. Subsequently, the second chain is run, in which the .ρ parameters that are the output from the first chain, are fed in as the seeds of the corresponding parameter. The proposal variance for .ρi could be reduced in this second chain, s.t. error committed in identifying the bounds of any of the .Nε energy partitions, is reduced, compared to that in the first chain. Traces of .f1 , . . . , fNε are then examined for convergence, (as are the traces of the .ρ parameters).

3.2.20 How Many Energy Partitions? How Many Radial Partitions? The learning strategy that we describe here, may appear to be sensitive to the binning-related details of the relevant ranges of values of: the (.L2 ) norm R, of the position vector .X; and the energy variable .ε . Driven by such a hunch, the suggestion might be to learn the number .NX of the R-bins and the number .Nε of the .ε -bins, i.e. .NX and .Nε be retained as (integer-valued) variables within our inference scheme and these be learnt from the data. However such would imply that the numbers of components in the sought vectors .ρ and .f are varying across iterations, in the MCMC-based inferential scheme that we adopt. In other words, dimensionality of the state space over which the MCMC chain is designed, is then going to vary across iterations. Inference is then possible—using algorithms such as Reversible Jump MCMC [11]; however, inference is then rendered rather cumbersome compared to MCMC that is run with the number of unknown parameters the same across all

126

3 Bespoke learning of density

iterations. So we approach the data to see if details of the binning strategy can be gleaned from the data or not. This very much appears possible in the given problem.

3.2.20.1

Binning the Radial Range

We use the empirical sample of values of .Rp , to construct a histogram. We wish to construct the histogram under the criterion that no bin be left bereft of a data point, s.t. 1. every bin in this histogram bears non-zero empirical information, and 2. at the same time, we wish to approximate the gravitational mass density function .ρ(R) with its vectorised form as best as possible, i.e. we wish to maximise the number of bins that the relevant range of .Rp is partitioned into. The lowest and highest values (.r0 and .rmax respectively), of .Rp that are used to compute this frequency distribution, are obtained directly from the data .D. Then the number of bins that the observed sample of .Rp can be partitioned into, is obtained by (uniformly) partitioning the .[r0 , rmax ) interval into the maximum number of bins that allow for the 1st criterion above, to be satisfied. In general, it is not going to be possible to achieve this without relaxing the right hand edge of the interval of sampled values of .Rp . These are the positions at which fewer observations are made than at the more central positions in the galaxy. Thus, it would be preferred to omit the very few particles that exist near the outer edge of the observed values of .Rp , in favour of a finer radial binning. The consequence is that inference on the gravitational mass density is made over a slightly less extended radial range than is maximally possible, given a data set. For example, in a data set that is provided for the real galaxy that we work with, the outermost observation is recorded at about 35 kpc, while we make inference on particles that live in the interval [2.2, 33] kpc. This permits a bin width of 1.1 kpc that ensures that every bin is inclusive of at least one data point. As a result, we work with 29 R-bins. It is to be noted that the binning in R is motivated as the binning of the empirically observed .Rp , where .R = r ≥ rp , (from .R 2 = Rp2 + X32 ). That does not mean (recall that to say that we are conditioning our .ρ-parameters to be over-estimated; 

gravitational mass density is non-increasing in R, so .ρ( rp2 + x32 ) ≤ ρ(rp )). All that we have here is a data-driven means to construct R-bins, over which the vectorised version of this density is erected. Radial positioning of empirical inputs— in terms of values of .Rp and .V3 —is not relevant to our learning of the .ρ-parameters. Hence the choice of binning described above, does not bias the learning, as long as model assumptions (of phase space isotropy and central potential), hold in the data .D. We simply choose to learn the gravitational mass density at a radial value of .R = rp , instead of at .R = r, where r and .rp might not be equal. Also, the only correlation between the gravitational mass density function over one R-bin and the next, exists to inform on the monotonicity of this function, i.e. .ρi is related to .ρi−1 only via the constraint that the variable .ρi ∈ [0, ρi−1 ], .∀i =

3.2 Methodology

127

2, . . . , NX , and .ρ1 ≥ 0. Within this stipulated interval, the random variable .ρi can attain any value, as governed by the inference scheme we use.

The values of .ρ1 , . . . , ρNX that are bespoke learnt at the respective R-bin, in this part of the learning, will be subsequently employed as training data that will be input in the learning of the stochastic process that generates the gravitational mass density function.

So the construct that gives rise to the R-bins can be considered misrepresentative, if the correlation structure of an underlying stochastic process is noted to be distinct, when the binning is changed.

3.2.20.2

Binning the Energy Range

We can take cue from the histogram-based construct, to suggest an ansatz for the way the empirical data can be invoked to bin the relevant range of values of the energy random variable .ε, on which the vectorised phase space pdf is erected. Had we been in possession of an empirical (proxy for the) energy variable, we could have used its frequency distribution to gauge the applicability of a binning choice for the energy variable. There is a proxy for .ε that is possible to be extracted from the empirically-observed sample of .Rp and .V3 values. We could use the histogram constructed of the sampled .Rp —discussed above—and use this as a scaled proxy of the gravitational mass density, to achieve the gravitational potential function at each observed .Rp . We can use this as an “empirical potential”. Then the sum of such gravitational potential at an .Rp value, and the .v32 /2 computed using the .V3 value measured at this same .Rp , provides an empirical proxy for .ε , at each observed .Rp . We choose .Nε s.t. there is at least one datum in each .ε -bin. It is however important to remember that the potential that enters the definition of this proxy for .ε , is not going to concur with that obtained at the end of a given number of iterations of the MCMC chain. So the energy computed with a freshly updated (gravitational mass density vector, and thereby) potential, will be different from that computed using another potential. At the same time, it needs to be appreciated that the proxy for .ε that we compute using the “empirical potential”, is computed using a kinetic energy that is given by the observed .V32 /2—not .V 2 /2. So we do expect the energy distribution computed in any iteration of the MCMC chain, to be different from that computed for the proxy for .ε . A scaled version of the latter distribution is depicted in black in the bottom right panel of Fig. 3.2, while the scaled phase space pdf learnt from an MCMC chain is overlaid in red. Thus it appears impossible that the constraint of every .ε-bin to remain populated by at least one datum, be ensured by a choice of .Nε made prior to the initiation of the chain. However, this can be checked within every iteration of the MCMC chain,

128

3 Bespoke learning of density

after the chain is initiated with a chosen number of .ε-bins, where this chosen .Nε is driven by the histogram of .ε that we compute using the “empirical potential” and half the squared observed .V3 . This is indeed what we undertake. Though ideally we would like to proceed with as large a value of .Nε , (and .NX ), it is also to be acknowledged that with increasing .Nε , the running of the algorithm takes longer. In light of this, we suggest a compromise value for .Nε .

3.2.21 Inference Using MCMC We employ Metropolis-within-Gibbs (Appendix A.1.5) to conduct inference on the .ρ1 , . . . , ρNX and .f1 , . . . , fε parameters. In each iteration, we will first update .ρ1 , . . . , ρNX , using Metropolis-Hastings, in which we update the .ρ-parameters starting with .ρNX . Then .ρi is proposed in the t-th iteration from a Truncated Normal density, that is: (,t)

• left truncated at the proposed value .ρi+1 of .ρi+1 , for .i = NX − 1, . . . , 1, and left truncated at 0 if .i = NX , (t−1) • the mean of which is the current value .ρi of .ρi , for .i = 1, . . . , NX , and • the variance of which is (0)

σρ2i = sρ2 × (ρi )2 ,

.

(0)

where .ρi is the seed or initial value of the .ρi parameter—with which the chain (0) initiates. .sρ is a scale parameter with which .ρi is scaled, .∀i = 1, . . . , NX , s.t. the seed value of .ρi , scaled by .sρ gives the standard deviation of the proposal density of this .ρ-parameter. Thus, the proposed value of .ρi in the t-th iteration is (,t) ρi(,t) ∼ T N (ρi(t−1) , σρ2i , ρi+1 ),

.

(,t)

i = NX − 1, . . . , 1;

(t−1)

ρNX ∼ T N (ρNX , σρ2i , 0).

.

The .f vector parameter at which this updating of .ρ is conducted in the first block of the t-th iteration, is the .f vector that was updated in the (second block of the) (t−1) .t − 1-th iteration, i.e. at .f . Here the index for iteration is t, which takes values from 0 to .Niter , i.e. .t = 0, 1, . . . , Niter .

3.2 Methodology

129

Once .ρi is updated in the t-th iteration, .∀i = 1, . . . , NX , we update f1 , . . . , fNε , at this updated .ρ (t) , in the second block of updating within the t-th iteration. Thus the MCMC algorithm that we employ is Metropoliswithin-Gibbs.

.

Now, the proposal density for .fj in this iteration is a truncated Normal that is (t−1) truncated to the left by 0, the mean of which is .fj , and the variance of which is (0)

σf2j = sf fj ,

.

(0)

where .fj is the seed for .fj that we initiate this chain with. Again, .sf is a scale that is chosen experimentally, for all f -parameters. Thus, the proposed value of .fj in the t-th iteration is (,t)

fj

.

(t−1)

∼ T N (fj

, σf2j , 0).

Here .j = 2, . . . , Nε , as described following Eq. 3.8, with .f1 ensuring that the phase space pdf integrates over all energies to 1. As we do not learn the global scale .Ψ on .fj , .j = 1, . . . , Nε , this of course means that we do not know the value of .f1 in any iteration. So we display results of the MCMC-learnt values of .f2 , . . . , fNε only. Truncated Normal priors are employed on .ρi and .fj , .∀i = 1, . . . , NX ; j = 2, . . . , Nε , where such a prior density is left truncated at 0, has a mean that is the seed value of the respective parameter, and prior variance that is implemented as between 5–50 times the proposal variance used for the parameter in question.

3.2.22 Wrapping Up Methodology Before we move on to discuss the results of an implementation of this method, we want to summarise the material delineated above. We appreciate that this is a difficult problem, where the difficulty is primarily triggered by the multiple strands of information deficiency that characterise this problem. We aim to learn the content and distribution of gravitational mass in a real galaxy. Firstly, we realise that in lieu of direct measurement of such distribution—embodied within the potential function of this system—we need to invoke an effect of the gravitational field of the system. This is identified as the motion of particles that live inside the galaxy. It is indeed a feat of modern astronomical instrumentation and the relevant technologies that we are able to track individual galactic particles for their velocities and locations— in fact, the system is such that half the phase space coordinates are fundamentally unobservable. This is one important kind of information deficiency in this problem.

130

3 Bespoke learning of density

We want to learn the gravitational mass density and pdf of the location (X) and velocity vector (V ), given data D = data {(X1 , X2 , V3 , δV3 )}N ≡ i=1 data {(Rp , V3 , δV3 )}N i=1 , where 2 2 Rp = X1 + X2 and the observed error in V3 is δV3 which is modelled as the standard deviation of the error distribution of V3 . Thus X3 , V1 , V2 are unobservables.

We assume centrality of potential, i.e. model potential as Φ(R), R := X12 + X22 + X32 . Potential and gravitational mass density are deterministically linked via the Poisson equation.

Total time derivative of phase space pdf is zero. So phase space pdf is function of integrals of motion. We assume the phase space pdf to be an isotropic function of X and V =⇒ phase space pdf is function of the energy ε (of a galactic particle).

Compute gravitational mass enclosed within given radius.

Using these 2 training datasets, learn ρ(·) and f (·) functions. Predict gravitational mass density at any radial location in the galaxy, and phase space pdf at any energy.

Invoke priors on each component of ρ and f . Joint posterior pdf of all components is computed given D, with normalisation of posterior computed at each iteration. Posterior samples are generated using Metropolis-with-Gibbs, leading to marginal posterior of each unknown and 95% HPDs are computed. X This results in {(ri , ρi )}N i=1 ε and {( j , fj )}N j=2 .

Energy is sum of potential and kinetic energy ((V12 + V22 + V32 )/2). So Φ(R) is now embedded in the support of the phase space pdf. This pdf marginalised over unobservables is marginal ν(Rp , V3 ) of observables Rp , V3 . ν(·, ·) is convolved with error density in V3 ; product of this convolution computed at each data point gives likelihood. So mass density is ρ(R) and phase space pdf is fε (ε). Training data on either function is absent. So we vectorise the functions to vectors ρ and f , and learn the vectors given data D. Likelihood computation is revisited in the vectorised paradigm.

Fig. 3.1 Flow chart to clarify learning scheme used in the work

However, there exists no training data to permit supervised learning of the sought potential function; this owes to the lack of information on the gravitational mass density at a given location within the given galaxy. Generic simulated galactic models do not inform on the gravitational mass density in the real system that we consider here—had such models existed, we need not in fact have aimed to learn the gravitational mass density in the considered galaxy. So we need to construct a new methodology that allows for the bespoke learning of the gravitational mass density .ρ(R), given the data on the partially-observed phase space vector. A succinct description of the methodology is available in Fig. 3.1. Centrality of the potential—and therefore the R-dependence of the gravitational mass density function from Poisson Equation—is motivated in light of the system being gravitationally bound. (Here R is the Euclidean norm of the position vector). Lack of inter-stellar interactions is elicited from galactic dynamical literature, and this helps found the dissipationless model for the potential, thus advancing a velocity-independent potential. So to compensate for this lack of information, we resort to excavating for information about the system—and this emerges in the form of the temporal evolution of the pdf of the phase space vector. In general, such information is embodied in the relevant kinetic equation [12]. In the considered system, the total time derivative of that is informed upon by the Boltzmann kinetic equation—as would be expected—and it is the Collisionless Boltzmann Equation that is motivated in light of the lack of inter-particle collisions in the galaxy. This implies the recasting of the support of the phase space pdf in terms of integrals of motion—of which, we choose to exclude all, but energy. This is to avoid the model from being rendered intractable, while reassurance is provided regarding testing of the assumption of phase space isotropy—which implies, and

3.2 Methodology

131

is implied by, a phase space pdf that is solely energy-dependent. Learning using models that violate isotropy is typically unfeasible, (owing the ensuing intractability of the model), but that of course does not endorse the usage of isotropy, (i.e. sole energy-dependence) in our model of the phase space pdf. We resort to isotropy since we still want to undertake learning. Therefore testing for an isotropic model in the data is required. As part of energy .ε , is the potential, (and thereby .ρ(R)), this construct then introduces the unknown potential (or equivalently, the unknown .ρ(R)), into the support of the phase space pdf .fW (ε). Since only half the components of the phase space vector are observable, we project .fW (ε) onto the space of the observables, to achieve the projected phase space pdf—i.e. the marginal pdf of the observables— that is computed at each value of the observable phase space coordinates. Then likelihood is defined using this marginal of the obsrvables—except, .fW (ε) is itself an unknown. So we need to learn .ρ(R), as well as .fW (ε). However, lack of training data sets - comprising pairs of values of .(R, ρ(R)), and (.ε , .fW (ε))—are missing, and this compels us to seek the vectorised forms of the sought functions. Thus we learn each component of the vector .ρ, which is the vectorised version of .ρ(R), and that of the vector .f , which is the vectorised version of .fW (ε). The constraints of positivity on each such component, and the constraint of monotonicity on the components of the .ρ vector, are imposed via the MCMC inference that we employ in this work. In light of these discretised versions of the sought functions, we rewrite the projection integral that marginalises .fW (ε) over the unobservable phase space variables. The crux of computing this integral is the identification of the volume that any energy partition occupies, in the space of the unobservable phase space coordinates. We undertake this integral, and convolution of the marginal pdf of the observables, with the density of measurement errors, provides for the likelihood. We employ uninformative priors on the components of .f , and parametric models elicited from literature to provide means of the Normal priors used on the components of .ρ. Then the likelihood is used along with such priors in Bayes rule, to attain the posterior, except, normalisation of the posterior needs to be addressed in this work, as it is model-dependent, i.e. dependent of the forms of the potential function and the phase space pdf. Fortunately, this normalisation is not intractable within the assumption of phase space isotropy, though it needs to be computed at every update of the .ρ and .f vector-valued random variables. Thus the posterior is defined, and sampled from using Metropolis-within-Gibbs, to yield the marginal posterior distributions of each component of each of .ρ and .f . The marginal posterior on any such component, then allows for identification of the 95.% HPDs on this learnt parameter, given the data at hand. Once each component of .ρ is learnt, it implies that the originally-absent training data on (.ri , ρ(R) = ρi ) pairs is now available; .i = 1, . . . , NX . So then we can undertake supervised learning of the .ρ(R) function. Similarly, upon the learning of each component of .f , we can learn the phase space pdf .fW (ε). The first component .f1 of the .f vector is not updated, but it is bound by the constraint that the phase space pdf has to abide by in every update, namely that—

132

3 Bespoke learning of density

being a pdf, the phase space pdf .fW (ε) must integrate over all possible energy values, to unity, i.e. 

0

.

fW (ε)dε = 1,

ε=Φ(0)

for any given .Φ(·). Given that we are working with the vectorised version of the phase space pdf, the Riemann sum (trapezoidal) representation of this integral yields:  (−Φ(0)) f1 /2 + fNε /2 + (f2 + . . . + fNε −1 ) Ψ = 1. Nε

.

Then with components .f2 , . . . , fNε updated, .f1 is computed as non-negative, in every iteration.

3.3 Empirical Illustration on Real Galaxy NGC4649 NGC4649 is an elliptical galaxy in the Virgo group of galaxies, and is also referred to as M60 [30]. It is about 60 million light years away. The methodology discussed above, will be illustrated on two independent data sets, comprising kinematic information on two distinct types of particles that reside in this galaxy. One of these particles is called “planetary nebulae” or PNe, which are in fact the end states of some massive stars [31]. PNe emit radiation of particular frequency that is captured within their spectral output; the motion of the PNe along the line-of-sight to the observer (.V3 ), is manifest in the position of the line representing this frequency, in the spectrum. The other kind of particle that is tracked for their .Rp and .V3 values is a Globular Cluster (GC) [31]. A GC is a compact cluster of stars that can be tracked for their line-of-sight speeds. The data comprising measured values of .Rp , .V3 and .δV3 of PNe is referred to as .DP N e ; there are 296 data points in the data set that we use in this work; this data is taken from observations performed by Teodorescu et al. [32], and kindly shared and prepared in its used form by Dr. Kristin Woodley. On the other hand, the data that comprises measurement of attributes of GCs, is referred to as .DGC , and there are 115 data points in the data set that is used in this work. This data includes observations undertaken by Bridges et al. [5] and Lee et al. [20], which were collated by Hwang et al. [16]. Again, it was Dr. Kristin Woodley, who had cleaned and prepared the data for implementation in the learning discussed here. Here .δV3 is used as 20 km s.−1 for .V3 values of the PNe considered, but the error in the .V3 measurement of GCs is distinct for each observed GC. As stated above, the error in the .Rp measurement is negligible. The top panels of Fig. 3.2 display the data points of .DGC and .DP N e , via the plot of .v3 against the observed .rp , with the error in .v3 superimposed on the data point as an error bar. The left lower panel of this figure depicts the histograms of the

3.3 Empirical Illustration on Real Galaxy NGC4649

133

600 400 v3 (kms–1)

v3 (kms–1)

GCs

PNe

500

0

–500

200 0 –200 –400

10

20 rp (kpc)

30

–600

10

20 30 rp (kpc)

40

1 normalised frequency

frequency

30 20 10 0

0.8 0.6 0.4 0.2 0

0

10

20 30 rp (kpc)

40

–2u107 –4u107 energy (km2 s–2)

Fig. 3.2 Top right: Values of .V3 of the 115 GCs in the observed data .DGC , plotted against values of the observed .Rp . Error bars represent the values of the error .δV3 observed in the observation of .V3 . Top left: Values of .V3 plotted against values of .Rp observed of the 296 PNe, with error on .V3 superimposed on each data point that comprise data .DP N e . Bottom leftt: Histogram of the observed .Rp values of the GCs in data .DGC in solid line, while histogram of observed .Rp of PNe in broken lines. Bottom right: Histogram (in black), of the values of the energy variable computed using the observed values of .V32 /2 of the tracked GCs, and potential that is computed using the (scaled) histogram of .Rp of the observed GCs—as a proxy for gravitational mass density—in Poisson Equation. In red is shown, a similar histogram, except that its computation employs the gravitational mass density, learnt in an MCMC chain run with .Nε =9. The .f2 , . . . , f9 parameters learnt in this chain using .DGC are plotted in green; the mean values of each f -parameter is joined in broken green line

measured values of .Rp in these data sets. Choices of bin width and radial ranges relevant to the two constructed histograms, are motivated by considerations that are discussed in Sect. 3.2.20.1. The right bottom panel includes histograms of the energy variable of the observed GCs, the potential energy part of it is computed in different ways.

134

3 Bespoke learning of density

• Firstly, there is the histogram—referred to as Histogram 1—that is constructed using potential that is computed using the discretised version of the gravitational mass density function that is represented by the (scaled) histogram of the vector of .Rp values of GCs (shown in broken lines in the lower left panel of this figure); this scaled histogram of the observed .Rp values, is input as the appoximation for gravitational mass density, in Poisson equation, to retrieve the potential. Then energy is defined as the sum of this potential at the measured .Rp for the k-th (k) (k) GC, and .(v3 )2 /2 where .v3 is the measured value of the k-th observed GC; .k = 1, . . . , 115. The histogram of this energy was constructed with an .Nε = 9. However, the ad hoc scaling of the gravitational mass density is not a desirable feature of this calculation. • So we ran an MCMC chain, with the data on the 115 GCs, and with .Nε = 9, s.t. upon the convergence of this chain, the learnt vectorised version of the gravitational mass density is input into Poisson Equation, to achieve the potential that is included to define the energy value of the k-th GC. Histogram—referred to as Histogram 2—of the energy values of all 115 GCs in the data set .DGC is plotted in the lower right panel in Fig. 3.2. While these histograms represent the frequency distributions of the energy variable of the observed GCs, the vectorised version of the pdf of the phase space variable that is learnt in this undertaken chain, tells of the galactic phase space pdf learnt using data .DGC . This learnt pdf is overplotted on Histogram 1 and Histogram 2, in the lower right panel of Fig. 3.2. It is to be noted that the energy variable being different in the construction of the two histograms, (Histogram 1 and Histogram 2), we do not expect the two histograms to concur in the details of binning of the relevant ranges of energies. Again, the support of the pdf is likely to be different from the definitions of energy relevant to the histograms—for example, Histogram 1 uses values of .Rp in place of R to define potential, while both histograms use .V3 values to compute the kinetic energy part of the energy variable. Support of the pdf on the other hand is definitely the comprehensive energy. So the histogram construction of the values of .Rp of the observed PNe and GCs suggest that we can work with .NX =29 R-bins given data .DGC , with .r0 = 3.44 kpc and .rmax = 44.04 kpc. With the PNe sample, .NX = 28, with an .rmax of 33 kpc and .r0 of 2.2 kpc. We do expect the GCs in a galaxy to be present at higher radii than PNe, and the data sets that we have access to here, reflect this. Again, we use .Nε =9, given either data set. In the MCMC chains that we run with data .DP N e and .DGC , the proposal densities of each parameter is as suggested in the text above. To recap, .ρi is proposed from a truncated Normal density, the mean of which is the current mean; the variance of which is experimentally fixed; which is left truncated at the proposed value of .ρi+1 , for .i = NX − 1, . . . , 1. However, .ρNX is proposed from a truncated Normal density with a left truncation of 0. Again, .fj is proposed from a truncated Normal density that has a mean at the current value of the parameter, variance experimentally fixed, and left truncation of 0, .∀j = 2, . . . , Nε . The .ρ-parameters are updated in the first block of our Metropolis-within-Gibbs based inference. We follow this within

3.3 Empirical Illustration on Real Galaxy NGC4649

135

the 2nd block in the same iteration, with the updating of the f -parameters, at the recently updated .ρ-parameters. We run one initial chain with each of .f2 , . . . , fNε set to 0.5, and the initial .ρi set to the arbitrarily chosen seed density .1012 /(10 + ri2 )3/2 , at the centroid .R = ri of the i-th R-bin; .i = 1, . . . , NX . During this chain, variances of the prior densities on each parameter is retained as 50 times the proposal variance of this parameter. Upon convergence, the mean of each parameter is used as the seed for that parameter in the subsequent chain, in which the prior variances on the .ρ-parameters are tightened to be 5 times the proposal variance on that .ρ-parameter, while the prior variances on the f -parameters are retained as in the previous chain. Uncertainties in the .ρparameters imply uncertainties in the support of the pdf of the phase space variables, s.t. there is an uncertainty in the identification of the updated .fj parameter as the the value of the pdf in the j -th .ε -bin. So, tightening the priors on the .ρ-parameters will reduce these uncertainties in the placing of the updated f -parameters across the .ε bins. The .ρ-parameters converge more easily given the data, than the f -parameters do, i.e. the mean of .ρi is robust to changes in prior variances. This way of running the chain allows for convergence of the f -parameters. A third chain is then ultimately run with the mean outputs of the .ρ and f -parameters, and prior variances set as five times the proposal variance on each of the .ρ and f -parameters. Results comprising learnt values of .ρ1 , . . . , ρNX and .f2 , . . . , fNε given data .DP N e are displayed respectively in the right and left panels of Fig. 3.3 in red; results given data .DGC are shown in black. Each parameter is learnt with 95.% 12 0

log (ri in M kpc–3)

log (fj)

–1

–2

–3

10

8

–4 –1

–0.8 –0.6 –0.4 Energy (normalised)

–0.2

6

0.8

1 1.2 1.4 log(Radius in kpc)

1.6

Fig. 3.3 Right: Logarithm of .ρ1 , . . . , ρNX learnt using .DGC plotted (in black) against the log of the centroid of the .NX R-bins. Plot of the logarithm of .ρ-parameters learnt using .DP N e , against logarithm of the radial location of the R-bins relevant to the PNe data, (in red). The seed density used in the MCMC chains run with either data is shown in green. Left: Plot of log of .f2 , . . . , fNε learnt using .DGC , against centroid energy value of the . -bins, in black. Log of the f -parameters learnt using .DP N e , are plotted in red. Log of a Gaussian in the energy variable, fit to the f parameters learnt using .DGC is plotted against energy, in broken red lines. The seed values of the f -parameters used in either MCMC chain is in green. Energy is normalised by .−Φ(0)

ε

136

3 Bespoke learning of density

Highest Probability Density credible region (HPD). The gravitational mass density parameters, i.e. the .ρ-parameters, learnt using the two different data sets are shown to concur within the respective 95.% HPDs. The HPDs are higher for the .ρparameters learnt using .DGC , compared to those on results obtained using .DP N e . This is to be expected, given the fewer GCs there are in the observed data set, than there are PNe in the .DP N e data set. To summarise, the galactic gravitational mass density value, as learnt using each data set, at respective R-bins, concur within the learnt uncertainties, irrespective of the implemented data.

However, the .f2 , . . . , fNε parameters, learnt using .DP N e , do not concur with the parameters learnt using .DGC . It is to be noted that it is the logarithm of the .fj parameter that we plot against energy, given the orders of magnitude in the values of the learnt f -parameters. While it is possible—within the 95.% HPDs—to fit (the logarithm of) a Gaussian to the results obtained using .DGC , (except at about normalised energy of .−0.2), such a fit defies the f parameters learnt using .DP N e at the less negative values of energy. In fact, the pdf of the phase space variables that we anticipate learning using .DP N e , is not (truncated) Normal by nature, given the incompatible tail behaviour. The pdf learnt using .DGC on the other hand, appears compatible to a Gaussian, within the 95.% HPDs. If the galactic phase space pdf s learnt using the data on the GCs and that using the data on the PNe, are not the same, then it implies very interesting descriptions of the galaxy. But, we have not yet learnt the pdf of the phase space variables using either data set.

Before discussing this learning, we will now ensure that the MCMC diagnostics of the chains, run with the two different data sets, correctly indicate convergence of the chains. Figure 3.4 presents the traces of some of the .ρ-parameters learnt with .DGC , while the histograms of the f -parameters are presented in Fig. 3.5. Again, histograms of some of the .ρ-parameters learnt using .DP N e are presented in Fig. 3.6, while traces of the learnt f -parameters are in Fig. 3.7. The presented traces manifest trendlesness, while the histograms are well approximated as unimodal bell-shaped curves. Such results motivate confidence in our learning.

3.3.1 Learning the ρ(R) Function and pdf f (ε) In the subsection above we discussed the learning of the vectorised version of the gravitational mass density function, i.e. bespoke learnt values of this function at selected radii inside the observed galaxy NGC4649. A similar vectorised pdf of the phase space variable vector was learnt, at distinct values of the energy variable .ε —where energy embodies that combination of the phase space variables, that

2u109 1.5u109 109

0

5u109 0

0

6u108 5u108 4u108 3u108 0

1.6u109 1.4u109 1.2u109 109 8u108

0

0

8u109 r3

r2

1010

1.2u1010 1010 8u109 6u109 4u109

r28

8u108 7u108 6u108 5u108 4u108

0

0

r10

2.5u109 2u109 1.5u109

0

1.5u1010 r1

0

0

r8

r6

3.5u109 3u109 2.5u109 2u109 1.5u109

0

109 8u108 6u108

0

108

1.2u108 108 8u107 6u107 4u107

r20

1.4u109 1.2u109 109 8u108 6u108

r16

r14

0

2u108 1.5u108

r12

2u108

137

6u109

r4

r24

r22

3u108

3u108 2.5u108 2u108 1.5u108 108

r18

4u108

r26

3.3 Empirical Illustration on Real Galaxy NGC4649

4u109 0

6u109 5u109 4u109 3u109 2u109

0

Fig. 3.4 Traces of a gravitational mass density parameter .ρi from the MCMC chain run with data Here .i ∈ {1, 2, 3, 4, 6, 8, 10, 12, 14, 16, 18, 20, 22, 24, 26, 28}

.DGC .

correctly represents the model assumption of an isotropic phase space. That we have undertaken the bespoke learning of .ρi at the centroid .ri of the i-th R-bin, and learnt .fj at the centroid of the j -th .ε -bin, equivalently implies that we have successfully Nε X generated the training data sets: .Dρ := {(ri , ρi )}N i=1 and .Df := {(εj , fj )}j =2 . In this discussion, these .ρ-parameters and f -parameters are learnt given the PNe data. In this section, we employ these generated training sets to perform supervised learning of the gravitational mass density function .ρ(R) and the phase space pdf f (.ε ), by treating each such function as a random realisation from a respective Gaussian Process. Thus, we now model: ρ ∼ GP (μρ (·), covρ (·, ·))

.

f ∼ GP (με (·), covε (·, ·)),

where .μ· (·) and .cov· (·, ·) are the mean and covariance functions of the relevant GP, respectively.

138

3 Bespoke learning of density

6000

3000

4000

2000

2000

1000

0

0

0

0.2 0.4 0.6 f6

0.5 f7

1

5000 4000

3000 2000 1000 0

0.001 f2

3000 2000 1000 0

0

0.005 f3

2500 2000 1500 1000 500 0

3000 2000 1000 0.5

1 f8

0 1.5

6000

3000

4000

2000

2000

1000

0

0.02 f4

0

0.5 f9

1

0.05 f5

Fig. 3.5 Histogram representing marginal posterior probability density of .fj , given data .DGC , computed using results of an MCMC chain run with this data. Here .j = 2, . . . , 9

3.3.1.1

Predicting from the Learnt ρ(·)

Then the joint probability distribution of the .NX outputs—each realised at a design radius that populates the training data—is a multivariate Normal, with mean vector (ij ) .μρ and variance-covariance matrix .Σ ρ = [σρ ] = K(ri , rj ), where .K(·, ·) is the kernel used to parametrise the covariance structure. We standardise the .ρ-parameter values that live in the data set .Dρ by the sample mean and sample standard deviation, and subsequently set .μρ = 0. Given the trends suggested by the learnt .ρ-parameters across values of R, we opt for a simple kernel, namely, a SQuared Exponential or SQE kernel: .K(ri , rj ) = aρ exp(−(ri − rj )2 /2ρ ), with the amplitude parameter .aρ and length-scale parameter .ρ unknowns that we will learn from the data. In fact, we will simplify the learning further, by recognising that the global amplitude can be subsumed within the global length scale, s.t. in our model, the only parameter that requires learning is the “amplitude-modulated length scale” that we will still refer to using the notation .ρ . Indeed, it may be considered simplistic to suggest that a unique length scale will be relevant for all .ri , rj ∈ {r1 , . . . , rNX }; however, the “smooth” trends in the learnt .ρ-parameters inspire confidence to use such a parametrisation of the used covariance kernel. Chakrabarty & Wang (under preparation) have advanced a learning technique in which each parameter of a covariance kernel is modelled as a function of the sample path realised from the invoked GP—which in general could be a tensor-variate GP—and subsequently modelled each such function as a realisation from a scalar-variate GP, each of which can be proved to be stationary. Then parameters of the covariance structure of each such scalar-variate, stationary GP need not be modelled further as functions of a sample path of the underlying Process, but be learnt from the data as global unknowns. So here, we are essentially reducing the complexity of our learning exercise by assuming that the

3.3 Empirical Illustration on Real Galaxy NGC4649

1000 500 0

1500

2000

1000

1500

r

108

0

22

24

1000

0

0

109

r

2u109

109

14

3000 2000 1000

2u109 4u109 r

1000 500 0 r

5u1010 1

0

1500

1000

1000

500

500

0

109 1.5u109 r

0 5u108

18

2000

3000

1500

1500

2000

1000

1000

500

500

2u109

4u109

0

r

8

1000 800 600 400 200 0

0 109

2u109 r 10

2u109 r 12

2000 1500

1500

2

109 r

1000

2000

1010 r

108 2u108 r

20

2000

0

0

28

1500

16

6

1500

2u109

r

2u108 4u108 r 26

2000

1000

0

0

5u108 r

3000

2000

500

500

4000

3000

1000

1000

500

5u108

139

1000

1000

500 0

500 0

5u109 r 3

1010

5u109 r 4

Fig. 3.6 Marginal posterior probability of .ρi given .DP N e , approximated by the histogram of values of .ρi learnt in the MCMC chain run with this data. Here .i ∈ {1, 2, 3, 4, 6, 8, 10, 12, 14, 16, 18, 20}

underlying scalar-variate GP that .ρ(R) is sampled from, is a stationary GP—and this assumption may not necessarily be the correct one to undertake. So to check if this assumption holds in the data, we learn the .ρ parameter of the GP that generates .ρ(R), using 21 out of the .NX = 29 data points that exist in the data set .Dρ , and then predict the .ρ-values at 8 other values of R. In other words, the original data .Dρ is split into a training set of 21 data points and a test set of 8 points. We predict at each of the 8 radii that populate the test set. The 21 data points of this training set (learnt using the PNe data) are plotted in black in the top right panel of Fig. 3.8, with the mean value of the learnt .ρ-parameter marked by a filled circle, and the uncertainty (95.% HPD credible region) in the MCMC-based inference on the corresponding .ρ-parameter, overplotted as an error bar. Here the gravitational mass density is in units of solar masses per kiloparsec cubed, where

3 Bespoke learning of density

0.1 0

105

0.0007 0.0005

f3

f2

0.0006 0.0004 0.0003

0

105

0

1 0.9 0.8 0.7 0.6

105

0.0008 0.0007 0.0006 0.0005 0.0004

0.7

0.4 105

0.003 0.0025 0.002 0.0015 0

105

0.6 0.5

0

f4

0.15

0.8 0.7 0.6 0.5 0.4 0.3

f9

f7

f6

0.2

f5

0.25

f8

140

0

105

0

105

0

105

0.07 0.06 0.05 0.04 0.03

Fig. 3.7 Traces of .fj from MCMC chain run with data .DP N e ; .j = 2, . . . , 9

the mass of the Sun is a standard unit of gravitational mass in galactic astronomy, depicted as .M . The predicted .ρ-parameter at each of the 8 data points that live in the test set, is marked in red in this plot—the mean is marked by a filled circle with the predicted uncertainty marked by the error bar. It is to be expected that such uncertainty of prediction exceeds the “measurement error”, where the latter in this exercise reduces to the credible region learnt during the MCMC-based inference on this .ρ-parameter. The “true” value of the .ρ-parameter at this radius—learnt using the PNe data—that lives in the test set, is plotted in green. Concurrence of the predicted values (in red) and the “true” values (in green), lends confidence to our learning of the parameter .ρ . Trace of this length scale parameter is included in the top left panel of this figure. In addition to prediction, we also undertake “forecasting”, upon the learning of the parameter .ρ that defines the underlying GP that generates the .ρ(R) function. Such forecasting is performed at .R = 0.1 kpc; .R = 0.3 kpc; .R = 1.1 kpc; and .R = 2.2 kpc. These values of the radius R are not part of the data set .Dρ — and therefore, absent from the training set, as well as the test set, described in the last paragraph. We first perform the forecasting at .R = 2.2kpc. To undertake this, we employ the 21 data points of the training set, along with their respective uncertainties (credible regions learnt in MCMC-based inference), in concert with the 8 predicted .ρ-parameters and the respective uncertainty in each prediction. Result of the prediction is shown in the top right panel of Fig. 3.8, with the mean predicted value depicted as a red open circle, upon which uncertainty in the forecast value of .ρ(2.2) is overplotted as an error bar. On completion of this forecasting, we forecast the .ρ-parameter at .R = 1.1 kpc, by using the training+predicted data (along with respective uncertainties) as before, but this time, also include the forecast performed at .R = 2.2 kpc, along with uncertainties. The mean forecast value of the gravitational mass density function at .R = 1.1 kpc, is also shown as a red open circle. Then we proceed similarly, to augment our training set with the recently forecast .ρ-parameter at 1.1 kpc, to forecast at .R = 0.1 kpc, and thereafter, perform

3.3 Empirical Illustration on Real Galaxy NGC4649 13

log(r in M kpc–3)

2

Correlation length scale

141

1.5

1

0.5

12 11 10 9 8

7u105

7.5u105

8u105

–1 –0.5

0.26

0

0.24

–1

0.22 0.2

0.5

1

1.5

–2 –3

0.18

9.9u105

0

log(radius in kpc)

log(pdf)

Correlation length scale

Iterations

9.95u105

Iteration

106

–4

7

7.5

log(–energy)

8

Fig. 3.8 Top right: 21 of the 29 .ρ-parameters—learnt using the PNe data—that are included in the training set .Dtrain are shown in black. The remaining 8 .ρ-parameters that are part of the test set, are in green. Mean of the prediction performed at each test value of R is depicted in red complete circles. Again, mean of the step-one-ahead forecasting performed at .r = 0.1 kpc; .r = 0.3 kpc; .r = 1.1 kpc; and .r = 2.2 kpc, are shown in red open circles. Uncertainties on predicted/forecast values are marked in red error bars. Bottom right: 8 of the f -parameters learnt using the PNe data are shown in black while 7 of the predictions performed at chosen test energy values, are in red; the mean of the prediction is depicted in red filled circle. In addition, forecasting is performed at two energies that are outside the range of energies at which the original learning (using the PNe data) was undertaken; mean of the forecast f -values are depicted in red open circles. Left panels: panels on the left display traces of the learnt .ρ (top), and .f (bottom left) parameters

one-step-ahead forecasting of the .ρ-parameter at .R = 0.01 kpc. It is to be noted that the mean of the learnt .ρ value is employed in the prediction and one-step-ahead forecasting.

3.3.1.2

Gravitational Mass Enclosed Within a Radius

One outcome of this forecasting is that we are allowed the means to learn the vectorised version of the enclosed mass function .M(R), which informs on the total

142

3 Bespoke learning of density

gravitational mass that is enclosed in the interval .[0, R). So r M(r) =

4π s 2 ρ(s)ds.

.

s=0

We can learn the vectorised version of .M(R) by discretising this integral (using a Riemann sum representation of the integral—or rather a trapezoidal rule to be precise), with the learnt/forecast/predicted value of the .ρ-parameter in each partition of that the domain of .M(·) is split into. This partitioning is done into the .NX Rbins that the domain of the gravitational mass density function is also partitioned into, except, this time, there are 4 additional R-bins: .R = r ∈ [0, 0.1) kpc, .r ∈ [0.1, 0.3) kpc, .r ∈ [0.3, 1.1) kpc and .r ∈ [1.1, 2.2) kpc. The .ρ-parameter values over each of these additional R-bins are forecast—with uncertainties—as depicted in the top right panel of Fig. 3.8. However such a Reimann sum representation is appears to suggest high values of the cumulative gravitational mass, upto a radius under consideration, and these computed (enclosed within a radius) mass values are not robust to changes in binning details, as well as to the choice of the Reimann sum method used. So we have resorted to parametric fitting to the upper and lower limits of the learnt 95.% HPDs on the (bespoke learnt and predicted) gravitational mass density values, and integrate from .r = 0 to the current value of the radius variable R—as per the above integral—to compute the mass enclosed within the current radius. We fit an exponential function of the form: .A1 exp(log(r)/A2 ) + A3 , to the data comprising logarithm (to the base 10) values of the learnt/predicted .ρparameters, learnt at different radii .r ≤ rin and a form that is linear in .log(r) to radii in excess of .rin ; in this fitting exercise, .rin is 5.5 kpc. Results of such computation of .M(r) using the .ρ-parameters that are learnt and predicted given the PNe data, are included in Fig. 3.8. The enclosed mass values computed using the GC data are foreseen to concur within the learnt HPDs, with these results, since the .ρ-parameters learnt with the PNe and GC data have been seen to concur. The results from the PNe data indicate that .M(r < 0.001 kpc) ∈ [7.7380, 137.38] × 109 M. , i.e. the mass within the inner 0.001 kpc in this galaxy is learnt to lie between about 10 billion to about 100 billion solar masses. There is nothing in our model that allows for interpretation this enclosed mass as a supermassive black hole or any other form of central gravitational mass condensation in the galaxy. The gravitational mass enclosed within about 21 kpc—using the PNe data—lies in the interval [0.5384, 3.9788].×1012 M. . The Reimann sum representation of the cumulative gravitational mass at any radius—which we reject owing to numerical instabilities manifest in that computational technique—suggested a value of about 10.13 M. within 34 kpc. Mass Estimate from X-Ray Emission The estimate from X-ray observations of the hot gas that emanates from this example galaxy, is higher than the total (dark+luminous) mass that astronomical

3.3 Empirical Illustration on Real Galaxy NGC4649

143

models have suggested to exist within a radius of value .rmax inside this galaxy. Such astronomical models have relied on the implementation of X-ray emission from hot gas within the galaxy, with thegravitational mass enclosed within radius of value r,  d log θ(r) + , where .θ (r) is the temperature modelled as: .M(r) = Cr θ (r) d dloglog(r) r d log r of the hot X-ray emitting gas, at a radius .R = r, and .(r) is the density function of this hot gas, at this radius, with both the temperature and gas distribution modelled as spherically symmetric, (i.e. rotation invariant), s.t. these variables are modelled as univariate functions of R. Here C is a known negative constant. Other than the simplistic—and typically unrealistic—physical assumptions of rotation-invariance of the spatial distributions relevant to the X-ray emission, and of the existence of hydrostatic equilibrium, the possibility of attaining robust spatial derivative of the radial-dependence of temperature, is next to nil. Indeed, testing for misspecification of the R-only dependence needs to be undertaken. To recap the estimation of the .θ (R) function: highly uncertain observed values of the temperature at assorted radii, are fit by a chosen parametric form, and the “best fit” out of such fit forms is identified as .θ (R). The precariousness of this approach is evident. While a “best fit” form can always be identified from within a set of trial forms, quality of the identified “best fit” needs to be assured, in terms of: robustness in the face of adding another temperature observation; increasing/decreasing measurement uncertainties on the temperature value, as well as on the radius at which such thermal observation is undertaken; clear and transparent error analysis of the undertaken fitting and testing, especially in light of such sparse and error-ridden measurements. Estimation of the galactic temperature profile from the small noisy observations, is another worry. To top these worries, the sensitivity of the spatial derivative of .θ (R), to uncertainties underlying the estimation of this function, will render computation of .M(R) highly noisy. The estimation of the density of the gas—though again plagued by concerns relevant to estimation via parametric fitting—is less uncertain that the estimation of .θ (R). Nonetheless, computing spatial derivative of such a fit gas density function will affect robustness of the computed .M(R). Uncertainty-free .M(R) fits resulting from such an approach are essentially meaningless. Another problem that typifies such approaches include the equating of the estimated .θ (Rp ) with the .θ (R), (or .(Rp ) with the function of the radius variable). Here R is the radius while .Rp is the projection of R onto the plane-of-the-sky. Then the eventual .M(·) computation is compromised.

Mass Estimate from Kinematic Data These generic problems carry over to other methods, that invoke data on .V3 values of individual galactic particles, to compute  within  the gravitational mass enclosed radius R is modelled as .M(r) = CrσR2 (r)

d log (r) d log r

+

d log σR2 (r) d log r

+ β(r) , where C

is a (different from above) known constant; .(R) is the number density function of a type of galactic particle—assumed radially-dependent; .σR2 (R) is the variance of the

144

3 Bespoke learning of density

particle radial velocity variable—modelled again as an R-dependent function; .β(R) is the function that expresses the radial variation in the anisotropy parameter, with anisotropy parametrised as the compliment of the ratio of variance .σT2 of the particle velocity orthogonal to the radial direction, and .σR2 i.e. .β(R) := 1−σT (R)2 /σR (R)2 . The above model that provides .M(r) is a heavily parametrised model. But the implementational problems are severe. Firstly, the data on .V3 values of .< 100 to at most 400 particles typically, is undertaken to estimate the sample variance of the noisy .V3 values recorded inside each of the .Rp -bins that the observed range of values of .Rp is partitioned into. These empirical variance values in these radial bins are then used to estimate the .σR2 (R) function. Connecting the observations at different .Rp values to the R-dependent function is obviously a problem that leads to biases in the results. In the computation of .M(R) using the equation provided in the last paragraph, the pre-factor of r on the RHS of the equation will dominate the comparative trends between using the correct value r for R, over .rp for R. Given that .rp ≤ r, this pre-factor implies that mistaking projected radial profiles as functions of r, will lead to an under-estimation of .M(r) in general. Even if the logarithm of the variance and number density profiles are falling with log radius, these effects will be subdued by the r term that multiplies the expression inside the outermost parentheses set on the RHS. Again, results obtained with an extremely small sample of observed .V3 values— compared to the very large number of such galactic particles that abound in the system (. 109 for stars, for example)—cannot be directly interpreted as results for the galaxy, within the frequentist frameworks that are typically considered. Importantly, some .Rp -bins are more populated by particle .V3 measurements, than others; in fact, some bins are very sparsely populated. Empirical variances computed using differently noisy, and differently (small) sized samples, will inevitably trigger non-robust parametric “best fits”. Instead of learning the functional link between R and such an empirical variance of .V3 , parametric fitting techniques are employed (as in the procedure that works with X-ray emission data), to identify “best fit” function of the velocity dispersion—typically misrepresented as that of the galaxy. Arbitrariness of undertaken smoothing, fitting, and crucially of sensitivity of solutions to binning details (relevant to the radial partitions), are indicators of shortcomings of these approaches. Making the output .M(R) depend on the spatial derivative of such noisy and non-robust fits, is very far from the desired. We need to do better. Similar problems affect the .(R) estimation, and this time, there are additional problems over the model-heavy (if not arbitrary) nature of the scaling of inputs that enter the model, such as mass to light ratio. As for the .β(R) function, given that .σT2 (R) is not measurable, the Occam’s Razor solution would be to set the anisotropy parameter to 0. Such would correspond to our assumption of an isotropic phase space, though checking the same in the data, would be possible under our scheme, as is discussed in Sect. 3.2.8. It is an unconvincing argument that the above methods are the only resort, given the difficulty of the problem, set within a particularly challenging data environment. This uniquely deficient data paradigm notwithstanding, our method indicates that less erroneous, and more roust methods exist. What our method offers, is a data-

3.3 Empirical Illustration on Real Galaxy NGC4649

145

driven protocol that allows for automated learning of the total gravitational mass density, in addition to the pdf of the phase space vector, under the assumption that the phase space is isotropic. We discuss relaxation of the same below, but here we emphasise that this method does not suffer from the multiple strands of theoretical misinterpretations, and non-robust—as well as incorrect—implementation that plague the aforementioned approaches. Our technique offers objective, Bayesian uncertainties on the learnt parameters, as well as learnt functions. Also, our method is designed to acknowledge and work within the deficiencies of the real data, as noisy, discrete, (and partially sampled from the pdf of the phase space variable vector). Values of the gravitational mass density that we learn, are comparable—though expectedly higher—than the (unprojected) density of luminous matter alone, as depicted in Figure 7 of [16]. For example, we learn that logarithm of .ρ(0.1)— where .ρ(·) is in units of .M pc.−3 —lies between 2.5280 and 3.3364 approximately, while [16] reports the log of the gravitational mass density at about this R value to be about 2. N.B. in this work, we express our learnt gravitational mass density information in units of .M kpc.−3 , but in our comparison with the work by Hwang et al. [16], we scaled the learnt density values to the unit that has been used in [16], namely, .M pc.−3 . Again, at about 34 kpc, we learn the log of the gravitational mass density in .M pc.−3 to be between .−2.17 and .−2.04, while the log of the density due to luminous matter alone is slightly higher than .−3, at approximately this radius, as per Fig 7 of [16]. The highly non-robust estimation of gravitational mass using measurements related to X-ray emission, is suggested in Fig 4 of [30] to indicate a gravitational mass of about .2 × 1011 M. , within a radius of about 34 kpc, which is noted in this figure, to concur with the gravitational mass due to the luminous matter alone, at this radius. However, the total, i.e. the gravitational mass of the dark and luminous matter together, is expected to exceed that of the luminous matter alone. Given this, the above result would indicate that either (or both) estimate is erroneous, or the dark matter is non-uniformly distributed in this galaxy s.t. at about 34 kpc, the galaxy is almost bereft of dark matter. Since our understanding of dark matter and its spatial distribution in a test galaxy is limited to almost negligible, the latter proposal is not necessarily to be ignored. However, given the lack of reliability of results obtained using X-ray measurements, there is a possibility of this result being wrong. Our work however offers no such concerns, with a gravitational mass of the order of .1012 M. learnt to exist within a radius of about 21 kpc—see Fig. 3.9. It is to be noted, that on the basis of such forecasting, the mass enclosed within the inner 0.1 kpc lies within the interval of 2.×1011 M. , to 2.×1012 M. . 3.3.1.3

Details of Learning and Prediction

It is perhaps an anomaly in the ordering of the material in this subsection, as we have discussed results of our GP-based learning of the .ρ(R) function, before describing details of how such results were generated. We clarify the same now. As mentioned above, modelling .ρ(·) with a scalar-variate GP implies that the

146

3 Bespoke learning of density 13

Enclosed Mass M(r) (solar mass)

Fig. 3.9 Gravitational mass that is enclosed inside the R-bin at which the .ρ-parameters are learnt given the PNe data, is plotted against location of each R-bin. The interval that such mass value lies within, at a given radius, is marked by the error bar drawn at that radius. The plot also includes values of gravitational mass enclosed within radial bins that have right edges at radial locations where the gravitational mass density was predicted and forecast

12

11

10

9

–3

–2

–1

0

1

Radius (kpc)

joint of all the (learnt with uncertainties) .ρ-parameters that populate the training set, is multivariate Normal—with a null mean vector, and covariance matrix .Σ ρ parametrised by an SQE-shaped kernel marked by parameter .ρ . The last statement is synonymous with the assertion that the likelihood of .ρ given the design points and corresponding output values in the training data (.Dtrain ), is the aforementioned multivariate Normal, i.e.   1 exp −r T Σ −1 L(ρ |Dtrain ) =  ρ r , 2π |Σ ρ |

.

where the vector .r is comprised of the design points included in the training set Dtrain used in the learning, with .Σ ρ defined as above, in terms of the the SQEshaped kernel with amplitude-modulated length scale parameter .ρ . We learn this parameter within a Random Walk Metropolis Hastings-based scheme. We impose Normal priors on .ρ , and sample from the ensuing posterior, to learn .ρ with 95.% HPD credible regions. Along with this learning, we perform the closed-form forward prediction of (test) (test) the mean and variance of .ρ(ri ), where at the test radius .ri , prediction of (test) , .ρ(·) is undertaken. Such p number of test radii are elements of the vector .r (test) at which we predict values of the gravitational mass density vector .ρ ≡ (ρ(r1(test) ), . . . , ρ(rp(test) ))T . In fact, the predictive distribution of this vector of predicted outputs (i.e. .ρ(r (test) ) is a multivariate Normal: .

ρ(r (test) ) ∼ N (ρ¯ (test) , K).

.

3.3 Empirical Illustration on Real Galaxy NGC4649

147

Here, the mean of the vector of predicted outputs is: ρ¯ (test) = Kr () ,r (tr) (Kr (tr) ,r (tr) + σ 2 I )−1 ρ (tr) ,

.

where .r (tr) = (r1(tr) , . . . , rq(tr) )T is the vector comprising the q number of distinct values of R that are design points in the training set .Dtrain , and the vector comprising values of .ρ(·) at each of these design points is denoted .ρ (tr) . Here the q,q (tr) (tr) .q × q-dimensional matrix .Kr (tr) ,r (tr) := [exp(−(r − rj )2 /2ρ )]i,j =1 , while the i (test)

(tr)

p,q

the .p × q-dimensional matrix .Kr (test) ,r (tr) := [exp(−(ri − rj )2 /2ρ )]i,j =1 . q Lastly, the .q × q-dimensional diagonal matrix .σ 2 I = [δij σi2 ]i,j =1 , with .σi2 the (tr)

variance of the (assumed Gaussian) distribution of the error in the value of .ρi that is (learnt as) realised at the i-th design point in the training data set; .i = 1, . . . , q. Here .I is the .q × q-dimensional identity matrix; the delta function .δij = 1 if .i = j and .δij = 0 if .i = j . Again, the variance-covariance matrix of the vector of predicted outputs is: K = Kr () ,r () − Kr () ,r (tr) (Kr (tr) ,r (tr) + σ 2 I )−1 Kr (tr) ,r () .

.

So we forward the aforementioned closed-form mean of the vector of output predicted at the i-th test radius, embedded within the interval that ranges from values that are 2.5 times this predicted standard deviation less than the mean, to values that are 2.5 times this predicted standard deviation more than the predicted mean at (test) .r , .∀i = 1, . . . , p. The factor of 2.5 is used to imitate 95.% of probability in the i offered uncertainty range.

3.3.1.4

Predicting Upon Learning the Phase Space pdf

In the same way as above, we learn the length scale parameter .f that describes the correlation structure of the GP—assumed stationary—that we invoke to model the pdf of the phase space vector variable. Convergence of the Random Walk Metropolis Hastings chain that was undertaken to perform this learning, is borne by the trace of .f shown in the lower left panel of Fig. 3.8. Upon learning the phase space pdf, we then perform prediction at 7 distinct test values of energy, where all the (learnt given the PNe data) f -parameters (.f2 , . . . , f9 ) are included in the training set. The test energy values at which prediction is performed, are chosen to lie on a uniform grid, initiated at an arbitrarily value of energy. In addition, we perform “forecasting” of the pdf at two energies that are even more negative than the energy at which .f9 is learnt, i.e. the centroid of the 9-th .ε -bin. The learnt mean values of .f2 , . . . , f9 are shown in the lower right panel of Fig. 3.8 in black filled circles, with the learnt credible regions superimposed. The predicted mean values are in red filled circles and the forecast values are in red open circles—with errors of prediction/forecasting superimposed as an error bar in red. Logarithm of each

148

3 Bespoke learning of density

of these learnt/predicted/forecast f -parameters, is plotted against logarithm of the negative of the energy in this figure. As we can see, the learnt .f is s.t. the prediction of the pdf made at the least negative energy, is the least compatible with the two “true” values of the pdf — namely .f2 and .f3 —that straddle this prediction. This may suggest that the .f learning is not as good as it needs to be, given the training data we employ, or more likely, the underlying GP that generates the pdf given the PNe data, is not s.t. a unique (global) scalar length scale is not sufficient to parametrise this underlying GP correctly.

3.4 Conclusion We assume that the pdf of the phase space vector variable .W is s.t. its dependence on .X and .V betray rotational-invariance, where .W = (XT , V T )T . Then we model the pdf as a function of the norms of .X and .V .

3.4.1 Testing for Isotropy in the Data After undertaking learning under this assumption of an isotropic phase space, it is possible to test for this assumption. Chakrabarty [7] advanced a new Bayesian test of hypothesis that permits the testing of the null that the data implemented to achieve the learnt model, is sampled from an pdf that is an isotropic function of .X and .V , against the model that maximises the likelihood, under the null. Here, by a “learnt” model is implied the learnt model vector .(ρ T , f T )T , along with the 95.% HPDs on each component of the model vector. In fact, a version of this test allows for comparing the adherence to isotropy, of one learnt model given data .D1 , and another learnt model given a differently sized data .D2 , where dimensionalities of the learnt model vectors are different in the two models. Such testing of the data sets .DP N e and .DGC would be instructive in the context of the galaxy NGC 4649, since there is a significant difference that is noted between the phase space pdf parameters learnt using the two data sets comprising measured phase space coordinates of two different kinds of galactic particles in this galaxy. The tail behaviour of the pdf learnt using PNe data is not that of a Normal density, though the pdf learnt using the GC data is closer to a Normal shape. In spite of this difference between the learnt pdf s, the galactic gravitational mass density parameters learnt given these two data sets are concurrent within the learnt 95.% HPDs. We notice that the phase space pdf s learnt using data on the two particle types are not significantly different over the more negative energies, i.e. for the relatively

3.4 Conclusion

149

more strongly bound particles. So it appears that there is a difference between the phase space distributions of the comparatively higher energy PNe and GCs.5

3.4.2 Working with a Multivariate Phase Space pdf In this chapter we have learnt the evolution driving function, i.e. the potential, while committing to an isotropic phase space. This model assumption has resulted in a univariate pdf of the phase space vector variable. Indeed, even though it is the pdf of a vector, the domain variable of this pdf is rendered a scalar—namely, the energy variable—which contains information about the (six) different phase space components of the phase space vector variable. This model assumption has a direct bearing on the tractability of the procedure relevant to the need to integrate out the unobserved phase space coordinates, in this problem. This procedure of projecting the phase space pdf onto the space of observations is rendered intractable for a multivariate phase space pdf, while it is possible—though a very tedious and difficult computation—for a prescribed 2-integral version of the pdf, i.e. for a selected bivariate model. Therefore, when the bespoke learning endeavour is not mired in the problem caused by observations comprising sampling from a sub-volume of the phase space, it will be possible to proceed with a multivariate phase space pdf.

3.4.3 Summary So to summarise, we remind ourselves that the methodology offered in this chapter, is a data-driven tool for learning the evolution-driver or potential function of a real world stationary dynamical system, namely a distant (elliptical) galaxy. As structure drives behaviour in such a dynamical system, this evolution-driver then informs—deterministically in fact—on the structure of the system, i.e. the gravitational mass density function of the system. The method can be morphed into a black box, to permit the automated learning of the gravitational mass content inside a given radial location in the galaxy. The pdf of the phase space variables of this mechanical system, is another output of this method. Thus, the method directly allows access to the astronomically relevant number density and variances of the different components of the velocity vector variable, while also offering the

5 It is conjectured that the higher energy PNe and/or GCs manifest chaotic behaviour on an attractor, thus impeding mixing between the phase space sub-volume comprised of orbits of such high energy PNe and that of the high energy GCs. However, at the lower—or more negative— energies, there is more efficient mixing. Given that the tail of the pdf occurs at the higher energies, the overall effect is that the gravitational potential—and hence the gravitational mass density— learnt using the two data sets, concur within learnt uncertainties.

150

3 Bespoke learning of density

chance (of constructing Poincare sections, to allow) for an inspection of the orbital population in the galaxy. A method such as this can be used to learn the structure and behaviour of assorted dynamical systems, by learning its evolution-driving potential function, irrespective of whether the system is mechanical or not; with evolution underlined by a stationary or non-stationary process; and a potential that is not necessarily central. While each such application will be a challenge in itself, treating the information on the evolution of the phase space pdf is the key to the unravelling of the dynamics of the system, as informed by the potential.

References 1. S. Axler, P. Bourdon, and W. Ramey. Harmonic Function Theory (2nd edition). Springerverlag, 2001. 2. J. Binney. Dynamics of elliptical galaxies and other spheroidal components. Annual Review of Astronomy and Astrophysics, 20:399–429, 1982. 3. J. Binney and M. Merrifield. Galactic Astronomy. Princeton Series in Astrophysics. Princeton University Press, Princeton, NJ, 1998. 4. J. Binney and S. Tremaine. Galactic Dynamics. Princeton University Press, Princeton, 1987. 5. Terry Bridges, Karl Gebhardt, Ray Sharples, Favio Raul Faifer, Juan C. Forte, Michael A. Beasley, Stephen E. Zepf, Duncan A. Forbes, David A. Hanes, and Michael Pierce. The globular cluster kinematics and galaxy dark matter content of ngc 4649 (m60). Monthly Notices of the Royal Astronomical Society, 373(1):157–166, 2006. 6. T. Caraballo and X. Han. Autonomous Dynamical Systems. in: Applied Nonautonomous and Random Dynamical Systems. SpringerBriefs in Mathematics. Springer, Cham, 2016. 7. Dalia Chakrabarty. A new bayesian test to test for the intractability-countering hypothesis. Jl. of American Statistical Association, 112:561–577, 2017. 8. J. L. Doob. Classical Potential Theory and Its Probabilistic Counterpart. Springer-Verlag, Berlin, Heidelberg, New York, 2001. 9. K. Freeman. Dark matter in galaxies. In Encyclopedia of Astronomy and Astrophysics. Natue Publishing Group, 2001. 10. H. Goldstein, C. P. Poole, and J. L. Safko. Classical Mechanics. Addison-Wesley Longman, Incorporated, 2002. 11. P. J. Green. Reversible jump Markov chain Monte Carlo computation and Bayesian model determination. Biometrika, 82(4):711–732, 1995. 12. P. T. Gressman and R. M. Strain. Global classical solutions of the Boltzmann equation with long-range interactions. Proceedings of the National Academy of Sciences, 107(13):5744– 5749, 2010. 13. Stewart Harris. An Introduction to the Theory of the Boltzmann Equation. Dover Publications, Mineola, New York, 2011. 14. L. L. Helms. Introduction to potential theory. R. E. Krieger, 1975. 15. R. A. Howland. Integrals of Motion. In: Intermediate Dynamics: A Linear Algebraic Approach. Mechanical Engineering Series. Springer, Boston, MA, 2006. 16. Ho Seong Hwang, Myung Gyoon Lee, Hong Soo Park, Sang Chul Kim, Jang-Hyun Park, Young-Jong Sohn, Sang-Gak Lee, Soo-Chang Rey, Young-Wook Lee, and Ho-Il Kim. The globular cluster system of m60 (NGC 4649). II. kinematics of the globular cluster system. Astrophysical Journal, 674(2):869–885, 2008. 17. O. D. Kellogg. Foundations of Potential Theory. Dover publications, 1969.

References

151

18. John Korsgaard. On the representation of two-dimensional isotropic functions. International Journal of Engineering Science, 28(7):653–662, 1990. 19. P.S. Landa. Nonlinear Oscillations and Waves in Dynamical Systems. Mathematics and Its Applications. Volume 360. Springer, Dordrecht, 1996. 20. Myung Gyoon Lee, Ho Seong Hwang, Hong Soo Park, Jang-Hyun Park, Sang Chul Kim, Young-Jong Sohn, Sang-Gak Lee, Soo-Chang Rey, Young-Wook Lee, and Ho-Il Kim. The globular cluster system of m60 (ngc 4649). i. canada-france-hawaii telescope mos spectroscope and database. Astrophysical Journal, 674(2):857–868, 2008. 21. A. Loeb. Maybe dark matter is more than one thing, May 30, 2021. 22. N. Mukherjee and S. Poria. Preliminary concepts of dynamical systems. International Journal of Applied Mathematical Research, 1(4):751–770, 2012. 23. I. Murray, Z. Ghahramani, and D. J. C. MacKay. Mcmc for doubly-intractable distributions, 2007. 24. N. Napolitano et al. The Planetary Nebula Spectrograph elliptical galaxy survey: the dark matter in NGC 4494. Monthly Notices of the Royal Astronomical Society, 393:329–353, 2009. 25. NASA, ESA, and STScI. Mystery of galaxy’s missing dark matter deepens, June 1, 2021. 26. J. F. Navarro, C. S. Frenk, and S. D. M. White. The structure of cold dark matter halos. The Astrophysical Journal, 462:563–575, 1996. 27. D. D. Nolte. Introduction to Modern Dynamics: Chaos, Networks, Space and Time. Oxford University Press, 2015. 28. C. P. Robert and G. Casella. Monte Carlo Statistical Methods. Springer-Verlag, New York, 2004. 29. Vera Rubin. Dark matter in the universe, 1998. 30. Juntai Shen and Karl Gebhardt. The supermassive black hole and dark matter halo of “ngc” 4649 (“m”60). The Astrophysical Journal, 711(1):484–494, 2010. 31. F. Shu and E. Kranakis. The Physical Universe: An Introduction to Astronomy. Series of books in astronomy. University Science Books, 1982. 32. A. M. Teodorescu, R. H. Méndez, F. Bernardi, J. Thomas, P. Das, and O. Gerhard. Planetary nebulae in the elliptical galaxy ngc 4649 (m 60): Kinematics and distance determination. Astrophysical Journal, 736(1):65, 2011. 33. P. van Dokkum, S. Danieli, Y. Cohen, A. Merritt, and et. al. Statistical challenges of highdimensional data. Nature, 555:629–632, 2018. 34. C. C. Wang. On representations for isotropic functions. Archive for Rational Mechanics and Analysis, 33:249–267, 1969.

Chapter 4

Bespoke Learning in Static Systems: Application to Learning Sub-surface Material Density Function

Abstract In this chapter we discuss bespoke learning of a system property in a static system, i.e. discuss bespoke learning of the relevant system property parameter, at design values of an input. Such learning will then capacitate the supervised learning of the system property as a function of the input variable. This bespoke learning typically requires anticipating trends in values of the likelihood, as it changes with the difference between observations and (a model of) the system. Modelling the likelihood in terms of such a difference—or distance—is possible, subsequent to mapping from the space of system parameters onto the space of the observable. Here, we discuss an application of this scheme, to the non-destructive learning of the material density, as a function of a sub-surface location—using existing work that provides the bespoke learnt values of this density at designed sub-surface locations. The training data that is thus generated, bears strong inhomogeneities in correlation, rendering supervised learning of the material density function, difficult. This is undertaken, to enable prediction of the density at test locations.

4.1 Introduction In the previous two chapters we discussed a generic method that invokes bespoke learning, to capacitate the learning of the temporally-varying function that drives evolution of a dynamical system. This allows for the prediction of the state of the evolving system at any time—when such prediction/forecasting is of interest to us— as it was, in Chap. 2. On other occasions, it is the evolution-driving function, i.e. the potential function itself, that is the ultimate objective—as it was in Chap. 3, when we were inclined on learning the structure of a gravitationally-bound system, that could be approximated as a stationary system. To begin with, we have no access to any training data that comprises pairs of values of the sought potential realised at design values of the variable that defines the domain of this function. It is such original absence of training data that inhibits the supervised learning of the potential function at the very outset, and we compensate for the same, using bespoke learning conducted by embedding the potential inside the support of the phase space pdf. © Springer Nature Switzerland AG 2023 D. Chakrabarty, Learning in the Absence of Training Data, https://doi.org/10.1007/978-3-031-31011-9_4

153

154

4 Bespoke Learning in Static Systems

In this chapter, we address the problem of learning an intrinsic structural property of a static system, where an original lack of the relevant training data has inhibited supervised learning of this sought system property. So we will need to invoke the system physics, to undertake bespoke learning of this system property—treated as an unknown function of those input parameters, which if changed, yields generally distinct values of this property. As the physics of the given system is directly responsible for the generation—via bespoke learning—of the value of the property at an appointed design point, the methodology that underlies such bespoke learning varies from one system to another, where such systems are distinguished by their “physics”. We could for example be interested in the nondestructive learning the material density at different locations within the bulk of a slab of material that has been imaged with an Electron Microscope [9, 11, 18, 25, 27, 30, 34]. using an appropriate imaging technique [5, 12, 22]. Then again, we could be interested in learning the petrological composition of a rock that has been newly sampled from a well, where Nuclear Magnetic Resonance data has been collected from this rock [4, 19, 32]. The ulterior interest in these two examples would be respectively, the prediction of the material density at chosen/test sub-surface locations, and the prediction of petrological composition of new rocks sampled from the reservoir that includes this well. Indeed, many such problems, from diverse areas would be of interest to the respective domain expert. The exact nature of the bespoke learning in each such learning problem is specific to the system under consideration. So description of the methodology that underpins the bespoke learning that is required in the particular application, is specific to this considered application. Therefore, it will not be possible to motivate a detailed generic discussion of a bespoke learning methodology that permeates all types of static systems. Instead, we will proceed in this chapter by beginning our discussions with an outlined method, and then follow this with a specific application of bespoke learning, to appreciate the implementational details. This application will entail the learning of the value of the material density at chosen sub-surface locations inside a material slab. It is to be noted that upon the accomplishment of populating the originally-absent training data, we will find ourselves in a position to undertake supervised learning. Given this, we will follow up the bespoke learning relevant to the Material Science application that we mention above, with the supervised learning of the material density function in the bulk of this material sample. Such supervised learning will be a challenge since the sought function (i.e. the sub-surface material density function) is highly discontinuous.

4.2 Bespoke Learning, Followed by 2-Staged Supervised Learning

155

4.2 Bespoke Learning, Followed by 2-Staged Supervised Learning Let a system property be parametrised as a variable denoted by .Y , which in general is a k-th order tensor, s.t. .Y ∈ Y ⊆ Rm1 ×m2 ×...×mk . Let .Y be a function of other system parameters .X1 , . . . , Xd , with vector .X := (X1 , . . . , Xd )T , where .X ∈ X ∈ Rd . Then we can represent the association of .X with .Y as .Y = g(X), where the tensor-valued function .g : Y −→ X is unknown in general. The equation .Y = g(X) represents the relation between the random variable called “.Y ” and another called “.X”, with .g(·) itself treated as a random function, [10, 13, 31]. (We will soon see, that this random function is modelled as a random realisation from a stochastic process). In the Bayesian paradigm that we adopt here, values of .Y predicted at given values of .X are obtained by performing Bayesian inference from the posterior probability density of .Y given the test data on .X, (i.e. given values of .X), as well as the model for .g(·) that we would have learnt using a training data set. Uncertainties in the predicted values of .Y realised at this test data on .X will be learnt in this inference, and such uncertainties will include: uncertainties in the learning of this model of .g(·); noise in the training data; and noise in the test data on .X. Such a (forward) prediction on the values of .Y given test data .x (test) on .X is closed-form, as long as we model .g(·) with a Gaussian Process [20, 24, 26, 29, 33]. Uncertainty in prediction—modelled as the variance of (test) —is itself exactly known in such a situation, and such closed-form .Y at .X = x mean and variance prediction of the unknown is performed while folding in noise in the training data.

Basically, the functional link between the output .Y and input .X is not given by a linear regression model such that (s.t.) an additive error term should be included in the model equation. Neither is this functional link given by a Generalised Linear Model that demands the linear predictor to be given as the transformation of the output .Y , (to be precise, the link function of .Y ), s.t. the expectation of .Y conditional on components of the .X vector, is the linear predictor. The functional relationship between the generic highdimensional output .Y and the vector .X of input variables is a general one, s.t. uncertainties in a predicted value of either variable, at noisy test data on the other, are Bayesianly learnt, given the uncertainty-included learnt model of the functional relation between .X and .Y , conditional on noise-included training data.

For the sake of completion, we also discuss the inverse prediction of the values of the system parameters .X at test data on .Y . Such (inverse) prediction of the intrinsic system parameters, at which a recorded value of the easily-observable .Y is realised, is an endeavour that we often encounter in daily life [16, 17]. Inverse prediction

156

4 Bespoke Learning in Static Systems

could be invoked, whenever supervised learning of the functional relationship between .Y and .X is motivated s.t. it is easier, if .Y is cast as the output of this function—as opposed to .X. Then by inverse prediction we imply prediction of the values of .X, at which test data .y (test) on .Y is realised. Such inverse prediction is not closed-form in general, and in the Bayesian paradigm, values of .X at which .y (test) is realised, is given by sampling from the posterior probability of .X, given .Y = y (test) and the learnt model for .g(·), (which is learnt given the training data set .D that comprises pairs of design value of .X and the values of .Y realised at this design point). In other words, .g(·) is learnt using the training set .D := {(x i , y i )}N i=1 . s.t. the sought values of .X are provided via posterior sampling from .π(x|y (test) , g(·)), Lack of closed-form nature of the inversion of the posterior probability of .X conditional upon the test data on .Y and the model for the inter-variable relationship, implies that sampling from this posterior needs to be undertaken via techniques such as Markov Chain Monte Carlo or MCMC [3, 8, 28]. In view of the additional difficulty of inverse prediction—over forward prediction—an argument can be advanced to suggest that we should always cast the original model equation s.t. the unknown variable is retained at the output in that model equation. This is however not always judicious given that difficulty in learning a function is driven by the dimensionality of its input variable, (over that of its output), if kernel parametrisation of the covariance structure of the underlying stochastic process is to be undertaken.

Indeed, it is harder to work with—i.e. a larger number of parameters are needed to be learnt—a covariance kernel that is relatively higher-dimensional.

Hence it is a better choice to leave the vector-valued .X as the input, and the (generally) tensor-variate .Y as the output, rendering the sought .g(·) a tensor-valued function, the covariance structure of the generative process of which is parametrised using kernels defined on a vector space [7, 21]. We do not perform inverse prediction in this chapter. Thus, the equation .Y = g(X) is one that represents the relationship between the high-dimensional (tensor-valued) variable .Y , and the vector-valued .X, where the forms of the different components of the tensor-valued function .g(·) are unknown. In our native Bayesian paradigm, values of .Y (or .X), realised at test data on .X (or .Y ), are learnt using Bayesian inference on the posterior of the unknown, given the knowns/already-learnt parameters, where the undertaken learning has been undertaken given the available training data set. So now that we have clarified the model equation .Y = g(X), we proceed to discuss the general outline of the Bayesian implementation of bespoke learning of .Y at chosen values of .X, with the aim of generating the training data set .D that we will invoke next, to undertake the supervised learning of the functional relationship between .Y and .X, so that we can then predict .Y at any .x.

4.2 Bespoke Learning, Followed by 2-Staged Supervised Learning

157

4.2.1 Bayesian Implementation of Bespoke Learning 4.2.1.1

Learning Y at Given Values of X, Using Data on W

Let there exist means of advancing a trial value .y i of the variable .Y i , at the ith fixed value (.x i ) of the input .X, using the data set .Dbespoke that comprises measurements of a different variable denoted .W i , when .X = x i . In fact, ˜ i is recorded of variable .W i when ˜ 1, . . . , w ˜ Ndata }, where the value .w .Dbespoke := {w .X = x i , .∀i = 1, . . . , Ndata However, deficiency in the required information and noise in the measurement of .W i , leave us to characterise such learning of .Y i at .x i , given .Dbespoke as probabilistic, (.∀i = 1, . . . , Ndata ). In other words, we cannot know .y i deterministically, but we will undertake Bayesian inference to learn .Y i at the fixed value .x i of the input .X, .∀i = 1, . . . , Ndata , using data .Dbespoke .

4.2.1.2

How Can W Inform on the Sought Y ?

We have not yet declared what kind of a random variable (r.v.) .W i is, nor declared the relation between .W i and .Y i , (via .X). Indeed such a relationship is specific to the application at hand. In the application we discuss below, .W i represents the image data collected by targeting the information available at a point that is identified by the location vector .x i in the imaged object. On the other hand, .Y i represents the material density of the object at the chosen location inside it, i.e. at .x i . The generic representation of .W i is that it is a tensor-valued r.v., though not necessarily of the same order or dimensionality as .Y i . Then let us motivate the notation for .W i as: .W i ∈ W ⊆ Rn1 ×...×nw , .∀i = 1, . . . , Ndata .

So .W i could be a non-linear transformation of .Y i , where the assumption is that the form of this transformation is known at a given .X, though all parameters of this transformation may not be known. Let us represent this transformation as .τ x i at the i-th fixed value of .X, s.t. the value of .W i at .X = x i is .w i = τ x i (y i ), .i = 1, . . . , Ndata , where .y i is the value of the ˜ i is the observed value of .W i at .X = x i , while .wi denotes variable .Y i . N.B. .w the value of the non-linearly transformed value of .Y i , realised when .X = x i .

We note that we do not have any data on what .y i is, for any .i ∈ {1, . . . , Ndata }; our only data is on .W i . Priors on .Y i are known—and this may be of varying strengths, depending on the application at hand. However, we do not of course wish to render our learning of .Y i at .X = x i , prior-driven. To render our inference on .Y i to be led by data, we need to acknowledge the data that exists as .Dbespoke . This is relevant to the inference on .Y i , since data on the non-linearly transformed value of .Y i , over .i = 1, . . . , Ndata , comprise the data set .Dbespoke .

158

4.2.1.3

4 Bespoke Learning in Static Systems

Motivating a Model for the Likelihood

At .X = x i , probability density of the observed value of .W i , conditional on .y i , is ˜ i on .W i equals the value attained by .W i in the model, maximum if the observation .w ˜ i equals (which is the non-linearly transformed .y i ), i.e. the pdf is maximised if .w .τ i (y i ) = w i . Using this, we motivate a likelihood by recalling the following. • From the last argument, at .X = x i , any (smoothly) decreasing function of a ˜ i and .τ i (y i ) = wi , is maximised, if .w ˜ i equals .τ i (y i ). scaled distance between .w • Again, at .X = x i , any (smoothly) decreasing function of the normalised distance ˜ i and .τ i (y i ) = wi , is minimised, if .w ˜ i is maximally away from .τ i (y i ), between .w ˜ i and .τ i (y i ) are infinitely far from each other. i.e. if .w • So at .X = x i , let us propose the probability density function (pdf ) of .W i , conditional on .y i , as proportional to such a (smoothly) decreasing function of the ˜ i and .τ i (y i ) = wi . Then such a pdf is maximised, normalised distance between .w if the measurement of .W i equals the modelled value of .W i given .y i . • Again, when the observed value of .W i is maximally distant from the value this variable attains in the model for a given value of .Y i , pdf of .W i has to be 0. Then the (smoothly) decreasing function of the distance between the observed and modelled .W i values, that models the pdf of .W i , needs to be assigned a form that permits the function to go to 0, when the observed and modelled values of the variable are infinitely distant. • Additionally, we need to ensure that the pdf thus defined, is scaled s.t. it integrates to unity, over all values of .W i that are attainable at .X = x i .

Thus, one possible model of the pdf of the observable, conditional on the value of the variable .Y i is   ˜ i − τ i (y i ))2 (w , .fW i (w i |y i , α1 , . . . , αNτ ) = √ exp − 2σ 2 2π σ 2 1

where .σ is a scale that normalises the distance between the observed and modelled .W i at .X = x i , and .α1 , . . . , αNτ are the .Nτ number of unknown parameters that parametrise the identified non-linear transformation .τ i : Y −→ W. Thus, the pdf of .W i conditional on .y i is modelled as a Gaussian in the observed value of .W i at .X = x i , with a constant variance and mean given by the value of .W i at this value of .Y i (i.e. .y i ). This choice for the pdf satisfies the limiting conditions stated above.

The above motivates such a defined pdf, to be a good model for the likelihood ˜ i . Then assuming that the observations recorded at each of the of .Y i given data .w .Ndata number of fixed values of .X are iid, we get the likelihood of .Y 1 , . . . Y Ndata —

4.2 Bespoke Learning, Followed by 2-Staged Supervised Learning

159

and the unknown parameters in the description of the identified shape of .τ i .∀i = 1, . . . , Ndata —to be L(Y 1 , . . . Y Ndata , α1 , . . . , αNτ |Dbespoke ) =

N data

.

˜ i |y i ), fW i (w

i=1

˜ i , with mean of .τ i (y i ) ˜ i |y i ) is a Gaussian in .w where our model choice for .fW i (w and variance .σ 2 . Thus, we have simplified our model considerations to suggest the same parameter-dependence of .τ i .∀i = 1, . . . , Ndata ; such a simplification could of course be relaxed in an application.

4.2.1.4

Posterior and Inference from It

Then after invoking judicious priors on the unknowns, (.π0 (Y 1 , . . . Y Ndata , α1 , . . . , αNτ )), we use this likelihood to define the posterior probability density of all unknowns, given the data .Dbespoke . This posterior is

.

π(Y 1 , . . . Y Ndata , α1 , . . . , αNτ |Dbespoke ) =

.

C

N data

fW i (wi |y i )π0 (Y 1 , . . . Y Ndata , α1 , . . . , αNτ ),

i=1

where C is an unknown positive constant. We undertake Bayesian inference on the unknowns by generating posterior samples using MCMC techniques. On the other hand, readers may prefer to undertake estimation of the unknowns, by maximising the posterior. In this chapter—as elsewhere in the book—inference using MCMC is advocated. Given the ready distinguishability between the .Y · and .α· parameters, Metropolis-within-Gibbs sampling suggests itself readily to such inference.

4.2.2 What if a Different Likelihood? It may be argued at this point that results of such Bayesian inference are subject to our choice of the model for the likelihood—had we chosen a different likelihood, such MCMC-led inference would have provided different results. However, this is not relevant as far as the expectation of the learnt values of each unknown is concerned, since sampling from the posterior that is proportional to a “limitingsituations abiding” likelihood, will lead to “similar” modes of the marginal of an unknown given the data, in this system, as long as the MCMC chain has converged, and the target density does not imply multimodal inference (on any unknown). By “similar” in the last sentence is implied modes that are distinct due to numerical fluctuations, and “limiting-situations abiding” refers to a model of the likelihood

160

4 Bespoke Learning in Static Systems

that goes to 0 when distance between modelled and observed .W i is infinite, while the likelihood is maximal when this distance is 0. Thus, the double-exponential function could again be chosen as another model for the likelihood. However, variation in the choice of models for the likelihood is likely to give rise to distinction in the width of the 95.% Highest Probability Density credible region (95.% HPDs) that we learn on any unknown, given the available data. Indeed, such a width can be tuned by the .σ parameter, irrespective of the shape of the likelihood. Nonetheless, sensitivity of uncertainties on any unknown, to changes in the data would potentially be higher when one likelihood shape is chosen over another, though such sensitivity would be modulated by the included priors and noise in the data. It is at the same time possible to include .σ as an unknown, and learn its value using MCMC, given the data at hand. A qualitative summary of the situation is then described by stating that it is likely that learnt values of any unknown will concur within the learnt 95.% HPDs on this unknown, irrespective of the choice of a “limiting-situations abiding” likelihood.

4.2.3 Dual-Staged Supervised Learning of g(X) (= Y ) Once .Y i is learnt—with uncertainties (95.% HPDs)—at a fixed value .x i of .X, we are in a position to populate the training data set .D := {(x i , y i )}N i=1 that was originally absent. Using this data we can in principle learn the relationship .g(·) between .Y and .X, where .Y = g(X). As stated above, .Y being a k-th ordered tensor of dimensionality .m1 × . . . × mk , and .X a d-dimensional vector, the function .g(·) is a k-th ordered tensor-valued function that has as many components as tensor .Y does, and each of the component functions take the d-dimensional vector .X as its input. Given that there is an uncertainty in the value .y i that is realised at the i-th design point .x i , prediction of value of .Y at .x (test) , (or inverse prediction of the value of .X at which .y (test) is realised), will be undertaken while acknowledging such noise in the “data” on .y i , .∀i = 1, . . . , Ndata . Fitting a parametric function to the training data .D will be inadequate in this example since parametric fitting techniques cannot capture inter-component correlations. On the other hand, we need each pair of the .m1 , m2 , . . . , mk components of the sought function .g(·) to be correlated, where the pairwise correlation amongst the components of the function needs to follow the correlation between the corresponding pairs of components of tensor .Y . Stated differently, a shortcoming of parametric fitting techniques is that the smoothness of the fit functions is not learnt directly from the data; we definitely wish to learn the smoothness as led by the available data. An objective means of uncertainty-incorporation into the learning of the sought function—and in the prediction of parameters thereafter—is also in our wishlist. Given the above, we discard attempts at parametric fitting and resort

4.2 Bespoke Learning, Followed by 2-Staged Supervised Learning

161

to the learning of the sought high-dimensional function .g(·) by modelling it with a tensor-variate Gaussian Process (GP): g(·) ∼ GP (μ(·), K(·, ·)),

.

where .μ(·) is the mean function of this tensor-variate GP, s.t. .μ(·) is a k-th ordered tensor-valued function. Also, .K(·, ·) is the covariance function of this GP. Then by definition of a GP, the joint probability density of N number of realisations of this k-th order tensor-valued function, is a .k + 1-th ordered tensor Normal density. We consider the N realisations at each of the N design points that live inside the training set .D. By putting together all the N “slice”s of .g(·) realised at each of the N design points, we attain a .k + 1-th ordered tensor. Thus, modelling .g(·) with a tensor-variate GP implies: fg(X1 ),...,g(XN ) (g(x1 ), . . . , g(xN )) = T Nk+1 (μ, C 1 , . . . , C k+1 ),

.

where this tensor Normal density, [1, 35], is parametrised by a mean tensor .μ that is k + 1-th ordered, and is of dimensionality .m1 × . . . × mk × N . We may standardise the data using the sample mean and sample standard deviation, and then model .g(·) using a zero-mean GP. This density bears .k + 1 number of covariance matrices, 1 .C 1 , . . . , C k+1 . Then recalling the form of the tensor Normal density, the joint expressed in the left hand side of the last equation is given as: .

fg(X1 ),...,g(XN ) (g(x1 ), . . . , g(xN )) =

.

1 . (2π )m/2

k+1  i=1

1 |Ai |(m/mi )

 −1 2 exp(−|(DY − μ) ×1 A−1 1 . . . ×k+1 Am | /2),

k+1 . .  where the .k + 1-th order tensor .DY := (Y 1 .. . . . ..Y N ); .m = mi ; and .Ap is i=1

the unique square root of the p-th (positive definite) covariance matrix, i.e. .C p = Ap ATp , .p = 1, . . . , k + 1. Here “.×p ” represents the p-mode product between the .m1 × . . . mp−1 × mp × mp+1 × . . . × mk × N-dimensional tensor .DY − μ and .q × p-dimensional matrix .V which produces a .m1 × . . . mp−1 × q × mp+1 × . . . × mk × N-dimensional tensor. We will standardise the data as suggested above, and invoke a zero-mean GP, i.e. .μ is a null tensor. The above tensor Normal density in .DY , with parameters .C 1 , . . . , C k+1 , is equivalently the pdf of the observations conditional on the parameters .C 1 , . . . , C k+1 . In light of this realisation, we reinterpret the last equation

1 Covariance

matrix .C p is s.t. the ij -th element of this matrix informs on the covariance between the i-th and j -th k-th ordered tensor-valued slices (that are .m1 × . . . mp−1 × mp+1 × . . . × mk × N dimensional), .∀p = 1, . . . , k + 1.

162

4 Bespoke Learning in Static Systems

as the likelihood of the unknown model parameters (covariance matrices), given the data .DY . Thus, likelihood is .

1 (2π )m/2

k+1  i=1

1 |Ai

L(C 1 , . . . , C k+1 |DY ) = 

|(m/mi )

−1 2 exp(−|(DY − μ) ×1 A−1 1 . . . ×k+1 Am | /2)

In the presence of noise in the values of .Y 1 , . . . , Y N , the error density of the observations is convolved with this pdf of .Y 1 , . . . , Y N , conditional on the covariance matrices .C 1 , . . . , C k+1 . If this distribution of the noise in the data is Gaussian, then such convolution leads to the addition of the variance of the Normal error density in each of .Y 1 , . . . , Y N to the variance (i.e. diagonal) terms of .C k+1 . When such variances are unknown, it is indeed possible to learn the variances of the error densities of the observables, by including these variances as unknowns in the model. Then after invoking relevant priors on the covariance matrices, we input the above likelihood and priors into Bayes rule, to express the joint posterior probability of the unknown covariance functions (and when relevant, the unknown error variances), given the data .D. Values of the unknowns are learnt by sampling from this posterior using a chosen MCMC algorithm. MCMC-based Bayesian inference eases the job of computing the marginal posterior pdf of any unknown parameter; so having computed the marginal of each unknown, the 95.% HPDs on the respective unknown are learnt.

Learning Covariance Matrices: 3 Ways Now, one issue that requires further discussion still, is about what learning the covariance matrices actually means. We are aware of MCMC-based learning of individual parameters, given available data; however, the learning of a matrix needs qualification. 1. One way a (symmetric) covariance matrix can be learnt, is of course by learning each of its distinct elements as an unknown parameter, directly from MCMC. This is firstly cost-intensive, and that too, learning in excess of 200 correlated parameters using MCMC, is difficult. Conservatively, learning any larger than a 15.×15-dimensional covariance matrix directly from MCMC is difficult. 2. Alternatively, a covariance matrix could be computed using the empirical or sample estimate of the covariance between each relevant pair of tensor-valued “slices” within the data .DY that results from collating values of .Y 1 , . . . , Y N . Computation of any such plug-in estimate of the correlation between a pair of high-dimensional slices of data is possible, subsequent to the collapsing of any such high-dimensional slice into a vector. Such enforced collapsing of a highdimensional structure into a vector, results in loss of correlation information

4.2 Bespoke Learning, Followed by 2-Staged Supervised Learning

163

amongst components. Additionally, sample estimates of correlations tend to improve as the sample size increases; so if faced with small samples, the plug-in estimate will generally be erroneous. 3. A still third method is parametrising the ij -th element of the .p × p-dimensional covariance matrix by modelling the covariance between the i-th and j -th slices of .DY that are realised respectively at input variable .U = ui and .U = uj , as a chosen, parametric function of the distance between .ui and .uj , where .i, j ∈ {1, . . . , p}. Here .U is any random variable that can be invoked to influence the generation of one of these slices of the data .DY . In other words, the variable .U is s.t. covariance between a pair of these slices can be parametrised by the difference between those values of .U , at which each slice is realised. So for .p = k + 1, when .U ≡ X, the ij -th element of the covariance matrix .C k+1 informs on the covariance between the i-th k-th ordered tensor-valued slice of data generated at .X = x i , and that generated at .X = x j . Then, this ij -th element can be parametrised by a chosen decreasing function of a norm of .x i −x j . Such a chosen function is referred to as a chosen covariance kernel. So our third method of learning a covariance matrix, is to perform kernel parametrisation of the matrix.

Kernel Parametrisation of Covariance Matrices Instead of learning the distinct .p(p + 1)/2 number of elements of a .p × pdimensional covariance matrix, we only have to learn the hyperparameters of the chosen covariance kernel—this is a huge saving in terms of the resource called for to undertake the considered learning. Kernel parametrisation also helps in casting a higher-dimensional problem, (namely, learning the covariance between pairs of tensor-valued slices of data), into a lower-dimensional one, (namely, setting the covariance between the tensor-valued slices as a scalar-valued function of the norm of a vector-valued input variable). This does however raise concerns regarding the loss/distortion of covariance information, when inter-slice covariance is equated to a modelled function of the distance between the input values at which the considered slices are realised. In particular, the applicability of globally assigned values of hyperparameters of the covariance kernel is questionable since covariance between a pair of data slices realised at two given design points does not necessarily vary similarly with distance in input space, as does the covariance between slices realised at two other points in input space. In fact, when observations in .Y are distributed “differently” across different sub-volumes of the input space, concerns are justifiably raised if even a unique parametric form of the covariance kernel is maintained at all .x—let alone if global values are assigned to the hyperparameters relevant to the chosen parametric form of the kernel. In the face of non-homogeneous distribution of the covariance between data slices realised at a two distinct volumes within input space, the safest parametrisation of covariance matrices with kernels, is to opt for a form for the kernel that adapts to the variation of the inter-slice covariance, with .x. With this in mind, Chakrabarty .& Wang (under preparation), have forwarded a new way of

164

4 Bespoke Learning in Static Systems

kernel parametrisation in which the chosen kernel that parametrises any element of a covariance matrix, is such that each of its hyperparameters is an unknown function of the sample path of the underlining tensor-variate GP.

Dual-Layered Learning, to Address Inhomogeneities in Correlation Distribution In this new way of covariance parametrisation using kernels, the functional form of the kernel is no longer a chosen, parametric one. Instead, at any iteration, each hyperparameter of the kernel is assigned a (generally) distinct functional dependence on the sample function that is sampled at that iteration from the (outer) tensor-valued GP, where said functional dependence is also varying from one iteration (of the MCMC chain that is being employed in the learning), to another, since this depends on the sample function from an inner, scalar-valued GP, (as will be soon clarified below). Here, there are as many such scalar-valued GPs invoked in the inner layer as there are hyperparameters. The nesting of the functions result effectively in the overall non-parametric nature of the kernel, i.e. different forms of the kernel can then be said to hold at different points in the input space. Hence the kernel is locally receptive to the form of the sample function .f (·)—modelled as a random function in our work—sampled from the tensor-variate GP. This causes less information distortion in the presence of inhomogeneities in the correlation structure of the data, than if a globally imposed kernel shape is worked with.

In our work, the embedding of the sample path-dependent hyperparameters inside a chosen shape of the kernel implies that even if we choose to work with a “simple” shape to begin with, nesting of this shape with the sample path dependencies of each hyperparameter, renders the resulting kernel very different from that “simple” kernel shape chosen at the outset. Importantly, the nature of the sample path dependencies of each hyperparameter varies from one instance of sampling from the tensor-variate GP, to another. Within our MCMC-based learning, a fresh sample is drawn from the tensor-variate GP, at a new iteration of the MCMC chain. So the overall shape of the kernel varies from one iteration to the next. The kernel is then truly non-parametric.

This is now clarified. Let a hyperparameter . be expressed as an unknown function .h1 (·) of the sample path that is randomly generated from the tensor-variate GP. Then given that a sample-path is generated per iteration, the sample path is an unknown random function .h2 (·) of iteration index T . Using this, we realise that in iteration T , . is expressed as .h1 (h2 (T )), which equivalently is .h(T ), where function .h(·) is a composition of the unknown functions .h1 (·) and .h2 (·), i.e. .h ≡ h1 ◦ h2 . It is this unknown function .h(·) of the iteration index T , that we model as generated

4.2 Bespoke Learning, Followed by 2-Staged Supervised Learning

165

from a scalar-variate GP. To achieve the kernel parametrisation of the considered covariance matrix, we then need to learn the covariance structure of each of these scalar-variate GPs—that is invoked to generate the respective hyperparameter.

One question that can then be asked is about parametrising the covariance matrices of such a scalar-variate GP that is being advanced as the process that generates the function .h(·) (that outputs the hyperparameter .). Given the concerns voiced above about inhomogeneities in the correlation structure of observations in the output of a function, details of kernel parametrisation of the scalar-variate GP that generates .h(·), would indeed require scrutiny. Such concerns may appear to dissuade us from working with one global form of a kernel invoked to parametrise the covariance matrix of the (multivariate Normal) likelihood that will result from modelling .h(·) using such a scalarvariate GP. Fortunately, such worries are misplaced in this situation, since any function such as .h(·) can be proved to be continuous, rendering its generative scalar-variate GP, stationary. The reason for this continuity is that any function such as .h(·) is a mapping from the set of the iteration index variable T , to the space where (hyperparameters such as) . takes its value, i.e. .h : N −→ R; as all students of Analyis 101 can prove, a mapping from .N to .R is continuous. Then it can be proven that the scalar-variate GP invoked to generate .h(·) is stationary. This allows for the usage of a chosen (parametric) form of the kernel to parametrise the covariance function of such a GP, including the usage of (unknown) global hyperparameters that parametrise such a chosen kernel.

Summary Thus, in this learning scheme, we use MCMC to make inference on the hyperparameter values of each kernel that parametrises the covariance matrix of each scalar-variate GP that generates a hyperparameter of the kernel that parametrises a covariance matrix of the tensor Normal likelihood given the data .DY on the tensorvalued output variable. In fact, in our implementation of MCMC-based inference, we update the parameters relevant to the scalar-variate GPs first within any iteration, and then update parameters of non-kernel parametrised covariance matrices of the tensor Normal likelihood. Thus, a Metropolis-within-Gibbs sampling led inference is most suitable for the implementation of this method.

166

4 Bespoke Learning in Static Systems

4.3 Application to Materials Science We illustrate the methodology discussed above, by learning the material density function at any location under the surface of a cuboidally-shaped block of material, where the only data that is available, is the image data that comprises images of the material sample taken with a Scanning Electron Microscope. The first stage of this learning is of the bespoke kind, in which we generate the missing set of pairs of design value of sub-surface location, and the value of the material density realised at this design point. Such a training data does not pre-exist [2]. We cannot simulate the value of the material density at a given point in the bulk of the material sample because, even for a material sample that is grown under known laboratory conditions, knowledge of how the material is distributed across sub-surface locations is not available to us. Of course, if the scientific pursuit of the practitioner is s.t. it is satisfied with the ad hoc description of this density that is provided by a numerical model, then in the context of such pursuits, the current discussion may appear to be an overkill. If a reliable, with-uncertainty learning of this material density function is sought—so that prediction of the density at any point in the material block is possible—then one realises that the absence of the training data needs to be addressed, in order to empower supervised learning of the material density function. At the second stage we discuss the results of the dual-layered learning of this function, using the training set that is made available at the first stage.

4.3.1 Generating the Originally-Absent Training Data To generate a training set that will help in the supervised learning of the material density function, we need to know values of the density at respective design points inside the material sample. Let a point in the bulk of the block of material sample— that is grown in the laboratory—be .S ∈ S ⊆ R3 . We set .S = (X, Y, Z)T in the basis .{e1 , e2 , e3 } with .Z = z ∈ [0, zmax ], and .X = x ∈ [−xmax , xmax ], .Y = y ∈ [−ymax , ymax ], for non-negative .zmax , xmax , ymax . The .Z = 0 plane is the surface of the material sample. The design locations—at each of which we will bespoke learn the value of the material density—are chosen s.t. each design value of .S is the location value of a vertex of a 3-dimensional grid that we place the cuboidally-shaped material block in. The mutually orthogonal edges of this material block or sample, are s.t. each edge is parallel to a basis vector. This 3-dimensional grid is partitioned into “voxels”, with details of this partitioning motivated by details relevant to the imaging of the material sample, with a Scanning Electron Microscope (SEM). We discuss such details next. We will return later, to the discussion about the binning of the 3-D grid into voxels. The SEM imaging works by letting a beam of electrons impinge upon the material sample, at a given point .(x, y, 0). Some of the beam electrons then enter the

4.3 Application to Materials Science

167

material sample and interact atomistically with the material of the sample. As they move through the material bulk, the electrons lose energy, while undergoing such atomistic interactions with the atoms of the material block. Thus, after travelling a distance .rinteraction from the point .(x, y, 0), the electrons can be modelled to have lost their energy, where [14] have offered a model of .rinteraction = β() := 0.0276A 1.67 dZ 0.89 for a material distinguished by mass density d measured in gm/cm.3 ; atomic number .Z; atomic mass A, as the non-linear function .β(·) of the energy . of the electrons in the input electron beam. Thus, if input energy value . in the beam electrons increases from 10 energy units to 11 energy units, the increase in the stopping length .rinteraction is then larger, than would be the case if beam energy is increased from 11 to 12 energy units. (The energy unit of keV is discussed below). So as the beam energy is increased in value, in equi-distant steps, the increase in .rinteraction decreases monotonically. Indeed, the distribution of the energy of the beam electrons with .s, as the electrons travel within the material bulk is itself conditional on the sub-surface material density function, where the aim here is to learn that unknown density. Thus, we anticipate that summaries such as the expected stopping length will vary from one value of .S to another, as the material density varies across subsurface locations. However, we shall adopt a simple model that suggests an isotropic location distribution of the beam electrons within the material sample, after they have penetrated the surface at .(x, y, 0), and interacted with the atoms of the material sample. In other words, the spatial distribution of the electrons inside the material sample, after impingement at .(x, y, 0), is modelled as contained within a hemisphere centred at .(x, y, 0), of radius .rinteraction . We refer to this hemisphere as the “Interaction Volume”, which we shall abbreviate as I V . Now, as the electrons interact with the atoms of the material sample, diverse types of radiations are generated; of these, the Back Scattered Electrons or BSE [6, 34] that are given rise to, are captured by the relevant imaging tool within the SEM, as the imaged or output datum received from the hemispherical I V that is given rise to by the electron beam of energy . =. that impinges the material sample at .(x, y, 0). As the electron beam sweeps across the surface of the material sample, targeting one location on the surface of the material block after another, the I V that is created following the impingement at a given location, is then distinguished by this location. Thus, as with the I V , this captured image datum is also distinguished by the target coordinates and energy of the electron beam. Then in this isotropic model of the shape of the I V , this datum is considered to be informed by the material density inside this I V —and in particular, we can state that this image datum carries information from the sub-surface depth of the energy-dependent stopping length that is given by .β(). Indeed, it was to obtain information on the material density from different subsurface depths, that [2] designed a unique imaging experiment in which beam energies of sequentially increasing energies were input at a given location on the surface, that allowed image data from increasingly more voluminous I V s—and equivalently from increasing sub-surface depths—to be captured. Thus, when the electron beam is targeted at the i-th location on the surface of material block, and

168

4 Bespoke Learning in Static Systems

the beam has energy . =.j , then the ij -th I V is created, and the ij -th BSE datum is captured; here .i = 1, . . . , NS , .j = 1, . . . , N . By scanning the surface of the material block at each of the .NS locations, at . =.j , the j -th image (in BSE) of the material sample is created, .∀j = 1, . . . , N . Then the j -th image carries information about the material density function from (j ) within a distance that we denote .riteraction = β(j ), from each of the .NS points on the .z = 0 plane, at which the electron beam of energy .j is input. Here .j = 1, . . . , N . This j -th image consists of .NS number of image data points, with each datum occupying a pixel in the image, where the image plane is partitioned into .NS squares, each of side .δ. We refer to the j -th image—taken with beam of energy .j (j ) (j ) as matrix .W j = [wm n ] where .wm n ∈ R≥0 is the .m n-th element of .W√j , i.e. the image datum in the .m n-th pixel of the j -th image. Here .m, n ∈ {1, . . . , NS }. Thus the 3-D grid that the material sample is placed in, is uniformly gridded along .e1 and .e2 —with a width of .δ along either direction. However the gridding along .e3 is non-uniform, to reflect the fact that difference between consecutive stopping lengths decreases, as beam energy increases, i.e. as information about the material density is arriving into the captured image datum from a greater sub-surface depth. In this 3-D grid, there are .N number of partitions along .e3 , with the j -th (j ) “layer” defined at .Z = rinteraction . Any 3-dimensional (cuboidal) grid cell of this grid is referred to as a voxel, and a voxel is identified by the values of x and y of the bottom left hand corner of its top surface, and the index of the “layer”that its bottom surface lies on. Thus, the voxel at .X = x, .Y = y, and in the j -th “layer”, is the .x, y, j -th voxel. Inside the I V created by impingement of beam of energy .j at .(x, y, 0), multiple voxels are typically included. Some voxels are fully included within the I V , while varying fractions of other voxels fall inside this I V . No such voxel will be in a “layer” with index .> j , when the electron beam energy is .j . Also, each such voxel (j ) will be at distance .≤ rinteraction from point .(x, y, 0). The hemispherical I V with (j ) (j ) radius .rinteraction , centred at .(xm , yn , 0) is denoted .I Vm,n . We denote the material density function at sub-surface point .(x, y, z) as .ρ(x, y, z), where .ρ(x, y, z) ≥ 0. In fact, the material density function at any point (j ) .(x, y, z) inside .I Vm,n convolves with a blurring function .η(x, y, z), [9, 23, 25], s.t. three sequential orthogonal and contractive projections of this convolution, (j ) onto .(xm , yn , 0), creates the image datum .wm n [15]. This blurring function is the parametrisation of different kinds of physical processes—inclusive of those caused by the imaging itself—that distort the projection of the material density to be (j ) captured as the image datum .wm n . Indeed, it is typically referred to in the literature as the “microscopy correction function”. Then we express the image datum obtained from the convolution of .ρ(·, ·, ·) and .η(·, ·, ·), as: (j )

(j )

wm n = Pm n (ρ η)(x, y, z),

.

4.3 Application to Materials Science

169

where (j )

(m n)

Pm n = P3

.

(j )

(j )

◦ P2 ◦ P1 ,

(j )

with “.◦” representing composition; .P1 representing projection onto the plane .z = (j ) (j ) 0; .P2 representing projection onto a diameter of the circle of radius .rinteraction ; (m n) and .P3 the projection onto the point .(xm , yn , 0). Then it is by undertaking 3 sequential inverse projections of the image data .{W 1 , . . . , W  }, that we learn .(ρ η)(x, y, z), ∀x ∈ [−xmax , xmax ], y ∈ () [−ymax , ymax ], z ∈ [0, rinteraction ], where .ρ(·, ·, ·) and .η(·, ·, ·) are unknown (j ) functions. .Pm n is most easily identified in the spherical polar coordinate system— (j ) as a triple integral over the radial coordinate (with limits from 0 to .rinteraction ); over the azimuthal coordinate (from 0 to .2π ); and the polar coordinate (from 0 to .π/2). (j ) However, closed form inversion of .Pm n is not available.

4.3.2 Underlying Stochastic Process The functions .ρ(·, ·, ·) and .η(·, ·, ·) are sought—with the full correlation structure of these functions to be learnt within an inverse problem approach. It may be perceived that a way to do this reliably, is to treat each sought function as random, and generate (j ) its value at various chosen values of the location variable .S, inside the .I Vm,n . Here the realisations of each function at a design location is anticipated as modelled from a stochastic process in each iteration, and could in principle be identified while minimising the difference between projection of the convolution of the currently (j ) sampled functions onto the space of observables, and the image datum .wm n . The immediate question that such an approach motivates is about the underlying stochastic process that is invoked to sample a sought function from it, given the only data that appears available—namely, the image data. After all, it is the correlation structure of such a process that will generate the correlation structure of a sought function. So it is imperative that the process parameters are learnt, and not input by hand in the absence of data. On the other hand, learning the parameters of the process demands training data comprising pairs of design location and the functional value realised at that chosen location—however we have no data on the material density function at any chosen sub-surface location, and neither on the microscopy correction function at any such location. So learning the process parameters is not possible. But if the generating processes are not specified by the available data in any way, functions sampled from such processes will also not be informed by the data. One suggestion is to then iterate over the parametrisation of the correlation structure of each of the two invoked processes, while sequentially (thrice) projecting the convolution of a pair of sampled functions, (namey, material density and

170

4 Bespoke Learning in Static Systems

Y X Z

Fig. 4.1 Schematic diagram of the 3-dimensional grid that the material sample is placed in. Any point in this grid is .S = (X, Y, Z)T , placed in the basis .{e1 , e2 , e3 }. Gridding along the directions of .e1 and .e2 is uniform, while that along .e3 is s.t. difference in the value of Z between consecutive layers, decreases with Z. In fact, the stopping length, i.e. Z-location of the deepest “layer” relevant to a given beam energy, is proportional to . 1.67

microscopy correction functions), from each of the currently updated processes, to ultimately seek the divergence between the projection of the sample functions and the image datum. Thus, it is the divergence/distance between the projection of the sample functions and the image datum—for a given beam impingement location and energy—that we envisage will inform on the invoked processes. However such a strategy is not a possibility, since prediction of the mean outputs of a sample function at chosen sub-surface locations will require known values of the density (and microscopy) functions at other given locations, i.e. a training set will be required, which is but absent. So we need to construct the absent training set first. On the contrary, we welcome a model in which one scalar-variate stochastic process models the tri-variate material density function, where we learn the correlation structure of such a process. Figure 4.1 presents a schematic diagram of the partitioning of the material sample into 3-dimnsional voxels.

4.3.3 Details of Existing Bespoke Learning

The primary modelling strategy that [2] adopted, to bespoke learn the material density function of this material sample, is to consider the material density inside any voxel to be an unknown constant, independent of the density in any other voxel. [2] learnt the value of unknown density in any voxel, given the data {W 1 , . . . , W  }. We denote the constant material density inside the (j ) voxel at x = −xmax + mδ, y = −ymax + nδ, z = rinteraction as ρm,n,j ≥ 0, (continued)

4.3 Application to Materials Science

171

which is the discretised version of the material density function ρ(x, y, z) at sub-surface point (x, y, z). The microscopy correction function was modelled as dependent only on Z, and discretised again in the treatment of [2] to enable its bespoke learning. Thus, the microscopy correction function in voxels at X = x, Y = y and the j -th “layer” is denoted ηj , ∀x, y.

Then with the sought functions discretised, the projection of their convolution computed over all locations that lie within a given I V is computed, and the result compared to the image datum obtained from this I V . Chakrabarty et al. [2] provide 3 distinct models of such a projection, distinguished by the stopping length at the highest beam energy used in the imaging, relative to the population of the I V by voxels. These are enumerated below. • When the atomic number of the material under consideration is low, the stopping (N ) length is typically small—if 2rinteraction ≤ δ, then only one voxel encases one I V . The projection from this I V is then easiest to compute. (j ≤j ) • When the atomic number is low to moderately valued, ∃j = j s.t. 2rinteraction ≤ (j >j ) δ but 2rinteraction > δ. In such a case, at some beam energies ≤ j , computation of the projection is easy as there is only one voxel that contributes to the projection, while for beam energies higher than this threshold energy, the projection involves acknowledging material density from multiple voxels. • Lastly, for high atomic number elements (such as transient metals), each I V — even at the lowest beam energy used—typically includes multiple voxels. So the computation of the projection is the most difficult in this case. In this application, we discuss the third case. Chakrabarty et al. [2] report the bespoke learning of the values of the sub-surface material density function at vertices of a chosen subset of voxels, via three sequential inversions of image data of a material sample prepared as a cuboidally-shaped nanobrick comprising Aluminium and Silver particles. The imaging is performed with an SEM, at 11 different values of the beam energy: =10, 11, . . ., 20 keV. Here “keV” stands for kilo electron Volts which is a unit of energy (1 keV = 1.6×10−19 Joules). In Fig. 4.2, 2 of the images of this material sample, taken at 12 keV and 19 keV are shown. The other 9 images are not shown here. 2 rectangularly-shaped areas inside each of these 11 images, are focused upon, and these comprise the 2 image data sets: (1) W(1) := {W (1) 1 , . . . , W 11 , };

.

(2) W(2) := {W (2) 1 , . . . , W 11 }.

.

172

4 Bespoke Learning in Static Systems

Fig. 4.2 SEM images of the nano-brick of Nickel and Silver, prepared in the laboratory, imaged at 11 distinct beam energies. Displayed figures are the images taken at beam energies of 12keV (left) and 19keV (right)

Image data W1 is produced when the beam impinges on N1 number of locations on the surface of a sub-part of the material sample—referred to as sub-part P1 . Then the sub-part P1 of the material sample is placed within a 3-dimensional grid of size N1 × N1 × 11; this grid is uniform in its gridding along e1 and e2 , but gridded non-uniformly along e3 into 11 “layers”. Again, image data W2 is produced by beam impingement on the surface of subpart P2 of the whole material sample. Sub-part P2 is placed in an N2 ×N2 ×11-sized 3-D grid. 1. Image data set W(1) comprises 101 × 101 = 10201 pixels, obtained at a resolution—which fixes the value of δ—of 50 nm = 0.05 μm. 2. On the other hand, W(2) comprises 41 × 41 = 1681 pixels with δ = 0.05 μm. 3. Figure 9 from [2] shows the mean of the bespoke learnt values of the material density function learnt by inverting W(1) at the vertices of the 3-dimensional grid that sub-part P1 is placed within. This grid extends from X = −2.5 to 2.5 μm along e1 at intervals of δ = 0.05 μm; from Y = −2.5 to 2.5 μm, at intervals of δ=0.05 μm along e2 ; and for Z=0 to 1.44 μm. Thus, there are N12 = 10201 pixels in the image data W(1) . 4. Again, the same figure displays the bespoke learnt mean material density values at vertices of the 3-dimensional grid that sub-part P2 is encased within. This grid comprises X = −1 to 1 μm at interval of δ=0.05 μm; from Y = −1 to 1 μm at intervals of δ=0.05 μm; and for Z=0 to ≈1.44 μm where image data W(2) has been used. There are N22 = 1681 pixels in the image data W(2) . 5. Image data set W1 includes image datum generated by beam impingement at 1 ,N1 surface locations denoted {(xp , yq , 0)}N p=1;q=1 , with x1 = −xmax ; xN1 = xmax , and xmax = 2.5 μmetres. y1 = −ymax ; yN1 = ymax , and ymax = 2.5 μm. 6. Image data set W2 includes image datum generated by beam impingement at 2 ,N2 surface locations denoted {(xp , yq , 0)}N p=1;q=1 , with x1 = −xmax ; xN2 = xmax , and xmax = 1 μm. y1 = −ymax ; yN2 = ymax , and ymax = 1 μm.

4.4 Supervised Learning Using GP and Prediction

173

Chakrabarty et al. [2] undertook the learning of the values of the material density function given these two image data sets W1 and W2 . Within an iterative scheme, the currently updated material density parameters and microscopy correction parameters were convolved, and projected onto the image space—all within the discretised paradigm that [2] worked within. Likelihood of the unknown parameters, given the image data was defined as a Gaussian in image data obtained from a given I V , with mean given by the projection of the computed convolution obtained from that I V , and variance given by the noise in the image data due to details of the imaging technique. Then together with the invoked priors, this likelihood paved the way for formulation of the posterior probability density of the unknown parameters, given the image data. Strong priors on the microscopy correction function were elicited from the microscopy literature. Inference using Metropolis-within-Gibbs then provided samples from this joint posterior of all unknown parameters, given the image data—individually for each sub-part. Having computed the marginal posterior probability of each unknown given the data, 95% Highest Probability Density credible regions on each unknown is learnt.

This yields the set of bespoke learnt values of the material density function at selected sub-surface locations inside each of the two sub-parts of the material sample. The set of pairs of values of each such design sub-surface location and the material density function realised at this location, is then the training data that was originally absent. Now that such a training set has been rendered available—via the bespoke learning that [2] undertook by invoking the relevant system Physics of this static system—we are in a position to undertake the learning of the material density function, to enable forward prediction of the density at any location, not just the design locations as chosen by Chakrabarty et al. [2]. To assess the strength of the methodology, we check how well such forward prediction at a subset of such design locations, compare, to the bespoke learnt density values at these locations.

4.4 Learning Correlation of Underlying GP and Predicting Sub-surface Material Density As seen in the last sub-section, the bespoke learnt values of material density at each N1 ,N1 ,11 design sub-surface location .∈ {(xp , yq , zt )}p=1;q=1;t=1 contributes to the generated N1 ,N1 ,11 , where each such design point training data .D1 = {(xp , yq , zt , ρp,q,t )}p=1;q=1;t=1 is the location of the identifying vertex of a voxel inside the considered first sub-part .P1 of the material sample.

174

4 Bespoke Learning in Static Systems

Similarly, the bespoke learnt values of the subsurface material density function, 2 ,N2 ,11 at the design locations .{(xp , yq , zt )}N p=1;q=1;z=1 , build the training data .D2 = N2 ,N2 ,11 , where each design pint in .D2 marks the location {(xp , yq , zt , ρp,q,t )}p=1;q=1;t=1 of a vertex of the 3-D grid that the second sub-part .P2 of the material sample. We discuss the learning of the material density function .ρ(X, Y, Z), at a point T 3 .S = (X, Y, Z) where .S is in the basis .{e 1 , e 2 , e 3 }. Here, .ρ : R −→ R≥0 . We will model .ρ(·, ·, ·) as a random realisation from a GP, and will learn the correlation structure of this GP with training data that comprises the bespoke learnt values of the sub-surface material density at each of .Ncorr number of distinct locations .s 1 , . . . , s Ncorr . These .Ncorr design points are chosen as random values of .S corresponding to the identifying vertices of .Ncorr different voxels included inside sub-part .P1 of the material sample prepared in the laboratory, the imaging of which was reported by Chakrabarty et al. [2]. Thus, the training corr data for the correlation learning is .Dcorr := {(xmk , ynk , zjk , ρmk ,nk ,jk )}N k=1 , where .(xm1 , yn1 , zj1 ), . . . , (xmN , ynNcorr , zjNcorr ) are the randomly chosen locations of corr .Ncorr voxels that comprise the 3-dimensional grid that the material sample is placed in. The material density function .ρ : R3 −→ R≥0 being scalar-valued, is modelled using a scalar-variate Gaussian Process. Then the joint probability distribution of the .Ncorr realisations of the material density function is multivariate Normal, with a mean vector .μ ∈ RNcorr and a .Ncorr × Ncorr -dimensional covariance matrix .Σ = [Cov(ρ(s k ), ρ(s k / ))]. We standardise the training data .Dcorr using the sample mean and standard deviation, and invoke a zero mean GP. Then .Σ reduces to the correlation matrix of the multivariate Normal density, and we learn this correlation matrix by parametrising it with a kernel. Following the dual-staged learning strategy that we motivate above, we parametrise the correlation function using an SQElooking kernel, the length scale hyperparameters of which is each a (generally distinct) function of the sample path.

 / / Corr(ρ(s k ), ρ(s k / )) = exp −(s k − s k )T L−1 (s k − s k ) ,

.

where the .3 × 3-dimensional diagonal matrix .L is s.t. its three diagonal elements are .X , Y , Z , which are the three parameters that we learn from the data.

4.4.1 Non-parametric Kernel Parametrisation Subsequent to the passing of .t0 number of iterations (time steps) of the MCMC chain that we use to undertake inference on the unknowns, we start modelling each length scale hyperparameter as a function of the sample path generated from the GP that models .ρ(·, ·, ·). As a new sample path is generated from this GP that generates .ρ(·, ·, ·), at a new value of the time-step, (or iteration), by the dual-staged learning method discussed above, we set D = gD (T ), D=X,Y,Z,

.

4.4 Supervised Learning Using GP and Prediction

175

where the time-step T takes values in .{0, 1, 2, . . . , }. Here .gD : Z≥0 −→ R is an unknown continuous function since mappings from .Z to .R are so continuous. Then as delineated in the learning methodology above, .gD (·) is modelled with a stationary GP, the correlation structure of which is distinct for .D = X, Y, Z. In fact, gD (·) ∼ GP (μD (·), ΨD (·, ·)).

.

We collate the values of .D that were sampled in the last .t0 number of iterations, (t) counting from the current t-th iteration, to form the lookback data .DlookbackD := (t−t )

(t−1)

(t)

{(t − t0 , gD 0 )), . . . , (t − 1, gD )}. Here the .gD is the sampled value of .gD (·) in the t-th iteration of the undertaken MCMC chain. In fact we standardise the .t0 samples of this function, each of which is the output obtained at a design time point, s.t. .D(t) lookbackD includes standardised values of the sampled .gD (·) in the last .t0 iterations. Then we invoke a zero mean GP to model the function .gD (·). Since .gD (·) is continuous, a stationary GP suffices to model it—we will use an SQE-shaped kernel to parametrise the covariance function of this GP, where this SQE-kernel is marked by the length scale hyperparameter .δD and an amplitude parameter .AD . Here .D = X, Y, Z. To clarify, the likelihood of the unknowns at iteration t, given (t) the lookback data .DlookbackD at the current time point, is the joint of the .t0 sampled (t)

values of .gD (·) that are included inside the training data .DlookbackD : (t)

(t)

(t)

L(δX , AX , δY , AY , δZ , AZ |DlookbackX , DlookbackY , DlookbackZ ) =

.



[gD (t − t0 ), . . . , gD (t − 1)] =

.

D=X,Y,Z



MN (0, S D ),

D=X,Y,Z

(D)

where .S D = [sij ], where .i, j = 1, . . . , t0 , and  (D) .s ij

= aD exp −

(ti − tj )2 2 δD

 , ∀D = X, Y, Z.

So we compute the posterior of these six unknown hyperparameters, by inputting this likelihood and suitable priors into Bayes rule. Inference on these six unknown parameters, is undertaken via posterior sampling, which we carry out, using Metropolis Hastings. The data that this posterior is conditional on, comprises the past few identified samples generated from the posterior. So the data employed in the learning is varying with iterations, until the MCMC chain achieves convergence. The amplitude parameters are proposed from truncated Normal densities, the mean of which is the current value of the parameter, and the variance of which is experimentally decided upon. The length scale parameters are proposed from Normal proposal densities. Non-peaky Gaussian priors centred at the seeds used in the run, are invoked.

176

4 Bespoke Learning in Static Systems

This MCMC-based inference on these six unknown parameters results in their current values, which are employed to predict the values of the mean and variance of .X , Y , Z , at the current time t, i.e. in the t-th iteration. Such closed-form mean and variance predictions are possible because we had invoked GPs to underlie .gX (·), gY (·), gZ (·) respectively, where these functions are respectively equal to .X , Y , Z . However, it is difficult to interpret uncertainties in .D if there is a value of the variance generated at each iteration. In light of this, in this application that is included here, we choose to compute the sample mean of .δD , AD parameters, over the post-burnin iterations, and employ these to inform the correlation structure of the GP that underlies .gD (·), to then predict the mean and variance of .D , .∀D = X, Y, Z. The joint of the values of the sampled functions that reside within the training data, and the unknown value of the function that is realised at test time points ( ) ( ) .t 1 , . . . , tN , is a multivariate Normal, with a mean that is the null vector, and a variance-covariance matrix that is a block matrix, comprising 4 matrices: • A .t0 × t0 -dimensional matrix .K(train, train), the ij -th element of which is the correlation between the function .gD (·) sampled at time .ti and that at .tj , .ti , tj = t − t0 , . . . , t − 1. • A .t0 × Ntrain -dimensional matrix .K(train, test), the ij -th element of which is the correlation between the function .gD (·) sampled at time .ti and that at .tj , where .ti = 1, . . . , t0 , .tj = 1, . . . , Ntrain . • An .Ntrain × t0 -dimensional matrix .K(test, train), the ij -th element of which is the correlation between the function .gD (·) sampled at time .ti and that at .tj , where .ti = 1, . . . , Ntrain , .tj = 1, . . . , t0 . • An .Ntrain × Ntrain -dimensional matrix .K(test, test), the ij -th element of which is the correlation between the function .gD (·) sampled at time .ti and that at .tj , where .ti = 1, . . . , Ntrain , .tj = 1, . . . , Ntrain . Following on from this general result described above, the predictive distribution computed at a test time that is given by the current time t, is Normal. Thus, in our (test) application, .N =1, and .tN = t. This Normal predictive density is with mean that is the sought value of the predicted mean of .D in the t-th iteration. This is given as −1 ¯(t) D = K(test, train)[K(train, train)] (gD (t − t0 ), . . . , gD (t − 1)).

.

(4.1)

The predicted variance is given as (t) varD = K(test, test) − K(test, train)[K(train, train)]−1 K(train, test). (4.2)

.

Here .D = X, Y, Z. Thus, a mean value and variance value of .D is generated at each iteration. Then following convergence of the MCMC chain, we compute the sample mean 2 of . , over the postof the .¯D parameter, and sample mean of the variance .s¯D D burnin iterations, for .D = X, Y, Z. Then we predict the mean and variance of the

4.4 Supervised Learning Using GP and Prediction

177

sub-surface material density function .ρ(·, ·, ·) at .Ntest number of test locations, by computing the correlation between a pair of sampled values of .ρ(·, ·, ·) at two given input locations, using the computed mean values of .X , Y , Z , with noise in the used correlation structure of the GP that generates .ρ(·, ·, ·), given by the mean variance in these length scale parameters. Such noise affects the predicted mean density at a test location—as per the generic result expressed in Eq. 4.1—as the addition of the inter-training correlation matrix, to a .3 × 3-dimensional diagonal 2 , .s¯ 2 , .s¯ 2 . matrix with diagonal elements .s¯X Y Z We could also have predicted .ρ(·, ·, ·) in each of the .Ntest test locations, at every iteration of the MCMC chain, following the prediction of the .D parameters. However this would have required .Ntest number of additional predictions (and therefore matrix inversions) per iteration, over all the post-burnin iterations. This slows down the prediction algorithm. More importantly, we would ideally wish to maintain the fast prediction of the material density at .Ntest sub-surface locations, a modular endeavour—not to be enmeshed within the learning of the correlation structure—s.t. when predictions at a different set of test locations are sought, we need not learn the correlation structure of the underlying GP again. In fact, the learnt correlation is advanced as one that underlies the material density function over the whole tri-variate domain that is the space that hosts any sub-surface location inside the material sample that we aim to learn the density function for. Indeed we will be predicting—and forecasting—density values at multiple values of the location variable .S = (X, Y, Z)T within this space. So to ensure fast prediction, we need to have one algorithm for learning the correlation, and another to perform predictions/forecasting.

• We model the sub-surface material density function of the whole material sample, as a random function generated by a generic Gaussian Process, and learn the correlation structure of this underlying GP, by employing a subset of the training set .D1 . This is discussed in Sect. 4.2.1. In all our predictions—both made within sub-part .P1 and sub-part .P2 —the correlation structure of the underlying GP that generates the material density function, is learnt using 300 randomly selected design locations inside sub-part .P1 .

• Subsequently, we input this learnt correlation structure of this GP to predict the mean material density at test sub-surface locations that are some of the remaining design points of .D1 . Comparison of the predicted density at such test locations, and of the bespoke learnt density values at these locations will indicate strength of the material sample. This is discussed in Sect. 4.3.1. • We will also employ this learnt GP correlation parametrisation, to predict the (j ) mean material density on the .z = rinteraction plane inside sub-part .P2 .∀j = 2, . . . , 11, where we assume knowledge of the material density function in the

178

4 Bespoke Learning in Static Systems

z = 0 plane, i.e. the top surface of this sub-part. It is possible by certain independent imaging techniques (such as imaging with Secondary Electrons) or topographical probing techniques, to have information on the material density at the surface of the material sample. We illustrate in Sect. 4.4.2 that our learning of the correlation structure of the generative GP, permits successful prediction of material density at sub-surface “layers”. We will present comparison of these predicted material density function and the bespoke learnt values of this function at the X and Y values of the vertices of grid-cells in each of the 10 sub-surface “layers” of sub-part .P2 .

.

Figure 4.3 displays plots against iteration index, of the 3 length scale parameters X , Y , Z of the covariance kernel that parametrises the covariance matrix .Σ of the GP that is invoked to model .ρ(·, ·, ·). However, in this dual-staged treatment of the correlation structure of the GP, .X , Y , Z are not constants, but (unknown) random functions of the iteration index, with each such function learnt as modelled by a stationary GP, the hyperparameters of which we also learn. Then as the hyperparameters of the 3 invoked stationary GPs attain convergence, the stationary GPs that are nested within the mother GP, are learnt, within uncertainties driven by inference undertaken with MCMC. Then the uncertainty in the realisation of .D is due to the variance in the sampling from the respective stationary GP, for .D = X, Y, Z. Thus, the plot of .D with iteration index can then indicate trendlessness across iterations, within fluctuations that owe to the sampling variance. This trendlessness is noted in these plots of .D with iteration, once convergence in the GP hyperparameters is attained.

.

However, if we try to learn .D modelled as an unknown constant, using MCMC, then such inference—given the correlation inhomogeneities borne by the relevant training data—is highly sensitive to the seed value of the parameters used to start the chain, and mixing is acutely difficult. The chains get stuck quickly, and at wrong answers, where the error of these inferredupon values of .D is evident on examining predictions of material density function made at test locations. It is clear that the correlation structure of the data in this application is inhomogeneous enough, to invalidate the treatment of the length scales .X , Y , Z of the mother GP, as unknown constants.

Histogram representations of the length scale hyperparameters .δX , .δY and .δZ of each of the 3 stationary GPs—each of which is invoked to model each of .X , Y , Z —are included in Fig. 4.3.

4.4 Supervised Learning Using GP and Prediction

0.008

0.002

0

0

0

2u104 4u104 Iteration

0.0896 0.0894 0.0892

0

0.089

2u104 4u104 Iteration

0.2

0.3

0.1

lZ

0.15

0.2

lY

lX

lZ

0.004

0.1

0.1 0.05 0

2u104

0

1.5u104

0.35

dX

0.4

0

2u104 Iteration

1.5u104 104

5000

5000

2u104 4u104 Iteration

0

Iteration

104

104

0

0.095 0.09 0.085 0.08 0.075 0.07

2u104

0

Iteration

0

0.0898

0.006

0.2

lY

lX

0.3

0

179

5000 0.35 0.4 dY

0

0.35 0.4 0.45 dZ

Fig. 4.3 Top: plots of .X , Y , Z against iteration index, from the model in which 3 stationary GPs—each of which generates the function that takes iteration index as its input, to output .D — are nested inside a GP that is invoked to generate the material density function, where covariance function of said GP is parametrised by a non-parametric kernel; here .D = X, Y, Z. Middle: plots of .X , Y , Z against iteration index, from the model in which .D is the unknown global constant length scale parameter of the GP that is modelled to generate the material density function. Inference is carried out using Metropolis Hastings, and the training data used in this exercise comprises bespoke learnt density values at 300 randomly chosen voxel locations from inside subpart Part I. The same training set is relevant for the results displayed in the top row. Bottom: histogram approximations of the marginals of the length scale hyperparameters .δX , δY , δZ , of the 3 nested stationary GPs, that model the respective length scale hyperparameter .D , .D = X, Y, Z

4.4.2 Predictions We employ the bespoke learnt values of material density in .Ntrain number of (bl) (bl) randomly located voxels that are marked by .X = xpk ∈ {x1 , . . . , xm }, where (bl) .x ∈ {−xmax , −xmax + δ, . . . , xmax }, ∀k = 1, . . . , m. Let these .Ntrain randomly k train chosen locations be denoted .{(xpk , yqk , zjk )}N k=1 ; let the bespoke learnt material density at location .(xpk , yqk , zjk ) be .ρpk ,qk ,jk .

180

4 Bespoke Learning in Static Systems

Fig. 4.4 Schematic diagram to indicate 2 of the 3 prediction exercises that are undertaken as empirical illustration of the supervised learning of the sub-surface material density function, following bespoke learning of the values of this function at design points (that are the vertices of the grid cells of the 3-dimensional grid that the material sample is partitioned into). The mean bespoke learnt value of the material density function at .Ntrain different random locations are used to perform the prediction of the material density at .Ntest number of locations, following learning of the correlation structure of the GP that generates the discontinuous material density function. The .Ntrain bespoke leant density values are at locations that lie on m distinct planes— (bl) (bl) (bl) outlined in red—that are generically given as .X = x1 , . . . , xm , where .xk ∈ V X , where .V X := {−xmax , −xmax + δ, . . . , xmax }. Bespoke learnt values of density at X values in .{x : (bl) x ∈ V X , x = xk ∀k = 1, . . . , m}—shown as unfilled rectangles outlined in blue—are not used to inform the training set, using which predictions at .Ntest random locations at .X = xtext are undertaken. The .X = xtest plane is shown as a filled rectangle

train Then using the training data .Dtrain = {xpk , yqk , zjk , ρpk ,qk ,jk }N k=1 , we predict the mean and variance of the material density function at .Ntest random test locations test marked by .X = xtest , given as .{(xtest , yqk , zjk )}N k=1 . Thus, all the test locations share the same value of X, and are represented on the .X = xtest plane inside the 3-dimensional grid that the material sample is placed in, where the location vector is in the declared basis .{e1 , e2 , e3 }. Since the material density function .ρ(X, Y, Z) is underlined by a GP, prediction of the mean and variance of the functional value at a test location, is closed form. It is clarified that the test locations are not part of the training set. In fact, there exist multiple planes given by .X = x, where (bl) (bl) .x ∈ / {x1 , . . . , xm }, s.t. this x is not an X-coordinate value of the locations included within the training set. Needless to say, one such X-coordinate value that is precluded as a design location, is .xtest . The choice of the X values that are included as designed locations in the training set, and those that are test locations, are schematically presented in Fig. 4.4.

• In all our predictions—both made within sub-part .P1 and sub-part .P2 —the correlation structure of the underlying GP that generates the material density function, is learnt using 300 randomly selected design locations inside sub-part .P1 . • To predict the density at random locations on the .X = xtest plane, inside subpart .P1 , we use .Ntest = 451, These predictions are performed using random

4.4 Supervised Learning Using GP and Prediction

181

locations on the .X = xk planes that preclude the .X = xtest plane, and additionally, preclude locations with .X = xk / as well; here .xk = xk / ; xk / , xk ∈ {−xmax , −xmax + δ, . . . , xmax }. • Similarly, to predict inside sub-part .P2 , the training set is .∼3500 data points in size and precludes locations marked by diverse values of X, in addition to locations on .X = xtest . To put it in context, the total number of voxels in this subpart is .101 × 101 × 11. The number of locations at .X = xtest that density predictions are made at in this sub-part, is the number of voxels that exist in this grid at .X = xtest , namely, .Ntest = 1111. Figures 4.5 and 4.6 present results of predictions of mean density, made at test locations that lie on a plane .X = x, for two given x, where such x values lie within

Fig. 4.5 Left: mean predicted values of the learnt material density function .ρ(·, ·, ·), at 451 test locations on the .X = 0.5 plane in sub-part .P1 , i.e. at 451 sub-surface locations .(0.5, y, z) in .P1 , where the values of Y and Z variables coincide with the Y and Z coordinates of those voxels at a given value of X—that comprise the 3-dimensional grid that the material sample is placed within—that exist at .X = 0.5. Surface plot of the predicted density values at the X=0.5 plane, is supplemented by the contour plot of the same, drawn on the base. In this sub-part, values of Y extend from -1 to 1. Values of Z extend from 0 to about 1.44 in this material sample. A training data set that includes 1748 random locations—that precludes locations at .X = 0.5, and at .X ∈ [0.4, 0.5) ∪ [0.55, 0.6]—is the one that is employed to make these predictions. Right: bespoke learnt values of the material density function at the 451 identifying vertices of voxels that exist at locations marked by X=0.5, within the 3-dimensional grid that encases sub-part .P1 . A direct comparison of the predictions arising out of the learnt density function at these locations, and the bespoke learnt density values at the same locations—i.e. the known density—is undertaken, while maintaining the same parameters for the plotting of the surface (and contour) plot given the set of discrete values of the predicted material density at the selected locations, as well as the plots given the set of bespoke learnt values of the density at the same locations

182

4 Bespoke Learning in Static Systems

Fig. 4.6 Plots of the mean predicted material density function at 451 test locations on the .X = −0.35 plane in sub-part .P1 of the material density sample. A training data set that includes 1525 random locations—that precludes locations at .X = −0.35, and at .X ∈ [0.6, 1]—is the one that is employed to make these predictions. Right: bespoke learnt values of the material density function at the 451 identifying vertices of voxels that exist at locations marked by .X = −0.35, within the 3dimensional grid that encases sub-part .P1 . In every other respect, this figure is the same as Fig. 4.5

sub-part .P1 . The predictions are presented at varying values of Y and Z on the given X = x plane. We also know the bespoke learnt values of the material density at these locations. Such values are smoothed over by the plotting package (in GNUPlot), and the resuling image is not expected to exactly match the predicted material density image on the .X = x plane. The image of the bespoke learnt densities is included as an indicator of the density structure at the various locations in the .X = x plane. Also, the results of our prediction of the mean value of density are plotted (at various locations in the .X = x plane) in the left panel of these figures. Figures 4.7 and 4.8 present results of predictions made on the basis of learning using training data collated from sub-part .P2 . These predictions are made at test locations that lie on the .X = x plane, for distinct choies of x. Distinguished from these results on prediction, Fig. 4.9 presents results on forecasting the density at various locations that lie within the .P2 sub-part.

.

4.4 Supervised Learning Using GP and Prediction

183

Fig. 4.7 Plots of the mean predicted material density function at 1111 test locations on the = −1.2 plane in sub-part .P2 of the material density sample. A training data set that includes 3967 random locations—that precludes locations at .X = −1.2—is the one that is employed to make these predictions. Right: bespoke learnt values of the material density function at the 1111 identifying vertices of voxels that exist at locations marked by X=-1.2, within the 3-dimensional grid that encases sub-part .P2 . In every other respect, this figure is the same as Fig. 4.5

.X

4.4.3 Forecasting In addition to predicting the material density function at locations that lie surrounded by other locations at which the density value is known, we also perform forecasting of the density at locations that are bordered only on one side by locations at which the density is known. In particular, we perform the forecasting of the full depth structure of material density function in subpart .P2 , using the correlation structure of the GP, learnt using training data .Dcorr which comprise locations in .P2 . We consider the material density in the .z = 0 “layer” of sub-part .P2 known. As stated above, there are topographic imaging techniques, or Secondary Electron imaging available, that can inform on the density at the .z = 0 “layer”. We assume here that the bespoke learnt densities at the 10201 number of voxels in the topmost “layer” are rendered available via one such technique that images the top layer of the material sample. Then knowing the discrete values of bespoke learnt material density at randomly chosen voxels at .z = 0, we forecast the density at .Ntest = 3378 number of randomly selected locations within the 2nd “layer” from the top. The training data used for this 1 forecasting is .{(xi , yj , 0), ρi j 1 }N i,j =1 . Subsequent to this, material density in the .N1

184

4 Bespoke Learning in Static Systems

Fig. 4.8 This figure is the same as Fig. 4.7, except that the left panel includes plots of the mean predicted material density function at 1111 test locations on the .X = 1.5 plane in sub-part .P2 of the material density sample. A training data set that includes 3294 random locations—that precludes locations at .X = 1.5 and .X ∈ [0.65, 1.45]—is the one that is employed to make these predictions. Right: bespoke learnt values of the material density function at the 1111 identifying vertices of voxels that exist at locations marked by X=1.5, within the 3-dimensional grid that encases subpart .P2

voxels within the 1st “layer”, and the forecast density at the .Ntest locations within the 2nd “layer”, are used, to forecast the density value in .Ntest randomly chosen locations in the 3rd “layer”. In this way, one step ahead forecasting is undertaken, at .Ntest randomly chosen sub-surface locations, for all relevant layers, upto the 11-th layer. We collate values of .ρ(·, ·.·) forecast at these .Ntest × 10 locations, augmented by the known density at the .N1 (voxel) locations in the top “layer”. From the set of such locations, we randomly select a subset of 2000 locations .s 1 , . . . , s 2000 , along with the density values forecast/known at each such selected location, to build a training set. Then this training set is employed to predict density in 2000 randomly chosen voxel locations inside .P2 . We ensure that no location that is included in the composed training data, is replicated in the (voxel) locations at which we perform such prediction. Results of this prediction are compared to the bespoke learnt values of the material density function at each such voxel location. The same correlation structure that was employed for the forecasting, is used to predict the density at these 2000 random locations inside sub-part .P2 .

4.4 Supervised Learning Using GP and Prediction

185

Fig. 4.9 This figure is the same as Fig. 4.7, except that on the right, we display the surface plot of the bespoke learnt values of the material density function at 2000 randomly chosen vertices of voxels that exist within the 3-dimensional grid that encases sub-part .P2 . The left panel includes plots of the mean predicted material density function at these 2000 randomly chosen locations in this layer, that serve as the test locations. The training data set that is used for these predictions is one that is built via step one ahead forecasting of the values of the material density function at 3378 randomly chosen—out of 10201—voxel locations in the 2nd “layer” of Z values, sequentially, upto the last 3378 randomly chosen locations in the 11-th “layer”, having started from the known (i.e. bespoke learnt) values of the density at the locations that identify the 10201 voxels that exist in the top-most “layer”. Thus, the comparison of the left and right plots is reflective of the strength of the method in forecasting and predicting, while strength of the predictive power of this dual learning strategy is brought home via the previous four figures as well. It is to be emphasised that the correlation structure employed in this subpart—namely, .P2 —was learnt using the training data of known (i.e. bespoke learnt) density values at 300 randomly sampled voxel locations in sub-part .P1

Figure 4.9 is strongly indicative of the efficacy of this learning strategy. Treating the bespoke learnt density function as the data, we conclude that this tri-variate function is sought, conditional on data that bears a highly discontinuous correlation structure. Thus, the inter-location correlation varies throughout the sub-surface extent of the material sample. Learning this correlation function is therefore a challenge, and it is clear that the stochastic process that underlines the sought scalar-valued density function will be required to be non-stationary. By using the 3 scalar-variate stationary GPs, as nested within a (scalar-variate in this application), GP, we successfully learn the hyperparameters of the non-parametric kernel that parametrise the correlation function of this latter GP. In an upcoming contribution, (Roy & Chakrabarty), the equivalence of this non-parametric kernel, and a nonstationary correlation, will be established. Convergence is attained by the MCMC chain that is run while sampling from this set of nested GPs. These hyperparameters, when input into the GP correlation structure, offers the inter-location correlation anywhere inside the material sample.

186

4 Bespoke Learning in Static Systems

4.5 Conclusions In this chapter we discussed a novel usage of bespoke learning, to generate the originally-absent training data, in order for such a training set to then be solicited, to enable the learning of a discontinuous function. This sought function is modelled as a random function that is a realisation from a GP, the correlation structure of which is parametrised by a non-parametric kernel. Such is rendered possible within a dual-layered learning strategy, in which, each hyperparameter of this GP is itself treated as an unknown function of the GP sample path, where a new sample path is generated at each iteration of the MCMC chain that is being employed to undertake inference on the unknowns. Then each hyperparameter is an unknown function of the iteration number, or time step index, where each such unknown function is modelled as realised from respective GPs, each of which can be proved to be stationary. The inhomogeneity-bearing correlation structure of the data is captured, by compounding multiple stationary GPs in the inner layer, with one mother GP in the outer layer. We presented an empirical illustration of this dual-staged learning on the learning of the material density function within the bulk of a material sample that has been imaged with a Scanning Electron Microscope. Indeed, it would have been possible to have compared the results displayed here, against results obtained using other kernels, and such an in-depth comparative study is retained for a future contribution. The focus of this chapter—as in the two prevous chapters—has been an illustration of bespoke learning, (here in a static system), followed by supervised learning that is shown to avail of the (originally-absent) training set that is generated via such bespoke learning. A detailed scrutiny of the supervised learning is not within the scope of these chapters. The non-destructive learning of material density at sub-surface locations is difficult, and novelty aspects of the bespoke learning discussed in this chapter is included in the existing literature [2], that are referenced above. The bespoke learnt material density at chosen sub-surface locations inside the considered material sample, is obtained via the methodology advanced by Chakrabarty et al. [2]. This method works by the multiple and sequential inversion of the image data that is accrued by imaging the material sample at multiple beam energies, even when the electron beam—relevant to imaging with an SEM— impinges on the same point on the surface of the material sample. In this chapter we discuss the implementation of this bespoke learnt training set in the dual-staged supervised learning of the material density function at sub-surface locations inside the material sample. Predictions are made at selected test locations, using a training data that is a subset of all locations at which density is bespoke learnt. The test locations are s.t. bespoke learnt density values are available at these locations, and this ensures that we can undertake a comparison of the known density values— at such test locations—with the predicted density there. If the predicted density values tally well with the known (i.e. the bespoke learnt) values, confidence in the learning methodology is encouraged. We also perform one step ahead forecasting of the density at discrete depths under the sample surface, to learn the full depth profile

References

187

of the material density function, starting with information only on the density in the topmost layer that lies just under the surface. We noted early in this chapter that relevance of bespoke learning to static systems is more system-specific, than in dynamical systems. It is such system-specific Physics that is invoked to formulate a model of the system property, a transformation of which at a given design point, is observable. Then a likelihood is motivated in terms of the distance between such a transformation of this system property, and the available observation. Within the paradigm of Bayesian inference using MCMC techniques, robustness to variation in the parametric form of the likelihood is appreciated. Indeed, once the likelihood is formulated, the posterior probability density of the unknown system property is expressed, having invoked priors relevant to the existing information. The path to inference on the unknown system property— via MCMC-based posterior sampling—is then laid. Once the bespoke learnt system property is learnt at chosen design points, a training data is effectively generated. Using such training data, supervised learning of the system property as a function of the input variables is possible, and a generic methodology for undertaking such supervised learning is presented in this chapter.

References 1. Peter J Basser and Sinisa Pajevic. A normal distribution for tensor-valued random variables: applications to diffusion tensor mri. IEEE transactions on medical imaging, 22(7):785–794, 2003. 2. Dalia Chakrabarty, F. Rigat, N. Gabrielyan, R. Beanland, and Shashi Paul. Bayesian density estimation via multiple sequential inversions of 2-d images with application in electron microscopy. Technometrics, 57(2):217–33, 2014. 3. S. Chib and E. Greenberg. Understanding the metropolis-hastings algorithm. The American Statistician, 49(4):327, 1995. 4. G. R. Coates, L. Xhao, and M. G. Prammer. NMR logging; principles & applications. Halliburton Energy Services Publication H02308, Houston, 1999. 5. G. R. Davis, S. E. P. Dowker, J. C. Elliott, P. Anderson, H. S. Wassif, A. Boyde, A. E. Goodship, S. R. Stock, and K. Ignatiev. Non-destructive 3d structural studies by x-ray microtomography. Advances in X-ray Analysis, 45:485, 2002. 6. S.L. Erlandsen, P. T. Macechko, and C. Frethem. High resolution backscatter electron (bse) imaging of immunogold with in-lens and below-the-lens field emission scanning electron microscopes. Scanning Microscopy, 13(1):43, 1999. 7. T. Evgeniou, A. M. Charles, and Massimiliano P. Learning multiple tasks with kernel methods. Journal of Machine Learning Research, 6:615–37, 2005. 8. W. R. Gilks and G. O. Roberts. Strategies for improving MCMC. In W. Gilks, S. Richardson, and D. Spiegelhalter, editors, Markov Chain Monte Carlo in Practice, Interdisciplinary Statistics, pages 89–114, London, 1996. Chapman and Hall. 9. Joseph Goldstein, Dale E. Newbury, David C. Joy, Charles E. Lyman, Patrick Echlin, Eric Lifshin, Linda Sawyer, and J.R. Michael. Scanning Electron Microscopy and X-ray Microanalysis. Springer Science+Business Media, New York, 2003. 10. T. Hastie and R. Tibshirani. Generalized additive models. Chapman and Hall, London, 1990. 11. K. F. J. Heinrich and D. E. Newbury. Electron probe quantitation. Springer, New York, 1991. 12. C. Hellier. Handbook of Nondestructive Evaluation. McGraw-Hill, New York, 2001.

188

4 Bespoke Learning in Static Systems

13. I. M. Johnstone and D. M. Titterington. Statistical challenges of high-dimensional data. Philosophical Transactions of Royal Society London, A, 367(1906):4237–53, 2009. 14. K. Kanaya and S. Okamaya. Jl. of Physics D., Applied Physics, 5:43, 1972. 15. K.Lindberg. Contractive projections in continuous function spaces. Proceedings of the American Mathematical Society, 36(1):97–103, 1972. 16. Neil D. Lawrence. What is machine learning?, 2019. http://inverseprobability.com/talks/notes/ what-is-machine-learning-ashesi.html. 17. J. W. Lee, W. B. Park, and B. et al. Do Lee. Dirty engineering data-driven inverse prediction machine learning model. Scientific Reports, 10:20443, 2020. 18. R. E. Lee. Scanning electron microscopy and X-ray microanalysis. Prentice-Hall, New Jersey, USA, 1993. 19. H. K. Liaw, R. Kulkarni, S. Chen, and A. T. Watson. Characterization of fluid distributions in porous media by nmr techniques. Jl. of Materials, Interfaces, and Electrochemical Phenomena, 42:538–546, 1996. 20. Mauricio A. Álvarez and Neil D. Lawrence. Computationally efficient convolved multiple output gaussian processes. Journal of Machine Learning Research, 12 (41):1459–1500, 2011. 21. Mauricio A. Álvarez, Rosasco Lorenzo, and Neil D. Lawrence. Kernels for vector-valued functions: A review. Foundations and Trends Machine Learning, 4 (3):195–266, 2012. 22. K. Mayer, P. K. Chinta, K. Langenberg, and M. Krause. Ultrasonic imaging of defects in known anisotropic and inhomogeneous structures with fast synthetic aperture methods. In Proceedings of the 18th World Conference on Non-Destructive Testing, Durban, South Africa. Available online at http://www.ndt.net/article/wcndt2012/toc.htm, 2012. 23. C. Merlet. An accurate computer correction program for quantitative electron probe microanalysis. Mikrochim. Acta, 114/115:363, 1994. 24. R. Neal. Regression and classification using gaussian process priors. In J. M. Bernardo, J. O. Berger, A. P. Dawid, and A. F. M. Smith, editors, Bayesian Statistics 6, pages 475–501. Oxford University Press, 1998. 25. J. L. Pouchou and F. Pichoir. PAP (ρZ) procedure for improved quantitative microanalysis. Microbeam Analysis, ed. J.T.Armstrong. San Francisco Press, San Francisco, California, 1984. 26. Carl Edward Rasmussen and Christopher K. I. Williams. Gaussian Processes for Machine Learning. The MIT Press, MIT, 2006. http://www.gaussianprocess.org/gpml/. 27. S. J. B. Reed. Electron Microprobe Analysis and Scanning Electron Microscopy in Geology. Cambridge University Press, Cambridge, 2005. 28. C. P. Robert and G. Casella. Monte Carlo Statistical Methods. Springer-Verlag, New York, 2004. 29. Gneiting Tilmann, William Kleiber, and Martin Schlather. Matérn cross-covariance functions for multivariate random fields. Journal of the American Statistical Association, 105 (491):1167–77, 2010. 30. W. E. Vanderlinde and J. N. Caron. Blind deconvolution of sem images. In ISTFA 2007 Conference Proceedings of the 33rd International Symposium for Testing and Failure Analysis, pages 97–102. ASM International, 2007. 31. V. Vapnik. The Nature of Statistical Learning Theory. Information Science and Statistics. Springer New York, 1999. 32. P. Wang, V. Jain, and L. Ventakaraman. Sparse bayesian t1-t2 inversion from borehole nmr measurements. In Proceedings of SPWLA 57 Annual Logging Symposium, 25–29 June 2016, Reykjavik., 2016. 33. B. Wessel, E. Perim, W. Tebbutt, S. Hosking, A. Solin, and R. Turner. Scalable exact inference in multi-output gaussian processes. In International Conference on Machine Learning, Proceedings of Machine Learning Research, pages 1190–1201, 2020. 34. F. S. L. Wong and J. C. Elliott. Theoretical explanation of the relationship between backscattered electron and x-ray linear attenuation coefficients in calcified tissues. Scanning, 19:541, 1997. 35. Qibin Zhao, Guoxu Zhou, Liqing Zhang, and Andrzej Cichocki. Tensor-variate gaussian processes regression and its application to video surveillance. In Acoustics, Speech and Signal Processing (ICASSP), 2014 IEEE International Conference on, pages 1265–1269. IEEE, 2014.

Chapter 5

Bespoke Learning of Disease Progression Using Inter-Network Distance: Application to Haematology-Oncology: Joint Work with Dr. Kangrui Wang, Dr. Akash Bhojgaria and Dr. Joydeep Chakrabartty Abstract In certain real-world problems, supervised learning of the functional relation is sought between a multivariate time series and associated system properties, that serve respectively as the output and input variables. Such learning will then require training data comprising pairs of designed value of the input and the corresponding realisation of the output—except, realisations of this output at different design points, may vary in temporal span. If truncation to the minimum of such time points is not possible, or not preferred, we require a parametrisation of the considered output, where such a parameter is unaffected by the length of the output value. Such a lossless embodiment of the time series output is bespoke learnt, using the scaled Hellinger distance between the graphical models that are learnt, for a pair of time series data sets. An application of such bespoke learning of a viable scalar parameter that stands for such a variably-long output, is presented, to learn the risk score for developing the potentially terminal disease SOS/VOD, that affects some recipients of bone marrow transplants. Following the learning of this score, its functional relation with the vector of the pre-transplant variables is sought— by learning this function as modelled with a Gaussian Process. Subsequently, we undertake prediction of the SOS/VOD score at the pre-transplant stage, for new patients, whose pre-transplant variables are known.

5.1 Introduction In the last chapter we discussed the supervised learning of the generally non-linear and tensor-valued functional relationship between two random variables that are mutually associated, given data that displays inhomogeneities in its correlation structure. We recall from Chap. 4, that such learning entails modelling the sought function as a random realisation from a stochastic process, and is aided if the higher dimensional between these two variables is treated as the output to the function, and the lower dimensional one, the input. The vector of system parameters .X is treated as this input, while the associated tensor-valued (in general) observable .Y that is associated with this input, is the output. The data on the observable © Springer Nature Switzerland AG 2023 D. Chakrabarty, Learning in the Absence of Training Data, https://doi.org/10.1007/978-3-031-31011-9_5

189

190

5 Bespoke Learning of Disease Progression Using Inter-Network Distance:. . .

comprises multiple observations of .Y . General tensor-valued variables are discussed in multiple applications [1, 3, 5, 6, 8, 10, 11, 14, 18, 21, 22, 28, 31, 32]. However, in the real world, situations can arise in which observations on the high-dimensional .Y are not necessarily equally “long”. For example, .Y could be a multivariate time series variable, such that (s.t.) different realisations of .Y cover different temporal ranges. Let the variable .Y in this example, be the matrix of observations of p parameters for an individual in a sample, over T time points, where .T ∈ N is a random variable. Then the data accumulated by collating values of .Y relevant to members of the sample, is not shaped as a cuboid since the different values of .Y in this data set, are matrices with a varying number of rows. In such cases, to capacitate the building of a training set that contains pairs of design values of .X and the corresponding .y, we can truncate all values of .Y in a sample of m individuals, to .tmin , where .tmin = min{t1 , . . . , tm }, with the k-th individual’s observations recorded in the .tk × p-dimensional matrix .y k ; .k = 1, . . . , m. Such truncation is naturally is not optimal, as the learning is then deprived of all of the available information. In fact, in certain applications, the rowwise observations represent information obtained at a sequence of time points, and enforced omission of the last few rows from the ensuing time series would imply ignoring information about evolution of the system. This can give rise to incorrect results. Then again, inflating the number of rows of .y k , to .max{t1 , . . . , tm }, needs to be avoided as well, since such inflation is likely to be ad hoc. Hereon, we proceed with our discussions, under the assumption that the output .Y is indeed a multivariate time series, with a variable time coverage. In this chapter we forward a method that allows for the computation of a lossless parametrisation (S) of this variably-long time series variable .Y , s.t. this sought parameter is unaffected by the variability of the temporal cover that characterises the different realisations of .Y . Such a property of the sought S parameter will then permit the learning of S as a function of the input vector variable .X. Following such learning, we can undertake prediction of the value .X at a given S—which is to say, at a given realisation of the variably-long multivariate time series .Y . More relevant might be the learning of values of S at test data on .X.

Evidently, it is a transformation of .Y that results in the sought S, though we are not aware of the form of the transformation that yields the S given a .Y —after all, we do not have a training set comprising pairs of values of the variablylong time series variable .Y and the parameter S realised at a chosen value of such .Y . Thus, a supervised learning of the form of this transformation is irrelevant to the discussion at hand. In its place, we undertake bespoke learning of the parameter S, given a .y. We choose to simplify our bespoke learning by demanding that S be a scalar. Let the scalar parameter S that includes all information on the .tk values of the p parameters that comprise .y k , be called .sk , .∀k = 1, . . . , m.

5.1 Introduction

191

The parameter .S ∈ S ⊆ R carries information about any realisation of the multivariate time-series over a variable temporal range, (i.e. .Y ), and as there is no information lost in S representing .Y , S can potentially stand in for the response variable in the supervised learning of the functional relationship between .X and .Y —i.e., this time, the relation between the input .X and this output parameter S is sought. Indeed, it is the bespoke learning of S given data on .Y that provides the training data on pairs of values of the input .X and the viable output variable S. Then implementation of the corresonding training data is made, to undertake supervised learning of the relationship between these input and output variables. Later we will note, that subsequent to the bespoke learning of S, we will find it easier to model S as the input and .X as the output of the functional relation between these two variables, given that learning of an inter-variable functional relation is easier if the lower-dimensional of the two variables is cast as the input, (as recalled from earlier chapters, in the first paragraph of this section). In that case, we will need to learn the value of S at which a test .X is realised. Such a need may arise when we wish to parametrise the risk score S of developing a malady, given a set of trigger parameters (.X), where data exists on a set of subjects who have been observed in the past, for their behaviour/symptoms over varying time intervals, around the time when they were suffering from said malady. Then our sequential actions will entail: performing bespoke learning of the risk score for each such subject using the time series (.Y ) of their behaviour/symptoms observed in the past; establishing pairs of (bespoke learnt score, trigger parameters) for each such subject; and learning the score-trigger variable relationship, to then learn the score at which a new (or test) subject’s observed trigger parameters are realised. Another compelling reason behind the proposed learning of the relationship between .X and .Y is the identification of those components of .X that are comparatively more strongly associated with the response .Y , than other components. Since S is constructed s.t. it carries the information in .Y , without being affected by differences in the length of the matrix-valued observation of .Y , learning how the said S is associated with .X, permits identification of the sought ranking of the components of .X, by association with S. This equivalently implies the grading of components of .X by potency of association with .Y . It is also possible that such a scalar parametrisation of the information contained in the observation of a generic high-dimensional output .Y is the very sought— even in a situation in which .tk is the constant n, .∀k = 1, . . . , m. For example, in applications in which the aim is the formulation of a composite index, .tk may or may not be a constant. For example, .Y can be the matrix-valued variable that constitutes of: – p socio-economic factors relevant to the k-th country, observed over .tk years, where the aim is to learn a scalar parametrisation—or composite index—of each such monitored country’s socio-economic status, for .k = 1, . . . , N; – p social+health+well-being variables of the population in a given geographical region, measured over n time points, where p is a known constant relevant to all

192

5 Bespoke Learning of Disease Progression Using Inter-Network Distance:. . .

regions that are tracked, and n may or may not be a constant for all such regions. Our interest is in parametrising the progression of an epidemic in the k-th region in a county, where such a progression score needs to be bespoke learnt for each such region, given the time series data obtained for such a region. – p phenotypic random variables of a patient in a cohort, measured over n time points, where p and n are known constants relevant to all patients in this cohort. We might be interested in parametrising the progression of a disease in the kth patient. Such progression is parametrised by a “disease progression score” in each patient.

The method that we present in this chapter, permits the learning of a score parameter that quantifies the progression of an illness, given the observed time series of selected phenotypic manifestations of this illness in each patient in this cohort—albeit with observations possible over different time intervals for different patients, owing to relatively early demise of some patients from the considered disease, or other underlying diseases. Subsequent to the learning of such a “disease progression score”, we might be interested in undertaking the supervised learning of the functional relation between such a disease progression score variable, and other pre-disposal and/or procedural and/or treatment parameters that comprise .X, which can be invoked as being potentially associated with this score. Thus, it is the bespoke learning of the progression score—which is the scalar-valued output variable—that allows for the supervised learning of the relation between this output and the input .X.

The method that we discuss here uses the distance computed between the posterior probabilities of the pair of graphical models, learnt given a pair of values, .y 1 and .y 2 , of .Y ; (Wang and Chakrabarty, communicated). Said distance is then implemented to learn the relative values .s1 and .s2 of the score parameter .S that embodies information contained in the observations .y 1 and .y 2 respectively. We referred to the learnt values of S as relative, since for one arbitrarily chosen observed value of .Y —say when .Y = y ref —the corresponding value of S is set as 1, and for .Y = y ref , S will be assigned values relative to this arbitrary assignment at .Y = y ref , depending on how distant the learnt graphical model at .Y = y i is, from the graphical model learnt at .Y = y ref ; .y i = y ref . In this work, the graphical model that is learnt given the data .Y = y, results from realisations of the random graph variable that is Bayesianly inferred upon, given this data, where such a random graph is a Soft Random Geometric Graph or SRGG, [9, 23, 24]. The random graph is in fact a Random Geometric (continued)

5.1 Introduction

193

Graph (RGG) that is drawn in a probabilistic metric space, [17, 27], rendering it an SRGG; we clarify this soon. The posterior probability of such a random graph variable is formulated given the estimated inter-component correlation structure of the data, and sampling from this posterior is undertaken using MCMC techniques (Appendix A). This inference results in the set of graphs that comprise comprehensive and objective uncertainties of the learning, namely, the set of inferred graphs that comprise the .(1−τ ) Highest Probability Density credible region (HPDs). This set of graphs is referred to as the graphical model of the data .y.

In general, the inter-column correlation matrix of the data .y is also learnt, along with the graphical model of the data, and this concurrent updating of the two random structures, (namely, the inter-column correlation matrix and SRGG), is undertaken within a Metropolis-with-a-2-block-update scheme that we discuss below. Then such Bayesian learning of the correlation matrix demands that its posterior probability density, give data .y, is formulated. Below, we advance the closed-form likelihood of this correlation matrix, given this data. We will learn the graph drawn in any iteration of the MCMC chain as an SRGG. This implies that while there is randomness in the placement of the nodes of the graph, there is additional uncertainty in the existence of the edge between a pair of nodes—to be precise, an edge exists with the probability that is the output of the socalled connection function, the input to which is the inter-nodal distance. Thus in the graph construct that we advance, drawing of a graph in a probabilistic metric space implies that the inter-nodal “distance” is a probability. This is anticipated since the metric of a probabilistic metric space is a cumulative distribution function defined on a positive support. In fact, it can be proved that one such distance function is complimentary to the marginal posterior probability of the random edge variable that connects the relevant nodal pair, given the correlation between the r.v.s that sit at these two nodes. We can show that the affinity between these r.v.s that sit at these two nodes, is given by this marginal posterior of the inter-nodal edge variable, where this edge marginal is the connection function of the SRGG variable. We will return to this point later (in Sect. 5.2.2). Equivalently, we can state that we include the edge between these two nodes, if and only if, the marginal posterior of this edge (given the correlation between the relevant r.v.s) exceeds a pre-chosen threshold probability. This design is reminiscent of a Random Geometric Graph (RGG), [23], which is defined to include an inter-nodal edge, as long as this edge length falls below a threshold distance. So, (as indicated above), the graph that we propose to draw, is then an RGG that is drawn in a probabilistic metric space, s.t. the edge connecting a nodal pair exists only if the inter-nodal distance falls below—i.e. inter-nodal affinity, which is given by the edge marginal, exceeds—a given threshold probability. In other words, any edge variable in the graph that we draw, exists with a probability—

194

5 Bespoke Learning of Disease Progression Using Inter-Network Distance:. . .

as in a Soft Random geometric Graph or SRGG. This is why we refer to our constructed graph as an SRGG. It is the construction of our graph as an RGG in a probabilistic metric space, that renders our graph an SRGG. Even if some of the different columns of .Y contain observations on variables that pertain to distinct domains, the constructed graphical model of this data offers an integrated representation of the inter-variable correlation structure of the data. Also, maintaining the freedom to choose a probability threshold that decides on the existence of any edge, given the learnt inter-column correlation matrix of the data, is useful since it offers a means to control sparsity of big networks.

5.2 Learning Graphical Models and Computing Inter-Graph Distance In this section, we discuss the learning of a graphical model, and thereafter we will discuss computation of the distance between a pair of such learnt graphical models. In the following section we elaborate on the bespoke learning of a scalarvalued representation S of a matrix-valued variable .Y . Such a learnt scalar is then a score parameter, that we will model as associated with system parameter vector .X. The supervised learning of the relationship between S and .X is enabled by the generation of the bespoke learnt value of S, given the realisations of .Y , where any such realisation of .Y is generated at a design value of .X.

5.2.1 Learning the Inter-Variable Correlation Structure of the Data Let the data that we propose to find the correlation structure of, be the matrix .Y that comprises T measurements of p number of variables .Y1 , . . . , Yp . Here, each  measurement of the p-dimensional vector .(Y1 , . . . , Yp ) occurs at a time point, s.t. the variable .Y is a p-variate time series observed at T times. Here T is a random variable, s.t. one realisation of .Y comprises measurements made at t time points while another realisation of .Y may comprise observations made at .t / time points, with .t = t / in general. We standardise .Yj using its sample mean and variance, into .Y˜j , (.j = 1, . . . , p), to result in the standardised matrix . . ˜ = (y˜ 1 , .., . . . , .., y˜ p ), Y

.

where the T -dimensional vector .y˜ j comprises the T observations of .Y˜j ; .j = 1, . . . , p. Thus, we define the random variable .Y˜ j ∈ RT that takes the value

5.2 Learning Graphical Models and Computing Inter-Graph Distance

195

y˜ j , .j = 1, . . . , p. We wish to learn the inter-variable covariance of the observed ˜ which is the data set for our current purposes. standardised matrix .Y, ˜ is Then the inter-variable or inter-column covariance matrix .Σ C of the data .Y

.

Σ C = [σij ],

.

where .σij = Cov(Y˜i , Y˜j ), ∀i, j = 1, . . . , p. Given that .Y˜j is standardised, ˜i , Y˜j ) reduces to the correlation between .Y˜i , .Y˜j , s.t. matrix .Σ C is the inter.Cov(Y ˜ column correlation matrix of data .Y. The inter-row correlation matrix of this data is also defined, as .Σ R = [ψmq ], where .m, q = 1, . . . , T , with .ψmq the correlation between the m-th and q-th rows of the data. However, our interest lies in the learning of the inter-column correlation matrix .Σ C , which determines how any pair of the (standardised) variables .Y˜j and ˜i are correlated with each other, to capacitate computation of the probability .Y of existence of the edge between these variables as nodes of our sought SRGG. To accomplish the learning of .Σ C alone, we would ideally want to write the ˜ This is attained via a closed-form posterior probability density of .Σ C given data .Y. marginalisation of the joint posterior probability density of .Σ C and .Σ R given data ˜ over .Σ R . .Y,

To attain such closed-form marginalisation, we model the set of vectorvalued random variables .{Y˜ j }j ∈P ⊆N with a vector-variate Gaussian Process. Then for .{1, 2, . . . , p} = P, by definition, the joint distribution of ˜ 1 , Y˜ 2 , . . . , Y˜ p ) is matrix Normal, i.e. .(Y [Y˜ 1 , Y˜ 2 , . . . , Y˜ p ] = MN (0, Σ C , Σ R ),

.

where a zero-mean GP is motivated as the underlying stochastic process since ˜ Thus, the density of .Y ˜ given .Σ C and .Σ R we are using a standardised data .Y. ˜ with a zero mean matrix and covariance is the matrix Normal density in .Y, matrices .Σ C and .Σ R . In other words, likelihood of the unknown covariance matrices, given the data is matrix Normal.

Definition 5.1 Using the matrix Normal density as the likelihood, Jeffrey’s prior on Σ R , and Uniform prior on .Σ C , the joint posterior probability density of these two ˜ can be marginalised in a closed-form way over all covariance matrices given data .Y,

.

196

5 Bespoke Learning of Disease Progression Using Inter-Network Distance:. . .

values of the .Σ R variable. This yields the posterior probability density of .Σ C given ˜ as: data .Y  −p/2  − n+1   ˜ ˜ T  2 Σ C  Y(Σ C )−1 Y

˜ ∝ π(Σ C |Y)

.

K (Σ C )

,

(5.1)

where .K (Σ C ) is a normalisation of this posterior, estimated as   Q   − n/ +1  1 2   −1 T ˜ i)  ˜ i (Σ C ) (Y ˆ (Σ C ) := .K . Here, the sample of Q number  Y Q i=1 ˜ 1, . . . , Y ˜ Q }, where .Y ˜ i abides by inter-column of .n/ × p-dimensional datasets is .{Y −1 T ˜ ˜ correlation .Σ C .∀i = 1, . . . , Q, s.t. .(Yi )(Σ C ) (Yi ) is positive definite. In our work, we use .n/ = n0 , with a chosen .n0 . ˜ is updated at every iteration of the MCMC This posterior of .Σ C given data .Y chain that we undertake, but the graph given this updated correlation is updated in the second block of any such iteration, using the partial correlation matrix .R, where an updated .R is computed at each iteration using the updated .Σ C . Definition 5.2 The .p × p-dimensional partial correlation matrix is .R = [rij ], with rij = − √

.

qij , qii qjj

i = j, and rii = 1 for i = j.,

(5.2)

for .i, j ∈ {1, . . . , p}, and the precision matrix is Q = [qij ] := Σ −1 C .

.

As long as .p  500, it is feasible to undertake inversion of the .p×p-dimensional covariance matrix .Σ C within every iteration of the MCMC chain. Such inversion is relevant to the computation of the posterior of .Σ C , as well as the computation of .R given .Σ C . Indeed, this inversion gets increasingly more resource-intensive as p increases, and for .p  500, such inversion renders the chain unfeasibly slow. Then in such situations in which p is large—which corresponds to a large number of nodes in the sought graph—we need to compromise on the computational probity of the inference, by: – bypassing the MCMC-based inference, in favour of populating the ij -th .Σ C with the empirical estimate of the covariance between the observed .y i and .y j for .p  500; .i, j = 1, . . . , p. Indeed, the standardisation of .y i implies that this estimated covariance is the estimated correlation. The estimated .Σ C is then inverted to compute the partial correlation matrix, conditional on which, the graphical model ˜ is learnt. of data .Y – drawing the graph as an SRGG, using the empirically estimated correlation matrix .Σ C , when p is so large that a robust inversion of this correlation matrix is not attainable, s.t. the partial corelation matrix is not attainable. Then such a

5.2 Learning Graphical Models and Computing Inter-Graph Distance

197

large p leads us to draw the graphical model of the data as a correlation graph; this is what we refer to as a network. We will learn the network as an SRGG, given this estimated correlation.

5.2.2 Learning the SRGG Once we update the inter-column correlation matrix of the given data, we will update the graph of this data, as a Soft Random Geometric Graph or SRGG, where we recall that the graph is rendered an SRGG by our drawing of this graph as an RGG, in a probabilistic metric space, as earlier stated. Then the distance between the i-th and j -th nodes in this graph, is a probability—offset by a known constant—that is conditional on the absolute partial correlation .|rij |, between the variables .Y˜i and .Y˜j that sit at these nodes. As we have motivated above, for large p, this probability is conditional on the (absolute) correlation .|σij | between these two variables. The edge connecting the i-th and j -th nodes exists in the graph, if this inter-nodal distance falls below a cutoff or threshold probability. We denote this threshold probability, ˜1 , . . . , Y˜p } that .τ . Here .i, j ∈ {1, . . . , p}, with .i = j , with .V the vertex set .{Y the SRGG is defined on. It is to be noted that there are no self edges allowed in this SRGG. Also, distinct nodal pairs are joined by their respective edge variable, independently of each other.

In a model of the affinity between .Y˜i and .Y˜j —where the affinity is by definition, complimentary to the distance between .Y˜i and .Y˜j —the affinity is the marginal posterior probability of the mutual edge variable .Gij , conditional on .|rij |, (offset by a known constant). This relation is achieved by marginalising the joint posterior probability of all unknowns relevant to the ij -th nodal pair, over those variables that are not .Gij .

So we now motivate a model for this joint posterior probability of all unknowns, given the learnt value of .|Rij |. To this effect, we recall a few relevant trends that we can identify, prior to discussing the parametric form of the likelihood invoked to learn the edge parameter .Gij , for any .i, j ∈ {1, . . . , p}, .i = j . This edge parameter either exists or not— i.e. .gij = 1 or 0—with a probability that is conditional on the absolute partial correlation variable .|Rij |. • When the edge .Gij between the i-th and j -th nodes exists in the graph, i.e. when ˜i and .Y˜j to be .gij = 1, probability for the absolute partial correlation between .Y 1, is maximal. Again, when the edge .Gij connecting the i-th and j -th nodes is

198

5 Bespoke Learning of Disease Progression Using Inter-Network Distance:. . .

absent, i.e. when .gij = 0, probability for .|Rij | to be 0, is maximal. This intuition holds .∀i, j ∈ {1, . . . , p}, .i = j . • At the same time, when .gij = 1, probability for .|Rij | to be 0, is minimal. Similarly, .gij = 0 ⇒ probability for .|Rij | to be 1, is minimal. • For .gij =1, probability of the absolute mutual partial correlation decreases, as .|Rij | takes values downward of 1. Again, for .gij =0, probability of the absolute mutual partial correlation decreases, as .|Rij | takes values upward of 0. Thus, as .|Rij | changes s.t. the absolute difference between .Gij and .|Rij | increases at a given .Gij , the attained value of the partial correlation variable, is rendered less probable. In light of the above, we can advance a form of the probability density function (pdf ) of .|Rij | given .Gij = gij as:   (|rij | − gij )2 1 .f|Rij | (rij |gij , vij ) =  exp − , 2vij 2π vij i.e. .|Rij | is chosen as Normally distributed with mean .gij and variance .vij . Indeed then, with increasing absolute difference between the value attained by variable .|Rij | and the given .gij , probability for .|Rij | = |rij | decreases for a given .vij . In particular, for .Gij set at 1 (or 0), the pdf for .|Rij | to be 0 (or 1), is then minimal, for a given choice of .vij . Also, this pdf is a (smoothly) decreasing function of the absolute difference between .gij and .|rij |. Thus, the trends that we have noted previously—which constitute the only constraints that we can impose on the choice of this conditional density—are all adhered to, by this Normal form of the pdf of .|Rij | given .Gij = gij . We also note that centred at a given .gij , .Rij can take values in [-1,1]. Then the maximal dispersion of .Rij about its mean, is 1, where such deviation away from the mean, is typically not greater than 3 times the standard deviation of the density. We address this sufficiently, to suggest that .vij takes values in (0,1]. Thus, we are saying that density of .|Rij | in this graph learning exercise can be Normal in shape. Indeed, many other forms for the density would have satisfied this limited amount of information that we have on the expected trends—one example of this would be the double exponential density centred at .gij . Basically, any density that falls symmetrically about a modal value of .gij would suffice as far as our available information is concerned. Bayesian inference undertaken on the unknowns, with any such likelihood is expected to yield concurrent mean values, but different uncertainties on the unknowns; to address this, we learn the variance parameter .vij along with the edge parameter. Thus, using the Normal form the density, the joint posterior probability density of .Gij and .vij is formulated, given the learnt .|rij |, after invoking suitable priors on .Gij and .vij . We use .Bernoulli(0.5) priors on .Gij and .U nif orm(0, 1] priors on .vij , where motivation for the interval .(0, 1] is provided above. This is perhaps a wider support for the prior probability density for .vij than is necessary. In the methodology development stage, we use weak priors, but if an application permits better information on the edge and variance parameters, we can acknowledge that in our work.

5.2 Learning Graphical Models and Computing Inter-Graph Distance

199

Within the Metropolis-Hastings based inference, we use a truncated Normal proposal density for the positive-definite .vij , where this proposal density is left truncated at 0, the mean is the current value of .vij , and variance is experimentally chosen. .Gij is proposed from a .Bernoulli(|rij |) probability distribution. Thus, the posterior of all edge and variance parameters is p .[π(Gij , vij ||rij |)] i N, is also possible within this scheme, without us having to redo the computation for the original sample of N subjects.

5.4 Application to Haematology-Oncology In this section we discuss an empirical illustration of the methodology discussed above, to undertake bespoke learning of the progression score of a disease referred to as Sinusoidal Obstruction Syndrome/Veno Occlusive Disease, or SOS/VOD [15, 16, 19, 25, 30, 33]. This disease is a potentially terminal complication that can follow bone marrow transplants—or Hematopoietic Stem-Cell Transplantation (HSCT) to be specific—in patients afflicted with certain types of blood cancers, who have undergone such transplantation. SOS/VOD occurs with mortality rates of about .80%, in 0–60.% of transplant recipients, as noted in NHS England’s documentation on Defibrotide-based treatment (https://www.england. nhs.uk/wpcontent/uploads/2020/11/Defibrotide-in-severe-veno-occlusive-diseasefollowing-stem-cell-transplant.pdf). Early diagnosis and treatment, enhance survival, but efficient clinical criteria are required to improve the current state

5.4 Application to Haematology-Oncology

207

of diagnosis, and treatment protocols are sparse [7, 19], with dose specifications typically unavailable—though Defibrotide is currently considered an efficacious treatment for the severe stages [26, 29]. Prevention via the identification of risk factors, as well as pharmacologic steps are sparsely discussed in the literature; a useful recent finding suggests abandoning heparin as SOS/VOD prophylaxis [12]. Reliable prediction of SOS/VOD progression, given pre-transplant manifestations, is sorely missing, as is the automated identification of the most potent influences, and optimal patient-specific treatment regimens. In this application, we discuss the bespoke learning of a continuous, real-valued, data-driven progression score of the disease SOS/VOD, (unlike in [20]), in .N1 = 5 patients in the first cohort (Cohort I) and .N2 = 8 patients in the 2nd cohort (Cohort II), relative to the score of 1 that is assigned to an arbitrarily chosen reference patient. Said learning is undertaken subsequent to the learning of the graphical model of the phenotypic information observed for each patient of these two cohorts. Such information includes observations of 11 phenotypic parameters of each patient of each cohort, undertaken at different time points that range from about 6 days prior to the transplant, to about 18 days after. In the score learning that we discuss here, we have employed 10 of these phenotypic parameters. It is an unfortunate reality of the virulence of SOS/VOD, that not all patients who are monitored within these 2 cohorts, are alive till the end of this observational period. Thus, the recorded observations of the different patients who we include in the study, comprise matrices with different numbers of rows, where the maximal number of rows is known as the time point corresponding to the last observational time on the 18th post-transplant day. Each column of these matrices contain information about the same phenotypic parameters for all patients. Thus, the observed phenotypic information for the i-th patient in the c-th cohort is contained (c) in an .ni × 10-dimensional matrix .Yi ; here .c =I,II, .i = 1, . . . , 5 for Cohort I and .i = 1, . . . , 8 for Cohort II. The data for Cohort II arrived after the analysis with the data for Cohort I was undertaken. We undertook the learning of the scores for Cohort I by arbitrarily calling the 5th patient in Cohort I, the reference patient for this cohort. In Cohort II, we flag the 1st patient as the reference patient. The phenotypic graphical models (I ) (I ) (I ) (I ) .G 1 , . . . , G5 are learnt given the data .Y1 , . . . , Y5 . Figure 5.1 displays the variation with the time point of observation of the 10 different phenotypic attributes used, of the 8-th patient in Cohort II, and the 3rd patient in this cohort. (The 8-th patient was found in our work to have a higher risk score of SOS/VOD, than the 3rd patient in this cohort). Graphical models are learnt with such phenotypic observations. (I ) (I ) (I ) Graphical models learnt with the data sets .Y1 , Y2 , Y3 are displayed in Fig. 5.2. MCMC-based learning is undertaken of the score parameters using the distance that is computed between the graphical model of the reference patient in a cohort, and each of the other patients in this cohort. Histogram representations of the (I I ) (I I ) marginal posterior of learnt score parameters .s2 , . . . , s8 , of the non-reference (I I ) (I I ) patients—i.e. the 2nd, 3rd, . . . , 8-th patients—given the data .Y2 , . . . , Y8 respectively, are presented in Fig. 5.3.

208 2 1.5 1 0.5

57 56 55 54

5 Bespoke Learning of Disease Progression Using Inter-Network Distance:. . .

5

10

5

98 5

10

20

5

10

10

25 20 5

10

5

10

85 80 75 70 65

130 120 5

10

10

40

5

100 99 98 97 96

5

10

30

110 100 90

110

5

10

1.6 1.4 1.2 1 0.8

1

0 –500 –1000 –1500

99 97

3 2

3 2.5 2 1.5 1

0

100 99 98 97

120 100 80

5

5

10

10

10

100 99 98 97 96

10

5

10

5

10

5

10

5

10

25 20 5

10

140 135 130 125 120 5

0 –500 –1000 –1500

5

5

10

85 80 75 70 65

Fig. 5.1 Figure presents plots depicting temporal variation of phenotypic parameters of two different patients in Cohort II, from a few days prior to the bone marrow transplant, to a few days after. The 2 left-most columns refer to attributes of the 8-th patient while the 2 right-most columns depict parameters of the 3rd patient in this cohort. These attributes are as follows: systolic blood pressure—plotted against observational time points in the panels at positions (1,1) and (3,1) respectively for the 8-th and 3rd patients in Cohort II; dystolic pressure plotted against observational times in panels at the (2,1) and (4,1) coordinates, for the 8-th and 3rd patients respectively; pulse rate for these patients in panels at (1,2) and (3,2) positions respectively; respiratory rate in panels at positions (2,2) and (4,2); body temperature in panels at positional coordinates (1,3) and (3,3); capillary saturation in panels at positions (2,3) and (4,3); body weight in panels at (1,4) and (3,4); fluid balance in panels at (2,4) and (4,4); total bilurbin in panels at positions (1,5) and (3,5); creatinine in panels at positional coordinates (2,5) and (4,5) respectively, for the 8-th and 3rd patients in Cohort II

5.4 Application to Haematology-Oncology

209

Fig. 5.2 Graphical models learnt given data sets that comprise observed information on phenotypic attributes of 3 patients in Cohort I, observed from a pre-transplant time point, till about the 18-th day following the bone marrow transplant. However, the 2nd patient did not survive this time horizon. Names of these attributes are presented in the caption of Fig. 5.1; from the bottom-most, going in an anti-clockwise direction, the nodal names are: “Fluid balance”; “Total Bilirubun”; “Creatinine”; “Blood pressure (high)”; “Blood pressure (low)”; “Pulse rate”; “Respiratory rate”; (I ) “Temperature”; “CS” or Capillary saturation; “Body weight”. For the data set .Yi of the i-th (I ) patient in Cohort I, the graphical model is denoted .Gi ; .i = 1, . . . , 5. From left to right are presented the graphical models .G1(I ) , G2(I ) , G3(I ) . A probability cutoff of .τ = 0.6 is used to construct (I ) the (soft geometric) random graphs that build the presented graphical models. In .Gi , relative frequency of an edge—in the set of random graphs that were sampled using MCMC, at the learnt correlation—is marked at that edge

6u104

6u104

4u104

4u104

4u104

2u104

2u104

2u104

0

1.5

1.6 Score

0

0.45 0.5 0.55 Score

0

0.5 0.6 Score

Fig. 5.3 Figure presents histogram approximations of the marginal posterior probability density of (I I ) (I I ) (I I ) the score parameters .s8 , s7 , s6 from left to right, of 3 of the patients in Cohort II—relative to the score of the reference patient, i.e. the 1st patient in this cohort, who is assigned a score of 1

Score parameters are also learnt for each of the non-reference patients in Cohort I, relative to the reference patient for this cohort, i.e. relative to the score of 1 of this cohort’s designated reference patient (the 5th patient). These score parameters are learnt within an MCMC-based inferential scheme, using the distance between all pairs of graphical models of the phenotypic observations of patients in this cohort, recorded from a pre-transplant to post-transplant time point, though not all patients in this cohort survive equally. To be precise, the 2nd and 4th patients expire following their HSCT transplant at different time points before the time horizon of observations is attained.

210

5 Bespoke Learning of Disease Progression Using Inter-Network Distance:. . .

However, the scores learnt for all patients in Cohort I, are updated, so that these scores are rendered relative to the score of 1 that is assigned to the reference patient of Cohort II, who we choose as the reference patient common to both cohorts. Once this is accomplished, the scores learnt for all patients in Cohort I are rendered relative to this reference patient, i.e. the 1st patient in Cohort II.

To accomplish this updating of scores learnt for patients in Cohort I, we use the distance computed between the graphical models of the originally designated reference patients in Cohort II and that in Cohort I, i.e. between the graphical models learnt for the 1st patient in Cohort II and the 5-th patient in Cohort I. The latter was earmarked as the reference patient in Cohort I, with respect to whom, all other patients of Cohort I were assigned their relative VOD-scores, except now the reference is to be changed from the 5-th patient (I ) (I I ) in Cohort I, to the 1st patient in Cohort II. The distance between .G5 and .G1 is about 1.17. On the basis of temporal progression of phenotypic parameters in these 2 patients, physicians offer the information that SOS/VOD was more progressed in the 1st patient in Cohort II, over the 5-th patient in Cohort I. This assumption permits the transformation of the score of the latter patient— relative to the assigned score of 1 of the latter—as .−0.17 (approximately). In other words, the erstwhile reference patient of Cohort I—who was assigned a score of 1 without uncertainties—now again has an uncertainty-free scaled (relative) score of about -0.17, on the scale in which the score of the newly identified reference patient for the two cohorts, (namely, the 1st patient in Cohort II), has a score of 1. Then scores of other patients in Cohort I are transformed accordingly (shifted down in value, by 1.17).

The learnt scores of all patients in these 2 cohorts are tabulated in Table 5.1

The results of Table 5.1 are calibrated against the status of SOS/VOD onset in these retrospective cohorts. It is noted that all patients for whom we learnt a mean score of .s > 0.11, had developed the disease in the retrospective cohorts, and those with mean scores .≤ 0.11, did not.

5.4 Application to Haematology-Oncology

211

Table 5.1 Table displaying the SOS/VOD progression scores of patients in Cohort I and Cohort II, relative to the risk score of the disease in the reference patient of the 2 cohorts, namely the 1st patient in Cohort II Patient index 1 2 3 4 5 1 2 3 4 5 6 7 8

Cohort index 1 1 1 1 1 2 2 2 2 2 2 2 2

Mean relative score 0.11 0.44 .−0.069 .−0.96 .−0.17 1 1.37 0.48 1.51 0.44 0.54 0.50 1.55

95.% HPDs on score [0.089, 0.31] [0.29, .67] [.−0.24, 0.11] [.−1.54, .−0.37] [.−0.17, .−0.17] N.A. [1.34, 1.40] [0.43, 0.52] [1.46, 1.55] [0.38, 0.48] [0.49, 0.57] [0.46, 0.54] [1.51, 1.60]

5.4.1 Learning of the Relation Between VOD-Score and Pre-transplant Variables and Prediction For each patient in the retrospective cohort that we used in our discussions above, the pre-transplant variables vector is known. (It is to be noted that the retrospective cohorts I and II are merged and treated hereon, as one retrospective cohort). Thus, for the i-th patient for whom the score is .si , the pre-transplant variables vector is .x i . Here .i = 1, . . . , 13, i.e. we have merged all 13 patients who we discuss above, into one retrospective cohort, so that the i-th of these patients has a VOD-score that is bespoke learnt as .si . So the bespoke learning of the VOD-scores now offers the training set .{(si , x i )}13 i=1 , using which, we can learn the functional relationship between the pre-transplant vector .X and VOD-score S. As suggested above, we model this function .f (·) to output the higher-dimensional (of the two variables), i.e. use .X as its output and take S as its input. In our application, .X is 30-dimensional. So .f (·) is also a vector-valued function of the same dimesionality. We model .f (·) as a sample function of a vector-variate GP, implying that the joint probability density of the .N = 13 outputs—realised respectively at each design input, i.e. each bespoke learnt score—is matrix Normal, with a mean matrix and 2 covariance matrices. One of these covariance matrices (inter-component covariance matrix) is s.t. its ij -th element informs on the covariance between the i-th and j -th components of the .X, (.i, j = 1, . . . , 30 = D). The other covariance matrix is the inter-patient covariance matrix, the ij -th element of which informs on the covariance between the i-th and j -th VOD-scores, (.i, j = 1, . . . , 13 = N ). In other words, in the .N × 30. . dimensional data matrix .D = (x , .., . . . , ..x )T , the covariance between rows is the 1

N

inter-patient covariance matrix .Σ P and the covariance between columns is the inter-

212

5 Bespoke Learning of Disease Progression Using Inter-Network Distance:. . .

component covariace .Σ C . Then we perform a sample estimate of each element of Σ C and can kernel parametrise .Σ P as .Σ P = [K(si , sj )]. In our work we choose a simple SQE kernel, the globally-applicable hyperparameters a and . of which we learn from the data, i.e. .K(si , sj ) = a exp(−(si − sj )2 /22 ). Then, as moivated in the previous chapter, likelihood of the kernel hyperparameters, given data .D    T −1 is . (2π )−DN |Σ C |−N |Σ P |−D .exp −T r Σ −1 (D − μ) Σ (D − μ) /2 . The C P mean matrix .μ is estimated s.t. each of its columns is a replicate of the sample mean of values of the component of .X included in that column, (as observations of the N patients). We invoke adequate priors on the kernel hyperparameters, and multiply these priors with this likelihood, to allow for the joint posterior probability density of these parameters given the data .D and the given design points .s1 , . . . , sN . We then generate posterior samples using MCMC, allowing for the computation of the marginal posterior probability of each parameter given the data. The marginals then allow for the learning of the 95.% HPDs on each parameter, given the data. However, our aim is to learn the value .s (test) of the VOD-score of a new (or test or prospective) patient, who is at the pre-transplant stage, s.t. their pretransplant parameter vector is recorded as .x (test) , but their time series of phenotypic observations is not yet available. Thus, the augmented data .Daug on the output now includes .x 1 , . . . , x N —from the retrospective patients—as well as .x (test) . Thus, . . . .D = (x , .. . . . , ..x , ..x (test) )T . Let the data variable that attains the value .D .

aug

1

N

aug

be .Daug . Conditioning the density of .Daug , S1 , . . . , SN on the model parameters (i.e. hyperparameters of the covarince kernel used to parametrise the inter-patient covariance matrix) and .S (test) , reduces to the ratio of: – the joint density of .Daug and .S1 , . . . , SN , S (test) , conditional on parameters ., a, and – density of .S (test) conditional on ., a.

Setting the latter conditional density as Uniform over a chosen interval in the VODscore, we write the logarithm of the density of data variable and .S (test) given the unknowns as log(f (Daug , s1 , . . . , sN |, a, s (test) )) =

.

.

  T −1 − T r Σ −1 C (Daug − μaug ) Σ P (Daug − μaug ) /2,

added to an unknown constant, where .Σ P is the .(N + 1) × (N + 1)-dimensional inter-patient correlation matrix of the .(N + 1) × D-dimensional augmented data .Daug . The inter-component correlation matrix of this augmented data .Daug is .Σ C which is .D × D-dimensional, but elements of which are different from elements of .Σ C owing to the added row in the augmented data, over the original data .D. The mean matrix of the augmented data matrix is also changed from that of .D, and is .(N + 1) × D-dimensional. Our MCMC-based inference permits ignorance of the unknown constant that is added to the log likelihood. Along with this likelihood, we invoke priors on the hyperparameters . and a, and incorporate priors elicited

5.4 Application to Haematology-Oncology

213

by the Haematologist-Oncologists on .s (test) —if available, (else use weak priors)— to formulate the joint posterior probability density .π(, a, s (test) |Daug , s1 , . . . , sN ). We learn the marginals of each learnt parameter using MCMC, allowing for the learning of 95.% HPDs on each learnt parameter. Currently, we have expanded our training set from 13 to 25 patients, and have made VOD-score learning for 7 test or prospective patients, including patient (indxed as) .P1 . So the learning for .P1 ’s score was undertaken with a bigger training set than what has been discussed above. When we undertook the analysis of prospective patient .P1 , as they were a new or test patient, their time series of phenotypic observations were not available to us. At that stage, the patient’s VODscore was learnt as lying in the range of [0.949, 1.033], with a mean of 0.967. The results of our MCMC-based inference of this score is depicted in Fig. 5.4, along with the traces of the kernel hyperparameters . and a.

5.4.2 Which Pre-transplant Factors Affect SOS/VOD Progression Most? It is possible to identify the ranking—in order of association with the onset prospects and progression of SOS/VOD—of components of the 30-dimensional pre-transplant vector .X, where SOS/VOD onset and rogression is parametrised by the VOD-score S. As .X is the output of the GP-generated function (which takes S as the input), we monitor the probability of the data on the variable .X −k , given design inputs .{s1 , . . . , sM } and parameters of the GP that we invoke to model the functional relation between .X−k and S, where .X −k is the vector of 29 pre-tansplant parameters, barring the k-th component of .X. We do this for each .k = 1, . . . , 30. In other words, we leave out the k-th of the 30 pre-tansplant parameters and check the likelihood of the model parameters given the data on T .X −k = (X1 , . . . , Xk−1 , Xk+1 , . . . , X30 ) , .∀k = 1, . . . , 30. Above, we refer to the expanded training set comprising information on .M = 25 patients, for each of whom, the VOD-score is bespoke learnt. The k for which the likelihood given data on all 30 pre-transplant parameters is maximally different (by being higher) from that given data on .X−k , is the k s.t. .Xk is the most influential pre-transplant variable. Using this technique we rank all 30 pre-transplant variables. The 3 most influential pre-transplant variables are noted in our work to be the underlying condition of sufferance from Acute Lymphoblastic Leukaemia; the comorbidity of Hepatic Osteodystrophy liver disease; and the state of relapse into the underlying cancer—in that order.

214

5 Bespoke Learning of Disease Progression Using Inter-Network Distance:. . .

Length scale hyperparameter

Amplitude hyperparameter

0.55 0.5 0.45 0.4 0.35 0.3

0

5000 Iteration

0.1

0.05

0 0

104

5000 Iteration

104

100

1

Frequency

Predicted score

1.2

0.8

0.6

50

0 0

5000 Iteration

104

0.9

0.95 1 1.05 1.1 DPS for Test –Patient3

Fig. 5.4 Top right: trace of the length scale hyperparameter ., learnt in an MCMC chain run to learn the VOD-score of test patient .P1 . Top left: trace of the amplitude hypermater a, learnt in this chain. Bottom left: trace of .P1 ’s VOD-score. Bottom right: The histogram of the post-burnin values of the score variable, (referred to as the “DPS” for Disease Progression Score), sampled during the MCMC chain run with the training data of 25 retrospective patients, augmented by the pre-transplant variable vector observed for this patient

5.5 Summary In this chapter we have discussed a methodology that streamlines the learning of the functional relation between a vector-valued input and an output that is a multivariate time series covering a variable temporal range. One way to circumvent the variable lengths of the different realisations of this output, is to then learn a scalar parametrisation that bears information about the output, irrespective of the temporal coverage that is idiosyncratic to a given realisation of this output. This might be possible by learning the form of the transformation which when operated

5.5 Summary

215

on this variably-long time series variable, results in the sought scalar parametrisation of the same. Learning the form of this transformation will then permit prediction of the scalar parameter, at a given realisation of the aforementioned output variable. However, we do not know what a value of such a scalar parameter is, for a given time series. That is to say, we do not have any training data comprising pairs of values of this time series, and the corresponding scalar parameter. Therefore, need arises for the bespoke learning of this sought, lossless scalar (score) parametrisation of a given realisation of the multivariate time series variable, covering a realised temporal range. Such bespoke learning is motivated to annul the effect of the variability of the temporal coverage of any realisation of the multivariate time series variable, in the learning of the sought scalar score that parametrises of this time series variable. With this as a guide, we opt for the sought scalar parameter of a given realisation of the time series variable to be learnt as a relative score between this realisation and a reference one, where the reference realisation is arbitrarily chosen. Such a relative score is then defined as the distance between the graphical models that are learnt of the given realisation of the time series variable, and that learnt for the reference realisation. It is in fact, the distance between the posterior probabilities of the relevant graphical models, that is transformed linearly to offer the relative score. Any such graphical model is defined upon the vertex set comprising nodes that represent the different attributes on each of which the time series of observations are available, to build the considered realisation of the multivariate time series variable. The edge set of this graphical model contains those edges that are given by probability in excess of a cutoff or threshold probability, where the estimate of the probability that marks an edge of the graphical model, is the relative frequency of existence of this edge in the sequence of random graphs that are generated during the post-burnin part of the MCMC chain in which this graph variable is updated. Such updating occurs within the MCMC chain, at the updated inter-attribute correlation matrix of the considered data, i.e. the considered realisation of the multivariate time series variable. To be precise, any such random graph is a Random Geometric Graph that is drawn in a probabilistic metric space, resulting effectively, in each random graph being a Soft Random Geometric Graph or an SRGG. Then the posterior probability of this SRGG variable in each iteration of the MCMC chain, is noted, and it is the (scaled) Hellinger distance between the posterior probability of the graphical models—learnt given the considered and the reference data—that defines the inter-graph distance. We solve the system of linear equations that formulate the distance between each pair of the graphs learnt given each pair of considered data— not just inclusive of the reference data—using MCMC, to learn the unknown scores for each data, i.e. for each realisation of the multivariate time series variable, on a scale in which the score assigned to the reference realisation is 1. Once the score is learnt for each realisation of the output (that is this variablylong time series), its relation with all identifiable system input parameters can be learnt, to enable prediction of values of some of the input variables at which a given score can be attained. Again, learning of the score for a new (or test) time series, at given input parameter values is also undertaken. The functional learning between the

216

5 Bespoke Learning of Disease Progression Using Inter-Network Distance:. . .

system parameters and the score parameter—values of which are bespoke learnt— also allows for identification of the input variables that appear to influence the score parameter the most. In this chapter, this form of bespoke learning is undertaken in a real-world application in Haematology-Oncology. There are multiple other applications envisaged for such a learning technique, that offers a lossless scalar parametrisation of a variably long multivariate time series. Thus, this form of bespoke learning could be a relevant undertaking in the learning of a lossless composite index of (generally speaking, variably long) multivariate data, as within the fields of Econometrics, Finance, Healthcare, etc.

References 1. E. M. Airoldi. Getting started in probabilistic graphical models. PLoS Computational Biology, 3(12):e252, 2007. 2. R. Aler, J. M. Vallsa, and H. Bostrmb. Study of hellinger distance as a splitting metric for random forests in balanced and imbalanced classification datasets. Expert Systems with Applications, 149(1):113264, 2020. 3. D. Bandyopadhyay and A. Canale. Sparse multi-dimensional graphical models: A unified bayesian framework. Journal of Rotyal Statistical society Series C, 65(4):619–640, 2016. 4. S. Banerjee, A. Basu, S. Bhattacharya, S. Bose, D. Chakrabarty, and S. Mukherjee. Minimum distance estimation of milky way model parameters and related inference. SIAM/ASA Journal on Uncertainty Quantification, 3(1):91–115, 2015. 5. P. Benner, R. Findeisen, D. Flockerzi, U. Reichl, and K. Sundmacher. Large-Scale Networks in Engineering and Life Sciences. Modeling and Simulation in Science, Engineering and Technology. Springer, Switzerland, 2014. 6. C. M. Carvalho and M. West. Dynamic matrix-variate graphical models. Bayesian Analysis, 2(1):69–97, 03 2007. 7. S. Corbacioglu, E. Carreras, M. Ansari, A. Balduzzi, S. Cesaro, J. Dalle, et al. Diagnosis and severity criteria for sinusoidal obstruction syndrome/veno-occlusive disease in pediatric patients: a new classification from the european society for blood and marrow transplantation. Bone Marrow Transplant, 53:138145, 2018. 8. S. Ding and R. Dennis Cook. Matrix variate regressions and envelope models. Journal of Multivariate Analysis, 80(2):387–408, 2018. 9. A. P. Giles, O. Georgiou, and C. P. Dettmann. Connectivity of soft random geometric graphs. Journal of Statistical Physics, 162(4):1068–1083, 2016. 10. R. Guhaniyogi and D. Spencer. Bayesian tensor response regression with an application to brain activation studies. Bayesian Analysis, TBA:1–29, 2021. 11. S. Gurden, J. Westerhuis, R. Bro, and A. Smilde. A comparison of multiway regression and scaling methods. Chemometrics and Intelligent Laboratory Systems, 59:121–136, 2001. 12. Imran H. et al. Use of prophylactic anticoagulation and the risk of hepatic vod in patients undergoing hsct: a systematic review and meta-analysis. Bone Marrow Transplant, 37:677–86, 2006. 13. B G. Lindsay. Efficiency versus robustness: The case for minimum hellinger distance and related methods. Annals of Statistics, 22(2):1081–1114, 1994. 14. Eric F. Lock. Tensor-on-tensor regression. Journal of Computational and Graphical Statistics, 27(3):638–647, 2018.

References

217

15. Mohty M. et al. Prophylactic, preemptive, and curative treatment for sos/vod disease in adult patients: a position statement from an international expert group. Bone Marrow Transplant, 55:485–495, 2020. 16. G. B. McDonald, M. S. Hinds, L. D. Fisher, H. G. Schoch, J. L. Wolford, M. Banaji, et al. Veno-occlusive disease of the liver and multiorgan failure after bone marrow transplantation: a cohort study of 355 patients. Annals of Internal Medicine, 118:255267, 1993. 17. K. Menger. Statistical metrics. Proc. Nat. Acad. Sci. USA, 28 (12):535–537, 1942. 18. M. Mitchell, M. Genton, and M. Gumpertz. Testing for separability of spacetime covariances. Environmetrics, 16:819–831, 2005. 19. M. Mohty, F. Malard, M. Abecassis, E. Aerts, A. Alaskar, M. Aljurf, et al. Sinusoidal obstruction syndrome/veno-occlusive disease: current situation and perspectives-a position statement from the european society for blood and marrow transplantation (ebmt). Bone Marrow Transplant, 50:781789, 2015. 20. I. N. Muhsen and S. K. Hashmi. Utilizing machine learning in predictive modeling: whats next? Bone Marrow Transplant, 57:699–700, 2022. 21. D. N. Naik and S. S. Rao. Analysis of multivariate repeated measures data with a kronecher product structured covariance matrix. Journal of Applied Statistics, 28:91–105, 2001. 22. T. Nummi and J. Mttnen. On the analysis of multivariate growth curves. Metrika, 52:77–89, 2000. 23. M. Penrose. Random Geometric Graphs. Oxford Studies in Probability, OUP, Oxford, 2003. 24. M. D. Penrose. Connectivity of soft random geometric graphs. Annals of Applied Probability, 26:986–1028, 2016. 25. P. Richardson, S. Aggarwal, O. Topaloglu, et al. Systematic review of defibrotide studies in the treatment of veno-occlusive disease/sinusoidal obstruction syndrome (vod/sos). Bone Marrow Transplant, 54:19511962, 2019. 26. P. Richardson et al. Phase 3 trial of defibrotide for the treatment of severe sos/vod and multiorgan failure. Blood, 127:165665, 2016. 27. B. Schweizer and A. Sklar. Probabilistic Metric Spaces. North-Holland., 1983. 28. A. Smilde, J. Westerhuis, and R. Boqu. Multiway multiblock component and covariates regression models. Journal of Chemometrics, 14:301–331, 2000. 29. C. Strouse, P. Richardson, G. Prentice, S. Korman, R. Hume, B. Nejadnik, et al. Defibrotide for treatment of severe veno-occlusive disease in pediatrics and adults: an exploratory analysis using data from the center for international blood and marrow transplant research. Biology of Blood and Marrow Transplantation, 22:19511962, 2016. 30. P. D. Tsirigotis, I. B. Resnick, B. Avni, S. Grisariu, P. Stepensky, R. Or, and M. Y. Shapira. Incidence and risk factors for moderate-to-severe veno-occlusive disease of the liver after allogeneic stem cell transplantation using a reduced intensity conditioning regimen. Bone Marrow Transplant, 49:1389–92, 2014. 31. C. Viroli. On matrix-variate regression analysis. Journal of Multivariate Analysis, 111:296– 309, 2012. 32. J. Whittaker. Graphical Models in Applied Multivariate Statistics. Wiley, Switzerland, 2008. 33. K. Yakushijin, Y. Atsuta, N. Doki, A. Yokota, H. Kanamori, T. Miyamoto, et al. Sinusoidal obstruction syndrome after allogeneic hematopoietic stem cell transplantation: Incidence, risk factors and outcomes. bone marrow transplant. Bone Marrow Transplant, 51:403–9, 2016.

Appendix A

Bayesian Inference by Posterior Sampling Using MCMC

A.1 Bayesian Inference by Sampling Methodologies that are presented in this book are all Bayesian by nature, and the undertaken inference too is Bayesian; to be precise, all inference discussed here is done using Markov Chain Monte Carlo (MCMC) techniques. This appendix is dedicated to a discussion of such techniques. As we know, the Bayesian paradigm treats every unknown as a random structure that is underlined by its probability distribution. However, as practitioners of the relevant discipline, we are interested in the values that the unknowns attain, given the data at hand. Bayesian inference is referred to such learning of the values of the variables, given the (marginal posterior) probability distribution of an unknown, conditional on this data at hand. One way for undertaking Bayesian inference on variable X, is to then invert the marginal posterior probability density of X—that will offer the sought values of X, given the data at hand. A concern that pre-empts any anticipated difficulties with said inversion, is the generation of the marginal posterior itself—indeed, it is the joint posterior probability density of multiple variables, that is formulated in Bayes rule, given the data. Marginalising this joint over all such variables except X, is not closed-form in general, and we seek a robust and fast tool that easily (and organically) offers the marginal posterior of each unknown, given the data. In addition, the inversion of the marginal posterior of X—however that might have been obtained—is not possible in a closed-form way for most generic posteriors. So such a suggested inversion is not relevant. We are also motivated to use an inferential tool that allows for the acknowledgement of noise in the data in undertaking said inference, and is at the same time, capable of producing clearly-interpretable and objective uncertainties on the sought values of each of the unknowns that the inference is undertaken on, given the data at hand. The above items in our wishlist are concurrently satisfied by MCMC-based inference, performed by sampling from the joint posterior of all unknowns given the data at hand. © Springer Nature Switzerland AG 2023 D. Chakrabarty, Learning in the Absence of Training Data, https://doi.org/10.1007/978-3-031-31011-9

219

220

A Bayesian Inference by Posterior Sampling Using MCMC

A.1.1 How to Sample? The relevant question then is how to sample from the posterior of the unknowns, given the data. Before addressing that question, we undertake the simpler question of how to generate a sample of values of the random variable X from its pdf .fX (x). Let .X ∈ X ⊆ R, where the density .fX (·) does not necessarily ascribe to any parametrised form, and .FX (·) is not invertible in a closed-form way. Then the trick is to sample X from an analytically invertible density .g(x), but accept only those values .x  that are sampled from .g(·), into the sought set of samples that we hope to generate from .fX (·), that satisfy a cleverly designed acceptance prescription. Here, this prescription is designed to confer on the thus accepted .x  , the distribution .FX (·). Thus, the target density that we would like to sample from is  .fX (·), but instead, we propose a candidate .x as a value of X that can be included  in this sought sample, with .x sampled from an invertible density .g(·). Thus, .g(·) is the proposal density. We choose .M ∈ R>0 s.t. .fX (x)/Mg(x) < 1, ∀x ∈ X , and .x  is accepted in the sought sample if u≤

.

fX (x  ) , where u ∼ Uniform[0, 1]. Mg(x  )

In other words, if the proposed .x  abides by this acceptance criterion, then we set the next value of X sampled from .fX (·), as .x  . In this way, we can populate the sought sample. This acceptance prescription works because it can be proved that probability that  .x ∼ g(·) lies in any arbitrarily defined interval .[a, b] ⊆ X —conditional on it being b accepted, via adherence to the above acceptance criterion—is . a fX (u)du. Such sampling is referred to as Rejection Sampling.

A.1.2 MCMC In an iteration of inference with Markov Chain Monte Carlo (MCMC), the posterior probability density of the unknowns that we wish to make inference on, is computed at the current values of these unknowns, and this posterior contributes towards the acceptance of the proposed values of the unknowns. If accepted, these values are rendered the updated “current” values of the unknowns, which in turn contribute towards the updating of the posterior probability density relevant at that next iteration. This creates a sequence of posterior probability densities of the unknowns, given the data, where every density is affected by the previous element of this sequence. Then random variables generated by such sequentially varying densities, form a sequence of mutually correlated random variables—to be precise a Markov Chain. The sequential updating of the generative posterior density triggers convergence of this sequence of random variables, in distribution, to a

A

Bayesian Inference by Posterior Sampling Using MCMC

221

variable underlined by the steady state distribution of this Markov Chain. Thus, upon convergence, sampling occurs from this target density—known up to a constant. It is the acceptance criterion that by design, allows for values (of the unknowns), that occur at higher posterior probability densities, to occur proportionately more frequently in the generated sample. The different implementation of such MCMC-based posterior sampling techniques, vary in the general designing of the proposal density, and/or in the treatment of the inference required on multiple unknowns. In the next section, we discuss these characteristics that distinguish some of the MCMC techniques that have been used in the book.

A.1.3 Metropolis-Hastings In the Metropolis-Hastings implementation of MCMC-based sampling, the acceptance criterion in the k-th iteration is given as the following. (k)

.

If

π(x1

(k)

, . . . , xp |D)

q(x1(k) , . . . , xp(k) |x1(k−1) , . . . , xp(k−1) ) (k−1)

q(x1

(k−1)

, . . . , xp

(k)

|x1

×

(k)

, . . . , xp )

π(x1(k−1) , . . . , xp(k−1) |D) .

(A.1)

≥ u, x (k) = x (k) ;

else x (k) = x (k−1) ,

where: – .U ∼ U nif orm[0, 1]; – the vector of unknowns is .X = (X1 , . . . , Xp )T ; – the current value of .Xj before acceptance within the k-th iteration, is the value (k−1) of .Xj current in the .k − 1-th iteration, which is denoted by .xj ; .j = 1, . . . , p; (k)

– the proposed value of .Xj in the k-th iteration is .xj , – where the proposed candidate for .X in the k-th iteration is sampled from the proposal density .q(·; ·) that is centred at the current value of .X, till acceptance/rejection of the proposed value has been completed in this iteration. – Here the data that the posterior probability density has been computed as conditional on, is .D. – k is the index of the iteration, s.t. for an .Niter long MCMC chain, .k = 1, . . . , Niter . This proposal density is then chosen to be a parametric density, closed-form inversion of which is possible, or an approximation to such an inversion is available. Typically, the proposal density computed at the proposed values of

222

A Bayesian Inference by Posterior Sampling Using MCMC

the candidates for .X1 , . . . , Xp in the k-th iteration, is centred at the current values of these variables. The proposal density is also computed at the current values, conditional on the proposed, where such a computed value of this density is required within the acceptance criterion. Typically, .Xi and .Xj are proposed independently, and not necessarily from the same parametric form of the proposal (k) (k) (k−1) (k−1) densities; .i, j = 1, . . . , p. Thus, .q(x1 , . . . , xp |x1 , . . . , xp ) is replaced (k) (k−1) (k) (k−1) (k) (k−1) by .q1 (x1 |x1 )q2 (x2 | .x2 ) . . . qp (xp |xp ). Similarly for the proposal density computed at the current values of .X1 , . . . , Xp , given the proposed. Here .k = 1, . . . , Niter . Thus, we notice that in the generic (or vanilla) Metropolis-Hastings, all unknowns are updated simultaneously. In the Random Walk implementation of Metropolis-Hastings, the proposal density is Normal, which implies that the ratio of the proposal densities computed at the proposed values, given the current, is the same as the form of the density computed at the current, given the proposed. Therefore, in this implementation, the ratio of proposal densities reduces to unity. In the Independent Sampler implementation on the other hand, the proposal density is assigned all fixed parameters. These are the three types of Metropolis-Hastings implementation that have been employed in the inference undertaken in this book. Other implementation exist, but these have not been invoked here. Any global (or unknown constant) scale of the posterior probability density is not captured in the updating, since such a scale cancels in the computation of the acceptance ratio. However, there are some real-world problems, in which the normalisation of the posterior pdf —that ensures that this pdf integrates to unity over all possible values of the relevant variables—is dependent on the unknowns, and is not a constant. One such problem is discussed in Chap. 3. In such cases, the normalisation computed at the proposed, as well as that computed at the current values of the unknowns, enter the computation of the acceptance ratio. Since the proposed and current values of the unknowns are not equal in general, the normalisations then do not cancel in the definition of the acceptance ratio. As each unknown is updated in any iteration, a sample of accepted values of each of the p unknowns is built, upon convergence of the chain. Let the .Nburnin th iteration onward, the chain be considered converged. Then a sample of accepted values of the unknown .Xj is a ready means of constructing the histogram of the values of .Xj that are sampled from the target density, i.e. samples accepted postconvergence, i.e. samples accepted in the .Nburnin + 1-th iteration onward. Then this histogram approximates the scaled marginal posterior probability density of .Xj given data .D; .j = 1, . . . , p. We recall that an easy and organic achievement of the marginals of each of the unknowns, was in our wish-list. In this histogram, we proceed from the smallest value of .Xj , towards higher values, and proceed from the highest values of .Xj towards lower values, while monitoring the area under the histogram curve bound by the pair of .Xj values that we are at currently. Once this area is .100(1 − α)% of the total area under the histogram, the current (low and high) values of .Xj define the .100(1 − α)% Highest Probability Density credible regions (or HPDs). Typically, we use .α = 0.05, s.t.

A

Bayesian Inference by Posterior Sampling Using MCMC

223

we learn the uncertainty in our learning of .Xj given data .D as 95.% HPDs. Thus, our keenness in the learning of the uncertainties in our learning of values of every unknown is addressed, within the MCMC paradigm.

A.1.4 Gibbs Sampling Another implementation of MCMC-based sampling is Gibbs sampling, which is a case of Metropolis-Hastings, with unit acceptance ratio. Gibbs sampling allows for the reduction of the inference that is to be made on the vector-valued .X = (X1 , . . . , Xp )T , into p number of inference problems that are made respectively on .X1 , then on .X2 , . . . to .Xp . The updating of .Xj in the k-th iteration is done by proposing a value of .Xj from the proposal density that is the marginal posterior probability density .m(·|·) of .Xj , computed at the current value of .Xj , given the data .D and values of all other unknowns in the k-th iteration, till the updating of .Xj has been undertaken. Thus, (k−1) (k−1) xj(k) ∼ m(xj(k−1) |D, x1(k) , . . . , xj(k) ). −1 , xj +1 , . . . , xp

.

It can be shown that this proposal density, once input into the acceptance ratio of Metropolis Hastings, guarantees acceptance. Then x1(k) = x1(k) ∼ m(x1(k−1) |D, x2(k−1) , . . . , xp(k−1) ).

.

(k)

(k)

x2 = x2

.

(k−1)

∼ m(x2

.

(k)

xj

.

(k)

= xj

(k−1)

∼ m(xj

(k)

(k−1)

|D, x1 , x3

. . . , . . . , xp(k−1) ).

............ (k)

(k)

(k−1)

|D, x1 , . . . , xj −1 , xj +1 , . . . , xp(k−1) ). .

............ (k)

(k)

xp(k) = xp(k) ∼ m(xp(k−1) |D, x1 , . . . , xp−1 ).

.

A.1.5 Metropolis-Within-Gibbs Instead of updating each variable, one at a time, it is also possible to update a subset of the p variables in the first block within an iteration; then a few more within the next block—updated given the already updated variables and current values of those not considered yet for updating; and so on, till the last remaining subset of the

224

A Bayesian Inference by Posterior Sampling Using MCMC

unknowns is updated. Within any block, the updating of the chosen few variables happens according to a chosen implementation of Metropolis-Hastings. In multiple applications that are discussed in this book, such Metropolis-withinGibbs implementation of MCMC is undertaken. In the 5th Chapter, we also undertake Metropolis-with-2block-update. In this inferential scheme, one set of unknown variables is updated in the first block within any iteration, given the data alone. This updating is not informed by the current values of the remaining unknowns. Thereafter in the second block, the remaining variables are updated, given the updated first subset of variables. Again, within a block, updating of all relevant variables happens as in Metropolis-Hastings.

Index

Symbols .α2 -shifted

forecast rate, 88

A Absent training data, 2, 4, 103, 131, 154 ARIMA, 25, 83 Attractor, 27 Autonomous dynamical system, 36, 102

B Back Scattered Electrons (BSE), 167 Bayesian inference, 29 Bespoke learning, 14 in the brain, 19 example problems, 18 implementation, 157 material density function, 171 potential, 43, 83 using system Physics, 154 Binning the energy range, 128 Binning the radial range, 127 Block matrix, 176 Bone marrow transplant, 13

C Central potential, 110 Closed-form formard prediction, 67 Closed-form prediction, 53 Collisionless Boltzmann Equation (CBE), 109 Composite index, 192 Convolving with error density, 115, 124 Covariance kernel, 50, 138

Covariance learning using MCMC, 162 parametrisation using covariance kernels, 163 sample estimate, 163 COVID 19, 14

D Dark matter, 9, 102 Design (sub-surface) locations, 178 Design point, 4 Deterministic system, 35 Differently-long time series, 3, 190 Disease progression score, 192 Distance between graphical models, 192, 202 Distance function of probability metric space, 199

E Electron microscope, 7, 14, 166 Elliptical galaxies, 105 Empirical dynamical models (EDM), 26 Empirically-observed rate, 76 Empirical negative potential, 91 Empty brain, 18 Enclosed gravitational mass, 142 Energy, 48 Error function, 124 Evolution-driving function, 10 Evolution of pdf of phase space variables, 108 Evolution of phase space pdf, 3

© Springer Nature Switzerland AG 2023 D. Chakrabarty, Learning in the Absence of Training Data, https://doi.org/10.1007/978-3-031-31011-9

225

226 F Forecasting, 141 material density function, 184 Forecasting state results, 86 Forecast rate, 88 Forward and inverse prediction, 156 Forward prediction, 147, 148

G Gaussian Process (GP), 14, 30, 137, 162 non-stationary, 37 Gibbs sampling, 223 Globular Clusters, 132 Gravitational mass density of galaxies, 103

H Hamiltonian system, 36 Hellinger distance, 202 Hellinger distance between posterior of graphs, 202 Hidden Markov Models, 31 Highest Probability Density credible region (HPD), 67, 77, 117, 193 Hyperparameters of covariance kernel, 165

I Imaging at a sequence of beam energies, 168 Inference on graphical model using MCMC, 193 Information redundancy, 93 Integrals of motion, 47, 107 Interaction Volume, 167 Inter-column correlation matrix of data, 194 Inter-component covariance, 60 Inter-nodal affinity, 197 Inter-time covariance, 60, 66 Inverse transform sampling, 219 Irrelevance of available training data, 7 Isotropic phase space pdf, 114

K Kernel hyperparameter - unknown function of sample path, 165 Kernel parametrisation, 66

L Lack of information on probability distribution of variables, 2

Index Learning a network, 197 Learning period, 44 Length scale parameters, 66 Likelihood, 56, 104, 112, 115, 158 Liouville’s Theorem, 47 Lookback data, 175 Lossless scalar parametrisation, 190

M M60, 132 Marginal posterior - histogram visualisation, 88 Marginal posterior of edge variable, 197 Markov Chain Monte Carlo (MCMC), 14, 29, 46, 104, 159, 175, 221 Markov Decision Process, 33 Material density function, 166 Matrix Normal density, 50, 62, 196 MCMC-based inference on score, 205 MCMC diagnostics, 136 Metropolis-Hastings, 175, 222 Independent Sampler, 222 Random Walk, 222 with 2-block update, 200 Metropolis-within-Gibbs, 59, 125, 128, 135, 159, 165, 224 Microscopy correction function - convolution with density, 169 Multivariate Normal, 146

N Negative forecast potential, 76, 91 New infection numbers of COVID 19, 74 Newton’s Second Law, 24, 27, 41, 50, 72, 109 feedback from state to generative process, 43 NFW model as prior, 124 NGC4649, 132 Nondestructive learning, 154 Non-linear dynamics, 25, 118 Normalisation of posterior, 113 Not Generalised Linear Model, 155 Not linear regression, 155 Nuclear Magnetic Resonance (NMR), 9, 154

P Parametric fitting - inadequate, 161 Partial correlation, 196 Partially observed phase space vector, 105, 111 Partitioning details, 80 Permutation entropy, 93

Index Petrophysics, 9 Phase space pdf, 24 variables, 24 Planetary nebulae, 132 Poincare sections, 150 Poisson equation, 103, 110 Positive, monotonicity imposed via MCMC, 120 Posterior of inter-column correlation matrix given data, 195 Potential, 10, 28, 35, 37, 38 causal driver of evolution, 71 from density, 121 effective, 39 generalised treatment, 74 time dependence, 47 Prediction, 140 given matrix Normal likelihood, 70 period, 44, 67 Probabilistic metric space, 193 Probability of edge variable, 194 Problems with parametric fitting, 2 Progression score from inter-graph distance, 204

R Random Geometric Graph (RGG), 194 Random Walk Metropolis Hastings, 78, 146 Ranking covariates by strength of association, 191 Reinforcement learning, 15, 34 Rejection sampling, 220 Reward function, 35 Riemann sum, 119, 132

S Sample specificity, 7 Secondary Electrons, 178 Sequence of random graphs, 193 Sequential inverse projections, 169 Skew-Normal, 32 Soft Random Geometric Graph (SRGG), 193

227 SOS/VOD, 207 Spatial statistics, 11 Speed of forecasting, 78 Speeds of learning, 78 State space models, 10, 26 Stationary GPs nested inside non-stationary GP, 165 Stem Cell Transplanation, 207 Sub-surface material density function, 7 Supermassive blackhole, 118 Supervised learning, 3, 36

T Tensor Normal density, 162 Tensor-valued function, 155 Tensor-valued variable, 4 Test data, 4, 140 Test time, 59 Time series as response, 190 Time window, 44 Training data, 4, 29, 36 Truncated Normal, 32, 129, 135

U Unsupervised learning, 15 Useful information, 94

V Vectorisation, 119 Vector-variate Gaussian Process, 50 Velocity-independent potential, 106 Voxel, 168

W WHO dashboard, 74 Width of time window, 81 With-uncertainty graphical model, 201

X X-ray emission, 143