Hume's Problem Solved: The Optimality of Meta-Induction (The MIT Press) 0262039729, 9780262039727

A new approach to Hume's problem of induction that justifies the optimality of induction at the level of meta-induc

339 36 6MB

English Pages 400 [399] Year 2019

Report DMCA / Copyright

DOWNLOAD FILE

Polecaj historie

Hume's Problem Solved: The Optimality of Meta-Induction (The MIT Press)
 0262039729, 9780262039727

Table of contents :
Cover
Contents
Preface
1: The Problem of Induction
2: On Failed Attempts to Solve the Problem of Induction
3: The Significance of Hume’s Problem for Contemporary Epistemology
4: Are Probabilistic Justifications of Induction Possible?
5: A New Start: Meta-Induction, Optimality Justifications, and Prediction Games
6: Kinds of Meta-Inductive Strategies and Their Performance
7: Generalizations and Extensions
8: Philosophical Conclusions and Refinements
9: Defense against Objections
10: Interdisciplinary Applications
11: Conclusion and Outlook: Optimality Justifications as a Philosophical Program
12: Appendix: Proof of Formal Results
Formal Symbols and Abbreviations
Memos, Definitions, Propositions, Theorems, Figures, and Tables
References
Subject Index
Author Index

Citation preview

Hume's Problem Solved

The Optimality of Meta-Induction

Gerhard Schurz

The MIT Press Cambridge, Massachusetts

London, England

© 2019 Massachusetts Institute of Technology

All rights reserved. No part of this book may be reproduced in any form by any electronic or mechanical means (including photocopying, recording, or information storage and retrieval) without permission in writing from the publisher. This book was set in Stone Serif by Westchester Publishing Services. Printed and

bound in the United States of America. Library of Congress Cataloging-in-Publication Data Names: Schurz, Gerhard, 1956- author. Title: Hume's problem solved : the optimality of meta-induction / Gerhard Schurz. Description: Cambridge, MA : MIT Press, 2019. | Includes bibliographical references and index. Identifiers: LCCN 2018032149 | ISBN 9780262039727 (hardcover : alk. paper) Subjects: LCSH: Hume, David, 1711-1776. | Induction (Logic) Classification: LCC B1499.I6S $38 2019 I DDC 161-dC23 LC record available at https://lccn.loc.gov/2018032149

10 9 8

7 6 5

4

3

2 1

Contents

Preface

1

ix

The Problem of Induction

1

l 1.1 The Notion of Induction: Conceptual Clarifications 1.2 David Hume and the Problem of justifying Induction 5 1.3 Plan of the Book 8

2

On Failed Attempts to Solve the Problem of Induction

11

2.1 Can Induction Be Avoided? II 2.2 Is Induction Rational "by Definition"? Rationality and Cognitive Success 13 2.3 Can Induction Be justified by Assumptions of Uniformity? 16 2.4 Can Circular justifications of Induction Have Epistemic Value? 18 2.5 Can Induction Be Justified by Abduction or Inference to the Best Explanation? 22 2.6 The Role of Induction and Abduction for Instrumentalism and Realism 24

3

The Significance of Hume's Problem for Contemporary Epistemology

27

3.1 The Aims of Epistemology 27 3.2 Foundation-Oriented Epistemology and Its Main Problems 3.3 Coherentism and Its Shortcomings 35 3.4 Externalism and Its Shortcomings 38 3.5 The Necessity of Reliability Indicators for the Social Spread of Knowledge 43 3.6 Conclusion: A Plea for Foundation-Oriented Epistemology

29

44

vi

4

Contents

Are Probabilistic justifications of Induction Possible?

47

47 4.1 Why Genuine Confirmation Needs Induction Axiorns 4.2 Digression: Goodman's Paradox and the Problem of Language Relativity 52 4.3 Statistical Principal Principle and Narrowest Reference Classes 4.4 Statistical Principal Principle and Exchangeability as Weak Induction Axioms 61 4.5 Indifference Principle as an Induction Axiom 68 4.6 Inductive Probabilities without the Principle of Indifference? 4.7 Is Skepticism Unavoidable? 75

5

57

72

A New Start: Meta-Induction, Optimality justifications, and Prediction Games

77

5.1 Reichenbach's Best Alternative Approach 77 5.2 Reliability justifications versus Optimality justifications 78 5.3 Shortcomings of Reichenbach's Best Alternative Approach 81 5.4 Object-Induction versus Meta-Induction 82 5.5 Prediction Games 85 5.6 Classification of Prediction Methods and Game-Theoretic Reflections 90 5.7 Definitions of Optimality, Access-Optimality, and (Access-) Dominance 94 5.8 Three Related Approaches: Formal Learning Theory, Computational Learning Theory, and Ecological Rationality Research 99 5.9 Simple and Refined (Conditionalized) Inductive Methods 102

6

Kinds of Meta-Inductive Strategies and Their Performance

109

6.1 Irritate the Best (ITB): Achievements and Failures 1 l o 6.2 Epsilon-Cautious Imitate the Best (eITB) 122 6.3 Systematic Deception: Fundamental Limitations of One-Favorite Meta-Induction 126 6.3.1 General Facts about Nor converging Frequencies 126 6.3.2 Nonconvergent Success Oscillations and Systematic Deceivers 6.3.3 Limitations of One-Favorite Meta-Induction 129 6.4 Deception Detection and Avoidance Meta-Induction (ITBN) 131 6.5 Further Variations of One-Favorite Meta-Induction 135 6.6 Attractivity-Weighted Meta-Induction (AW) for Real-Valued Predictions 138 6.6.1 Simple AW 140 6.6.2 Exponential AW 144 6.6.3 Access-Superoptimality 145

127

vii

Contents

6.7 Attractivity-weighted Meta-Induction for Discrete Predictions 147 6.7.1 Randomized AW Meta-Induction 149 6.7.2 Collective AW Meta-Induction 153 156 6.8 Further Variants of Weighted Meta-Induction 6.8.1 Success-Based Weighting 157 16] 6.8.2 Worst-Case Regrets and Division of Epistemic Labor

7

Generalizations and Extensions

163

7.1 Bayesian Predictors and Meta-Inductive Probability Aggregation 7.2 Intermittent Prediction Games 169

7.3

7.4 7.5 7.6 7.7

8

163

7.2.1 Take the Best (TTB) 172 7.2.2 Intermittent AW 177 Unboundedly Growing Numbers of Players 180 7.3.1 New Players with Self-Completed Success Evaluation 181 7.3.2 Meta-Induction over Player Sequences 183 Prediction of Test Sets 186 Generalization to Action Garnes 188 Adding Cognitive Costs 191 194 Meta-Induction in Games with Restricted Information

Philosophical Conclusions and Refinements

197

8.1 A Noncircular Solution to Hume's Problem 197 8.1.1 Epistemological Explication of the Optimality Argument 197 8.1.2 Radical Openness and Universal Learning Ability 203 8.1.3 Meta-Induction and Fundamental Disagreement 204 8.1.4 Fundamentalistic Strategies and the Freedom to Learn 206 8.1.5 A Posteriori justification of Object-Induction 208 8.1.6 Bayesian Interpretation of the Optimality Argument 210 8.1.7 From Optimal Predictions to Rational (Degrees of) Belief 212 8.2 Conditionalized Meta-Induction 215 8.3 From Optimality to Dominance 222 8.3.1 Restricted Dominance Results 222 8.3.2 Discriminating between Inductive and Noninductive Prediction Methods 224 228 8.3.3 Bayesian Interpretation of Dominance

9

Defense against Objections

233

233 9.1 Meta-Induction and the No Free Lunch Theorem 9.1.1 The Long-Run Perspective 235 9.1.2 The Short-Run Perspective 245 260 9.2 The Problem of infinitely Many Prediction Methods

viii

Contents

9.2.1 9.2.2 9.2.3 9.2.4 9.2.5

infinitely Many Methods and Failure of Access-Optimality 260 262 Restricted Optitnality Results for infinitely Many Methods Defense of the Cognitive Finiteness Assumption 266 The Problem of Selecting the Candidate Set 268 Goodman's Problem at the Level of Prediction Methods 270

10 Interdisciplinary Applications

273

10.1 Meta-Induction and Ecological Rationality: Application to Cognitive Science 273 10.2 Meta-Induction and Spread of Knowledge: Application to Social Epistemology 284 10.2.1 Prediction Games in Epistemic Networks 287 10.2.2 Local Meta-Induction and Spread of Reliable information 289 10.2.3 Imitation without Success Information: Consensus Formation without Spread of Knowledge 293 10.2.4 Conclusion 295 10.3 Meta-Induction, Cooperation, and Game Theory: Application to Cultural Evolution 297

11

Conclusion and Outlook: Optimality justifications as a Philosophical Program

305

11.1 Optimality justifications as a Means of Stopping the Justificational Regress 305 11.2 Generalizing Optionality justifications 307 11.2.1 The Problem of the Basis: introspective Beliefs 307 11.2.2 The Choice of the Logic 307 11.2.3 The Choice of a Conceptual System 310 11.2.4 The Choice of a Theory 310 11.2.5 The justification of Abductive Inference 311 11.3 New Foundations for Foundation-Oriented Epistemology 314

12 Appendix: Proof of Formal Results

315

Formal Symbols and Abbreviations 347 Memos, Definitions, Propositions, Theorems, Figures, and Tables

References 355 Subject Index 371 Author Index 383

351

Preface

This book pres­ents the results of my research on a new approach to “Hume’s prob­lem”—­the prob­lem of justifying induction. My approach is characterized by three features. 1. ​It concedes the force of Hume’s skeptical arguments against the possibility of a noncircular justification of the reliability of induction. What it demonstrates is that one can nevertheless give a noncircular justification of the optimality of induction. ­ ecause an optimality 2. ​Reichenbach’s “best alternative account” failed b justification cannot be given at the level of object-­induction (induction applied at the level of events). However, it can be given at the level of meta-­induction: the application of induction to competing prediction methods. Based on discoveries in computational learning theory it is shown that a strategy called attractivity-­based meta-­induction is predictively optimal in all pos­si­ble worlds among all prediction methods that are accessible to the epistemic agent. I consider this to be the major achievement of this book. It provides us with a noncircular a priori justification of meta-­induction. 3. The a priori justification of meta-­induction generates a noncircular a posteriori justification of object-­induction, ­because in our world inductive prediction methods have been observed as being more successful in the past than noninductive methods, whence it is meta-­inductively justified to ­favor object-­inductive strategies in the ­future. Besides its importance for epistemology, meta-­inductive learning has many applications in neighboring disciplines, including forecasting sciences, cognitive science, social epistemology, and generalized evolution theory. Thus, a distinctive feature of this book is its interdisciplinary nature. In the last chapter, the method of optimality-­based justification is generalized

x Preface

into a new epistemological strategy that can resolve skeptical doubts against foundation-­oriented epistemologies. Some of this book’s results are based on previously published articles. Chapter  3, on epistemology, uses materials from papers that appeared in Acta Analytia (Schurz 2008c), Grazer Philosophische Studien (Schurz 2009a) and Frontiers in Psy­chol­ogy (Schurz 2014b). Chapter 4 includes En­glish translations of recent results published in my German book on probability theory (Schurz 2015b). Some portions of chapters  5–9, which constitute the core of this book, are based on two papers published in Philosophy of Science (Schurz 2008a and 2017a). Section 10.1, on applications in cognitive science, pres­ents some of the recent findings published in Minds and Machines (Schurz and Thorn 2016). Section 10.2, on applications in social epistemology, draws on two papers that appeared in Episteme (Schurz 2009b and 2012). Section 11.2 draws on papers that appeared in Synthese (Schurz 2008a) and in the British Journal for Philosophy of Science (Schurz 2009d). Numberings follow the pattern: “Item. Chapter.Number_within_chapter.” For example, “figure 2.2” is the second figure in chapter 2. Likewise for definitions, theorems, and propositions; for instance, definition 4.3 is the third definition in chapter 3. Further impor­tant results (memos) are numbered using round brackets, so “(3.4)” is the fourth memo of chapter 3. A list of all formal symbols, memos, definitions, propositions, theorems, figures and ­tables is given at the end of the book. The proofs of all propositions and theorems (except very ­simple ones) are compiled in the appendix. Most of the computer simulations presented in this book w ­ ere programmed by Paul Thorn; some of them by Eckhart Arnold. For valuable help concerning ­matters of content I am indebted to Eckhart Arnold, Peter Brössel, Igor Douven, Christian Feldbacher, Alvin Goldman, Peter Grünwald, Ralph Hertwig, Franz Huber, Simon Hutteger, Marc Jekel, Konstantinos Katsikopoulos, Kevin Kelly, Gernot Kleiter, Hannes Leitgeb, Laura Martignon, Alan Musgrave, Erik Olsson, Ronald Ortner, Arthur P. Pedersen, Jeanne Peijnenburg, Stathis Psillos, Nicolas Rescher, Jan-­Willem Romeijn, Brian Skyrms, Wolfgang Spohn, Tom Shogenji, Tom Sterkenburg, Paul Thorn, Markus Werning, Ioannis Votsis, Greg Wheeler and Jon Williamson. I hope that this book entertains its readers and lets them profit intellectually. Gerhard Schurz Düsseldorf, May 2018

1  The Prob­lem of Induction

1.1  The Notion of Induction: Conceptual Clarifications We begin this book with a clarification of the notion of an “inductive inference.” This notion is used in the lit­er­a­ture with two dif­fer­ent meanings. Some authors use the notion of induction in a wide sense: they equate inductive inferences with any kind of nondeductive inferences (e.g., Bird 1998, 13; Earman 1992; Pollock 1986, 42). Hence, they also classify abductive inferences—or inferences to the best explanation (IBE)—as inductive. In this wide understanding of “induction,” the crucial difference between deductive and inductive inferences lies in the fact that only deductive inferences preserve the truth from the premises to the conclusion with certainty—­ that is, in all pos­si­ble “worlds” (or circumstances). By contrast, all inductive or abductive inferences are uncertain: they transfer the truth from the premises to the conclusion not in all worlds, but only in ­those worlds that are sufficiently regular or uniform. For example, the deductive inference “All swans are white / Therefore all swans in this lake are white” is certain and preserves truth in all pos­si­ble worlds, whereas the inductive inference “All swans observed so far are white // Therefore all swans are white” is uncertain and preserves truth only in sufficiently uniform worlds (the double slash // indicates this uncertainty). ­Because it is pos­si­ble that the conclusions of inductive inferences in the wide sense are false even if their premises are true, ­these inferences are said to be content expanding (or ampliative), in contrast to deductive inferences which are content preserving. This wide notion of inductive inference is vague in several re­spects (as we ­will see ­later), but ­there is also the notion of an inductive inference in the narrow sense. This notion refers (primarily) to inferences in which a regularity that has been observed so far—­for example, “All observed Fs have been Gs”—is transferred e­ ither to a new f­uture instance (inductive prediction) or to the entire ­future or the entire domain of individuals in space-­time

2

Chapter 1

(inductive generalization). This narrow notion of induction is also called the “Humean” sense of induction, as David Hume’s skeptical doubts (see ­ ere primarily concerned with induction in the narrow the next section) w sense. The two simplest forms of ­these inferences are the following. 1. ​Inductive prediction: r% of all so far observed Fs have been Gs. Therefore, with a (subjective) probability of approximately r%, the next F w ­ ill be a ­ ill be predicted to be a G, provided r is greater than 1/2 G—­and thus w and F is the total evidence regarding the next observed individual. 2. ​Inductive generalization: r% of all so far observed Fs have been Gs. Therefore, with high (subjective) probability, approximately r% of all Fs are Gs.1 Both inferences are formulated probabilistically; the special case in which r equals 100% gives us their strict version. Depending on the interpretation of the phrases “approximately” and “high,” one obtains dif­fer­ent, more or less refined versions of t­hese two inferences, which are discussed in chapter 4. The formulation of 1 ­after the dash in terms of a prediction rule is also called the maximum rule (see section  5.9). Note that an inductive “prediction” inference may but need not be understood in the temporally forward-­directed sense. In temporal retrodictions, such as in the historical sciences, one inductively infers from observations to unobserved instances that lie in the past. From now on, by “induction” we ­will always mean induction in the narrow sense. The notion of “induction in the wide sense” includes too many heterogeneous kinds of inferences to yield a reasonable notion of uncertain inference, for which the justification prob­lem can be precisely stated. Therefore, we ­will avoid speaking about induction in the wide sense and instead speak of specific forms of abductive inferences, also known as inferences to the best explanation (IBEs), when we address them. According to ­Schurz (2008b, 2017b), the notion of abduction comprises dif­fer­ent kinds of inferences that have quite dif­fer­ent epistemological properties. They nevertheless have the following schema in common, ­going back to C. S. Peirce (1903, §189).

1. ​A third kind of inductive inference is inductive specialization (IS): r% of all Fs are Gs, therefore (with high subjective probability), approximately r% of a given sample of Fs are Gs. An IS (also called a “direct inference” by Levi 1977) is only reliable if the sample is representative for the domain. In contrast to the other two inductive inferences, the strict version of an IS is deductively valid.

The Prob­lem of Induction 3

(1.1)  General schema of abduction (or inference to the best explanation, IBE) Premise 1: A (singular or general) fact E, in need of explanation. Premise 2: Background knowledge K, which implies for some hypothesis H that H is a sufficiently plausible explanation for E. Abductive conjecture: H is true.

The trustworthiness of an abductively acquired hypothesis is highly tentative: The abducted hypothesis needs further empirical testing to acquire  the character of a probable hypothesis (Peirce 1903, §171). More importantly, the given background knowledge usually suggests several pos­ si­ble hypotheses to potentially explain the given evidence E, and the abductive inference selects the most plausible of them. For this reason Harman (1965) transformed Peirce’s general schema of abduction into the schema of the IBE (Lipton 1991). The focus of this book is the prob­lem of justifying induction (in the narrow sense), which is also called Hume’s prob­lem. Abductive inferences are treated only marginally—in section 2.5 as an attempt to justify induction, and in section 11.2.5 where we generalize the method of optimality justifications as the key for solving the prob­lems of a foundation-­oriented epistemology. Thus, we suggest a separation of the epistemological prob­lems of justifying induction from ­those of justifying abduction. The major reason for this division of ­labor is spelled out in sections 2.5 and 2.6: although the skeptical objections against the possibility of justifying induction apply equally to the possibility of justifying abduction, the latter prob­lem involves additional difficulties that make its solution even more troublesome than Hume’s prob­lem. For example, inductive inferences can never introduce new concepts into the conclusion: ­every predicate of their conclusion occurs also in their premises, only the time points—or, more generally, the individual constants of the conclusion—­are new compared to ­those in the premises. Therefore, if the premise of an inductive inference expresses an empirical fact (i.e., is formulated in terms of observation predicates), then also its conclusion expresses an empirical fact. By contrast, certain abductive inferences infer a theoretical explanatory model from a given empirical regularity: ­these inferences contain in their conclusion new theoretical (nonempirical) concepts and a corresponding existence assumption about unobserved or even unobservable (hidden) par­ameters—­such as, for example, “gravitational force” in physics. The justification of t­hese kinds of

4

Chapter 1

abductive inferences involves prob­lems that go beyond Hume’s prob­lem, in par­tic­u­lar the prob­lem of inferring the approximate truth of a theoretical model from its empirical adequacy. On the other hand, we s­ hall see in section 2.5 that a satisfactory justification of abductive inferences presupposes that one can have a noncircular justification of inductive generalizations. So the prob­lem of justifying induction is epistemologically more fundamental than that of justifying abduction. We end this section with a list of major conventions concerning formal symbols. •

Nonlogical symbols of first-­order logic: F, G, … , R, Q, … for predicates; a, b, … for individual constants; x, y, … for individual variables; f, g, … for function symbols; A, B, … for arbitrary sentences.



Logical symbols: ¬ (negation), ∧ (conjunction), ∨ (disjunction), → (material implication), ↔ (material equivalence), ∃ (existential quantifier), ∀ (universal quantifier), = (identity), and ⊨ (logical inference). As usual, “iff” abbreviates “if and only if.”



Set-­theoretic symbols: ∈  (ele­ment),  ∩ (intersection), ∪  (­union),  − complement; f:X → Y, a function from X to Y; and Pow(−), powerset.



Mathematical symbols: ∑ consecutive sum; × consecutive product; X “variable” in the mathematical sense—­that is, a function X:D → Val(X) from a domain D into a value space Val(X) with values x ∈Val(X).

Further symbols ­will be explained in the text. In several chapters we w ­ ill make use of probabilities. Throughout the book, the lowercase symbol “p(Fx)” stands for the statistical probability of a type of (repeatable) event or state of affairs that is linguistically expressed by the open formula Fx (e.g., that it is a rainy day in Munich, or that a given coin lands on heads ­after being tossed; in both cases “x” ranges over time points). The statistical probability of Fx events is understood as their limiting relative frequency in a random sequence of outcomes of a repeated random experiment (or physical pro­cess). In contrast, the uppercase symbol P(Fa) denotes the subjective or epistemic probability of a par­tic­u­lar event or state of affairs, linguistically expressed by a closed formula Fa (e.g., that ­today is a rainy day in Munich, or that a given coin lands heads in this throw). The subjective probability P(Fa) is the rational degree of belief of a given subject (or class of subjects) in the occurrence of the event Fa. Conditional probabilities are understood as usual: P(A|B) = P(A∧B)/P(B), provided P(B) > 0 (likewise for p instead of P). More details on probabilities follow in chapter 4.

The Prob­lem of Induction 5

1.2  David Hume and the Prob­lem of Justifying Induction los­ o­ phers in the era of enlightenment intended to liberate rational Phi­ ­ hese thought from religious authority and dogmatic prejudice of any sort. T phi­los­o­phers sought to build their systems of rational belief solely on the ­human capacity of reason—on invincible evidence and rational argument (see chapter  3). David Hume (1711–1776) is famous for his fundamental skeptical challenges ­toward this respectable philosophical enterprise. In (1748, chapters 4 and 6), as well as in his earlier work (1739), Hume demonstrated that all the standard methods of justification seem to fail when confronted with the task of justifying the reliability of induction. His starting question was, What entitles us to think that t­here is a lawlike or even necessary connection between two kinds of events, F and G, such as that lighting is followed by thunder or a billiard ball colliding with another ball is followed by the second ball’s movement? Before Hume, the standard philosophical answer to this question was coined in terms of causality: We know that F events are regularly followed by G events b ­ ecause we know that the former are causing the latter. The first part of Hume’s skeptical challenge consisted of his rejection of the justifiability of induction in terms of causality. He argued that although events are ordered by their spatial and temporal location, ­there is nothing in ­human sense data that would correspond to t­ hose events being related as cause and effect ([1748] 2006, §4, part 2, 7). When we watch a billiard game and think to “have seen” that the first billiard ball’s movement has caused the movement of the second, all that we r­ eally have observed is that a certain movement of the first billiard ball has been followed by a movement of the second. That this is all that we can observe already follows from the fact that when we are watching a film of the two moving balls, our sense impressions are exactly identical with our impressions from observing the balls directly, although nobody would say that the projected image of the first ball’s movement is the cause of the projected image of the second ball’s movement. Nor does it hold a priori, by logical reason, that events are related as cause and effects and that e­ very event has been caused by a set of preceding events (Hume [1748] 2006, §4, part 2). When one billiard ball approaches a second ball and comes to rest when it touches the second ball, which in turn moves away in a straight line, nothing in this series of observations implies that the first ball caused the second ball’s movement. For example, the billiard ball could also be moved by magnets hidden ­under the ­table.

6

Chapter 1

Moreover, no law of logic excludes the possibility that billiard balls are intentional agents that move on their own ­free ­will.2 In conclusion, it is not pos­si­ble to justify induction by the assumption of causality ­because this assumption is itself prima facie unjustified. If ­there is any hope to justify the existence of causal connections at all, then it is by way of an inductive inference. Thus, we need a justification of induction that is in­de­pen­dent of considerations of causality. This realization led Hume to the second and major part of his skeptical challenge: his argument against the rational justifiability of inductive inferences from observed regularities to unobserved or ­future instances. Hume’s four main skeptical arguments ([1748] 2006, chapters 4 and 6) can be summarized as follows: 1. Obviously, inductive inferences cannot be directly justified by observation ­because the conclusions of inductive inferences are propositions about unobserved events. 2. Likewise, it is obvious that inductive inferences cannot be justified by deductive logic, for it is logically pos­si­ble that starting tomorrow our world ­will behave completely dif­fer­ent than it has been so far. Therefore, no inference from propositions about observed events to propositions about unobserved ones can be logically or analytically valid. 3. Induction cannot be justified by the standard method of empirical ­science—by induction from observation. This is the most impor­tant point in Hume’s skeptical reasoning. To argue that the inductive method ­will be successful in the ­future ­because it has been successful in past applications would mean justifying induction by induction—­which is a circularity. ­Because circular arguments already presuppose what they purport to justify, they are without any justificatory value (see section 2.4). 4. Of course, inductive inferences are not strict entailments. They do not always lead from true premises to true conclusions. But their justification

2. ​Historically, Hume was not the first to challenge the princi­ple of causality. Before Hume, the occasionalists (such as Nicolas Malebranche) argued that regular successions of events cannot be explained in terms of their causal connection but rather as the result of God’s universal spirit holding every­thing together. Moreover, theologists from all ages argued against deterministic causality b ­ ecause it would imply a limitation of God’s omnipotence and his capacity to perform miracles. However, none of ­these phi­los­o­phers doubted the princi­ple of sufficient reason according to which every­thing must have a sufficient reason; they located the ultimate reasons not in the physical realm but in the Godly spirit. Presumably Hume was the first phi­ los­o­pher who fundamentally challenged the rational justifiability of the princi­ple of sufficient reason as the basis of inductive inference.

The Prob­lem of Induction 7

should be able to show that they are reliable in the sense that they are truth-­preserving with high probability, or in a high majority of cases. However, Hume already argued ([1748] 2006, §6) that this probabilistic reformulation—­contrary to what some phi­los­o­phers have proposed—­ does not help. For in order to justify that an inductive inference of the form “Most observed Fs have been Gs, and therefore the next F w ­ ill be a G” is truth-­preserving in most f­uture cases, we must presuppose that the relative frequencies of past events can be transferred to ­future events. This, however, is nothing but a probabilistic version of the inductive generalization rule. ­These are the reasons that led Hume to the skeptical conclusion that t­ here is no pos­si­ble rational epistemic justification of induction and that it is merely the result of psychological habit ([1748] 2006, part 5). Let us be clear about how harsh Hume’s skeptical challenge r­eally is. Hume did not only say that we cannot prove that induction is successful or reliable; he argued that induction is not capable of any rational justification whatsoever. All sorts of h ­ uman prejudice and superstition, from rain dancing to burning witches, are based on “psychological habit.” However, if t­here is no substantial difference between the irrational practices of men of the Stone Age and the inductive methods of modern science, then the search for intersubjective standards of epistemic rationality as a guide t­ oward objective truth—­ the enterprise of enlightenment rationality—­fails completely. Along t­hese lines, Russell (1946, 699) once remarked that if Hume’s prob­lem cannot be solved, “­there is no intellectual difference between sanity and insanity.” Keep in mind that Hume did not say that ­there is no justification of induction in terms of nonepistemic goals—­for example, ­because our belief in the reliability of induction makes us feel better about the ­future. Hume argued that t­here is no epistemic justification, meaning a system of arguments showing that inductive methods are useful or the right means for the purpose of acquiring true and avoiding false beliefs. When speaking of ­ ill always mean epistemic the “justification” of induction in this book we w justification. ­There have been several attempts in the analytic philosophy of the twentieth ­century to find pos­si­ble or improved ways of justifying induction. None of ­these attempts have been successful. A variety of such proposals and the ­ ill be presented in chapters  2 and 4. However, reasons for their failure w ­these chapters ­will not be entirely negative, as they ­will provide impor­tant insights about which directions are dead ends and which directions might yield a solution to the prob­lem.

8

Chapter 1

A final note on Goodman’s paradox: One often reads that besides Hume’s “first riddle” of induction, Goodman (1948) raised a “second riddle” of induction that is at least as hard to solve as Hume’s riddle. Goodman has shown that if one applies inductive rules to certain defined predicates, one may ­ ill be treated in section 4.2, end up in contradictions. Goodman’s riddle w where we w ­ ill see that it has two parts. One part has to do with the prob­ lem of induction; this part turns out to be just a variant of Hume’s prob­lem. The other part of Goodman’s riddle has to do with language dependence: a solution to this prob­lem requires preference criteria for the choice of the primitive predicates of one’s language. To avoid Goodman’s paradox we w ­ ill always assume a given set of primitive predicates that are supposed to designate qualitative properties (or relations). Inductive inferences may only be applied to logical combinations of ­these primitive predicates that are ­free from individual constants. U ­ nder ­these assumptions, Goodman’s prob­lem can provably not arise (see section 4.2). 1.3  Plan of the Book In chapters 1 through 5 our account is embedded into the state of the art of the philosophical debate on the prob­lem of induction. All major philosophical attempts ­toward a solution of Hume’s prob­lem and their shortcomings are thoroughly discussed in chapter 2 (on informal accounts) and chapter 4 (on probabilistic accounts). Chapter 3 explains the significance of the prob­lem of induction in foundation-­oriented accounts of epistemology as well as in alternative accounts that attempt to circumvent the prob­lem, such as externalism and coherentism. Based on the insights achieved in chapters 1 through 4, chapter 5 concludes that the only approach to Hume’s prob­lem that has a chance of success must be one in line with Reichenbach’s “best alternative account.” This account does not attempt to show that induction is guaranteed to be reliable, but rather that it is guaranteed to be optimal—­that is, it is the best that can be done with regard to the goal of making successful predictions. However, results in formal learning theory show that Reichenbach’s account fails for ordinary object-­induction (induction applied at the level of events). The research program of meta-­induction starts from the idea that the best alternative account has to be applied at the meta-­level of competing prediction methods. The results developed in this book are new in three re­spects. 1. ​The strategy of meta-­induction applies the princi­ple of induction to the success rates of all accessible prediction methods and predicts a

The Prob­lem of Induction 9

combination of the predictions of ­those methods that ­were most successful in the past. ­After introducing the formal framework of prediction games and the most impor­tant variants of meta-­induction it is demonstrated that ­there is a meta-­inductive prediction strategy—­called attractivity-­based meta-­induction—­whose predictive success is provably long-­run optimal among all accessible prediction methods in strictly all pos­si­ble worlds, including worlds without converging frequencies or worlds hosting adversarial demons. This provides an a priori justification of meta-­induction that is noncircular—­that is, it does not rely on inductive assumptions. Moreover, the justification of meta-­induction generates an a posteriori justification of object-­induction as follows. We know by experience that in our world inductive prediction methods have been more successful in the past than noninductive methods, whence it is meta-­inductively justified to ­favor object-­inductive strategies in the ­future. This argument is no longer circular ­because a noncircular justification of meta-­induction has been in­de­pen­dently established. Although the prob­lem of induction is a time-­honored prob­lem of philosophy, this book pres­ents a variety of new philosophical insights on this prob­lem. To a significant extent, t­hese insights arose from new discoveries in the fields of mathematical learning theory and machine learning, including fundamental theorems on regret-­based learning and on the relation between probability and algorithmic complexity. The content and consequences of ­these discoveries are still largely unknown in con­temporary philosophy and cognitive science. The pres­ent book introduces ­these discoveries to the community of phi­los­o­phers and cognitive scientists, develops them further, combines them with insights from other disciplines, and explains their far-­reaching consequences, both in informal words and on the level of formal results. Chapters  5 through 7 pres­ent a variety of theorems (some known and many new), accompanied by a description of computer simulations illustrating the content of t­ hese theorems; all proofs are placed in a mathematical appendix. Starting from basic results, increasingly power­ful strategies of meta-­ induction are presented, including its application to Bayesian predictors, to arbitrary action games, and to games with unboundedly growing sets of methods. 2. ​In the remaining chapters (8 to 10) core insights regarding the optimality of meta-­induction are defended, refined, and applied. The noncircular structure and philosophical content of the a priori argument for meta-­induction and the a posteriori argument for object-­induction are explicated in chapter  8. Moreover, several impor­tant results about the

10

Chapter 1

dominance of meta-­ induction over noninductive methods are established in that chapter. Besides its fundamental importance for epistemology, meta-­inductive learning has impor­tant applications in neighboring disciplines: section  9.1 explains how the optimality and dominance of meta-­induction can provide a solution to the famous no ­free lunch theorem; section  9.2 elucidates how meta-­induction is applicable to infinite sets of methods and to Goodman’s infamous prob­lem; section 10.1 describes the impact of meta-­induction for research on ecological rationality in cognitive science and pres­ents the results of an empirical study; section  10.2 applies meta-­induction to social epistemology, including computer simulations of the meta-­ inductive spread of knowledge in social networks; and fi­nally in section  10.3 meta-­inductive learning is investigated from the perspective of evolutionary game theory. ­Because of this wealth of applications, a distinctive feature of the pres­ent book is its interdisciplinary nature: while dealing with core themes in epistemology and philosophy of science, the contents of this book are of equal significance for forecasting sciences, mathematical learning theory and machine learning, cognitive psy­chol­ogy, and sciences of generalized evolution. In the last chapter, the method of optimality-­ based justification is 3. ​ detached from the prob­lem of induction and generalized to a new strategy of justification in epistemology. It is argued that optimality justifications can avoid the prob­lems of justificatory circularity and regress, given that they are applied at the level of meta-­methods that can learn from other methods. The generalization of optimality justifications to other domains is briefly sketched by reference to two further epistemological prob­lems: the justification of a system of logic (section 11.2.2) and the justification of abductive inference (section 11.2.5). It is concluded that optimality justifications provide new foundations for foundation-­oriented epistemology and philosophy of science that can resolve skeptical doubts without resorting to positions of resignation.

2  On Failed Attempts to Solve the Prob­lem of Induction

2.1  Can Induction Be Avoided? One pos­si­ble strategy to circumvent Hume’s prob­lem of induction is to argue that rational thought does not ­really need inductive arguments. That was the strategy of the founder of critical rationalism, Karl Popper. He advocated the thesis that Hume’s prob­lem of induction is, on the one hand, unsolvable, but that on the other hand empirical science can proceed without inductive inferences altogether ([1935] 2002, section I). However, one can distinguish dif­fer­ent notions of “induction” in the philosophy of Karl Popper. In this subsection we want to point out that although Popper’s objections against certain understandings of induction brought to light impor­tant insights, his more radical claim that empirical science could go without induction is not tenable. Popper criticized the view of methodological induction (see Schurz 2013, section  2.7.2). This view understands induction as a method of discovering or generating general laws and theories from par­tic­u­lar observations by means of inductive generalization procedures. Popper argued that the belief that science needs such a discovery method rests on a confusion of the context of discovery and the context of justification ([1935] 2002, section I.1–2; 1983, 118). How scientific hypotheses are generated, be it through induction, intuition, or trial and error, is completely irrelevant to the context of justification. What is impor­tant is the justification of hypotheses. However, according to Popper the justification of a hypothesis proceeds in a purely deductive way: to test a theoretical hypothesis one has to derive empirically testable consequences from it by means of deductive logic. One then compares ­these derived observation statements with one’s ­actual observations. If ­there is a contradiction between them, the tested hypothesis (law or theory) is falsified. If, on the other hand, they are in agreement, the prediction was successful and the hypothesis is corroborated.

12

Chapter 2

Several aspects of Popper’s account of theory testing can be criticized. First, only for strict but not for statistical hypotheses is it pos­si­ble to derive observable consequences by means of deductive logic.1 Second, Popper is not right about all methods of discovering hypotheses being irrelevant to questions of justifications. One counterexample is the inductive generalization inference explained in section  1.1: it gives us a method of discovering a statistical hypothesis that at the same time inductively justifies this hypothesis. On the other side, Popper’s criticism of methodological induction is correct for the reasons pointed out in section 1.1: inductive methods for extracting general hypotheses from observations exist only for empirical ­ ecause the latter contain nonobservable but not for theoretical hypotheses b (theoretical) concepts. All ­ these aspects are minor compared to the following fundamental challenge. Even if we grant Popper that discovery procedures are separated from mea­sures of justification and that the derivation of observable consequences proceeds in a deductive way, his claim that the scientific test procedure can proceed entirely without inductive inferences is untenable. The idea of Popper’s notion of “corroboration” is, of course, that we should base our ­future predictions and actions on ­those theories that up to now have been most successful—­that is, theories that so far have been corroborated best (Popper 1983, 65; 1979, section I.9). It follows from this that Popper’s “deductivistic” program of corroboration, too, contains in its core a fundamental inductive step, which Musgrave (2002) called epistemic induction. The princi­ple of epistemic induction says the following: if theory T1 has been more successful (explanatory and prognostically) than theory T2 so far, then it is reasonable to assume, relative to the given state of evidence, that T1 ­will also be more successful than T2 in the ­future. In other words, the success preferences established so far are projected inductively into the ­future. Epistemic induction is a special case of what we call “meta-­induction” ­because it does not inductively infer object-­level hypotheses about ordinary events but meta-­hypotheses about the confirmational success of object-­ level hypotheses. Meta-­induction (in the sense of this book) applies this princi­ple at the level of arbitrary methods of prediction or action.

1. ​To overcome this prob­lem, Popperians have suggested regarding extremely improbable observations as a “falsification” of the respective statistical theory (Gillies 2000, 148ff.; Popper [1935] 2002, chap. II.68). Howson and Urbach (1996, 174) argue convincingly that this view is untenable.

On Failed Attempts to Solve the Prob­lem of Induction 13

The epistemic induction princi­ple is indispensable for all empirical disciplines. ­Were this princi­ple not to be accepted, our successes to date would simply be irrelevant to our ­future actions. Although, for example, the theory that heavy bodies on Earth are disposed to fall to the ground and not to float freely has been better corroborated so far than the opposite theory, this would not be a reason to make this theory the basis of our ­future actions ­because logically all bodies might just as well start floating as of tomorrow. In other words, t­ here would not be any point in Popper’s method of testing if inductive inference ­were not accepted at least at the meta-­level. Although Popper was averse to all forms of induction throughout his life (on pain of inconsistency), critical rationalists such as Watkins (1984, 340ff.) and Musgrave (2002) have accepted the epistemic induction princi­ple. More radical critical rationalists such as Miller (1994, chap. 5) dismiss even the epistemic induction princi­ple and reduce the justified core of critical rationalism to the rejection of falsified theories. But we should note that even the maxim that a falsified temporally universal generalization should not be applied to f­uture instances contains an inductive princi­ple—­why ­else would the theory’s falsity in the past diminish our trust in its f­uture per­for­mance? To reject also the future-­referring part of a temporally universal theory b ­ ecause its past-­referring part has been falsified makes sense only if one believes that the ­future resembles the past. In conclusion, the method of induction is indispensable, at least at the meta-­level, in commonsense reasoning as well as in science. What we need is a positive solution to Hume’s prob­lem. 2.2  Is Induction Rational “by Definition”? Rationality and Cognitive Success Ayer, Strawson, and other proponents of the so-­called analytic account have argued that induction is simply part of the meaning of the word “rational” (Ayer 1956, 74ff.; Pollock 1974, 204; Strawson [1952] 1963, 257). In other words, they claim that induction is rational “by definition” or “semantic stipulation.” Related are commonsense “solutions” that claim that common sense simply “knows” (without further reasons) that induction is rational (Edwards ­ ere to believe every­thing 1974, 29ff.; Reid [1764] 1997). However, if we w that common sense regards as rational we should prob­ably better stop philosophizing. On the other hand, if we want to have philosophical arguments to demarcate the rational from the irrational beliefs of common sense, we must come up with an in­de­pen­dent argument for the rationality

14

Chapter 2

of induction and, thus, are back to Hume’s prob­lem. ­Under the pressure of this objection, commonsense phi­los­o­phers often use Strawson’s analytic account as a defense of their position (see Pollock 1986, 15). Before explaining our major objections against the analytic account of induction, we have to make clear what we mean by a (proper or positive) justification of induction. Generally speaking, such a justification must establish that induction is instrumental for the goal of truth. The opposite position is the intuition-­based conception of rationality, for which coherence with one’s subjective intuitions is sufficient for justification. A con­temporary example of the latter position is Cohen (1981). For Cohen, rules of logical reasoning such as modus ponens or modus tollens are based on intuitions about correct reasoning, which have to form a “reflective equilibrium” in the sense of Goodman (1955) and Rawls (1971). From his position Cohen inferred that it is impossible to demonstrate the irrationality of ­human reasoning by empirical means. ­Were psychologists to find out that ordinary ­people’s reasoning deviates from the rules of logic, it would merely show that ordinary ­people’s intuitions about correct reasoning differ from ­those of the logicians. However, psychologists have repeatedly demonstrated how error-­prone ­human intuitions can be (Kahneman, Slovic, and Tversky 1982; Piatelli-­Palmarini 1996). Moreover, ­people’s intuitions are strongly subjective. Thus the intuition-­based conception of rationality unavoidably leads into a form of cognitive relativism, which has been elaborated particularly by Stich (1990). For example, religious ­people would consider dif­fer­ent rules of reasoning as intuitively rational than nonreligious ­people. For this reason, we do not ground our epistemological account on an intuition-­based conception of rationality. Instead, we base the notion of rationality on the objectively testable goal of cognitive success, which we characterize as the goal of finding many (possibly relevant) truths and avoiding errors with reasonable cognitive effort (Schurz 2011b, 2014b). What the goal of cognitive success adds to the account of rationality as truth-­ conduciveness or reliability is the dimension of cognitive cost. In epistemology, the success-­based account of epistemic rationality is also called goal externalism, ­because truth and real­ity are external to the mind (see section 3.4). The opposite intuition-­based account is called goal internalism. Let me emphasize, however, that in order to take the most general perspective, we understand the notion of truth in a correspondence-­theoretic but metaphysically neutral sense (see Kirkham 1992, section 6.5). This means that the truth of a statement is understood in terms of its correspondence with a certain “real­ity” or domain of facts, but this real­ity need not necessarily be absolutely external to the subject. It could also be the real­ity of subjective experiences, consisting of introspective reports (see section 3.2).

On Failed Attempts to Solve the Prob­lem of Induction 15

Following Dancy (1985, 136ff.), major epistemological positions can be classified according to the domain of entities to which true statements correspond. This domain may consist of ­either • ​one’s • ​the

own subjective experiences (subjective idealism or solipsism),

experiences of all subjects (intersubjective idealism), or

• ​a subject-­independent external real­ity that ­causes the experiences (realism).

The impor­tant point to observe ­here is that the correspondence-­theoretic concept of truth is coherent with all ­these positions. As long as the existence of an external real­ity is epistemologically accepted, the notion of truth refers to the domain of external facts, and the success-­based account of rationality coincides with goal externalism in the proper sense. However, for certain epistemological purposes, a nonrealist “relaxation” of the notion of truth is needed. An impor­tant example is the epistemological justification of the abductive inference from the regularities within one’s introspective experiences to the existence of an external real­ity as their cause (see sections 3.2 and 11.2.5); to avoid circularity, the premises of this justification must not presuppose an externalist notion of truth. ­After ­these preliminaries we turn to our major critique of the analytic account of induction. Obviously it is impossible to guarantee the cognitive success of a prediction or inference method by a mere definition of rationality. What we need is a system of arguments that establish that the predictions inferred by the inductive method are truth conducive, at least in some weak sense of favoring the truth of t­hese predictions in comparison to alternative predictions delivered by noninductive methods. In order to count as a justification, this system of arguments must be noncircular—it must not presuppose induction in its premises. Moreover, it must not make any philosophically doubtful assumptions; rather, it should rest only on directly evident beliefs given by deductive logic or by sense experience. In section 3.2 we call this position a foundation-­oriented (instead of foundationalist) epistemology. Strawson ([1952] 1963, 249) develops a more delicate argument to justify his claim that inductions should be regarded as “rational by definition.” He compares inductive with deductive inferences and argues that the situation between the two is quite similar. If we want to justify a deductive inference such as modus ponens, then a justification circle is unavoidable b ­ ecause the very possibility of rational reasoning rests on the use of logical rules such as modus ponens. Or, put differently, we cannot envisage a pos­si­ble situation in which modus ponens fails. A brief examination of Strawson’s argument for the domain of deductive logic is given in section 11.2.2. It ­will be argued ­there that certain but not

16

Chapter 2

all attempts to justify classical logic by using classical logic at the meta-­level involve a circle or infinite regress. Be that as it may, Strawson’s argument fails simply ­because for inductive inferences the situation is entirely dif­fer­ent than for deductive inferences. We can hardly imagine pos­si­ble worlds in which deductive logic fails, but pos­si­ble worlds in which inductive inferences fail or are unreliable are easily conceivable (see Salmon 1974a)—­for example, worlds in which familiar “laws of nature” change, ­things that ­were previously falling down start to fly, and so on, just as it happens in Alice’s ­ ill be full Adventures in Wonderland. In fact, the l­ater chapters of this book w of constructions of “demonic worlds” in which inductive inference fails (and yet meta-­induction is optimal). A related argument in ­favor of a priori reasons for inductive assumptions asserts that even observation reports about the past articulated in terms of qualitative properties involve inductive assumptions. Norton (2003, e.g., 668n9) has claimed that even the singular observation sentence “This ball is red” involves the universal proposition “This ball has the same color as all balls in an infinite class of balls.” However, this argument is not convincing. Rather, the observation statement “This object is red” means that this object is perceived to have a certain memorized color quality that is similar to the color of vari­ous other objects observed in the past. This is merely a report about a pres­ent experience and its relation to past experiences, but not at all an inductive generalization. Fi­nally, we emphasize that it is not only theoretically conceivable that inductive prediction methods fail or are worse than noninductive prediction methods. Millions of ­people do in fact believe in superior noninductive prediction methods, be it based on God-­guided intuition, clairvoyance, or other purported paranormal abilities. Frequently t­hese beliefs are embedded into systems of religious or spiritualistic belief in super­natural agents or powers. Therefore, we think that a satisfying justification of the method of induction would not only be of fundamental epistemological importance but also of fundamental cultural importance as part of the enterprise of explaining and promoting scientific rationality and demonstrating its superiority over nonscientific forms of reasoning. 2.3  Can Induction Be Justified by Assumptions of Uniformity? John Stuart Mill (1865, III.3.1) argued that the reliability of induction should be justified by the metaphysical assumption of the uniformity of nature. A similar argument was given by Russell ([1912] 1959). First of all, the

On Failed Attempts to Solve the Prob­lem of Induction 17

uniformity argument has to deal with the prob­lem that it is difficult if not impossible to give a generally satisfying definition of “uniformity” (as we discuss ­later). However, given a par­tic­u­lar method of inductive inference, one can very well say which kinds of uniformity are sufficient for the reliability of this method. For example, a sufficient condition for the success of the inductive prediction (or generalization) inference (section 1.1) applied to a primitive property F in an infinite sequence of events is the following: the limiting frequency of Fs is far away from 1/2, and the finite F frequencies converge sufficiently fast ­toward this limit (see section 5.9). Even if we restrict Mill’s argument this way, the main prob­lem that the argument leads into a circle remains. ­Every attempt to justify the assumption that nature is uniform (in this or other re­spects) must involve an inference from past observations to f­ uture expectations—­and, thus, must involve an inductive inference. So the circle involved in Mill’s proposed justification is this: we justify A (the reliability of induction) with B (the uniformity of nature), and we justify B with A. So far we took it for granted that circular arguments are without epistemic value. But we should rigorously prove it. This w ­ ill be undertaken in sections 2.4 and 3.3. A “localized” version of the uniformity account has been proposed by Norton (2003). According to Norton, inductive reasoning is not governed by formal and general rules, such as “if all observed As are Bs, then prob­ably all As are Bs,” but by local material inferences such as “if some samples of bismuth melt at 271°C, then all samples of bismuth do.” Local inductions are in turn justified by local uniformity assumptions such as “Samples of the same ele­ment agree in their physical properties.” The fundamental prob­lem of Norton’s localized account is precisely the same as that of Mill’s account: domain-­specific uniformity assumptions are generalizations and must be justified by inductive inferences. Thus, Norton’s justification ends up in a circle or in an infinite regress. Norton argues that this circle or infinite regress is only harmful to a formal but not to a “material” account of induction. But this argument is not convincing and has been successfully criticized by Kelly (2010, 762) and Worrall (2010). Also, Norton’s claim that all scientific inductions are not general but local is not tenable. A closer look at Norton’s example shows that the uniformity assumptions that justify inductive inferences become more and more general. For example, the inference (I1) “Some samples of bismuth melt at 271°C; so all samples of bismuth do so” is justified by the uniformity assumption (U1) “Samples of the same ele­ment agree in their physical

18

Chapter 2

properties.” The inductive inference that justifies U1 has the form (I2): “So far all observed samples of the same ele­ment agreed in their physical properties; so this is generally so.” Now the inductive uniformity that justifies the reliability of (I2) is already maximally general and asserts the following princi­ple of spatiotemporal invariance (U2): “Physically identical entities differing only in their location in space and time behave in an identical way.” Let us fi­nally turn to the prob­lem mentioned at the beginning of this section, the prob­lem of finding a definition of uniformity that applies to all kinds of uniformity. That this prob­lem is presumably unsolvable was pointed out by Skyrms ([1975] 2000, 34ff.). For example, “uniformity” could mean that a given stream of events is nonrandom in the sense that ­there are correlations between past and f­uture events of the sequence. However, if a sequence of events is truly random and correlations of this sort are excluded, as in the case of a sequence of results of tossing a regular coin, then this random sequence is still uniform in the weaker sense that its relative frequencies converge to a limiting frequency of one-­half. Reichenbach (1949, 474), for example, understood the notion of “inductive uniformity” in this weak sense. But as we ­shall see in section 6.3.1, even endlessly oscillating event frequencies exhibit a certain regularity, on pure mathematical grounds. Their frequencies oscillate between an inferior and a superior limit, and their oscillation period grows at least exponentially in time. Thus, it seems that a strictly general definition of “uniformity” is impossible. Consequently, our account of meta-­induction ­will not depend on any assumed definition of uniformity. 2.4  Can Circular Justifications of Induction Have Epistemic Value? Hume’s most impor­ tant insight was that the justification of induction that refers to its past success is epistemically worthless b ­ ecause it involves a circle. Several phi­los­o­phers have suggested, contrary to Hume, that the circular “justification” of induction is not vicious but virtuous. A standard distinction is that between premise circularity and rule circularity (Ladyman and Ross 2007, 75). An argument is premise circular if one of its premises is ­either identical with the conclusion or can only be justified if the conclusion’s truth is presupposed.2 In contrast, an argument is rule circular if the

2. ​This definition presupposes that the premises have been split up into smallest conjunctive ele­ments (where “A” is a conjunctive ele­ment if and only if A is not logically equivalent with a conjunction of statements each of which is shorter than A). See Schurz and Weingartner 2010, def. 4.1b.

On Failed Attempts to Solve the Prob­lem of Induction 19

truth of its conclusion is presupposed by the under­lying inference rule—in our case, the rule of induction. The “inductive justification of induction” is rule circular. That a rule-­circular argument may have epistemic value has been claimed, for example, by Braithwaite (1974), Black (1974), van Cleve (1984), Papineau (1993, section 5), Goldman (1999, 85) and Psillos (1999, 82). However, the following counterexamples demonstrate that the hopes of ­these phi­los­o­phers are in vain. Argument (2.1) below, which goes back to Salmon (1957, 46), is enlighthere; it shows that the same type of rule-­ circular argument that ening ­ “justifies” the reliability of induction can also be used to “justify” the rule of anti-­induction (or counterinduction). The latter rule predicts, roughly speaking, the opposite of what has been observed in the past, so it predicts the opposite of what is predicted by the rule of induction. (2.1)  Rule-­circular justification of induction

Rule-­circular justification of anti-­induction

Premise: Past inductions have been successful.

Premise: Past anti-­inductions have not been successful.

Therefore, by rule of induction:

Therefore, by rule of anti-­induction:

Inductions ­will be successful in the ­future.

Anti-­inductions ­will be successful in the ­future.

Both circular “justification” patterns have precisely the same structure; the premises of both arguments are true, yet they have opposite conclusions. This proves that rule-­circular argument patterns are pseudo justifications that cannot have any epistemic value, as they can be used to pseudo justify opposite and even mutually contradictory conclusions. The conclusions of the two arguments are contradictory ­because they entail that opposite predictions ­will be successful. If the sequence of events (e1, e2, … ) is binary (i.e., ei ∈ {0,1}), “being successful” means that the success probability is at least greater than one-­half. This leads to the mutually contradictory probabilistic expectations P(en+1 = 1|e1 = 1, … ,en = 1) > 0.5 (inductive) and P(en+1 = 0|e1 = 1, … ,en = 1) >  0.5 (anti-­inductive). Some phi­los­o­phers have doubted that an anti-­inductive prediction rule can be successful (Blackburne 2016; White 2015; van Cleve 1984, 561). However, this can easily be the case: most oscillatory sequences are friendly to counterinduction (see section 9.1.2). Consider, for example, the simplest

20

Chapter 2

inductive rule for binary random sequences (which is called OI for “object induction”). It predicts en+1 = 1 if freqn(1) ≥ 1/2 (or n = 0) and en+1 = 0 other­ wise, where freqn(1) is the frequency of 1s in the first n members of the sequence. The corresponding anti-­induction rule (called OAI for “object anti-­induction”) predicts en+1 = 0 if freqn(1) ≥ 1/2 (or n = 0) and en+1 = 1 other­ wise. We ­shall see in section 5.9 that in application to random sequences with a limiting frequency of r, OI has a success rate of max({r,(1−r)}) (where “max(S)” denotes the maximum in S), while the anti-­inductive rule has a success rate of 1 − max({r,1−r}). Now consider the (nonrandom) oscillating event sequence (0,1,0,1, … ). Its finite frequencies freqn(1) are 0, 1/2, 1/3, 1/2, 2/5, 1/2, … , or generally, 0.5 − 0.5/n for odd n and 0.5 for even n. For this sequence, the success rate of the anti-­inductive rule OAI is 1 whereas that of the inductive rule is 0.3 Skyrms ([1975] 2000, 35ff.) pointed out that the above circular justification of induction can also be reconstructed in the form of an infinite regress, by distinguishing between inductive arguments of the first level, which are applied to events, inductive arguments of the second level, which are applied to applications of inductive arguments of the first level (and so on). The rule circularity then turns into an infinite regress, in which one has to justify inductive arguments of the nth level by inductive arguments of the n+1-th level. Of course, as Skyrms would presumably agree, an infinite justificational regress is not less vicious than a rule-­circular justification, so we gain nothing by this move. The parallelism between the rule-­ circular justifications of induction and anti-­induction shows that rule-­circular justifications are epistemically worthless, as they can serve to justify opposite conclusions. Several further examples demonstrate that with the help of rule-­circular justifications one can pseudo justify vari­ous other nonsensical rules. Achinstein (1974) gives the example of the following obviously invalid rule, RA: Rule RA: No Fs are Gs  Some Gs are Hs   Therefore: All Fs are Hs

3. ​If we replace the greater-­than-­or-­equal-to symbol (≥) in the above formulation of the two rules by the greater-­than symbol (>), the result is obtained for the inverted sequence (1,0,1,0, … ).

On Failed Attempts to Solve the Prob­lem of Induction 21

Achinstein points out that rule RA can be justified in a rule-­circular way by the following argument, which is an instance of RA (instantiations of the predicates F, G, and H are indicated by characteristic underlinings): (2.2)  Achinstein’s rule-­circular justification No RA-­instantiating argument is an argument with universally quantified premises. Some arguments with universally quantified premises are valid. Therefore: all RA-­instantiating arguments are valid.

If rule-­circular justifications w ­ ere epistemically valuable, then with the preceding justification one could justify nonsensical instances of RA, such as “No plant is ­human, some ­humans are astronauts, therefore: ­every plant is an astronaut.” A rule-­circular justification may also be constructed for practically dangerous rules, such as the blind-­trust-­in-­authority rule: (2.3)  Rule BTA (for “blind trust in authority”) If my accepted authority tells me “p,” then I infer that p is true. Rule-­circular justification of rule BTA: My accepted authority tells me that the rule BTA is reliable, from which I infer by BTA that BTA is reliable.

Rule BTA is a mark of dogmatic worldviews. It is also prevalent in fundamentalistic religions (for a defense of this rule in analytic theology, see Crisp 2009). Since dif­fer­ent religions make opposite claims we again end up in the situation that mutually contradicting assertions can be rule-­circularly justified by the rule BTA. We call the authority rule “blind” ­because it trusts the authority in­de­pen­dently of its epistemic track rec­ord. Of course, t­ here may be rationally justified trust in epistemic authorities based on inductive inference from their track rec­ord—­though, of course, only to the extent that a justification of induction is pos­si­ble. Note that even when the superior reliability of an authority is granted, it does not follow that less reliable persons should always replace their beliefs by the authority’s beliefs (as suggested by Zagzebski 2012, 197); often an aggregation of the judgments of less and more reliable agents is more successful (see section 6.6.3, note 7). In any case, the

22

Chapter 2

rule-­circular “justification” of the trust-­in-­authority rule is without epistemic value, and this applies a fortiori to the inductive justification of induction. Some proponents of the inductive justification of induction defend their view by arguing that rule-­circular justifications are only acceptable if they are applied to the “right kind” of rule—­namely to the rule of induction—­ because it is only in this case that the premise of the rule application asserts that the rule was successful in the past. However, this position is unconvincing. The premise on which the self-­justification of a rule is based does, of course, depend on the rule. The authority rule bases its self-­justification on the premise that its reliability was asserted by the authority. Thus, defenders of authority rules can argue analogously that the self-­justification of the authority rule is only legitimate if it is based on the right kind of authority, but obviously this move does not bring any epistemic gain. Thus, the restriction of rule-­circular justifications to the right kind of rule seems to be just a concealed way of admitting that rule-­circular justifications do not have intrinsic epistemic value and should therefore be only applied to rules that one considers as in­de­pen­dently justified. 2.5  Can Induction Be Justified by Abduction or Inference to the Best Explanation? Several authors (Armstrong 1983, section 6; Harman 1965; Lipton 1991, 69) have argued that inductive inferences are justified ­because they are instances of abductive inferences, or inferences to the best explanation (IBEs). As noted in section 1.1, an IBE infers from an observed phenomenon that the premises  of its best (available) explanation are presumably true. An inductive generalization—so the argument goes—is an instance of an IBE ­because the best (available) explanation of a hitherto observed regularity (such as “So far all observed ravens ­were black”) is the corresponding lawlike generalization (“It is a law that all ravens are black”). This argument sounds prima facie plausible. Indeed, whenever we are willing to proj­ect or generalize our observations inductively, we assume that the observed regularity was backed up by a nonaccidental (i.e., lawlike) regularity over the entire domain or time span. Believing that the observed regularity was a ­matter of an accidental coincidence, it would be unreasonable to inductively proj­ect the observed regularity beyond our observations. For example, if I see a banana on the sidewalk and at the same time see my friend Susan, then I am sure this coincidence was accidental, and I do not expect Susan to show up ­every time I see a banana on the sidewalk (for a similar argument, see Spohn 2005).

On Failed Attempts to Solve the Prob­lem of Induction 23

What this shows, however, is merely that inductive assumptions are semantically more or less equivalent to making, at least implicitly, lawlikeness assumptions. It seems doubtful to me ­whether this connection can be called a genuine explanation, as asserted by Armstrong (1983, part I).4 But even if it is called a weak form of an explanation, nothing is won by it. For the question is, Why are we justified to believe that this lawlikeness explanation of observed regularities is (prob­ably) true? The answer given by IBE-­theorists must be b ­ ecause (1) it is the best explanation, and (2) IBEs are justified. This in turn brings up two questions: (i) why is the lawlikeness explanation the best explanation, and (ii) what justifies IBEs at all? Attempts to answer ­these two questions run into the following two prob­lems. Prob­lem 1, regarding question (i): The justification of induction by means of an IBE seems to presuppose that we already believe in the reliability of induction or in the uniformity of nature. Other­wise, it is unclear why one should regard the statement “­Because it’s a law that all Fs are Gs” as the best explanation of the fact that all observed Fs so far have been Gs. If one assumes instead that from time to time the laws of nature undergo radical changes, the explanation “­Because so far it was lawlike that Fs are Gs” seems to be the better explanation of the corresponding observations. This argument can be generalized. Without inductive uniformity assumptions, the two law-­hypotheses “All Fs before t and a ­ fter t are Gs” and “All Fs before t are Gs and ­after t are ¬Gs” seem to be equally good explanations of “All Fs observed before t w ­ ere Gs.” Should someone object that the first explanation is syntactically simpler than the second one, then Goodman’s objection comes into play according to which this “simplicity advantage” can be inverted by introducing suitably defined predicates (see section 4.2). In conclusion, the IBE-­justification of induction is indirectly circular as it can only work if inductive uniformity assumptions are presupposed. Prob­lem 2, concerning question (ii): Justification strategies analogous to ­those we have discussed regarding induction have been proposed for IBEs. As a consequence, they are beset with analogous prob­lems (see Bird 1998, 171). For example, Armstrong (1983, 53) has argued that IBE is rational “by definition.” But, as we pointed out earlier, analytic conventions cannot demonstrate that a certain rule is cognitively successful. Lipton (1991, 167ff.), Papineau (1993, section 5), and Psillos (1999, 82) have suggested justifying IBE in the rule-­circular way shown in argument (2.4).

4. ​ Armstrong explains regularities by contingent necessitation relations between universals.

24

Chapter 2

(2.4)  Rule-­circular “justification” of IBE The assumption that IBEs are reliable is the best (available) explanation of the fact that so far most hypotheses introduced by IBEs have turned out to be successful. Therefore, by the IBE rule: IBEs are reliable—­that is, most hypotheses introduced by IBEs are true or are at least close to the truth.

However, we know from section 2.4 that the justification in argument (2.4) is as epistemically worthless as all other rule-­circular justifications. As Douven (2011, section 3) has pointed out, by a rule-­circular argument with precisely the same structure we can “justify” the rule of inference to the worst explanation (IWE), which infers from the premise that a hypothesis P is the worst explanation of a fact E that P is true, or at least close to the truth. (2.5)  Rule-­circular “justification” of IWE The assumption that IWEs are reliable is the worst (available) explanation of the fact that so far most hypotheses introduced by IWEs have turned out to be unsuccessful. Therefore, by the IWE rule: IWEs are reliable—­that is, most hypotheses introduced by IWEs are true or are at least close to the truth.

In conclusion, the prob­lem of justifying IBEs is harder than that of justifying induction. While all prob­lems besetting the justification of induction apply also to the justification of IBEs, the latter prob­lem involves additional difficulties, such as the justification of the introduction of theoretical concepts as explained in section 1.1. 2.6  The Role of Induction and Abduction for Instrumentalism and Realism In the domain of scientific theory testing, theories are evaluated based on the extent to which their empirical consequences can be verified in observations and experiments. The more of ­these empirical consequences that can be observed, the more the theory counts as confirmed or successful. Conversely, the more of ­these empirical consequences that are in disagreement with observations, the more the theory counts as disconfirmed. The way (meta-)inductive and abductive inferences come together in the enterprise of theory testing is frequently described as follows.

On Failed Attempts to Solve the Prob­lem of Induction 25

(1) Evidence: Tk is among the alternative theories T1,…,Tn so far the most empirically successful. Meta-inductive inference to empirical adequacy (2) Instrumentalist conclusion: Tk is among T1,…,Tn the most empirically adequate (therefore also the most empirically successful in future)

Abductive inference to (approximate) truth (3) Realistic conclusion: Tk is among T1,…,Tn the closest to the truth. Figure 2.1 Interaction of epistemic induction and abduction.

By means of meta-­inductive inferences, we transfer into the f­uture the empirical successes we have observed so far of competing theories. From ­these inductive inferences we obtain judgments about the empirical adequacy of a theory—­that is, the extent to which its empirical content matches the observable facts (not restricted to the past but including ­future observations). The claim that a theory is empirically adequate implies that its empirical consequences are (approximately) true, but it does not entail that its theoretical content is also (approximately) true—­that is, that the hidden entities and properties posited by the theory (such as quarks or gravitational fields) have real existence. The inference from the empirical adequacy of a theory to its (probable and approximate) realistic truth—­that is, to the truth of its theoretical superstructure—is not an inductive but an abductive inference. A modest version of this inference has been suggested by Kuipers (2000, section 7.5.3), who argues (like Popper 1983, 19ff.) that we cannot assess the “absolute” degree of the truthlikeness of a theory; we can only make comparative success evaluations. This interplay of the (meta-) inductive inference and the (meta-)abductive inference to the best theory is summarized in figure 2.1. Figure 2.1 makes it clear how ­these two kinds of inference are connected with the positions of instrumentalism and realism in the philosophy of science. Instrumentalists (such as van Fraassen) accept the inference from 1 to 2 but reject the abductive inference from 2 to 3 ­because, according to them, theories may only justifiably be said to be more or less empirically

26

Chapter 2

adequate, but they may not justifiably be said to be true or false in the realistic sense (van Fraassen 1980, 11ff., 20; 1989, 142ff.). Realists in the philosophy of science (such as Psillos 1999) believe in the likely truth or truthlikeness of well-­confirmed theories; for them, the inference from 2 to 3 is essential to science. As figure 2.1 shows, the meta-­inductive inference to the empirical adequacy of a scientific theory is a precondition for the inference to its realistic truth. Moreover, the inference to empirical adequacy produces at least some kind of justification for believing in the theory. Thus, a convincing justification of meta-­induction (which constitutes the focus of this book) would offer at least a partial solution to the prob­lem of justifying scientific theories. How this partial justification might lead to a full (abductive) justification of scientific theories ­will be briefly discussed in section 11.2.5. So now it is time to take stock. In this chapter we have seen that all nonformal philosophical attempts to solve Hume’s prob­lem of induction have failed. Is skepticism unavoidable? As we ­shall see, a variety of con­ temporary epistemologists seem to draw this conclusion. However, so far we have investigated only the nonformal but not the formal-­probabilistic accounts of the prob­lem of induction. The latter ones are technically more demanding, and their discussion helps prepare the ground for the development of our own approach, so their investigation ­will take place in chapter 4. Chapter 3 focuses on the significance of Hume’s prob­lem in the light of con­temporary epistemology.

3  The Significance of Hume’s Prob­lem for Con­temporary Epistemology

3.1  The Aims of Epistemology Since the beginning of the philosophical Enlightenment in the sixteenth ­century, the concept of justification played a central role in epistemological debates. According to the traditional conception, knowledge is justified true belief. Thus, it is justification that distinguishes knowledge from accidental true belief resulting from lucky guesses. In con­temporary epistemology the conception of knowledge as justified true belief is still widely held; however, the traditional foundationalist and internalist understanding of justification has been challenged (see Fumerton 2010, section 1.2). The leading idea of justification in the Enlightenment was foundation-­ oriented and internalist: in order to acquire knowledge our system of beliefs should be justified not by religious or other authority but by reason—by means of a system of arguments by which all our beliefs can be soundly derived from a small class of basic beliefs and princi­ples that are considered as directly evident to every­body. This idea of justification was shared by both the rationalistic wing (e.g., Descartes, Leibniz, Kant) and the empiricist wing (e.g., Locke, Hume, Mill) of Enlightenment epistemology. During the development that followed the Enlightenment era, which led to the philosophical situation of “(post-)modernity,” the foundation-­ oriented program of epistemology came increasingly u ­ nder attack. The main criticism was not that it is misguided but that it is pretentious—­its noble claim of presumption-­free and universally acceptable standards of justification is illusionary, too good to be true. The major challenge for the foundation-­oriented program is the prob­lem of justificational regress: the apparent necessity to base each justification on premises that are themselves in need of justification. ­There are two dimensions in which this regress prob­lem arises. First, it arises at the horizontal level of beliefs (or statements) that are traced back to

28

Chapter 3

more and more basic beliefs—in modern terminology, this is the prob­lem of first-­order justification. In regard to this dimension, the Enlightenment epistemologists ultimately arrived at the minimalist class of introspective and analytic beliefs, which they considered to be the only beliefs that can be regarded as directly evident. Second, the regress prob­lem arises at the level of arguments, whose reliability must be demonstrated by means of certain meta-­arguments—in modern terminology, this is the prob­lem of higher order justification. Hume’s skeptical arguments against the possibility of a noncircular justification of induction made it clear that a sustainable solution to this prob­lem is extremely difficult, if not impossible. For this reason, three centuries ­after Hume the prob­lem of induction has lost none of its importance for epistemology. In this chapter we defend this claim against some recent (in par­tic­u­ lar externalist) developments in epistemology that attempt to circumvent Hume’s prob­lem by redefining the notion of justification in such a way that we can be justified in an inductively inferred belief even without possessing a justification for induction. ­These programs take a resignative stance ­toward the prob­lem of induction (as traditionally understood); they seem reasonable only if a noncircular solution to Hume’s prob­lem is regarded as impossible. Along this line, Greco wrote, “We can hope to avoid Hume’s skeptical arguments only by adopting externalism” (2005, 265). In section 3.4, however, we ­will see that externalism is incapable of offering any reason in ­favor of inductive as opposed to noninductive methods. In contrast to t­ hese positions we defend a modernized account of a foundationalistic position that we call “foundation-­oriented epistemology,” which can offer a positive solution to Hume’s prob­lem by means of the method of meta-­induction. Our conception of foundation-­oriented epistemology departs from traditional foundationalist accounts in three re­spects: 1. ​A careful distinction is made between foundation-­oriented and foundationalistic approaches. Classical foundationalistic epistemologies demand that the basic beliefs be epistemologically certain or necessary (see Dancy 1985, chap. 4.1). Most con­temporary epistemologists reject the infallibility requirement as too strong. Foundation-­oriented approaches allow that even basic beliefs may be revisable. Still, basic beliefs enjoy an epistemological priority insofar as they (1) are more entrenched than nonbasic beliefs and (2) figure as informational inputs in the dynamical network of beliefs. 2. ​Foundation-­oriented epistemology is committed to the idea of meliorative epistemology (see Goldman 1999, chap. 4; Schurz 2008c, 2009a;

The Significance of Hume’s Prob­lem for Con­temporary Epistemology 29

Shogenji 2007, section  1)—­the idea that epistemology should help improving the epistemic practice of ­people, which was an impor­tant part of the Enlightenment program. Bishop and Trout (2005) have criticized standard analytic epistemology b ­ ecause of its inability to serve meliorative purposes. Instead of hiding skeptical doubts ­behind clever redefinitions of the concept of knowledge, meliorative epistemology should be concerned with the question of which inference strategies are cognitively more successful than o ­ thers. In par­tic­u­lar, meliorative epistemology should be helpful in solving disagreements between competing worldviews by means of rational argumentation. In contrast, prominent defenders of con­temporary analytic epistemology are skeptical ­toward the possibility of solving fundamental disagreement by rational argument (Sosa 2010). The apparently insurmountable prob­ lem of traditional foundational3. ​ ism, the prob­lems of circularity and regress, is handled by the novel method of optimality justifications, paradigmatically exemplified by the optimality-­based justification of meta-­induction developed in this book. 3.2  Foundation-­Oriented Epistemology and Its Main Prob­lems In the core of foundation-­oriented epistemology is the notion of foundation-­ oriented justification. Based on the previous considerations this notion is explicated as follows: (Definition 3.1)  Foundation-­oriented justification A system of justifications is foundation-­oriented in the internalist sense iff it satisfies the following requirements: (R1). It attempts to justify all beliefs by means of chains of arguments whose ultimate premises consist of basic beliefs that are directly evident. (R2). It thereby avoids complete justification circles ­because they are epistemically worthless. (R3). Its justifications intend to be complete in the sense of providing higher order justifications for the reliability of the argument patterns employed in R1, or at least of their optimality in regard to the goal of reliability.

Concerning R1: ­There are two variants of foundation-­oriented epistemologies, internalist and externalist (Fumerton 1995, chap. 3). We understand the notion of a “foundation-­oriented justification” in the traditional internalist sense as a system of arguments terminating in premises expressing

30

Chapter 3

directly—­ that is, unconditionally—­ evident beliefs. For externalists (e.g., Goldman 1986) the system of justification consists in (unconditionally or conditionally) reliable cognitive pro­cesses that need not necessarily be accessible to the epistemic subject. Our preference for the internalist account is based on a ­simple reason: the justifications of our belief system must be cognitively accessible ­because inaccessible justifications are epistemically useless (see section 3.4). R1’s requirement that ­every justification chain must terminate in directly evident beliefs excludes infinite regresses. The class of “directly evident” ­ ill characterize l­ater; this beliefs is understood in a minimalistic sense we w minimalistic understanding distinguishes the foundation-­ oriented program from dogmatic accounts. Concerning R2: This requirement excludes justification circles and discriminates foundation-­oriented accounts from coherentist accounts, which allow circular justifications. In sections 2.4 and 3.3 we criticize coherentist accounts by arguing that complete justification circles are epistemically worthless ­because they can be used to pseudo justify mutually inconsistent beliefs. Concerning R3: This condition requires a justification of the reliability or the (reliabilistic) optimality of the employed argument patterns. An argument pattern is called reliable iff the objective probability of its conclusion, given its premises, is sufficiently high—­that is, greater than a given “acceptability threshold” t > 0.5 (the explication of optimality ­will be given in the ­later chapters of this book). Requirement R3 is maintained by all traditional and many con­temporary internalist accounts but is rejected by all externalists and even some internalist accounts (see Conee and Feldman 2001). We ­will argue ­later that condition R3 is particularly impor­tant for internalist accounts with meliorative purposes. Recent defenses of this condition can be found, for example, in Fumerton’s account of inferential justification (1995, 36, 85), according to which being justified in believing p on basis of believing evidence e entails (1) being justified in believing e and (2) being justified in believing that e makes p probable (i.e., that the argument from e to p is reliable). Another variant of R3 is White’s reliability princi­ple (2015, 219), according to which a rational person can only be justified in believing a proposition p if she is justified in believing that the methods that led her to believe p are reliable. Following from R1 and R2, higher order justifications must themselves be noncircular and ­free from invoking an infinite regress. For this reason we prefer to speak of higher order instead of second-­order justifications. The latter notion (introduced by Alston 1976) invites the question why

The Significance of Hume’s Prob­lem for Con­temporary Epistemology 31

Problems of higher order justification: Why are basic beliefs immediately (or prima facie) evident?

Why are the inference patterns (D, I, A) cognitively successful?

Basic beliefs

Derived (nonbasic) beliefs) First order justification:

Unconditional: Find basic beliefs

Conditional: Find basic reasons/premises and inference patterns for nonbasic beliefs. Solution: Deduction (D), Induction (I), and Abduction (A).

Figure 3.1 Major components and prob­lems of internalist foundation-­oriented epistemology.

one should not require third or fourth order justifications (and so on); but obviously the regress of meta-­levels has to stop at some level. This book argues that this is pos­si­ble by means of optimality justifications. The turn to optimality justifications is reflected in our formulation of R3; it constitutes the crucial novelty of this book. Figure 3.1 illustrates the major components of an internalist foundation-­ oriented epistemology and at the same time reveals its major prob­lems. ­Every foundation-­oriented model of justification must, first, specify a class of basic beliefs that are taken as immediately (or at least as prima facie) evident and are not in need of further justification. Second, it must specify argument patterns by which derived or nonbasic beliefs can be traced back to basic beliefs. The three major types of argument patterns that have been established in analytic epistemology are deduction, induction, and abduction, or inference to the best explanation. Specifying basic beliefs, and specifying deductive, inductive, or abductive reasons for one’s derived beliefs is the task of first-­ order justification—­unconditional for basic beliefs, conditional for derived beliefs. This part of justification is required not only according to philosophical but also according to commonsense standards of rationality. Take, for example, the first-­order justification of my belief “You ­will get a fever.” It is based on my basic belief “I saw you being in close proximity to a person who had the flu” together with the first-­order inductive argument “Based on statistical evidence, flu is infectious.”

32

Chapter 3

For a complete justification, higher order justifications are required. The higher order justification of basic beliefs must explain why a par­tic­u­lar class of beliefs that is regarded as basic can legitimately be considered as immediately (or at least as prima facie) evident; this is usually called the “prob­lem of basic beliefs.” The higher order justification of inference patterns has to explain why the patterns of deduction, induction, and abduction can be legitimately regarded as cognitively successful, in the sense of being reliable or at least optimal in regard to reliability. This prob­lem is usually meant by the “prob­lem of higher order justification” in the narrower sense—­that is, applied to inferences. For deductive argument patterns the task of higher order justification is prima facie unproblematic ­because we know, by the definition of logical validity, that deductive arguments preserve truth in strictly all cases. This kind of justification is not enough if we want to defend classical logic against alternative logics (more on this in section  11.2.2), but for the time being we ­will take classical deductive logic for granted. By contrast, the task of finding a higher order justification for induction and abduction that is neither circular nor exposed to an infinite regress is extremely difficult, if not impossible, as we have seen in chapter 2. Before we turn to the significance of Hume’s prob­lem in the context of figure 3.1, we explain the classical solution to the prob­lem of basic beliefs, which has already been suggested by Augustinus and was taken over by early Enlightenment phi­los­o­phers such as Descartes, Locke, Berkeley, and Hume. This solution was minimalistic as it considered only two kinds of beliefs as truly basic (i.e., ­free of doubt and not in need of further justification): 1. Introspective beliefs, which express facts about one’s conscious experiences, without any implications about the existence and constitution of a subject-­independent external world. Our beliefs about an external real­ity may be in error. For example, the tree in front of me that I see right now may be a hallucination, a mere dream, or what­ever. But what I know for sure is that I have this perceptual experience now—in my beliefs about my own conscious (and hence introspectively accessible) experiences I cannot be in error. Thus, introspective statements are formulated in a “self-­language”: they have the canonical form “I now have the experience such-­and-­such.” 2. ​Analytic beliefs, by which we understand believed propositions that are ­either true ­because of the laws of (classical) logic, or ­because of accepted semantic definitions or meaning postulates for extralogical concepts. Thus, analytically true sentences are true for purely logical or semantic reasons, in­de­pen­dent of the factual constitution of the world. In

The Significance of Hume’s Prob­lem for Con­temporary Epistemology 33

contrast, synthetic beliefs express something about the factual constitution of some part of “the world”—­which is the ordinary external world for realistic beliefs and the world of “my” experiences for introspective beliefs. Summarizing, we divide all beliefs (or propositions) into analytic and synthetic beliefs, and synthetic beliefs into introspective and realistic beliefs. independent facts—­ Realistic propositions express external and subject-­ which may be singular or general, empirical or theoretical. In contrast, introspective propositions express one’s subjective experiences—­which may be perceptual experiences or inner experiences, without ascribing to them realistic content. Basic beliefs consist of introspective and analytic beliefs. Phi­los­o­phers have raised objections to the classical solution of the prob­ lem of basic beliefs. It has been argued that even introspective beliefs are prone to error. For example, introspective beliefs about one’s past experiences rely on memory and memory is fallible. Moreover, when introspective beliefs are formulated in a public language, one may be in error about the semantics of that language (see Lehrer 1990, 51–54, 64ff.). We agree with ­these objections. However, we suggest that introspective beliefs should be restricted to one’s private language and one’s pres­ ent experiences. Thus restricted even memory beliefs—­such as “I remember having seen a ­table”—­are basic beliefs about one’s pres­ent memories, ­whether or not the content of ­these memories—in the example: that I saw a ­table—is in error (see note 1 in section 11.2.5 on the prob­lem of justifying the reliability of one’s memories). If introspective beliefs are restricted in this way, they seem to constitute at least optimal candidates for directly evident basic beliefs. We do not need to assume that they are infallible, as foundationalistic positions do. It is sufficient to recognize that in almost all cases we can rely on our introspective reports; exceptions arise only when a mind or brain acts in a completely schizophrenic way (see Fumerton 1995, 71). Likewise, analytically true beliefs are obvious candidates for basic beliefs. The truth of logically true beliefs follows from the semantic laws characterizing logical concepts (e.g., truth t­ ables of propositional logic). Similarly, the truth of extralogical meaning postulates (such as “bachelors are unmarried men”) follows from accepted semantic conventions of the given linguistic community. While the classical empiricists (from Locke to Hume) limited the class of a priori knowable propositions to analytic ones, the classical rationalists (e.g., Descartes and Leibniz) thought that even some synthetic truths (e.g., certain laws of physics) can be justified on a priori grounds. The same was claimed by Kant, though for radically dif­fer­ent philosophical

34

Chapter 3

reasons. However, most of the princi­ples that the rationalists and Kant considered a priori truths ­were refuted on empirical grounds by modern physics. Thus, it seems more reasonable to prefer the minimalistic solution in accordance with the empiricist tradition, according to which the only beliefs whose truth is doubtlessly a priori are analytic statements. The main prob­lem of the minimalist solution to the prob­lem of basic beliefs does not lie in justifying analytic beliefs, nor in the possibility of erroneous introspective beliefs. Rather, it lies in the difficulty of inferring from the small class of basic beliefs anything nontrivial about the part of the world that lies outside of our consciousness, including our internal ­future as well as the external world. Deductive inferences are clearly insufficient for this task ­because by means of deductive logic we cannot infer conclusions that relevantly contain predicates not contained in the premises.1 Thus, what we can (relevantly) infer from introspective beliefs by means of deductive logic are merely other introspective beliefs. In order to pass from beliefs regarding ­actual experiences to beliefs regarding the f­uture or general laws one needs induction. Moreover, passing from introspective experiences to beliefs concerning an external real­ity that ­causes ­these experiences requires abduction, or inference to the best explanation (Vogel 2005; section 11.2.5). In conclusion, the real prob­lem of the minimalist foundation-­oriented solution to the prob­lem of basic beliefs is that it shifts the burden of justification to the inference patterns of induction and abduction. Thus the task of providing higher order justifications for ­these two inferences becomes enormously impor­tant, induction’s justification being the more fundamental as it is presupposed in the justification of abduction (recall section 2.5). However, we have seen that up to now no generally satisfying higher order justification of induction and abduction has been found in the lit­er­a­ture. If we cannot solve the prob­lem of higher order justification of inductive inferences, all well-­known skeptical arguments ­will descend upon us and nothing transcending our own consciousness could be said to be knowable. Faced with this desperate situation, nonfoundationalistic epistemologists have searched for ways to circumvent the prob­lem of higher order

1. ​If the conclusion of a deductive inference contains a predicate that does not occur in the premises, that predicate is completely irrelevant, in the sense of being replaceable by any other predicate salva validitate of the inference. This follows from the theorem of uniform substitution for predicates (Schurz 1991). Examples (the underlined predicates are replaceable salva validitate): Fa ⊨ Fa∨Ga, ∀x(Fx → Gx) ⊨ ∀x(Fx∧Hx → Gx), and so on.

The Significance of Hume’s Prob­lem for Con­temporary Epistemology 35

justification. Following from figure 3.1 t­here are two pos­si­ble strategies of escaping this prob­lem, which can be formulated in the form of a dilemma. Dilemma—­horn 1 (extend the basis): ­Here, one tries to avoid the prob­lem of a higher order justification of induction and abduction by extending the basis—in par­tic­u­lar, by including beliefs regarding the uniformity of nature or an external real­ity. However, d ­ oing that means that basic beliefs are no longer directly evident, and the prob­lem of basic beliefs becomes insuperable. A historically well-­known reaction to this prob­lem is dogmatism, the attempt to defend “extended basic” beliefs by mere authority, as in fundamentalistic worldviews. Given that dogmatic attitudes are rejected, the only remaining way of including nonevident beliefs in the basis is by allowing for basic beliefs to be criticized or justified by means of nonbasic beliefs. In consequence, the noncircularity requirement R2 of definition 3.1 is given up. Historically, this route has been taken by coherentism. horn 2 (minimizing the basis): The classical solution to the Dilemma—­ prob­lem of basic beliefs, supported in this book, is minimalistic—it includes only introspective beliefs and analytic beliefs in it. However, minimizing the basis worsens the prob­lem of higher order justification which, as many epistemologists think, becomes insuperable. Proponents of this position ­will be inclined to give up the demand of higher order justification, requirement R3 of definition 3.1. This route has been taken by externalism, at least implicitly. In the next two sections we illustrate the implications of coherentism lem of induction and discuss their major and externalism for the prob­ shortcomings. 3.3  Coherentism and Its Shortcomings According to coherentistic accounts (e.g., BonJour 1985), the beliefs of an epistemic agent are the more confirmed the more they mutually support each other. Thus, coherentistic accounts accept circular justifications. Our major objection against circular justifications is that they are without epistemic value ­because with their help one can pseudo justify mutually contradicting propositions. Recall the distinction between premise circularity and rule circularity introduced in section 2.4. In that section we made our point for rule-­ circular justifications, such as the inductive justification of induction. We demonstrated that according to coherentistic standards not only the inductive justification of induction but also the anti-­inductive justification of ­ ecause anti-­induction has to be considered as an acceptable argument b

36

Chapter 3

both rule-­circular arguments are structurally identical. ­Because both arguments lead to opposite conclusions, they cannot have any justificatory value. Moreover, we have seen that rule-­circular arguments can pseudo justify even obviously nonsensical rules such as “No plant is h ­ uman; some ­humans are astronauts; therefore, ­every plant is an astronaut.” A similar objection applies to premise-­circular justifications—­the circular justification relations between propositions. An example in the context of our prob­lem is the justification of induction by assumptions of unifor­ ere the reliability of induction (RI) is justified by the mity (section 2.3). H uniformity of nature (UN), which is justified by the reliability of induction: RI ! UN. The following argument shows that the mere existence of circular justification relations between two or more propositions cannot increase their probability of being true. It is impossible that two contradicting sets of beliefs are both true. However, t­ here exist many dif­fer­ent belief systems that are equally coherent but in mutual contradiction, though only one of them can be true. It follows that mere coherence among propositions cannot be a reliable indicator for their truth. We illustrate this point with a s­ imple example. Assume my belief system contains ­these three propositions: (1) p (God is angry), (2) q (It is raining), and (3) p ↔ q (God is angry exactly if it is raining). This belief system is perfectly coherent b ­ ecause 3 is inductively supported by 1 and 2, 1 is deductively supported by 2 and 3, and 2 is deductively supported by 1 and 3. But obviously, ­these circular justification relations hold for e­ very system of propositions that has this formal structure; for example, it also holds for the two opposite propositions (1*) ¬p and (2*) ¬q, and (3*) ¬p ↔ ¬q, where 3* =  3, ­because ¬p ↔ ¬q is logically equivalent with p ↔ q. In other words, by circular reasoning based on p ↔ q we can “justify” both p and q, and the opposite propositions ¬p and ¬q. For the justification-­circle between reliability of induction and uniformity (RI ! UN), this means we could, with equal right, “justify” the unreliability of induction by nature’s nonuniformity and vice versa (¬RI ! ¬UN). This argument can be generalized. Let S = {B1,B2, … } be the set of propositions believed by an agent. Consider first the degree of deductive coherence of S, which is determined by the amount of deductive justification relations between propositions in S. Now replace in all propositions in S e­ very propositional variable pi uniformly by its negation ¬pi and call the resulting belief system S* = {B*1 , B*2 , … }. By the logical theorem of uniform substitution, the deductive justification relations between the propositions in S* remain exactly the same: B i1 ,…, B in |== B in + 1 iff B*i1 ,…, B i*n |== B i*n + 1 . Thus, the deductive coherence of S* is the same as that of S. However, for ­every

The Significance of Hume’s Prob­lem for Con­temporary Epistemology 37

elementary proposition (­every propositional variable or its negation) we believe in S* the exact opposite of what we believe in S. A similar consideration applies to the degree of probabilistic coherence of S, as determined by the amount of probabilistic confirmation relations between propositions in S (Douven and Meijs 2007). In this case, we must vary the under­lying probability function P of the epistemic subject. Let P* be the function that assigns to each elementary proposition the same probability that P assigns to its opposite: P(pi) = P*(¬pi) and (thus) P(¬pi) = P*(pi). Then the probabilistic coherence of S relative to P is precisely the same as the probabilistic coherence of S* relative to P*. However, for subjective Bayesians ­there exists no objective reason to prefer P over P* b ­ ecause both functions are prior distributions (see the next section). For objective Bayesians who assume a uniform prior, the situation is even worse b ­ ecause now—­ following from P(pi) = P(¬pi)—­P and P* are identical distributions. The uniform substitution operation “pi/¬pi” may be performed for e­ very subset of propositional variables. Thus, for e­ very belief system S containing n elementary propositions ­there exist 2n equally coherent but mutually contradictory beliefs systems. If this argument is right, it also refutes the following variant of coherentism suggested by Elgin (2005): beliefs are justified by their coherence if their truth is the best explanation of why they cohere. However, b ­ ecause the same degree of coherence is obtained by replacing all elementary propositions involved in ­these beliefs by their negations, ­there is an equally good explanation of this coherence according to which at least some of the coherent beliefs are false. Fi­nally, we clarify a pos­si­ble misunderstanding. Our objection applies to radical forms of coherentism that consider complete justificatory circles as epistemically valuable, but not to moderate forms of coherentism that merely argue for the epistemic value of partial circles. A circular chain of justifications between n propositions, A1 → … → An → A1, is called a complete justificatory circle if not a single one of ­these propositions is justified by in­de­pen­dent evidence (figure 3.2a). If some of the propositions Ai are justified by in­de­pen­dent evidence, we speak of a partial circle (figure 3.2b). Assuming the justification arrows in figure 3.2 express high conditional probabilities, then according to the preceding argument the complete cycle in figure 3.2a does not entail anything about the truth chances of A1 and A2. In figure 3.2b, however, this is dif­fer­ent. ­Here, A1 is justified in­de­pen­dently by evidence E1, and A2 by evidence E2, where the prior probabilities of E1 and E2 are assumed to be high (close to 1) and the priors of A1 and A2 are greater than zero. Now the fact that A1 and A2 are circularly related may produce a justificatory surplus, provided evidence E1 produces a probability increase

38

Chapter 3

A1

A2

A1

A2

E1 (a)

E2 (b)

Figure 3.2 (a) Complete and (b) partial justificatory circle (→ = justification relation).

for proposition A1 in­de­pen­dently of the truth value of A2, in which case this probability increase is propagated to A2. It can be proved that this yields an increase of A1’s and A2’s probability value as compared with the situation where A1 is only supported by E1 and A2 by E2; thus P(A1|E1∧E2) > P(A1|E1) and P(A2|E1∧E2) > P(A2|E2). Note that this surplus justification is constrained by the condition that A1’s probability increase produced by E1 is in­de­pen­dent from A2’s truth value. Among other ­things, this constraint excludes the possibility that rule-­circular arguments can produce a justificational surplus. For example, the fact that something was asserted by an authority (= E1) cannot justify its truth (= A1) without presupposing the authority’s reliability (= A2); thus, in the absence of an in­de­pen­dent justification of A2, no probability increase can be propagated from E1 to A1 and from ­there to A2. In conclusion, partial circles into which in­de­pen­dent evidence is entered can have a justificatory surplus value. Many arguments in ­favor of moderate forms of coherentism that can be found in the lit­er­a­ture involve merely partial circles. For example, in the philosophy of science it has been investigated to what extent the mutual coherence between evidence propositions E1, … ,En can lead to a confirmational surplus for the confirmation of a hypothesis H (Gähde and Hartmann 2005; Olsson 2007). ­Here it is assumed that ­ these evidence propositions are intrinsically justified on in­ de­ pen­ dent grounds, which is unproblematic within our account of foundation-­ oriented epistemology. 3.4  Externalism and Its Shortcomings We confine our discussion of the con­temporary externalism–­internalism debate to t­ hose aspects that are relevant to the prob­lem of induction. Let us start with two impor­tant distinctions: that between internal versus external

The Significance of Hume’s Prob­lem for Con­temporary Epistemology 39

facts (or propositions) and that between goal externalism and justification externalism. A proposition is called internal for an assumed subject iff it designates a cognitive state (or content) of the subject that is cognitively accessible to this subject. This does not imply that all the subject’s internal states are actually conscious, but it implies that all of them can be brought to consciousness—­ for example, by memory retrieval.2 On the other hand, facts or states of affairs are called external if they belong to the external real­ity outside of the self-­accessible part of the subject. Note that we understand the notion of an internal statement in a broader sense than that of an introspective statement. Introspective statements refer to the pres­ent moment; internal statements express introspective facts that may refer not only to pres­ent but also to past or even ­future times. In contrast to introspective statements, ­these internal statements are not basic. The inclusion of ­future internal states in the notion of “internal” is needed to formulate inductive prediction methods on the purely internal level (e.g., “I ­will have the experience of waking up in my bed tomorrow”), without the presupposition of an external real­ity. In contrast, all realistic statements (in the explained sense) are external statements. Next we turn to the distinction between goal externalism and justification externalism. Goal externalism is a weak form of externalism. It assumes that the goal of knowledge is truth, which is an external state.3 Externalists have rightly argued that the task of justification does not consist in the satisfaction of intuitions or intuitively given epistemic rules, as assumed by deontological internalism or virtue internalism (see Alston 1989, 85ff.; Greco 2004). Rather, it consists in the acquisition of true beliefs and the avoidance of false beliefs. Our epistemological account is coherent with goal externalism, as long as the notion of “truth” is understood in the metaphysically neutral sense explained in section 2.2. The stronger form of externalism, which we are inclined to reject, is justification externalism. In the following discussion, “externalism” always means “justification externalism” ­because this is the standard meaning of the term. Pretty much e­ very classical account of justification in philosophy ­until the 1960s considered justification to be an internal concept.

2. ​This definition of internalism (which we prefer) is also called accessibility internalism, as opposed to state internalism (Fumerton 1995, 60–66). 3. ​Goal externalism corresponds to Goldman’s externalism concerning the “standards of rightness” (2009, 336).

40

Chapter 3

The obvious reason for this understanding is that possessing a justification for one’s belief means to refer to a cognitive state or disposition of the subject, something that is possessed by the subject rather than being external to it. The development of externalist epistemologies took place in the last five de­cades. It was triggered by two major motivations: first, the attempt to overcome the well-­known Gettier prob­lems of knowledge (which cannot be discussed ­here; see Gettier 1963, Olsson 2015), and second, the search for ways to avoid the challenge of skepticism. Some authors tried to avoid Gettier counterexamples by strengthening the internal condition of justifica­ thers simply replaced the internalist notion of justification with tion, and o an external concept of justification. The most general externalist concept of justification has been proposed by Goldman (1986). He characterizes a belief as externally justified iff this belief was formed by a cognitive pro­cess that is reliable in our world, which means that ­under the “relevant” circumstances this pro­cess leads to true beliefs with high objective probability. Goldman’s account is also called reliability externalism.4 The crucial difference between internalism and externalism in regard to the prob­lem of induction is this: if the skeptical phi­los­o­phers are right that a justification of induction cannot be given, not even by “epistemic experts,” then according to justification internalism it follows that we cannot attribute inductive knowledge to any epistemic subject and the unavoidable consequence is skepticism. In contrast, according to justification externalism we may still “have” a sort of justification of induction—­namely, exactly if our world is inductively uniform over time—­although nobody could know ­whether we do have such a justification. As a consequence of this, the necessity of providing higher order justifications dis­appears in externalistic accounts. The burden of this task is shifted from the epistemic subject to the external world. It is not ­really the subject who “has” the justification; rather, a possibly inaccessible part of the external world takes care of it. The prob­lem of purely externalist justifications is their lack of cognitive accessibility, which deprives them from their meliorative function. We illustrate this point by considering the externalist treatment of the rule-­ circular “justification” of induction and anti-­induction. The stance ­toward

4. ​The “relevant” circumstances can ­either mean the “­actual” circumstances (as in Goldman 1979) or the “normal” circumstances (as in Goldman 1986). This ambiguity is irrelevant for our considerations.

The Significance of Hume’s Prob­lem for Con­temporary Epistemology 41

rule-­circular justifications highlights the characteristic differences between coherentism, foundation “orientism,” and externalism: •

For coherentist positions both arguments are acceptable justifications, which was criticized previously as unacceptable.



For foundation-­ oriented internalism, both arguments are pseudo justifications.



Justification externalism leads to a third view: at most one of t­ hese two arguments can be acceptable as an externalist justification, but which one is acceptable depends on (possibly) inaccessible external facts.

An example in point is van Cleve’s externalist version of the inductive justification of induction. Van Cleve (1984, 562) makes the externalist “correctness” of this argument dependent on the truth of a general fact—­the reliability of induction—­that is not stated as a premise and may be epistemically inaccessible. He wrote, “The antecedent on which this [justification] depends—­that induction is reliable … is an external antecedent. It makes knowledge pos­si­ble not by being known, but by being true.” As a consequence of this position, van Cleve has to accept not only the left rule-­circular argument in f­ avor of induction, but also the right rule-­circular argument in ­favor of anti-­induction.

Inductivist:

Anti-­inductivist:

Past inductions have been successful.

Past anti-­inductions have not been successful.

Therefore, by rule of induction:

Therefore, by rule of anti-­induction:

Inductions ­will be successful in the ­future.

Anti-­inductions ­will be successful in the ­future.

The internalist concludes from the symmetry that both “justifications” are epistemically worthless. In contrast, for the externalist both justifications are correct in the following sense: The circular justification of induction is correct in worlds where inductive inferences are reliable.

The circular justification of anti-­ induction is correct in worlds where anti-­inductive inferences are reliable.

However, without possessing a cognitively accessible higher order justification of induction one cannot possibly know that the left argument is externalistically correct and the right one incorrect. Note that this consequence

42

Chapter 3

does not rebut the externalist position. Externalists admit that externalist justifications do not satisfy two intuitive princi­ples that are satisfied by internalist accounts: the KK princi­ple (knowing p entails knowing that one knows that p) and the JJ princi­ple (justifiedly believing p entails possessing a justification that one has a justification for p). However, it is precisely the failure of ­these two princi­ples that deprives externalist justifications of their meliorative function. Our criticism of justification externalism differs from the standard criticisms saying that justification externalism violates philosophical intuitions (BonJour 2003, 27; Feldman 2005). Intuitions are controversial, but our point about lack of epistemic usefulness is in­de­pen­dent of intuitions. As a further illustration, consider the following example in which an empirical scientist is confronted with a religious fundamentalist, with an analy­sis from both the internalist and externalist viewpoints. The empirical scientist says:

The religious fundamentalist says:

Life is the result of evolution; I conclude this from the empirical evidence by induction or abduction.

Life has been created by an omnipotent God; I conclude this from the way in which God speaks to me.

The externalist analy­sis: This is knowledge if it was caused by evidence via a reliable cognitive mechanism—­though I do not know ­whether this is the case.

This is knowledge if it was caused by this God in a reliable way—­ though I do not know ­whether this is the case.

The internalist analy­sis: Scientific induction/abduction can be (higher-­order) justified as being reliable, or at least as being more reliable than blind faith in God.

Pure externalism has to remain neutral regarding the knowledge claims of the opposing camps, but the internalist can point out that only the first but not the second knowledge claim can be justified as being reliable—­ provided ­ there exist tenable higher order justifications of inductive and abductive arguments. To avoid misunderstandings, two remarks are in order. (1) Presumably most externalists would agree that scientifically justified beliefs are more reliable than beliefs in God. But they cannot justifiably assert this qua externalists, as t­ here is nothing in the externalist conception of knowledge that

The Significance of Hume’s Prob­lem for Con­temporary Epistemology 43

would justify or explain why beliefs based on science are more reliable than beliefs based on religious worldviews. (2) When we criticize fundamentalist religions we do not intend to denounce all kinds of religions. As we ­will argue in section  8.1.2, t­here exists no a priori reason to regard religious worldviews as irrational so long as they allow their assertions to be critically tested by experience and meta-­induction. However, most religious worldviews reject critical testing and consider doubt to be a sin (more on this in section 8.1.4). Throughout this book we use the rejection of meta-­inductive self-­tests as the defining criterion of fundamentalist systems of thought. ­Whether internalist foundation-­oriented approaches can do better than externalist approaches depends on w ­ hether higher order justifications of inductive and abductive scientific confirmation procedures are pos­si­ble. If this is not the case, then the internalist account is forced into skeptical conclusions and would thereby lose its meliorative function as well. In conclusion, if the skeptical prob­lems involving Cartesian or Humean demons can be solved, internalism is the preferable position. If not, externalism gives us an ersatz conception of knowledge, but at the price of being useless ­under precisely ­those conditions ­under which the internalist is defeated by skeptical objections. 3.5  The Necessity of Reliability Indicators for the Social Spread of Knowledge In this final section we sketch the significance of the ­human capability of providing justifications of our beliefs for the cultural evolution of ­human belief systems. In cultural evolution, acquired beliefs or be­hav­iors (as opposed to genet­ically determined traits) are transmitted from generation to generation (Boyd and Richerson 1985; Mesoudi, Whiten, and Laland 2006). For the spread of true as opposed to erroneous beliefs, it is of utmost importance that reliably produced beliefs (i.e., beliefs with high truth chances) reproduce faster than unreliably produced beliefs (e.g., beliefs based on blind faith). For this purpose, reliably produced beliefs have to be recognizable to the members of the population by means of reliability indicators—­good reasons and corresponding arguments for the truth of the communicated beliefs (Craig 1990). Assume a premodern population with a subpopulation of purported information providers (medicine men, priests, ­etc.), out of which only 10 ­percent are truly reliable in­for­mants who base their information on empirical induction instead of on superstition or religious faith. The situation is illustrated in figure 3.3.

44

Chapter 3

Reliable informants

Purported informants Population of epistemic subjects

Figure 3.3 The prob­lem of social spread of reliable information.

As long as the members of the population cannot discriminate between reliable in­for­mants (dark circle in figure 3.3) and alleged ones (light circle), the reliable information provided by the true experts in the dark circle cannot spread through society. In section 10.2 this fact w ­ ill be proved by means of a computer simulation of a social network in which individuals imitate the beliefs (or predictions) of other individuals based on indicators of the reliability of their beliefs. In the simplest case, t­ hese indicators are rec­ords of past successes evaluated by the method of meta-­induction. It ­will be shown that if and only if the imitation criteria are probabilistically correlated with the ­future success rates of the imitated believers (predictors), the knowledge of a small percentage of experts can spread to the entire society. However, the ability to discriminate between reliable and unreliable in­for­ mants requires justifications in the sense of cognitively accessible reliability indicators. In other words, this consideration gives us a strong argument for justification internalism that is not based on philosophical intuition (as in standard defenses of internalism) but on the conditions for the social spread of knowledge in cultural evolution. 3.6  Conclusion: A Plea for Foundation-­Oriented Epistemology Let us summarize. In sections  3.1 through 3.2 we argued that the most natu­ral attempt of laying down universal rational standards for knowledge is the framework of foundation-­oriented epistemology, whose development begun in the Enlightenment era of philosophy, with impor­tant forerunners in antiquity. For this program, the solution to the prob­lem of induction is essential, as the justification of nonbasic beliefs by basic beliefs must be based on inductive (and abductive) inferences. Therefore, the objections of

The Significance of Hume’s Prob­lem for Con­temporary Epistemology 45

skeptical phi­los­o­phers, who intend to demonstrate that this task is unsolvable, are a major challenge for the foundation-­oriented program. In sections 3.3 through 3.4 we discussed two con­temporary alternative positions that try to circumvent the skeptical challenges by weakening the standards of justifications: coherentism, which accepts circular justifications, and externalism, which abolishes the demand of higher order justifications by making justification depend on possibly inaccessible external facts. We have argued that both accounts are inacceptable. If one’s philosophy is oriented ­toward the goals of meliorative epistemology, one should better try to improve the foundation-­oriented epistemological program. This is what we intend to do in this book: develop a new strategy of constructing higher order justifications for inductive argument patterns. ­These higher order justifications do not attempt to prove directly that induction is reliable—­something that cannot be done, as Hume’s arguments clearly established—­but rather that induction is optimal (and even dominant) in regard to the goal of reliability: it is the best among all accessible methods. We call this approach meta-­induction, and we ­will begin its development in chapter 5. Before we get to meta-­induction, we ­will inspect the large variety of probabilistic accounts to Hume’s prob­lem in chapter 4. This ­will give us impor­tant insights and foundations for the development of our own approach.

4  Are Probabilistic Justifications of Induction Pos­si­ble?

We recommend recapitulating the basic concepts and princi­ples of statistical and subjective probability theory before reading this chapter. Recall from section 1.1 that “p(Fx)” (or “p(F)” for short) stands for the statistical probability of a type of (repeatable) event or state of affairs Fx, while “P(Fa)” denotes the subjective (or epistemic) probability of a par­tic­u­lar event or state of affairs Fa. While P is defined over a given space of pos­si­ble worlds (W) with propositions being understood as sets of pos­si­ble worlds, p is defined over a given domain of individuals (D), each individual representing a pos­si­ ble outcome of an under­lying random experiment. More precisely, for each n ∈ , p:AL(Dn) → [0,1] is defined over the algebra AL(Dn) of ­those subsets of the n-­fold Cartesian product of the domain D that are expressible by open formulas of the given language. In contrast, P:AL(W) → [0,1] is defined over the algebra AL(W) of (linguistically expressible) propositions over the set of epistemically pos­si­ble worlds W. If Dn or W, respectively, are continuous and given by a real-­valued interval [a,b], then AL(Dn) or AL(W) respectively is the Borel algebra over [a,b], Bo([a,b]), which is the closure of all subintervals of [a,b] ­under complement, countable ­union, and intersection. 4.1  Why Genuine Confirmation Needs Induction Axioms Probabilistic accounts to Hume’s prob­lem have in common their attempt to demonstrate that inductive inferences lead from true premises to true conclusions with high probability. First of all, we have to carve out a basic ambiguity of this claim: its content crucially depends on ­whether probability is interpreted in the objective-­statistical or the subjective-­epistemic sense. If probability is interpreted in the first sense, then the probability distribution is defined, for each pos­si­ble world, over the domain of its objects. The claim then asserts that in all pos­si­ble worlds inductive inferences are truth preserving with a high frequency or frequency limit. This claim immediately

48

Chapter 4

falls prey to Hume’s objection according to which that is not the case in radically nonuniform worlds. On the other hand, if probability is interpreted subjectively, then the probability distribution is defined over the space of all pos­si­ble worlds, and the claim asserts that according to our subjective expectation it is probable that our world lies in the subset of ­those pos­si­ble worlds in which inductive projections are successful. H ­ ere the question is why this subjective expectation is justified. ­Whether probabilities are interpreted as objective or as subjective, we cannot establish that induction is “prob­ably successful” without making inductive assumptions. We ­will first demonstrate this point for objective and then for subjective probability mea­sures. Concerning objective probabilities, recall Hume’s argument stating that inductions can only be successful in the statistical sense if the so far observed frequencies can be inductively projected into the ­future, which is itself an inductive assumption. To make Hume’s point formally precise, recall that the objective probabilities of given types of events (F) manifest themselves in the fact that their relative frequencies freqn(F) in event sequences (e1, … ,en) of increasing length n converge to certain frequency limits, which are identified with their statistical probabilities: p(F) =def limn→ ∞ freqn(F). Recall that the convergence condition p(F) =def limn→ ∞ freqn(F) means by definition that for ­every arbitrarily small but positive ε ­there exists an m such that for all n ≥ m the F-­frequency at position n deviates from p(F) by not more than ε (|freqn(F) − p(F)| ≤ ε). What is impor­tant is that convergence of finite frequencies to limits is not at all logically or metaphysically necessary, but rather a weak inductive uniformity assumption (examples of nonconverging sequences are given in section 6.3). Moreover, the event sequences are usually assumed to be random sequences in the sense of von Mises (1964), which means that their limiting frequencies are insensitive to outcome-­independent place se­lections. This implies the law of statistical in­de­pen­dence: if “x1” and “x2” denote the outcome of two repetitions of the same random experiment, then p(Fx1 ∧ Gx2) = p(Fx1) • p(Gx2) (Schurz 2013, section 3.13.3). Distributions obeying this princi­ple are called IIDs, for “identical in­de­pen­dent distributions.” If, on the other hand, the consecutive outcomes of a random experiment are probabilistically dependent on each other, the sequence is called a Markov chain. A well-­known consequence of the princi­ple of statistical in­de­pen­dence is n k⎞ ⎛ the binomial formula, p ⎜ freq n (F) = n ⎟ = k i pk i (1 − p)n − k , where p = p(F). ⎝ ⎠ In other words, the probability of n-­element random samples with k Fs in them is n-­choose-­k times F’s probability to the power of k times non-­F’s

( )

( )

probability to the power of n-­minus-­k. ­Here, n = n i (n − 1) … i (n − k + 1) def k k!

Are Probabilistic Justifications of Induction Pos­si­ble? 49

is the number of possibilities of choosing k from n individuals (and k! = 1 • 2 • … • (k − 1) • k). For example, the probability to obtain three sixes in 10 dice rolls is (10 • 9 • 8/1 • 2 • 3) • (1/6)3(5/6)7 = 0.155. For n → ∞, p(freqn(F)) approximates zero for freqn(F) ≠ p(F) and has an infinitely steep peak for freqn(F) = p(F) (see section 4.4). Derivable from this are the laws of large numbers:1 1. ​The weak law of large numbers says that for ­ every arbitrarily small ε > 0,the probability that freqn(F) deviates from p(F) by less than ε goes to one for n → ∞ (∀ε>0: limn→ ∞ p(|freqn(F) − p(F)|  P(H*1 ) and P(H*2 |E)  tk ∧ Bxt))—­that is, an emerald that is grue2 at all times changes its color from green to blue at time tk. Carnap (1947, 146; [1928] 1967, 211) argued that inductively projectible predicates must be qualitative, which means that they are ­either primitive or (if they are defined) their definition does not contain logically essential individual constants referring to par­tic­u­lar individuals or space-­time points. Predicates whose definitions contain such a reference are called positional predicates. The reason why we should inductively proj­ect only qualitative and not positional properties is ­simple: induction consists in transferring the same properties from the observed to the unobserved. To formulate rules of induction we must know what has remained invariant in our observations and what has changed. This depends on the qualitative predicates, as we assume that dif­fer­ent instances of the same qualitative predicate refer ontologically to the same property. Positional properties are pseudo-­properties, ­because when we proj­ect “grue” from instances (observed) before tk to instances (observed) a ­ fter tk, we performed not an inductive but an anti-­inductive inference with re­spect to the ontologically “real” properties “green” and “blue.” To unify Carnap’s rejection of positional predicates (“reason 2”) and the rejection of the observation-­predicates (“reason 1”) in one requirement, we have definition 4.1.

54

Chapter 4

(Definition 4.1) Qualitative predicates A predicate Fx is qualitative (in the extended sense) iff for every individual constant a, Fa does not analytically imply anything about (i) the individual identity or the spatiotemporal location of a, or (ii) the question of whether a has been observed or not.2

Definition 4.1(i) is the standard definition of qualitative predicates. It is easy to prove that for qualitative predicates in the (extended) sense of definition 4.1, Goodman’s paradox cannot arise. To prove this we assume all defined predicates are replaced by their definition, so analytic consequence is reduced to logical consequence. Concerning definition 4.1(i), let us assume that C(a) ∧ DF(a) is the conjunction of some qualitative condition C (e.g., “x is an emerald and is green”) and the definition DF of a complex but qualitative predicate F, applied to an observed individual a. By assumption C(a) ∧ DF(a) is consistent. Because F is qualitative in the sense of definition 4.1(i), C(a) ∧ DF(a) does not contain any individual constant besides a. This implies that for any (unobserved) individual constant b, the formula C(b) ∧ DF(b) must be logically consistent, too.3 Concerning definition 4.1(ii), assume that a but not b has been observed. Then (with “O(x)” for “x is observed”) both C(a) ∧ DF(a) ∧ O(a) and C(b) ∧ DF(b) ∧ ¬O(b) must be logically consistent, because by definition 4.1(ii) C(x) ∧ DF(x) neither entails O(x) nor ¬O(x). Definition 4.1 is relative to a given linguistic framework, with a given set of primitive predicates that are assumed to represent qualitative properties. Under this assumption, the restriction of prediction and generalization rules to qualitative predicates works fine. Goodman, however, objected to the existence of a language-independent criterion for demarcating qualitative

2. More precisely: If “Def” is the set of accepted definitions, then (i) means that Def ∪ {Fa} does not logically imply a sentence of the form “a = b1 ∨ … ∨ a = bn” (the bi being distinct from a) or of the form “(¬)t ≤ r” (t being the temporal location of a, and r a numerical constant) (Schurz 2013, def. 6.6-1). Moreover, (ii) demands that Def ∪ {Fa} does not imply “(¬)Oa.” 3. Proof: C(a) ∧ DF(a) = (C(x) ∧ DF(x))[a/x] where by the assumption of qualitative predicates, “C(x) ∧ DF(x)” contains no further occurrence of a (for Goodman-predicates such as “(x = a ∧ Fx)∨(¬x = a ∧ ¬Fx)” this condition is violated). We show that also C(b) ∧ DF(b) is consistent. If not, then by the rule of ∀-generalization, ⊨ ¬(C(b) ∧ DF(b)) implies ⊨ ∀x(¬C(x) ∧ DF(x)), which implies ⊨ ¬(C(a) ∧ DF(a)), contradicting the assumption that C(a) ∧ DF(a) is consistent.

Are Probabilistic Justifications of Induction Pos­si­ble? 55

and positional (i.e., nonqualitative) properties. Thus, Goodman’s new riddle of induction splits up into two distinct parts: 1. ​The first part of Goodman’s riddle consists in the question of why we should prefer inductive to anti-­inductive projections of the (observed) qualitative properties of our linguistic framework. This is nothing but a variant of Hume’s prob­lem, which is the focus of this book. 2. ​The second part of Goodman’s riddle consists in the prob­lem of the apparent language dependence of qualitative properties. This prob­ lem can be traced to the fact that one can construct back-­and-­forth translations shifting from our ordinary language (L) to a Goodmanian language (L*), which has the same expressive power as L but uses “grue” (G*) and “bleen” (B*) as primitive predicates, as shown in (4.3). (4.3)  Language-­dependence of inductive confirmation (Goodman’s paradox) G is for “green,” B for “blue,” G* for “grue,” B* “bleen,” “F” for “emerald,” and Tx for “x is temporally located before (­future) time tk”: Languages L (T, F, G, B)

L* (T, F, G*, B*) Analytic definitions

G*x ↔ def (Tx ∧ Gx)∨(¬Tx ∧ Bx)

Gx ↔ def (Tx ∧ G*x) ∨ (¬Tx ∧ B*x)

B*x ↔ def (Tx ∧ Bx)∨(¬Tx ∧ Gx)

Bx ↔ def (Tx ∧ B*x) ∨ (¬Tx ∧ G*x)

Analytically equivalent translations between L and L* … of the evidence E = {Fai ∧ Gai ∧ Tai:1 ≤ i ≤ n}

E* = {Fai ∧ G*ai ∧ Tai:1 ≤ i ≤ n} … of hypotheses

H1 = ∀x(Fx → Gx) H1* = ∀x(Fx → ((Tx ∧ G*x) ∨  (¬Tx ∧ B*x))) H2 = ∀x(Fx → ((Tx ∧ Gx) ∨ (¬Tx ∧ Bx)))

H2* = ∀x(Fx → G*x)

Inverted results on inductive confirmation E confirms H1 and disconfirms H2

E* disconfirms H1* and confirms H2*

If we treat both languages as “equally justified,” then inductive confirmation turns out to be language dependent. A similar prob­lem of language dependence arises in other areas in philosophy of science, such as truthlikeness or causality. In the rest of this section, we briefly explain a pos­si­ble

56

Chapter 4

solution to this prob­lem, as described in previous works (Schurz 2013, section 5.11.3; see also 2015a). We start with Carnap’s proposal that inductive confirmation relations should be restricted to qualitative predicates in the sense of definition 4.1. Logically speaking, the “qualitativity” of a predicate is language dependent: in the language system L, G and B are qualitative and G* and B* positional predicates, and in L* it is the other way around. However, for observation predicates ­there exists a language-­independent criterion for the qualitativity of a primitive predicate, based on the inspection of the pro­cess of ostensive learning of the corresponding property. An ostensive learning experiment consists of a training phase and a test phase. In the training phase, the concept is introduced to the test person with an artificial word “X” unknown to them. “X” is illustrated to the test person by means of a set of positive and negative instances of the concept (in the form of photos, movies, or real objects), and the information “this is an X” or “this is not an X” is given. ­Whether the test person has successfully learned the concept is investigated in the following test phase, in which a series of new positive and negative instances of the concept is presented, and the question “Is this an X?” is asked. In Schurz (2013, section 5.11.3; 2015a) it is argued that, in contrast to theoretical concepts, typical observation concepts can be learned from a small set of learning instances by more or less all persons, in­de­pen­dently of their background culture and language. The ostensive learning pro­cess of an observation predicate may also be used to differentiate between qualitative and positional properties: To learn a qualitative predicate (e.g., a color) from a c­ ouple of positive and negative instances does not require any positional information concerning ­these instances. In contrast, a positional property such as Goodman’s “grue” can only be learned by ostension (if it can be learned at all) ­under two conditions: (1) the training instances (pictures of grue emeralds) must contain information about the time the instance was recorded (e.g., by means of a calendar), and (2) they have to include instances before and ­after the f­uture “subversion” time tk. We thus arrive at the empirically testable criterion. (4.4)  Language-­independent criterion for qualitative observation predicates A (primitive) predicate is a qualitative observation predicate iff it can be learned by pure ostension without the training phase’s exemplars containing any information about the positional location of the exemplar.

Are Probabilistic Justifications of Induction Pos­si­ble? 57

Admittedly, this demarcation criterion can only be applied to observational but not to theoretical properties. But once we have established a criterion of qualitativity for the empirical sublanguage, we can set up preference criteria for our choice of primitive theoretical predicates via the ability of the corresponding theories to unify observational facts formulated in terms of qualitative predicates. ­Whether this solution is accepted or not, we emphasize that the prob­lem of language dependence is in­de­pen­dent of Hume’s prob­lem. In the remainder ­ ill set aside Goodman’s prob­lem by assuming that the conof this book we w sidered methods of prediction are always applied to a given system of qualitative predicates, and Hume’s prob­lem consists of the question of ­whether and why inductive rather than other methods of prediction should be preferred. 4.3  Statistical Principal Princi­ple and Narrowest Reference Classes In section 4.1 we saw that the assumption that statistical probabilities exist is already a weak inductive princi­ple. Therefore, we should not be amazed by the fact that certain ways of establishing bridge princi­ples between statistical and subjective probabilities are equivalent to induction princi­ples for subjective probabilities that go beyond the basic probability axioms. This ­will be shown in the next subsection whereas this subsection is devoted to the explication of ­these bridge princi­ples. The most impor­tant bridge princi­ple between epistemic and objective probabilities has been called the principal princi­ple (PP) by David Lewis (1986). Lewis, however, formulated this princi­ple to hold between subjective and objective “single-­ case” probabilities, which he calls “chances.” Lewis’s PP is technically easier to formulate than the statistical PP ­because single-­case chances are, like subjective probabilities, defined over token events or states of affairs expressed by closed formulas, as distinct from statistical probabilities, which are defined over event types expressed by open formulas. However, it seems to me that strictly singular chances, like the probability of this throw of this coin to land on heads (as opposed to other throws), are epistemically inaccessible metaphysical constructions. Therefore, we are inclined to reject the notion of “objective single-­case propensities.” The only way to make them epistemically accessible is to connect them with the frequencies of corresponding repeatable event types. This is done by the concept of objective-­statistical probability. In what follows we formulate an analogue to Lewis’s PP, which we call the statistical principal princi­ple, or StPP (Schurz 2013, section  3.13.5). The StPP goes (indirectly) back to de Finetti and has been defended, for example, by von Kutschera

58

Chapter 4

(1972, 82), Howson and Urbach (1996, 345), Williamson (2013), and Strevens (2004), who calls it the “probability coordination princi­ple.” (Definition 4.2)  Statistical principal princi­ple (StPP) Let H be a statistical hypothesis that implies p(Gx|Fx) = r (given the basic probability axioms). Let P be an epistemic prior probability distribution that is in­de­pen­dent of any evidence involving the individual a, where a ∉{b1, … ,bn} (“admissibility condition”). Then: (a) P(Ga | Fa ∧ H ∧ E(b1, … ,bn)) = r. In words: Ga’s epistemic probability conditional on Fa and statistical assumptions entailing p(G|F) = r (plus assumptions about individuals dif­fer­ent from a) is r. (b) Special case: P(Ga | p(Gx) = r) = r (obtained when H = “p(Gx|Fx) = r” and Fx and E(x1, … ,xn) are tautologous predicates). (c) StPP for random samples: k

( )i r

P(freq (G | F : {a1 ,…, a n }) = n | p(Gx | Fx) = r) = n k

k i (1 −

r)n − k ,

where “freq(G|F:{a1, … ,an})” is the relative frequency of Gs in the par­tic­u­lar sample of F-­individuals a1, … ,an. In other words, the epistemic probability that ­there are k Gs in this sample of n Fs is given by the binomial formula.

Definition 4.2a explicates the StPP for (possibly complex) monadic predicates, involving only one individual term as their argument.4 “Gai” may also have the form “Gf(ai),” where f(ai) designates an individual at a time point dif­fer­ent from that of ai. The application to random samples in definition 4.2c is an easy consequence of 4.2a obtained by applying the binomial formula (recall section 4.1). The StPP is of par­tic­u­lar importance for Bayesian statistics. ­Here, Bayes’s theorem is used to compute the subjective probability P(H|E) of a (statistical) hypothesis H given a piece of evidence E as follows.

4. ​A generalization of definition 4.2(a) to relational predicates is found in Schurz (2015b, section  7.1, 7-1). Note that the round brackets “E(b1, … ,bn)” express that b1, … ,bn are all individual constants contained in the formula E. To express that a1, … ,an are some of the individual constants contained in E, one uses square brackets: “E[a1, … ,an].”

Are Probabilistic Justifications of Induction Pos­si­ble? 59

(4.5)  Bayesian statistics P(H|E) = P(E|H)•P(H) / P(E)  (according to Bayes’s rule)   = pH(E*)•P(H) / P(E)  (by the STPP),

where E* is “E” with individual constants bijectively replaced by variables and pH(E*) is the statistical probability of E* entailed by H. The inverse probability P(E|H) is the so-­called likelihood (of H given E). For Bayesians it is essential that likelihoods are objective—­are equated with the corresponding statistical probabilities pH(E*)—­because only then can the subjective probabilities of hypotheses, when the evidence accumulates, converge to intersubjective (or objective) probabilities (Hawthorne 2005, 286; Strevens 2004; and ­later in this chapter). An impor­tant precondition of the StPP is the following: the epistemic probability mea­sure P must be “prior” in regard to the individual(s) a(i) to which the StPP is applied, in the sense that this mea­sure must not implicitly rely on further experiences about a(i) on which the statistics are not conditionalized. For example, if we have observed that the coin has fallen on heads (= Fa), then our ­actual degree of belief in Fa is one (or almost one), P(Fa) = 1 (or P(Fa) = 0.99), even if we know that p(Fx) = 1/2. So P(Fa | p(Fx)=1/2) = 1 (or = 0.99), which contradicts the StPP. More generally, we can only reasonably identify the epistemic degree of belief in Fa given Ga with Fx’s statistical probability given Gx if this epistemic probability is not already constrained by par­tic­u­lar evidence concerning a. For the same reason, the StPP may only be applied conditional to some “remainder” evidence E(b1, … ,bn) if it satisfies the admissibility condition mentioned in definition 4.2: E(b1, … ,bn) must not contain any of the individuals a(i) to which the StPP is applied. Schurz (2015b, section 7.1, prop. 7.1) proves that u ­ nder ­these conditions the StPP is coherent: for ­every statistical probability function p t­ here exists a corresponding subjective probability function P that satisfies the StPP with regard to p. The StPP determines only the subjective prior probability of singular sentences—­that is, sentences containing individual constants that can be replaced by individual variables—­but not that of quantified sentences without individual constants, such as “all Fs are Gs” or “90 ­percent of all Fs are Gs.” Within the Bayesian framework, the prior probability of quantified sentences or hypotheses (Hi) is assumed to be “somehow given.” The posterior probability of a hypothesis Hi, given empirical evidence, is computed from ­these priori probabilities with the help of the Bayesian formula: P(Hi|E) = P(E|Hi) • P(Hi) / ∑1≤i≤n P(E|Hi) • P(Hi),

60

Chapter 4

where {H1, … ,Hn} is a partition of possible hypotheses—that is, a logically exhaustive set of mutually exclusive hypotheses H1, … ,Hn; the “likelihood” P(E|Hi) is determined by the StPP for random samples (definition 4.2c). The partition may also be continuous, such as {Hr:r ∈ [0,1]} with Hr =def “p(Fx) = r,” in which case the sum is replaced by an integral (∫r r • dP(r)). Moreover, the StPP determines the subjective probability only for those singular sentences whose corresponding statistical probability is known. However, if one assumes a prior distribution P over a partition of possible statistical hypotheses, then by the StPP the prior probability of all other singular sentences is determined as the subjective expectation value of corresponding statistical probabilities as shown in (4.6) (where “b1−n” abbreviates “b1, … ,bn”). (4.6) Subjective probabilities as expectations of statistical probabilities P(Fa|E(b1−n)) = ∑1≤j≤n P(Fa|Hj ∧ E(b1−n)) • P(Hj|E(b1−n)) = ∑1≤j≤nrj • P(H|E(b1−n)) (by StPP), where a ∉{b1−n}, Hj = “p(Fx) = rj,” and {Hj:j ∈ J} is the partition of all possible statistical hypotheses of this form whose disjunction has subjective probability 1.

In the final part of this section we briefly explain the relation between subjective prior probabilities and actual probabilities. We designate the prior probability with “P” or “P0” (“0” being an initial time point) and the actual degree of belief function at a given time t by Pt. The relation between Pt and P0 is standardly described by the principle of conditionalization (Carnap 1971b, 18; Howson and Urbach 1996, 102ff.). (Definition 4.3) Rule of strict conditionalization Let P0 be the priori probability (of the given subject) at time t0, let Pt be the actual probability at time t, and let K0-t be the total (singular and statistical) knowledge acquired between t0 and t. Then for every proposition S: Pt(S) = P0(S|K0-t).5

The thus formulated rule of strict conditionalization is a generalization of Reichenbach’s principle of the narrowest reference class, which is the most basic bridge principle relating actual subjective probabilities with statistical probabilities (Reichenbach 1949, §72). Let S be the singular sentence Fa. Then the total knowledge K0-t in definition 4.3 can be restricted to the conjunction of all known facts about the individual a—this is Reichenbach’s

5. A generalization of strict conditionalization is Jeffrey conditionalization (Howson and Urbach 1996, 105).

Are Probabilistic Justifications of Induction Pos­si­ble? 61

“narrowest reference class” designated by Rax—­plus the statistical hypothesis about the probability p(Fx|Rax). Hence, definition 4.3 simplifies to (4.7). (4.7)  Reichenbach’s princi­ple of the narrowest reference class The subjective probability of event Fa is equated with the (estimated) statistical probability of the corresponding event type Fx in the narrowest (nomological) reference class Rax within which we know that a lies. In formulas: Pt(Fa) = p(Fx|Rax).

The princi­ple of the narrowest class of reference is widely used in everyday life and in science. For example, the weather forecast “the probability that it is ­going to rain tomorrow is 75 ­percent” has, according to Reichenbach’s princi­ ple, the following interpretation: the statistical probability that it is g ­ oing to rain on a day that is preceded by a day with similar weather patterns as ­today is 75 ­percent. The rationality of Reichenbach’s princi­ple in terms of maximizing predictive success has been proved by Good (1983, 178ff.): He showed that by conditionalizing one’s probability-­based predictions or actions on narrowest reference classes one can only improve but never diminish one’s predictive success. In section 8.2 we ­will generalize Good’s proof to prediction games. 4.4  Statistical Principal Princi­ple and Exchangeability as Weak Induction Axioms Condition (4.6) defines the priori probability of a singular proposition Fa as a mixture or expectation value of statistical probabilities. The thus-­defined prior probabilities have a fundamental property that was first formulated by de Finetti: they satisfy the axiom of exchangeability (or the equivalent axiom of symmetry of Carnap 1971a, 117ff.). We assume that P is defined over the propositions of a first-­order language with an infinite sequence I =(ci : i ∈ ) of standard names for individuals: that is, ci = ck only if i = k (“” for the set of natu­ral numbers). Then P is called exchangeable iff P is invariant with re­spect to arbitrary permutations of the constants in I—­that is, ­P(A(a1, … ,an)) = P(A(aπ(1), … ,aπ(n))) holds for ­every permutation function π: →  over the natu­ral numbers (see Earman 1992, 89; Gillies 2000, 71). Exchangeability is an obvious consequence of equation (4.6), which implies P(Fai) = P(Faj) for arbitrary i,j ∈ , ­because the definiens ∑1≤j≤k rj • P(Hj) does not depend on any individual constants ai. De Finetti ([1937] 1964) has proved a famous repre­sen­ta­tion theorem, according to which e­ very exchangeable subjective probability function is identical with a probabilistic expectation value of (in­de­pen­dent) statistical probability distributions in the sense

62

Chapter 4

of equation (4.6). Moreover, equation (4.6) is equivalent to the StPP plus the assumption that with subjective probability 1 the property Fx possesses a frequency limit in the sequence I. We summarize these facts in proposition 4.2. (Proposition 4.2) Exchangeable probability functions Let L be a first-order language with an infinite sequence of standard names I, and Sent(L) and Form(L) be the sets of sentences or open formulas, respectively, expressible in L. Then the following conditions are equivalent: (1) P over Sent(L) is exchangeable. (2) For every formula Fai, P is representable as P-expectation value of statistical probability functions p over Form(L), as in (4.6). (3) (i) With P = 1 every property (event type) expressed by a formula Fx in Form(L) has a frequency limit p(Fx) in I, and (ii) P and p satisfy the StPP (the constants in I playing the role of the “ai’s” in definition 4.2).6

Proof: Appendix 12.1. The most important fact about these equivalent conditions of proposition 4.2 is their nature as weak probabilistic assumptions of induction. As explained in section 4.1, condition (3)(i)—the assumption that frequencies converge to a limit—is an inductive assumption. Moreover, the StPP transfers frequency tendencies to particular individuals or samples as their subjective expectation values so long as we lack particular experience regarding these individuals. That is only reasonable under the inductive assumption that these individuals or samples are representative for the whole domain or world—that our world is uniform. The condition of exchangeability (proposition 4.2(1)) rests on the assumption that independent of their particular properties the probabilistic tendencies of all individuals are the same. Thinking of “propertyless” individuals as positions in space-time, this amounts to an inductive uniformity assumption for space and time. The inductive character of the condition of exchangeability is further manifested in the fact that together with the condition of nondogmaticity, exchangeability makes inductive learning possible. Nondogmaticity is

6. (1) ⇐ (2) ⇔ (3) holds also if I is finite; (1) ⇒ (2) holds only approximately in this case (Diaconis and Freedman 1980). To grant that in 2 and 3 the functions “p” can be chosen that satisfy statistical independence, one has to assume that P is σ-additive (Spielman 1976).

Are Probabilistic Justifications of Induction Pos­si­ble? 63

a further kind of inductive probabilistic assumption: Its leading idea is that prior probabilities of 0 or 1 are “dogmatic” ­because they make learning from experience impossible, since if P(H) = 0 (or P(H) = 1), then P(H|E) = 0 (or P(H|E) = 1) for ­every pos­si­ble experience E. More precisely: (Definition 4.4)  Nondogmatic (subjective) probability functions (a) A (subjective) probability function P over a countable space W of pos­si­ ble worlds (or finest propositions) is called nondogmatic iff for all w ∈ W, P(w) ≠ 0,1. (b) A (subjective) probability function P over an uncountable possibility space represented by a nonempty interval [a,b] is called nondogmatic iff for ­every set X in the Borel algebra over [a,b] it holds: P(X) is positive iff X has a positive Borel mea­sure.

Concerning definition 4.4(b), the Borel mea­sure (mBo) is, generally speaking, a mea­ sure of geometric content; for an interval it is its length, mBo([a,b]) = (b − a). Thus, a nondogmatic distribution over [a,b] assigns a positive mea­sure to a set of points X ∈ Bo([a,b]) exactly if this set contains an interval with positive length. Sets with Borel-­measure zero are, for example, singleton sets {r} (points) and all countable ­unions of points. Fi­nally, a distribution P is called continuous iff the cumulative distribution function Pcum is differentiable and thus definable as the integral of a corresponding denr sity function D(r): Pcum(r) = 0 ∫ D(x)dx. (Proposition 4.3)  Inductive learning for predictions Assume P is exchangeable, nondogmatic, and for (b) in addition continuous. Then: (a) The degree of inductive support increases continuously with the number of confirming instances (“instantial relevance”). Formally (where “freqn(F)” = the frequency of Fs in a sample of size n):

(

k +1

P Fa n +1 | freq n (F) = n

) > P (Fa

n +1

k

)

| freq n (F) = n .

(b) Fa’s subjective probability converges against Fx’s observed frequency when the sample size goes to infinity.

(

[rin]

Formally: lim n → ∞ P Fa n +1 | freq n (F) = n rounding of the real number r).

Proof: Appendix 12.2.

)=r

(where [r] is the integer-­

64

Chapter 4

The inductive nature of exchangeability has been stressed, for example, by Earman (1992, 108) and Gillies (2000, 73). The fact is impor­tant as some authors have argued that exchangeability is an a priori or logical property (e.g., van Fraassen 1989, chap. 7; Carnap 1971). However, it is easy to come up with situations in which it is reasonable to give up exchangeability: for example, when we recognize that a ­ fter some time t, the frequency tendencies of a random experiment have significantly changed; t is thus a nonuniform “subversion” point and event-­indices before and ­after time t are no longer exchangeable (for further counterexamples see Gillies 2000, 69−83). That the exchangeability condition is not reasonable “a priori” can also be seen from the fact that it can only be applied reasonably to qualitative predicates in the sense of section 4.2, but not to Goodman-­type positional predicates; other­wise, it would produce contradictions. To see this, assume a1, … ,an are emeralds observed before the ­future time tk and b is an emerald not observed before tk (and we assume ­these facts are known—­that is, that their probability is 1). Then assuming the exchangeability and nondogmaticity of P for both predicates G and G* would imply by proposition 4.3(a): (i) P(Gb|Ga1 ∧ … ∧ Gan) > P(Gb), and (ii) P(G*b|G*a1 ∧ … ∧ G*an) > P(G*b). But G*b is analytically equivalent with ¬Gb, and G*ai with Gai (for 1 ≤ i ≤ n). So (ii) is analytically equivalent with P(¬Gb|Ga1 ∧ … ∧ Gan) > P(¬Gb), which contradicts (i). This result has been worked out by von Kutschera (1972, 144), who draws the conclusion that an inductive logic in the sense of Carnap (1950) is impossible. Proposition 4.3 is an example of a convergence result for the inductive de­ pen­ dently of one’s (subjective) probability of predictions that holds in­ prior probabilities of hypotheses. Similar convergence results are pos­si­ble for the inductive probability of general hypotheses. Assume a given prior probability density distribution D(Hr) over the pos­si­ble statistical probabilities r ∈ [0,1] of a binary property. Then the posterior density distribution D(Hr|E) conditional on a par­tic­u­lar sample evidence E is computed by the Bayesian formula as follows (where “pHr” is the statistical probability function as determined by Hr = “p(Fx) = r”): (4.8)  Posterior density of hypotheses 1

D(Hr | E) = pHr (E) i D(Hr ) / ∫ pHx (E) i D(H x ) dx 0

n

(r ∈[0, 1]).

Discrete case : P(Hr | E) = pHr (E) i P(Hr ) / ∑ pHi (E) i P(Hi ) (1 ≤ r ≤ n). i=1

Are Probabilistic Justifications of Induction Pos­si­ble? 65

One can prove that as long as the prior distribution over the pos­si­ble ­frequency limits is continuous and nondogmatic, the conditionalization of this distribution on evidence reporting a par­tic­u­lar sample frequency rE produces a shift of this distribution ­toward rE. With increasing sample size the resulting posterior distribution successively steepens with an increasing amount of the probability mass located over the reported sample frequency (see de Finetti 1974, section XI.4.5; Howson and Urbach 1996, 354ff.). An easy way to state this fact formally is as shown in proposition 4.4. (Proposition 4.4)  Continuous convergence for inductive generalizations Assume P is exchangeable, nondogmatic and continuous over the pos­si­ble frequency limits (where “Hr” stands for “p(Fx) = r,” and “[r•n]” is the integer rounding of r/n). Then:

(

[rin]

lim n → ∞ ⎛ D (Hr + ε | freq n (Fx) = n ⎝ for ­every |ε| > 0.

) D (H | freq (Fx) = )⎞⎠ = 0 r

n

[rin] n

In words: The relation between the posterior probability density of a false hypothesis and the true hypothesis (modulo approximation by rational numbers) converges to zero with increasing evidence.

Proof: Appendix 12.3. Convergence results are also called the “washing out of priors” (Earman 1992, 141ff.). They are central to subjective Bayesianism ­because they establish a form of intersubjectivity in­de­pen­dently of the par­tic­u­ lar choice of prior distributions ­under the mentioned assumptions. We know already that the StPP is an inductive princi­ple. We ­will see ­later (and more explic­itly in section 9.1) that the assumption of a continuous nondogmatic prior distribution over the pos­si­ble frequency limits corresponds to an inductive princi­ple, too. As long as we do not know how to solve Hume’s prob­lem of induction, we do not know how to justify ­these princi­ples. Even if ­these inductive princi­ples are accepted, they are too weak to ensure the convergence of opinions in h ­ uman practice. They establish convergence when the number of observations or points in time increases to infinity, but they do not entail any probabilistic convergence at finite stages of evidence. Proposition 4.5 is easily provable.

66

Chapter 4

(Proposition 4.5)  Biased priors For ­every true hypothesis H and arbitrarily long conjunction E of evidence statements that confirms H to an arbitrarily high degree but that is consistent with ¬H, ­there exists a nondogmatic but sufficiently biased prior probability distribution P such that P(¬H|E) > P(H|E).

Proof: Appendix 12.4. In other words, Bayesian convergence results are unable to prevent irrational beliefs for arbitrarily long but finite time spans. To give a concrete example, if a subjective Bayesian believes strongly enough in the existence of a benevolent God, no finite set of counterexamples, however large, ­will convince him to give up his faith. Proposition 4.3(a) is an example of a continuous inductive convergence result, ­because with ­every new confirming observation the degree of belief in the prediction increases by some (possibly small) amount. An even weaker kind of convergence result is ­simple convergence, which may start at an arbitrarily late point in time and says something only about what happens in the limit. For the proof of ­simple convergence exchangeability is not needed; an even weaker inductive princi­ple is sufficient, namely σ-­additivity. Recall that P is called σ-­additive (or countably additive) iff the probability of the ­union of a (countably) infinite number of pairwise disjoint events equals the infinite sum of their individual probabilities. Kelly (1996, 321) worked out that σ-­additivity implies the following ­simple convergence property: (4.9)  ­Simple convergence for strict generalizations If P is σ-­additive, then: limn→∞ P(∀xFx|Fa1 ∧ … ∧ Fan) = 1, provided P(∀xFx) > 0.

For a Humean skeptic concerning induction, this consequence is unacceptable. She would object that for any number n of individuals having been observed to be F, t­ here are still infinitely many unobserved individuals left; hence, ­there is no reason why our degree of belief that ∀xFx ­will be falsified in the ­future should sink with increasing n. It is well known that σ-­additive distributions over countably infinite possibility spaces cannot be uniform (Howson and Urbach 1996, 326). If P is a σ-­additive distribution over the set of natu­ral numbers , then almost all of the probability mass is concentrated over a finite initial segment of natu­ral numbers. More precisely, for ­every small ε > 0 ­there exists an n such that P({1, … ,n}) ≥ 1 − ε (see figure 4.1). De Finetti (1974, section III.11.6) has

Are Probabilistic Justifications of Induction Pos­si­ble? 67

concluded from this fact that one should not consider σ-­additivity to be a general probability axiom (Gillies 2000, 77). In reply to ­these prob­lems, Schurz and Leitgeb (2008) have developed a frequentistic probability theory that includes non-­σ-­additive probability functions. The ­simple convergence result (4.9) following from σ-­additivity can be generalized. This is stated in the next proposition 4.6. The result of proposition 4.6a was proved by Gaifman and Snir (1982). Hereby H(L) consists of all hypotheses expressible in a language L that is able to express statistical frequency limits, and the sequence (±wAi:i ∈ ) consists of all unnegated or negated atomic statements that are true in the pos­si­ble world w; it is assumed that this set contains all information about w.7 P=1

P=0

N

Figure 4.1 σ-­additive probability distributions over .

(Proposition 4.6)  ­Simple convergence results Assume P is σ-­additive. Then: (a) Gaifman/Snir-­convergence: For all H ∈ H(L), the set of pos­si­ble worlds (or L models) w for which limn→∞P(H|±wA1 ∧ … ∧ ±wAn) equals H’s truth value in w has probability P = 1.

(

[rin]

(b) Special case: lim n → ∞ P p(Fx) = r | freq n (F) = n

) = 1.

For strict generalizations and predictions, provided P(∀xFx) > 0: (c) limn→∞P(∀xFx|Fa1 ∧ … ∧ Fan ) = 1. (d) Even without σ-­additivity: limn→ ∞ P(Fan+1|Fa1 ∧ … ∧ Fan ) = 1.

Proof: Appendix 12.5. Gaifman and Snir’s result does not require the nondogmaticity of P, for the following reason: if P(H) = 0, then limn→∞P(H|±wA1 ∧ … ∧ ±wAn) ­will not equal

7. ​This is a reformulation of Gaifman and Snir’s requirement that the sentences {±wAi:i ∈ } must “separate” the set of L models (D,v).

68

Chapter 4

H’s truth-­value in worlds in which H is true. But b ­ ecause the set of t­hese worlds has probability zero, ­these worlds do not undermine result 4.6a, which only says something about the set of all worlds having positive probability. Also, note that Gaifman and Snir’s result (4.6a) does not hold when the data stream does not determine H’s truth value in all pos­si­ble worlds. This is the case when H contains theoretical concepts (latent variables) that are not part of the data. The convergence results (4.6c and 4.6d) apply to strict inductions and require the assumption that a universal generalization has a positive prior probability, P(∀xFx) > 0. Fact 4.6d tells us that inductive convergence for strict predictions can be proved by the assumption P(∀xFx) > 0 even without σ-­additivity; that shows that this assumption is a further kind of inductive princi­ple. In fact, P(∀xFx) > 0 is a rather strong inductive assumption. For example, P(∀xFx) > 0 is invalid in Carnap’s systems of inductive probability. Although Carnap (1950, 571ff.) circumvented this prob­lem by means of his method of “instance confirmation,” Popper ([1935] 2002, new appendix vii*) inferred (wrongly) from it that inductive probabilities satisfying P(∀xFx) > 0 are impossible. Earman (1992, 91ff.) showed that P(∀xFx) > 0 is only pos­si­ble if the conditional probabilities P(Fan+1|Fa1 ∧ … ∧ Fan) “rapidly”8 converge to 1 when n grows to infinity. 4.5  Indifference Princi­ple as an Induction Axiom The two conditions of exchangeability plus nondogmaticity are not strong enough to determine unique posterior probabilities conditional on finite evidence ­because they do not fix the prior probability distribution over a given partition of pos­si­ble hypotheses. Therefore, proponents of intersubthe “objective Bayesians”—­ have suggested stronger jective probabilities—­ conditions that fix this prior distribution P(Hi). The most impor­tant suggestion of this kind is the indifference princi­ple, which asserts that in the absence of experience the same prior probability should be assigned to ­every pos­si­ble hypothesis.9 Consider again all pos­si­ble hypotheses about the statistical probability of property F: Hr =def “p(F) = r” (for r ∈ [0,1]). The indifference princi­ple requires that (in the absence of information) the prior density D(Hr) is a uniform (or flat) distribution, which implies (by the normalization

8. ​­There must exist k ∈  and c ∈ (0,1) such that ∀n ≥ k: xn+1/xn ≤ c, where xn =def  P(¬Fan+1|Fa1 ∧ … ∧ Fan). 9. ​Williamson (2010, 16, 28ff.) calls it the princi­ple of “equivocation.”

Are Probabilistic Justifications of Induction Pos­si­ble? 69

requirement 0∫1D(Hr)dr = 1) that D(Hr) = 1 for all r ∈ [0,1]. From this and (4.8) we infer immediately that the posterior density of Hr has its maximum at that r-­value which maximizes the likelihood pHr(E). Moreover, numerical values of (posterior) probabilities can now be computed as in proposition 4.7.10 (Proposition 4.7)  Probabilities obtained from indifference

(

)

k = 1 . a) P freq n (F) = n n +1

In words: the prior probability that k out of n individuals are Fs is 1/(n + 1).

(

)

k = k +1 b) P Fa n +1 | freq n (F) = n n + 2 (Laplace’s rule of succession).

In words: the probability that the next individual w ­ ill be an F, given k out of n observed individuals have been Fs, is (k+1)/(n+2).

Proof: Appendix 12.6. Proposition 4.7a entails that the prior distribution over pos­si­ble sample frequencies is flat, and 4.7b is Laplace’s famous rule, which is a strong form of an induction princi­ple that is valid in Carnap’s preferred c*-­system of inductive “logic” (1950, 568), ­under the assumption that the “logical width” of property F is 1/2. Unfortunately, the indifference princi­ple stands on weak legs. It is beset by two major objections. Objection 1: Equiprobability is language dependent. ​Uniform probabilities are not preserved ­under partition refining. For example, equiprobability over the color partition {red, ¬red} gives us P(red) = 1/2, but over the color partition {red, blue, some-­other-­color} gives us color(red) = 1/3. What is worse, uniform distributions are not even preserved by fineness-­preserving language transformations. For example, assume a uniform prior probability distribution for the unknown frequency (μ) of a par­tic­u­lar kind of radiation (e.g., the light emitted by sodium). The wavelength (λ) of a radiation is given by the light velocity c divided by its frequency, λ = c/μ. Transforming the uniform distribution for μ (bijectively) into a distribution for λ, one obtains a nonuniform, exponentially decreasing distribution (see figure 4.2) (Gillies 2000, 37–48). ple have already been Similar “paradoxes” of the indifference princi­ reported by Keynes (1921, chap. 4). Keynes proposed an improvement that,

10. ​Jeffrey (1971, 219); Gillies (2000, 72ff.); Howson and Urbach (1996, 55ff.); Billingsley (1995, 279).

70

Chapter 4

D(c/μ)

1

D(μ)

μ λ (=c/μ) Figure 4.2 Uniform density for μ (frequency) turns into a nonuniform density for λ (wavelength).

however, does not work for the previous counterexample (Gillies 2000, 43ff.; Howson and Urbach 1996, 61). Objection 2: Equiprobability of all pos­ si­ ble worlds prohibits induction. ​A description that fixes the truth value of each atomic sentence of the language is called a state description by Carnap (1950, 71); it describes the total state of a pos­si­ble world. If we restrict the language to one predicate F versus ¬F (describing for example the results of a repeated coin tossing experiment), then each pos­si­ble world corresponds to an infinite binary sequence (e1,e2, … ) with ei ∈ {0,1} (1i codes “Fai” and 0i “¬Fai”). It is well known that ­every binary sequence corresponds exactly to one real number between zero and one in binary repre­sen­ta­tion, “0.e1e2, … ”. The following result holds for priors that are uniform across all pos­si­ble worlds.

(Proposition 4.8)  Induction-­hostile uniformity Assume the probability distribution P is state-­uniform, that it is uniform over all pos­si­ble state descriptions or event sequences (instead of over all statistical hypotheses). Then, (a) Inductive learning from experience is impossible, and one obtains the result that for any given two-valued predicate “F”: P(Fan+1|sn(F)) = 1/2, for e­very pos­si­ble observation of n individuals sn(F) =def ±Fa1 ∧ … ∧ ±Fan (“±“ for “unnegated” or “negated”). Thus P satisfies the properties of a random IID over {0,1}.

Are Probabilistic Justifications of Induction Pos­si­ble? 71

(Proposition 4.8)  (continued) (b) The prior density corresponding to this P is σ-­additive and uniform over the interval [0,1], when e­ very real r ∈ [0,1] is taken in its binary expansion, representing an infinite binary sequence of ±F-­events (as explained previously). (c) Yet if the same density is distributed over the space of pos­si­ble statistical hypotheses Hr =def p(Fx) = r (r ∈ [0,1]), it becomes maximally dogmatic. It is concentrated over the point r = 0.5: P(p(Fx) = 0.5) = 1; hence, D(Hq) = 0 for q ≠ r and D(Hq) = ∞ for q = r.

Proof: Appendix 12.7. Proposition 4.8(a) is the strongest induction-­skeptical result that we have met so far. It has been reported, among ­others, by Carnap (1950, 564–566)11 and Howson and Urbach (1996, 64–66). We s­hall see in section  9.1 that Wolpert’s (1996) no ­free lunch theorem, which is famous in computer science, is a generalization of this result. According to proposition 4.8, a state-­uniform P distribution makes the long-­run convergence results that we discussed in section  4.4 entirely impossible ­because ­these results require a continuous and nondogmatic distribution over the pos­si­ble frequency limits (or statistical hypotheses). As a consequence of this result one cannot say, as Bayesians frequently do, that in­de­pen­dent of the prior distribution the posterior distribution over the pos­si­ble value of a statistical pa­ram­e­ter ­will approximate the true value with probability 1 (Wolpert 1996, 1342). This holds only if the prior distribution is continuous over the pos­si­ble values of the statistical pa­ram­e­ter (e.g., the frequency limit), but not if it is uniform over all pos­ si­ble state descriptions and hence noncontinuous over the values of the statistical pa­ram­e­ter. Deeper aspects of this result ­will be investigated in section 9.1. As the two objections make clear, no prior probability distribution can be said to be unbiased, or void of information. In conclusion, the Laplace-­ Carnapean idea that “logic” can determine probability values seems to be untenable. In par­tic­u­lar, any Bayesian attempt to establish at least a partial solution to Hume’s prob­lem of induction must give noncircular arguments as to why one should not assume a prior distribution that is uniform over

11. ​The state-­uniform probability distribution yields Carnap’s confirmation function c†.

72

Chapter 4

state descriptions instead of statistical descriptions. However, no such arguments are in sight. Alternatively, one must develop an account that is in­de­ pen­dent of assumptions about prior probabilities. The latter way is chosen by the approach to Hume’s prob­lem developed in this book, although in ­later chapters we w ­ ill show how to embed our results into the Bayesian framework (sections 6.1, 7.1, 7.2, 8.1, 8.3, and 9.1). 4.6  Inductive Probabilities without the Princi­ple of Indifference? ­ here have been attempts to justify inductive probabilities of hypotheses by T weaker assumptions or even without inductive assumptions. One attempt of this sort is the “symmetry argument” that goes back to Williams (1947), Stove (1986), and o ­ thers (Henderson 2018, §3.3.3) and has more recently been defended by Williamson (2013). The statistical probability that the finite frequency (freq) of a property is close to its statistical probability is high according to the law of large numbers, and this is entailed by the basic axioms of probability alone. That is the starting point of the symmetry argument: if freq is close to p, then p is close to freq, and thus—­the conclusion goes—­the statistical probability is close to the observed frequency in a par­tic­u­lar sample. Several authors have demonstrated that this argument contains two major ­mistakes. In what follows freq(F:xn) denotes the frequency of a binary property F in a varying n-­element sample designated by xn, and freq(F:sn) denotes the frequency of F in a given par­tic­u­lar n-­element sample sn. With help of well-­known statistical methods we can compute an interval ±an such that the probability of drawing a random sample whose F frequency, freq(F:xn), lies in the closed interval [p(F) ± an] is approximately 95 ­percent (where “[x ± a]” stands short for {x: x − a ≤ x≤ x + a}). For a two-valued property F, the interval ±an is given as ±2 i

p(F) i (1 − p(F)) n

(which becomes small when n gets large). Thus the

structure of the argument is as follows: Step 1: Probability(freq(F:xn) ∈ [p(F) ± an]) = 0.95. The Williams-­Stove argument in its original form infers from this by symmetry. Step 2: Probability(p(F) ∈ [freq(F:xn) ± an]) = 0.95, from which by “inserting” the observation freq(F:sn) = q it is inferred that Step 3: Probability(p(F) ∈ [q ± an]) = 0.95. In words: the probability that F’s true statistical probability deviates by not more than an from the observed sample frequency q is 95 ­percent.

Are Probabilistic Justifications of Induction Pos­si­ble? 73

Early proponents of the Williams-­Stove argument have even argued that this “derivation” follows from the basic axioms of probability alone. That ­ istake lies in the fact that the argument does is obviously wrong. The first m not distinguish between statistical and subjective probability. The probability in step 1 and step 2 is a statistical one, and its variable xn varies over arbitrary random samples of size n, drawn from the given infinite population (or “source”). Hence, we have to write more precisely: Step 1*: p(freq(F:xn) ∈ [p(F) ± an]) = 0.95, and Step 2*: p(p(F) ∈ [freq(F:xn) ± an]) = 0.95. Only steps 1* and 2* are entailed by the basic axioms. ­These two steps say that in a random sequence of n-­sized samples the frequency of samples that deviate from the limiting frequency by at most an is 95 ­percent. This is dif­fer­ ent from what Williams and Stove asserted. Along the same line, Winkler and Hays (1970, 328) and Howson and Urbach (1996, 239ff.) have objected that— in spite of the mathematically valid symmetric inversion—­the probability in step 2* is still a statistical probability of samples, but not of a hypothesis, ­because probabilities of hypotheses are not statistical but epistemic in nature. The crucial step that is overlooked in the passage from step 2* to step 3 in the Williams-­Stove argument consists in the transfer of the statistical probability to the single case—­that is, to the observed par­tic­u­lar sample sn, according to the StPP for sample frequencies (definition 4.2c). Thus we need to insert the following intermediate step: Step 2a*: P(p(F) ∈ [freq(F:sn) ± an]) = 0.95 (by means of the StPP). This demonstrates that the original intention of Williams and Stove to establish a noncircular “deductive” justification of induction does not work. As we know, the StPP expresses an inductive assumption that asserts that (in the absence of further knowledge) the actually observed sample frequency is representative of the frequency limit in the entire population. That makes sense only if we believe that our world is uniform when passing from observed to nonobserved. In his reconstruction Williamson (2013, 304–308) avoids this m ­ istake. He makes it clear that the argument relies on the StPP. Is the thus adjusted symmetry argument correct? Unfortunately, it is not. The application of the StPP in step 2a* is indeed correct, as in this step it is not assumed that the agent knows anything about the individual sample sn, except that it is randomly drawn from the population with limiting frequency p(F). So the admissibility condition of definition 4.2 is satisfied in regard to the par­tic­u­lar sample sn, whose individuals correspond to the ai’s in definition 4.2c. This is dif­fer­ ent, however, when the agent has already acquired experience regarding the

74

Chapter 4

sample sn, in par­tic­u­lar, when she has already observed its F frequency. ­Here the StPP is conditionalized on an informative condition concerning the individual sample sn, which violates the admissibility condition. To obtain the full reconstruction of the argument we must add the following steps: Step 3a*: P(p(F) ∈ [freq(F:sn) ± an] | freq(F:sn) = q) = 0.95 (conditionalization of 2a*). Step 3b*: P(freq(F:sn) = q) = 1 (assumption). Step 3*: Steps 3a* + 3b* imply by probability logic: P(p(F) ∈ [q ± an]) = 0.95. In this way, the incomplete derivation “steps 1, 2, 3” has been transformed into the complete derivation “steps 1*, 2*, 2a*, 3a*, 3b*, 3*.” The problematic step in Williamson’s version of the symmetry argument is step 3a*, which conditionalizes the StPP on the evidence “freq(F:sn) = q” and is not generally admissible. It can be argued that step 3a* is only admissible if the prior probability over the pos­si­ble hypotheses is approximately uniform—­that it obeys the indifference princi­ple. Along this line, Maher (1996, 426) argued that the move from step 2a* to 3a* is inadmissible if the prior probabilities are biased in a way that makes the sample result freq(sn) = q seem like an outlier. In this case the information freq(F:sn) = q can lower the epistemic probability of p ∈ [q ± an] far below the value of 0.95. As an example, take a series of 100 coin tosses. It can be computed that with p = 95 ­percent the frequency of heads in 100 throws does not deviate by more than 8 ­percent from the true statistical probability of heads. Now assume we observe a number of 30 heads in 100 throws of the coin. According to Williamson’s argument we should now believe that the coin has a biased heads-­probability of 30 ± 8 ­percent. That is only reasonable if our prior expectation concerning the coin’s true probability is uniform, which means that our prior expectation that the coin is approximately fair is very low. If we are confident that the coin is fair (i.e., our prior peaks about p = 1/2), it seems more reasonable to believe that the given series was an unrepresentative accident. This informal argument may even be turned into a formal theorem. (Proposition 4.9)  Deriving indifference Assume that for all q ∈ [0,1], the probability distribution P satisfies the equation of step 3a*—or in informal words, ­there is a 95 ­percent degree of belief that the statistical probability of F lies in an ±an-­interval around the F frequency in a sample, given that this frequency equals q. Then the prior distribution DP(p = x) converges to the uniform distribution with increasing n.

Proof: Appendix 12.8.

Are Probabilistic Justifications of Induction Pos­si­ble? 75

In conclusion, a justification of inductive posterior probabilities conditional on finite evidence solely by the StPP, without assuming a par­tic­u­lar prior distribution, is impossible. A similar conclusion is stated by Henderson (2018, §3.3.3). 4.7  Is Skepticism Unavoidable? In chapter  2 we have seen that all nonformal philosophical attempts to solve Hume’s prob­lem of induction failed. In chapter  4 we investigated more recent probabilistic attempts to justify induction. This needed more technical effort, but the result was similar: without building inductive assumptions into the probability function, inductive consequences cannot be proved. Examples of such inductive assumptions w ­ ere, with increasing strength, σ-­additivity, nondogmaticity and continuity of priors, exchangeability and the StPP, the princi­ple of indifference, and a positive probability of strict generalizations. Some phi­los­o­phers argue that all or at least some of ­these princi­ples are “justified” ­because they are intuitive. However, recall our characterization of justification in section 2.2: A justification has to establish that induction is truth conducive or cognitively successful. Considerations of intuitiveness cannot achieve that. Moreover, we presented vari­ous objections that turn initially promising intuitions into counterintuitions. For example, we saw that the princi­ple of indifference makes induction entirely impossible if it is applied to the set of pos­si­ble worlds (state descriptions), instead of to the set of pos­si­ble frequency hypotheses. Does this mean that all philosophical attempts to solve the prob­lem of induction are hopeless? Many epistemologists and phi­los­o­phers of science have drawn this conclusion (Stegmüller 1977, vol. II, chap. 4). Yet we think that ­there is no reason to give up. So far all justification proposals have attempted to demonstrate the reliability of inductive inferences. ­ There exists a fundamentally dif­fer­ent approach that does not attempt to demonstrate the reliability but the optimality of inductive inferences among all accessible methods of prediction. The rest of this book consists of a development of this latter and (in our view) more promising approach to Hume’s prob­lem, which we call meta-­induction.

5  A New Start: Meta-­Induction, Optimality Justifications, and Prediction Games

5.1  Reichenbach’s Best Alternative Approach In the previous chapters we have shown that not one of the proposed methods of justifying induction works. Observe, however, that all of the justification attempts discussed thus far1 have tried to demonstrate that inductive inferences are or ­will be successful, at least in the probabilistic sense of being reliable—­that is, leading from true premises to true conclusions in a high percentage of cases (which must be at least higher than the random success of blind guessing). According to Hume’s basic objection, it is impossible to demonstrate that inductions ­will be successful without making inductive assumptions about the world, which means reasoning in a viciously circular way. Thus Hume’s skeptical argument can be summarized by saying that it is impossible to demonstrate a priori that induction is reliable. The results of the previous chapters have confirmed Hume’s argument. However, ­there exists a dif­fer­ent approach to Hume’s prob­lem for which we can at least uphold the hope that it can succeed if it is adequately developed. Put in a nutshell, this account does not attempt to provide a reliability justification, but an optimality justification of induction. Historically this approach goes back to Reichenbach’s best alternative approach (Reichenbach 1935, §80; 1949, §91; Salmon 1974b). The best alternative approach does not try to show that induction must be (probabilistically) successful, which is impossible given Hume’s insight. Rather, it argues that the expected success of induction ­will be maximal among all (available) competing methods. Or, in simpler words, if any method of prediction works, then the inductive method does. Reichenbach draws the picture of a fisherman

1. ​With the exception of analytic accounts, which do not provide a success-­based justification at all.

78

Chapter 5

sailing to the sea, not knowing ­whether he ­will find fish (= find regularities). Yet it is clearly more reasonable for him to carry his fishing net (= induction) with him than not, ­because by ­doing so he can only win (catch fish) but not lose anything. ­There exists a systematic argument for why the best alternative seems to be the only promising approach. Given the following two premises, Premise 1: It is impossible to demonstrate that induction must be successful (Hume’s lesson), and Premise 2: ­There exist vari­ous alternative noninductive prediction methods (such as intuition, trust in God, paranormal abilities such as clairvoyance, or even anti-­inductive methods), then the following conclusion seems unavoidable: Conclusion: The only remaining possibility of an epistemic justification of induction is to show that induction is the (or at least a) “best alternative” among all available competing methods of prediction. Besides inductive prediction inferences, t­here are also inductive generalization inferences (recall section 1.1), but the inductive reliability of the latter ones can only be evaluated by their predicted success. Hence we can safely assume that the ultimate goal and evaluation criterion of inductive inferences is success in predictions. 5.2  Reliability Justifications versus Optimality Justifications In the terminology of con­temporary game theory, the goal of Reichenbach’s account can be understood as an epistemic dominance argument or at least as an optimality argument. In the standard decision-­theoretic framework, an action A is called optimal in a class of available actions A (A ∈ A) iff in e­ very world w of a given class of pos­si­ble worlds W, the utility of A is greater than or equal to the utility of ­every other action in A. Action A is called dominant in A iff A but no other action B ∈ A is optimal in A, or equivalently, iff A is optimal in A and for ­every alternative action B ∈ A ­there is at least one world in which A’s utility is strictly greater than that of B (Weibull 1995, 13). The difficult part of Reichenbach’s enterprise is to show that induction is optimal, that it is never worse than any noninductive method of prediction. If the demonstration of the optimality of induction ­were to succeed, then it would seem that this result could easily be extended to demonstrate dominance ­ because the predictions of e­very noninductive method are

Meta-­Induction, Optimality Justifications, and Prediction Games 79

suboptimal in a “sufficiently uniform” world. In­de­pen­dent of this argument (which ­will be discussed in section  8.3), a demonstration of the optimality of induction alone would already constitute a ­great justificatory success ­because it would establish that no method of prediction can do better than the inductive method. We emphasize that the possibility of optimality justifications is compatible with Hume’s insight that no noncircular argument can establish the reliability of induction. Even in epistemically “demonic” worlds in which no method of prediction is reliable, inductive methods can still be optimal in the sense of being “the best of a bad lot”—­not having less predictive success than any other method accessible in this world (though proposition 8.1 guarantees a meta-­inductive success rate of at least 0.5). Thus, in an impor­ tant sense, optimality justifications are epistemically weaker than reliability justifications.2 Another objection to optimality justifications of induction could be to point out that even if they w ­ ere successful they would not give us a true solution to Hume’s prob­lem. We think that this diagnosis is incorrect. Hume did not just argue that induction cannot be demonstrated to be reliable, he argued for the much stronger claim that no epistemically rational justification of induction is pos­si­ble. However, the goal under­lying the optimality argument is the maximization of true predictions, and that is clearly an epistemic goal. Therefore, we regard optimality arguments as a genuine (though weak) solution to Hume’s prob­lem. Moreover, optimality and dominance can be translated into two well-­ known properties of the probabilistic expectation value of success (however “success” is defined). A method is optimal iff for e­ very prior probability distribution over pos­si­ble worlds its average success is maximal among all competing methods (see section 8.1.6). For dominance the relation is more complicated (see section 8.3.3). Optimality justifications have also been called “pragmatic”; Feigl (1950) spoke of the “pragmatic vindication of induction.” T ­ here is nothing to say against this designation as long as “pragmatic” means an epistemic pragmatics that justifies inferences as a means for the epistemic

2. ​Logically speaking, they are not weaker. It is pos­si­ble that a method is very reliable but not optimal b ­ ecause another method is even more reliable. However, a reliability justification grants at least some threshold of (more than accidental) predictive success whereas an optimality justification cannot do this. In this sense, optimality justifications are epistemically weaker.

80

Chapter 5

goal of predictive success, not for extraepistemic goals such as wealth or happiness. ­There is a more specific reason why decision-­theoretic justifications of induction have been called “pragmatic”; namely, ­because without further assumptions, they do not establish that the conclusion of an inductive inference with true premises is prob­ably true—­that is, true with a probability greater than one half in the binary case or, more generally, greater than the truth-­probability of blind guessing. Although an optimality justification of induction, if it exists, establishes that the expected rate of truth-­success of induction is greater than or equal to that of any competing prediction method, it does not establish that the success rate is always greater than blind guessing; w ­ hether this is the case or not depends on the chosen prior probability distribution (more on this in section 9.1). However, given the practical necessity of choosing actions and hence of making predictions, the best that we can do is to choose the prediction method with the best expected success and believe that its prediction is prob­ably true (in the binary case, more probable than one half). One may object that the argument is not a proper justification but merely an as-­if justification, as we merely act as if we believed in the conclusions of our inductive inferences. To that we can reply that if it is optimal to act as though we believed in induction, then it is optimal to believe in induction. In section 8.1.7 we w ­ ill elaborate this princi­ple and call it the “optimality princi­ple.” We ­will also show t­ here how meta-­induction can be used to justify (1) rational degrees of beliefs in propositions (predictions) without presupposing par­tic­u­lar prior distributions, and (2) qualitative (plain) beliefs in the context of a given ac­cep­tance threshold. Fi­nally, we emphasize that in demonstrating optimality one must of course allow all pos­si­ble worlds, including not only radically nonuniform worlds but also paranormal worlds in which ­there exist perfectly successful future-­tellers or anti-­inductivistic demons. Restricting the set of worlds to uniform or “naturalistically normal” worlds would completely destroy the enterprise of justifying induction. For then we would have to justify inductively that our real world is one of ­these “normal” worlds, ending up in precisely that kind of circle in which according to the Humean skeptic all attempts of justifying induction must end up. In other words, the justification of induction takes place at the very bottom level of epistemology, so it must not presuppose anything except analytical truths and certain introspectively justified beliefs (see section 8.1.1). In par­tic­u­lar, the justification of induction must not already presuppose a naturalistic worldview excluding the possibility of clairvoyance or spiritual connection with an

Meta-­Induction, Optimality Justifications, and Prediction Games 81

omniscient God. All assumptions of this sort would make our enterprise circular ­because the justification of the naturalistic worldview presupposes our belief in the reliability of induction. In other words, the justification of induction must be “radically open-­minded” (section 8.1.2). 5.3  Shortcomings of Reichenbach’s Best Alternative Approach Unfortunately, the hope for a general optimality argument for induction is too high to be satisfiable. ­There are severe objections against Reichenbach’s best alternative account that show that one cannot demonstrate the optimality of induction (or some other method) among all pos­si­ble methods of prediction or uncertain inference. The most popu­lar version of Reichenbach’s argument was given by Salmon (1974b, 83) and rests on the utility matrix in figure 5.1. According to this matrix induction is optimal: by applying induction you can only gain but you cannot lose. One prob­lem with this reconstruction of Reichenbach’s argument is that its key terms, uniform world state and success are vague. The r­ eally devastating prob­lem, however, is that in­de­pen­dent of how one delineates uniform from nonuniform worlds, the argument fails as soon as “success” is understood as it should be—in terms of predictive success.3 It is easy to conceive worlds in which the success of inductive methods is not better than that of blind guessing, but where ­there nevertheless exists a clairvoyant who successfully predicts the ­future. For example, a perfect future-­teller may have 100 ­percent success in predicting random tossings of a coin while the scientific inductivist can only have a predictive success of 0.5 in this case (Kading 1960). That is what the question mark in figure 5.1 indicates. World-state:

Uniform

Nonuniform

Inductive

Success

Failure

Other

Success or Failure

Failure [?]

Method applied:

Figure 5.1 Salmon’s reconstruction of Reichenbach’s argument. The “[?]” indicates its weak spot.

3. ​Or some mea­sure correlated with predicted success, such as cognitive success in the sense of section 2.2.

82

Chapter 5

To escape ­these difficulties, Reichenbach offers another version of his argument. In this alternative version, the goal of induction is nothing but to approximate the true frequency limit of a given property in a given infinite sequence of events, based on a successively increasing number of observations. Moreover, a uniform world state is now defined as one in which the events to be predicted possess a frequency limit in the given sequence of  events, while a nonuniform world state is understood as a sequence whose events do not possess such limits. With re­spect to this interpretation of the decision matrix, Reichenbach’s argument becomes true, but also more or less trivial. If the event has a frequency limit p, then the simplest inductive generalization rule—­the so-­called straight rule, which transfers the relative frequency observed thus far to the conjectured frequency limit—­ must approximate this limit in the long run ­because by definition the existence of a limit p means that with n → ∞ the finite frequencies freqn of the given event converge to p, limn→ ∞ freqn = p. On the other hand, if the event frequencies freqn do not converge to a frequency limit, then for trivial reasons no method can find a limit (Reichenbach 1949, 474ff.). Lenz (1974) and Rescher (1980) have provided devastating arguments against this version of the pragmatic justification. First, the method of “finding” provided by the straight rule is not ­really a method of finding ­because ­after any finite number of observations, however large, our observed frequency could still be maximally distant from the true limit. Thus, we never know when we have approximated the limit within nontrivial approximation bounds. Second, and more importantly, conjecturing an approximate frequency limit is practically insignificant and certainly not the primary goal of inductive inferences. Rather, the primary goal is the prediction of ­future events. In this re­spect, Reichenbach’s account fails completely. For as we have just explained, even if the inductivist performs equally well as a clairvoyant in conjecturing the frequency limit of an event, the clairvoyant could still be overwhelmingly more successful in regard to single event predictions. 5.4  Object-­Induction versus Meta-­Induction Reichenbach was well aware of the prob­lem explained in the last section: the possibility of a superior forecaster in a scenario in which even the best scientific induction methods can only have random success. He remarked that if a successful future-­teller existed, then that would already be “some uniformity,” which the inductivist could recognize by applying induction to the success of prediction methods (1949, 476ff.; 1938, 353ff.). But Reichenbach neither showed nor even attempted to show that by this

Meta-­Induction, Optimality Justifications, and Prediction Games 83

meta-­inductivistic observation the inductivist could have an equally high predictive success as the future-­teller. This point has been highlighted by Skyrms ([1975] 2000, chap. III.4). Skyrms distinguishes between the object-­ level and the meta-­level of methods and also allows for meta-­meta-­levels (­etc.). He argues that Reichenbach has indeed shown that if t­here exists a successful prediction method—­say, a clairvoyance-­method—at the object level, then the meta-­inductivist can find that out and can construct an inductive argument about why this clairvoyance method is successful at the meta level. But what Reichenbach did not show, Skyrms asserts, is that the meta-­inductivist could produce equally successful inductive predictions at the object level. Our approach—­which we turn to now—is to try to show what, according to Skyrms, Reichenbach has failed to show. The basic idea focuses on the so-­called meta-­inductivist who applies the inductive method at the level of competing prediction methods. Let us make this formally precise. By object-­induction (abbreviated as OI) we mean all methods of induction that are applied at the level of events—­the object level. In contrast, by meta-­ induction (abbreviated as MI) we understand “meta-­level” methods that apply induction at the level of competing prediction methods. More precisely, the meta-­inductivist bases her predictions on the predictions and the observed success rates of the other (non-­MI) methods and tries to derive therefrom an “optimal” prediction. The simplest type of MI simply predicts what the presently best prediction method predicts, but as we s­ hall see ­there exist vari­ous more refined kinds of meta-­inductive strategies for prediction. Generally speaking, the prob­lem of Reichenbach’s account lies in the fact that it is not only impossible to demonstrate that object-­induction is always a reliable prediction method, but likewise impossible to demonstrate that it is always an optimal (or approximately optimal) prediction method. This is a lesson of formal learning theory precisely explained in section 5.7. Roughly speaking, for ­every method M one can construct a demonic world (event sequence) w in which this method systematically fails; and for ­every world w one can construct a method Mw which is maximally reliable in w. Thus, M is outperformed by Mw in w. However, the preceding remark only holds true if the method M that fails in world w is not a meta-­inductive method that observes Mw’s success in w and starts to imitate Mw. That is the anchor point of meta-­induction, and ­because of this fact one should expect that for meta-­induction the chances of demonstrating optimality are much better than for object-­induction. Thus, the crucial question of the next chapters ­will be, Is it pos­si­ble to

84

Chapter 5

design a version of meta-­induction that can be proved to be an optimal prediction method? One may object that meta-­induction presupposes that the predictions of the other methods are accessible to the meta-­inductivist. This is indeed true: it is pos­si­ble that an “esoteric” forecaster does not tell his predictions to the meta-­inductivist but keeps them secret or shares them only with his followers who blindly trust his prophecies. Therefore, the meta-­inductive notion of optimality has to be restricted to prediction methods that are accessible to the meta-­inductivist, in the sense that the meta-­inductivist has information about their predictions and success rates and can learn from them (details in section 5.7). In what follows we call this notion of optimality access-­optimality. What we intend to show is that meta-­induction is an access-­optimal meta-­level prediction strategy—­among all prediction methods whose output is accessible to the meta-­ inductivist, the meta-­ inductivistic strategy is always the optimal choice. The restriction of meta-­ inductive optimality to accessible strategies is, on the one hand, the crucial step that makes it pos­si­ble to achieve a result that does not fall prey to the standard skeptical scenarios. On the other hand, we argue that this restriction is not ­really a drawback. Methods whose output is not accessible to the given forecaster are not among her pos­si­ble actions, so they are without relevance for her deliberations concerning the choice of an optimal strategy. If the predictive success rates of all accessible competing prediction methods (however the success rate is mea­sured) are known in advance and do not change in time (in a sense to be made precise), then the demonstration of the access-­optimality of MI is trivial. The meta-­inductivist just has to select the (or in case of ties, one) method with the best success rate in order to be optimal. But in realistic scenarios ­these conditions are of course not satisfied. The meta-­inductivist can only observe the per­for­mance of the available methods in the past, and has to infer from ­these observations which method or combination of methods she should choose for the next prediction to be made. But ­after the next event has been revealed, the success ordering of the competing prediction methods may already have changed. The leading method so-­far whose prediction was imitated may have delivered a wrong prediction of the next event, so its success may fall ­behind a new leading method. Thus, the meta-­inductivist has to accept an unavoidable delay in information. As we ­shall see, this makes her susceptible to vari­ous sorts of deception—­against which, however, she may develop dif­fer­ent kinds of defense strategies. We do not restrict changes of success rates to improbable random fluctuations, as they occur in IID (in­de­pen­dent identically distributed) random

Meta-­Induction, Optimality Justifications, and Prediction Games 85

sequences. Rather, we admit that the success rates of the considered methods may systematically change in time, for example, ­because of systematic changes in the environment. Systematic deviations between past and f­ uture are the most general and at the same time hardest situations for the induction prob­lem. We ­will compare our setting with more induction-­friendly learning situations such as random sampling in ­later sections. Even if the access-­optimality of meta-­induction could be demonstrated, it is not yet clear which epistemological significance this result would have for a pos­si­ble solution to Hume’s prob­lem of induction, the prob­lem of ­whether we are justified in object-­induction—­induction applied to events. The significance for Hume’s prob­lem is the following: if the access-­optimality of meta-­ induction could be demonstrated, then at least meta-­induction would have a rational and noncircular justification based on a mathematical-­analytic argument. But this a priori justification of meta-­induction would at the same time yield an a posteriori justification of object-­induction in the real world. We know by experience that in our world, object-inductive prediction methods have been more successful than noninductive methods so far, whence it is meta-­inductively justified to ­favor object-­inductivistic strategies in the ­future (more on this in section 8.1.5). In other words, the commonsense argument in f­ avor of object-­induction based on its past success rec­ord would no longer be circular if we had an in­de­pen­dent, noncircular justification of meta-­induction. 5.5  Prediction Games The central notion by means of which we ­shall study the per­for­mance of dif­fer­ent (object-­or meta-­level) methods of prediction is that of a prediction game. A prediction game is formally a pair G = ((e), Π) that consists of a stream of events (e) and a set of prediction methods or “players” Π. If G = ((e), Π), we also write Π = ΠG and (e) = (e)G. 1. ​Events: The stream of events is represented as an infinite sequence (e) =def​ (e1, e2, … ), consisting of events that are numerically coded (or mea­sured) by ele­ments of the unit interval [0,1]. Naturally (but not necessarily) the ordering of the events in the sequence is understood as an ordering in time, so at each discrete time n = 1, 2, … the event en ∈ [0,1] obtains. For example, (e) may be a sequence of daily weather conditions, stock values, or coin tossings. An event sequence (e) is the simplest way of modeling the natu­ral part of the world that ­will be extended for specific purposes in section 7.4.

86

Chapter 5

Generally speaking, our repre­sen­ta­tion covers all continuous or discrete events that are representable by real-­valued numbers within finite bounds [a,b]; subtraction of a and division by (b−a) gives us normalized events in [0,1]. “Val” denotes the set of pos­si­ble event-­values and is a subset of the closed interval [0,1], thus en ∈ Val ⊆ [0,1]. In the context of mathematical theorems we typically assume Val = [0,1], but in practice the pos­si­ble events in Val do not cover all real numbers in [0,1] but are rounded up to a finite number of certain decimal places ­behind the comma. In certain contexts we w ­ ill assume a set of pos­si­ble events that is more restricted than demanded by finite accuracy; for example, Val = {0.1,0.3,0.5, 0.7,0.9}. Formally, the event variable e, or more explic­itly Xe, is a mathematical (random) variable Xe : T → Val, with T (“times”) = , Val ⊆ [0,1] and ei =def Xe(i) denoting the true value of Xe at time i. 2. ​Predictions: The players in Π have the task of predicting, at any time n, which event en+1 ­will occur at the next time n+1 (thus the prediction for time 1 is delivered at time 0). The value space of pos­si­ble predictions is denoted by Valpred and may extend the space of pos­si­ble events Val; that is, Val ⊆ Valpred ⊆ [0,1]. ­Because in real-­valued prediction games it is allowed to predict linear combinations of events, Valpred coincides mathematically with [0,1] even if Val is finitely restricted. The only restriction of Valpred consists in the finite accuracy of predictions. A par­tic­u­lar case are binary events, represented by “1” for “event E obtains” and “0” for “E does not obtain”; thus for binary events Val = {0,1}. Even if events are binary, real-­valued predictions are allowed. If they are interpreted as the predicted probabilities of the event 1, t­hese games describe Bayesian predictors, which are treated in section 7.1. A generalization of binary events are discrete events, for which Val = {v1, … ,vq} consists of a finite set of discrete values (e.g., Val = {blue, red, yellow, green}) that possess no gradual structure (even if they are, for con­ve­nience, represented as real numbers in [0,1]). Real-­valued predictions of discrete events predict a probability distribution over Val (section 7.1). The prediction of proper weighted averages of events is forbidden in ­ ere Val is discrete, and the predicso-­called discrete prediction games. H tion space coincides with the event space: Valpred = Val. For binary prediction games, this means that events as well as predictions must take only one of the two values 0 and 1; predictions of real numbers between 0 and 1 are not allowed, or equivalently are assigned a score of zero. 3. ​Players: The set of players Π = {P1, P2, … , xMI} contains a meta-­inductivist xMI of a given type “x” whose per­for­mance is being investigated. (­Later

Meta-­Induction, Optimality Justifications, and Prediction Games 87

parts of the book ­will assume several meta-­inductivists xMI1, … ,xMIk of the same type, for reasons to be explained ­there.) Besides the meta-­ inductivist xMI, the player set Π contains vari­ ous other players P1, P2, … who are called “non-­MI players,” or more precisely “non-­xMI players.” The subset of non-­MI players of Π is also called xMI’s candidate set; it is abbreviated as Π¬MI and may contain, for example:

(3.1) ​One or several object-­inductivists. Each object-­inductivist bases his predictions on a computable method (though for certain purposes we ­will allow object-­inductivists randomizing their predictions). Methods of object-­level induction may be more or less refined, forming an infinite hierarchy of complexity (which includes the arithmetical hierarchy; see Kelly 1996). The simplest object-­inductivist ­will be denoted as OI.



(3.2) ​A subset of “alternative” players, such as persons who rely on their instinct, clairvoyants or “God-­guided” future-­tellers, and blind guessers. The prediction method of alternative players is typically not given as an algorithm but by the de facto predictions of a real agent; in computer science this corresponds to what is called an oracle. In paranormal worlds, alternative predictors may have any success you wish. Note that we do not define “paranormal worlds,” as that would be impossible; all that is impor­tant for a noncircular justification of meta-­induction is to admit pos­si­ble worlds that host future-­tellers.

Our investigation avoids players whose predictions are identical. We can ­ ater, in the contherefore identify each prediction method with one player. L text of discrete prediction games, we ­will introduce meta-­inductive methods that depend on their being played by a collective of meta-­inductivists. inductivists earn the same success However, in t­hese contexts all meta-­ rates, so the identification of a “method” with one idealized player who represents the entire collective can still be upheld in the context of optimality and dominance results (see sections 8.1 and 8.3). For the investigation and evaluation of prediction games we introduce the following technical notions that ­will be used throughout the book: • pred n

denotes a prediction for time n that is issued at time n−1, and predn(P) stands for the prediction for time n issued by a given player P of the prediction game. =def loss(predn(P),en) is the loss that player P incurs for her prediction predn(P). The loss function “loss(predn,en)” mea­sures the deviation of the prediction predn from the event en (at time n). Loss functions are normalized: loss(pred,e) ∈ [0,1].

• loss (P) n

88

Chapter 5

The natu­ral (or absolute) loss function is assumed to be the absolute difference between prediction and event, loss(predn,en) =def |predn − en|. Although we prefer the natu­ral loss function, the theorems contained in this book do not depend on the assumption of natu­ral loss functions. Some theorems ­will hold for arbitrary loss functions, and o ­ thers ­will hold for arbitrary convex loss functions, which is still a wide class of loss functions, including all loss functions that are normalized polynomial or exponential functions of |predn − en| with positive coefficients. A minimum requirement for numerical loss functions is their monotonicity: loss(pred1,e1) ≤ loss(p2,e2) iff |pred1 − e1| ≤ |pred2 − e2|. In the case of discrete (non-­numeric) events, this condition makes no sense, and the so-­called zero-­one loss is the preferred loss function: loss0-1(pred,e) = 0 if pred = e and loss0-1(pred,e) = 1 if pred ≠ e. In the special case of binary predictions, loss0-1 coincides with the natu­ral loss function. The natu­ral loss function is a special case of linear loss functions (i.e., polynomial of degree one). With linear loss functions one may extend the normalized scoring interval [0,1] to any cost-­gain interval [−c,g], by subtracting c and multiplying by (c + g): that is, score’ = −c + (c + g) • score. For the transformed scoring interval [−c,g] the upper bounds for the short-­run regrets of dif­fer­ent meta-­inductive strategies to be established in chapters 6 and 7 have to be modified by multiplying them with the breadth (g + c) of the scoring interval (Cesa-­Bianchi and Lugosi 2006, 24ff.).4 Based on the loss function, we define the following evaluation mea­sures: • score (P) =  1 − loss (P) n def n

is the score which player P earns for predn(P).

• abs (P) =  ∑ n def 1≤i≤n

scorei(P) is the absolute success achieved by player P at (or ­ ntil) time n, defined as the sum of P’s scores for predictions delivered u ­until time n.

• suc (P) =  abs (P)/n n def n

is the success rate of player P at time n.

• limsuc(P)  =def limn→∞

sucn(P) denotes player P’s limit success, provided that P’s success rate converges to such a limit (which means, recall section 4.1, that ∀ε > 0∃m ∈ ∀n ≥ m: |sucn(P) − limsuc(P)| ≤ ε holds).

• maxabs n

is the maximum of the absolute successes of all non-­MI players at time n; that is, maxabsn = max({absn(P): P ∈ Π¬MI}). Likewise,

4. ​The additive term “−c” cancels in regrets. A further variation are directed loss functions depending on the direction (sign) of the difference (predn − en); an example is the so-­called Wilhelm Tell function (Trommershäuser, Maloney, and Landy 2003). For directed loss functions the conditions of monotonicity and convexity must be applied separately to positive and negative differences.

Meta-­Induction, Optimality Justifications, and Prediction Games 89

• maxsuc n

is the maximum of the success rates of all non-­MI players at time n,5 and



maxlimsuc is their maximal limit success (if it exists).



en : = (∑1≤ i ≤ n e n )/n denotes the event’s mean value at time n.



eˆ n denotes the event’s median value at time n, which is defined as the greatest real number r ∈ Val such that at least 50 ­percent of all observed events so far have a value of ≥ r. Medium and mean values coincide for symmetric distributions but deviate for asymmetric ones. For example, if the observed events so far are {0.1,0.3,0.2,0.1,0.8,0.9}, their median value is 0.3, and their mean value is 0.4.



e=def lim n→∞ en is the event’s limiting mean value, Xe’s statistical expectation value, provided the sequence (e) converges to a limit, and



eˆ = def lim n→∞ eˆ n is the event’s limiting median value, u ­ nder the same proviso.

Observe that in the case of a binary prediction game, (1) the absolute success rate absn(P) equals the number of P’s correct predictions u ­ ntil time n, (2) the success rate sucn(P) equals the relative frequency of P’s correct predictions among all predictions of P u ­ ntil time n, (3) limsuc(P) equals P’s limiting frequency of correct predictions, (4) en equals the relative frequency freqn(E) of event E at time n, and (5) e equals E’s limiting frequency—­that is, statistical probability, p(E) =def limn→∞ freqn(E). Concerning the median value in binary games, (6) eˆ n equals 1 if freqn(E) is ≥ 0.5, and other­wise eˆ n = 0; likewise, (7) eˆ = 1 if p(E) ≥ 0.5, and other­wise eˆ = 0. In section 5.9 we ­shall see that predicting the median value of binary events corresponds to what is called the “maximum rule” of prediction. Philosophically a prediction game can be identified with a pos­si­ble world or, in cognitive science terminology, with a pos­si­ble environment. The event sequence constitutes the natu­ral part and the player set constitutes the social part of the world/environment. Apart from the definition of a prediction game, we make no assumptions about t­ hese pos­si­ble worlds. In par­tic­u­lar, our approach does not depend on any (problematic) distinction between uniform or nonuniform worlds (such a distinction is notoriously difficult to justify; see Skyrms [1975] 2000, 34ff.). The stream of events (e) can be

5. ​If Π is countably infinite t­ here may exist no maximum [for instance, the success rates may have the form ∑1≤i≤k (1 − 0.5i) for Pk, k ∈ ℕ]. In this case, maxsucn is identified with the supremum (the smallest upper bound) of the success rates of the non-­MI players. The same is the case for maxabsn.

90

Chapter 5

arbitrary; it can be a random sequence, a Markov chain, a “deterministic” sequence that is strictly determined by some algorithm, or an arbitrarily “chaotic” nonrandom sequence whose finite frequencies do not converge to limits at all. Nothing that concerns the be­hav­ior of xMI hangs on this question. We also do not assume a fixed list of players—­the list of players may vary from world to world, except that the game always contains xMI, and some fallback strategy of xMI in rounds or situations in which ­there are no other accessible players (e.g., OI or blind guessing). ­ ill make the realistic assumption In a large part of this investigation I w that the players of a prediction game are real beings with finitely bounded computational means. Thus we ­will restrict our investigation to prediction games with finitely many players (or methods) ­because finite beings can only compare the success rates of finitely many players. Throughout the following we ­will use the letter m for the finite number of non-­MI players, P1, … ,Pm. Some central results of this book on meta-­induction ­will depend on this finiteness assumption. In section 9.2 the finiteness assumption ­will be discussed; in sections 7.3 and 9.2.2 prediction games with unboundedly growing and with infinitely many methods ­will be investigated. 5.6  Classification of Prediction Methods and Game-­Theoretic Reflections For a first orientation we develop a brief classification of prediction methods, in short methods. A method is in­de­pen­dent or object-­level, if its predictions depend only on the events (and event-­frequencies) but not on the predictions (and success rates) of the other players of the prediction game. Nonin­de­pen­dent methods are called dependent methods or meta-­level methods; their predictions depend on the predictions of other players. Dependent methods are also called strategies. ­These strategies can be further classified into success-­dependent and success-­independent strategies. Examples of the latter ones are authority-­based “blind-­favorite” strategies, which always ­favor one method and never doubt its reliability; another example are majority-­ based methods, which predict what the majority predicts (in­de­pen­dent of their success rates; see section 10.2). Success-­dependent strategies are methods of social learning; an impor­tant subclass are meta-­inductive strategies. A method is called normal (in a given world) if its predictions depend only on past events or successes. A method is paranormal or clairvoyant (in a given world) if it has “privileged” access to f­uture events. The admission of clairvoyant methods is needed only in the philosophical context of the prob­lem of induction, not in the naturalistic setting of cognitive science.

Meta-­Induction, Optimality Justifications, and Prediction Games 91

A method is object-­inductive or meta-­inductive, respectively, if it uses some kind of inductive method to infer f­ uture from past events or f­ uture from past successes, respectively. It is impor­tant to distinguish between a player playing a strategy S, and a player merely favoring a strategy S—in the latter case, the player does not play S but plays a meta-­strategy that ­favors S. Formally, a prediction method corresponds to a certain prediction function, denoted by π, that maps an appropriate type of input into a prediction. This input contains the given time n at which the prediction is made, but apart from n it is dif­fer­ent for dif­fer­ent kinds of players or methods. The output of the prediction function πP of player P, when applied to the given inputs at time n, is P’s prediction for time n+1. Or formally, πP:{Inputn(P): n ∈ ℕ} → [0,1], with πP(Inputn(P)) = predn+1(P). The input of e­ very (­simple) normal method is the history of all past events; thus, Inputn(P) = (e)↑n =def (e1, … ,en). ­Later we ­will also consider the case in which P is a refined in­de­pen­dent method (section 5.9): in this case, the input consists of “extended” events that may include other events of the environment that are correlated with the events to be predicted. If P is a dependent method, the input consists of the past events plus the past and pres­ent predictions of the players accessible to P; if the set of t­hose is denoted by Πacc(P), then Inputn(P) = {ei:1≤i≤n} ∪ {(predi(Q): Q ∈ Πacc(P)): 1 ≤ i ≤ n+1}. A normal method is computable or recursive if its corresponding π-­function is recursive. The input of an alternative method depends on the world it is in. In a normal world, its input is like that of a normal method, restricted to past events. In paranormal worlds, however, an alternative method P may have clairvoyance powers, and its input includes the information {en+1} ∪ {(predn+1(Q): Q ∈ Πacc(P))}; that is, P can see the next event and the other player’s predictions for the next time. An alternative method can only be given “externally” by the actions of a real agent; its prediction function is not computable. A normal method may e­ ither be given by a computable function or externally as a real agent. The difference between normal and paranormal worlds is that the former exclude clairvoyance, and the latter do not. If event and success frequencies converge against limits, a s­ imple method of defining normality (nonclairvoyance) is to say that conditional on the entire past, the predictions of a normal method are probabilistically in­de­pen­dent of ­future events. However, as explained, our general framework does not depend on this or any other demarcation between normal and paranormal worlds. For good reasons we ­will allow that the given meta-­inductive strategy xMI that is evaluated in a prediction game has access not only to in­de­pen­dent

92

Chapter 5

(object-­level) strategies but also to other possibly competing dependent (meta-­level) strategies. This is pos­si­ble so long as the predictions of the latter ones are accessible to the former one. That xMI may f­avor other meta-­ strategies is philosophically impor­tant ­because in this way an infinite regress of meta-­levels is blocked. If one meta-­level method M1 tries to base its predictions on t­hose of another meta-­level method M2, ­there may arise circular situations in which M1 waits for M2’s prediction and M2 waits for M1’s prediction, ­until the deadline for making a prediction is over. T ­ here are two pos­si­ble solutions to this prob­lem. 1. ​The first solution is to simply assume that circularly related methods are not accessible to each other. We ­will make that assumption in the following sections. ­Because it is required that xMI has access to all non­MI players of the game, the assumption implies that dependent non-­MI players do not base their predictions on t­ hose of xMI, although they may base them on other players. Thus, in regard to t­ hese meta-­level methods xMI has an access privilege; this assumption is relaxed in prediction tournaments, as we ­will see. More generally, to avoid circularity the accessibility relations between dependent non-­MI players must be hierarchically ordered. 2. ​A second solution is to assume circular networks of mutually dependent forecasters who revise their predictions in the form of so-­called update cycles. This method ­will be employed in section 10.2 in applications to social epistemology. The nth round of a prediction game is denoted by rn. Formally each round can be defined as a pair rn = (en,{predn+1(P):P ∈ Π}), consisting of the event and the predictions of the players. The steps taking place in each round are temporally ordered as follows: a. ​First (a1) the event en is revealed, and next (a2) the in­de­pen­dent players update their rec­ords of event-­frequencies and the dependent players update their rec­ords of success-­frequencies and determine their favored non-­MI players (or their weights) for the next round. b. ​Then the players deliver their predictions for time n+1 in the following order: first (b1) the in­de­pen­dent non-­MI players, next (b2) the dependent non-­MI players, and fi­nally (b3) the meta-­inductivist xMI who has access to all other players. In ­later chapters we ­will consider the following extensions of prediction games:

Meta-­Induction, Optimality Justifications, and Prediction Games 93

1. ​So far we have assumed that ­every player of a prediction game delivers a prediction each round. We speak ­here of per­sis­tent players and games. In section 7.2 we admit intermittent players and games. Intermittent players do not deliver a prediction each time. That might be the case ­because a player refrains from predicting in a given round, or ­because she makes a  prediction that is inaccessible to the meta-­inductivist, or ­because the ­prediction task is based on a comparison of two cue values that are not “discriminatory,” as in the psychological studies explained in sections 7.2 and 10.1. 2. ​So far we have assumed that what is to be predicted is always the next event. In section 7.4 we ­will generalize our prediction games to the prediction of ­future time spans, in the form of the next k events. Fi­nally, in section 7.5 we generalize our results from prediction games to arbitrary action games. 3. ​A prediction game G with a distinguished meta-­inductivist who has simultaneous access to all players in ΠG = {P1, … ,Pm,xMI} is called a prediction game in the narrow sense. Our theorems in chapter  6 ­will all be about prediction games in the narrow sense. As explained, if a game contains another dependent strategy S, then xMI has a privilege to have access to S, but S cannot access xMI, on pain of circularity. To allow a fair comparison between several meta-­inductive or (more generally) meta-­ level strategies—­without giving access privilege to any one of them—we ­will also investigate prediction tournaments. ­These are prediction games in a wide sense in which several meta-­level strategies M1, … ,Mk are mutually inaccessible and can access the same candidate set of prediction methods, C = {P1, … ,Pm}. Formally we denote prediction tournaments as t­riples ((e),{P1, … ,Pm},{M1, … ,Mk}). We conclude this section with a brief reflection on prediction games from the viewpoint of game theory. Prediction games are not one-­shot games but iterated games. They are interactive in a wider sense b ­ ecause the actions of the meta-­inductivist depend on the actions of the non-­MI players, and vice versa the actions of certain alternative non-­MI players (e.g., systematic deceivers introduced in section  6.3) depend on xMI’s choice of favorites. Are prediction games also interactive in the narrow sense, meaning that the utilities (success rates) of a player’s actions depend on the utilities of the actions of other players? This depends on what one means by “the players’ actions.” If ­these actions are the players’ predictions, then prediction games are prima facie not interactive in the narrow sense ­because the score achieved by player P in a given round is solely determined by “nature” (the

94

Chapter 5

stream of events) and by P’s own prediction. The only exceptions are the collective meta-­inductivists introduced in section  6.7.2: they share their joint success, from which follows that ­here we have interactive effects in the narrow sense already at the level of predictions. However, if the players are considered agents who choose which type of method they play—in par­tic­u­lar, who choose w ­ hether they play in­de­pen­ dent versus dependent methods—­then we obtain typical interactive effects in the narrow sense as a rule. T ­ hese game-­theoretic ­matters ­will be investigated in section 10.3, where we s­ hall observe the remarkable effect that dependent (or conformist) players are the “egoists” and in­de­pen­dent (or nonconformist) players are the altruists of the prediction game. 5.7  Definitions of Optimality, Access-­Optimality, and (Access-) Dominance Let us now come to the notions of optimality and dominance for prediction games. We define ­these in the manner that is usual in decision and game theory (Weibull 1995, 13). T ­ hese notions are in­de­pen­dent of any assumptions about probability distributions, but they are relativized to a class G of prediction games (or worlds, environments). We distinguish between optimality in the long run and in the short run. In the short run, only approximate optimality is pos­si­ble. A further complication compared to standard game theory arises b ­ ecause the set of prediction methods (or actions) can vary from world to world. Hence we can define a method to be optimal in a class of prediction games only if it occurs in all of t­ hese games; other­wise, we would get the strange result that a method can become optimal by hiding itself from almost all games. Recall that we use the notions “method” and “player” interchangeably. Our definition of long-­run optimality does not presuppose that the success rates of the competing methods converge to limits. A weaker condition is that the differences between the success rates converge to limits; that ­ nder this is pos­si­ble even when the success rates are endlessly oscillating. U assumption the long-­run optimality of a method M* requires that the difference between M*’s success rate and that of any other method M converges to a non-­negative value—­that is, limn→∞(sucn(M*) − sucn(M)) ≥ 0. To obtain a most general definition of “long-­run optimality,” however, we even admit the case in which the success differences do not converge. If a sequence (x) of bounded real numbers does not converge to a limit (lim), then its ele­ments oscillate endlessly between (ε-­neighborhoods of) a limit inferior (liminf) and a limit superior (limsup). Liminfn→∞(xn) [limsupn→∞(xn)] is defined as the smallest [greatest] value r such that xn ≤ r [xn ≥ r] holds for

Meta-­Induction, Optimality Justifications, and Prediction Games 95

infinitely many times n (see section 6.3.1). Obviously, liminf ≤ limsup; and, if the sequence converges to a limit, liminf = limsup = lim. Should the success difference between two methods M* and M oscillate endlessly, we consider M* at least as good as another method M in the long run if the liminf of their success difference, liminfn→∞(sucn(M*) − sucn(M)), is non-­negative (by limsup ≥ liminf this implies that also limsupn→∞(sucn(M*) − sucn(M)) ≥ 0 must hold). (Definition 5.1)  Optimality A method M* is called optimal in a class G of prediction games iff for all ((e),Π) in G: M* ∈ Π and for ­every other method M ∈ Π (i.e., M ≠ M*), M* is at least as good as M in ((e),Π). That “M* is at least as good as M” can be understood in two senses: (1) In the long run, meaning that liminfn→∞(sucn(M*) − sucn(M)) ≥ 0. We call this “strict long-­run optimality.” (2) In the short run, meaning that for all n, sucn(M*) ≥ sucn(M) − r(n), where “r(n)” is a sufficiently small upper bound of the short-­run regret of M* compared to the best method, which converges sufficiently fast to zero for n → ∞. We call this “near short-­run optimality.”

Generally speaking, a players’ regret is the loss of her success in comparison to the success of the best player in a considered set of alternative players. “Near short-­run optimality” is, of course, a vague notion. Only “strict short-­run optimality” (defined by r(n) = 0 for all n) is a sharp notion, but it is too good to be achievable by meta-­inductive methods, except in the rare situation in which the short-­run success rates of all accessible methods are known in advance. The notion of “near short-­run optimality” ­will be made precise in par­tic­u­lar contexts (theorems). Besides the notion of strict optimality we ­will also introduce the notion of ε-­approximate optimality in section 6.2 (where ε is a small number). An ε-­approximately optimal method deviates from a strictly optimal method by an amount of at most ε. Thus the definition of ε-­approximate optimality is obtained from definition 5.1 by replacing “at least as good” by “almost at least as good”, replacing “≥ 0” by “≥ −ε” in (1), and replacing “−r(n)” by “−r(n) − ε” in (2). In conclusion, we have four senses of optimality: long run, ε-­approximate long run, near short run and ε-­approximate near short run. Observe that the optimality of a method M* in a class of prediction games G does not exclude that ­there exists some other method M′ that is “equally optimal” with re­spect to G. The latter situation is excluded with ­ ecause players/methods can vary the stronger notion of “dominance.” B

96

Chapter 5

from world to world, we restrict our notion of dominance to prediction games in which dominated players do indeed occur (other­wise dominance would be too easy to achieve). Given a class of prediction games G and a class of players ∆, G↑∆ =def {G ∈ G: ∆ ⊆ ΠG} denotes the subclass of games in G in which all players in ∆ occur. Note that by definition 5.1, the optimality of M in G implies G = G↑{M}. Moreover, we define M(G) =def {M ∈ ΠG: G ∈ G} as the set of players occurring in some game in G. (Definition 5.2)  Dominance A method M* is called dominant in a class of prediction games G iff M* is optimal in G but no other method M is optimal in G↑{M} (in one of the explained senses; see definition 5.1). Note: This implies that for e­ very other method M ∈ M(G) (with M ≠  M*) ­there exists a prediction game ((e),Π) ∈ G↑{M} such that M* is better than M in ((e),Π).

If the definition of M*’s dominance does not hold in regard to all methods in M(G) but only in regard to all methods in a more restricted class of methods M ⊆ M(G), we say that M* is dominant in G with re­spect to the class M. That notion ­will only become impor­tant in section 8.3. Likewise for optimality: if M* is optimal only in regard to a subclass of methods M ⊆ M(G), we say that M* is optimal in G with re­spect to M. The note in definition 5.2 is expressed more explic­itly in proposition 5.1. Its proof rests on two facts: (1) concerning the long run, that ¬(liminfn→∞​ (sucn(M) − sucn(M*)) ≥ 0) is equivalent with limsupn→∞(sucn(M*) − sucn(M)) > 0, and (2) concerning the short run, that ¬(∀n: sucn(M) ≥ sucn(M*) − r(n)) is equivalent with ∃n : sucn(M*) > sucn(M) + r(n). (Proposition 5.1)  Dominance in prediction games Assume a method M* is dominant in a class of prediction games G according to definition 5.2. Then for ­every M ∈ M(G) ­there exists a prediction game ((e),Π) ∈ G↑{M} containing M* in which M* is better than M in the following sense: (1) Long run: liminfn→∞(sucn(M*) − sucn(M)) ≥  0 and limsupn→∞(sucn(M*) −  sucn​(M)) > 0. (Note: If the success rates converge, this reduces to ­limn→∞(sucn(M*) − sucn(M)) > 0.) (2) Short run: ∃n: sucn(M*) > sucn(M) + r(n). Corollary: Proposition 5.1 holds also for the ε-­approximate sense of “dominance.”

Proof: Appendix 12.9.

Meta-­Induction, Optimality Justifications, and Prediction Games 97

We mentioned in section 5.4 that no inductive and more generally no normal method can be absolutely optimal, in the sense of being optimal in regard to the class of all pos­si­ble prediction games.6 In the framework of prediction games, this result is easily proved. For ­every normal prediction method P one can define a πP-­demonic event sequence (e*), which produces a worst-­score event for each time n, defined as en* = 0 if predn(P) >  0.5; ­else en* = 1. This grants that P’s success rate in predicting (e*) cannot climb above 0.5 and is zero in binary games. Moreover, for the thus constructed (e*), one can define a method P* that perfectly predicts the event sequence (e*), by setting predn+1(P*) = e*n+1. Hence P is not absolutely optimal. Observe that if P is recursive, then the definitions of (e*) and P* are recursive, too. A first result of this sort has been proved in a seminal paper by Putnam (1965); it was a founding result of formal learning theory (Kelly 1996, 263; Friend, Goethe, and Harizanov 2007, ix). Thus, the worst-­case result for prediction tasks in regard to reliability is zero success, and in regard to absolute optimality it is a maximal regret of one. For reliability this holds b ­ ecause of the possibility of demonic events ­ ecause of the possibility of perfectly clairvoyand for absolute optimality b ant methods. But observe that if P is a meta-­method, the preceding proof only goes through if P does not have access to P*’s predictions. If P can imitate P*’s predictions, it is no longer generally pos­si­ble to construct an event sequence (e*) that deceives P and at the same time rewards P*. This s­ imple but crucial fact underlies all the results concerning our notion of access-­optimality. To define this notion, we first have to explicate the notion of “access.” (Definition 5.3)  Access (1) A method or player M is accessible to a dependent method or player M* iff, at any time of the given prediction game, M* can ­either (a) observe M’s pres­ent predictions (external access) or (b) can simulate M’s pres­ent predictions (internal access); moreover (c), based on a or b, M* has the ability to maintain a track rec­ord of M’s success rates. (2) A class M of prediction methods is accessible to a method M* iff (a) all methods in M are accessible to M* in the sense of part 1, and (b) M* can si­mul­ta­neously keep a rec­ord of their pres­ent predictions and past success rates and apply the meta-­inductive algorithm to ­these data.

6. ​No clairvoyant method can be absolutely optimal e­ ither ­because ­these methods are only successful in paranormal worlds and fail in normal worlds.

98

Chapter 5

In case of definition 5.3(1a) we speak of (merely) external access or “output access.” In this case, M* can only imitate M’s predictions, but M* does not understand the algorithm or cannot internalize the pro­cess on which M’s method is based. In case of definition 5.3(1b) we speak of internal access ­because ­here M* understands the algorithms or cognitive mechanisms that underlie M so that M* can simulate M. For intermittent prediction games (section 7.2), the requirement “at any time of the given prediction game” ­will be dropped. A scientific meta-­inductivist ­will not be satisfied with having mere external access to successful prediction strategies; she ­will try to understand ­these strategies and get internal access to them ­because if she can simulate a prediction strategy, she no longer depends on the presence of other players playing that strategy. Recall that we have assumed that the given meta-­inductivist xMI always has internal access to at least one in­de­pen­ dent method that she permanently simulates and applies when it is the best or only available method. More generally, a prediction game can also be run by just one person simulating the predictions of all methods in the associated candidate set and selecting her true predictions according to the meta-­inductive method. Definition 5.3(1) explains what it means that a person has access to a single method Mi. More cognitive effort is involved in applying a meta-­ strategy to a candidate set M = {M1,M2, … }. For this purpose one must access the methods in M si­ mul­ ta­ neously in the sense explained in definition 5.3(2b). Keeping a track rec­ord of all methods in M si­mul­ta­neously requires significantly more computational effort than having access to all methods in M individually. It may be that all methods in a class of methods M are accessible to a given agent X while M itself is not (si­mul­ta­neously) accessible to X. In par­tic­u­lar, this w ­ ill always be the case if M is infinite. For this reason we focus our investigation on games with finite sets of players and shift the discussion of infinite sets of players to section 9.2. With this notion of access at hand our definition of access-­optimality and access-dominance is as follows: (Definition 5.4)  Access-­optimality and access-­dominance (1) A dependent method M* is access-­optimal in a class of prediction games G [in one of the explained senses] iff M* is optimal [in the respective sense] in the class of all prediction games in G whose player-­set Π is accessible to M*.

Meta-­Induction, Optimality Justifications, and Prediction Games 99

(Definition 5.4)  (continued) (2) A dependent method M* is universally access-­optimal iff M* is access-­ optimal in the class of all prediction games whose player set contains M*. (3) A dependent method M* is access-­dominant in G iff M* is access-­optimal in G, but no other dependent [or in­de­pen­dent] method M that is accessible to M* (in some G∈G) is access-­optimal [respectively optimal] in G.

Definition 5.4(1) relativizes the notion of access-­optimality to a given class of prediction games G. What we would like to have, for a fully noncircular justification of meta-­induction, is a version of meta-­induction that is universally access-­optimal in the sense of definition 5.4(2). Can we design a version of meta-­induction that is indeed universally access-­optimal, at least in the long run, with tolerably small regrets in the short run? This constitutes the major research question of chapters 6 and 7. 5.8  Three Related Approaches: Formal Learning Theory, Computational Learning Theory, and Ecological Rationality Research Prediction games are a rather new epistemological tool in the philosophical lit­er­a­ture. ­There are, however, three related approaches in related fields. In formal learning theory (see Kelly 1996) prediction methods are investigated from an object-­level perspective. Only one player, the object-inductive scientist, plays against a stream of events, and which cognitive tasks can reliably be achieved ­under which conditions imposed on the stream of events is investigated. Concerning inductive tasks, the results are mainly negative in nature (recall section  5.7). In contrast, our prediction games consist of several prediction methods playing against each other, and the investigation does not focus on the question of the reliability but on the optimality of methods. Even if for ­every meta-­inductive prediction method ­there exist suitably chosen demonic streams of events for which its predictive success is zero, such a method may still be optimal, provided one can prove that in all demonic cases all other accessible methods equally have zero success. Are ­there any positive findings in formal learning theory? Yes, but they are rather weak. Concerning inductive predictions, Kelly’s major result is that an infinite sequence of events is correctly predictable by some comput­ fter some finite time iff this infinite sequence belongs able normal method a to a given recursively enumerable set of data sequences S (1996, 260ff.). The

100

Chapter 5

proof is as follows. Define a prediction method that predicts at any time n the datum e′n+1 of the first data sequence in S whose first n data coincide with the n events that have been observed. If the newly observed event en+1 does not match the prediction e′n+1, the method switches to the next data ­ ecause the true event sequence w ­ ill appear at some sequence in S that fits. B finite position in S, the method is guaranteed to predict correctly ­after some finite time point. Note, however, that a uniform distribution assigns a probability of zero to the set S b ­ ecause S is countable, but t­ here are uncountable many pos­si­ble data sequences. In computational learning theory (which is a subdiscipline of theoretical computer science that provides foundations for machine learning), a general label for learning to choose an optimal method or combination of methods for dif­fer­ent tasks and environments is meta-­learning (Lemke, Budka, and Gabrys 2013). Thus, meta-­induction can also be regarded as a program for meta-­learning. ­There is a specific branch of computational learning theory that comes ­really close to our approach, although it has not been related to the prob­ lem of induction. This branch is called online learning ­under expert advice (Cesa-­ Bianchi and Lugosi 2006). H ­ ere a forecaster predicts an arbitrary event sequence based on the predictions of a set of experts, and the question is ­whether and how the forecaster can predict optimally—­that is, minimize his worst-­case success regret in regard to the successes of the experts. Assuming that the forecaster corresponds to the meta-­inductivist and the experts to the non-­MI players, this setting accommodates a meta-­inductive perspective. It is called online learning b ­ ecause the forecaster has to si­mul­ta­ neously learn from past events while making new predictions. Some key results of this book ­will make use of mathematical theorems from this field. Neither formal learning theory nor online learning make any probabilistic (and hence implicitly inductive) assumptions about the event sequences to be predicted. This makes t­hese two approaches strictly more general than probabilistic accounts (Auer et  al. 1995, 2; Bubek and Cesa-­Bianchi 2012, 6; Merhav and Feder 1998, 6). For this reason the study of dynamic online learning in prediction games is an impor­tant complement to Bayesian accounts of learning, whose results depend on assumptions about prior probabilities (see section 9.1). This does not exclude that our results can be extended by Bayesian assumptions (see, e.g., sections  6.1, 6.7.1, 7.1, 8.1.6–7, 8.3). Some Bayesians argue that prior probabilities should be “adapted” to our local environment. However, prior to the utilization of inductive inferences from experience, we have no clue which prior

Meta-­Induction, Optimality Justifications, and Prediction Games 101

distribution fits with our (­future) environment. Hence, this argument is circular: prior probabilities over f­ uture events are not empirically grounded but subjective in nature (recall chapter 4). A third related field is the investigation of the efficiency of prediction methods in cognitive psy­chol­ogy, in par­tic­u­lar within the cognitive research on ecological and adaptive rationality (Gigerenzer et  al. 1999; Todd and Gigerenzer 2012). Although prima facie this research investigates prediction methods from an object-­level perspective, it can easily be related to the ­ ecause it investigates prediction methods that are meta-­level perspective b based on so-­called cues from the environment. ­These cues are themselves predictive indicators for a given predictive target or “criterion variable,” as it is called in psy­chol­ogy. A prediction method that has frequently been investigated in this field is take the best (TTB). This method bases its predictions on the best cue among t­ hose cues that deliver a predictive hint in the given round. If we translate this framework to the meta-­level setting by interpreting “cues” as accessible prediction methods, then the TTB method turns out to be a certain refinement of the most ­simple meta-­inductive method imitate the best (ITB). In section  7.2.1 it w ­ ill be shown how our results can be generalized from ITB to TTB. The major difference between this third approach and the prediction games approach concerns the method of inductive learning of success rates. In almost all studies the cues’ success probabilities are assumed to be constant, given as finite frequencies in finite populations. They are learned by random sampling, by induction from randomly chosen samples that are called training sets.7 Random sampling makes inductive assumptions: all individuals in the population (or events in the event sequence) have the same probability of appearing in the sample. Thus the sample distribution is IID (in­de­pen­dent and identically distributed). The par­ameters found in the training set are inductively projected to a randomly chosen test set, which is a subset of the remainder population. The training set’s success frequencies ­will deviate from the test set’s success frequencies merely by a symmetrically distributed error variance that can be calculated from the laws of IIDs. In contrast, the inductive learning method in prediction games is dynamic online learning. This method makes no inductive assumptions, but allows for worst-­case scenarios. In most real-­life situations, dynamic online learning

7. ​Exceptions are Dieckmann and Todd (2012) and Rieskamp and Otto (2006), who study prediction tasks in an online learning setting.

102

Chapter 5

is the more realistic learning situation. It differs from random sampling in three re­spects. First of all, in dynamic online learning past observations are ­ ecause one can only observe past but inductively projected into the ­future: b not ­future events, not all members of the event sequence have the same chance of entering into the observed sample. Second, ­there is no separation between a training and a test phase. One may artificially draw this separation each round, by considering the observations made so far as a training set, and the predicted event as a test set. Third—­and this makes the crucial difference—­the success rates of the prediction methods (cues) may systematically change in time: the f­ uture may be systematically dif­fer­ent from the past. ­Because the success rates may have changed ­after the training phase has passed, online learning must be dynamic in the sense that it requires constant updating of the inductively projected success rates. In probabilistic terms, the sampling procedure in dynamic online learning not only generates an error variance, as in random sampling, but may also generate a systematic bias (Brighton and Gigerenzer 2012, 46ff.). This systematic bias manifests itself in the form of a correlation between the (event or success) frequencies in the training phase and t­ hose in the test phase. A correlation between past and f­uture events means formally that the under­lying event sequence is not IID but a Markov chain. This possibility is admitted in the framework of dynamic online learning; it is even admitted that the event or success frequencies do not converge to a limit but oscillate forever. The worst case is given when the correlations between predictions and events (conditional on information about the past) are negative, which means that the environment is adversarial, e­ ither in regard to the events (the events’ probabilities are negatively correlated with having been predicted by xMI) or in regard to the non-­MI players (their success rates are negatively correlated with being imitated by xMI). Predicting in a dynamically changing and possibly adversarial environment is like playing an unknown game against a possibly adversarial player; one of the first studies of this prob­lem is Banos (1968). 5.9 ­Simple and Refined (Conditionalized) Inductive Methods In the concluding section of this chapter we first introduce some basic object-­level prediction methods. Then we explain the impor­tant distinction between ­simple and refined (conditionalized) inductive methods, for object-­and meta-­level methods. The most ­simple versions of an object-­inductive prediction method are OI for linear and OI2 for quadratic loss functions. The predictions of t­ hese

Meta-­Induction, Optimality Justifications, and Prediction Games 103

two methods depend on the scale-­level of the events and predictions. ­There are three possibilities. 1. OI for discrete events and discrete predictions.  In this case, the method OI can be regarded as a combination of the inductive straight rule and the maximum rule of prediction. The straight rule proj­ects the observed frequency to the conjectured frequency limit. This is appropriate as long as it is reasonable to suppose that the event sequence is a random (IID) sequence and events and predictions are statistically in­de­pen­dent. Salmon (1963, 1974b, 89–95) and Rescher (1980, chap. 6.3) have shown that for IID sequences the straight rule is most effective in approximating the true frequency limit among a large class of likewise correct competitor rules.8 For the purpose of binary or discrete predictions, one applies the so-­called maximum rule, which requires the prediction of an event with maximal conjectured probability. For binary events (e ∈ {E,¬E}) this generates the prediction rule: predn(OI) = 1 if freqn(E) ≥  50 ­percent, and ­else  = 0. In general, OI’s prediction rule can also be expressed by saying that OI predicts always the median value of the observed events, which in the binary case is ­either zero or one. Given that events and predictions are probabilistically in­de­pen­dent it is easy to prove that this prediction rule maximizes the predictive success frequency among all competing prediction rules that have the form “predict E in r% and ¬E in 1 − r% of cases” (Greeno 1971, 95; Reichenbach 1938, 310ff.). The proof is ­simple: the probability of true predictions for each such rule is given as the probability that one predicts E and E obtains plus the probability that one predicts ¬E and ¬E obtains,” which is r • p + (1−r) • (1−p), where p is the event probability. If p ≥ 0.5, this expression attains its maximum, p, if r = 1, and if p ≤ 0.5, then it attains its maximum, (1−p), if r = 0.9 Thus, for binary IID sequences OI’s success rate converges against the maximum of p and (1−p). 2. OI for discrete events and real-­valued predictions.  If events are discrete but real-­valued predictions are allowed, ­these predictions can be interpreted as the predicted probabilities of the events and are described in section 7.1

8. ​­These are the so-­called vanishing-­compromise rules that proj­ ect “(1  − wn)​ • freq (E)​  + w  • P (E)” to the limit. ­ Here P0(E) is an “a priori probability” of E that n n 0 depends on the logical strength of E. The weight wn converges to zero for n → ∞; this guarantees that the conjectures of ­these rules ­will converge to the true frequency limit. The case wn = 0 gives the straight rule. 9. ​A similar proof shows that for arbitrary discrete event value spaces {x1, … ,xk}, the success rate is maximized if one always predicts an event value with a maximal probability, instead of predicting dif­fer­ent event values with dif­fer­ent probabilities.

104

Chapter 5

on Bayesian predictors. However, if the loss function is linear and the events are binary, then the maximum rule is still an optimal prediction rule for IID sequences among all predictions rules that predict a constant value r ∈ [0,1]. The proof is the same as before: the expected success is p • r + (1−p) • (1−r), which is maximized if p ≥ 0.5 and r = 1, or p  b ∈ [0,1]). Then for i = 1, 2: limsuc (Pi ) =

a+b 2

(for i = 1,2), and limsuc (ITB) = lim suc (Pi ) − 2 i

a−b p

The worst case is given for a  = 1, b = 0, p =  4; then limsuc(Pi) = 1/2 and limsuc(ITB) = 0.

Proof: Appendix 12.12. A frequently suggested idea to improve ITB is to equip ITB with a certain capability of higher-­order induction—­that is, inductions over more complex patterns. For example, a refined ITB, call it ITB+, may detect that the alternative players P1 and P2 oscillate in their success rates. As soon as ITB+ detects this, it switches from P1 to P2 whenever P1’s decreasing phase begins (and vice versa). As a result, ITB+’s success rates would then be greater than that of both alternative players. As an illustration, take the two deceiving players A and B from the worst-­case scenario of 6.1 and assume that ITB+ implements the advanced switching mode ­after having observed the oscillation pattern of A and B two times. The result would be this:

Kinds of Meta-­Inductive Strategies and Their Per­for­mance 119

(6.2)  Refined meta-­induction—­ITB+ 1 |0 0 1 1| 0 0 1 1| 0 0 1 1| 0 0 1 1| … 

limsuc(A) = 1/2

B scores

0 |1 1 0 0| 1 1 0 0|1 1 0 0| 1 1 0 0| … 

limsuc(B) = 1/2

ITB scores

0 |0 0 0 0| 0 0 0 0| 0 0 0 0| 0 0 0 0| … 

limsuc(ITB) = 0

0 |0 0 0 0| 0 0 0 0||1 1 1 1| 1 1 1 1| … 

limsuc(ITB+) = 1

+

ITB scores

{

A scores

(iterated ­after “||”)

Higher-­ order meta-­ induction is certainly impor­ tant in realistic scenarios without deceivers. For the epistemological purpose of defending meta-­induction against deceivers, however, this idea leads us nowhere. The refined method ITB+ may easily be deceived by two more refined deceiving players A+ and B+. We just have to assume that A+ and B+ exchange their oscillation mode as soon as ITB+ has detected it, in a way that is as bad as pos­si­ble for ITB+’s advanced switching mode. Let us further assume that ITB+ falls back to its old switching mode a ­ fter observing two periods in which A+’s and B+’s oscillation pattern is ­violated and that A+ and B+ fall back to their previous oscillation mode as soon as ITB+ does. The result is displayed in (6.3). (6.3)  Refined meta-­induction against refined deceivers A+ scores

1 |0 0 1 1| 0 0 1 1| 1 1 0 0| 1 1 0 0| … 

B+ scores

0 |1 1 0 0| 1 1 0 0| 0 0 1 1| 0 0 1 1| … 

ITB+ scores

0 |0 0 0 0| 0 0 0 0| 0 0 0 0| 0 0 0 0| … 

(all 4 periods iterated)

Of course a further refinement of ITB+ might be suggested—­call it ITB++. ITB++ detects that the “oscillation-­oscillations” of A+ and B+ recur ­every “fourth four-­period” and having observed this higher-­order pattern two times ITB++ adapts to it. But we can also invent two further refined deceivers A++ and B++ who react to this and change their oscillation-­oscillation pattern in a way that once again reduces ITB++’s predictive success to zero. In conclusion, the refinement approach leads to a potentially infinite hierarchy (in evolutionary terms, a coevolution) of refined methods of induction and corresponding refined methods of deception. Fortunately t­ here is no need to keep growing this hierarchy. In section  6.3 we w ­ ill collapse the entire hierarchy into one level by our definition of a systematic deceiver. We conclude this section by presenting some more “optimistic” results concerning ITB that follow from inductive probabilistic assumptions. We assume that the event sequence and the sequences of scores are distributed

120

Chapter 6

according to an IID. That is the case, for example, if the event sequence is generated by random sampling with replacement from a finite population with stationary frequencies, and the non-­MI players base their methods on cues that are correlated with the predictive target in this population. U ­ nder ­these conditions, the success probabilities of the non-­MI players converge and are uncorrelated with their role of being or not being ITB’s favorite, which allows us to prove a straightforward long-­run optimality theorem for ITB. For this purpose we define the favorite-­conditional success rate of non­MI players and of ITB or other one-­favorite meta-­inductivists as follows (where ∑s denotes the sum of numbers in a sequence of numbers s). Terminological conventions. For e­ very kind of one-­favorite meta-­inductivist xMI: =def ∑(scorej(xMI): 1 ≤ j ≤ n, favj(xMI) = Pi) is the absolute success of xMI conditional on times ­until time n for which Pi was xMI’s favorite. Likewise,

• abs (xMI|P ) n i

=def ∑(scorej(Pi): 1 ≤ j ≤ n, favj(xMI) = Pi) is the absolute success of non-­MI player Pi conditional on times (≤ n) for which Pi was xMI’s favorite.

• abs (P |xMI) n i

• num (fav(xMI) = P ) n i

is the number of times (≤ n) for which Pi was xMI’s

favorite. =def [numn(fav(xMI) = Pi)]/n is the (relative) frequency of times (≤ n) for which Pi was xMI’s favorite.

• freq (fav(xMI) = P )) n i

• suc (xMI|P ) =  abs (xMI|P )/num (fav(xMI) = P ) n i def n i n i

is the success rate of xMI conditional on times (≤ n) for which Pi was xMI’s favorite. So long as numn(fav(xMI) = Pi) is zero, we set by convention sucn(Pi|xMI) =def sucn(Pi).

• suc (P |xMI) =  abs (P |xMI)/num (fav(xMI)=P ) n i def n i n i

is the success rate of Pi conditional on times (≤ n) for which Pi was xMI’s favorite.

• limsuc(xMI|P ) = lim i n→∞

sucn(xMI|Pi), and likewise for limsuc(Pi|xMI).

Based on the fact that xMI’s and Pi’s predictions are identical for all times for which Pi was xMI’s favorite, the following facts are easily proved. (6.4)  Fact about one-­favorite meta-­induction For all one-­favorite meta-­inductivists xMI and times n ∈ ℕ: (1) ​sucn(xMI|Pi) = sucn(Pi|xMI) for all i ∈{1, … ,m}. (2) ​sucn(xMI) = ∑1≤i≤m freqn(fav(xMI) = Pi) • sucn(Pi|xMI); that is, xMI’s success rate is a weighted average of the non-­MI players’ favorite-­conditional success rates, weighted by their frequency of being xMI’s favorite.

Proof: Appendix 12.13.

Kinds of Meta-­Inductive Strategies and Their Per­for­mance 121

(Proposition 6.3)  Prediction games with favorite-­independent success probabilities Assume that in a prediction game ((e),{P1, … ,Pm, ITB}) the success rates of all non-­MI players converge to success probabilities that are favorite-­independent in the sense that limsuc(Pi|ITB) = limsuc(Pi). Then ITB’s success rate converges to the maximal success rate in the long run: limsuc(ITB) = maxlimsuc.

Proof: Appendix 12.14. Note that proposition 6.3 does not presuppose the existence of a unique best player. It is much more difficult to compute good upper bounds for the probability of ITB’s short-­run regret. For simplicity, we assume a prediction game with only two non-­MI, players, ((e),{P1,P2,ITB}), whose success probabilities are IID with limsuc(P2) = p and limsuc(P1) = p + δ (δ > 0). It is a well-­known fact that the standard deviation of the n-­membered sample frequencies of a binary IID variable with probability p, also called the standard error, is given as en ≈  p i (1− p) /n (via approximation of the binomial by a Gaussian distribution for n ≥ 30; “≈” for “approximately equal”). With a probability3 of ≈95.5  ­percent, the sample mean ­after n rounds lies within the ±2 • en interval around p. The idea of proposition 6.4 is to set ε = 2 • en and to compute a time k  0.5 and limsuc(P1) = p + δ (0  favn(εITB) + ε if such a player exists; other­wise, favn+1(εITB) = favn(εITB). The per­for­mance of εITB is illustrated by the computer simulation in figure 6.3, in which εITB plays against the two alternative players of the convergent oscillation scenario of figure 6.2 (this time εITB’s first prediction is wrong). As soon as the success difference between the alternative players becomes smaller than ε, εITB stops switching and sticks to its current favorite forever, with the result that εITB’s success rate recovers and ε-­approximates the maximum success of the two alternative players. The general per­for­mance of εITB is spelled out in theorems 6.3 and 6.4, with theorem 6.3 being logically more general and theorem 6.4 philosophically more transparent. We say that a prediction game contains a subset

Success Rate 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9

Kinds of Meta-­Inductive Strategies and Their Per­for­mance 123

convergent oscillation as in fig. 6.2

εITB for ε = 0.05

100 Round

200

Figure 6.3 Epsilon-­cautious imitate the best (εITB) in the convergent oscillation scenario of figure 6.2. As soon as the oscillation amplitude sinks below ε = 0.05, εITB stops switching favorites.

BP ⊆ {P1, … ,Pm} of ε-­best non-­MI players with winning time w iff for all times n ≥ w, (1) each player in BP is more successful than each non-­MI player outside BP (i.e., ∀P ∈ BP∀Q ∈ (Π¬MI − BP): sucn(P) > sucn(Q)), and (2) the successes of all BP-­players are ε-­close to each other (i.e., ∀P ≠ P′ ∈ BP: |sucn(P) − sucn(P′)| ≤ ε). Note that, thus defined, the subset BP is not generally unique and need not be unique for the purpose of theorem 6.3. We can make BP unique by selecting that BP that has a minimal winning time; this improves the upper bound of the short-­run regret of εITB. The proof of theorem 6.3(2) establishes that εITB stops switching favorites and thus ε-­approximates the maximal success rate if ­there exists a subset of ε-­best non-­MI players. The proof rests on the observation that although not ­every player whose success is ε-­close to the maximal success at some time ­after w needs to be an ε-­best player (for her success may decrease afterward), ­every player with maximal success at some time ­after w must be an ε-­best player. Thus, εITB ­will change her favorite at most once ­after time w. (Theorem 6.3)  εITB in games with a set of ε-­best players For ­every prediction game ((en), {P1, … ,Pm, εITB}) whose set of non-­MI players contains a subset BP of ε-­best players with winning time w, the following holds:

124

Chapter 6

(Theorem 6.3)  (continued) (1) Short run: w i maxsuc w + 1 . n (ii) For all times n ≥ s, where s is the maximum of w and εITB’s last switch time: n + s − 1 w i maxsuc w + 1 suc n (εITB) ≥ maxsuc n − ε i − . n n (i) For all times n ≥ w: suc n (εITB) ≥ maxsuc n − 2 i ε −

(2) Long run: εITB’s success ε-­approximates the maximal success of the non-­MI players; that is, limn→∞(maxsucn − sucn(εITB)) ≤ ε.

Proof: Appendix 12.16. The short-­run result of theorem 6.3(1i) is not especially good ­because the winning time w may occur arbitrarily late. However, w depends only on the speed of the ε-­approximate convergence of the best non-­MI players’ success rates. If their success differences stop oscillating early, then εITB’s short-­run regret ­will also be small. The worst-­case short-­run regret of theorem 6.3.(1i) is in­de­pen­dent of εITB’s last switch time s, which may occur arbitrarily late. The price of this advantage is that this result assigns εITB a time-­independent regret of 2 • ε instead of ε. The upper regret bound stated in theorem 6.3.(1ii) involves the dependence on s; this bound is worse for the short but better for the long run and gives the right limit be­hav­ior of εITB for n → ∞, which 6.3.(1i) does not. The results of theorem 6.3 can be summarized as follows: εITB predicts approximately optimal in all worlds whose non-­MI player set contains a subset of ε-­best players whose winning time does not occur too late. As in the case for ITB, one can obtain improved probabilistic short-­run results for εITB by assuming that the sequence of scores is IID-­generated; we dispense with the details. The condition of theorem 6.3 is implied by the (stronger) condition that the success rates of the non-­MI players converge ­toward a limit. That is the content of theorem 6.4. We define the joint α-­convergence point, nα, of a given (finite) set of non-­MI players Π¬MI as the earliest time ­after which the success rate of ­every player in Π¬MI deviates from its limit success by at most α. (Theorem 6.4)  εITB in success-­convergent games For ­every prediction game ((en), {P1, … ,Pm, εITB}) whose non-­MI players’ success rates converge to limits, the following holds: The results of theorem 6.3 hold when BP ⊆ {P1, … ,Pm} is identified with the subset of non-­MI players whose limit success is at most ε/2 below maxlimsuc,

Kinds of Meta-­Inductive Strategies and Their Per­for­mance 125

(Theorem 6.4)  (continued) and the winning time w is identified with the non-­MI players’ joint convergence point nα for α = min({δ/2, ε/4}), where δ is the difference between the worst limit success of the non-­MI players in BP and the best limit success of non-­MI players outside of BP.

Proof: Appendix 12.17. The ε-­cautious version of ITB is long-­run optimal in a broader class of pos­si­ble worlds than the ­simple ITB, with the cost that its optimality is not strict but ε-­approximate. Is this good enough to count as a justification? We think so. By way of comparison, approximate optimality relates to strict optimality just as approximate truth relates to strict truth. For almost all practical purposes ­there exists a choice of ε, which is small enough to count as practically insignificant, and theorems 6.3 and 6.4 hold for all choices of ε. So concerning long-­run per­for­mance, the approximate version of optimality is “almost as good” as its strict counterpart. However, ­there is a trade-­ off in re­spect to short-­run per­for­mance: a small ε implies a large w in theorem 6.3(1) and thus a large short-­run regret. The freedom to make ε small is limited by the interest in keeping the short-­run regret small. Theorems 6.1 through 6.4 are partial successes for the meta-­inductive approach as they establish (ε-­ approximate) access-­ optimality for large classes of worlds, progressively including more kinds of strange worlds. In par­tic­u­lar, theorem 6.4 establishes ε-­approximate optimality for the class of all pos­si­ble worlds in which the success rates of non-­MI methods are reasonably calculable in the sense of converging to a limit. This class covers a huge subclass of paranormal worlds—­worlds containing oracles, or God-­guided future-­tellers, or what­ever, provided their success frequencies converge. Nevertheless we are not yet satisfied with ­these results, for two reasons. First, it is unclear why methods whose success rates do not converge in the long run should be excluded a priori. Indeed, we can never know with certainty ­whether the observed frequencies w ­ ill converge to limits b ­ ecause all that we can observe is the short run and we need induction to infer from the short to the long run. Second, the short-­run results for ITB and εITB are not particularly good, and we are interested in pos­si­ble means to their improvement. In conclusion, the prospects for the meta-­inductive approach to Hume’s prob­ lem are encouraging. Nevertheless the genuine worst cases still lie

126

Chapter 6

ahead of us. They lie in the class of prediction games in which the success rates of the non-­MI players oscillate in a nonconvergent fashion. That is the topic of the next section. 6.3  Systematic Deception: Fundamental Limitations of One-­Favorite Meta-­Induction 6.3.1  General Facts about Nonconverging Frequencies If the frequencies freqn(E) of a (binary) event type E in an under­lying sequence (e) do not converge to a limit, then they oscillate endlessly between upper and lower values that converge to their limit superior (limsup) and their limit inferior (liminf), respectively. T ­ hese two limits are possessed by all (even by nonconverging) sequences. The limsup [or liminf] of a sequence of real numbers (xn) is defined as the greatest [or smallest] value r such that xn ≥ r [or xn ≤ r] holds for infinitely many times n. ­Because ­there need not be a greatest [or smallest] value r, one takes the sup(remum) [or inf(imum)] of ­these values. Thus: liminfn→∞ freqn(E) =def inf({r ∈: freqn(E) ≤ r for infinitely many n∈}). limsupn→∞ freqn(E) =def sup({r ∈: freqn(E) ≥ r for infinitely many n ∈}). If the sequence of frequencies (freqn(E):n ∈ ) converges, then liminfn→∞ freqn(E) = limsupn→∞ freqn(E) = limn→∞ freqn(E). If freqn(E) does not converge, then limsupn→∞ freqn(E) > liminfn→∞ freqn(E), and for e­ very small ε >  0 ­there exists an m such that for all n ≥ m, freqn(E) oscillates with amplitudes ranging from a value ε-­close to liminfn→∞ freqn(E) to a value ε-­close to limsupn→∞ freqn(E), possibly with long fluctuations in between ­these amplitudes. It is useful to calculate the minimal period p of endless frequency oscillations around a given value r. The calculation shows that t­hese minimal periods grow exponentially with their number. We define q(n) as the length of a half period starting with time n. A declining half period q−(n) of minimal length whose amplitude shrinks from r + γ to r − γ consists solely of e = 0 events. Analogously an inclining half-­period q+(n) of minimal length consists solely of e = 1 events. Their lengths are calculated by the following equations (see figure 6.4): Calculation of q−(n): [n • (r + γ) + 0]/[n + q−(n)] = r − γ. Calculation of q+(n): [n • (r − γ) + q+(n)]/[n + q+(n)] = r + γ. By algebraic transformations we obtain q−(n) = n • (2 • γ/(r − γ)), and q+(n) = n • (2 • γ/(1 − r − γ).

Kinds of Meta-­Inductive Strategies and Their Per­for­mance 127

Relative frequency

r

γ

q+(n')

q-(n) n'

n

time

Figure 6.4 Minimal declining (q−(n)) and inclining (q+(n)) half-­period of frequency oscillations with amplitude 2 • γ .

Note that if r is greater [equal, smaller] than 0.5, then q−(n) is smaller [equal, greater] than q+(n). Assuming that the oscillation starts with the maximum amplitude r + γ at time n, the minimal length of k oscillation periods, π(k), is calculated as follows.4 (6.5)  Exponential time-­dependence of frequency oscillation periods ⎛ γ⎞ π(k) = ⎜ rr + ⎝ − γ ⎟⎠

k

k

i

⎛1− r + γ ⎞ ⎜⎝ 1 − r − γ ⎟⎠ .

Thus, the length of k oscillation periods grows exponentially with k. ­Because q−(n) and q+(n) are linear functions of n = π(k), the minimal length of the oscillation periods also grows exponentially with k. 6.3.2  Nonconvergent Success Oscillations and Systematic Deceivers The ε-­cautious ITB meta-­inductivist is deceived by two or more leading alternative non-­ MI players, if their success rates oscillate around each other in a nonconvergent manner with a nondiminishing amplitude δ that is

2iγ ⎛ 2iγ ⎞ 2iγ π(k + 1) = π + q − (π) + q + (π + q − (π)) = π + π i i + π+πi r −+γ q +⎜⎝(π + q − (π)) r − γ=⎟⎠ π1+−πri −2γi γ + ⎛ π 4. ​ Proof: Abbreviating π(k) = π, we calculate: π(k + 1) = π + q − (π) r − γ ⎜⎝ r + γ 1− r + γ 2iγ ⎛ 2iγ ⎞ 2iγ = ( … ) = π(k) i i i π) + q + (π + q − (π)) = π + π i +⎜π+ πi i S ­ tarting with π(0) = 1 r − γ 1 − r − =γ ( … ) = π(k) i r + γ i 1 − r + γ i r−γ ⎝ r − γ ⎟⎠ 1 − r − γ r − γ 1− r − γ and iterating k times gives equation (6.6). r + γ 1− r + γ = π(k) i i i r − γ 1− r − γ

Success Rate 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9

128

Chapter 6

four systematic deceivers oscillating around each other with constant amplitude ε

εITB (ε = 0.05) 100

200

Round

300

400

Figure 6.5 Epsilon-­cautious imitate the best (εITB) against four systematic deceivers in non­ convergent success oscillations (ε = 0.05, binary prediction game, random event sequence with limfreq = 2/3).

greater than the chosen switching threshold ε, δ > ε. Following from the preceding considerations, the minimal periods of nonconvergent oscillations grow exponentially in time. The worst case is given when δ is as close as pos­si­ble to ε—­that is, when a deceiving player predicts with minimal scores as soon as εITB chooses this player its favorite, and predicts with maximal scores as soon as εITB switches to another deceiving player as favorite. The computer simulation in figure 6.5 shows a binary prediction game in which four alternative players deceive εITB. As long as a deceiver D1 is εITB’s favorite, D1 predicts the wrong result ­until his success is more than ε below some deceiver D2. At this time εITB changes its favorite from D1 to D2, D1 starts to predict correctly and D2 starts to predict incorrectly, ­until the next switch of εITB occurs. In this way, εITB’s success rate is driven to zero, while the mean success of the four deceivers per oscillation is 3/4. Observe that the fact that the inclining half-­period is longer than the declining half-­period (section 6.3.1) produces small superoscillations of the oscillation minima and maxima. One might think that ­there is an easy defense strategy against nonconvergent deceivers for εITB: it just needs to lower its significance threshold ε so that it becomes smaller than the oscillation amplitude of the deceivers. However, this idea is illusionary. Consequent deceivers may react and adapt their oscillation amplitude to the chosen ε-­threshold of the meta-­inductivist. The worst case involves systematic deceivers, who are assumed to know w ­ hether the meta-­inductivist chooses them as her favorite for the prediction of the next

Kinds of Meta-­Inductive Strategies and Their Per­for­mance 129

event. They use this information to deceive the meta-­inductivist by delivering a worst prediction whenever the meta-­inductivist imitates them. The worst prediction for given time n is 0 if en ≥ 0.5 and 1 other­wise. Thus, the score of a worst prediction is 0 in the binary prediction game, and a value between 0 and 0.5 (depending on en) in a real-­valued prediction game. The formal definition of systematic deceivers applies to all one-­favorite meta-­inductivists. (Definition 6.1)  Systematic deceiver A non-­MI player P is a systematic deceiver of a one-­favorite meta-­inductivist xMI iff for all times n > 1: if P is xMI’s favorite for time n, then P delivers a worst prediction for time n; other­wise, P predicts the right result for time n.

How does a systematic deceiver know when xMI has chosen her as favorite? We might assume that she knows this by clairvoyance, in which case we would assume that we are in a paranormal world. But in princi­ple this knowledge need not involve clairvoyant powers but could also be based on causally normal information acquisition ­because, as we may recall from section 6.1, xMI determines its favorites based on the rec­ords of the previous round, before the non-­MI players deliver their predictions. 6.3.3  Limitations of One-­Favorite Meta-­Induction The negative result of figure 6.5 holds for all one-­favorite meta-­inductivist who are playing against systematic deceivers, in­de­pen­dently of their choice of a switching threshold ε ≥ 0. Even if the one-­favorite meta-­inductivist changes this threshold during the game, the result w ­ ill stay the same, except for a change of oscillation periods. Of course, a systematic deception strategy comes partially at the cost of the deceiver’s own success. But if a one-­favorite xMI is playing against m deceivers, then at each time t­ here ­will be m−1 deceivers who predict correctly, b ­ ecause they are not xMI’s favorite. Thus, their average success rate ­will converge to the limit m−1 m , while xMI’s success rate converges to zero as in figure 6.5. The result just mentioned is the content of theorem 6.5(1) below. Theorem 6.5(1i) holds for all times, but theorem 6.5(1ii) refers to the limit success rates. In theorem 6.5(1i) we cannot determine the limit success of individual deceivers ­because they depend on the type of the one-­favorite meta-­inductivist xMI (recall that one-­favorite meta-­inductivists switch to the better but not necessarily to the best players). If we know that xMI is ITB or εITB, then we can prove the stronger result 6.5(1ii).

130

Chapter 6

One can prove a more general result that holds not only for systematic but also for “mild” sorts of deceivers who simply lower their success rate as soon as they are the meta-­inductivist’s favorite. Deceivers in this broad sense are non-­MI players whose success rate is negatively correlated with their role of being xMI’s favorite. We explicate the negative correlation condition for a given player P as follows. (Definition 6.2)  Negatively favorite-­correlated success rates The success rate of a non-­MI player P is negatively correlated with its position as xMI’s favorite ­after time k by an amount of δ>0, iff for all times n ≥ k, sucn(P) ≥ sucn(P|xMI) − δ holds (where, recall, sucn(P|xMI) is the success rate of P at time n conditional on ­those times for which P was xMI’s favorite).

The finite time point k is introduced b ­ ecause accidental negative correlations can frequently occur due to random success fluctuations in the initial phases of a prediction game. Theorem 6.5(2) expresses our general result for prediction games with deceivers in the broad sense. Recall that xMI’s internal fallback strategy is included in the set of non-­MI players as a virtual player. Thus, in theorem 6.5 we assume that even xMI’s fallback strategy—­such as OI or blind guessing—is a deceiving strategy. That is pos­ si­ble by assuming a paranormal world with a “demonic” event sequence, whose events are negatively correlated with the predictions of the fallback strategy. (Theorem 6.5)  Limitations of one-­favorite meta-­induction For ­every prediction game ((e),{P1, … ,Pm, xMI}) in which xMI is a one-­favorite meta-­inductivist: (1) If the game is binary and ­every non-­MI player P is a systematic deceiver, then (i) For all times n: (a) sucn(xMI) = 0, (b) the mean success rate of the non-­MI m−1 players, suc n (P), equals m−1 m , and (c) maxsuc n ≥ m .

(ii) (a) If xMI is ITB, then (for all i∈{1, … ,m}) limsuc(Pi ) = m−1 m .

(b) If xMI is εITB, then Pi’s success rates oscillate with limits inferior



m−1 m−1 and superior satisfying m−1 m − ε ≤ liminfn→∞ (suc n (Pi )) ≤ m and m ≤ limsupn→∞ (suc n (Pi

m−1 ≤ limsup m−1 minfn→∞ (suc n (Pi )) ≤ m−1 n→∞ (suc n (Pi )) ≤ m + ε. m and m

Kinds of Meta-­Inductive Strategies and Their Per­for­mance 131

(Theorem 6.5)  (continued) (2) If for ­every P∈Π¬MI, Pi’s success rate is negatively correlated with Pi’s role of being xMI’s favorite ­after some time ki by an amount of δ > 0, then xMI is not δ-­approximately long-­run optimal; that is, limn→∞(maxsucn − sucn(x MI)) ≥ δ.

Proof: Appendix 12.18. Recall that if the sequence of events and scores obey the laws of IIDs, non-­MI players with negatively favorite-­correlated success rates, as in theorem 6.5(2), are impossible (recall proposition 6.3). In theorem 6.5(1) we assumed that all non-­MI players are systematic deceivers, for the reason that the type of one-­favorite meta-­inductivist was left unspecified. If we assume that xMI is ITB or εITB, we can prove theorem 6.5(1i) ­under the weaker condition that ­there exists a subset D of “ε-­superior” deceivers, in the sense that for all times n, sucn(P) > sucn(P′) + ε holds for all non-­MI players P in D and P′ outside of D. In conclusion, the limitations of one-­favorite meta-­induction that result from the presence of deceivers are fundamental and insuperable.5 The only possibility to establish optimality results for one-­favorite meta-­inductivists is to exclude deceivers from their candidate set, which leads us to the topic of the next section. 6.4  Deception Detection and Avoidance Meta-­Induction (ITBN) The meta-­inductivist’s natu­ral reaction to systematic deception is deception detection. To explicate this notion we define the notion of a (non)deceiver at a given time n in relation to a given one-­favorite meta-­inductivist xMI. Recall the definition of the favorite-­conditional success rate, sucn(P|xMI), of a non­MI player P. We say that a non-­MI player P is a deceiver at a time n if the difference between P’s (unconditional) success rate and P’s favorite-­conditional success rate exceeds a practically given significance threshold, εd, which is called the deception threshold. In order to avoid unwanted classifications of players as deceivers in the initial phase of the game when accidental success

5. ​One possibility to protect ITB against deceivers is to add a “random perturbation” to the non-­MI players’ relative success that diminishes with increasing n (Cesa-­ Bianchi and Lugosi 2006, section 4.3). However, by this move xMI is transformed into a variant of randomized weighted meta-­induction (section 6.7.1).

132

Chapter 6

fluctuations are high, we assume that the meta-­inductivist starts the recording of the deception be­hav­ior of a non-­MI player P only when P has been favorite for at least k1 rounds and nonfavorite for at least k2 rounds. (Definition 6.3)  Deceivers A non-­MI player P is said to be a deceiver of a one-­favorite meta-­inductivist xMI at time n iff (a) sucn(P) − sucn(P|xMI) > εd, where (b) numn(fav(xMI) = P) ≥ k1 and (c) numn(fav(xMI) ≠ P) ≥ k2. For any time t, Dn ⊆ {P1, … ,Pm} is the set of deceivers at time n, NDn = {P1, … ,Pm} ​ − Dn is the set of nondeceivers at time n, and maxsucn(ND) is the maximal success rate of the nondeceivers at time n.

We call k1 and k2 the start par­ameters, and conditions b and c the start conditions for the deception recording of player P. In our computer simulations we have chosen k1 = k2 = 5. Definition 6.3 not only covers systematic deceivers, whose success rate when being favorite is zero, but also more harmless kinds of deceivers whose favorite-­conditional success rate drops below their unconditional success rate by more than εd. Note that our definition of a deceiver is time relative: a deceiver at time n can become a nondeceiver at ­later times, for example, ­because he has lowered his success rate while being a nonfavorite. In par­tic­u­lar, by definition 6.3 a player counts as nondeceiver so long as he has never been a favorite. The basic idea of avoidance meta-­induction is to avoid the imitation of non-­MI players who have been detected as deceivers. The method ITBN—­ short for imitate the best nondeceiver—­proceeds as follows: at any time n, ITBN chooses his favorite from among ­those non-­MI players who are not deceivers at time n. ITBN switches to a new (nondeceiving and leading) favorite N for time n+1 if ­either ITBN’s old favorite O has become a deceiver at time n, or O’s success rate at time n has dropped by more than ε below the maximal success rate of the nondeceivers at time n. Formally we can express this as follows:

(6.6)  Favorite of ITBN favn+1(ITBN) = = favn(ITBN), if favn(ITBN) ∈ NDn and sucn(favn(ITBN)) ≥ maxsucn(ND) − ε, = the first player in NDn with sucn(P) = maxsucn(ND) other­wise.

Kinds of Meta-­Inductive Strategies and Their Per­for­mance 133

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9

Success Rate

The method ITBN works with two significance thresholds: ε for switching and εd for deception. For the sake of the proof of the next theorem, we have to require that εd  0, PAWq, whose weights are defined as wn(P) = ​ max((atn(P))q,0). Note that AW = PAW1. PAWq coincides with Fp+1, introduced in appendix 12.23, for which Cesa-­Bianchi and Lugosi (2006, 12) prove the general regret-­bound stated in proposition 12.1. 6.6.2  Exponential AW We do not explain the details of PAWq ­because ­there is a better improvement: exponential attractivity-­ weighted meta-­ induction, abbreviated as EAWh and EAW. The weights of EAWh are defined ­under the restrictive condition that the duration of the prediction game—­the so-­called prediction horizon h—is known in advance. Dif­fer­ent from AW, the weights of EAWh are defined not with help of the relative but with the absolute attractivities, Atn(P) = n • atn(P) (Cesa-­Bianchi and Lugosi 2006, 16ff.). (6.7)  Weights of the method EAWh (exponential attractivity-­weighted meta-­​ ­induction) With known prediction horizon h, w n (P) = def e η i Atn (P) , with η = 8 i ln(m)/h. Note: w0(Pi) = 1 (for all i); thus, pred1(EAWh) = ∑1≤i≤m pred1(Pi)/m.

EAWh does not disregard negative attractivities. However, ­because exponential weights with negative attractivities decrease exponentially with increasing n, their influence vanishes in the long run. Thus, EAWh “gradually forgets” unsuccessful non-­MI players. One can equivalently define the exponential weights by means of absolute successes instead of absolute attractivities, as in 6.8(i), ­because the constant ­factor 1 / eabsn (EAW) cancels by normalization (Cesa-­Bianchi and Lugosi 2006, 14). Another frequently used equivalent formulation of EAWh’s weights is in terms of the non-­MI players’ (negative) absolute losses, Lossn(P) = ∑1≤t≤nlosst(P) = n − absn(P): (6.8)  Equivalent weights of EAWh (i) w n (P) = e η i absn (P) [ because e η i Atn (P) = e η i (absn (P)−absn (EAW)) = e η i absn (P) /e η i absn (EAW) ]. (ii) w n (P) = e− η i Lossn (P) [ because e− η i Lossn (P) = e− η i (n−absn (EAW)) = e η i absn (P) / e η i n ].

As theorem 6.9(i) informs us, EAWh’s worst-­case regret bound is significantly better than that of AW: instead of a de­pen­dency of this bound on the

Kinds of Meta-­Inductive Strategies and Their Per­for­mance 145

square root of m, we have now merely a de­pen­dency on the square root of the logarithm of m. Thus, the maximal short-­run regret of EAWh can be small even if m is larger than n, provided that log m is smaller than n. The disadvantage of EAWh is that it is only defined for prediction games with a fixed prediction horizon h that is known in advance and at which the game is evaluated. EAWh’s per­for­mance can be made in­de­pen­dent of the fixation of a prediction horizon by making the proportionality f­actor η in (6.7) dependent on the variable time n; we call the resulting method EAW. (6.9)  Weights of the method EAW for arbitrary prediction horizon w n (P) =def e− η i Lossn (P) , with η = 8 i ln(m)/(n +1 ).

With this choice one obtains the upper bound of EAW’s regret stated in theorem 6.9(ii). EAW’s regret bound is worse than the bound for the fixed prediction horizon h in theorem 6.9(i), but EAW is the best attractivity-­weighted meta-­inductive strategy with a variable time horizon discovered so far. (Theorem 6.9)  Worst-­ case regrets of EAWh and EAW (with convex loss function) (i) For ­every prediction game ((e){P1, … ,Pm, EAWh}) with prediction horizon h: maxsuc h − suc h (EAWh ) ≤

ln(m)/(2 i h ) .

(ii) For ­every prediction game ((e){P1, … ,Pm,EAW}) and n ≥ 1: 0.36 maxsuc n − suc n (EAW) ≤ 2 i ln(m)/n + ln(m)/8 i n 2 ≤ 1.42 i ln(m)/n + n

0.36 uc n − suc n (EAW) ≤ 2 i ln(m)/n + ln(m)/8 i n 2 ≤ 1.42 i ln(m)/n + n

i

i

ln(m).

(iii)  EAW is long-­run access-­optimal: limsupn→∞(maxsucn − sucn(EAW)) ≤ 0.

Proof: Appendix 12.24 (based on theorem 2.2 + 3  in Cesa-­Bianchi and Lugosi 2006; see also theorem 21.11  in Shalev-­Shwartz and Ben-­David 2014, 253f.). 6.6.3  Access-­Superoptimality Besides its improved upper regret bound, a further advantage of EAW over AW is that it does not completely ignore players with negative attractivities. EAW’s weights are positive even when EAW’s success exceeds that of all non-­MI players; thus, the denominator in definition 6.4 never becomes zero, and EAW’s success may surpass that of all non-­MI players even when

ln(m).

146

Chapter 6

EAW’s fallback method is included in Π¬MI. For this reason, theorem 6.9(iii) is formulated in terms of the limit superior. ­Because of this property, EAW can also ­handle wisdom-­of-­crowd situations (see section 10.2). H ­ ere the meta-­ inductivist plays against many laypeople whose individual predictions are not very good but are randomly distributed around the true value; so their errors compensate for each other, and the success rate of their average prediction is much better than the success rate of their best individual prediction. ­Because the success rates of the laypeople are close together (apart from initial fluctuations), EAW ­will always predict a value close to the average prediction of the laypeople and thus w ­ ill enjoy the wisdom of the crowd effect, even if no player or method is accessible that actually “plays” the wise crowd (i.e., predicts the laypeople’s average prediction). Thus, in wisdom-­of-­crowd scenarios EAW’s success rate w ­ ill climb above the laypeople’s maximal success rate. This ­will not be true for the AW player, who as soon as her success exceeds that of all laypeople converts to her fallback method, which is typically worse than the wise crowd’s prediction (cf. Thorn and Schurz 2012, Feldbacher 2012). Of course, this does not constitute an epistemological argument against AW’s access-­optimality ­because as soon as the wise crowd’s prediction is accessible by AW (i.e., played by some non-­MI player), AW’s long run success ­will be as good as that of the wise crowd. However, EAW’s success w ­ ill be as good as the wise crowd even if its prediction is not accessible to EAW. Thus, EAW performs better than needed for the epistemological optimality argument. run optiThis consideration can be generalized. Can EAW be long-­ mal with re­spect to the superset of all convex combinations—­that is, the weighted averages of the given players Π¬MI = {P1, … ,Pm} to which it has access, with time-­constant weights c1, … ,cm, even if the success rates of ­these combined players are not tracked by EAW? We call a meta-­inductive strategy with this property access-­superoptimal and denote a combined player as cmΠ¬MI, with cm being a vector of m normalized weights c1, … ,cm with c1+ … +cm = 1. Theoretically ­there are uncountably many combined players over Π¬MI. Even if the real-­valued interval [0,1] is coarse-­grained, their number is exponential in m; so it would be good to have an access-­superoptimal meta-­inductive strategy. Unfortunately, access-­superoptimality is too good to be generally achievable. It can happen that ­there is a best non-­MI player P1 to which EAW converges, but a certain combination of non-­MI players predicts better than P1. For example, assume Π¬MI = {P1,P2}, P1 and P2 predict constantly 0.9 and 0.6, respectively, and the true event value is constantly 0.8. Then the natu­ral losses are 0.1 for P1 and 0.2 for P2; thus EAW’s weights are wn(P1) = e−η • n • 0.1 and wn(P2) = e−η • n • 0.2; with increasing n, EAW’s

Kinds of Meta-­Inductive Strategies and Their Per­for­mance 147

predictions and success rates converge to that of P1, but the combination (2/3) • P1 + (1/3) • P2 would constantly deliver the perfect prediction of 0.8.7 Under certain conditions, however, access-­ ­ superoptimality can be achieved. First and foremost, all the randomized and collective versions of (E)AW meta-­induction introduced in section  6.7 are access-­superoptimal, following from the linearity of probabilistic expectations. Moreover, ­under restricted conditions variants of EAW-­meta-­induction for real-­valued event variables can be access-­superoptimal, too. One example is ensemble forecasting with continuous ranked probability scores, which are generalizations of the Brier score (see Thorey, Mallet, and Baudin 2016). H ­ ere the event variable is assumed to be generated by a cumulative probability distribution, and the meta-­inductive method is exponential gradient forecasting—in short, EGAW—­predicting a weighted average of the predictions of the non-­MI players (“ensemble members”). EGAW uses the loss gradients as the exponential coefficients of the weights wn(Pi)—­the partial derivative of EGAW’s loss function at time n with re­spect to the weights wi,n (Thorey, Mallet, and Baudin 2016, 524ff.). The loss gradients not only depend on the distance of the ensemble members’ predictions from the true value but also on their average distance from each other. If the time horizon is fixed to h, then EGAW’s maximal absolute regret with re­spect to the set of all pos­si­ble convex combinations cmΠ¬MI of the non-­MI players’ predictions is upper bounded as follows: max ({Loss n (EGAW) − Loss n (P) : P ∈cm ∏ ¬MI }) ≤ n i (ln(m) + 0.5). The relative regret vanishes for n → ∞, so EGAW is access-­superoptimal.8 6.7  Attractivity-­Weighted Meta-­Induction for Discrete Predictions In theorem 6.8, the events and the predictions of the non-­MI players may be binary. Yet this theorem does not apply to binary prediction games ­because the weighted average of several predictions of zeros and ones is

7. ​ The example shows that Zagzebski’s thesis (2012, 197) concerning epistemic authorities (persons with superior reliability) is not always recommendable. According to her thesis, one should not weigh one’s beliefs with the authorities’ beliefs, but should replace them by the latter ones. 8. ​The bound is obtained from equation 25 in Thorey, Mallet, and Baudin (2016) by setting η = 1 / n and a = 1 (a is the loss bound); their T is our n and their M our m. With a variable time horizon, EGAW is not access-­superoptimal. Another example of access-­superoptimality is times series prediction based on ARMA models (auto­ regressive moving average; see Anava et al. 2013): the ARMA online gradient descent algorithm is access-­superoptimal with re­spect to the predictions of all ARMA models.

148

Chapter 6

(typically) a real value between 0 and 1. For example, if in a binary game with two non-­MI players, player P1 has a (normalized) attractivity of 0.6 and predicts 0, and P2 has an attractivity of 0.4 and predicts 1, then AW predicts 0.4. However, that is not allowed in binary prediction games: ­every player, including the meta-­inductivist, must predict ­either 1 or 0. Of course, we could introduce a loss function for binary prediction games that rounds AW’s real-­valued predictions—­that is, interprets predn ≥ 0.5 as predicting “1” and predn  0, with probability P = 1 − δ: maxsuc n − suc n (RX) ≤ b X n+

ln (1/δ ) . 2in

Thus if P is σ-­additive, then with probability P = 1: limsupn→∞(maxsucn − sucn(RX)) ≤ 0.

Proof: Appendix 12.25 (based on Cesa-­ Bianchi and Lugosi 2006, section 4.1–2).

The proof of theorem 6.10 rests on the observation that even if the loss function of a discrete prediction game is arbitrary, the expected loss of RAWs prediction probabilities, loss (P(pred n ), e n ), is linear and thus convex in the argument P(predn). Therefore, the proof of theorem 6.8 (6.9) can be transferred from A(E)W to R(E)AW. The generality of theorem 6.10 is remarkable. It applies to discrete prediction games with arbitrary loss functions loss(pred,e) over Val × Val, as the values of Val need not have any numerical structure. Therefore, theorem 6.10 applies also to arbitrary action games whose values v ∈ Val represent pos­si­ble actions and “score(v,e)” represents the payoff of action v when the environmental is in state e (see section 7.5). Moreover, theorem 6.10 applies to all real-­valued prediction games with nonconvex loss functions as follows: we coarsen the interval [0,1] into a finite partition of subintervals {∆1, … ,∆q} (i.e., real numbers rounded up to a finite number of certain decimal places). Then we treat predictions predn ∈ ∆i as discrete, identify their weighted averages with probabilistic mixtures of pure predictions, and apply theorem 6.10. ­Because of the linearity of the expected loss function, the convexity in­equality turns into an equality: loss (pred n (RAW)) = ∑1≤ r ≤ q P(pred n (RAW) = v r ) i loss(v r ,en ). ed n (RAW) = v r ) i loss(v r ,en ). An analogous equality holds for the expected success of e ­ very pos­si­ble “convex” combination of the non-­MI players: loss (pred n (cm ∏ ¬MI )) = ∑1≤ i ≤ m c i i loss (p red n (cm ∏ ¬MI )) = ∑1≤ i ≤ m c i i loss (pred n (Pi ),en ) (with the ci’s as probabilities). The expected loss of ­every such combination of non-­MI players is greater than or equal to the minimal loss of the non-­MI players. Thus, RAW is access-­superoptimal in regard to all pos­si­ble probabilistic combinations of non-­MI players. The same is true for REAW, but with better regret bounds. The only restriction of theorem 6.10 is the probabilistic in­de­pen­dence assumption (equation 6.10). Though it may seem technically harmless, this restriction is philosophically substantial ­because it prevents the application of theorem 6.10 to demonic worlds conspiring against RAW by producing events that minimize RAW’s score. In the next section we pres­ent a collective meta-­inductive strategy that is secured against Humean demons of this sort.

Kinds of Meta-­Inductive Strategies and Their Per­for­mance 153

6.7.2  Collective AW Meta-­Induction A fully general kind of access-­optimality for discrete prediction games is pos­si­ble by means of assuming a collective of k AW meta-­inductivists, abbreviated as CAW1, … ,CAWk (“CAWi” for “collective attractivity-­weighted meta-­ inductivist number i”), and by considering their average success rate. The CAWs approximate the probabilities assigned to the pos­si­ble predictions of RAW by means of their finite frequencies as close as pos­si­ble. To define this idea, let [x] be the integer-­rounding of a real number; if x’s remainder ­behind the comma is 0.5 we round downward; for instance, [3.47] = [3.5] = 3, and [3.51] = [3.7] = 4. Moreover, let Pn(vr) =def P(predn(RAW) = vr) abbreviate RAW’s probability of predicting event vr for time n. In the case of binary games the approximation method is s­ imple. If ­there are k CAWs, then [Pn(1)] CAWs predict 1 and the other CAWs predict 0. For example, if ­there are 87 CAWs and P(1) = 0.76, then [0.76 • 87] = [66.12] = 66 CAWs predict 1 and 21 predict 0. The corresponding CAW frequencies (approximating 0.76 and 0.24) are

66 21 = 0.758 and = 0.242. 87 87

Success Rate

Figure 6.9 displays a computer simulation of a binary prediction game with a collective of 10 CAW meta-­inductivists playing against the four CAW-­adversarial non-­MI players of figure 6.7. B ­ ecause the event sequence is random, the CAWs’ mean success rate (bold) is almost identical with AW’s ideal success rate and thus approximates the maximal success of the non-­MI players in an almost strict sense. The approximation method of the CAWs with q pos­si­ble event values v1, … ,vq is more complicated. Let non(vr) be the number of CAWs predicting

0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1

4 adversarial deceivers (thick grey) 10 CAW’s (thin grey) CAW’s mean success (bold) approximates maximal success

10

100 Round (logarithmic scale)

1000

Figure 6.9 Binary prediction game with 10 collective attractivity-­ weighted (CAW) meta-­ inductivists playing against four CAW adversaries. (Event sequence is binary.)

154

Chapter 6

event vr for time n. The quantity rn(vr) =def non(vr) − k • Pn(vr) is the rounding error of this number. Note that rn(vr) may be positive or negative. If we set non(vr) = [k • Pn(vr)] for all r  maxlimsuc = limsuc(ITB).

Proof: Appendix 12.28. According to theorem 7.1(2), TTB is only guaranteed to improve ITB’s per­for­mance in the long run. In the short run, both ITB and TTB may suffer from non-­negligible regrets. However, the values of ­these two regrets are in­de­pen­dent. It may be that the ecological validities of two non-­MI players oscillate around each other for a very long time, while their success rates quickly converge to a stable ordering ­because the prediction frequency of one non-­MI player is much higher than that of the other. Situations in which ITB’s short-­run regret is large but that of TTB is small are also pos­si­ble. If the sequences of events and scores are obtained from random sampling, then the regrets of ITB and TTB ­will be approximately equal and already in early phases of the game sucn(TTB) > sucn(ITB) ­will hold. If the stabilization time of the validity ordering occurs late, TTB’s short-­ run per­for­mance ­will be bad. Similar as in the case of ITB in figure 6.2, TTB’s success ­will break down in prediction games whose players (or cues) are deceivers, in which case their validities are negatively correlated with being imitated by TTB and their validity ordering never stabilizes. Theorem 6.5 applies to this case, ­because TTB is a one-­favorite meta-­induction method and theorem 6.5 holds for all one-­favorite methods. A situation of systematic deception of TTB is programmed in figure 7.2. ­Here, TTB plays against four systematically deceiving intermittent cues: they deliver a prediction in 60 ­percent of the time and predict incorrectly whenever TTB ­favors them. As a result, TTB’s success rate converges to zero while the mean success rate of the cues converges to 0.6 • 0.75 + 0.4 • 0.5 = 0.65. The preceding result reveals a restriction of TTB that seems to have been unnoticed in psychological research on TTB: TTB performs well only

Success Rate 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9

Generalizations and Extensions 177

4 convergently oscillating cues (non-MI players)

TTB

10

Round (logarithmic scale)

100

Figure 7.2 Breakdown of take the best (TTB) in a binary game against four systematically deceiving cues.

if the cues’ validities converge sufficiently fast t­oward a unique ordering. This assumption is granted by the methods of random sampling (recall section  5.8). However, in scenarios of online learning with oscillating event frequencies, as, for example, in predictions of the stock market (keyword “­bubble economy”), the “inductive safety” of random sampling cannot be assumed. In ­these situations it would be a bad recommendation to put all of one’s money on the presently most successful stock instead distributing it over several stocks in form of a stock portfolio—­which corresponds to the method of weighted average meta-­induction. As in the case of ITB, we can modify TTB to form the variant εTTB with a switching threshold. The method εTTB can defend itself against convergent success oscillation, but its success breaks down in games with nonconvergent success oscillations, similar as in figure 6.3 for εITB (I ­will refrain from presenting the details ­here). 7.2.2  Intermittent AW The intermittent version of AW, iAW, assigns attractivities to non-­MI players according to their validities. At each time n, iAW ignores all non-­MI players who do not deliver a prediction and constructs his prediction as an attractivity-­weighted average of the predictions of t­hose non-­MI players who are active and sufficiently attractive. ­There is but one prob­lem. The attractivity of a non-­MI player P depends on the relation between P’s

178

Chapter 7

validity and iAW’s per­for­mance, but iAW is a per­sis­tent forecaster who is evaluated according to its success rate. It would be distorting to compare P’s validity with iAW’s success rate, and theorem 6.8 could not be generalized in order to apply to iAW if we would do this. The only possibility for a correct comparison between a non-­MI players’ validity and iAW’s success rates is to conditionalize the two on all pos­si­ ble subsets of non-­MI players who may be active at the given time. For m non-­MI players, ­there are exactly μ =def 2m−1 such subsets, which we denote by U1, … ,Uμ (μ = 2m−1 ­because ­there are only m − 1 intermittent non-­MI players, since iAW’s fallback method is per­sis­tent and is included in all Uis). We conditionalize all validities and success rates on ­these μ subsets. For this purpose we define: •

provided P ∈ Ui, valn(P|Ui) denotes P’s ecological validity at time n conditional on all times for which player set Ui was active (i.e., was the set of active players). Note that valn(P|Ui) = sucn(P|Ui).



likewise, sucn(iAW|Ui) is iAW’s success rate at time n conditional on all times for which player set Ui was active.

• freq (U ) n i

is the relative frequency of times u ­ ntil time n for which player set Ui was active.

• at (P|U ) =  suc (P|U ) − suc (iAW|U ) n i def n i n i

is the attractivity of player P at time

n conditional on player set Ui. With ­these notions, iAW’s weighting method is defined as follows: (7.8)  Weighting method of iAW For ­every i ∈ {1, … ,m} and time n, where U(n+1) =def the set of all non-­MI players that are active for time n+1: wn(P) = atn(P|U(n+1)) if P ∈ U(n+1) and atn(P|U(n+1)) > 0 = 0 other­wise (i.e., if ­either P ∉ U(n+1) or atn(P|U(n+1)) ≤ 0).

­These conditionalized attractivities are used by iAW as weights for predictions. Convention (7.8) partitions each prediction game with iAW into μ = 2m−1 per­sis­tent subgames, which we denote as G|Ui (1 ≤ i ≤ μ). Each subgame consists of the corresponding subsequences of events and subsets of active players, G|Ui = ((ek: U(k) = Ui, k ∈ ), Ui ∪ {iAW}). Each subgame G|Ui satisfies the definition of an ordinary per­sis­tent prediction game over which iAW plays like AW. Thus, we can apply theorem 6.8 and get the result that in each of ­these subgames G|Ui, iAW approximates the maximal success rate in the long run. Using this fact we can eventually prove what we want: that iAW improves AW

Generalizations and Extensions 179

if all non-­MI players predict better than random guesses and that iAW approximates TTB in prediction games with a stable ecological validity ordering. In what follows, maxsucn|Ui denotes the maximal success rate of the Ui-­players in the subgame G|Ui at time n and |Ui| is the cardinality of Ui. (Theorem 7.2)  Universal access-­optimality of iAW in intermittent games For ­every prediction game ((e),{P1, … ,Pm, iAW}) (1) For ­every n >1 and i ∈ {1, … ,μ =def 2m−1}: (a) Short run: suc n (iAW) ≥ [∑1≤ i ≤µ freq n (U i ) i maxsuc n|U i ] − ∑1≤ i ≤ µ suc n (iAW) ≥ [∑1≤ i ≤µ freq n (U i ) i maxsuc n|U i ] − ∑1≤ i ≤ µ

| U i | i freq n (U i ) n

| U i | i freq n (U i ) n

(b) Long run: sucn(iAW) converges to ∑1≤i≤μ freqn(Ui) • maxsucn|Ui for n → ∞. (2) Assume ­there exists a time r such that for all times n ≥ r, k  ∈ {1, … ,m} and i∈{1, … ,μ}: freqn(Ui) is nonzero and sucn(Pk|Ui) is greater than sucn(ran|Ui). Then, (a) Short run: ∀n ≥ r: ∑1≤i≤μ freqn(Ui) • maxsucn|Ui > maxsucn. (b) Long run: liminfn→∞(sucn(iAW) − maxsucn) > 0. Thus in the limit iAW’s success rate is greater than the non-­MI players’ maximal success rate. (3) ­Under the assumption of conditionally in­de­pen­dent validities converging to a unique ordering, sucn(iAW) converges to sucn(TTB) for n → ∞.

Proof: Appendix 12.29. Observe that the regret bound of iAW in theorem 7.2(1a) is much smaller than the sum of the regret bounds for each subgame, which is ∑1≤ i ≤ µ

|Ui | nifreq n (Ui )

by theorem 6.8.

The strategy iAW is a theoretically clean generalization of AW to intermittent success evaluation, which improves AW in the long run ­under the mild assumption that ­after some finite time the validities of all non-­MI players are better than random guesses. However, iAW has two disadvantages compared with AW. •

Disadvantage 1. ​Since μ = 2m−1, iAW’s computational complexity grows exponentially with the number of non-­MI players. If ­there are too many non-­MI players, iAW becomes computationally intractable. Therefore, we tested a simplified version of iAW in our computer simulations in section 10.1 (figure 10.2a), abbreviated as “iAW­simple.” It predicts according the following simplified attractivities: Simple-­atn(P) = valn(P)—­sucn(iAW)

180

Chapter 7

so long as ­there are some active players with positive ­simple attractivities; other­wise, iAW­simple predicts according to the strategy TTB. In our computer simulations, iAW­simple was as successful as TTB; however, we have no general mathematical result for the long-­run success of iAW­simple. •

Disadvantage 2. ​The worst-­case regret of iAW is ∑1≤ i ≤ µ | U i | i freq n (U i ) / n. This is (usually) greater than the worst-­case regret of AW (­because it is not the terms |Ui| • freqn(Ui) but their square roots that are summed up). For example, if n = 100, m = 4, and the frequencies of the 23 = 8 subsets Ui are equal, then AW’s worst-­case regret is computed as 0.17, while iAW’s worst-­case regret is 0.28.

It is also pos­si­ble to define an intermittent version of exponential AW meta-­induction. Theorem 7.2 and its proof applies to EAW in the same way, 0.36 with the improved worst-­case regret of 1.42 i ∑1≤ i ≤ µ ln | U i | i freq n (U i ) / n + n i ∑1≤ i ≤ µ 0.36 | i freq n (U i ) / n + n i ∑1≤ i ≤ µ ln | U i | (recall theorem 6.9(ii)). One may also define intermittent variants for other meta-­inductive methods such as intermittent randomized AW (iRAW), intermittent randomized exponential AW with variable time horizon (iREAW), and intermittent success-­weighted meta-­ induction (iSW). As with SW in comparison to (E)AW, iSW can only outpace i(E)AW locally and is not universally access-­optimal. 7.3  Unboundedly Growing Numbers of Players Prediction games with unboundedly growing numbers of players have the form ((e),P(n) ∪ {xMI}), where P(n) =def {P1, … ,Pm(n)} is a set of non-­MI players increasing with time and m(n) is the number of non-­MI players at time n. The growth function m(n) satisfies n > n′ → m(n) ≥ m(n′) and ∀n∃n′ > n: m(n′) > m(n). Thus, m(n) never decreases and increases unboundedly with n → ∞, though it need not increase at ­every time step. We assume that each non-­MI player P predicts per­sis­tently ­after the first time at which P was added to P(n), and we denote this time as t(P). We ­will restrict our investigation to exponential meta-­induction ­because it has the best regret bound. We cannot directly apply the results of theorems 6.9 and 6.10 ­because ­these theorems presuppose that the success of all non-­MI players—­including ­those added at ­later times of the game—is evaluated right from the start. All pertinent methods of embedding new players into the meta-­inductive framework known to me propose some way of attributing a default success to the new players before they ­were participating in the game; that is, the default values abst(P)(P) and suct(P)(P) = abst(P) (P)/t(P) for the given event sequence e1, … ,et(P). We call this retrospective success evaluation.

Generalizations and Extensions 181

Before we start, let us ask: Is it pos­si­ble to establish access-­optimality for new players who are evaluated only from their entrance time t(P) onward, without a default assumption about their hy­po­thet­i­cal past success? The answer is no. Assume the pres­ent time is 1,100 and at time n = 1,000 a couple of new players have been added that are evaluated from time ­ t = 1,000 onward, while EAW and the other players are evaluated from the start. It may well be that for all players predictive success was much harder to attain in the first 1,000 rounds than in the last 100 rounds. In that case, the “lucky” new players have the advantage of only being evaluated in the easy part of the game, and EAW’s success as evaluated from t =  0 ­will be much lower than the success of lucky new players. As time goes on the lucky advantage of the new players ­will diminish, but with unboundedly growing player sets, this situation may recur at any time when a new player enters. Thus, meta-­induction cannot be universally access-­optimal in regard to growing player sets without incorporating some method of retrospective success evaluation. 7.3.1  New Players with Self-­Completed Success Evaluation Let absn|t(P) =def ∑t n) at time n depends only on sf’s initial subsequence until time n, FEAW’s regret bound also holds for all player sequences of length greater than n, provided the maximal number of their switches is still equal to k. This gives us the following theorem for the access-optimality of FEAWgr in regard to player sequences with growing numbers of base players and player switches. (Theorem 7.4) Access-optimality of FEAWgr in regard to sequences of methods For every prediction game ((e),{P1, … ,Pm(n), FEAWgr}), where Sk(n) denotes the ≥n set of all admissible player sequences of length ≥ n with at most k(n) switches: (1) maxsucnSk(n) ≥n − sucn(FEAW) ≤ the regret bound in (7.10) with “k” replaced by “k(n)” and “m” by “m(n).” (2) FEAW is access-optimal if k(n)·ln m(n) grows slower than linearly7 with n. (3) Results 1 and 2 apply to REAW and CEAW (with “suc n ” replacing “sucn” and adding the regret term mls for CEAW). k

Proof: Parts 1 and 3 follow from theorem 5.1 and corollary 5.1 of CesaBianchi and Lugosi (2006, 105), by using our bound (7.10) for EAW (see note 6) and dividing the absolute regret through n. Result 2 is an immediate consequence. A similar result is obtained by Shalizi et al. (2011). More nuanced regret bounds for expert sequences are obtained by Mourtada and Maillard (2017, 5): for sequences in which only switches to new experts are allowed and using an exp-concave loss function for η, they obtain the upper regretbound of (η/n • (k(n) + 1) • [log(max({mt:1 ≤ t ≤ n})) + log(n)], where mt is the number of new base experts added at time t; again this bound vanishes for n → ∞ if k(n) grows slightly less than linearly with n. 7.4 Prediction of Test Sets The major steps in the proofs of our theorems refer only to xMI’s prediction method and to the sequences of scores of the non-MI players. They do not depend on the kind of prediction task for which these scores are earned. What is important is only that these scores are based on feedback from the

7. More precisely, slower than n/(log n) • log(log n) (Cesa-Bianchi and Lugosi 2006, 106).

Generalizations and Extensions 187

environment, which is possibly given with some delay but before the task ­ ecause this feedback is part of the input for the in the next round begins, b next round’s task. For this reason, all the preceding results can be generalized to a multitude of variations and extensions of prediction games. In this subsection we vary the prediction task. It need not be the prediction of the value of next event; it could also be the prediction of the value of the next k events. The only ­thing that is crucial is the normalization of the score per round. If the k next events (en+1, … ,en+k) are predicted in each round, this can be achieved by dividing the sum of all scores per round through the number k. Moreover, the number k may change each round—as long as we normalize the score, this changes nothing. Even the concrete prediction task may change each round. This sounds strange, but it is pos­si­ble, for example, by announcing the prediction task in each round; “multitasking” scenarios of this sort are often encountered in practical situations. In such a situation conditionalized meta-­induction (see section  8.2) is the meta-­level method to be preferred over ­simple meta-­induction: The type of prediction task then figures as a success-­relevant property to which one should conditionalize the non-­MI players’ success rates. Generally speaking, the event sequence of an extended prediction game consists of an infinite sequence of finite sequences si of ki events: (si : i∈), with s i = (e i,1 ,…,e i,k i ). Each s n+1 ∈Val k n+1 is a value of the predictive target variable Xn+1 of round n. Following standard terminology we call sn+1 the test set (or “test sequence”) of round n and the set of events observed u ­ ntil time n the training set of round n. The prediction of the test set sn+1 consists of a sequence of predictions spred n+1 = (pred n+1,1 ,… ,pred n+1,k n+1 ) with predn+1,j ∈ Valpred(Xn+1,j). To obtain a normalized score, we average the scores of the predictions for all events in the test set si. (7.12)  Scoring of the prediction of test sets Score (spred i ,s i ) = 1 − loss (spred i ,s i ), with loss (spred i ,s i ) = def Score (spred i ,s i ) = 1 − loss (spred i ,s i ), with loss (spred i ,s i ) = def

(

∑ loss pred i,j, ei,j

1≤j≤k i

ki

(

∑ loss pred i,j, ei,j

1≤j≤k i

)

ki

)

,

,

where “loss” is the given loss function. By treating the so-­defined scores obtained in each round of an extended game in the same way as the scores of a ­simple prediction game, we can transfer all our results about meta-­ induction in ­simple prediction games to extended games.

188

Chapter 7

Extended prediction games resemble the training-­test-­set method that ­ ere, a finite is frequently used in research on prediction tournaments. H set of items is randomly divided into a training and a test set; the training set is used for estimating the methods’ par­ameters (e.g., the success probabilities), and the items of the test set are to be predicted. The procedure is repeated several times, and, in accordance with 7.12, the average score for all predicted items of the test set is computed. Extended prediction games can be regarded as iterated procedures of training-­test-­set se­lections. ­There is but one difference: the par­ameters of a prediction game are historically cumulative: ­after the items of the test set—­ which ­were predicted in the previous round—­have been observed, they are added to the pool of all observations making up the training set of the next round. In this way, the training sets of prediction games are continuously increasing. In contrast, the training sets of a repeated training-­test-­set experiment have a fixed size. In the language of prediction games, this corresponds to prediction methods (mentioned at the end of section 6.5) that have a restricted past time win­dow. ­These methods compute the success rates from the last, say, n (e.g., 50) events, instead of considering the entire past. Methods with a restricted past time win­dow have local advantages in two kinds of environments: 1. ​Periodically changing environments to which dif­fer­ent object-­level methods are differently adapted. 2. ​Randomly selected event sequences. ­Because the standard error of the sample frequencies is decreasing with the sample size n by a f­actor of 1/ n , ­there are diminishing returns for large sample sizes. Instead of considering the entire past, a small training set is usually considered sufficient. However, ­these advantages are merely local. Past-­time restricted strategies are not access optimal, and our remark on the “division of l­abor” from section 6.8 applies. If one conjectures that a given past-­time restricted method could be advantageous, one should not use it as meta-­level method, but rather put it into the toolbox of candidate methods of an access-­optimal meta-­inductive strategy. 7.5  Generalization to Action Games In this section, we generalize our account of meta-­induction to arbitrary methods of action. Choosing an action from a given class of actions A is a decision prob­lem. As in the case of predictions, ­there exist dif­fer­ent

Generalizations and Extensions 189

object-­level methods of decision making. In Bayesian decision theory one assumes that the utilities of actions depend on the circumstances of the environment that obtain according to a given probability distribution P. Assuming ­these circumstances are characterized by a partition E = {c1, … ,cq}, then a rational decision-­maker with utility function u:A × E →  chooses an action a ∈ A whose expected utility E(a) = ∑1≤i≤q P(ci) • u(a,ci) is maximal. Thus the Bayesian decision-­maker has a twofold induction prob­lem. She has to conjecture (1) the probabilities of the circumstances from their so-­ far observed frequencies and (2) the utilities of the actions from their so-­ far observed payoffs. We know from section 5.9 that for IID sequences the inductive projection of the so-­far observed frequency and average utility is an optimal object-­level method. However, we do not assume IID sequences but allow for arbitrary sequences whose frequencies or average utilities need not converge. ­Because the utilities of the actions in A may change in irregular ways, refined induction techniques may outperform s­imple inductive decision methods. We define a (per­sis­tent) action game as a ­triple A = ((r),Π,p) consisting of 1. An infinite sequence (r) = (r1 , r2, … ) of rounds. 2. ​A finite set Π of accessible players P whose task is to perform an action an(P) in each round n. Each player corresponds to a “method of action.” The player’s actions are taken from a fixed but arbitrarily large set of pos­ si­ble actions A. Examples are healing actions, planting actions, hunting actions, or weather-­forecasting actions (thus, predictions are a special case of actions). 3. ​A payoff function pn:A → [0,1] that reflects the “reaction of the environment” to the action in round n, in the form of this actions’ payoff.8 The payoff pn(P) of the action of a player P in round n—we write pn(P) as short for pn(an(P))—is mea­sured by a score within the normalized interval [0,1]. The payoffs of actions are unknown in advance and may change over time; they are revealed ­after the action has been performed. In mathematical learning theory, action games of this sort have been studied ­under the heading of multiarmed bandit prob­lems (MABPs) (Auer et  al. 1995; Freund and Schapire 1995; Schlag 1998). One distinguishes

8. ​We could also represent this payoff function by a time-­invariant two-­place function p(a,c) assigning to each action a ∈ A and circumstance c ∈ E the payoff of a in c, together with an environment function e: → E assigning to each time the circumstance that obtains; in this setting we would define pn(ai) = p(ai,e(n)).

190

Chapter 7

between stochastic MABPs, in which the payoff of each arm is determined by a fixed probability distribution, and nonstochastic or “adversarial” MABPs, where the arm’s payoffs are dynamically changing and may depend on the decisions of the meta-­inductivist (Bubek and Cesa-­Bianchi 2012, 6); the latter setting is also assumed in our notion of an action game. Moreover, one can distinguish between ­simple MABPs in which each non-­MI player always chooses the same action, and complex MABPs in which dif­fer­ent non-­MI players may choose dif­fer­ent actions in dif­fer­ent rounds, as in our notion of an action game (Auer et al. 1995, 7). However, in the nonstochastic case the complex MABPs are, from a mathematical viewpoint, as “­simple” as the ­simple MABPs ­because the dif­fer­ent arms may equally be considered as experts who choose actions from a set A and are imitated by the meta-­ inductivist (see Cesa-­Bianchi and Lugosi 2006, 68, remark 4.1). Fi­nally note that in multiarmed bandit prob­lems one usually assumes restricted information: the randomized meta-­inductivist (RAW) can observe only the score of the action of her favored non-­MI player. This modification is discussed in the next section; ­here, we assume that the meta-­inductivist can observe the payoff of all non-­MI players. We now explain how our results on prediction games transfer to action games. One-­favorite meta-­inductive strategies imitate the action performed by the ­actual favorite in the given round (where this favorite has been determined in the round before). Thus, all results concerning one-­favorite meta-­induction apply immediately to action games. The actions of weighted-­ average meta-­inductivists are “weighted averages” of actions. T ­ here are two dif­fer­ent ways in which the actions of dif­fer­ent players can be combined into a mixed action: probabilistically (in discrete action games) and literally (in real-­valued games). Discrete action games. ­Here, only probabilistic or frequentistic combinations of actions are pos­si­ble. Thus, the weight of a non-­MI player corresponds ­either to the probability with which a single RAW meta-­inductivist imitates this player (as in theorem 6.10) or to the relative frequency with which collective meta-­inductivists CAW imitate this player (as in theorem 6.11). The transfer of our results about randomized AW meta-­induction (RAW and REAW) from discrete prediction to action games is straightforward. We identify the set of actions A with the set of predictions (which the non-­MI players deliver in the previous round) and the scoring function with the payoff function; this is pos­si­ble ­because the loss function of discrete prediction games is arbitrary. Thus we transform a given action game into a prediction game by defining Valpred =def A, and for P∈Π¬MI, scoren(Pk,en) = pn(an(P)). The event space Val contains the “pos­si­ble circumstances” in the sense of note 8.

Generalizations and Extensions 191

Its cardinality must be at least as ­great as the maximal number of dif­fer­ent payoffs that an action in A has. Thus |Val| ≥ max({|{pi(a): i∈}|: a∈A}). Note that now Val and Valpred do not share ele­ments, but this does not affect the proof of theorem 6.10 for the access-­optimality of R(E)AW. valued action games. H ­ ere, we assume that the pos­ si­ ble actions Real-­ are numerically graded—­for example, driving with graded velocity—so the meta-­inductivist can form literal weighted averages of them, as in the case of real-­valued prediction games. Our results about real-­valued prediction games with convex loss functions can be transferred to real-­valued action games by the following trick: let maxn ∈ [0,1] denote the maximal payoff of all pos­si­ble actions in round n. We let “maxn” play the role of the event en. The predictions corresponding to the actions of players are generated by assuming that if P choses action an(P), P predicted in the previous round that this action ­will receive maximal payoff and earns loss(pn(P),maxn) for this prediction. Thus we identify P’s “virtual” prediction, predn(P), with the payoff pn(P).9 This consideration allows us to transfer all mathematical theorems for (E)AW from prediction to action games. Summarizing, in all action games ((r),{P1, … ,Pm,xAW},p), xAW’s actions are long-­run optimal with worst-­case regrets specified by theorems 6.8 through 6.11. Note that for the sake of proof transfer we consider action games as a special kind of prediction game, although from a natu­ral viewpoint prediction games are a special kind of action game. This is not contradictory ­because the relation of translation is dif­fer­ent in the two cases. 7.6  Adding Cognitive Costs A further pos­si­ble extension of prediction games takes account of cognitive costs. ­Here we do not mean the costs of error, for they are already included in the loss function, but rather the cost of producing the prediction—­for example, in terms of computational complexity, or simply in terms of the financial costs of acquiring data. The error costs and the cognitive effort costs are then aggregated into a total cost, in the simplest case by adding them up and renormalizing the cost sum. Of course, from an idealized epistemic viewpoint, only error costs should play a role, but from a practical viewpoint, cognitive costs are of obvious importance.

9. ​This “virtual” prediction is unknown to P before the payoffs are revealed; they are only needed for transferring the proof from prediction to action games.

192

Chapter 7

We assume that dif­fer­ent prediction methods have dif­fer­ent cognitive costs, but their cost per round remains constant over the prediction game. ­Under this assumption we can incorporate cognitive costs into our results by adding to the loss of each method P in a given round an additional cost term, denoted as “cost(P),” ranging within an interval [0,c]. Thereby c (a positive real number) is the maximal cost of a prediction task; if c  1 cognitive simplicity counts more than predictive success. The total loss in round n is designated as “t-­lossn” and is given as the sum of the predictive loss and cost, t-­lossn(P) = lossn(P) + cost(P). The total score, t-­scoren(P), is obtained by subtracting the total loss from 1 + c, which is the maximum pos­si­ble total loss per round; thus, t-­loss and t-­score range over values in [0, 1 + c]. In conclusion, the total losses, scores, and success rates t-­sucn are computed as follows. (7.13)  Total loss, score, and success (at time n) t-­lossn(P) = [lossn(P) + cost(P)] t-­scoren(P ) = (1+c) −  t-­lossn(P) = [1 + c − lossn(P) − cost(P)] t-­sucn(P) = (1/n) • ∑1≤i≤n [1 + c − lossi(P) − cost(P)] = (1/n) • [∑1≤i≤n scorei(P)] + (1/n) • [n • c − n • cost(P)] = sucn(P) + c − cost(P)

In prediction games with cognitive costs, meta-­inductivists should still imitate the non-­MI players with maximal predictive success and not ­those with maximal total success, ­because xMI imitates their predictions but not their cognitive costs. The costs that xMI must pay for imitating other (combinations of) players are fixed for xMI, in­de­pen­dent of the costs that the non-­MI players must pay for their predictions. Thus, it is only by imitating the non-­MI players with maximal predictive success that xMI can maximize his total success. Let Maxn denotes the first-­best non-­MI player with maximal total success at time n. Then we obtain for the meta-­inductivist’s total regret at time n, t-­regn(xMI). (7.14)  Total regret t-­regn(xMI) =  t-­maxsucn −  t-­sucn(xMI) = (by 7.13) [sucn(Maxn) + c − cost(Maxn)] − [sucn(xMI) + c − cost(xMI)] = [sucn(Maxn) − sucn(xMI)] − [cost(Maxn) − cost(xMI)] ≤ regn(xMI) + [cost(xMI) − cost(Maxn)]  (­because sucn(Maxn) ≤ maxsucn).

Generalizations and Extensions 193

Calling “cost(xMI) − cost(Maxn)” xMI’s cost regret, we can express (7.14) by saying that xMI’s total regret is upper-­bounded by the sum of xMI’s predictive regret and xMI’s cost regret. In the long run, the first best player with maximal total success may change endlessly; thus cost(Maxn) need not converge. To overcome this difficulty we define cost(Max) as the minimum of the cognitive costs of all non-­MI methods whose total success rate is infinitely often maximal. This gives us proposition 7.2 (whose proof follows from 7.13 + 7.14). (Proposition 7.2)  Prediction games with method-­specific cognitive costs Let ((e),{P1, … ,Pm, xMI}) be a prediction game with player-­ specific but unchanging cognitive costs, cost(X) ∈ [0,c] for X ∈ Π. Assume xMI has an upper predictive regret bound given by a function f(n,m) and xMI is predictively access-­optimal: limn→∞ f(n,m) = 0. Then xMI’s total regret satisfies: (1) Short run: t-­regn(xMI) ≤ f(n,m) + cost(xMI) − cost(Maxn), and (2) Long run: limn→∞ t-­regn(xMI) ≤ cost(xMI) − cost(Max).

Imitating other forecasters typically produces costs for the meta-­inductivist, because usually the “experts” do not give away their expert knowledge ­ for ­free and require something for it in return. Of course, also the experts have to pay their costs. Thus, prima facie the cost-­regret of meta-­induction, cost(xMI) − cost(Maxn)—or cost(xMI) − cost(Max) in the long run—­can be positive or negative. However, the philosophically impor­tant ­matter involved in the consideration of cognitive costs is the following: in the typical case, the cost regret of meta-­induction is negative; that is, it is a gain. It is an uncontroversial fact of ­human evolution that the costs of individual learning are usually much higher than ­ those of social learning. Many unsuccessful trial-­ and-­ error steps are involved in individual learning that can be avoided by just being informed about the results of ­these steps. A drastic example could be primitive men who test ­whether certain mushrooms are edible or poisonous: the meta-­inductivist ­will eat only ­those mushrooms that she has observed being eaten by an individual learner who did not become sick or die afterward. Even if imitators have to pay for expert information, the expert ­will be imitated by many laypeople whose individual share of the costs that they have to pay to the expert ­will be small. Thus, meta-­induction enjoys an additional advantage as soon as cognitive costs are included. No won­ der social learning has so successfully evolved in Homo sapiens, the “third

194

Chapter 7

ape species,” which for the first time in biological evolution had developed brain structures rich enough for the linguistic skills that are involved in dif­ ill investigate ­these ­matters ferentiated social learning. In section 10.3 we w from the viewpoint of evolutionary game theory, with in­ter­est­ing consequences concerning cooperative be­hav­ior. On the other hand, assuming that the meta-­inductivist has to pay each expert for obtaining his advice, the meta-­ inductivist’s cost depends on m, the number of dif­fer­ent “experts” or candidate methods. If m is large, cost(xMI) may become too large, and it becomes wise for xMI to get access not to all but only to a certain share of the expert advice and their success rec­ords. We speak ­here of meta-­inductive strategies with restricted information and ­will discuss them in the next section. 7.7  Meta-­Induction in Games with Restricted Information Several results about action (or prediction) games with restricted information have been proved in regret-­based learning theory (Auer et al. 1995, 3). A frequent assumption is that the meta-­inductivist can observe all events, but only some of the actions of the other players; we call this the access-­ restricted meta-­inductivist. In nonstochastic multiarmed bandit prob­lems (MABPs) one assumes the extreme case that the meta-­inductivist can only observe the payoff of that action that she chooses in the given round (a generalization would be “prediction games with partial monitoring” in Cesa-­ Bianchi and Lugosi 2006, section 6.5). In this section we briefly pres­ent the major result on standard MABPs—­ action games with observations restricted to the action that was chosen in the given round. It turns out that if the payoff information received by the meta-­inductivist is in­de­pen­dently randomized, so e­ very action-­payoff pair has the same in­de­pen­dent chance to be revealed to the meta-­inductivist, then access-­optimality still holds, though with a worsened upper bound of the regret. In what follows, we pres­ent this result for the information-­restricted randomized EAW (IREAW) in application to a set of pos­si­ble actions (or predictions) A = {a1,a2, … } with payoff function p as explained in section 7.5. (7.15)  Prediction Method of IREAW (information-­restricted EAW) —­for prediction or action games with restricted information and par­ameters β, γ ∈ [0,1] and η ∈ + (“multiarmed bandit prob­lem”; Bubek and Cesa-­Bianchi 2012, 30, Exp3.P)

Generalizations and Extensions 195

(7.15)  (continued) 1. In each round n, IREAW imitates the action of a non-­MI player Pi who is selected according to the probability distribution Probn(i), for 1 ≤ i ≤ m (thus, Probn(i) is more explic­itly written as Pn(an(IREAW) = an(Pi)). Initially the distribution is uniform. IREAW observes only the payoff of the selected non-­MI player in the given round. 2. The expected score of IREAW in round n is abbreviated as “sn ” and given as sn = (scoren (Pi ) + β) / Probn (i) if the action of non-­MI player Pi was chosen in round n, ­else sn = β / Probn (i). 3. The new absolute successes, weights, and probability distributions are updated by the following recursive equations: (i)  abs n (Pi ) = abs n−1 (Pi ) + sn (with abs 0 (Pi ) = 0), (ii)  wn (Pi ) = e ηiabsn (Pi ) , and (iii) Probn+1(i) = (1 − γ) • [wn(Pi)/∑1≤j≤mwn(Pj)] + (γ/m).

In condition 2 of (7.15) a non-­MI player earns the payoff β even if he has not been chosen in the given round, and in condition 3(iii) ­every non­MI player has a chance of (γ/m) of being chosen in­de­pen­dently from his achieved success. This grants that in the long run the payoffs observed by IREAW are a representative sample of all payoffs, so the following result can be proved about IREAW’s regret (Bubek and Cesa-­Bianchi 2012, 31–32): (Proposition 7.3)  Regret bound of IREAW in games with restricted information In all prediction or action games ((r),{P1, … ,Pm, IREAW},p), if IREAW’s par­ ameters are set to β = (ln(m/δ)/n • m)(1/2) for δ ∈ [0,1], γ = 1.05 • (m • ln(m)/n)(1/2) and η = 0.95 • (ln(m)/n•m)(1/2), then with a probability ≥ 1−δ, maxsucn − sucn(IREAW) ≤ 5.15•(m•ln(m/δ)/n)(1/2).

In (7.15),3(iii) the ratio γ/(1−γ) reflects the balance between exploration (imitating a possibly unknown action in­de­pen­dently of its observed payoff) and exploitation (imitating actions proportional to their observed payoff). As proposition 7.3 shows, the optimal ratio is given when γ = 1.05 • (m • ln(m)/n)(1/2). We ­will say more on the exploration–­exploitation balance in section 10.3, where we discuss the role of meta-­ induction in the context of cultural evolution.

8  Philosophical Conclusions and Refinements

8.1  A Noncircular Solution to Hume’s Prob­lem 8.1.1  Epistemological Explication of the Optimality Argument The presented results about the universal access-­optimality of attractivity-­ weighted (AW) meta-­ induction and its variants show that a noncircular epistemic justification of (meta-)induction is pos­si­ble. This justification does not show that meta-­induction must be successful in a strict or probabilistic sense, but it ­favors the meta-­inductivistic strategy against all other accessible competitor strategies. This is sufficient for an epistemic justification, without being in dissent with any of Hume’s skeptical arguments. The given justification of meta-­induction is mathematically analytic (or a priori), as it does not make any assumptions about the nature of the considered worlds, aside from certain practically evident assumptions that are listed below. However, as we explained in section  5.4, the analytic justification of meta-­induction also implies a contingent (or a posteriori) justification of object-­induction in our real word, insofar as (and to the extent that) object-­induction has turned out to be, so far, the most successful prediction strategy. This argument is no longer circular, given that we have a noncircular justification of meta-­induction. In conclusion, the insights about meta-­induction provide us with a solution to Hume’s prob­lem of induction. In which sense and to which extent the solution is merely “partial” is discussed in this and the next chapter from vari­ous ­angles. As a first step to this enterprise, we summarize the (minimal) epistemological assumptions upon which the optimality justification of meta-­induction is based. Recall my epistemological claim that a certain meta-­inductive method xMI is access-­optimal (possibly ­under conditions C) is elliptic for the claim that for ­every given person X who participates in prediction games (­under conditions C), it is optimal to choose xMI as her meta-­level strategy and apply it to the class of prediction methods that are (si­mul­ta­neously) accessible to X in the given world. The same holds for action games. We call X the (epistemic

198

Chapter 8

or practical) decision-­maker or agent. Four minimal assumptions concerning the decision-­maker are involved in this justification of meta-­induction: 1. ​that maximization of predictive success (or of payoff in action games) is an accepted goal of the decision-­maker, 2. ​that the decision-­maker has certain (“normal”) cognitive capabilities, 3. ​that past observations/experiences can be reliably recorded, and 4. ​that the decision-­maker’s decisions (predictions or actions) are ­free. In what follows I explain the content of ­these four assumptions. Assumption 1. The importance of the goal of predictive success for humans is obvious. Whenever a person acts, she makes choices; and whenever a (utility-­maximizing) decision-­maker makes a choice, she makes an implicit prediction concerning which action is most useful to her. Prominent accounts in cognitive science consider the h ­ uman brain as a power­ ful prediction machine (Clark 2013). Our brain is engaged in hundreds of unconscious predictions in almost e­ very minute, such as when a person walks up a stairway or grasps an object with her hand. Also on a larger time scale we are constantly engaged in making predictions (weather forecasting, mating choice, economic forecasts, ­etc.). For action games, the maximization of utility is an even more obvious goal. A pos­si­ble objection could be that the assumption of fixed goals or utilities involves a circularity as the payoffs of our actions are received in the ­future, which presupposes that our preferences do not change during the time span between action and outcome—or if they do change, then we can predict this change. In other words, the utility maximizer has to predict his ­future utilities and this involves induction. It is not true, however, that temporally changing preferences imply a circularity. They only imply that the full predictive task that our decision-­maker must perform is more complex than it seems: she must not only predict the results of her pos­si­ble actions but also her f­ uture preferences. Apart from that, nothing changes: we should still choose that method that is most successful in this (complex) task, and we know that the optimal strategy for method se­lection is attractivity-­based meta-­induction. To give a realistic example, the utility of taking pain killers to soothe a toothache loses its effectiveness over time; in this case, the meta-­inductivist ­will ­favor a method that predicts this effect, rather than recommending the consumption of pain killers as a long-­ term solution. Assumption 2. The decision-­maker is cognitively able to apply the strategy of meta-­induction. This presupposes, for example, a language in which past observations or experiences that are relevant to the predictive target can be expressed as well as the small amount of logic and mathe­matics

Philosophical Conclusions and Refinements 199

that is needed for computing, comparing, and evaluating success rates. We ­ ecause consider this assumption as harmless. Moreover it is noncircular b the justification of logic does not presuppose induction (see section 11.2.2). Assumption 3. Past instantiations of the predictive target variable can be observed in a reliable way. If this assumption is epistemologically interpreted in a realistic way, then it is not epistemologically harmless. However, recall from sections 2.2 and 3.1 that the justification of induction does not presuppose a realistic epistemological framework. It also makes good sense within a subjective-­idealistic framework. Let me explain t­ hese two pos­si­ble perspectives in more detail. Introspective (subjective-­idealistic) perspective: H ­ ere one assumes merely the existence of introspective experiences, without adding to them realistic content (for a precise description of the architecture of introspective experiences, see Carnap [1928] 1967). It is not excluded in this perspective that the agent is a “brain-­in-­the-­vat” in the sense of Putnam (1981, chap. 1). Induction at the introspective level consists in predicting one’s f­ uture experiences based on one’s pres­ent memories of one’s past experiences. T ­ hese introspective experiences also include the experience of other subjects (without necessarily assuming their realistic existence), in par­tic­u­lar the experience of their predictions, which are taken as “predictive cues.” Meta-­induction at the introspective level consists of applying a meta-­inductive algorithm to the predictions and success rates of all predictive methods or cues that are introspectively accessible. tant that the justification of induction It is epistemologically impor­ already applies at a purely introspective level. The justification of abductive inference to an external real­ity presupposes that ­there exist certain regularities between our introspective experiences—­regularities that have their best explanation in the assumption of an external real­ity that ­causes ­these experiences (see section 11.2.5). Thus the foundation-­theoretic justification of ­human knowledge already requires the justification of induction at the level of “methodological solipsism” before the assumption of an external real­ity can get justified. However, induction is needed at all epistemological levels, in par­tic­u­lar when one has accepted the assumption of realism and makes predictions of external events. Realistic perspective: ­Here one assumes that one’s introspective sense experiences express properties or states of an external (subject-­independent) real­ity in a more-­or-­less truthful way. If one’s ­actual and memorized sense experiences satisfy this assumption, they are called (pres­ent or past) observations and are assumed to be true or at least reliable (i.e., true in most cases), which in turn implies that dif­fer­ent subjects agree in their observational beliefs (at least in most cases). Indeed, meta-­induction can only be applied

200

Chapter 8

to the solution of fundamental epistemic disagreements (see section 8.1.3) if ­there is agreement about the observed success rec­ords of the competing cognitive methods. The assumption of intersubjective agreement concerning success rec­ords is not epistemologically harmless. But it is noncircular, which is of primary importance. The cognitive ability of recording the past requires that past observations can be categorized according to similarity types and ordered according to their relation in time, but it does not require induction of any sort (as some phi­los­o­phers have wrongly argued; recall section 2.2). When we predict that the sun ­will rise tomorrow, we neither presuppose that ­there ­will be “a sky” tomorrow nor that ­there ­will be “a tomorrow” at all; all this is part of our prediction and is verified if the predicted event is observed, while if a dif­fer­ent event is observed, the prediction and certain conjunctive parts of it are falsified. Assumption 4. The decision-­ maker is ­free in her decisions regarding which action or prediction she chooses in each round. We need not assume freedom in an “absolute” sense that entails a “freedom of ­will.” All what is needed is that the decision-maker is able to perform that action that is recommended by the (meta-) method that she applies in a given round. This ability is standardly explicated by means of the following counterfactual conditional. (8.1)  Freedom of the decision-­maker A decision-maker X is ­free in regard to a class A of predictions (or actions) in a pos­si­ble world w iff the following counterfactual conditional holds for any a ∈ A: (*) If X’s goals ­were confined1 to the constitutive goal of the under­lying prediction (or action) game and X ­were to prefer action a, then X would perform action a.

According to Lewis’s (1973) pos­si­ble world semantics, condition (*) is explicated as follows: in ­every minimal change wa of w differing from w only in X’s volitive preferences and satisfying the if-­clause of the conditional (*), the then-­clause of (*) is satisfied. The freedom assumption seems to be harmless, but regarding the application of meta-­induction to the rational discourse between dif­fer­ent worldviews

1. ​ This restriction is necessary ­ because the agent may have nonepistemic preferences—­such as not insulting somebody—­that prevent him to make the meta-­ inductively recommended prediction.

Philosophical Conclusions and Refinements 201

(see section 8.1.3) it becomes fundamental. Assume, for example, that a representative of a fundamentalistic religion, X, says that he is willing to enter a “rational” discourse with an adherent of enlightenment rationality, Y, in order to find out which worldview is better for mankind (which requires meta-­inductive considerations), but only ­under the condition that Y never doubts the existence of his God ­because this is a deathly sin and insults X. Then the f­ ree application of meta-­inductive arguments is no longer pos­si­ble, as this would require the consideration of, among other t­ hings, the expected ­ nder all relevant circumstances, in consequences of an atheistic worldview u par­tic­u­lar ­under the circumstance that God does not exist. To summarize, the optimality justification of (meta-) induction rests on four epistemological assumptions, two of which are harmless (concerning the proper epistemic goal and h ­ uman cognitive abilities) and two of which are not so harmless, though clearly noncircular (concerning the observation of past events and freedom of decision). For attractivity-­ weighted meta-­induction (AW), two further technical assumptions are needed: (A) in real-­valued games, the convexity of the loss function; and (B) in discrete games (or in games without convex loss function), the existence of a collective of AW meta-­inductivists who share their success (or alternatively (B′), AW’s possession of an in­de­pen­dent random choice method). We regard assumptions A and B as unproblematic. In par­tic­u­lar, the embeddedness of individual actions in a social collective is a conditio humana (condition B′ is less harmless but unnecessary).2 For ­these reasons, we are inclined to regard the proposed optimality justification of meta-­induction (and the contingent justification of object-­ induction that is licensed by it) as a ­viable solution to Hume’s prob­lem of induction—­at least, as a partial solution. Objections that the proposed solution is merely “partial” can be raised on three grounds. 1. ​The no ­free lunch objection. ​According to the no ­free lunch theorem, ­every nonclairvoyant prediction method has the same expected success, relative to a uniform prior over the space of event sequences. This seems to suggest that ­there cannot be any general (i.e., world-­independent) relations of dominance between prediction strategies. In section 8.3 we ­will see that this is wrong and ­there are plenty of dominance relations; however, they cannot be found at the object-­level of in­de­pen­dent methods but at the level of dependent meta-­strategies. The tension between

2. ​In contrast, our theorems on one-­favorite meta-­induction have to restrict optimality claims to nondeceiving methods.

202

Chapter 8

the access-­dominance of meta-­induction and the no ­free lunch theorem ­will be analyzed and resolved in section 9.1. The finiteness objection. ​The justification of meta-­induction, as developed 2. ​ so far, is restricted to prediction games with finitely many competing candidate methods, although in section 7.3 the number of methods was allowed to grow unboundedly, provided it does not grow too fast. We consider the finiteness objection in section 9.2 and w ­ ill see that it does not concern the justification of meta-­induction but rather the question of choosing an appropriate candidate set of non-­MI methods to which meta-­induction is applied. Moreover, we ­will pres­ent certain optimality results that hold even for infinitely many methods. The “best of a bad lot” objection. ​The justification can only establish that 3. ​ meta-­induction is guaranteed to be access-­optimal, but not that its success is greater than a certain threshold, at least greater than one-­half, in order to count as “success” in a minimal sense. In other words, the optimal prediction strategy can be the best of a bad lot. However, as explained in section 5.2, this kind of justification is nevertheless genuinely epistemic, ­because it guarantees that the best that we can do in order to reach our epistemic goal is to apply meta-­induction. Apart from this consideration, AW meta-­induction even has the means to set a certain limit to the best-­ of-­a-­bad-­lot argument. This distinguishes AW meta-­induction from one-­ favorite meta-­induction, whose worst-­case success rate is zero ­because of the possibility of deceivers. The defense of AW against the best-­of-­a-­bad-­lot argument consists of including the strategy “Always-­predict 1/2”—­abbreviated as “Av” for “averaging”—in AW’s candidate set of prediction methods. Obviously, the success rate of Av is never smaller than one-­half—­that is, sucn(Av) ≥ 0.5. Thus, AW is guaranteed to have a long-­run success rate of at least one-­half (given a natu­ ral loss function). The same trick can be applied in binary games by applying the collective method (CAW): if Av has highest attractivity, then approximately half of the k CAWs predict zero and the other half predict one, and by ­doing so the CAWs can guarantee themselves a minimal average success of approximately one-­half. We summarize this fact as follows: (Proposition 8.1)  Lower success bound of AW and CAW (with natu­ ral loss function) (1) In ­every real-­valued prediction game ((e), {P1, … ,Pm, AW}) whose set of non-­MI methods includes Av (“always predict 0.5”), the limit inferior of AW’s success rate is greater than or equal to 0.5.

Philosophical Conclusions and Refinements 203

(Proposition 8.1)  (continued) (2) In ­every binary prediction game ((e), {P1, … ,Pm, CAW1, … ,CAWk}) whose set of non-­MI methods includes Av, the limit inferior of CAW’s average success rate is greater than or equal to 0.5 − 1i . 2 k

However, no lower bound of (C)AW’s success rate can be given that is greater than one-­half. This means for binary prediction tasks that we cannot show that AW’s predictions are more often true than false—­that is, reliable in the minimal sense of this word. Of course, the situation changes when we make certain assumptions about the prior probability distribution over pos­si­ble worlds (see section 9.1). Related to the best-­of-­a-­bad-­lot objection is the following worry: even in normal worlds (where successful in­de­pen­dent methods exist) the success of meta-­induction is always parasitic on the existence of other attractive players or cues that the meta-­inductivist can imitate. This is indeed true, and we ­will analyze the in­ter­est­ing consequences of this fact in section 10.3. However, to consider this insight as an objection to meta-­induction is mistaken. Obviously, the success of the meta-­level strategy depends always on two ­things: (1) the availability of good in­de­pen­dent candidate methods and (2) the choice of the right meta-­level se­lection strategy. We do not claim that the application of AW meta-­induction in isolation is the right epistemic strategy, only that it is optimal ceteris paribus, that it is conditional on a fixed candidate set. Besides this, it is always good in addition to try to improve one’s candidate set of in­de­pen­dent (object-­level) methods. In conclusion, the recommended procedure is that of a scientist who constantly tries to improve her own in­de­pen­dent methods while at the same time to not restrict her focus to her own preferred paradigm but rather applying meta-­induction to all the methods available to her. 8.1.2  Radical Openness and Universal Learning Ability The major advantage of the meta-­ inductivistic approach is its openness ­toward all kinds of possibilities. In our view, this openness is a sign of all good foundation-­oriented (instead of “foundationalistic”) programs in epistemology. The meta-­inductivistic program does not exclude any nonscientific, esoteric prediction methods or worldviews from the start. You need not be a convinced naturalist or scientist in order to become convinced of the superiority of the (meta-) inductive method. You may start with any beliefs you want, including beliefs in a superior God who gives you knowledge about the ­future. This distinguishes the meta-­inductive program, for example, from Nicholas Rescher’s account to the prob­lem of induction.

204

Chapter 8

Rescher distinguishes between an initial justification and a pragmatic retro-­justification (Rescher 1980, 91ff.). In his account, the initial justification of induction proceeds from the presumption that prima facie ­there are no better candidates for predicting the ­future than inductive methods. So, according to Rescher, prior to the accumulation of empirical data, t­ here is no reason to take oracles or other esoteric methods seriously (1980, 82ff.). In our view, such a position amounts to a dogmatic assumption that has no (noncircular) justification. ­After all, it is pos­si­ble that super­natural powers exist. Moreover, an a priori exclusion of fundamentally dif­fer­ent worldviews is certainly not the right means to initiate a rational dialogue between adherents of dif­fer­ent worldviews, such as a scientific phi­los­o­pher and an adherent of a fundamentalist religion. Thus we suggest that the strategy of meta-­induction is a superior means to initiating rational dialogues between members of conflicting paradigms and worldviews, and we w ­ ill elaborate this idea in the next subsection. Some readers may still be inclined to think that the attempt to establish a universal optimality argument for meta-­induction is too ambitious. The skeptical challenge against e­ very sort of universal approach is this: How can it ever be pos­si­ble to prove that a given strategy is better than, or at least as good as, e­ very other accessible strategy in e­ very pos­si­ble world—­without assuming anything about the nature of accessible strategies and pos­si­ble worlds? I give the following answer to this skeptical challenge: it is pos­si­ble to prove this for strategies that are able to learn from other strategies. An epistemic strategy is called a universal learner if whenever the strategy is confronted with a so far better strategy, it can at least imitate the success of this strategy (output accessibility) or may even learn to understand and reproduce this strategy (internal accessibility). The epistemic success of a univer­ ill be access-­optimal—­optimal in regard to all kinds sally learning strategy w of accessible competitors and in ­every kind of environment, modulo certain short-­run losses caused by temporal delays. What my account tries to show is, in other words, that meta-­induction is a universal learning strategy. 8.1.3  Meta-­Induction and Fundamental Disagreement Because of its radical openness, meta-­ ­ induction offers a solution to the prob­lem of fundamental disagreement. Epistemologists have discussed the question of ­whether a cognitive disagreement between persons can ever be  “reasonable” in the sense that the disagreement does not constitute a reason for the parties to lower their degree of belief, in cases where the parties share the same body of evidence. However, the prob­lem has only been discussed for the situation of disagreement between epistemic peers

Philosophical Conclusions and Refinements 205

(Feldman and Warfield 2010, Christensen and Lackey 2013, eds.). Epistemic peers share their most general methods and standards of evaluation, which together make up what Goldman (2010) has called their epistemic system. Obviously, meta-­induction is also highly useful for solving disputes between epistemic peers. A much more difficult challenge for epistemology, however, is the prob­lem of fundamental disagreement. By this we mean disagreement between parties who argue from the background of radically dif­fer­ent epistemic systems, such as a fundamentalistic religion versus a scientific world-­view. Fundamental disagreement can only be called “reasonable” if one takes a strong relativistic stance, according to which t­ here does not exist any objective standards for better or correct epistemic systems (Boghossian 2006). In this book, we reject this relativistic epistemological position. In domains where one finds intersubjectively shared goals and standards of evaluation (such as predictive success and/or practical well-­being) fundamental disagreement seems to be a sign of unreasonableness and an indicator of error. Disagreement in such cases does not lead ­people to shrug their shoulders, supposing that the relevant attitudes are ­matters of taste. Rather, such disagreements produce ongoing controversies and even fights, and their proper resolution consists in finding reasonable agreement. But how can fundamental disagreement be overcome? The answer we suggest is: By cognitive methods that are universal in the sense of being reasonable in ­every cognitive system. In other words, by methods that have a justification that is worldview neutral. We think that meta-­induction is precisely such a method, more precisely a meta-­method, whose justification is universal, worldview neutral, and thus acceptable by ­every ­human person. Whenever two persons disagree in regard to the question which of two fundamentally dif­fer­ent belief systems, methods of reasoning, or ways of acting are better in regard to some shared goals, meta-­induction constitutes the preferred if not the only way to resolve the disagreement. As an example, consider the esoteric healing method of “high-­dosed” homeopathic drugs. ­These drugs are diluted in ­water in such a low concentration that from a physical standpoint the homeopathic solutions consist of pure ­water. For the background system of natu­ral science, therefore, it seems impossible that homeopathic drugs can be physically effective. In contrast, the esoteric worldview assumes that, as a result of the dissolution pro­cess, certain force fields emerge within the w ­ ater, unmea­sur­able for the scientist and responsible for the healing effect. Endless disputes arise and are transmitted over public media. For the meta-­inductivist, however, the most reasonable stance in this situation is to forget for a moment one’s accepted paradigm

206

Chapter 8

and apply meta-­ induction, by comparing the success rates of standard medical therapies with homeopathic therapies in a variety of double-­blind experiments. Only ­after this has been done does it makes sense to ask which of the competing paradigms can better explain the outcomes of t­ hese experiments. 8.1.4  Fundamentalistic Strategies and the Freedom to Learn In explicating our notion of an access-­optimal strategy (definition 5.3 + 4) we assumed that the candidate prediction methods are accessible to the epistemic agent in­de­pen­dently of which prediction meta-­strategy she uses. What if a candidate method M makes its accessibility dependent on w ­ hether the decision-maker uses a par­tic­u­lar learning strategy? A particularly dev­ ilish situation for meta-­induction is given when a method of prediction or action praises itself as “superior,” but makes itself accessible only to ­those decision-makers who reject meta-­inductive learning and submit themselves to the method’s authority without committing the “sin of doubt,” which consists in critical reflection and comparison of the method’s utility with that of alternative methods. We speak ­here of a fundamentalistic strategy and define it as follows. (Definition 8.1)  Fundamentalistic method A method M is fundamentalistic iff it makes its accessibility for a person X dependent on w ­ hether X’s meta-­strategy consists in the “blind” (i.e., success-­ independent) imitation of M.

Of course, the notion of a “fundamentalistic” worldview can be understood in dif­ fer­ ent ways, but I think definition 8.1 captures a common feature of ­these dif­fer­ent understandings of “fundamentalistic”: a fundamentalistic method forbids freedom of opinion and critical testing. The promise of eternal salvation upon an unconditional submission u ­ nder the religious creed is a mechanism of most fundamentalistic religions. “Thou shalt not put the Lord thy God to the test” appears in the New Testament (Jesus in the desert), and the Catholic Catechism condemns doubt as one of the major transgressions of Catholic faith; similar quotes can be found in authoritative texts of Islamic religion. More “profane” fundamentalistic mechanisms can be found in nonreligious totalitarian ideologies. We would of course reject such a fundamentalistic strategy as irrational, but it is surprisingly difficult to give a noncircular argument for its irrationality ­because to avoid circularity we have to admit the possibility of paranormal

Philosophical Conclusions and Refinements 207

worlds in which superior God-­guided methods exist. Assume that we are in such a paranormal world in which a fundamentalistic method S does have a superior success rate. Then a person who selects meta-­induction as her social strategy would no longer be optimal in regard to methods with fundamentalistically restricted accessibility; if she would have deci­ded for the blind-­favorite strategy “submit to S forever,” her success rate would be higher. Of course, in all other worlds in which S fails to be optimal, a meta-­ inductivist who submitted to S would be lost forever: not only would her success be lower, but she could nevermore correct this ­mistake. But this does not change the fact that meta-­induction cannot be universally optimal in application to methods with fundamentalistically restricted accessibility. The decisive critique of submitting oneself uncritically ­under a fundamentalistic strategy consists in the observation that by d ­ oing so one spoils his freedom to learn. Note that even if the act of submission was performed freely by a person, this does not mean that the ­future actions of the person are also ­free—on the contrary, the submission has made the person unfree from the moment at which it is in force. A person who has freely submitted herself to a religious creed is only ­free at ­later times if she has the freedom to refrain from her submission whenever she wants—­but this is not what is meant by unconditional submission. In conclusion, submitting oneself to fundamentalistic strategies violates the freedom condition that was introduced in section 8.1.1; in par­tic­u­lar, it undermines a person’s freedom to learn. Clever meta-­inductivists can develop strategies by which they can even learn from a fundamentalistic strategy S, provided S is not “maximally fundamentalistic.” First, a single meta-­inductivist xMI may try to get access to S by favoring S unconditionally for a while, and a ­ fter the while is over, xMI proceeds meta-­inductively and considers S as attractive to the extent that S has been observed to be successful. However, a strengthened fundamentalistic strategy S w ­ ill suss out xMI’s tactic and suspend its accessibility to xMI, b ­ ecause xMI does not honestly submit herself to S but “follows” S to determine S’s success rate, which is a case of betrayal from the viewpoint of a fundamentalistic God. Second, even this strengthened version of fundamentalism is not sufficient to prevent meta-­inductivists from gaining access to it, provided they act as a cooperative collective whose members share information and success. We may assume, for example, that the collective of meta-­inductivists assigns a so-­called pi­lot meta-­inductivist, abbreviated as pMI(S), to each fundamentalistic strategy S. Each pMI(S) submits himself to S and follows S forever, thereby earning the same success as S. At the same time, each pMI(S)

208

Chapter 8

informs all other meta-­inductivists about the success rate of S and keeps them updated in ­every round so that all nonpi­lot meta-­inductivists can apply their standard meta-­inductive method to all fundamentalistic strategies. Provided ­there are enough meta-­inductivists, the worst-­case regret of their average success produced by the fraction of pilot-­meta-­inductivists ­will be neglectibly small. However, an intelligent fundamentalistic strategy S w ­ ill try to punish such be­hav­ior. S ­will consider all his followers as traitors who support the infidels, the nonfollowers, and adversaries of S. Rather, S ­will give access to his salutary forecasts only to ­those loyal followers who do not convey any relevant information to nonfollowers of S. Let us call a strategy that behaves in this way maximally fundamentalistic. A maximally fundamentalistic strategy behaves like a conspirative religious sect: only if you believe in our God forever and do not cooperate with anybody who does not belong to our sect, you belong to us and our God ­will save you. Thus we have reached the result that collectively or­ga­nized meta-­induction fails only in regard to strategies that are maximally fundamentalistic. To avoid misunderstandings, we do not claim that maximally fundamentalistic religions (or worldviews) are frequent. Most fundamentalistic religions try to disseminate their doctrines instead of keeping them secret, but t­ here are also real examples of maximally fundamentalistic religious sects. The adoption of a maximally fundamentalistic strategy spoils one’s (individual and social) learning ability completely and forever. From a rational viewpoint, this is maximally irrational, as ­every kind of rational be­hav­ior in a contingent world presupposes the ability to learn. Of course, one can give more specific reasons against fundamentalistic strategies that rest on certain presuppositions, but the ultimate argument against them is that they destroy the basis of all forms of rationality: the ability to learn. 8.1.5  A Posteriori Justification of Object-­Induction The a priori justification of meta-­induction bestows on us a noncircular a posteriori justification of object-­induction, insofar as object-­induction—­ the empirically inductive methods of science—­have been the most successful prediction strategies in the past. Theologists or phi­los­o­phers opposed to “scientific dominance” are inclined to take dispute with this thesis. In this subsection we ­will briefly elaborate this thesis—­though the principal focus of this book is not this thesis, but rather the a priori justification of meta-­induction. First and foremost, recall from section 5.9 that object-­induction is not just one method but an unboundedly large ­family of—­simple or increasingly

Philosophical Conclusions and Refinements 209

refined—­inductive methods applied at the level of observed events. Many scientific debates concern the question of which inductive method (e.g., take the best heuristics, multilinear regression, s­ imple or full Bayes estimation) is most appropriate for which domain. Moreover, ­there are dif­fer­ent corpuses of scientific evidence. Two cases may occur. Case 1: The same inductive method may lead to opposite predictions when it is applied to dif­ fer­ent evidence, E1 versus E2. In such cases, conditionalization on the total evidence E1 ∧ E2 is the recommended strategy (recall section 4.3), though this is not always pos­si­ble ­because of missing statistical information. Case 2: Two dif­fer­ent inductive methods based on the same evidence may produce opposite predictions ­because they conditionalize on dif­fer­ent patterns in that evidence (see section 8.3.2). In this case, the application of meta-­induction is the recommended choice, which bestows on us an optimal combination of the methods. Of course, sometimes the result of combining the evidence or the methods may lead to the conclusion that the odds are equal and one should remain agnostic. Besides cases of competing inductive methods t­ here are several fields in which object-­inductive prediction methods are not more successful than random guessing ­because of the chaotic dynamics or the chance-­driven nature of the events. Thus, the thesis that object-­induction was so far predictively more successful than noninductive methods should be explicated as follows. (8.2)  A posteriori justification of object-­induction ­ ntil the pres­ent time and according to the presently available evidence, object-­ U inductive methods dominated noninductive methods in the following sense: in many fields some object-­inductive method was significantly more successful than e­ very noninductive method, though in no field was a noninductive method significantly more successful than all object-­inductive methods.

Note that according to thesis (8.2) the justification of object-­induction is always relative to the pres­ent time and available evidence. Is (8.2) true? For us, a positive answer is evident. The assertions of the holy scriptures concerning the creation of the world are empirically refuted on all counts. Or to take another example, the methods of scientific medicine are almost always more successful than religious healing practices. At this point proponents of religion may point out that religious beliefs have positive effects on one’s psychological and even physical well-­being and health. This may be true, and in other writings I have called ­these effects “generalized Placebo

210

Chapter 8

effects” (Schurz 2011a, section 17.4–6). However, t­hese effects can be predicted and explained by empirical psy­chol­ogy without assuming that the respective religious beliefs are true. Moreover, the methods under­lying religious claims are diverse. Sometimes they are “fundamentalistic” in the sense of blind trust in the religious authority, but often they are based on ordinary object-­induction from religious or other “alternative testimonies.” Typically ­these alternative testimonies report “miracles”: for example, testimonies cited in the bible report that a person (Jesus) walked on ­water, and spiritual healing prac­ti­tion­ers report having healed cancer by the laying-on of hands. If we could trust ­these alternative testimonies, this situation would be an instance of case 1: t­here is “alternative evidence” E′ that would confirm, by ordinary object-­inductive reasoning, a “miraculous” fact F contradicting scientific theories T based on scientific evidence E. In such cases the crucial question is not about object-­ induction versus noninductive methods, but rather w ­ hether we should trust ­these testimonial reports. The view that all ­human testimonies are prima facie trustworthy is untenable; they may be based on confused perception, wishful thinking, or mere sensation seeking. In section 10.2 we show that ­human testimonies are trustworthy only to the extent that their reliability is meta-­inductively supported by success indicators. ­There are well-­known scientific methods of safeguarding against errors in the evidence, for example, (1) reproduction of the miraculous event and (2) presence of many mutually in­de­pen­dent witnesses. None of the religious testimonies known to me satisfies t­ hese criteria; but if they did, open-­minded scientists would have to accept them. 8.1.6  Bayesian Interpretation of the Optimality Argument The optimality justification ­favors the meta-­inductivistic strategy against all other accessible prediction strategies in terms of their predictive success. Although this argument seems to be unrelated to probabilistic (or Bayesian) accounts, ­there is a well-­known way to explicate the content of an optimality argument in probabilistic terms. The Bayesian framework is as follows. Framework of Bayesian decision theory •

A set of pos­si­ble actions A.



A set of pos­si­ble circumstances or worlds W together with a (canonically defined) algebra AL(W) over W. If W is countable, then AL(W) is the power set over W. If W is uncountable (continuous), then W is represented by a (nonempty) interval I of real numbers, and AL(W) is the Borel-­Lebesgue algebra over W.

Philosophical Conclusions and Refinements 211

• A

utility function u:A × W → ; thus “u(a,w)” is the utility of action a in world w.



The set Prob of all (countably additive) probability mea­sures P:AL(W) → [0,1]; “P(w)” is the probability of world w ∈ W.

For a countable set of worlds W, the expected utility ExpP(a) of an action a ∈ A is defined as ∑w∈W P(w) • u(a,w) (assuming, for simplicity, that the probabilities of worlds are in­de­pen­dent of the choice of action). For a continuous set of worlds represented by an interval [s,t] of real numbers, the expected utility of action a is given by the Lebesgue integral ExpP(a) = ∫[s,t] u(a,x) dP(x).3 Proposition 8.2 pres­ents the central fact under­lying the Bayesian interpretation of optimality. (Proposition 8.2)  Bayesian interpretation of optimality Action a ∈ A is optimal in A in a class of words W with (canonical) algebra AL(W) iff for ­every probability function P over AL(W), the expected utility of a is maximal—­that is, greater than or equal to the expected utility of any other pos­si­ble action b  ∈ A: ExpP(a) ≥ ExpP(b).

Proof: Appendix 12.31. The beautiful fact about proposition 8.2 is that the iff condition does not impose any restriction on the probability distributions P. This is a strong result ­because usually Bayesian findings depend on constraints about prior distributions. For example, proposition 8.2 does not require that P is nondogmatic (this is only required in the Bayesian interpretation of dominance; see section 8.3.3). The reference to ­every pos­si­ble P is a precondition of the right-­to-­left direction of proposition 8.2. In order to apply proposition 8.2 to prediction or action games, we have to address two complications. Complication 1: The worlds of prediction games contain sequences of time points, and the utility (predictive success) may vary in time. Maximization of expected success in the short run depends on the par­tic­u­lar shape of the prior probability distribution over event sequences. The only way to establish ­simple connections between optimality and expectation value is in the long run. In order to make our definition applicable to worlds with

3. ​We need not require that the cumulative distribution function Pcum(x) =def P([s,x]) (x ≤ t) is continuous. If this is the case, P can be generated by a Riemann-­integrable x probability density function D:[s,t] → +,0 by the definition P([s,x]) = s ∫  D(r)dr.

212

Chapter 8

permanently oscillating success rates, we formulate the expectation value for the inferior limits of the regrets. Complication 2: Worlds are now identified with prediction games (W = G) containing sequences of events and actions. We allow that pos­si­ble worlds contain dif­fer­ent sets of pos­si­ble actions (methods). In order to compare the expected success of a method M* applied in all worlds in G with the expected success of another method M that is applied only in a proper subset G↑{M} of worlds in G (where G↑{M} is the set of games in G in which M occurs), we ignore worlds outside of G↑{M} in the comparison of expected successes, by setting liminfn→∞(sucn(M*) − sucn(M)) to zero for all worlds in which M is not applied. (Proposition 8.3)  Bayesian interpretation of optimality in prediction games A (meta-­inductive) strategy xMI is access-­optimal in the long run in a class of worlds (games) G (1) iff for ­every probability function P over AL(G), the expected limit inferior of the difference between the success rate of xMI and the maximal success rate of the non-­MI players is greater than or equal to zero; and (2) iff for ­ every probability function P over AL(G) and for ­ every method M ∈ M(G) (M ≠ xMI), the expected limit inferior of the difference between the success rates of xMI and M is greater than or equal to zero.

Proof: Appendix 12.32. By inserting the results of our respective theorems and propositions into proposition 8.3, one obtains the respective Bayesian interpretations of the access-­optimality of imitate the best (ITB), epsilon-­cautious imitate the best (εITB), imitate the best nondeceiver (ITBN), and all versions xAW. In conclusion, universal access-­optimality is equivalent with maximal expected success (among all accessible methods) for all pos­si­ble prior distributions. In this sense, optimality justification by means of regret analy­sis offers an alternative to Bayesian accounts that is superior insofar as standard Bayesian results depend on assumptions about (induction-­friendly) prior distributions (recall chap. 4). 8.1.7  From Optimal Predictions to Rational (Degrees of) Belief Recall from section 8.1.1 that the optimality justification gives us a reason to adopt the epistemic practice of meta-­induction, but it does not give as a (presupposition-­free) reason to believe that meta-­inductivist’s predictions are reliable in the minimal sense, which for binary predictions means that

Philosophical Conclusions and Refinements 213

the prediction’s probability is greater than one-­half. In this subsection we ask, What follows from the fact that an access-­optimal method predicts p for our belief regarding p? For reasons of simplicity we focus on binary prediction tasks. The epistemic optimality of meta-­induction entails that in all contexts in which the agent’s choice of action depends on the truth of p versus not-­p and an access-­optimal meta-­method predicts p, the agent should act as if she would consider p as “more probable” than not-­p. Thus we argue that if a person has to predict and knows that she maximizes her truth-­chances by predicting p rather than not-­p—­because her prediction method is access-­optimal—­then it is rational for this person to doxastically prefer p over not-­p. In what follows we call this princi­ple the optimality princi­ple. Before we explicate it, we should formulate two careful provisos to this princi­ple. •

Proviso 1. ​From the fact that an access-­optimal method predicts p, one can at most infer that it is more reasonable to believe p rather than not-­p, but not that it is reasonable to believe p in the sense of qualitative belief or ac­cep­tance. A widely accepted connection princi­ple between an agent’s degrees of beliefs and her qualitative beliefs is “Locke’s thesis” (Foley 1992, 111). It states that a proposition should be rationally believed iff its epistemic probability (conditional on the total evidence) is greater than a given threshold t: P(p|evidence) > t. The par­tic­u­lar choice of t depends on the context of involved risks and gains, but t has to be at least one-­half in order to avoid inconsistency (Schurz 2018). For the same reason, one cannot infer from the fact that an access-­optimal method predicts p that one should act on the assumption that p is true. This would be unwise if the costs of falsely assuming p are much higher than the costs of falsely assuming not-­p. For answering the question w ­ hether one should qualitatively believe p or act on p, a rational estimation of p’s probability is needed; we return to this question ­later.



Proviso 2. ​To delimitate the reach of the best-­of-­a-­bad-­lot objections, the candidate set of methods to which meta-­induction is applied must satisfy certain minimal rationality conditions: only if the candidate set of an access-­optimal prediction method M contains certain basic object-­level methods such as Av (averaging) is it guaranteed that the truth-­chances of M’s predictions are at least 0.5 (recall proposition 8.1). Moreover, the candidate set should contain at least the basic object-­inductive methods OI and OI2 (together with other accessible methods), and the object-­ level methods should be applied to the total evidence available (which in the case of extended prediction games may include information about

214

Chapter 8

events external to the sequence; recall section 5.9). For the application to binary games, not only the version of OI that predicts one in the case of ties (i.e., observed event frequencies of 0.5), but also the version that predicts zero in this case has to be included in the candidate set; other­ wise, it would not be guaranteed that the meta-­inductivist predicts 0.5 on average when applied to a genuine random sequence. Fi­nally, ­there are not one but several access-­optimal meta-­inductive methods (including conditionalized meta-­induction introduced in the next section), and ­there may be event sequences for which their predictions (based on the same candidate set) disagree. In ­these cases no one of the two opposite beliefs should be doxastically preferred. Therefore, we have to refer in our optimality princi­ple to all meta-­inductive methods that are demonstrably access-­optimal; only if they agree is a doxastic preference for their prediction warranted. Based on ­these considerations we can explicate the optimality princi­ple. (8.3)  Optimality princi­ple If all justifiably access-­optimal meta-­inductive methods, applied to a candidate set satisfying minimal rationality conditions, predict (in a given round n) a binary event e, or predict event e with a higher value than 0.5, then • nonprobabilistic • probabilistic

version: it is rational to doxastically prefer e over not-e.

version: e should be considered as more probable than not-­e.

If the antecedent of princi­ple (8.3) is satisfied we speak of an “uncontested meta-­inductive recommendation of e.” In the nonprobabilistic version of the optimality princi­ple the epistemic agent infers from such a recommendation that e is doxastically preferred over not-­e—­that it is more reasonable to believe e than not-­e—­without drawing conclusions for her (subjective) probability function. According to this version it is pos­si­ble to stick to the induction-­hostile state-­uniform distribution explained in section 4.5 (proposition 4.8) and nevertheless doxastically prefer e over not-­e. A proposal of this kind was made by Musgrave (2002) who argued that it may be rational to prefer e over not-­e even if one has no reason to believe that e is more probable than not-­e. In contrast, the probabilistic version of the optimality princi­ple entails a certain constraint on the agent’s probability distribution; in par­tic­u­lar, her prior distribution must not be state-­ uniform but has to be one of the prior distributions that have the capability to learn (recall propositions 4.3 and 4.4), or a meta-­inductively weighted

Philosophical Conclusions and Refinements 215

average of ­these distributions with the state-­uniform distribution (recall section 7.1). Condition (8.3) explicates the optimality princi­ple for the prediction of binary events. For predictions over an event space with q pos­si­ble (event) values v1, … ,vq the princi­ple is generalized as follows: in this case, we infer from an uncontested meta-­inductive recommendation of the prediction vr that value vr is doxastically preferable over all other pos­si­ble values, or in the probabilistic version that vr’s probability is higher than that of all other values, which entails that P(vr) is higher than 1/q. To decide ­whether we should qualitatively believe an event or proposition e that is uncontestedly recommended by meta-­induction or w ­ hether we should act on its basis, we have to rationally estimate e’s probability (given our evidence). If e’s probability passes the given acceptability threshold, we should qualitatively believe e (in accordance with Locke’s thesis4). Moreover, if the action with maximal expected utility does not change if we conditionalize our probability distribution on e, then it is safe to “act on the basis of our belief in e” (Weatherson 2005). How meta-­induction can help us in the task of rational probability estimation was explained in section 7.1: we have to apply meta-­inductive probability aggregation to a prediction game with Bayesian forecasters. As argued ­there, meta-­inductive probability aggregation offers a new way of dynamic probability estimation that has certain advantages as compared to standard Bayesian methods. 8.2  Conditionalized Meta-­Induction The purpose of conditionalized meta-­induction is to improve the per­for­ mance of meta-­induction in changing environments whose changes are recognizable. It is a basic fact of evolution that dif­fer­ent methods of action are adapted to dif­fer­ent environments and, thus, have dif­fer­ent success rates in dif­fer­ent environments. Therefore, it would be profitable to always ­favor that method that performs best in the given environment instead of favoring a method that is best on average. For this purpose it is necessary to conditionalize the success rates of the accessible methods on ­those properties of the environment that are recognizable and relevant to success. In section 5.9

4. ​Locke’s thesis may conflict with the conjunction princi­ple for qualitative beliefs (Schurz 2018). Pos­si­ble solutions of this prob­lem are in­de­pen­dent from the optimality princi­ple.

216

Chapter 8

we called this the method of conditionalized meta-­induction (con-­xMI). The method of conditionalizing meta-­induction to subsequences with certain properties has also been applied in multiarmed bandit prob­lems (see Bubek and Cesa-­Bianchi 2012, 44ff., theorem 4.1). In this section we ­will elaborate strategies of conditionalized meta-­induction. We first pres­ent a simulation. Figure 8.1 illustrates the per­for­mance of conditionalized imitate the best (con-­ITB) at hand of a binary prediction game that is played within five dif­fer­ent environments, E1, … ,E5. ­Every 10 rounds a new environment is chosen randomly from {E1, … ,E5}, according to a uniform distribution. ­There are five non-­MI methods; each of ­these methods is adapted to one of the five environments in which it predicts with a success rate of 95  ­percent, while in the other four environments the method has a much lower success rate. B ­ ecause con-­ITB can recognize the environment and f­ avors in each environment exactly the strategy that performs best in it, the success rate of con-­ITB climbs high above the average success rates of the five methods. For sake of comparison, figure  8.1 includes the success rate of ordinary (unconditional) ITB that selects its favorites according to their unconditional success rate. We see that con-­ ITB predicts significantly better than unconditional ITB—­where this is, of course, ­under the assumption that con-­ITB’s predictions are not accessible to

Success Rate 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9

Con-ITB Thin-black: ITB (Con-ITB not accessible)

Grey: 5 environment-dependent methods

10

100 Round (logarithmic scale)

1000

Figure 8.1 Conditionalized imitate the best (con-­ITB) players and five forecasters (grey) who are adapted to five dif­fer­ent environments in which they predict with success rate 0.95. In the other environments their success rates are 0.4, 0.45, 0.5, 0.55, and 0.6, respectively. Environments change randomly ­after 10 rounds. For comparison, results of unconditional ITB are included.

Philosophical Conclusions and Refinements 217

ITB; other­wise, ITB would predict as well as con-­ITB, apart from a vanishing short-­run regret. ­ oing to establish the mathematical theorems that generalize We are now g the result of figure 8.1. Recall from section 5.9 that to obtain a sufficiently general notion of conditionalized meta-­ induction we assume that the observed events of the event sequence are multidimensional: they not only include the event to be predicted but also cognitively accessible information about other external events conjectured to be success relevant. Formally, our events ωn are now ele­ments of an r-­dimensional normalized real-­valued space, ω n = (e1n ,…, e rn ) ∈[0,1]r , where the first ele­ment is the event variable to be predicted and the remainder ele­ments are the external event variables on which the success rates are conditionalized. Let R = {R1, … ,Rq} be a reference partition of the external events so that at each time n the meta-­ inductivist has reliable information concerning which cell of the partition is realized at that time. We allow that the success rates are not only conditionalized on information about external variables at time n but also on information about more distant past events up to a certain historical depth d, (ωn−d+1, … ,ωn). This information is the content of the descriptions of the reference cells Ri (1 ≤ i ≤ q). Let R: → {R1, … ,Rq} be the function that assigns to each time n that cell R(n) of partition R that was realized up to time n, between the times n − d + 1 and n.5 Then cond-­xMI conditionalizes the success rates of the non-­MI players on R(n). We do not assume that cond-­xMI can foresee which cell of the partition R(n+1) ­will be realized at time n+1 of the predicted event. All that we assume is that con-­xMI has observed the cell R(n) that is realized up to time n. R(n) can be success relevant for predicting en+1 only if ­there is a correlation between R(n) and en+1. In other words, the conditionalization method can only be profitable if the success-­ relevant environmental properties possess a certain uniformity. If they change unforeseeably from round to round, nothing can be won by environment conditionalization; we ­shall also see, however, that nothing can be lost, at least not in the long run. To make the conditionalization method explicit, we introduce the following terminology: • abs (P |R ) n i j

is the absolute success of player Pi at time n conditional on reference cell Rj—­that is, the sum of scores that ­were achieved by Pi ­until time n for events ei that ­were preceded by reference cell Rj; R(i−1) = R j.

5. ​For times n  u(A|Q). By applying the respective theorem concerning xMI to the two subgames, con-­xMI’s success converges with that of A in subgame G|R and with that of B in subgame G|Q, which gives us: (iii) u(con-­MI)  = p(R) • u(A|R) + p(Q) • u(B|Q) > u(A) = u(xMI), by u(B|Q) > u(A|Q) and equation ii. So in this case conditionalization increases the long-­run success (Q.E.D.). Because Good’s proof shows that conditionalizing on narrower refer­ ence classes can only improve one’s success, it gives us a justification of the princi­ple of narrowest reference class, which was explained in section 4.3. This princi­ple becomes impor­tant for meta-­induction when t­here is more than one relevant reference partition of the environment, say R1 = {F,¬F} and R2 = {G,¬G}, and conditionalizing the success probabilities on the cells of ­these partitions produces ambiguous success orderings, such as u(A|F) > u(B|F) but u(A|G)  sucn(Q|Ri) for all non­MI players Q ≠ Bi. Theorems 8.1(1) and 8.1(2) assert that cond-­ITB’s success

220

Chapter 8

rate is greater than or equal to the maximal unconditional success rate minus a worst-­case regret of ∑1≤i≤q(wi′/n) • maxsucwi|Ri that vanishes in the limit. If at least one conditionally best player Bk outperforms the unconditionally best player ­after his winning time wk, cond-­ITB strictly improves the maximal success rate in the long run and, hence, strictly improves ITB (theorem 8.1(3)). (Theorem 8.1)  Con-­ ITB in conditionalized games with unique best subgame players Let G = ((e), {P1, … ,Pm, con-­ITB}, {R1, … ,Rq}) be a conditionalized prediction game whose set of non-­MI players contains for each subgame G|Ri (1 ≤ i ≤ q) a unique best player Bi with winning time wi. Let w =def max({w1, … ,wq}) and w′i = wi • freqwi(Ri). Then, (1) Short run: For all n ≥ w: sucn(con-­ITB) ≥ maxsucn − ∑1≤i≤q (w′i/n) • maxsucwi|Ri. (2) Long run: limn→∞ (sucn(con-­ITB)  − maxsucn) ≥ 0. (3) Let B(n) be the unconditionally first best player at time n and suppose some conditionally best player Bk with nonvanishing winning frequency (liminf(freqn(Rk)) > 0) outperforms B(n) a ­ fter its winning time—­that is, for all n ≥ wk, sucn(Bk|Rk) = sucn(B(n)|Rk) + δ holds for some δ > 0. Then 8.1(1) and 8.1(2) hold when ≥ is replaced by >.

Proof: Appendix 12.33. ­ hether con-­ITB’s short-­run regret is smaller or greater than the short-­ W run regret of the unconditional ITB depends on the relation of the regret term of ITB and the regret terms within each subgame weighted by the ­ nder probabilistic IID assumptions, the winfrequencies of the subgames. U ning times wi in each subgame are approximately equal to the winning time wB in the full game, which implies that the regret bound of con-­ITB is approximately q times higher than that of ITB, which is a dramatic increase. ­Under nonrandom conditions, however, con-­ITB’s short-­run regret may be smaller than that of ITB. It may even happen that the environment-­conditional success rates converge rapidly, while the unconditional success rates oscillate endlessly in a deceptive way. Vice versa it may happen that the non-­MI players’ unconditional success rates converge rapidly, but their environment-­conditional success rates oscillate endlessly in a deceptive way. In that case, ITB’s regret converges with zero, but con-­ITB’s regret stays maximal forever. Scenarios of this sort can easily be constructed. We abstain from presenting them and summarize their conclusion as follows: that con-­ITB can only improve but

Philosophical Conclusions and Refinements 221

not diminish ITB’s long-­run success is only true if the conditions of theorem 8.1 are met. si­ ble for the conditionalized versions of Similar theorems are pos­ ε-­cautious ITB and ITBN. We refrain from elaborating ­these generalizations and turn fi­nally to the conditionalized version of attractivity-­weighted meta-­ induction (con-­AW). This method bases its predictions on the environment-­ conditionalized attractivities atn(Pi|R(n)), which are defined for all n ≥ 1 as follows. (8.4)  Conditionalized attractivities of con-­AW atn(Pi|R(n)) = sucn(Pi|R(n)) − sucn(con-­AW|R(n)) if this difference is positive;  ­else  = 0.

By applying theorem 6.8 to each subgame G|Ri we obtain theorem 8.2. (Theorem 8.2)  Con-­AW in conditionalized games Let G = ((e), {P1, … ,Pm, con-­AW}, {R1, … ,Rq}) be a conditionalized prediction game whose loss-­function loss(predn,en) is convex in the argument predn. Then, (1) Short run: ∀n ≥ 1 : suc n (con-AW) ≥ maxsuc n − m / n i ∑1≤ i ≤ q freq n (R i ). (2) Long run: sucn(con-­AW) is greater than or equal to the non-­MI ­players’ maximal success; limsupn→∞ (sucn(con-­AW)  − maxsucn) ≥ 0. (3) If for some Ri with nonvanishing frequency, maxsucn |Ri ≥ maxsucn + δ holds for some δ > 0 and all times n ­after some time wi, then the ≥ in 8.2(1) (restricted to times n ≥ wi) and 8.2(2) can be strengthened to >.

Proof: Appendix 12.34. Theorem 8.2 holds for strictly all conditionalized prediction games, so con-­AW is universally access-­optimal. Moreover, con-­AW can only improve but not diminish AW’s long-­ run success; thus con-­ AW dominates AW, given that it is not accessible to AW. Unfortunately this improved long-­run be­hav­ior comes at the cost of the short-­run per­for­mance: con-­AW’s worst-­ case regret is m/n i ∑1≤ i ≤ q freq n(R i ) , which is greater than AW’s worst-­case regret of m/n , since ∑1≤ i ≤ q freq n(R i ) is greater than one. Of course, a similar conditionalization is pos­si­ble for the other versions of xAW. In conclusion, the strategy con-­ AW bestows on us the rare case in optimal meta-­ inductive method (con-­ AW) which a universally access-­ generally, and not just locally, improves the long-­run success of another

222

Chapter 8

access-­optimal method (AW) if applied to the same toolbox of candidate methods, although at the cost of an increased short-­run regret. We conjecture that it is also in this case preferable to follow the division-­of-­labor strategy recommended in section 6.8 and put con-­AW into AW’s candidate ­ ill lead to a smaller increase of the short-­run regret set. On average this w than when one uses con-­AW as one’s top-­level meta-­inductive strategy. 8.3  From Optimality to Dominance That an action A is optimal (in a class of actions A) does not exclude that some other actions A′ (in A) are optimal, too. This is only excluded if the action is dominant—­that is, no other action in A is optimal. Recall the definition of access-­dominance from section 5.7: a method M* is (access) dominant in a class of prediction games G iff M* is (access) optimal in G but no other method M is (access) optimal in G↑{M} (= the set of games in G in which M occurs). This implies by proposition 5.1 that t­ here is a prediction game in G↑{M} in which M* is better than M. So far we w ­ ere only able to prove (access) optimality results for meta-­ induction, but not dominance results. Reichenbach (1949, 476) gave a clever argument according to which the optimality of a method M may already be sufficient to provide rational grounds for preferring M, namely ­under the condition that M is the only method for which we can prove that it is optimal. If this condition is satisfied we even have a reason for supposing that M is dominant; however, this reason is not conclusive. Thus, the provable optimality of a method M gives us a weak reason to choose M, but it does not yield conclusive reasons for preferring M over another method M′ so long as we do not have conclusive reasons to suppose that M′ is not itself optimal. In this section we ask, Can we also prove dominance results? As explained, AW is not the only meta-­inductive method that is provably long-­run access-­optimal; ­there are further versions of xAW that are likewise long-­run optimal. It follows that no meta-­inductive method xMI can be (access) dominant in regard to all alternative methods that occur in the respective class G of prediction games in which xMI is (access) optimal, as G includes other meta-­inductive methods (xMI*) that are access optimal, too. 8.3.1  Restricted Dominance Results The only chance that is left for establishing dominance results for meta-­ induction is to look for restricted dominance claims, namely dominance in regard to classes of methods that are not meta-­inductive. For this purpose we introduced in section 5.7 the notion of the dominance of a method M*

Philosophical Conclusions and Refinements 223

in a class of games G in regard to subclass of methods M ⊂ M(G). To keep ­things ­simple, we focus on dominance in the long run. The thesis that we intend to defend in this subsection is as follows: if a method xMI is (long-­ run) access-­optimal with re­spect to G, it is (long-­run) access-­dominant with re­spect to three subclasses of methods in M(G): in­de­pen­dent methods, not access-­optimal methods, and noninductive dependent methods. We first consider in­de­pen­dent (object-­level) methods. Recall from section 5.5 that for ­every in­de­pen­dent method M, ­whether inductive or noninductive, ­there is an M demonic event sequence (e)M in which M’s success is zero. Moreover ­there is a method M# that predicts (e)M perfectly. Thus in the prediction game G = ((e)M,{M,M#}), M# dominates M. This argument establishes that no general relation of dominance (holding for all event sequences) can exists between in­de­pen­dent methods. The same argument, however, helps to establish a general dominance claim at the meta-­level: the dominance of ­every access-­optimal meta-­inductive strategy over all in­de­pen­dent methods M. For if xMI is access-­optimal in a class of prediction games G that includes ((e)M,{M,M#,xMI}), then xMI w ­ ill share M#’s long-­run result and thus w ­ ill dominate M in G. This is the content of our first result on dominance, whose proof rests on the fact that for all meta-­inductive methods xMI studied so far, G contains a game of the sort ((e)M,{M,M#,xMI}). (Proposition 8.4)  Dominance of meta-­induction over in­de­pen­dent methods Assume xMI is [ε-­approximately] access-­optimal (in the long run) in a class of prediction games G. Then xMI is [ε-­approximately] access-­dominant in G with re­spect to the subclass Mindep(G) of all in­de­pen­dent methods in M(G).

Proof: Appendix 12.35. The construction of demonic event sequences can also be applied to dependent methods, but it cannot be used to undermine their access-­ optimality. Note that a dependent method S when applied to a fixed candidate set C = {P1, … ,Pm} is itself a (uniquely defined) prediction method. Thus for ­every dependent method S being applied to C ­there exists a demonic event sequence (e)S:C and a method M# that predicts (e)S:C perfectly. However, this is only pos­si­ble if M# is inaccessible to S, so this argument cannot undermine the access-­optimality of a dependent method. Is ­there a possibility to establish the access-­dominance of meta-­strategies over other kinds of dependent strategies? Yes—­over all dependent strategies that are themselves not access-­optimal. This is the content of proposition 8.5(1). Its proof rests on the idea that whenever a dependent strategy S is

224

Chapter 8

not (long-­run) access-­optimal, ­there exists a game G in which liminfn→∞ (sucn(S)−maxsucn) is negative; by adding an access-­optimal strategy S* to ΠG, G can be extended to a game in which S* is better than S. Proposition 8.5(2) informs us that some well-­known meta-­inductive strategies are not access-­optimal, and thus are dominated by access-­optimal strategies. (Proposition 8.5)  Dominance meta-­methods

of

meta-­ induction

over

non-­ access-­ optimal

(1) Assume xMI is [ε-­approximately] access-­optimal (in the long run) in a class of prediction games G. Then xMI is [ε-­approximately] access-­dominant in G with re­spect to the subclass M¬opt(G) of meta-­methods S in M(G) that are not access-­optimal in G↑{S}. (2) The following meta-­ methods are not universally access-­ optimal: (i) all one-­ favorite methods, (ii) success-­ weighted meta-­ induction (SW), and (iii) linear regression for linear loss functions. Thus, if xMI is universally access-­ optimal, then it is universally access-­ dominant with re­ spect to ­these methods.

Proof: Appendix 12.36. By the same proof as for proposition 8.5 we can also show that an access-­ optimal strategy xMI dominates ­every non-­access-­optimal strategy S in a prediction tournament, in which xMI does not have access to S. We dispense with stating this result formally. 8.3.2  Discriminating between Inductive and Noninductive Prediction Methods We fi­nally return to the question of dominance of induction over anti-­ induction (or, more generally, noninduction). We have seen that no general relation of dominance can exist at the level of in­de­pen­dent methods. Is it at least pos­si­ble to establish such a relation at the level of dependent strategies? The prob­lem of this enterprise is to formulate a sufficiently general discrimination between “inductive” and “anti-­inductive” (or “noninductive”) prediction methods, ­whether for in­de­pen­dent or dependent methods. This is a highly nontrivial prob­lem that we first tackle at the level of in­de­pen­dent methods. ­Every inductive method searches for one or several properties of the past events, proj­ects them to the f­uture, and derives from this projection

Philosophical Conclusions and Refinements 225

categorical or probabilistic predictions concerning f­uture events. Refined conditionalized inductive methods may search for very complex properties, such as for Markov properties of the form “if the past events have property P, then the probability of the next event is such and such.” Based on this consideration, we propose a preliminary explication (8.5) of an inductive method. As a minimal condition for “inductive” we require that the predicted event value (or the probability of its prediction) increases monotonically, though not necessarily strictly monotonically, with the so-­far observed event average (or frequency). We distinguish between s­ imple and conditionalized inductive methods. (8.5)  Preliminary explication of an “inductive prediction method” [For conditionalized methods, add the underlined insertions.] (1) M is a simple/conditionalized in­de­pen­dent inductive prediction method for event variable Xe iff M bases its predictions of the next event en+1 on inductive projections of the observed averages of the values of Xe in the past [conditional on the values of another (possibly complex) variable Xr applying to one or several preceding events], and the value v that M predicts for the next event increases monotonically with an increasing average of Xe in the past [conditional on the observed value of Xp ]. (2) M is a simple/conditionalized dependent inductive method iff the values of the event variable Xe [conditionalized on Xr] in (1) consist of the unconditional/conditional success rates of the accessible prediction methods P1, … ,Pm and M predicts a (properly or improperly) weighted average of the predictions of P1, … ,Pm, with weights that are monotonically increas7 ing functions of so-­far observed success rates.

Explication (8.5) can be applied to predictions of discrete events by replacing “the observed averages” by “the observed frequencies” and “the value v that M predicts” by “the probability with which M predicts v,” which subsumes deterministic methods by letting ­these probabilities be ­either 0 or 1. Memo (8.6) gives us a corresponding explication of an anti-­inductive (or noninductive) method for ­free.

7. ​ Recall that one-­ favorite methods are improper weighting methods that assign weights of 1 and 0. “Monotonic increase” means for them that weight 1 has to be assigned to a method with maximal success; in the case of εITB, ignoring losses smaller than ε.

226

Chapter 8

(8.6)  Anti-­inductive (noninductive) prediction method M is a (­simple or conditionalized, in­de­pen­dent or dependent) anti-­inductive [or noninductive] method iff M satisfies the respective condition in explication (8.5) when “monotonic increase” is replaced by “monotonic decrease” [or by “not monotonic increase,” respectively].

We illustrate (8.5) and (8.6) using some examples of binary predictions. The most s­imple object-­inductive method is OI and predicts predn+1 = 1 if freqn(1) ≥ 0.5 or n = 0; ­else predn+1 = 0. ­Here the value of the predicted event increases monotonically but not strictly monotonically with an increasing frequency freqn(e); it switches from 0 to 1 when freqn(e) passes the threshold 0.5. The next ­simple object-­inductive method is average induction Av-­I (also called “OI2” in section  5.9) which proj­ects the observed frequency directly to the next event (predn+1 = freqn(e)). The correspondinductive methods are object-­ anti-­ induction OAI (predn+1 = 0 if ing anti-­ freqn(1) ≥ 0.5 or n = 0; ­ else predn+1 = 1) and average anti-­ induction Av-­ AI (predn+1 = 1 − freqn(e)). Moreover, ­every probabilistic method whose probability P(predn+1 = 1|freqn(1)) increases [decreases] monotonically with freqn(1) is simply inductive [anti-­inductive]. An example of conditionalized inductive methods are ­those looking for Markov-­dependencies. For example, the method IMark1 conditionalizes the observed event frequencies on the preceding event, thus when Xe = Xen+1, then Xr = Xen, and the probability of predn+1 = v given en = v′ increases monotonically with the conditional event frequency freqn(ex+1 = v|ex = v′). The fact that conditionalized inductive methods may search for complex properties has an impor­tant consequence. In section  3.4 we saw that the ­simple anti-­inductive rule OAI predicts successfully the alternating sequence 010101; in section 9.1.2 we w ­ ill see that anti-­inductive methods are often successful in predicting sequences with oscillating event patterns. The mentioned fact implies, however, that whenever a (­simple) anti-­inductive method predicts a par­tic­u­lar event sequence successfully, ­there exists a refined inductive method that predicts the same sequence with equal success—­for the reason that the anti-­inductive method’s success is based on the existence of a regular oscillation pattern that is detected and projected by the refined inductive method. We illustrate this with the sequence (0,1,0,1, … ): while object-­induction OI predicts this sequence with a success rate of zero, object anti-­induction OAI predicts this sequence with a success rate of one. The reason for OAI’s success is the regular oscillation patterns “01-­forever.” This pattern is detected by the conditionalized inductive method IMark1, which

Philosophical Conclusions and Refinements 227

looks for Markov dependencies of the first order, discovers this regularity, and predicts according to the prediction rule “pred1 = 0, and predn+1 = 1/0 iff en = 0/1” (IMark1 simplifies to this rule for this sequence). Thus, for this par­tic­ u­lar sequence the predictions of the ­simple anti-­inductive method OAI and the conditionalized induction method IMark1 coincide. This does not mean that the two methods coincide for all sequences. In fact, this is impossible according to our discrimination between “inductive” and “anti-­inductive” (or “noninductive”) methods in (8.5) and (8.6). However, for a par­tic­u­lar sequence the success of an anti-­inductive method (or of any noninductive method whose predictions depend on past observations) can always be reproduced by a conditionalized inductive method that proj­ ects that pattern on which the anti-­inductive method’s success relies. This fact is impor­tant for the transition from the a priori justification of meta-­induction to the a posteriori justification of object induction (recall sections  5.4 and 8.1.5). It explains why in our past experience inductive methods have appeared to be by far more successful than anti-­inductive (or noninductive) methods: ­because we understand “induction” as a ­family concept covering all sorts of refined induction strategies. Can we find h ­ ere a fundamental asymmetry between induction and anti-­induction (regarded as two families of methods)? An asymmetry in the sense that while refined inductive methods can achieve the same success as anti-­inductive methods wherever the latter are successful, refined anti-­ inductive methods cannot do the same for induction? It is easy to see that if such an asymmetry can be demonstrated at all, it is only if Goodman-­ type predicates are excluded, b ­ ecause—as we have seen in section 4.2—by employing positional predicates anti-­inductive methods can easily turned into apparently inductive ones, and vice versa. Thus, presumably, the proj­ ect of demonstrating such an asymmetry between induction and anti-­ induction is too ambitious, and we leave it ­here as an open prob­lem. In any ­ ecause it concerns the case, the question is not in the center of this book b a posteriori justification of object induction, while our study focuses on the a priori justification of meta-­induction. In conclusion, explication (8.5) is preliminary, and the claims of this book do not hang on it. Moreover, the explication in (8.5) is not helpful for our goal of finding a dominance relation of inductive over noninductive methods b ­ ecause it does not reveal w ­ hether a reference partition can be found such that the method is inductive when conditionalized on this partition. Furthermore, one can design vari­ous mixed methods that predict inductively in certain worlds and noninductively in ­others. Therefore, it is not pos­si­ble to obtain a general dominance theorem for meta-­induction in

228

Chapter 8

relation to all sorts of (dependent) strategies that are partially noninductive in the sense of predicting noninductively in some worlds. However, it is pos­si­ble to obtain a dominance argument in regard to dependent methods whose predictions are noninductive in application to IID random sequences. In IID sequences all internal dependencies vanish in the long run, so successful conditionalized inductions are impossible. Given that ­there are no ties between the limiting success-­frequencies, one can prove that if a dependent method chooses its weights noninductively in regard to IID sequences, it cannot be access-­optimal: so it is dominated by access-­optimal strategies by proposition 8.5. To prove this proposition, we introduce a further notion. (Definition 8.2)  Simply noninductive dependent methods A dependent prediction method S is simply noninductive in a class of prediction games G iff ­there exists a binary game G = ((e),Π = {P1, … ,Pm, S}) in G whose corresponding sequence of scores ((scorei(Pj): 1 ≤ j ≤ m): 1 ≤ i ∈ ) is IID and for all times ­after some some finite time k S attaches less weight to the prediction of a player with [ε-­approximately] maximal limit success than to a player with a non-­maximal limit success.

(Proposition 8.6)  Dominance of meta-­induction over simply noninductive dependent methods (1) If S is a ­ simple noninductive dependent method in G, then S is not [ε-­approximately] access-­optimal in G (in the long run). (2) If xMI is [ε-­approximately] access-­optimal in a class of prediction games G, then xMI is [ε-­approximately] access-­dominant in G with re­spect to the subclass of all simply noninductive methods in M(G).

Proof: Appendix 12.37. ­ able  8.1 summarizes our major results on the access-­optimality and T access-­dominance of meta-­inductive strategies. 8.3.3  Bayesian Interpretation of Dominance As for optimality we can also explicate our results on dominance within a Bayesian framework. Recall the decision-­theoretic framework from section 8.1.6: A is the set of pos­si­ble actions; W is the set of pos­si­ble circumstances (“worlds”); the algebra AL(W) is W’s power set if W is countable,

Philosophical Conclusions and Refinements 229

­Table 8.1 Major results on access-­optimality and -­dominance of meta-­induction. Meta-­ inductive strategy

Kind of optimality/ dominance

In class of prediction games satisfying

Access-­optimal with re­spect to subclass of methods M

ITB

Strict

∃ best B ∈ Π¬MI

No restriction

εITB ITBN AW

ε-­Approx. ε-­Approx. Strict

∃ ε-­best BP  ⊆ Π¬MI Universal Universal

No restriction Nondeceivers No restriction

Access-­dominant with re­spect to three subclasses of methods Mi ⊂ M



{

1. In­de­pen­dent, 2. Not access-­   optimal 3. Dependent and simply noninductive

Note: ITB, imitate the best; εITB, epsilon-­cautious imitate the best; ITBN, imitate the best nondeceiver; AW, attractivity-­weighted meta-induction.

and the Borel-Lebesgue algebra over W if W =def [s,t] is an uncountable real-­ valued interval; Prob is the set of all probability mea­sures P:AL(W) → [0,1]; u:A × W →  is a Lebesgue-­integrable utility function, with u(a,w) designating a’s utility in world w; and, fi­nally, ExpP(a) is the expected utility of action a ∈ A (with re­spect to P) as defined in section 8.1.6. We first explain the relation between dominance and expected utility in this framework and turn then to prediction games. What is now impor­tant is that the worlds in which the dominant method is strictly better than some other method must have a non-­zero probability in order to produce an increased expectation value of the dominant method’s success. For this purpose we need the notion of a nondogmatic probability function. Recall definition 4.4: while a nondogmatic probability function over a countable space assigns a positive probability to e­ very nonempty subspace, this is dif­fer­ ent for continuous spaces; h ­ ere even nondogmatic probabilities assign zero probabilities to single points or countable ­unions of them, and only nonempty intervals and their Boolean combinations have a nonzero probability. The major insight of the next proposition is this: it is only for countable possibility spaces that t­here exists a s­imple bidirectional correspondence between being dominant and a having a strictly greater expected success rate. For continuous possibility spaces t­ here is no such s­ imple connection. Only a strengthened version of dominance over nonempty intervals entails a strictly greater expected success. In the other direction, having a strictly greater expected success does not guarantee dominance, but involves it only “with probability 1.”

230

Chapter 8

(Proposition 8.7)  Bayesian interpretation of dominance (1) An action a is dominant with re­spect to A in a countable class of words W iff for ­every nondogmatic probability function P over AL(W) and ­every action b ∈ A, the expected utility of a is strictly greater than that of b; (ExpP(a) > ExpP(b)). (2) Assume a continuous interval of “worlds” (real numbers) [s,t] ⊆ . (i) If for ­every nondogmatic probability function P over AL([s,t]) and ­every action b ∈ A the expected utility of action a ∈ A is strictly greater than that of b, (ExpP(a) > ExpP(b)), then for ­every nondogmatic probability function P, a is dominant in [r,s] with probability P = 1 (i.e., in a subset S ⊆ [s,t] with P(S) = 1). (ii) If action a is optimal in A and for e­ very action b ∈ A, u(a,r) > u(b,r) holds for ­every real number in some nonempty subinterval I ⊆ [s,t], then ExpP(a) > ExpP(b) holds for ­every nondogmatic probability function P.

Proof: Appendix 12.38. In order to apply proposition 8.7 to prediction games we have to take care of the two complications explained in section 8.1.6. Moreover, dominance has to be relativized to a subclass M of methods. Note that when speaking about uncountable but mea­sur­able sets of prediction games G we assume a mea­sur­able injective function f:G → [s,t] mapping games into real numbers in [s,t]. For example, if G contains all binary prediction games with a fixed set of methods Π, then we represent G by ele­ments of the form G = (r ∈ [0,1],Π) such that each real number in [0,1] represents an infinite binary sequence (as explained in section 4.5). We define the algebra over G as the f-­inverse image of the Borel algebra over [s,t], AL(G) = f−1(Bo([s,t])), and set P(X) =def P(f[X]) for all X ∈ AL(G). Recall that we understand “access-­dominance” in the long run.

(Proposition 8.8)  Bayesian interpretation of dominance in prediction games (1) A (meta-­inductive) strategy xMI is access-­dominant in a countable class of prediction games G with re­spect to a class of methods M ⊆ M(G) iff condition (*) holds:   (*) For ­every method M ∈ M (≠ xMI) and nondogmatic probability function P over AL(G), the expectation value of liminfn→∞(sucn(xMI) − sucn(M)) is greater than zero.

Philosophical Conclusions and Refinements 231

(Proposition 8.8)  (continued) (2) Assume an uncountable set of prediction games G represented by an interval of real numbers [s,t] with AL(G) defined as explained above. Then:  (i) If condition (*) holds for M ⊆ M(G), then for e­ very nondogmatic probability function P over AL(G), xMI is access-­dominant in G with re­spect to M with probability P = 1. (ii) If xMI is access-­optimal in G with re­spect to M ⊆ M(G), and for ­every method M in M (M ≠ xMI) ­there exists a nonempty subset G′ ⊆ G corresponding to a subinterval I ⊆ [s,t] (i.e., f[G′] = I) such that for all G ∈ G′, liminfn→∞(sucn(xMI)−sucn(M)) is greater than zero, then condition (*) holds for M ⊆ M(G).

Proof: A straightforward combination of the proofs of propositions 8.3 and 8.7.

9  Defense against Objections

9.1  Meta-­Induction and the No ­Free Lunch Theorem We have demonstrated that attractivity-­weighted meta-­induction (AW) is not only universally access-­optimal in the long run but also dominant in the long run compared with large classes of prediction methods (including in­de­ pen­dent and non-­access-­optimal dependent methods). This result seems to contradict a famous theorem in computer science, the no ­free lunch (NFL) theorem. The apparent conflict is analyzed and dissolved in this section. A number of variants of the NFL theorem have been formulated.1 The most general formulation is given by Wolpert (1996). Wolpert’s NFL theorem expresses, roughly speaking, a deepening of Hume’s inductive skepticism for theoretical computer scientists. It comes in a weak and a strong version. In its weak version (mentioned in Wolpert 1996, 1354) it says that the probabilistically expected success of any nonclairvoyant prediction method—or learning algorithm, as Wolpert calls it—is equal to the expected success of random guessing or any other prediction method, if one assumes (1) a state-­uniform prior distribution, that is, a prior distribution that is uniform over all over all pos­si­ble event sequences or states of the world, and (2) a weakly homogeneous loss function. The strong version of the NFL theorem even claims that for each pair of nonclairvoyant prediction methods, the number (or probability) of pos­si­ble worlds in which the first method outperforms the second is precisely equal to the number (or probability) of worlds in which the second outperforms the first (Wolpert 1996, 1343). This version of the theorem rests on the rather strong condition of a homogeneous loss function.

1. ​See, for example, Giraud-­Carrier and Provost (2005); Rao, Gordon, and Spears (1995); Schaffer (1994); Wolpert (1992, 1996); Wolpert and MacReady (1995).

234

Chapter 9

Given ordinary intuitions, the content of the NFL theorem is rather surprising. For example, con­temporary epistemologists have argued that t­ here are by far more induction-­friendly than induction-­hostile pos­si­ble event sequences (White 2015). The NFL theorems tell us that this intuition is wrong; plenty of illustrations of this fact ­will be given in this section. Before we turn to the general NFL theorem we ­will explain the special case of this theorem concerning the “deterministic” (nonprobabilistic) prediction of binary sequences. However a computable prediction function f is defined, ­there are as many sequences (of length ≥ n+1) that verify f’s prediction as ­there are sequences that falsify it. This apparently “innocent” observation implies immediately the following fact: (9.1)  Strong no ­free lunch theorem for deterministic prediction methods For ­every binary deterministic prediction method f (i.e., predn+1 = f(e1, … ,en)  ∈ {0,1}), ­there are n k sequences of length n for which method f has a success

( )

k rate of n .

The proof of fact (9.1) is easy and can be explained informally. ­There are ( k ) n

pos­si­ble sequences of scores of f predictions with k ones in them; ­because f is deterministic, each score sequence (s1,s2, … ) can be realized by exactly one event sequence (e1,e2, … ), which is recursively constructed as follows: en+1 = f(e1, … ,en) if sn+1 = 1, ­else en+1 = 1 − f(e1, … ,en). Fact (9.1) is the content of the strong NFL theorem for binary sequences with deterministic prediction functions. By attaching an equal probability to ­every pos­si­ble sequence, the expected score of each learning algorithm must be 1/2, which is the content of the weak NFL theorem. The weak NFL theorem is also a consequence of proposition 4.8, which tells us that a state-­uniform prior distribution over the set of all pos­si­ble event sequences leads to a probability function P that is induction hostile in  the sense that any sequence is expected to behave like a random sequence—(i) P(1) = P(0) = 0.5, (ii) P(en+1|(e1, … ,en)) = P(en+1)), and thus (by the law of large numbers) (iii) P(limn→∞ freqn(1) = 0.5) = 1. As entailed immediately by Proposition 4.8, relative to a state-­uniform prior all computable learning algorithms for binary sequences must have the same expected success. The NFL theorems are a generalization of this earlier philosophical result, which also covers probabilistic prediction methods, defined as probability functions P(predn+1|(e1, … ,en)). In this context, Wolpert’s nonclairvoyance requirement asserts that P(predn+1|(e1, … ,en+k)) = P(predn+1|(e1, … ,en)); that is, the predictions’ probabilities depend only on past, not ­future events.

Defense against Objections 235

9.1.1  The Long-­Run Perspective The NFL theorem applies not only to all in­de­pen­dent (object-­level) learning algorithms but also holds for all dependent (meta-­level) strategies, given that they are applied to a fixed finite toolbox of in­de­pen­dent prediction methods. ­Every finite combination of a fixed set of prediction algorithms is itself a prediction algorithm. So the question arises: If the NFL theorem is true, how can AW meta-­induction, when applied to a fixed set of in­de­pen­ dent (nonclairvoyant) methods, be dominant? How can it sometimes be better in the long-­run than certain other methods, such as ­those mentioned in section 8.3? Is this a contradiction? Our answer to this question in regard to the long-­run perspective can be summarized as follows: No, the contradiction is only apparent. Indeed, ­there are many AW-­accessible methods whose long-­run success rate is strictly smaller than that of AW in some worlds (event sequences)2 and never greater than that of AW in any world. Let us call t­ hese methods Minf (for “inferior”). Nevertheless, the state-­uniform expectation values of the success rates of AW and Minf are equal ­because the state-­uniform distribution that Wolpert assumes assigns a probability of zero to all worlds in which AW dominates Minf; so ­these worlds do not affect the probabilistic expectation value. Put into a nutshell, for weakly homogeneous loss functions ­free lunches do exist but receive a probability of zero ­under a state-­uniform prior. Wolpert (1996, 1376ff.) explicates the NFL theorems in his “extended Bayesian framework,” in which a prediction is modeled by a probabilistic hypothesis that is based on a training set and projected to a test item. We abstain from reproducing this framework; a detailed pre­sen­ta­tion is found in Wolpert (1996). Instead we explain the proof of the NFL theorems in application to prediction games. A prediction game can be considered as an iteration of the procedure of selecting the first n events as training set and the next event as test item. Thus, the test item of a given round is added to ­ ecause a state-­uniform prior induces a the training set of the next round. B uniform distribution over pos­si­ble events conditional on all pos­si­ble “pasts,” the proof of the NFL theorems applies to ­every round of a prediction game and thus to the entire game. In the adaptation of Wolpert’s proof (1996, 1354ff., theorems 1 and 3) to prediction games it is assumed that e­ very prediction method M is paired with each pos­si­ble event sequence (e) exactly once, or at least equally often: P((e)) = P({((e)),Π): M ∈ Π}). Moreover, if M is a dependent method, then it is

2. ​In the NFL context, each prediction method meets each pos­si­ble event sequence exactly once. Therefore, we can identify pos­si­ble worlds with event sequences.

236

Chapter 9

always applied to a fixed set of in­de­pen­dent methods so that for each past event sequence (e1, … ,en) the prediction of M is uniquely defined. (In other words, if M gets applied to a dif­fer­ent set of candidate methods, it counts as a dif­fer­ent method.) The strong version of the no f­ree lunch theorem assumes that the loss function is homogeneous. This notion is defined in two steps (Wolpert 1996, 1349). (Definition 9.1)  Homogeneous loss function (i) Loss function “loss” is homogeneous for loss value c iff the number of pos­ si­ ble event values e ∈ Val for which a given prediction leads to a loss of c is the same for all pos­ si­ ble predictions—or formally, iff the sum ∑e∈Val δ(c,loss(pred,e)) takes the same value Δ(c) for all pred ∈ Valpred, where δ(x,y) is the Kronecker delta function—­that is, (x,y) =def 1 if x =  y; ­else  = 0. (ii) A loss function is homogeneous iff it is homogeneous for all pos­si­ble loss values c.

The requirement of a homogeneous loss function is satisfied for binary or discrete prediction games with the zero-­one loss function, which has only two pos­si­ble loss values: loss(pred,e) = 0 if pred = e, and loss(pred,e) = 1 if pred ≠ e. Assuming q pos­si­ble events, Val = {v1, … ,vq), then for ­every pos­si­ ble prediction pred ∈ Valpred ⊇ Val, the number of pos­si­ble event values leading to a loss of zero is one, and the number of pos­si­ble event values leading to a loss of 1 is q−1. (Proposition 9.1)  Strong no ­free lunch theorem Given a state-­uniform P distribution over the space of event sequences with q pos­si­ble event values and a homogeneous loss function, the following holds for ­every (nonclairvoyant) prediction method M, ­every pos­si­ble loss value c, and n ≥ 0. (i) For ­every (e1, … ,en): P(loss(predn+1,en+1) = c | (e1, … ,en)) = Δ(c)/q. In words: the probability that M earns loss c in the prediction of the “next” event equals Δ(c)/q) (see definition 9.1), conditional on e­very pos­ si­ ble sequence of “past” events. k n−k k = n i ⎛ Δ(0) ⎞ i ⎛ Δ(1) ⎞ . (ii) Given a zero-­one loss function, P suc n (M) = n k ⎜⎝ q ⎟⎠ ⎜⎝ q ⎟⎠

(

) ( )

k is given by the In words: the probability that M’s success rate ­after n rounds is n above binomial formula.

Proof: Appendix 12.39.

Defense against Objections 237

Proposition 9.1(ii) asserts that the state-­uniform probability that a prediction method earns a certain success rate is the same for all methods. ­Because this holds for all pos­si­ble success rates, it implies that the state-­ uniform probability—­ and in the finite case the number—of worlds in which a method M1 outperforms another method M2 is the same as that in which M2 outperforms M1. This is the usual formulation of the strong NFL theorem. If properly real-­valued predictions are allowed (while the events may still be binary), then a reasonable loss function ­will assign a loss dif­fer­ent from zero or one to some predictions that are dif­fer­ent from zero or one. Such a loss function is no longer homogeneous. So the strong NFL theorem does not apply to AW or any other real-­valued prediction method. This is a strong restriction of this theorem. Real-­valued predictions not only make sense in application to real-­valued events but also to binary events, for instance, by predicting their probabilities. Even in binary games the effect of real-­valued predictions can be simulated, as we know, by randomizing binary predictions or by assuming a collective of binary forecasters whose mean success approximates a real-­valued prediction. Only a weak version of the NFL theorem holds for games with binary events and real-­valued predictions, provided the loss function is weakly homogeneous: (Definition 9.2)  Weakly homogeneous loss function Loss function “loss” is weakly homogeneous iff for each pos­si­ble prediction of the next event the sum of all losses over all pos­si­ble events is the same—or, 3

formally, iff for all n ≥ 0 and pred ∈ Valpred, ∑e∈Val loss(pred,e) = a constant c*.

For binary games with real-­valued predictions and natu­ral loss function, the condition of definition 9.2 is satisfied b ­ecause for e­very prediction pred ∈ [0,1], loss(pred,1) + loss(pred,0) = 1 − pred + pred = 1. ­Under this assumption the following weak NFL theorem holds for the probabilistic expectation value (ExpP) of the loss and success rate of a prediction method M, where “(e1−n)” abbreviates “(e1, … ,en)” and Val(C) = {loss(pred,e): pred ∈ Valpred, e ∈ Val} is the set of pos­si­ble loss values.

3. ​Wolpert mentions weakly homogeneous loss functions and the weak no ­free lunch theorem in a small paragraph on page  1354 (“More generally, for an even broader set of loss functions … ”). For our purpose this version of his theorem is more impor­tant.

238

Chapter 9

(Proposition 9.2)  Weak no ­free lunch theorem Given a state-­uniform P distribution over the space of event sequences with q pos­si­ble event values and a weakly homogeneous loss function, the following holds for ­every (nonclairvoyant) prediction method M and n ≥ 0: (i) ExpP(loss(predn+1,en+1)|(e1−n)) =def ∑c∈Val(C) c • P(loss(predn+1,en+1) = c | (e1−n)) = c*/q. In other words, the expectation value of M’s loss in the prediction of the “next” event equals c*/q (see definition 9.2), conditional on ­every pos­si­ble sequence of “past” events.

(

)

k i P suc (M) = k =1 − c* . (ii) ExpP (suc n (M)) = def ∑1≤ k ≤ n n n n q In other words, the

expectation value of M’s success rate ­after an arbitrary number of rounds is 1 − c* q.

Proof: Appendix 12.40. The “identities by definition” in (i) and (ii) recapitulate the definitions of the expectation value. Proposition 9.2 implies that for real-­valued predictions over binary events, ­every pos­si­ble prediction (or learning) method has the same expected success, given a state-­uniform prior distribution. However, as we ­will see in section 9.1.2, the distribution over the pos­si­ble successes is significantly dif­fer­ent for dif­fer­ent methods. We emphasize that the state-­uniformity of the probability distribution is a necessary condition of the application of the NFL theorems to prediction games. ­Because prediction games are iterations of one-­shot learning procedures, this requires that the distribution is uniform conditional on ­every pos­si­ble past sequence, which is tantamount to demanding state-­uniformity of P. ­There are generalizations of NFL theorems for one-­shot learning procedures to certain symmetric but nonuniform P-­distributions, but they are not valid for prediction games. For example, Rao, Gordon, and Spears (1995) consider generalization algorithms that have to learn a binary n-­bit sequence by generalizing from a given training sequence of k bits to the test sequence given by the remaining n−k bits. The authors demonstrate for this learning scenario (Rao, Gordon, and Spears 1995, t­able 1, section 3.2) that the weak NFL theorem holds for ­every P distribution over n-­bit sequences that is invariant ­under exchange of zeros and ones; that is, P(e1, … ,en) = P(e*1 , … ,en*), where 0* = 1 and 1* = 0. However, this result does not generalize to prediction games. To see this, consider a distribution that satisfies the symmetry requirement of Rao, Gordon, and Spears (1995) but gives highly

Defense against Objections 239

regular sequences a much higher probability that irregular sequences. For example, consider three-­element sequences with k = 1 and n−k = 2 (i.e., e1 is a training event) and the symmetric P distribution P(111) = P(000) = 0.2, 0.1 and P(110)  = P(001) = 0.1. Then P(011) = P(100) = 0.1, P(010) = P(101) =  the object-­ inductive prediction method OI (which predicts predn+1 = 1 if freqn(1) ≥ 0.5 or n = 0; ­else predn+1 = 0) has an expected absolute success of 0.2 • (2 + 2) + 0.1 • (1 + 0) + 0.1 • (0 + 1) + 0.1 • (1 + 1) = 1.2, whereas the expected absolute success of the methods “predict always 1” and “predict always 0” is 0.2 • (2 + 0) + 0.1 • (2 + 0) + 0.1 • (1 + 1) + 0.1 • (1 + 1) = 1.0. In conclusion, online learning in prediction games is a nontrivial improvement over one-­short learning procedures ­because its vulnerability to NFL theorems is confined to state-­uniform distributions. For prediction games with real-­ valued (graded) events, standard loss functions are not even weakly homogeneous. ­Here “­free lunches” are pos­ si­ble. For example, assume a prediction game with the pos­si­ble events 0.1, 0.3, 0.5, 0.7, and 0.9, and a natu­ral loss function. Then the sum of the losses over all pos­si­ble events is 1.2 for the prediction 0.5; 1.4 for the predictions 0.3 and 0.7; and 2.0 for the predictions 0.1 and 0.9. H ­ ere the averaging method that always predicts 0.5 has a significantly higher expected success than, for example, the method that always predicts 0.9. This does not mean, however, that graded worlds are generally more induction friendly than discrete worlds. Even for real-­valued events one can prove the following restricted no ­free lunch result (which we add to the list of Wolpert’s NFL results). (Proposition 9.3)  Restricted no ­free lunch result for real-­valued events Assume Val ⊆ [0,1] is a set of pos­si­ble real-­valued events that are symmetrically distributed around the mean value 0.5 (i.e., Val = {0.5 − a1, 0.5 − a2, … ,0.5 − ak,0.5, 0.5 + a1, 0.5 + a2, … ,0.5 + ak}, for 0  0.5 and to laypeople with a reliability  u11 > u22 > u12 holds. The crux of this game is that regardless of what player 2 does, it is always better for player 1 to defect than to cooperate. So the average fitness of defection is always

Interdisciplinary Applications 303

higher than that of cooperation. Therefore, in the prisoner’s dilemma game, cooperative be­hav­ior must die out so long as no additional mechanisms are introduced that punish defection. Without such mechanisms, the only evolutionary stable equilibrium is 100 ­percent defectors. Coordination games. ­Here, two pos­si­ble actions must be “coordinated,” 2. ​ in the sense that they are only successful if both partners play the same action. Thus, the utility matrix satisfies u11, u22 > u12, u21: the diagonal utility values are higher than the nondiagonal ones. In the resulting evolutionary dynamics, the only two pos­si­ble stable equilibria are the two pure and coordinated equilibria in which ­either all individuals play action A1, or all of them play action A2. From the game-­theoretic viewpoint, the learning game is neither a prisoner’s dilemma nor a coordination game but rather falls u ­ nder the third category. 3. ​Hawk-­dove games. ­Here, A1 is again an action of cooperation, and A2 is an action of noncooperation. However, in this game the utilities satisfy the relations u12 > u11 > u21 > u22. In distinction from the prisoner’s dilemma game, the utility of a bilateral defection (u2,2) now has the smallest value in the matrix. Thus, if the other player is noncooperative, it is better for me to cooperate; if the other player is cooperative, it is better for me not to cooperate. In a well-­known instance of this game, t­here exists a good that both actors want to possess, and action A1 corresponds to “being ready to fight” (hawk) while action A2 corresponds to “giving way” (dove). The worst case is given when both players fight. The utility structure of this game leads to a stable mixed equilibrium with a certain equilibrium frequency p* of action A1 (cooperate). This equilibrium frequency is reached when the mean fitnesses of the two pos­si­ble actions are exactly equal. (10.2)  Equilibrium frequency in hawk-­dove games p* • u11 + (1 − p*) • u12 = p* • u21 + (1 − p*) • u22 Resulting solution condition for a non-­trivial mixed equilibrium: p* =

u 22 − u12 . u11 − u 21 + u 22 − u12

The learning game has the structure of a hawk-­dove game. To be sure, its moves—­individual versus social learning—­have not much to do with yielding versus fighting but more to do with being willing to pay the costs

304

Chapter 10

that are necessary for the welfare of all or refusing to pay t­ hese costs. However, the characteristic utility pattern is the same as in hawk-­dove games: if my partner is a meta-­inductivist (imitates me), it is better for me to learn individually (instead of imitating him, which brings me nothing), while if my partner is a (moderately successful) individual learner, it is better for me to imitate him. ­Here is a concrete example of a utility matrix in a learning game between meta-­inductivists (MI) and object-­inductivists (OI):         Player 2:

OI

MI

Player 1: OI 2 1 MI

6 0

Object-­ induction brings me less when my partner imitates me than when he learns by himself, b ­ ecause in the first case I have to share the total payoff of my learning with him. Meta-­induction brings me more that costly object-­induction if I merely have to imitate my object-­inductivistic partner, but it brings me nothing if my partner is also a meta-­inductivist. If we insert ­these values into the equation for the equilibrium frequencies, we obtain: (10.3)  Equilibrium frequency between individual learners and meta-­inductivists p* = p(OI) =

0 −1 = 1/5 = 20% 2 − 6 + 0 −1

(1 − p*) = p(MI) = 80%.

Thus, ­under ­these utility assumptions, an evolutionary stable epistemic society consists of 20 ­percent individual learners or “nonconformists” and of 80  ­percent meta-­inductive learners or “conformists.” ­These considerations may explain why empirically we find in more-­or-­less all socie­ties, besides a majority of conformists, always a certain (usually small) fraction of nonconformists. They are needed for maintaining a high cognitive fitness in an epistemically specialized society.

11  Conclusion and Outlook: Optimality Justifications as a Philosophical Program

11.1  Optimality Justifications as a Means of Stopping the Justificational Regress In chapter 3 we defended the framework of a critical foundation-­oriented epistemology, whose development began in the Enlightenment era for philosophy although some aspects dated to antiquity. Its ultimate goal was to lay down universal rational standards for knowledge and justification, which could replace the worldview of fundamentalistic religious authorities who had earlier ruled ­human ideas. The major challenge for the foundation-­ oriented program is the prob­lem of finding a means to stop the justificational regress:—­the apparent necessity to base each justification of an epistemic princi­ple upon premises that are themselves in need of justification. The epistemological strategy defended in this book was not to try to escape this challenge by weakening justificational standards, but rather to follow the traditional enlightenment program in its attempt to justify ­every belief that does not belong to the minimal class of directly evident basic beliefs—­that is, introspective beliefs and analytic beliefs. We pointed out that if one follows this account, the major share of the justification load rests on content-­ expanding nondeductive arguments or inferences. Essential for this version of the foundation-­oriented program is, therefore, the possession of higher order justifications of ­these content-­expanding inferences that can avoid the pitfalls of circularity or infinite regress. ­There are two major types of content-­expanding inferences: inductions and abductions. We argued that the first and foremost type of content-­ expanding inference is inductive and that any attempt to justify abduction has to be based on a presupposed justification of induction. In conclusion, for the program of a foundation-­oriented epistemology that neither ends up in dogmatism nor in relativism, the solution of Hume’s prob­lem of induction—­the prob­lem of finding a noncircular epistemic justification of induction—is essential.

306

Chapter 11

In this book we developed a new type of higher order justification for  inductive inferences that we called optimality justifications. Optimality ­justifications can stop the justificational regress ­because they do not attempt to “prove” that a cognitive method (­here, induction) is reliable—­something that, by Hume’s arguments, cannot be done—­but rather that it is optimal, i.e., that it is the best that we can do to achieve our epistemic goal, which in the case of induction is predictive success. We showed that a non-­circular optimality justification of induction cannot be achieved at the object-­level of predicting events; but it can be achieved at the meta-­level of strategies for the se­lection and combination of prediction methods on the basis of their observed success rec­ords. We called this the account of meta-­induction. As a major tool in our investigation of dif­fer­ent meta-­inductive strategies, we developed the notion of a prediction game and established a series of mathematical results about prediction games. We showed that a par­tic­u­lar variant of meta-­induction, attractivity-­weighted meta-­induction, is provably universally access-­optimal in the sense that in e­ very pos­si­ble world it w ­ ill achieve maximal predictive success among all prediction strategies that are accessible to the epistemic decision-maker. This result holds strictly for predictive success in the long run, but it applies also to the short-­run, modulo a small short-­run regret whose worst case bounds can be calculated. We proved a variety of similar results for in­ter­est­ing variants of meta-­induction and generalized ­these results from prediction games to all kinds of action games. In chapter 10 we applied the account of meta-­induction in a variety of scientific fields, from cognitive science and social epistemology to the theory of cultural evolution. The given justification of meta-­ induction was mathematically analytic or a priori, as it did not assume anything about the nature of the considered pos­si­ble worlds. Moreover, the analytic or a priori justification of meta-­induction implied a contingent or a posteriori justification of object-­ induction in the real word, to the extent that so far object-­induction has turned out to be the most successful prediction strategy. In conclusion, our insights about meta-­induction give us at least a partial solution to Hume’s prob­lem of induction. Our results about meta-­induction lead us to a general insight for the foundation-­oriented program of enlightenment epistemology. ­There exists a possibility to stop the justificational regress by means of epistemic optimality justifications that establish that certain epistemic strategies are universally access-­optimal. In the remaining part of this chapter let us briefly examine how the method of optimality justifications can be generalized from (meta-)induction to other epistemic tasks.

Optimality Justifications as a Philosophical Program 307

11.2  Generalizing Optimality Justifications In chapter  7 we saw that meta-­induction can be generalized to ­every domain of epistemic methods, provided (1) ­there exists a clearly defined goal that makes it pos­si­ble to define success mea­sures, and (2) the environment gives us feedback about the extent to which this goal is reached, which may be temporally delayed but must be provided ­after a finite time. By way of this generalization, we can apply meta-­induction not only to evaluate prediction methods but also to evaluate hypotheses, theories, or methodological designs in regard to their predictive success. Similarly we may apply meta-­induction to practical domains such as medical diagnosis or other sorts of purpose-­related tasks for which t­ here exist dif­fer­ent competing strategies. The application of optimality justifications to other foundational prob­ lems in epistemology is more difficult. For an overview, we ­will discuss five of ­these applications, emphasizing that ­these considerations are only meant to inspire ideas whose precise elaboration is left to ­future work. 11.2.1  The Prob­lem of the Basis: Introspective Beliefs Not all phi­los­o­phers agree that introspective beliefs are epistemically safe. For example, introspective beliefs about one’s past experiences rely on memory, and memory is fallible. If introspective beliefs are formulated in a public language, then one may be in error about the semantics of that language. In section 3.2, we dealt with ­these objections by restricting the introspective realm to one’s pres­ent experiences and one’s own private language. We argued that if introspective beliefs are restricted in this way, they constitute optimal candidates for directly evident basic beliefs. We did not assume that introspective beliefs are infallible, only that we can rely on them in almost all cases; exceptions are only given when a brain or a mind acts in a schizophrenic way. In other words, the essence of this kind of optimality justification of introspective beliefs consists in the fact that a normal mind does not have any other possibility than simply to accept them as given. Thus, for introspective beliefs our optimality argument runs as follows: their best alternative justification degenerates into an only alternative justification. 11.2.2  The Choice of the Logic In section 3.2, we argued that prima facie the justification of (classical) deductive inferences is unproblematic ­because one can prove semantically that ­these inferences are strictly truth preserving. This is true, but for the proof of

308

Chapter 11

this semantic fact one needs again the princi­ples of classical logic, now stated within the meta-­language in which the semantic rules are expressed. For example, the semantic proof of the truth-­preserving nature of the conjunction rule “p∧q/p” goes as follows: (1) True(p∧q) implies (2) True(p)∧True(q) (by the definition of ∧’s truth-­table), which implies (3) True(p) by conjunction rule. Thus, to prove the conjunction rule in the object language we need the conjunction rule in the meta-­language. Does this mean that we are again in the threatening situation of an epistemic circle or infinite regress? No. It merely means that semantic explications, although philosophically insightful, cannot stop the justificational regress. At some meta-­language level we must stop the regress by assuming the princi­ples of classical logic as given, or in other words, as basic in the explained sense. Technically this is done by assuming an axiomatic system—­a system of axioms and rules from which (hopefully) all other logically valid theorems can be derived. What justifies us in considering classical logic as basic? The traditional answer to this question points to the fact that ­there is a crucial difference between the prob­lem of justifying induction and that of justifying deduction: although we can easily imagine pos­si­ble worlds in which induction fails (whence induction needs justification), we can hardly imagine pos­si­ ble worlds in which logic fails, b ­ ecause we presuppose our logic already in the repre­sen­ta­tion of ­these worlds. For this reason, deductive logic is basic and needs no justification. ­ ecause it presupUnfortunately this justification is not fully convincing b poses that pos­si­ble worlds are represented by means of classical logic. However, ­there are alternatives to classical logic. Nonclassical logics do not make the classical assumptions but rather assume, for example, more than two truth values (such as “true,” “false,” and “undetermined”). How can one justify classical logic, or a system of logic at all, in view of this situation of “logical pluralism”? The situation may seem hopeless, but in fact it is not ­because logical systems are translatable into each other. Dif­fer­ent methods of translation between logical systems have been suggested in the lit­er­a­ture (see Jeřábek 2012; Wansing and Shramko 2008); most of them are abstract insofar as they do not preserve the structure of the translated sentence. In what follows we suggest a translation method that preserves structure by means of introducing additional concepts into the language of classical logic, L c. For example, a three-­valued nonclassical logic may be translated into L c by introducing three additional concepts into L c: the propositional operators of “being true” (T), “being false” (F), and “being undetermined” (U).

Optimality Justifications as a Philosophical Program 309

If S is a sentence of the three-­valued logic, then the sentences T(S), F(S), and U(S) are nevertheless two-­valued. Based on this fact, e­ very semantic axiom or rule of a three-­valued nonclassical logic can be translated into a corresponding axiom or rule formulated in the expanded language of the classical two-­valued logic. For example, Lukasiewicz’s three-­valued truth ­table for negation is represented by the three semantic axioms T(¬S) ↔ F(S), U(¬S) ↔ U(S), and F(¬S) ↔ T(S). By representing all truth t­ ables of Lukasiewicz’s three-­valued logic via semantic axioms of this kind and adding the axiom T(S) ∨! U(S) ∨! F(S) (with ∨! for the exclusive disjunction), we obtain the axiom system AxLuk of Lukasiewicz’s logic in the language of classical logic. Now each sentence of the three-­valued logic S can be translated into the corresponding sentence T(S) of classical logic so that the property of validity is preserved; that is, a sentence is logically true in the three-­valued logic exactly if its translation is logically true in the corresponding axiomatic system in classical logic: ⊨=Luk S iff AxLuk ⊨class T(S). The same translation strategy applies to all many-­ valued logics representable by means of finitely many truth values. For example, many para-­consistent logics can be characterized by means of finite truth value matrices, including truth values such as “both true and false” (Priest 1979, 2013, section 3.6). We conjecture that a similar translation strategy applies to all kinds of nonclassical logics (even ­those not characterizable by finite matrices) ­because all systems of nonclassical logics known to me use classical logic in their meta-­language in which they describe the semantics of their nonclassical princi­ples. Therefore ­there must exist ways to translate the princi­ples of ­these logics into classical logic by introducing additional operators into the language of classical logic corresponding to the semantical concepts of the nonclassical logic. An elaboration of this idea is work for the ­future. What this argument would show, if it is correct, is that ­every nonclassical logic can be represented within classical logic by using an appropriate extension of the language. This would give us an optimality justification: by using classical logic, our conceptual repre­sen­ta­tion system can only gain but never lose, ­because if another logic turns out to have advantages for certain purposes, we can translate and thus embed it into classical logic. This optimality justification does not make any presuppositions except the existence of the two logical systems and the translation function that is defined in the classical meta-­language (which is of the same conceptual type as the object-­language). In many cases ­there also exists an inverse translation function, from the classical into the nonclassical logic. For this reason the translation argument

310

Chapter 11

only shows that classical logic is optimal, not that it is dominant in the sense of being better than, say, three-­valued logics. For example, it can be shown that classical logic is also translatable into three-­valued logics. So defenders of a three-­valued logic can argue that their system is optimal, too, ­because they can translate e­ very bivalent system into their three-­valued logic. But this fact does not undermine the force of the optimality justification; it merely draws the picture of a situation of logical pluralism in harmony ­because the alternative logical systems are intertranslatable. Still, one may prefer classical logics as psychologically more natu­ral ­because they fit better with the way our mind or brain is thinking, but this is a dif­fer­ent ­matter that ­will not be pursued ­here. 11.2.3  The Choice of a Conceptual System Prima facie, the choice of a language or conceptual system seems to be an ideal candidate for the application of meta-­induction. We should choose that language—­that system of primitive nonlogical concepts—by means of which our hypotheses and theories are empirically most successful. The implementation of this idea, however, is not straightforward b ­ ecause to compare two dif­fer­ent languages one needs a superordinate common language in which one can translate the two languages. In Schurz (2013, section 2.9) it is argued that for observational concepts this prob­lem is solvable by ostensive learning procedures (recall section 4.2). By means of ostensive concept learning it is always pos­si­ble to find a joint common language, even for members of cultures that do not share a common language or worldview. For theoretical concepts this is not the case; rather, the choice of an optimal theoretical language goes hand in hand with the choice of an optimal theory, which is the our next point. 11.2.4  The Choice of a Theory Meta-­induction recommends itself for the evaluation of competing theories in regard to their empirical success. This success may not only be mea­sured in terms of predictive success but also in regard to other kinds of cognitive success such as unificatory or explanatory success. Moreover, we may also combine theories according to their attractivities by means of one of the methods explained in section 7.5. The meta-­inductive justification of theories considers theories as instruments of prediction or systematization and thus corresponds to an instrumentalist justification of theories, which van Fraassen (1980) calls their “empirical adequacy.” For instrumentalists this is all that theory evaluation can do. By contrast, for realists in the philosophy of science the inference

Optimality Justifications as a Philosophical Program 311

from the (meta-­inductively inferred) empirical adequacy of a theory to its realistic truth or truthlikeness is impor­tant. As we explained in section 2.6, this is not an inductive but an (irreducibly) abductive inference, ­because theories contain theoretical concepts (or latent variables) that are not contained in the description of the data sequences on which the meta-­inductive success evaluation is based. This brings us to the final and most difficult epistemological domain of generalizing optimality justifications. 11.2.5  The Justification of Abductive Inference The observation of past data and success rec­ords of empirical predictions does not give us any direct feedback about the fit of a theory’s theoretical structure with the unobservable (hidden) structure in our domain. Such direct feedback does not exist b ­ ecause theoretical par­ameters are unobservable. Therefore, the method of meta-­induction cannot be applied to the abductive inference from empirical adequacy to theoretical truthlikeness. The same is true for the abductive inference from the (inductively inferred) regularities in our introspective experiences to the existence of an external real­ity with certain properties as the best explanation of t­hese introspectively experienced regularities (recall section 3.2). In section 4.1 we argued that by postulating sufficiently many “latent” (or theoretical) variables (such as divine or occult powers, e­ tc.) we could “explain” every­thing we want. Such explanations are speculative and entirely post factum. In Schurz (2008b, 2013, section  5.10.4) it is suggested that abductive inferences to explanatory hypotheses that involve latent (or theoretical) variables are only scientifically worthwhile if they entail “use-­novel” predictions (in the sense of Worrall 2006) by means of which they can be in­de­pen­dently tested. This is demonstrably the case for abductive inferences to common ­causes, which offer unified explanations of intercorrelated empirical regularities or dispositions. Schurz (2008b, section 7.4) analyzes the inference to external real­ity as a common cause abduction. The hypothesis of external objects provides a common cause explanation of a huge set of intercorrelations between our introspective experiences. First, ­there are the intra-­sensual correlations, in par­tic­u­lar ­those within our system of visual perceptions: ­There are potentially an infinite number of two-­dimensional visual images of the same perceptual object on the ret­ina, but all ­these two-­dimensional images are strictly correlated with the position and ­angle at which we look at three-­dimensional objects. So ­these correlations have a common cause explanation in terms of external objects in a three-­dimensional space. Second, ­there are the inter-­sensual correlations between dif­fer­ent sensual experiences, in par­tic­u­lar between

312

Chapter 11

visual perceptions and tactile perceptions, which are similarly explained by the assumption of external objects in a three-­dimensional space.1 The question arises of how the cognitive optimality of abductive infer­ oing this: an ences can be justified in a noncircular way. We see two ways of d instrumentalist way and a realist way. Instrumentalistically we can argue that by performing abductive inferences we always take the advantage of explaining and representing our system of experiences by the best available theoretical model—­that is, by the most ­simple and most unified theory. Although this justification is instrumental, it goes beyond mere consideration of predictive success to considering m ­ atters of unification and economy, which belong to the dimension of cognitive costs that was added to meta-­inductive evaluations in section 7.6. More precisely, the instrumental optimality justification in terms of cognitive success works as follows: should some part of our theoretical model be false, one of two cases may occur. ­Either we observe this in the form of an incorrect prediction—­and as soon as this happens, we w ­ ill take steps to correct our theory. Or we never observe it ­because our experiences are limited, so nothing happens—­and we continue to operate with an instrumentally optimal theory even though it is false. It is, however, false in a way that cannot be empirically detected by us and thus ­will not practically harm us. So by performing abductive inferences to unifying theoretical models we can only gain but not lose something. Presumably even empiricists such as van Fraassen (1989, 142ff.), who reject standard accounts of inferences to the best explanation (IBE), would accept this instrumentalistic justification. Can more than such an instrumentalist justification be given—­a justification that directly infers the realistic truthlikeness of the theoretical part of an empirically successful theory? A naive argument of this sort is Putnam’s no miracle argument (1975, 73). It says, roughly, that without the

1. ​Note that even the justification of introspective memory beliefs involves abduction: it requires the combination of an inductive and an abductive inference step. Assume that I have just gotten up from bed this morning and am looking for my watch. I remember that I put my watch on the ­table in the kitchen before I went to bed last night. Of course, ­there is no way to travel into the past and find out by direct observation w ­ hether my memory belief is indeed true. But I can confirm the reliability of my memory belief by the following two inferential steps. (1) From the assumed reliability of my memory beliefs I infer by abduction that I r­eally did put my watch on the t­ able in the kitchen before I went to bed last night. (2) By induction and background knowledge (nobody has entered my ­house since last night) I infer that the watch is still lying ­there, so I ­will see it lying on the ­table when I go into the kitchen. When I then r­ eally go into the kitchen and see the watch lying on the t­ able, the reliability of my memory beliefs is confirmed.

Optimality Justifications as a Philosophical Program 313

assumption of realism the empirical success of science would be a sheer miracle. More precisely, the best explanation of the empirical success of theories is the realist assumption that their theoretical “superstructure” is approximately true and hence its theoretical terms refer to real constituents of the world. ­There exist broad controversies about this argument. A famous objection is Laudan’s pessimistic meta-­induction (Laudan 1997). It points to the fact that in the history of science one can recognize radical changes in the theoretical superstructure, although ­there was continuous pro­gress at the level of empirical success. On ­simple meta-­inductive grounds, we should expect that the theoretical superstructure of our presently accepted theories ­will be overthrown in the ­future as well, so they can in no way be expected to be approximately true. Scientific realists have made dif­fer­ent sorts of attempts to disprove Laudan’s pessimistic meta-­induction (see Psillos 1999; Votsis and Schurz 2011). Assuming certain assumptions of causality, (Schurz 2009d) proved the following correspondence theorem: if an outdated theory T and a presently accepted theory T* share a certain empirical success but have an incompatible theoretical superstructure (or “ontology”), then ­there must exist certain relations of correspondence between the theoretical concepts of the two theories that are responsible for their empirical success. To obtain from this result a justification of the abductive inference to truthlikeness, Schurz (2009d) applied the correspondence theorem to the relation between an empirically successful theory T and an unknown ideally true theory T+. For this application the theorem entails that t­here must be a correspondence between certain parts of the theoretical model of T and that of T+, which means that the theoretical model of T is at least partially true. In this account, the realistic interpretation of the theoretical core para­ meters of empirically adequate theories is based on their consideration as common ­causes, which presupposes certain causality princi­ples. But how can causality princi­ples be justified in turn? An answer to this question is developed in Schurz and Gebharter (2016), where we suggest justifying the princi­ples of the theory of causal Bayes nets by a fundamental abduction: they are justified as the best explanations of two properties of statistical correlations—­screening off and linking up. However, the explanations involved in this abductive inference are not themselves causal ones, as this would immediately lead us into a circle. Rather, the assumption of real cause-­effect relations is justified ­because it offers the best explanatory unification of certain statistical phenomena. This is again an instrumentalistic optimality justification. In conclusion, we regard it as an open question ­whether a noncircular optimality justification of the abductive inference to

314

Chapter 11

real­ity can be given that is stronger than an instrumentalistic justification in terms of predictive success and cognitive economy. 11.3  New Foundations for Foundation-­Oriented Epistemology This concludes our brief sketch of ways of generalizing optimality justifications to other domains of foundation-­oriented epistemology. Generally speaking, optimality justifications constitute new foundations for foundation-­ oriented epistemology. Let us fi­nally try to locate, in a preliminary way, the place of the account of optimality justifications in the landscape of epistemological positions in the history of enlightenment philosophy. For this purpose, the phi­los­o­pher Immanuel Kant ­shall figure as our light­house. Certainly the epistemic optimality account does not belong to pre-­ Kantian metaphysical accounts that w ­ ere based on uncritically accepted premises, which ­later on turned out to be unjustified by the skeptical challenges of empirical scientists and philosophical empiricists, in par­tic­u­lar by ­those of David Hume. What our account shares with the Kantian philosophy is the Copernican turn ­toward the inner “transcendental” dimension of knowledge—­the question of its ultimate cognitive foundations, presuppositions, and justifications. In contrast to Kant, however, we neither assume nor argue that certain cognitive methods or princi­ples are a priori, in the sense that we must apply them as necessary presuppositions of cognition. Kantian a priorism is not tenable, and modern philosophy has shown time and again that no transcendental argument can prove the a priori validity or necessity of a cognitive method or princi­ple. Even at the most fundamental level, ­there are choices. ­There is more than one method, and more than one way to go. However, what one can still have in such a situation of foundational pluralism are optimality justifications by means of strategies that are universally access-­optimal ­because of their built-in learning capacities. This is the central innovation of the proposed account of optimality justification. In conclusion, if Kant’s philosophy is called transcentendal a priorism, then the account proposed in this book can be called transcendental optimalism.

12  Appendix: Proof of Formal Results

12.1  Proof of Proposition 4.2 The equivalence (1) ⇔ (2) goes back to de Finetti ([1937] 1964; also see Carnap 1980). Regarding note 6 of chap. 4: Spielman (1976) showed that if P is σ-­additive, (1) implies that the p’s in (2) satisfy statistical in­de­pen­dence. Implication (2) ⇒ (3)(i) is proved as follows: Let {H1, … ,Hn} be a partition of hypotheses of the form Hi =def “p(Fx) = ri.” Then equation (4.6) implies for the tautological event: P(Fa ∨ ¬Fa) = ∑1≤i≤n1 • P(Hi) = P(H1 ∨ … ∨ Hn) =  1. ­Because each Hi entails that the event Fx possesses a frequency limit, the subjective probability that the event Fx does not possess a frequency limit is zero. (The generalization to the continuous case is straightforward.) Implication (2) ⇒ (3)(ii) is shown by applying equation (4.6) to the ­conditionalized probability function P(− |Hk). Then we obtain P(Ea|Hk) = ​ ∑1≤i≤n ri • P(Hi|Hk) = rk • 1 = rk—­that is, the StPP. Vice versa, (3) implies (2), b ­ ecause by Bayes’s theorem, P(Fa) is given as P(Fa) = ∑1≤i≤nP(Fa|Hi) • P(Hi) + P(Fa|X) • P(X), where X is the assertion that Fx does not possess a frequency limit; b ­ ecause the latter case has probability zero by (3)(i), we get P(Fa) = ∑1≤i≤n P(Fa|Hi) • P(Hi) and hence by (3)(ii) P(Fa) = ∑1≤i≤n ri • P(Hi), which is (2). Q.E.D. 12.2  Proof of Proposition 4.3 For (a): The proof is based on the Cauchy-­Schwartz in­equality and can be found in Humburg (1971, 233, theorem 5). Humburg uses in addition “Reichenbach’s axiom,” which (as Humburg shows) is itself derivable ­under the assumption of nondogmaticity. For (b): We start from ­these two equations:

(

) ( )

k = n (i) P freq n (F) = n k

1 i ∫0

x k i (1 − x)(n−k ) i D(r)dr, and

316 Appendix

(

) ( )∫

k = n (ii) P Fa n+1∧ freq n (F) = n k

1

i

0

x (k+1) i (1 − x)((n+1)−(k+1)) i D(r)dr

(which follow from the integral version of proposition 4.2). From (i) and (ii) we obtain

(

) ∫

1 ⎡ 1 k ⎤ k = ⎡ x (k+1) i (1 − x)(n−k ) D(r)dr ⎤ (n−k ) i (*)P Fa n+1 | freq n (F) = n D(r)dr ⎥ . ⎢ 0 ⎥ ⎢ 0 x i (1 − x) ⎣ ⎦ ⎣ ⎦ ⎡ 1 (k+1) ⎡ 1 k ⎤ ⎤ k (n−k ) (n−k ) i (1 − x) i D(r)dr . a n+1 | freq n (F) = n = ⎢ x D(r)dr ⎥ ⎢ 0 x i (1 − x) ⎥ ⎣ 0 ⎦ ⎣ ⎦



) ∫



­ ecause D(r) is continuous, D(r)’s slope over point r is finite. Hence for B ­every small ε > 0 ­there exists a positive and sufficiently small interval [r ± a] around r, in which the density D(r) varies by at most an amount of ε. ­Because r+a D(r) is everywhere positive in [r ± a], the integral r−a ∫ D(r)dr represents a nonvanishing fraction of the total probability. This means that all preconditions of the princi­ple of stable estimation ­after Edwards, Lindman, and Savage (1963) are satisfied, which implies that if we let n and k grow with constant ratio (k/n), then the integral 0∫1 xk • (1 − x)(n−k) • D(r)dr deviates from that which would be obtained through a uniform prior distribution D(r) = 1, by a vanishingly small ­factor ε*: ε* → 0 when ε → 0. By proposition

(( )

)

1 i 4.7a below, ∫ 0 x k i (1 − x)(n−k ) dr = 1 / n k (n + 1) . Thus by (the proof of)

(

k proposition 4.7b, P Fa n+1 ∧ freq n (F) = n

)

is ε-­ close to (k + 1)/(n + 2) and

converges to k/n for n → ∞. Q.E.D. 12.3  Proof of Proposition 4.4 According to proposition 4.8,

(

)

(

)

k = (1 / c) i p k D Hr | freq n (F) = n Hr freq n (F) = n i D(H r )

= (1 / c) i

( ) n k

i r k (1 −

r)(n−k ) i D(Hr ),

1

with c = def

∫p

HX (E) ⋅ D(H

x) dx.

0

Thus,

(

[rin]

D Hr+ε | freq n (F) = n

( ) (r + ε) n [nir]

i

[rin] i (1 −

) / D(H | freq r

m (F)

[rin]

= n

( )

)=

n

r − ε)n−[ r i n ] i D(Hr ) / [nir] i r[ r i n ] i (1 − r)n−[ r i n ] i D(Hr+ε) =

(*): a i [(r + ε)[ r i n ] i (1 − r − ε)n−[ r i n ] / r[ r i n ](1 − r)n−[ r i n ] ],

Appendix 317

where a =def D(Hr)/D(Hr+ε), which is by nondogmaticity of D(Hx) a finite positive value. The expression xk • (1 − x)n−k is a β-­distribution having its peak at k, x= n which gets infinitely steep for n → ∞. It follows from ­these properties

that limn→∞ ((r + ε)[r • n] • (1 − r − ε)n−[r • n] / r[r • n] • (1 − r)n−[r • n]) = 0, so (*) converges to 0 for n → ∞. Q.E.D. 12.4  Proof of Proposition 4.5 With P(H) =def h we have P(H|E)   0 implies limn→∞ P(Fa1 ∧ … ∧ Fan ) > 0. With “Π” for the consecutive product, P(Fa1 ∧ … ∧ Fan) =  Π1≤i≤nP(Fai|Fai−1 ​∧ … ∧ Fa1). Thus limn→∞ P(Fa1 ∧ … ∧ Fan ) can only be greater than zero if P(Fan+1|Fa1 ∧ … ∧ Fan ) converges to 1 for n → ∞. Q.E.D. 12.6  Proof of Proposition 4.7

(

)

k (for k ≤ n) denote the “ith” complete n-­ sample description Let s i F : n

asserting that k par­tic­u­lar individuals out of n given individuals have property F. Assuming statistical in­de­pen­dence, the statistical probability of each k is given as rk • (1 − r)(n−k), where r = p(F) is the unknown statistical si F : n

(

)

318 Appendix

probability value of F. Integrating this value with a uniform prior distribution D(r) = 1 over r, one obtains (Billingsley 1995, 279):

( ( )) = ∫ r

k (*) P s i F : n

1

0

k i (1 −

(( )

)

i r)(n−k ) i 1dr = 1 / n k (n + 1) .

For (a): ­Because ­there are n over k pos­si­ble complete sample descriptions

(

)

k , whose disjunction is equivalent with freq (F) = k , (a) is a consesi F : n n n

quence of (*). For (b): T ­ here exist n pos­si­ble complete sample descriptions of the

( )

k k+1 which satisfy Fa ; we can write them as k .” form s i F : n+1 “Fa n+1 ∧ s i F : n n+1

(

)

(

)

Therefore

(

) ( ) ( ( )) ( ) ( ( k+1 k = P ( s ( F : n+1 )) / P ( s ( F : n )) ( nk ) (n + 1) = (by (*) above)

k = n i P Fa k / n iP s F: k P Fa n+1 | h n (F) = n n+1 ∧ s i F : n i n k k i

))

i

i

( kn ++11) (n + 2) i

=

n! i (n + 1) i (k + 1)! i (n − k)! k + 1 Q.E.D. = . (n + 1)! i (n + 2) ⋅ k! i (n − k)! n + 2

12.7  Proof of Proposition 4.8 For (a): This holds simply ­because ­there are twice as many state descriptions verifying sn(F) than verifying Fan+1 ∧ sn(F). For (b): This is obvious from the explanation of binary repre­sen­ta­tions of real numbers in the text preceding proposition 4.8. For (c): (a) implies that P satisfies the princi­ple of in­de­pen­dence; that is, P satisfies the same laws as a statistical distribution function with p(Fx) = 1/2. So (c) follows from (a) by the law of large numbers (see section 4.1). Q.E.D. 12.8  Proof of Proposition 4.9 A general proof is found in Schurz (2015b, prop. 9-4, app. 10.3.16). ­Here we confine the proof of proposition 4.9 to the most s­imple situation. Assume ­there are only two epistemically pos­si­ble hypotheses H1: p(F) ∈ [q1 ± an ] and

Appendix 319

H2: p(F) ∈ [q2 ± an ] with q1 ≠ q2 , P(H1) = h1 und P(H2) = (1 − h1). By assumption, P(Hi|freq(F:sn) = qi ) = 95% (for i ∈ {1,2}). Application of Bayes’s theorem yields P(Hi|freq(F:sn) = qi)  = P(freq(F:sn) = qi | Hi) • P(Hi) / ∑1≤i≤2P(freq(F:sn) = qi | Hi) • P(Hi). We abbreviate the likelihoods as follows: P(freq(F:sn) = qi | Hi) =def Li for i ∈ {1,2}, and P(freq(F:sn) = qi | Hj) =def Li* for i ≠ j ∈ {1,2}. Obviously Li > Li*. By the (approximative) normality of the sampe distribution, the two likelihoods L1* and L2* (as well as L1 and L2) are equal. So we can write Li = L und Li* = L*, and obtain: P(Hi |freq(F:sn) = qi) = L • hi/(L • hi + L* • (1 − hi)), for i ∈ {1,2}. ­Because L  > L*, the value of this fraction can only coincide for i = 1 and i = 2 if h1 = h2 holds—­that is, if the probability distribution over the two hypotheses is uniform. Q.E.D. 12.9  Proof of Proposition 5.1 ( + Corollary) We have four senses of optimality and dominance. run dominant: Then for e­very method M ∈ M(G) M* is strictly long-­ ­there exists a world ((e),Π) ∈ G↑{M} in which M* occurs and is optimal but M is not optimal. Thus liminfn→∞(sucn(M) − maxsucn)  4 • ε, the value of ek/en lies (approximately) between 1 and 2. So 3 • ε = 6 • en ≥ 3 • ek, whence (by the statistics of the normal distribution) the probability that |suck(P2) − p| > 3 • ε (three standard errors) is greater than 99.8 ­percent. The probability that between times k and n P1’s success rate exceeds that of P2 is given as the product of 95.5 ­percent and 99.8 ­percent, which is approximately 95 ­percent. Thus (d) with p ≈ 95% ITB’s regret between times k and n is zero. ­Until round k, ITB’s success rate is given by memo (6.4)(ii) as the mixture (e) suck(ITB) = f • suck(P2|ITB) + (1 − f) • suck(P1|ITB), where f stands short for freqk(fav(ITB) = P2). With a probability of p ≥  99.8 ­percent, P2 success does not fall below p = limsuc(P2) by more than

Appendix 323

three standard errors, and the probability that suck(P1) falls below this value is much smaller. Thus by (e), (f) with probability p ≥ 99.8% suck(ITB) ≥ p − 3 • [p • (1 − p)/π • n]0.5. By (d) we have that (g) with probability of p ≈ 95%, regn(ITB) ≤ π • regk(ITB). In the worst case that was assumed in (c) we have suck(P1) = p + δ − ε. This and (g) gives us that with probability of p ≈ 95% regn(ITB) ≤ π • (δ − ε + 3 • [p • (1 − p) / n • π]0.5). Q.E.D. 12.16  Proof of Theorem 6.3 Let A be εITB’s last favorite (recall the remark before theorem 6.3). For theorem 6.3(2): Two cases are pos­si­ble. ­Either (a) s = w. Then εITB’s favorite at time w is ITB’s last favorite, A, whence A must be an ε-­best player. Or (b) assume the first switch a ­ fter w occurs at time s > w. Then εITB’s new favorite must be an ε-­best player in BP and εITB stops switching favorites, which implies that this new favorite is εITB’s last favorite, A. It both cases εITB’s success converges against the success of A (the initial losses vanish in the limit), and ­because A’s success lies by at most ε below the maximal success of the other non-­MI players, εITB’s success ε-­approximates this maximal success. For theorem 6.3(1): Recall that s = max(w, εITB’s last switch time). Again ­there are two cases. Case 1, s = w: The absolute success of εITB ­until time w is zero in the worst case, ­because εITB’s may be deceived by alternative players whose success rates oscillate around each other with amplitudes > ε (see figure 6.5). Thus ∀n ≥ w, sucn(εITB) ≥ sucn(A) − (w/n) • sucw(A). ­Because ∀n ≥ w: maxsucn ≥ sucn(A) ≥ ​ wimaxsuc w maxsucn − ε, it follows that ∀n ≥ w,suc n (εITB) ≥ maxsuc n − ε − . n Case 2, s > w: Between times w and s, εITB has only one favorite; we call him B. By reasoning as in case 1 we obtain that

(1) ∀n with w < n ≤ s : suc n (εITB) ≥ suc n (B) − ε −

wimaxsuc w . n

­Because B was leading between times w and s, we also have (1*) ∀n with w < n < s: suc n (εITB) ≥ maxsuc n − ε −

wimaxsuc w . n

At time s, A becomes εITB’s new favorite, which means that sucs−1(A) ​ ≤ sucs−1(B) + ε, or in terms of absolute successes, (i) abss−1(A) ≤ abss−1(B) + ε • (s  − 1). At time s, A has earned a score of at most one (in the binary case 1) and B less than one (in the binary case 0). Together with (i) this implies abss(A)  ≤ abss(B) + ε • (s − 1) + 1, and hence 1 (2) suc s (B) ≥ suc s (A) − ε i s−1 s − s.

324 Appendix

It follows from (1 + 2) that (3) suc s (εITB) ≥ suc s (A) − ε i s−1 s −

wimaxsuc w +1 . s

Beginning at time s+1, εITB earns the same scores as A. Therefore (and by 3) we obtain −1 − (4) ∀n ≥ s,suc n (εITB) ≥ suc n (A) − ε i s n

wimaxsuc w +1 . n

­Because ∀n ≥ s, sucn(A) ≥ maxsucn − ε, (4) gives us (5) ∀n ≥ s,suc n (εITB) ≥ maxsuc n − ε i

n + s − 1 wimaxsuc w +1 − . n n

This together with the result of case 1 gives us result 6.3(1ii). ­Because n + s − 1 is close to 2 in the worst case, it follows from (5) that n

wimaxsuc +1

w (6) ∀n ≥ s, suc n (εITB) ≥ maxsuc n − 2 i ε − . n (6) + (1*) together with the result of case 1 give us result 6.3(1i). Q.E.D.

12.17  Proof of Theorem 6.4 Let BP ⊆ {P1, … ,Pm } be the subset of non-­MI players whose limit success is at most ε/2 below maxlimsuc. We prove that BP is a subset of ε-­best players with winning time w = nα, where α =def min({δ/2, ε/4}). Condition (i) of our definition of “BP” (in the paragraph before theorem 6.3) is satisfied ­because by our assumptions, ­after time w the players in BP deviate by at most ε/4 from their limit successes, which deviate by at most ε/2 from each other. Thus ­after time w the success rates of the players in BP deviate by at most ε/2 + ε/4 + ε/4 = ε from each other, which means that εITB does not switch its favorites. ­ ecause ­after time w the Condition (ii) of the definition of BP is satisfied b success rates of the non-­MI players do not deviate from their limit success rates by more than δ/2, whence the success rates of the players in BP are always strictly greater than that of the non-­MI players outside of BP. The claim of theorem 6.4 now follows from theorem 6.3. Q.E.D. 12.18  Proof of Theorem 6.5 For 6.5(1i): (a) ­Because xMI’s imitates for each time a systematic deceiver, MI’s score for all times is zero. (b) The average score of the non-­MI players in each round is m−1 m , ­because exactly one of them is xMI’s favorite and predicts with score 0, while the other non-­MI players predict correctly and earn score 1. So for all times n, suc n (P) = m−1 m . (c) Result (b) implies that at

any time at least one non-­MI player has a success rate ≥ m−1 m .

Appendix 325

For 6.5(1ii): (a) If xMI = ITB, then ITB switches its favorite ­every round in the cyclic ordering P1, P2, … , Pm, P1, … (­etc.). Thus ITB imitates each deceiver 1 . So the limiting frequency with which each with a limiting frequency of m

deceiver earns a score of 1 is m−1 m , and this is the deceiver’s limiting success rate. (b) If xMI = εITB, then each deceiver’s success rate oscillates endlessly around his or her mean success rate of m−1 m (see 6.5(1i)), with maxima and minima that converge to a liminf and limsup, respectively, and that cannot deviate by more than ε from the deceiver’s constant mean success rate (­because at ­every switch time n with O designating εITB’s old and N εITB’s new 1 m−1 favorite, it holds that suc n (O) < m−1 m and suc n (N) = suc n (O) + ε + n > m ).

For 6.5(2): Let k = max({k1, … ,km}). So for all n ≥ k and 1 ≤ i ≤ n, sucn(Pi|xMI) ≤ sucn(Pi) − δ. This implies by memo (6.4) that for all n ≥  k: ​sucn​(xMI) ≤  ∑1≤i≤m freqn(fav(xMI) = Pi) • (sucn(Pi) − δ), and limn→∞​(maxsucn − sucn(xMI)) ≥ δ follows. Q.E.D. 12.19  Proof of Theorem 6.6 Proof of theorem 6.6(1): According to memo (6.4)(ii) we have (a) sucn(ITBN) = ∑1≤i≤m freqn(fav(ITBN) = Pi) • sucn(Pi|ITBN). From time n(i) ­until time n, Pi’s ITBN-­conditional success rate is frozen, so (b): sucn(Pi|ITBN) = sucn(i)(Pi|ITBN). n(i)−1 is the last time at which Pi was ITBN’s favorite and, hence, was not recorded as a deceiver. Assume Pi meets the start conditions at time n(i)−1. Then (c): sucn(i)−1(Pi|ITBN) ≥ sucn(i)−1(Pi) − εd. 1 , which by (b) and n(i) ≤ n (c) implies suc n(i) (Pi | ITBN) ≥ suc n(i)P(i) − ε d − n(i) implies 1. (d) suc n (Pi | ITBN) ≥ suc n(i) (Pi ) − ε d − n

By substituting (d) in (a) and factoring out we obtain: 1. (e) suc n (ITBN) ≥ [∑1≤ i ≤ m freq n (fav(ITBN) = Pi ) i suc n(i) (Pi )]− ε d − n

In (e) the lower bound of sucn(ITBN) is overestimated ­because non-­MI players that do not satisfy the start conditions at time n(i)−1 do not necessarily satisfy in­equality (c). B ­ ecause d is the number of t­ hose non-­MI players, their

326 Appendix

d ik

frequency of being ITBN’s favorite u ­ ntil time n is at most n 1 . By subtracting this term from the lower bound in (e) we get a correct lower bound; this gives us theorem 6.1(1). Q.E.D. Proof of theorem 6.6(2): We prove that the number of switches of ITBN’s favorite is finite. It follows that ­after some time s*, ­either (1) ITBN ­favors one nondeceiving non-­MI player forever, or (2) all non-­MI player are deceivers forever. In both cases, ITBN ε-­approximates the maximal success in the class of nondeceiving players—in case 2 for trivial reasons, and in case 1 b ­ ecause ITBN’s success converges against the success of ITBN’s permanent favorite ­after time s*, which is in turn always ε-­close to the maximal success. For reductio ad absurdum, assume that ­there are players who switch from a nonfavorite to a favorite (of ITBN) infinitely many times. By a suitable reindexing, let P1, … ,Pk (with k ≥ 2) be all of ­these players. ­Because of k ≤ m and Xn > 1, proposition 6.5 entails (1) ∑1≤ i ≤ k freq sn(Pi ) i suc sn(Pi | ITBN) < suc sn(fav sn+1 (ITBN)) − ε for ­every switch point sn. ­There must exist a time point w ­after which ITBN’s favorite is always one of the P1, … ,Pk, and all Pi (1 ≤ i ≤ k) meet the start conditions (definition 6.3b+c). We consider now switch points sn lying beyond w (sn ≥ w). In in­equality  1, the frequencies freq sn (Pi ), for 1 ≤ i ≤ k, no longer add up to one, but only to γ sn = def 1 − (w / s n ) i δ w , where δw is the relative frequency of times ­until w at which players dif­fer­ent from P1, … ,Pk ­were favorites. Note that limn→∞ γsn = 1, ­because limn→∞(w/sn) = 0. We define the renormalized frequencies freq *sn (Pi ) = def freq sn (Pi ) / γ sn , which add up to one (for 1 ≤ i ≤ k). Using this definition we obtain from equation 1 that for all sn ≥ w:

(

)

(2) γ sn i ∑1≤ i ≤ k freq *sn (Pi ) i suc sn (Pk | ITBN) < suc sn fav sn+1 (ITBN) − ε. Using the equation ∑ = γ • ∑ + (1−γ) • ∑ ≤ γ • ∑ + (1−γ) (­because ∑ ≤ 1), we obtain from equation 2 that for all sn ≥ w: (3) ∑1≤ i ≤ k freq *sn (Pi ) i suc sn (Pk | ITBN) < suc sn(fav sn+1(ITBN)) − ε + (1 − γ sn ). Define δ =def ε − εd. We pass to the first switch point lying past w at which the term (1 − γ sn ) (which converges to zero) has become smaller than δ/2. We call this the distinguished switch point σ1. ­Because ε = εd + δ, equation 3 implies: (4) For all s n ≥ σ1 : ∑1≤i≤k freq *sn (Pi ) i suc sn (Pk | ITBN) < suc sn(fav sn+1(ITBN) − ε d − δ/2. s n ≥ σ   1 : ∑1≤i≤k freq * sn (Pi ) i suc sn (Pk | ITBN) < suc sn(fav sn+1(ITBN) − ε d − δ/2. The sum-­term in equation 4 is a weighted average of the success rates sucn(Pi|ITBN) for 1 ≤ i ≤ k. Let Pmin(σ1 ) be the first-­best player among P1, … ,Pk

Appendix 327

which has a minimal ITBN-­conditional success rate at σ1. By the laws concerning weighted averages, equation 4 implies: (5) For all s n ≥ σ1 : suc sn(Pmin(σ1 ) | ITBN) < suc sn(fav sn+1(ITBN)) − ε d − δ/2. We now construct what we call a min-­max hypercycle (see figure 12.1): We pass from σ1 to the next distinguished switch point, call it σ2, at which the player Pmin(σ1 ) becomes favorite once again; this switch point must exist ­because (by assumption) each player in P1, … ,Pk achieves favorite-­status infinitely many times. The ITBN-­conditional success rate of Pmin(σ1 ) is frozen between σ1 and σ2, and at time σ2, Pmin(σ1 ) is a nondeceiver (meeting the start conditions), so it must hold: (6) suc σ1 (Pmin(σ1 ) | ITBN) = suc σ2 (Pmin(σ1 ) | ITBN) ≥ suc σ2 (Pmin(σ1 ) ) − ε d . However (by our construction), Pmin(σ1 ) = fav σ2 +1 (ITBN). So equation 5 implies for sn = σ2: (7) suc σ2 (Pmin(σ2 ) | ITBN) < suc σ2 (Pmin(σ1 ) ) − ε d − δ/2. It follows from equations 6 and 7 that the player Pmin(σ2 ) , who has a minimal ITBN-­conditional success rate at σ2, must be dif­fer­ent from Pmin(σ1 ) , and his ITBN-­conditional success rate at σ2, compared with Pmin(σ1 )’s at σ1, has decreased by at least δ/2: (8) suc σ2 (Pmin(σ2 ) | ITBN) ≤ suc σ1 (Pmin(σ1 ) | ITBN) − δ/2. We use the distinguished switch point σ2 and the player Pmin(σ2 ) to construct the next min-­max hypercycle; that is, we pass to the next switch point σ3 at which Pmin(σ2 ) becomes favorite again, with the result that a dif­fer­ent player’s ITBN-­conditional success rate drops by at least δ/2 below that of Pmin(σ2 ) , and so on. See figure 12.1. ­Because ­every min-­max-­hypercycle enforces a decline of the minimum of the ITBN-­conditional success rates by at least δ/2, and ­because success rates cannot drop below zero, t­ here must come a distinguished switch point σ* ­after which no further hypercycle is pos­si­ble (namely when this minimum is smaller than δ/2). This contradicts the assumption that the players in {P1, … ,Pk} switch from a non-­favorite to a favorite infinitely many times and concludes the proof of theorem 6.6(2). Q.E.D. 12.20  Proof of Proposition 6.5 At each switch point si (1 ≤ i ≤ n), the new favorite N has achieved a surplus of absolute success points that is (slightly) more than ε • si, compared with the relative success of the old (­actual) favorite O for time si. Adding this

328 Appendix

1

(uncond. success rates) 1

2

… σ1 …

0

sucσ (Pmin(σ )|ITBN) 1

1

Pmin(σ ) becomes favσ +1(ITBN)

≤ εd

Pmin(σ ) becomes favσ +1(ITBN)

2

3



sk, sk+1 …

σ2



sucσ (Pmin(σ )|ITBN) 2

ε

2

σ3



εd δ log time

etc.

Figure 12.1 Hyper-­cycles in the proof of theorem 6.6(2). Pmin(σ1 ) becomes new favorite at time σ2. Therefore, Pmin(σ2 ) ’s ITBN-­conditional success rate has dropped below that of Pmin(σ1 ) by at least δ (in the proof only by a value ≥ δ/2 ­because of the initial loss 1 − γ sn ).

surplus success to that O has relative to εITB gives the surplus success of the new favorite N relative to εITB. Thus, by induction over n, we obtain that the success of εITB’s new favorite at sn is (slightly) more than the sum of ­these absolute surplus successes between times s1 and sn, which is ε • s1 + ε • s2  + … + ε • sn = sn • Xn • ε. Thus, (i) suc sn (fav sn+1 (εITB)) > suc sn (εITB)) + X n i ε. Now by memo (6.4)(ii), ITB’s success rate at time sn is given as (ii) suc sn (εITB) = ∑1≤i≤m freq sn (Pi | εITB), where freq sn (Pi ) i suc sn (Pi | εITB) = def 0 if freq sn (Pi ) = 0. Equations (i) and (ii) entail proposition 6.5. Q.E.D. 12.21  Proof of Theorem 6.7 ­ here exists a time point n* ­after which ­every non-­MI player has ­either T been k1 times ITCB’s favorite, or ­will never be a favorite again. ­After time point n*, ITCB evaluates all non-­MI players that become favorites ­after this time according to their favorite-­conditional success rates. As explained in the paragraph before theorem 6.7, ­every switch of ITCB’s favorite ­after time n* requires a decrease of the best favorite-­conditional success rate by an amount of at least ε. So ­there can be at most 1/ε favorite-­switches ­after time n*. Thus ­after some time n** ≥ n*, ITCB ­will stop switching favorites and, thus, ­will ε-­approximate the maximal favorite-­conditional success rate in the long run. Q.E.D.

Appendix 329

12.22  Convexity of Linear, Polynomial, and Exponential Loss Functions We abbreviate “loss” with L, “en” with e, and “predn” with p, and show that for all e ∈ [0,1], weights γ ∈ [0,1] and predictions a