# Logical Introduction to Probability and Induction 0190845384, 9780190845384

A Logical Introduction to Probability and Induction is a textbook on the mathematics of the probability calculus and its

962 196 4MB

English Pages 304 [305] Year 2018

Cover
Half title
A Logical Introduction To Probability And Induction
Contents
Preface
Acknowledgments
1. Logic
1.1. Propositional Logic
1.2. Predicate Logic
1.3. Exercises
2. SetTheory
2.1. Elementary Postulates
2.2. Exercises
3. Induction
3.1. Confirmation and induction
3.2. The problem of induction
4. Deductive Approaches to Confirmation
4.1. Analysis and explication
4.3. The prediction criterion
4.4. The logic of confirmation
4.5. The satisfaction criterion
4.6. Falsificationism
4.7. Hypothetico-deductive confirmation
4.8. Exercises
5. Probability
5.1. The probability calculus
5.2. Examples
5.3. Conditional probability
5.4. Elementary consequences
5.5. Probabilities on languages
5.6. Exercises
6. The Classical Interpretation of Probability
6.1. The principle of indifference
6.3. The paradox of water and wine
7. The Logical Interpretation of Probability
7.1. State descriptions and structure descriptions
7.2. Absolute confirmation and incremental confirmation
7.3. Carnap on Hempel
7.4. The justification of logic
7.5. The new riddle of induction
7.6. Exercises
8. The Subjective Interpretation of Probability
8.1. Degrees of Belief
8.2. The Dutch Book Argument
8.4. Bayesian ConfirmationTheory
8.5. Updating
8.6. Bayesian Decision Theory
8.7. Exercises
9. The Chance Interpretation of Probability
9.1. Chances
9.2. Probability in physics
10. The (Limiting) Relative Frequency Interpretation of Probability
10.1. The justification of induction
10.2. The straight(-forward) rule
10.3. Random variables
10.4. Independent and identically distributed random variables
10.5. The strong lawof large numbers
10.6. Degrees of belief, chances, and relative frequencies
10.7. Descriptive statistics
10.8. The central limit theorem
10.9. Inferential statistics
10.10. Exercises
11. Alternative Approaches to Induction
11.1. Formal learning theory
11.2. Putnam’s argument
References
Index

##### Citation preview

A LOGICAL INTRODUCTION TO PROBABILITY AND INDUCTION

A LOGICAL INTRODUCTION TO PROBABILITY AND INDUCTION

F R A N Z HU B E R

University of Toronto

1

1 Oxford University Press is a department of the University of Oxford. It furthers the University’s objective of excellence in research, scholarship, and education by publishing worldwide. Oxford is a registered trade mark of Oxford University Press in the UK and in certain other countries. Published in the United States of America by Oxford University Press 198 Madison Avenue, New York, NY 10016, United States of America. © Oxford University Press 2019 All rights reserved. No part of this publication may be reproduced, stored in a retrieval system, or transmitted, in any form or by any means, without the prior permission in writing of Oxford University Press, or as expressly permitted by law, by license or under terms agreed with the appropriate reproduction rights organization. Inquiries concerning reproduction outside the scope of the above should be sent to the Rights Department, Oxford University Press, at the address above. You must not circulate this work in any other form and you must impose this same condition on any acquirer. CIP data is on file at the Library of Congress ISBN 978–0–19–084538–4 (Pbk.) ISBN 978–0–19–084539–1 (Hbk.) 9 8 7 6 5 4 3 2 1 Paperback printed by WebCom, Inc., Canada Hardback printed by Bridgeport National Bindery, Inc., United States of America

CONTENTS

Preface Acknowledgments

ix xiii

1. 1.1. 1.2. 1.3.

Logic Propositional Logic Predicate Logic Exercises Readings

1 1 7 14 19

2. 2.1. 2.2.

Set Theory Elementary Postulates Exercises Readings

21 21 28 35

3. 3.1. 3.2. 3.3.

Induction Confirmation and induction The problem of induction Hume’s argument Readings

36 36 38 41 46

vi

CONTENTS

4. 4.1. 4.2. 4.3. 4.4. 4.5. 4.6. 4.7. 4.8.

Deductive Approaches to Confirmation Analysis and explication The ravens paradox The prediction criterion The logic of confirmation The satisfaction criterion Falsificationism Hypothetico-deductive confirmation Exercises Readings

47 47 49 53 55 62 65 69 72 74

5. 5.1. 5.2. 5.3. 5.4. 5.5. 5.6.

Probability The probability calculus Examples Conditional probability Elementary consequences Probabilities on languages Exercises Readings

75 75 81 84 87 92 94 97

6. 6.1. 6.2. 6.3.

The Classical Interpretation of Probability The principle of indifference Bertrand’s paradox The paradox of water and wine Reading

98 98 100 106 110

7. 7.1. 7.2.

The Logical Interpretation of Probability State descriptions and structure descriptions Absolute confirmation and incremental confirmation Carnap on Hempel The justification of logic

111 111

7.3. 7.4.

122 123 126

CONTENTS

vii

7.5. 7.6.

The new riddle of induction Exercises Readings

132 137 138

8. 8.1. 8.2. 8.3. 8.4. 8.5. 8.6. 8.7.

The Subjective Interpretation of Probability Degrees of Belief The Dutch Book Argument The Gradational Accuracy Argument Bayesian Confirmation Theory Updating Bayesian Decision Theory Exercises Readings

139 139 140 147 155 168 173 180 185

9. 9.1. 9.2. 9.3.

The Chance Interpretation of Probability Chances Probability in physics The principal principle Readings

187 187 190 195 203

10.

The (Limiting) Relative Frequency Interpretation of Probability The justification of induction The straight(-forward) rule Random variables Independent and identically distributed random variables The strong law of large numbers Degrees of belief, chances, and relative frequencies Descriptive statistics The central limit theorem

10.1. 10.2. 10.3. 10.4. 10.5. 10.6. 10.7. 10.8.

204 204 207 215 217 222 227 230 238

viii

CONTENTS

10.9. Inferential statistics 10.10. Exercises Readings

243 255 257

11. 11.1. 11.2.

259 259 265 268

Alternative Approaches to Induction Formal learning theory Putnam’s argument Readings

References Index

271 283

P R E FA C E

A Logical Introduction to Probability and Induction is an introduction to the mathematics of the probability calculus and its applications in philosophy. On the mathematical side, we will study those parts of propositional and predicate logic as well as elementary set theory that we need to formulate the probability calculus. On the philosophical side, we will mainly be concerned with the so-called problem of induction and its reception in the philosophy of science, where it is often discussed under the heading of ‘confirmation theory.’ In addition, we will consider various interpretations of probability. These are philosophical accounts of the nature of probability that interpret the mathematical structure that is the probability calculus. The book is divided into five sections. The first section, Chapters 1–2, provides us with the relevant background in logic and set theory. It will occupy the first two weeks. The second section, Chapters 3–5, covers Hume’s argument for the thesis that we cannot justify induction; Hempel’s work on the logic of confirmation and the ravens paradox; Popper’s falsificationism and hypothetico-deductive confirmation; as

x

PREFACE

well as Kolmogorov’s axiomatization of the probability calculus. It will occupy three to four weeks. The third section, Chapters 6–8, covers the classical, logical, and subjective interpretation of probability. Topics include Carnap’s inductive logic and the distinction between absolute and incremental confirmation; Goodman’s philosophy of induction and the new riddle of induction; Haack’s dilemma for deduction; the Dutch Book and gradational accuracy arguments for the thesis that subjective degrees of belief ought to obey the probability calculus; Bayesian confirmation theory; update rules for subjective probabilities or probabilistic degrees of belief; as well as Bayesian decision theory. It will occupy four to five weeks. The fourth section, Chapters 9–10, is devoted to the chance and (limiting) relative frequency interpretation of probability. Topics include probability in physics; Lewis’ principal principle relating subjective probabilities and chances; Reichenbach’s “straight-(forward) rule;” the strong law of large numbers relating chances and (limiting) relative frequencies; descriptive statistics and the distinction between singular and generic variables; the central limit theorem relating sample means and expected values; as well as estimation with confidence intervals and the testing of statistical hypotheses. It will occupy four weeks and contains a section on the interplay between the three major interpretations of probability: subjective probabilities, chances, and relative frequencies. Along the way, we will come across probability puzzles such as Bertrand’s paradox and the paradox of water and wine, as well as paradoxes from logic and set theory such as the liar paradox and Russell’s paradox. Sections 10.3.-10.9 are centered around the strong law of large numbers and the central limit theorem. They are mathematically advanced and can be considered optional.

PREFACE

xi

The last week of a course is usually best spent reviewing material from previous weeks, and, perhaps, putting things in perspective by mentioning alternative approaches. The final section, Chapter 11, contains a suggestion. The primary aim of this book is to equip students with the ability to successfully carry out arguments, which is arguably (sic!) the most important philosophical skill. Fifty exercises that may be solved in groups rather than individually will help attain this end. Another skill that is important in philosophy is the ability to draw conceptual distinctions. Students are best asked to explain some of the distinctions introduced in this textbook jointly in the classroom, and individually, in the form of exam questions similar to those listed in the instructor’s manual. The latter also contains the solutions to the fifty exercises.

ACKNOWLEDGMENTS

I am very grateful to Claus Beisbart, Joseph Berkovitz, Michael Miller, Jonathan Weisberg, and, especially, Alan Hájek, Rory Harder, and Christopher Hitchcock for their helpful feedback on an earlier draft of this book. Rory Harder has also created the figures.

A LOGICAL INTRODUCTION TO PROBABILITY AND INDUCTION

CHAPTER 1

Logic

1.1 PROPOSITIONAL LOGIC The sentence “Angela Merkel is chancellor of Germany in August 2017” means, or expresses, the proposition that Angela Merkel is chancellor of Germany in August 2017. Sentences come in at least two forms: as abstract types and as concrete tokens. Consider: Toronto is a city. Toronto is a city. It is not the case that Toronto is not a city. Is there one sentence in the above line, or are there two sentences, or even three? The correct answer depends on whether we understand sentences as tokens or as types. There are two sentence types and three sentence tokens in the line. Both are different from the one proposition that these three sentence tokens express, or mean, viz. that Toronto is a city. Logicians are lazy. They use propositional variables, or sentence letters, as place-holders (“variables”) for sentence tokens. For instance: p Angela Merkel is chancellor of Germany in August 2017. q Toronto is a city. r The Great Wall of China stretches from east to west. .. .. . .

2

INTRODUCTION TO PROBABILITY AND INDUCTION

Sentence tokens can be combined, or connected, to form more complex sentence tokens by connectives: Negation: ¬ (¬q)

it is not the case that . . . It is not the case that Toronto is a city.

Conjunction: ∧ (&, .) . . . and . . . (q ∧ r) The Great Wall of China stretches from east to west, and Toronto is a city. Disjunction: ∨ (from the Latin word ‘vel’ that has the same meaning as the combination of English words ‘and/or’) . . . and/or . . .; . . . or . . . (or both) (p ∨ q) The Great Wall of China stretches from east to west, or Toronto is a city (or both). Material Conditional: → (⊃) if . . ., then . . .; it is not the case that . . ., or . . . (or both) (q → r) If Toronto is a city, then the Great Wall of China stretches from east to west. Material Biconditional: ↔ . . . if and only if . . . (iff); . . . just in case . . . (q ↔ r) The Great Wall of China stretches from east to west if and only if Toronto is a city. Officially we define a formal language L for propositional logic in the following recursive way: 1. Every propositional variable, or sentence letter, ‘p,’ ‘q,’ ‘r,’ . . . is a sentence, or well-formed formula (wff), of L. 2. If α and β are sentences of L, then so are (¬α),            ¬β ,  α ∧ β ,  α ∨ β ,  α → β ,and  α ↔ β . 3. Nothing else is a sentence of L. Speakers use languages to talk about things. Those things can be the chancellor of Germany in August 2017, the weather, or a language. When a language can be used to talk or write about

LOGIC

3

another language, the former has the latter as an object language. In 1–3 above, the formal language L is our object language. The language in which we talk or write about an object language is a metalanguage for this object language. In 1–3 above, ordinary English is our metalanguage for the object language L. Here is another example of this distinction. As a matter of fact, my English has a somewhat funny (German) accent, and after class, students sometimes talk about it. In this case, the language the students talk about, the object language, is my English with the funny accent. In contrast to this, the language in which the students talk, the metalanguage, is their perfect English. Of course, as I am writing this, I use my English with the funny accent, and one of the things I am writing about is the students’ perfect English. Therefore, two languages can both be metalanguages for each other. Another distinction we need is that between use and mention. The following two sentences are both true. Toronto is a city. ‘Toronto’ consists of seven letters. In the first sentence, we are using the word ‘Toronto.’ In the second sentence, we are mentioning the word ‘Toronto.’ The convention is to use left and right single quotes—“’ and ‘’,’ respectively—to indicate that one is mentioning rather than using a symbol. Take a moment to reflect on what you do when you introduce yourself to someone, as I do at the beginning of the first class of a course. Determine which of the following introductions are philosophically correct: I am Franz. My name is Franz.

I am ‘Franz.’ My name is ‘Franz.’

Now consider again the recursive definition above and, in particular, the Greek letters ‘α’ and ‘β.’ These Greek letters

4

INTRODUCTION TO PROBABILITY AND INDUCTION

are symbols of the metalanguage that we use to talk about the sentences of the object language L such as ‘(p ∧ q)’ and ‘¬r.’ These Greek letters are not part of the object language. In contrast to this, the symbols ‘¬,’ ‘∧,’ ‘∨,’ ‘→,’ and ‘↔’ are symbols of the object language L, and so are the left and right parentheses, ‘( ’ and ‘ ),’ respectively. This means that the string   of symbols ‘ α ∧ β ’ contains both symbols that are, and symbols that are not, part of the object language L. Strictly speaking, we would have to write the following: ‘( ’α‘∧’β‘ )’ is a sentence of L. However, as mentioned, logicians are lazy. They have introduced the symbols ‘’ and ‘’ to put single quotes around every symbol between them whenever single quotes are to be placed, and not when not. These symbols are called (left and right) Quine quotes—or, to be philosophically correct, these symbols are called ‘Quine quotes’—being named after the philosopher Quine who introducedtheminQuine(1940).Nowthatwehavediscussedthe distinction between use and mention we can ignore it again, as it quickly becomes quite cumbersome to always use these quotes. Before we move on to the next topic, let me note that there are languages—such as the English which I am using to write this book—that are so rich in expressive power that they can be metalanguages for themselves. That is, we can use such languages to talk in them about them. In fact, this is exactly what I am doing in this paragraph! Such languages are given a special name, viz. ‘self-referential languages.’ And, while they are great, they also cause a lot of philosophical trouble. This is illustrated by the following ‘liar sentence’: L This sentence is false. If the sentence L in the line above is true, then it is false, and if it is false, then it is true. So L is true if and only if it is false.

LOGIC

5

6

INTRODUCTION TO PROBABILITY AND INDUCTION

  sentence  α ∧ β  is false just in case the conjunct α is false or the conjunct β is false or both conjuncts α and β are false.   Disjunction: A disjunctive sentence  α ∨ β  is true just in case the disjunct α is true or the disjunct β is true or both   disjuncts α and β are true. A disjunctive sentence  α ∨ β  is false just in case both disjuncts α and β are false.   Material conditional: A material conditional  α → β  is true just in case the antecedent α is false or the consequent   β is true or both. A material conditional  α → β  is false just in case the antecedent α is true and the consequent β is false.   Material biconditional: A material biconditional  α ↔ β  is true just in case the sentence α and the sentence β have   the same truth value. A material biconditional  α ↔ β  is false just in case the sentence α and the sentence β have different truth values. Unfortunately, the meaning of many English conditionals, or if-thensentences,isnotcapturedbythematerialconditional.For this reason, philosophers have come up with other conditional connectives besides the material conditional. Among these probably the most important one for philosophical purposes is the counterfactual conditional which captures the meaning of ‘if’ in sentences such as ‘If things had been such and so, things would have been thus and so.’ The antecedents, or if-clauses, of these conditionals may involve a contrary-to-fact supposition (hence the name ‘counterfactuals’). Can you think of a reason why a contrary-to-fact supposition may cause trouble for the material conditional? The above truth conditions can be summarized by what logicians call a truth table. Also, note that I have stopped using single quotes when it became too cumbersome. Otherwise I should have written ‘that logicians call a ‘truth table.”

LOGIC

α T T F F

β T F T F

 (¬ α) (α ∧ β F T T T T F T T F F T F F F T T F F F F

(α T T F F

∨ T T T F

 β T F T F

(α T T F F

→ T F T T

 β T F T F

7

 (α ↔ β T T T T F F F F T F T F

1.2 PREDICATE LOGIC Sentences talk about objects, the properties these objects have, and the relations they stand in. Objects are referred to, denoted by, or named by, names such as ‘Angela Merkel.’ Since logicians are lazy, they use the shorter individual constants, which are usually small letters from the beginning of the alphabet. a b c .. .

Angela Merkel Toronto Montréal .. .

Properties of one object are referred to by predicates, or predicate symbols, and relations between two or more objects are referred to by relation symbols. These are usually capital letters from the middle of the alphabet. F G .. .

. . . is chancellor of Germany in August 2017 . . . has more inhabitants than . . . .. .

A predicate and an individual constant can be combined to form a sentence, similarly for a binary relation symbol and two individual constants (that may be two tokens of the same type).

8

INTRODUCTION TO PROBABILITY AND INDUCTION

F (a) Angela Merkel is chancellor of Germany in August 2017. G (b, c) Toronto has more inhabitants than Montréal. It is customary to identify predicate symbols with unary relation symbols, and propositional variables, or sentence letters, with 0-ary relation symbols. This has the consequence that propositional logic is included in predicate logic as a special case. Besides individual constants there are individual variables. These are usually small letters from the end of the alphabet. They make predicate logic both powerful and difficult. x y .. . I have not included a right column for individual variables because they generally do not occur on their own. Instead they are generally bound by the existential quantifier ‘∃x’ or the universal quantifier ‘∀y.’ Existential quantifier: ∃x there exists an x such that . . . x . . .; some x is such that . . . x . . .; at least one x is such that . . . x . . . Universal quantifier: ∀y every y is such that . . . y . . .; all y are such that . . . y . . .; each y is such that . . . y . . . When we translate English sentences into well-formed formulas of predicate logic, it is often helpful to proceed in two steps. In a first step, we move from the English language to the regimented English language, which is a clumsy version of the English language that contains no ambiguities and makes

LOGIC

9

the predicate-logical form of all sentences clear. For instance, consider the English sentences: There is a Republican U.S. president in August 2017. All Canadian cities have at most as many inhabitants as Toronto. These two sentences from the English language are transformed into the following two sentences from the regimented English language: There exists at least one object x such that: x is U.S. president in August 2017 and x is Republican. All things x are such that: If x is a Canadian city, then x has at most as many inhabitants as Toronto. In a second step, we can then transform the sentences from the regimented English language into the formal language of predicate logic: ∃x (P (x) ∧ R (x)) ∀x (C (x) → M (x, b)) We now subsume the formal language for propositional logic under the richer formal language for predicate logic, also called ‘L,’ which is defined recursively as follows. 1. If ‘t1 ,’ . . ., ‘tn ’ are n terms, that is, individual constants or individual variables, and ‘R’ is an n-ary relation symbol (which includes propositional variables, or sentence letters, as the special case where n = 0), then ‘R (t1 , . . . , tn )’ is a well-formed formula of L. Specifically, it is an atomic formula.

10

INTRODUCTION TO PROBABILITY AND INDUCTION

2. If α and β are well-formed formulas of L, and if ‘x’ is     an individual variable, then (¬α),  ¬β ,  α ∧ β ,        α ∨ β ,  α → β ,  α ↔ β , as well as ∃x (α) and ∀x (α) are also well-formed formulas of L. Specifically, they are complex formulas. 3. Nothing else is a well-formed formula, or simply formula, of L. In contrast to the previous definition, the first clause now is more general and includes the previous first clause as a special case. The same is true for the second clause. Predicate logic thus covers propositional logic as a special case. The following is one version of how the truth conditions for existentially and universally quantified formulas can be defined. It is not standard, as it assumes that we have a name, or individual constant, for each object. The standard semantics does not make this assumption, but it is considerably more complex (Shapiro 2013: sct. 4, Zach 2016: ch. 5). For our purposes, the present version will do. We use the notation ‘α [a/x]’—read: ‘a’ for ‘x’ in α—to denote that every free occurrence, or token, of the individual variable ‘x’ in the well-formed formula α has been replaced by an occurrence, or token, of the individual constant ‘a.’ For instance, consider the well-formed formula ‘∃y (L (x, y))’ in which the individual variable ‘x’ occurs freely, but in which the individual variable ‘y’ is bound by the quantifier ‘∃y’ and so does not occur freely. ∃y (L (x, y)) [a/x] is ‘∃y (L (a, y)),’ because ‘x’ occurs freely in ‘∃y (L (x, y))’ and so is replaced   by ‘a.’ ∃y (L (x, y)) b/y  is ‘∃y (L (x, y))’ because ‘y’ does not occur freely in ‘∃y (L (x, y)),’ and so nothing is replaced. Finally, ∃y (L (x, y)) [c/z] is also ‘∃y (L (x, y))’ because ‘z’ does not occur at all in ‘∃y (L (x, y)).’ Note that the ‘y’ next to the ‘∃’ does not count as an occurrence of ‘y’ in ‘∃y (L (x, y)).’ Instead it is part of the quantifier which is ‘∃y’ rather than ‘∃.’

LOGIC

11

Existential quantifier: An existentially quantified formula ∃x (α) is true just in case there is at least one individual constant ‘a’ such that α [a/x] is true. An existentially quantified formula ∃x (α) is false just in case all individual constants ‘a’ are such that α [a/x] is false. Universal quantifier: A universally quantified formula ∀x (α) is true just in case all individual constants ‘a’ are such that α [a/x] is true. A universally quantified formula ∀x (α) is false just in case there is at least one individual constant ‘a’ such that α [a/x] is false. Now that we have defined the truth conditions, or meaning, of the connectives and quantifiers, we can define the concepts that make clear that logic is the study of the validity, or value, of arguments. An argument consists of one or more premises to the left of the therefore symbol ‘∴’ and a conclusion to its right. An argument is logically valid if and only if the premises logically imply the conclusion. An argument is logically sound if and only if it is logically valid, and all its premises are true. Thus, the conclusion of a logically sound argument is also true. Logical truth: A formula α is logically true, |= α, just in case α is true in all logically possible cases. Below I will say more about what these logically possible cases are. For now, a few examples will do. The sentence ‘Toronto is a city or Toronto is not a city’ is logically true because it is true in all logically possible cases: if Toronto is a city, and also if Toronto is not a city. In symbols: |= (q ∨ (¬q)). Logical consequence (special version): A formula α logically implies a formula β, or β is a logical consequence of α, α |= β, just in case β is true in all logically possible cases in which α is true.

12

INTRODUCTION TO PROBABILITY AND INDUCTION

The sentence ‘Toronto is a city’ logically implies the sentence ‘Toronto is a city, or the Great Wall of China stretches from east to west’ because the latter sentence is true in all logically possible cases in which the former sentence is true: if Toronto is a city and the Great Wall of China stretches from east to west, and also if Toronto is a city and the Great Wall of China does not stretch from east to west. In symbols: q |= (q ∨ r). Since arguments generally contain more than one premise, this definition needs to be generalized as follows: Logical consequence (general version): Several formulas α1 , α2 , . . . logically imply a formula β, α1 , α2 , . . . |= β, just in case β is true in all logically possible cases in which all formulas α1 , α2 , . . . are true. The sentences ‘p,’ ‘q,’ and ‘((p ∧ q) → r)’ logically imply the sentence ‘r’ because ‘r’ is true in all logically possible cases in which ‘p’ and ‘q’ (and, hence, ‘(p ∧ q)’) as well as ‘(p ∧ q) → r’ are true: that is, in the one logically possible case where all of ‘p’ and ‘q’ and ‘r’ are true. In symbols: p, q, ((p ∧ q) → r) |= r. Logical equivalence: A formula α is logically equivalent to a formula β just in case: β is a logical consequence of α, and α is a logical consequence of β. The sentence ‘Toronto is a city’ is logically equivalent to the sentence ‘It is not the case that Toronto is not a city’ because these two sentences are logical consequences of each other. In symbols: q |= ¬¬q and ¬¬q |= q. Logical equivalence can also be defined as follows: Logical equivalence (variant): A formula α is logically equivalent to a formula β just in case the sentence α ↔ β is logically true.

LOGIC

13

That is, logically equivalent formulas have the same truth value in all logically possible cases. This means that logically equivalent sentences express, or mean, the same proposition. Of course, these definitions say little if we do not specify what the logically possible cases (that is, the models of model theory) are. For now, the logically possible cases are the lines in a truth table. A formula that is logically true with this understanding of the logically possible cases is said to be logically true in propositional logic. A formula that logically implies, or is logically equivalent to, another formula with this understanding of the logically possible cases is said to logically imply, or to be logically equivalent to, the former formula in propositional logic. Every logical truth, logical implication, and logical equivalence in propositional logic is also a logical truth, logical implication, and logical equivalence in predicate logic, respectively. The converse is not true, though, because propositional logic, in contrast to predicate logic, does not have any rules for quantifiers. It treats quantified formulas as sentence letters that cannot be analyzed further. This is illustrated by the following three examples. The formula ∀x (M (x)) → ∃x (M (x))—read: If everything is material, then something is material—is logically true in predicate logic. It is not logically true in propositional logic because the latter treats ∀x (M (x)) and ∃x (M (x)) as two distinct sentence letters that cannot be analyzed further. Therefore, there is a line in the truth table in which ∀x (M (x)) is true and ∃x (M (x)) is false. This line in the truth table shows that ∀x (M (x))→∃x (M (x)) is not logically true in propositional logic. The argument F (a) ∴ ∃x (F (x))—read: Angela Merkel is chancellor of Germany in August 2017; therefore, someone is chancellor of Germany in August 2017—is logically valid in predicate logic. It is not logically valid in propositional logic because the latter treats F (a) and ∃x (F (x)) as two distinct

14

INTRODUCTION TO PROBABILITY AND INDUCTION

sentence letters that cannot be analyzed further. Therefore, there is a line in the truth table in which F (a) is true and ∃x (F (x)) is false. This line in the truth table shows that F (a) ∴ ∃x (F (x)) is not logically valid in propositional logic. The formula ¬∀x (M (x))—read: Not everything is material —is logically equivalent to the formula ∃x (¬ (M (x)))—read: Something is not material—in predicate logic. The first formula is not logically equivalent to the second formula in propositional logic because the latter treats ∀x (M (x)) and ∃x (¬ (M (x))) as two distinct sentence letters that cannot be analyzed further. Therefore, there is a line in the truth table in which ∀x (M (x)) is false—so that its negation ¬∀x (M (x)) is true—and ∃x (M (x)) is false. This line in the truth table shows that ¬∀x (M (x)) is not logically equivalent to ∃x (¬ (M (x))) in propositional logic. We will come across one principle, the principle of the substitution of logical equivalents (SLE), where the distinction between logical equivalence in propositional logic as opposed to logical equivalence in predicate logic is crucial. Finally, if you think the first example, much like the claim that something is or is not material, ∃x (M (x) ∨ ¬M (x)), should not be a logical truth, and the second example, much like the argument: Winnie-the-Pooh is a bear and speaks English; therefore, some bears speak English, B (w) ∧ E (w) ∴ ∃x (B (x) ∧ E (x)), should not be a logically valid argument, then you are a proponent of inclusive logic (Nolt 2014). Inclusive logic rejects the assumptions of classical logic, which we are using, that at least one thing exists and that names only refer to existing things.

1.3 EXERCISES The truth table for the formula ‘((p ∨ q) ∧ ¬ (q))’ is obtained by first identifying the different types of propositional variables, or sentence letters, of the formula. These are ‘p’ and ‘q.’ Next, we

LOGIC

15

list all the possible assignments of truth values to these types of propositional variables. p T T F F

q T F T F

Then these truth values of the propositional variables, or sentence letters, are written underneath all tokens, or occurrences, of them in the formula: p T T F F

q (p ∨ q) ∧ ¬ q T T T T F T F F T F T T F F F F

In order to reduce the use of parentheses, we adopt the convention that ‘¬’ binds stronger than ‘∧’ and ‘∨,’ and that ‘∧’ and ‘∨’ bind stronger than ‘→’ and ‘↔.’ In the above table I have omitted all parentheses that are not needed to avoid ambiguities in scope. Next, we work our way from the propositional variables, or sentence letters, to the first connective, then the next, and so on. . . p T T F F

q (p ∨ q) ∧ ¬ q T T T F T F T F T F T F T F T F F F T F

p T T F F

q (p ∨ q) ∧ ¬ q T T T T F T F T T F T F T F T T F T F F F F T F

16

INTRODUCTION TO PROBABILITY AND INDUCTION

. . .until we reach the main connective of the formula, ‘∧’: p T T F F

q (p ∨ T T T F T T T F T F F F

q) T F T F

∧ F T F F

¬ F T F T

q T F T F

Exercise 1: Write down the truth table for the following formula ‘((p ∧ q) ∨ (¬ (¬q))),’ or simply ‘(p ∧ q) ∨ ¬¬q.’ In this way we can show, or prove, that the formula ‘¬ (p ∧ ¬p)’ is logically true because it has a ‘T’ under its main connective ‘¬’ in all lines of the truth table. p ¬ (p ∧ ¬ p) T T T F F F

p ¬ (p ∧ ¬ p) T T F T F T F F

p ¬ (p ∧ ¬ p) T T F F T F F F T F

p ¬ (p ∧ ¬ p) T T T F F T F T F F T F

Exercise 2: Show that the formula ‘(p → (p ∨ q)),’ or simply ‘p → p ∨ q,’ is logically true by showing that it has a ‘T’ under its main connective ‘→’ in all lines of the truth table. In this way we can also show, or prove, that two formulas are logically equivalent by showing that they have the same truth value under their main connective in all lines of the truth table. For instance, we can show in this way that the two formulas ‘(p ∨ q)’ and ‘¬ (¬p ∧ ¬q)’ are logically equivalent because they have the same truth value under their main connective ‘∨’ and ‘¬,’ respectively, in all lines of the truth table.

LOGIC

p T T F F

q (p ∨ T T T F T T T F T F F F

q) T F T F

¬ (¬ p ∧ ¬ T F T F F T F T F T T T F F F F T F T T

17

q) T F T F

Exercise 3: Show that the two formulas ‘(p ∧ q)’ and ‘¬ (¬p ∨ ¬q)’ are logically equivalent by showing that they have the same truth value under their main connective ‘∧’ and ‘¬,’ respectively, in all lines of the truth table. Furthermore, we can also show, or prove, in this way that one formula logically implies another formula by showing that the second sentence, or formula, has a ‘T’ under its main connective in all lines of the truth table, if any, where the first formula has a ‘T’ under its main connective. For instance, we can show in this way that the formula ‘¬p’ logically implies the formula ‘p → q’ because the second formula has a ‘T’ under its main connective ‘→’ in all lines of the truth table in which the first formula has a ‘T’ under its main connective ‘¬.’ p T T F F

q T F T F

¬ F F T T

p T T F F

p → q T T T T F F F T T F T F

Exercise 4: We adopt the convention that the main connective of a propositional variable, or sentence letter, is the propositional variable, or sentence letter, itself. Show that the formula ‘q’ logically implies the formula ‘p → q’ by showing that the second formula has a ‘T’ under its main connective ‘→’ in all lines of the truth table in which the first formula has a ‘T’ under its main connective ‘q.’

18

INTRODUCTION TO PROBABILITY AND INDUCTION

Exercise 5: Show that the formula ‘p ∧ q’ logically implies the formula ‘¬p ↔ ¬q’ by showing that the second formula has a ‘T’ under its main connective ‘↔’ in all lines of the truth table in which the first formula has a ‘T’ under its main connective ‘∧.’ The method of truth tables allows us to show, or prove, logical truths, logical implications, and logical equivalences in propositional logic. The principle of the substitution of logical equivalents facilitates this task. SLE says that the formula   α β/γ is logically equivalent to the formula α, if the formula β is logically equivalent to the formula γ in propositional logic.   Here α β/γ results from α by replacing all occurrences of γ in α by an occurrence of β. The restriction in SLE that β and γ are logically equivalent in propositional logic is most important! To show, or prove, logical truths, logical implications, and logical equivalences in predicate logic, we need additional tools. The first of these is the principle of existential generalization (EG). EG says that the formula ∃x (α [x/a]) follows logically in predicate logic from the formula α, provided the individual variable ‘x’ does not occur in α. Here the formula α [x/a] results from α by replacing all occurrences of the individual constant ‘a’ in α by an occurrence of the individual variable ‘x.’ We make use of this principle when we say that, in predicate logic, the sentence ‘There exists a chancellor of Germany in August 2017’ follows logically from the sentence ‘Angela Merkel is chancellor of Germany in August 2017,’ F (a) |= ∃x (F (x)). The second tool is the principle of universal instantiation (UI). UI says that the formula α [a/x] follows logically in predicate logic from the formula ∀x (α), where the formula α [a/x] results from α by replacing all free occurrences of the individual variable ‘x’ in α by an occurrence of the individual constant ‘a.’ We make use of this principle when we say that, in predicate logic, the sentence ‘Muhammad Ali is mortal if

LOGIC

19

Muhammad Ali is human’ follows logically from the sentence ‘All humans are mortal,’ ∀x (H (x) → M (x)) |= H (a) → M (a). The third tool is the principle of universal generalization (UG), and it is by far the most difficult one. UG says that the formula ∀x (α [x/c]) follows logically in predicate logic from the formula α [c], provided the individual constant ‘c’ is arbitrary (that is, has not occurred in any formula that was used to logically infer α [c]), and provided the individual variable ‘x’ is new (that is, does not occur in α). One way to think of this principle is as licensing any-all inferences: From the premise that any object c has a certain property—recall: ‘c’ is arbitrary—one may and ought to infer the conclusion that all objects have this property. Before applying these principles in the chapters to follow, a note on terminology. Logicians often restrict the term ‘sentence’ to these well-formed formulas that do not contain any free occurrences of individual variables. In this terminology, ‘∀x (F (x))’ is both a sentence and a well-formed formula, whereas ‘F (x)’ is only a well-formed formula but not also a sentence. I will try to avoid this terminology, which is why I am formulating some things in seemingly odd ways.

READINGS Textbooks that cover similar material as this book are: Hacking, Ian (2001), An Introduction to Probability and Inductive Logic. Cambridge: Cambridge University Press. Skyrms, Brian (1966/2000), Choice and Chance: An Introduction to Inductive Logic. 4th ed., Belmont, CA: Wadsworth Thomson Learning.

Recommended further readings for the material in the first chapter are: Klement, Kevin C. (2016a), Propositional Logic. In J. Fieser & B. Dowden (eds.), Internet Encyclopedia of Philosophy.

20

INTRODUCTION TO PROBABILITY AND INDUCTION

Papineau, David (2012), Philosophical Devices. Proofs, Probabilities, Possibilities, and Sets. Oxford: Oxford University Press. Chapter 10.

and perhaps also Papineau, David (2012), Philosophical Devices. Proofs, Probabilities, Possibilities, and Sets. Oxford: Oxford University Press. Chapter 11. Shapiro, Stewart (2013), Classical Logic. In E.N. Zalta (ed.), Stanford Encyclopedia of Philosophy.

CHAPTER 2

Set Theory

2.1 ELEMENTARY POSTULATES A set is a collection of things, entities, or objects. In the way philosophers use the term, subjects such as you and I are also objects. It’s not rude of me to say that you are an object. If anything, it would be rude of me to say that you are not an object, as that would be saying something along the lines that you do not exist. For instance, the government of Argentina in 2017 is or can be thought of as the set containing Mauricio Macri and the members of his cabinet. Similarly, the Library of Alexandria is or can be thought of as the set containing all its books. We use the curly brackets ‘{ ’ and ‘ }’ to denote sets, and it is important to distinguish a set from its members, or elements: Mauricio Macri and his ministers are human, but the set containing them is not. The set C containing all and only Canadian cities with more than 1 million inhabitants is   C = Toronto, Montréal, Calgary . C is also the set of all objects x such that x is a Canadian city with more than 1 million inhabitants, that is,  C = x : x is a Canadian city with more than 1 million inhabitants} . Together with the colon ‘:’ preceded by the individual variable ‘x’ the curly brackets ‘{ ’ and ‘ }’ bind the second occurrence of the

22

INTRODUCTION TO PROBABILITY AND INDUCTION

individual variable ‘x’ in the above line. They do so much like the quantifiers ‘∀x’ and ‘∃y’ bind the individual variables ‘x’ and ‘y’ in ‘∀x∃y (L (x, y)).’ The order in which the members, or elements, of a set are listed does not matter:     Toronto, Montréal, Calgary = Calgary, Montréal, Toronto The number of times a member, or an element, is listed does not matter either:   Toronto, Montréal, Calgary   = Toronto, Toronto, Calgary, Montréal, Toronto We use ‘∈’ to denote that the object mentioned to the left of ‘∈’ is a member, or an element, of the set mentioned to the right of ‘∈’, and ‘’ to denote that it is not. For instance, Toronto ∈ C and Vancouver  C. Sets S and T are identical just in case they contain the same members, or elements. This is known as the principle of Extensionality. Extensionality: For all sets S and T, S = T if and only if for all objects x, x ∈ S just in case x ∈ T. In particular, we have for all sets S: S = {x : x ∈ S}, which will turn out to be a very useful identity. We use ‘⊆’ to denote that all members of the “subset” mentioned to the left of ‘⊆’ are also members of the “superset” mentioned to the right of ‘⊆.’ For   instance, C ⊆ x : x is a Canadian city . We use ‘∅’, or ‘{},’ to denote the empty set that has no members, or elements, and whose existence set theory postulates. There are many ways to describe the empty set—for example, as the set of objects that are not identical to themselves, or as the set of objects that are both material and immaterial—but there exists just one empty set.

SET THEORY

23

Furthermore, we have for all sets S and T: S = T if, and only if, S ⊆ T and T ⊆ S. For instance, since C =   Toronto, Calgary, Montréal , we get that {Toronto, Montréal,    Calgary ⊆ C and C ⊆ Montréal, Toronto, Calgary . The other direction of this equivalence is useful when we want to prove that two sets are identical. This follows if we can establish that they are subsets of each other. In addition, we have for all sets S: ∅ ⊆ S and S ⊆ S. Set theory postulates the existence of further sets besides the empty set. However, unlike the empty set, whose existence set theory postulates “categorically,” these further sets are only postulated to exist on the condition that there already are some sets. If S and T are sets, then there exists the intersection of S and T, S ∩ T, which is the set of objects that are elements of both S and T: S ∩ T = {x : (x ∈ S) ∧ (x ∈ T)} For instance, C ∩ {Toronto} = {Toronto}.

S

T

If S and T are sets, then there exists the union of S and T, S ∪ T, which is the set of objects that are elements of S or of T or of both S and T: S ∪ T = {x : (x ∈ S) ∨ (x ∈ T)}   For instance, Calgary, Montréal ∪ {Montréal, Toronto} = C.

24

INTRODUCTION TO PROBABILITY AND INDUCTION

S

T

If S and T are sets, then there exists the complement of S with respect to T, T \ S, which is the set of objects that are members of T, but that are not members of S: T \ S = {x : (x ∈ T) ∧ ¬ (x ∈ S)} = {x : (x ∈ T) ∧ (x  S)}   For instance, C \ {Toronto} = Calgary, Montréal .

S

T

If S is a set, then there exists the power set of S, ℘ (S), which is the set of all subsets of S: ℘ (S) = {A : A ⊆ S} For instance, the power set of C, ℘ {C}, is the set that contains the following eight sets as elements:   ∅, {Toronto} , {Montréal} , Calgary {Toronto, Montréal} ,     Toronto, Calgary , Calgary, Montréal , C This means that ℘ (C) is the following set of sets:    ℘ (C) = ∅, {Toronto} , {Montréal} , Calgary ,   {Toronto, Montréal} , Toronto, Calgary ,    Calgary, Montréal , C

SET THEORY

25

The power set of a set cannot be pictured easily, as it is a set whose members, or elements, are sets as well. Note that the above principles have all been formulated in the language of predicate logic. This means that set theory can be formulated as a list of sentences of, or as a theory in, predicate logic. This has the consequence that we can apply all the logical principles from the previous chapter to prove claims in set theory. For instance, we can show, or prove, that it is a logical truth that every object x either is, or is not, an element of any given set S. Here is how. First, let ‘y’ and ‘T’ be arbitrary individual constants. ‘∈’ is the binary relation of set theoretic membership, or elementhood. In other words, let y and T be arbitrary objects of which we assume nothing whatsoever. ‘∈ (y, T),’ or more perspicuously, ‘y ∈ T’ is a well-formed formula, and so (y ∈ T) ∨ ¬ (y ∈ T) is logically true. We can show this by the following truth table, where ‘t’ is the sentence letter for ‘y ∈ T’: t t ∨ ¬ t T T T F T F F T T F The next step to arrive at our claim consists in noting that everything is a thing or an object. This includes sets, which are the objects satisfying the postulates of set theory. If it is true that, for an arbitrary object y and an arbitrary object T, y does or does not stand in relation ∈ to T, then it is also true that, for an arbitrary object y and an arbitrary object T that has the property of being a set, y does or does not stand in relation ∈ to T. Alternatively, we can use the method of truth tables to show that (y ∈ T) ∨ ¬ (y ∈ T) logically implies set (T) → (y ∈ T) ∨ ¬ (y ∈ T). Now we apply the principle of universal generalization (UG), which says that, in predicate logic, set (T) → (y ∈ T) ∨ ¬ (y ∈ T) logically implies ∀S (set (S) → (y ∈ S) ∨ ¬ (y ∈ S)) ,

26

INTRODUCTION TO PROBABILITY AND INDUCTION

because ‘T’ was arbitrary (that is, it has not occurred before we introduced it above,) and because ‘S’ is new (that is, it does not occur in ‘set (T) → (y ∈ T) ∨ ¬ (y ∈ T)). A second application of the principle of universal generalization says that, in predicate logic, ∀S (set (S) → (y ∈ S) ∨ ¬ (y ∈ S)) logically implies ∀x (∀S (set (S) → (x ∈ S) ∨ ¬ (x ∈ S))) , because ‘y’ was arbitrary (that is, it has not occurred before we introduced it above), and because ‘x’ is new (that is, does not occur in ‘∀S (set (S) → (y ∈ S) ∨ ¬ (y ∈ S))). This completes our proof. The above postulates, or axioms, postulate the existence of various sets: the empty set, the power set of a set, the union of sets, the intersection of sets, the complement of a set with respect to a set. Another principle, the so-called unrestricted comprehension principle, has been postulated by Frege (1893/1903), who thought that for each property P there exists the set SP of objects that posses the property P, SP =   x : x possesses property P . For instance, P may be the property of being a Canadian city with more than 1 million inhabitants. (Here we count everything as a property that can be described by a well-formed formula of predicate logic α [x] in which the individual variable ‘x’ occurs freely.) As we have just seen, sets can be members of sets, just as chefs can cook the dinners of chefs. For instance, each set S is a member of its power set, and so is the empty set, but not conversely: S ∈ ℘ (S) and ∅ ∈ S, but ℘ (S)  S and S  ∅. Russell (1902) used the following property of sets to show that the unrestricted comprehension principle is logically false, or contradictory: Set S possesses the Russell property just in case S  S. Compare: A chef is special if and only if she does not cook her own dinner. According to the unrestricted comprehension

SET THEORY

27

principle, for each property there exists the set of objects that have this property. The Russell property is a property, and so the unrestricted comprehension principle implies that there exists the set, the so-called “Russell set” SR , containing all and only the objects that possess the Russell property: SR = {S : S  S} This cannot be true, though. Consider the question whether the Russell set has the Russell property. Suppose first it does so that RS  RS . In this case, RS possesses the Russell property and so is a member of RS , RS ∈ RS . Suppose next it does not so that RS ∈ RS . In this case, RS is a member of RS and so possesses the Russell property, RS  RS . Hence, RS ∈ RS if and only if RS  RS , which is logically false. In the same way we can prove that it is logically false that there exists a chef who cooks the dinners of all and only these chefs who do not cook their own dinners, that is, it is logically false that there exists a chef who cooks the dinners of all and only the special chefs. Suppose there exists such a chef and consider the question if she is special and does not cook her own dinner. Suppose first she does not cook her own dinner and so is special. Then she is one of these chefs who she is cooking dinner for, and so she is not special after all. Suppose next she cooks her own dinner. Then she is not special, and so is not one of these chefs she is cooking dinner for. Hence, she is special if only if she is not, which is a contradiction, that is, a sentence that is logically false. The set theory we use relies on a weaker version of the unrestricted comprehension principle that is known as the restricted comprehension axiom. The latter principle says that for each set S and each property P there exists the set SP of objects which are members of S and possess the property P. The restricted comprehension axiom is not logically false. It avoids Russell’s paradox because it assumes there to be a set S—say, the set of Canadian cities—and then merely postulates

28

INTRODUCTION TO PROBABILITY AND INDUCTION

the existence of the subset of S whose members have a given property—for example, the set of Canadian cities with more than 1 million inhabitants.

2.2 EXERCISES Let us show that, in set theory, the following is true of every set P: P = P∪P 1. P = {x : x ∈ P} from Extensionality. 2. P = {x : (x ∈ P) ∨ (x ∈ P)} from 1. and because x ∈ P is logically equivalent to (x ∈ P) ∨ (x ∈ P) in propositional logic, which can be shown by the method of truth tables. 3. P = {x : x ∈ (P ∪ P)} from 2. and the definition of ∪. 4. P = P ∪ P from 3. and Extensionality. Here are the relevant truth tables, where ‘p’ is the sentence letter for ‘x ∈ P’: p p p ∨ p T T T T T F F F F F

p p ∨ p p T T T T T F F F F F

Since we are arguing inside the scope of ‘{x : . . . x . . .},’ we need to be careful and so will restrict ourselves to what is logically true in propositional logic. In this section, logical equivalence means logical equivalence in propositional logic. The principle that allows us to substitute a formula inside the scope of ‘{x : . . . x . . .}’ for another formula that is logically equivalent to the former in propositional logic is the principle of Extensionality. It implies that the curly brackets do not create what philosophers call a “hyperintensional” context in which this is not allowed. This is different for concepts such as actual belief which creates a hyperintensional context.

SET THEORY

29

Here is an example. That I believe p, say, that Moscow is the capital of Russia, does not imply that I believe ¬¬p, that is, that it is not the case that Moscow is not the capital of Russia. This is so despite the fact that ¬¬p is logically equivalent to p in propositional logic—a fact which I may fail to realize. (Of course, it may well be that I should believe ¬¬p if I believe p. Perhaps rational belief, in contrast to actual belief, does not create a hyperintensional context.) Recall the definition of a subset. For any sets S and T, S is a subset of T, S ⊆ T, if and only if for all objects x: If x is an element of S, x ∈ S, then x is an element of T, x ∈ T. We can put this in the notation of predicate logic: S ⊆ T ↔ ∀x ((x ∈ S) → (x ∈ T)) Since we want to restrict ourselves to propositional logic in this section, we will use a different formula in which the individual variable ‘x’ occurs freely, that is, is not bound by the universal quantifier ‘∀x,’ namely: (x ∈ S) → (x ∈ T). (The latter formula follows logically from ∀x ((x ∈ S) → (x ∈ T)) in predicate logic.) Armed with this, let us show next that, in set theory, the following is true for every subset P of any given set W: P = P∩W 1. P = {x : x ∈ P} from Extensionality. 2. P = {x : (x ∈ P) ∧ (x ∈ W)} from 1. and because x ∈ P is logically equivalent to (x· ∈ P) ∧ (x ∈ W) given our assumption (x ∈ P) → (x ∈ W), that is, P ⊆ W 3. P = {x : x ∈ (P ∩ W)} from 2. and the definition of ∩. 4. P = P ∩ W from 3. and Extensionality. Note that we have to justify each step in such a derivation, and we are only allowed to refer to the principles of set theory and whatever we can show to follow from these principles by propositional logic alone. This means that we have to restrict

30

INTRODUCTION TO PROBABILITY AND INDUCTION

ourselves to the method of truth tables and the principle of the substitution of logical equivalents SLE. We adopt the convention that the justification for a claim is written to the right of the claim that is to be justified. Let us show next that, in set theory, the following is true of every subset P of any given set W: P = W \ (W \ P) 1. W \ (W \ P) = {x : x ∈ (W \ (W \ P))} from Extensionality. 2. W \ (W \ P) = {x : (x ∈ W) ∧ ¬ (x ∈ (W \ P))} from 1. and the definition of \. 3. W \ (W \ P) = {x : (x ∈ W) ∧ ¬ ((x ∈ W) ∧ ¬ (x ∈ P))} from 2. and the definition of \. 4. W \ (W \ P) = {x : (x ∈ W) ∧ (¬ (x ∈ W) ∨ (x ∈ P))} from 3. and because ¬ (x ∈ W) ∨ (x ∈ P) is logically equivalent to ¬ ((x ∈ W) ∧ ¬ (x ∈ P)), so SLE implies that (x ∈ W) ∧ (¬ (x ∈ W) ∨ (x ∈ P)) is logically equivalent to (x ∈ W) ∧ ¬ ((x ∈ W) ∧ ¬ (x ∈ P)). 5. W \ (W \ P) = {x : ((x ∈ W) ∧ ¬ (x ∈ W)) ∨ ((x ∈ W) ∧ (x ∈ P))} from 4. and because ((x ∈ W) ∧ ¬ (x ∈ W)) ∨ ((x ∈ W) ∧ (x ∈ P)) is logically equivalent to (x ∈ W) ∧ (¬ (x ∈ W) ∨ (x ∈ P)). 6. W \ (W \ P) = {x : (x ∈ W) ∧ (x ∈ P)} from 5. and because (x ∈ W) ∧ (x ∈ P) is logically equivalent to ((x ∈ W) ∧ ¬ (x ∈ W)) ∨ ((x ∈ W) ∧ (x ∈ P)). 7. W \ (W \ P) = {x : (x ∈ P) ∧ (x ∈ W)} from 6. and because (x ∈ P) ∧ (x ∈ W) is logically equivalent to (x ∈ W) ∧ (x ∈ P). from 7. and the 8. W \ (W \ P) = {x : x ∈ (P ∩ W)} definition of ∩ 9. W \ (W \ P) = {x : x ∈ P} from 8. and the previously established result that P = P ∩ W if P ⊆ W. 10. W \ (W \ P) = P from 9. and Extensionality.

SET THEORY

31

In logic and mathematics, there are always many different ways to show that something is true. In particular, this is so when we want to show that something is true in set theory. Usually it will be helpful to start with the longer side of an identity claim or equation, but this is merely a heuristic rather than a rule. It will also be helpful to picture the situation with a Venn diagram (Venn 1880). We have used Venn diagrams to illustrate the definitions of set theoretic intersection, union, and complementation in the previous section. Exercise 6: Show that, in set theory, the following is true of all sets P and Q: P∩Q = Q∩P Exercise 7: Show that, in set theory, the following is true of all sets P, Q, and R: P ∩ (Q ∩ R) = (P ∩ Q) ∩ R Exercise 8: Show that, in set theory, the following is true of all subsets P and Q of any given set W: P = (P ∩ Q) ∪ (P ∩ (W \ Q)) Exercise 9: Show that, in set theory, the following is true of all subsets P and Q of any given set W: (W \ P) ∪ (W \ Q) = W \ (P ∩ Q) Exercise 10: Show that, in set theory, the following is true of all sets P, Q, and R: P ∩ (Q ∪ R) = (P ∩ Q) ∪ (P ∩ R) Here is another exercise in case you want to practice a bit more. Show that, in set theory, the following is true of all subsets P and Q of any given set W: P ∪ Q = (P ∩ Q) ∪ ((P ∩ (W \ Q)) ∪ ((W \ P) ∩ Q))

32

INTRODUCTION TO PROBABILITY AND INDUCTION

Solution: 1. (P ∩ Q) ∪ ((P ∩ (W \ Q)) ∪ ((W \ P) ∩ Q)) = {x : x ∈ ((P ∩ Q) ∪ ((P ∩ (W \ Q)) ∪ ((W \ P) ∩ Q)))} from Extensionality. 2. (P ∩ Q) ∪ ((P ∩ (W \ Q)) ∪ ((W \ P) ∩ Q)) = {x : (x ∈ (P ∩ Q)) ∨ (x ∈ ((P ∩ (W \ Q)) ∪ ((W \ P) ∩ Q)))} from 1. and the definition of ∪. 3. (P ∩ Q) ∪ ((P ∩ (W \ Q)) ∪ ((W \ P) ∩ Q)) = {x : (x ∈ (P ∩ Q)) ∨ (x ∈ (P ∩ (W \ Q)) ∨ (x ∈ ((W \ P) ∩ Q)))} from 2. and the definition of ∪. 4. (P ∩ Q) ∪ ((P ∩ (W \ Q)) ∪ ((W \ P) ∩ Q)) = {x : ((x ∈ P) ∧ (x ∈ Q)) ∨ (((x ∈ P) ∧ (x ∈ (W \ Q))) ∨ ((x ∈ (W \ P)) ∧ (x ∈ Q)))} from 3. and the definition of ∩. 5. (P ∩ Q) ∪ ((P ∩ (W \ Q)) ∪ ((W \ P) ∩ Q)) = {x : (((x ∈ P) ∧ (x ∈ Q)) ∨ ((x ∈ P) ∧ (x ∈ (W \ Q)))) ∨ ((x ∈ (W \ P)) ∧ (x ∈ Q))} from 4. and because ( ((x ∈ P) ∧ (x ∈ Q)) ∨ ((x ∈ P) ∧ (x ∈ (W \ Q))) )∨ ((x ∈ (W \ P)) ∧ (x ∈ Q)) is logically equivalent to ((x ∈ P) ∧ (x ∈ Q)) ∨ ( ((x ∈ P) ∧ (x ∈ (W \ Q))) ∨ ((x ∈ (W \ P)) ∧ (x ∈ Q)) ). 6. (P ∩ Q) ∪ ((P ∩ (W \ Q)) ∪ ((W \ P) ∩ Q)) = {x : ((x ∈ P) ∧ ((x ∈ Q) ∨ (x ∈ (W \ Q)))) ∨ ((x ∈ (W \ P)) ∧ (x ∈ Q))} from 5. and because (x ∈ P) ∧ ((x ∈ Q) ∨ (x ∈ (W \ Q))) is logically equivalent to ((x ∈ P) ∧ (x ∈ Q)) ∨ ((x ∈ P) ∧ (x ∈ (W \ Q))). 7. (P ∩ Q) ∪ ((P ∩ (W \ Q)) ∪ ((W \ P) ∩ Q)) = {x : ((x ∈ P) ∧ ((x ∈ Q) ∨ ((x ∈ W) ∧ ¬ (x ∈ Q)))) ∨ ((x ∈ (W \ P)) ∧ (x ∈ Q))} from 6. and the definition of \. 8. (P ∩ Q) ∪ ((P ∩ (W \ Q)) ∪ ((W \ P) ∩ Q)) = {x : ((x ∈ P) ∧ (((x ∈ W) ∧ (x ∈ Q)) ∨ ((x ∈ W) ∧

SET THEORY

9.

10.

11.

12.

13.

14.

15. 16.

33

¬ (x ∈ Q)))) ∨ ((x ∈ (W \ P)) ∧ (x ∈ Q))} from 7. and because (x ∈ W) ∧ (x ∈ Q) is logically equivalent to x ∈ Q given our assumption (x ∈ Q) → (x ∈ W), that is, Q ⊆ W. (P ∩ Q) ∪ ((P ∩ (W \ Q)) ∪ ((W \ P) ∩ Q)) = {x : ((x ∈ P) ∧ (x ∈ W)) ∨ ((x ∈ (W \ P)) ∧ (x ∈ Q))} from 8. and because x ∈ W is logically equivalent to ((x ∈ W) ∧ (x ∈ Q)) ∨ ((x ∈ W) ∧ ¬ (x ∈ Q)). (P ∩ Q) ∪ ((P ∩ (W \ Q)) ∪ ((W \ P) ∩ Q)) = from 9. and {x : (x ∈ P) ∨ ((x ∈ (W \ P)) ∧ (x ∈ Q))} because x ∈ P is logically equivalent to (x ∈ P) ∧ (x ∈ W) given our assumption (x ∈ P) → (x ∈ W), that is, P ⊆ W. (P ∩ Q) ∪ ((P ∩ (W \ Q)) ∪ ((W \ P) ∩ Q)) = {x : (x ∈ P) ∨ (((x ∈ W) ∧ ¬ (x ∈ P)) ∧ (x ∈ Q))} from 10. and the definition of \. (P ∩ Q) ∪ ((P ∩ (W \ Q)) ∪ ((W \ P) ∩ Q)) = {x : (x ∈ P) ∨ (((x ∈ W) ∧ (x ∈ Q)) ∧ ¬ (x ∈ P))} from 11. and because ((x ∈ W) ∧ (x ∈ Q)) ∧ ¬ (x ∈ P) is logically equivalent to ((x ∈ W) ∧ ¬ (x ∈ P)) ∧ (x ∈ Q). (P ∩ Q) ∪ ((P ∩ (W \ Q)) ∪ ((W \ P) ∩ Q)) = {x : (x ∈ P) ∨ ((x ∈ Q) ∧ ¬ (x ∈ P))} from 12. and because x ∈ Q is logically equivalent to (x ∈ W) ∧ (x ∈ Q) given our assumption (x ∈ Q) → (x ∈ W), that is, Q ⊆ W. (P ∩ Q) ∪ ((P ∩ (W \ Q)) ∪ ((W \ P) ∩ Q)) = {x : (x ∈ P) ∨ (x ∈ Q)} from 13. and because (x ∈ P) ∨ (x ∈ Q) is logically equivalent to (x ∈ P) ∨ ((x ∈ Q) ∧ ¬ (x ∈ P)). (P ∩ Q) ∪ ((P ∩ (W \ Q)) ∪ ((W \ P) ∩ Q)) = from 14. and the definition of ∪. {x : x ∈ (P ∪ Q)} (P ∩ Q) ∪ ((P ∩ (W \ Q)) ∪ ((W \ P) ∩ Q)) = P ∪ Q from 15. and Extensionality.

As a postscriptum, note that the meaning of the curly brackets ‘{ ’ and ‘ }’ is entirely different from the meaning of the

34

INTRODUCTION TO PROBABILITY AND INDUCTION

parentheses ‘( ’ and ‘ ).’ Parentheses are used to mark scope. For instance, the following sentence is ambiguous in scope: Angela Merkel is not chancellor of Germany in August 2017 and Toronto is a city or the Great Wall of China stretches from east to west. Parentheses, much like commas, are used to disambiguate sentences: Angela Merkel is not chancellor of Germany in August 2017 and Toronto is a city, or the Great Wall of China stretches from east to west. Angela Merkel is not chancellor of Germany in August 2017, and Toronto is a city or the Great Wall of China stretches from east to west. The first sentence has the following logical form: ((¬p ∧ q) ∨ r) It is true because r is true. The second sentence has the following logical form: (¬p ∧ (q ∨ r)) It is false because p is true, so ¬p is false. In contrast to parentheses, curly brackets allow you to create sets. For instance, Beyoncé is a spatio-temporally extended, concrete object (in the philosophical sense of the term). You can bump into her, because she is spatially extended, and you can do so more than once because she is temporally extended. I do not recommend it, though.   We use curly brackets to create sets. For instance, Beyoncé is the set that contains Beyoncé as its only member. You cannot bump into this set because sets are abstract objects (very

SET THEORY

35

  roughly, something like “ideas”). The set Beyoncé is totally different from the American singer. Moreover, both the set and the singer are totally different from (Beyoncé) because the string of symbols ‘(Beyoncé)’ does not even make sense (in the terminology of logicians, it is not well-formed). Finally, please note that we use the symbol ‘∈’ to denote set theoretic membership, or elementhood, as in the following true claim: K¯alid¯asa ∈ {K¯alid¯asa} We do not use the symbols ‘ε’ or ‘’ to denote this relation between sets and their elements, or members.

READINGS Recommended further readings for the material in the second chapter are: Papineau, David (2012), Philosophical Devices: Proofs, Probabilities, Possibilities, and Sets. Oxford: Oxford University Press. Chapter 1. Steinhart, Eric (2009), More Precisely: The Math You Need to Do Philosophy. Peterborough: Broadview Press. Chapter 1.

and perhaps also Klement, Kevin C. (2016b), Russell’s Paradox. In J. Fieser & B. Dowden (eds.), Internet Encyclopedia of Philosophy. Papineau, David (2012), Philosophical Devices. Proofs, Probabilities, Possibilities, and Sets. Oxford: Oxford University Press. Chapter 2.

CHAPTER 3

Induction

3.1 CONFIRMATION AND INDUCTION Data from controlled experiments and uncontrolled observations are said to confirm scientific theories as well as everyday hypotheses. We use different locutions to express this relation of confirmation. For instance, someone in Honolulu might say that the clouds in the sky speak in favor of the hypothesis that it will rain soon. A worried parent might say that a positive result of an allergy test indicates that the allergy tested for is present, and, in 1915 the German physicist Einstein might have said that Mercury’s anomalous 43 arc seconds (per century) advance of its perihelion confirms the general theory of relativity (Einstein 1915, Earman 1992). We often talk as if it was objects or events such as clouds or test results that do the confirming. However, for the purposes of developing a theory of confirmation, it is best to construe confirmation as a relation between sentences or between the propositions these sentences express, or mean. A sentence e describing the relevant objects or events confirms a sentence h describing the relevant hypothesis or theory. This renders confirmation a two-place relation between sentences (or propositions): e confirms h. However, confirmation is better construed as a three-place relation between a sentence describing the information one has (sometimes called the “evidence”), a sentence describing

INDUCTION

37

the hypothesis one is interested in, and a sentence describing some background assumptions. The clouds in the sky speak in favor of the hypothesis that it will rain soon only given certain meteorological assumptions. The positive test result indicates that the allergy tested for is present only given the background assumption that the test is somewhat reliable. The anomalous perihelion of Mercury confirms the general theory of relativity only given the background assumption that no grave observational errors were made. This means that we are dealing with a three-place relation: e confirms h given b. Just as sentences express, or mean, propositions, words express, or mean, concepts. These concepts can come in a qualitative, comparative, and quantitative form (Carnap 1950/ 1962: §8). For instance, when we say that today is warm, but yesterday was not, we are using a qualitative concept of warmth: A day either has the quality of being warm, or it does not. When we say that today is warmer than yesterday, we are using a comparative concept of warmth: Days are compared to each other with regards to their warmth. Finally, when we say that today’s temperature equals 20 degrees Celsius, we are using a quantitative concept of warmth: A day has a particular quantity of warmth. Confirmation too comes in a qualitative, comparative, and quantitative form. We use a qualitative concept of confirmation when we say that the clouds in the sky confirm the hypothesis that it will rain soon. We use a comparative concept of confirmation when we say that the positive test result confirms more the hypothesis that some allergy tested for is present than the absence of a rash confirms the hypothesis that the allergy is absent. We use a quantitative concept of confirmation when we say that the anomalous perihelion of Mercury confirms the general theory of relativity to a specific degree. In the latter case, confirmation is a four-place relation: e confirms h given b to degree r. Even this is not the whole story, as the confirmation

38

INTRODUCTION TO PROBABILITY AND INDUCTION

relation is also dependent on the language we use and, as we will see in later chapters, yet another factor. A quantitative concept automatically gives rise to a comparative concept: e1 confirms h1 given b1 more than e2 confirms h2 given b2 if and only if the degree to which e1 confirms h1 given b1 , c (h1 , e1 | b1 ), is greater than the degree to which e2 confirms h2 given b2 , c (h2 , e2 | b2 ), that is, c (h1 , e1 | b1 ) > c (h2 , e2 | b2 ). Sometimes a comparative or quantitative concept also gives rise to a qualitative concept. For instance, we could say that a day is qualitatively warm if and only if the day’s quantitative temperature is greater than 20 degrees Celsius.

INDUCTION

39

that one has. The problem of induction also arises for this more general formulation of the principle of induction: Why should, or ought, one follow the principle of induction? For purposes of illustration, let us consider two extremely simple candidates for the principle of induction. The principle of instantial induction says the following: From the premise that all objects about which one has enough information—that is, whether they are F and whether they are G—are G if they are F one may and ought to inductively infer the conclusion that the next object is G if it is F. For instance, from the premise that all students one has asked, and whose answer one remembers, reported to live within 20 miles of campus, one may and ought to inductively infer the conclusion that the next student one asks will report so too. The principle of universal induction says the following: From the premise that all objects about which one has enough information—that is, whether they are F and whether they are G—are G if they are F one may and ought to inductively infer the conclusion that all Fs are G. For instance, from the premise that the sun has risen on every day of which one remembers whether the sun has risen, one may and ought to inductively infer the conclusion that the sun rises every day. The argument in the next section concludes that we cannot justify the principle of induction if it is one of these two principles, or any other principle, no matter how sophisticated (as long as there is just one principle of induction). Here are examples of inductive inferences that are applications of a more sophisticated principle of induction. Among others, they illustrate that some inductive inferences are stronger than others—the premises of the former confirm their conclusions more than the premises of latter confirm theirs. This is why we speak of inductive strength. The connoisseur in the restaurant who orders a bottle of wine first tastes a sip of the wine and then inductively infers that the entire bottle is good and tempered appropriately from the datum that the sip is

40

INTRODUCTION TO PROBABILITY AND INDUCTION

good and tempered appropriately. The spam filter of your e-mail program inductively infers that an incoming e-mail is spam from its similarity to the e-mails you previously deleted unread and marked as ‘spam.’ In 1930, the statistician and biologist Fisher might have inductively inferred the sex ratio model from our population’s approximately even sex ratio (Fisher 1930). Of course, in each case there are also background assumptions, but we will mostly suppress these. We can reformulate these examples in terms of confirmation. The sip’s being good confirms that the entire bottle is good; the similarity of the incoming e-mail to those previously marked as ‘spam’ confirms that it too is spam; our population’s approximately even sex ratio confirms Fisher’s sex ratio model. More generally, we can say that e confirms h given b if and only if h may and ought to be inductively inferred from e and b. We will see later that there are at least two distinct concepts of confirmation, viz. absolute and incremental confirmation. The equivalence just stated holds for absolute confirmation; it does not hold for incremental confirmation. Different scientists study different aspects of inductive inference. Statisticians study how to best obtain and describe data as well as how to make statistical inferences—which are particular inductive inferences—from these data (see Sections 10.7–10.9). Computer scientists study how to program computers to make inductive inferences on our behalf and are particularly interested in questions of computability. The latter characterizes how difficult the recommendations of a particular principle of induction are to implement. Psychologists and cognitive scientists study how humans and animals actually learn and make inductive inferences. They aim at a descriptive account. Finally, philosophers are interested in how humans and cognitively less restricted agents ought to make inductive inferences, and why they ought to do so. They aim at a normative account.

INDUCTION

41

3.3 HUME’S ARGUMENT In A Treatise of Human Nature (1739) and an abbreviated version thereof, An Enquiry Concerning Human Understanding (1748), Hume distinguishes between relations of ideas, which we can allegedly know a priori, and matters of fact, which we can only know a posteriori. For instance, that circles are not square is a relation of ideas and that bread nourishes is a matter of fact. Hume claims that we can know the former to be true a priori, (that is, without experience), but that we can know the latter to be true only a posteriori (that is, after having experienced something). The distinction between what we can know a priori and what we can know only a posteriori is an epistemological one. The distinction between relations of ideas and matters of fact resembles the metaphysical distinction between necessity (that is, what could not have been otherwise), and contingency (that is, what could have been otherwise). It also resembles the distinction from the philosophy of languages between what is analytically true (that is, true in virtue of meaning alone) and what is synthetically true (that is true in part because of what reality is like). For instance, the sentence “Teenagers are less than 20 years of age” is analytically true, whereas the sentence “Some teenagers attend school” is synthetically true. According to Hume, the basis of, or justification for, our beliefs about matters of fact is causal information, that is, information about cause and effect. Our causal beliefs (or the causal information we have) in turn are based on, or justified by, inferences from experience. But what, Hume asks, is the basis of, or justification for, our inferences from experience? Hume answers that nothing justifies our inferences from experience, although we do, of course, reason in this way all the time out of habit or custom: “Custom, then, is the great guide of human life” (Hume 1748/1993, section V, part I).

42

INTRODUCTION TO PROBABILITY AND INDUCTION

Below we will formulate a contemporary version of Hume’s argument for the thesis that we cannot justify the principle of induction, whatever its precise form. Here is how Hume puts things: These two propositions are far from being the same, I have found that such an object has always been attended with such an effect, and I foresee, that other objects, which are, in appearance, similar, will be attended with similar effects. (Hume 1748/1993, section IV, part II) Let [humans] be once fully perswaded [. . .] of these two principles, that there is nothing in any object, consider’d in itself, which can afford us a reason for drawing a conclusion beyond it; and, that even after the observation of the frequent or constant conjunction of objects, we have no reason to draw any inference concerning any object beyond those of which we have had experience; (Hume 1739/1896, book 1, part 3, section 12) Recall that an argument consists of one or more premises and a conclusion. The conclusion of our version of Hume’s argument says that we cannot justify the principle of induction, whatever its precise form. Its three premises say the following. Premise 1 says that we can justify the principle of induction only if there is a deductively valid or an inductively strong argument which does not presuppose its conclusion, whose premises are restricted to information we have, and whose conclusion says that the principle of induction “holds.” The principle of induction holds if and only if it leads from true premises to true conclusions in all or most of the logically possible cases (whatever the precise meaning of ‘most’ if there are infinitely many logically possible cases). Premise 1 distinguishes between deductively valid and inductively strong arguments. The deductively valid arguments are precisely the logically valid arguments from chapter 1.

INDUCTION

43

We just call them ‘deductively’ valid now to contrast them with inductively strong arguments. In Hume’s terminology, the deductively valid arguments are arguments for relations of ideas which we can know a priori. Inductively strong arguments are arguments for matters of fact which we can know only a posteriori. As mentioned earlier, our terminology makes clear that inductive strength is a matter of degree—some inductive inferences are better or stronger than others—although this will become important only in later chapters. Premise 2 says there is no deductively valid argument that does not presuppose its conclusion, whose premises are restricted to information we have, and whose conclusion says that the principle of induction holds. Premise 3 says there is no inductively strong argument that does not presuppose its conclusion, whose premises are restricted to information we have, and whose conclusion says that the principle of induction holds. This is our version of Hume’s argument: Premise 1: We can justify the principle of induction only if there is a deductively valid or an inductively strong argument that does not presuppose its conclusion, whose premises are restricted to information we have, and whose conclusion says that the principle of induction holds. Premise 2: There is no deductively valid argument that does not presuppose its conclusion, whose premises are restricted to information we have, and whose conclusion says that the principle of induction holds. Premise 3: There is no inductively strong argument that does not presuppose its conclusion, whose premises are restricted to information we have, and whose conclusion says that the principle of induction holds. Conclusion: We cannot justify the principle of induction.

44

INTRODUCTION TO PROBABILITY AND INDUCTION

Different philosophers will have different views on what precisely is information we have. Some might say the information we have are the propositions we know to be true. Others might say the information we have are the propositions we believe to be true, or perhaps the true propositions we believe to be true. Still others might say the information we have are the propositions we assume to be true. As long as there is some information that we do not have (that is, some question to which we do not have the answer), it does not matter much which of these options or their combinations, if any, we choose (see, however, the caveat that follows). Our version of Hume’s argument is logically (that is, deductively) valid. This becomes clear by considering its propositional-logical form: 1. 2. 3. 4.

p → q∨r assumption ¬q assumption ¬r assumption ¬p from 1., 2., and 3. by the method of truth tables

Of course, just calling the three premises ‘assumptions’ does not justify them. Why should we believe or accept premise 2? Well, every question to which we do not have the answer is, by definition, a question whose answer requires information we do not have. The principle of induction is a general principle that applies to all questions, including those to which we do not have the answer. Furthermore, it is a truth of logic that there is no deductively valid argument whose premises are restricted to information we have, and whose conclusion says something that is the answer to a question to which we do not have the answer. Therefore, as long as there is some information that we do not have, there is no deductively valid argument whose premises are restricted to information we have, and whose conclusion says that the

INDUCTION

45

principle of induction holds (and, hence, no such argument that does not presuppose its conclusion). Thus, premise 2 is true as long as there is a question to which we do not have the answer. Note that the reason premise 2 is true is not that there is no deductively valid argument whose conclusion says that the principle of induction holds. In fact, there is such an argument. For instance, this one: 1. All principles hold. 2. The principle of induction holds.

assumption from 1. by UI

The reason premise 2 is true is that there is no deductively valid argument whose conclusion says that the principle of induction holds and whose premises are restricted to information we have. The premise of the logically valid argument above is false, and so goes beyond the information we have (here the choice of what we take to be information we have matters somewhat). Why should we accept or believe premise 3? Well, any inductively strong argument for the conclusion that the principle of induction holds has to be inductively strong in the sense of this very principle. Hence, any such argument presupposes that the principle of induction holds. Yet this is precisely the conclusion we want to derive. So, any inductively strong argument for the conclusion that the principle of induction holds presupposes its conclusion: It is circular. Therefore, there is no inductively strong argument whose conclusion says that the principle of induction holds and which does not presuppose its conclusion (and, hence, no such argumentwhosepremisesarerestrictedtoinformationwehave). An argument that presupposes its conclusion is called a ‘petitio principii.’ It is surprisingly difficult to spell out exactly when an argument presupposes its conclusion. For instance, does the logically valid argument above whose premise says that all principles hold presuppose its conclusion which says that the

46

INTRODUCTION TO PROBABILITY AND INDUCTION

principle of induction holds? If it does, doesn’t every logically valid argument? Premise 1 is indeed just an assumption of Hume’s philosophy. For instance, one could question that there are no other good arguments besides deductively valid and inductively strong ones. Relatedly, one could question that there is just one principle of induction. If there is more than one principle of induction, then one of them could perhaps be justified by an argument that is inductively strong in the sense of another principle of induction. The latter could perhaps be justified by an argument that is inductively strong in the sense of yet another principle of induction (or maybe the first). And so on. Another possible objection will be mentioned later.

READINGS A recommended further reading for the material in the third chapter is: Hume, David (1748/1993), An Enquiry Concerning Human Understanding. Ed. by E. Steinberg. Indianapolis, IN: Hackett. Sections VI.

and perhaps also Vickers, John (2014), The Problem of Induction. In E.N. Zalta (ed.), Stanford Encyclopedia of Philosophy.

CHAPTER 4

Deductive Approaches to Conﬁrmation

4.1 ANALYSIS AND EXPLICATION Recall that information e confirms hypothesis h given background assumptions b just in case h is the conclusion of an inductively strong argument with premises e and b. This equivalence employs qualitative concepts. The corresponding equivalence with quantitative concepts says that information e confirms hypothesis h given background assumptions b to degree r just in case the inductive strength of the argument with premises e and b and conclusion h equals r. In contrast to Hume, who was interested in justifying the principle of induction, Hempel, in “Studies in the Logic of Confirmation” (1945), aimed at defining a qualitative concept of confirmation in purely logical, or “syntactial,” terms. Hempel took such a definition to be a precondition for a definition of a quantitative concept of confirmation as well as other concepts such as rational belief and meaningfulness. There are two closely related, but slightly different, philosophical activities that aim at a definition of a concept: analysis and explication. An analysis of a concept lists conditions, jointly referred to as the analysans, that are individually necessary and jointly sufficient for the concept to be analyzed, the analysandum.

48

INTRODUCTION TO PROBABILITY AND INDUCTION

For instance, in the dialog Theaetetus, Plato discusses the analysandum knowledge and considers its analysis as justified true belief. Each of justification, truth, and belief is individually necessary for knowledge, and, according to the analysis considered, these three conditions are jointly sufficient. An analysis aims at a definition of the analysandum, which is often an important or unclear concept, in terms of other concepts that are taken to be clearer. It is important that the phrase expressing the analysans has the exact same meaning as the phrase expressing the analysandum. This is why Gettier, in “Is Justified True Belief Knowledge?” (1963), can allegedly refute the analysis of knowledge as justified true belief by the method of counterexamples. One counterexample, where the analysans applies but the analysandum does not (or vice versa), is enough to show that the two are not, or do not have, the exact same meaning. An explication of a concept aims to improve a potentially defective concept. The concept to be explicated is the explicandum, and the explicating concept is the explicatum. In contrast to an analysis, an explication does not necessarily aim at an explicatum that is, or has, the exact same meaning as the explicandum. The improved concept, the explicatum, may differ (in meaning) from the potentially defective concept, the explicandum. According to (the second edition of) Carnap’s Logical Foundations of Probability (1950/1962), an explication must be such that: 1. The explicatum is similar (in meaning) to the explicandum, but considerable differences (in meaning) are permitted. 2. The characterization of the explicatum is exact. 3. The explicatum is a fruitful concept, which means something along the lines of being useful for the formulation of many universal statements or “laws.” 4. The explicatum is as simple as (1–3) permit.

DE DUCT I V E A PPROAC HE S TO CONFIR M AT ION

49

Let us consider an example from a discipline other than philosophy, namely biology. The definition of the concept of fish that is used in evolutionary biology excludes whales, whereas whales used to belong to what were once called ‘fish.’ When this definition was proposed, nobody claimed that whales provide a counterexample to evolutionary biology. Instead people changed their concepts and adjusted their usage of the word ‘fish.’ (Perhaps something similar is currently happening to the concepts of race and gender.) For philosophical methodology, this has the important consequence that the method of counterexamples cannot be used to refute an explication. A counterexamplemerelyshowsthattwoconceptsarenotidentical (in meaning), but not that they are dissimilar (in meaning).

4.2 THE RAVENS PARADOX While Hempel took himself to be engaging in conceptual analysis, his method of listing conditions of adequacy for any definition of confirmation is reminiscent of the “subjectively plausible desiderata” that contemporary confirmation theorists list when engaging in conceptual explication. We can think of these conditions of adequacy as search criteria, much like the search criteria that we enter when we shop online. For instance, when you shop online for a condo, you enter search criteria such as your price range, the distance from your workplace, the number of bedrooms, the floor the condo is on, and whether the condo has a balcony. Hempel was also entering such search criteria, except that he was not shopping online for a condo but searching for a definition of confirmation. One of these conditions of adequacy is the equivalence condition (EC). It says that sentences that are logically equivalent—and, hence, express the same content—are always confirmed together. In other words, what matters for the confirmation of a sentence is the content it expresses, not

50

INTRODUCTION TO PROBABILITY AND INDUCTION

the particular way the sentence phrases its content. (We will suppress the background assumptions b.) Equivalence condition: For all sentences e, h1 , and h2 , if e confirms h1 and h1 and h2 are logically equivalent, then e confirms h2 . Later we will come across further conditions of adequacy, but for now the equivalence condition will do. Next we need something like the principle of universal induction, except that we have to formulate it in terms of confirmation rather than induction. To this end, it will be helpful to recall our equivalence that e confirms h if and only if h may and ought to be inductively inferred from e. The principle of universal induction tells us that we may and ought to inductively infer the conclusion that all Fs are G from the premise that all objects about which one has enough information—that is, whether they are F and whether they are G—are G if they are F. In particular, we may and ought to inductively infer the conclusion that all Fs are G from the premise that the one object about which one has enough information is F and G. Our equivalence turns this into Nicod’s criterion (named after Nicod 1930): Universal hypotheses of the form ‘All Fs are G,’ ∀x (F (x) → G (x)), are confirmed by their so-called “instances,” which are sentences of the form ‘a is F and a is G,’ F (a) ∧ G (a). Note that instances of ∀x (F (x) → G (x)) differ from instantiations of ∀x (F (x) → G (x)). The latter are sentences of the form F (a) → F (a). Before we turn to the ravens paradox, let me remind you that Hempel aimed at a definition of confirmation, whereas Hume was concerned with the justification of induction. This is a crucial difference that reflects what is sometimes called the linguistic turn in 20th century analytic philosophy. Universal hypotheses of the form ‘All Fs are G’ are universally quantified if-then sentences. The ravens hypothesis

DE DUCT I V E A PPROAC HE S TO CONFIRM AT ION

51

is the universal hypothesis that all ravens are black, ∀x (R (x) → B (x)). According to Nicod’s criterion, it is confirmed by its instances, which are sentences reporting that some object a is a black raven, that is, a is a raven and a is black, R (a) ∧ B (a). The ravens hypothesis is logically equivalent to the universal hypothesis that everything that is not black is not a raven, ∀x (¬B (x) → ¬R (x)). We can use the method of truth tables to show that the well-formed formula R (x) → B (x) is logically equivalent to the well-formed formula ¬B (x) → ¬R (x) in propositional logic. This allows us to apply the principle of the substitution of logical equivalents (SLE), which tells us that the well-formed formula α [¬B (x) → ¬R (x) /R (x) → B (x)] is logically equivalent to the well-formed formula α. In our case, α is the ravens hypothesis ∀x (R (x) → B (x)), and α [¬B (x) → ¬R (x) /R (x) → B (x)] is the hypothesis ∀x (¬B (x) → ¬R (x)), which results from ∀x (R (x) → B (x)) by replacing all occurrences (there is just one) of R (x) → B (x) by an occurrence of ¬B (x) → ¬R (x). According to Nicod’s criterion, a report that some object a is a non-black non-raven confirms the universal hypothesis that everything that is not black is not a raven. According to the equivalence condition and the logical equivalence between this hypothesis and the ravens hypothesis just established, a report that some object a is a non-black non-raven also confirms the ravens hypothesis. Allegedly this is absurd. Allegedly it seems to be false that the ravens hypothesis is confirmed by reports of non-black non-ravens. (While Nicod’s criterion and the equivalence condition do not imply that the ravens hypothesis is confirmed by reports of, say, red socks and bluejays, the satisfaction criterion from Section 4.5 implies this.) Here is a more formal derivation of what is known as the ravens paradox: 1. Equivalence condition 2. Nicod’s criterion

assumption assumption

52

INTRODUCTION TO PROBABILITY AND INDUCTION

3. ∀x (R (x) → B (x)) and ∀x (¬B (x) → ¬R (x)) are logically equivalent from SLE and because R (x) → B (x) and ¬B (x) → ¬R (x) are logically equivalent in propositional logic 4. ¬B (a) ∧ ¬R (a) confirms ∀x (¬B (x) → ¬R (x)) from 2. 5. ¬B (a) ∧ ¬R (a) confirms ∀x (R (x) → B (x)) from 1., 3., and 4. While you figure out if this conclusion is really absurd, here is another truth of logic. The ravens hypothesis is logically equivalent to the universal hypothesis that everything that is green or not green is either not a raven or black or both, ∀x (G (x) ∨ ¬G (x) → ¬R (x) ∨ B (x)). We can show this by applying SLE as well as the method of truth tables, which shows R (x) → B (x) and G (x) ∨ ¬G (x) → ¬R (x) ∨ B (x) to be logically equivalent in propositional logic. According to Nicod’s criterion, a report that some object a is green or not green, and not a raven or black—that is, a report saying that something is not a non-black raven that could be used to falsify or refute the ravens hypothesis—confirms the universal hypothesis ∀x (G (x) ∨ ¬G (x) → ¬R (x) ∨ B (x)). The latter is logically equivalent to the ravens hypothesis. Therefore, the equivalence condition implies that a report saying that something cannot be used to falsify the ravens hypothesis confirms it. This means that reports to the effect that something cannot be used to falsify universal hypotheses of the form ‘All Fs are G’ confirm these hypotheses. Allegedly this is absurd because it makes it too easy to confirm these hypotheses. Here is a more formal derivation of this result: 1. Equivalence condition assumption 2. Nicod’s criterion assumption 3. ∀x (R (x) → B (x)) and ∀x (G (x) ∨ ¬G (x) → ¬R (x) ∨ B (x)) are logically equivalent from SLE and

DE DUCT I V E A PPROAC HE S TO CONFIRM AT ION

53

because R (x) → B (x) and G (x) ∨ ¬G (x) → ¬R (x) ∨ B (x) are logically equivalent in propositional logic 4. (G (a) ∨ ¬G (a)) ∧ (¬R (a) ∨ B (a)) confirms ∀x (G (x) ∨ ¬G (x) → ¬R (x) ∨ B (x)) from 2. 5. (G (a) ∨ ¬G (a)) ∧ (¬R (a) ∨ B (a)) confirms ∀x (R (x) → B (x)) from 1., 3., and 4. Our options are to reject the equivalence condition, to reject Nicod’s criterion, or to accept the allegedly absurd conclusion that hypotheses of the above-mentioned form are confirmed by reports to the effect that something cannot be used to falsify them. Hempel thought that the absurdity is merely apparent. We think it is absurd that a report that some object a is a non-black non-raven confirms the ravens hypothesis because we implicitly make many background assumptions. However, once we make these background assumptions explicit, or we drop these implicitly made background assumptions, the appearance of absurdity disappears: In the absence of any further assumptions, information that some object a cannot be used to falsify a hypothesis really does confirm the latter, if only somewhat. Since Hempel aimed at a purely logical or “syntactical” definition of a concept of confirmation that does not incorporate any background assumptions, there is no problem after all. We will return to the ravens paradox in Chapter 8.

4.3 THE PREDICTION CRITERION Hempel next considered the idea that successful predictions confirm the hypotheses predicting them, much like successful predictions of pollsters or weatherpersons boost their trustworthiness. To make this idea formally precise, we will now combine the tools from logic and set theory and consider

54

INTRODUCTION TO PROBABILITY AND INDUCTION

sets of formulas. We say that a set of formulas Γ = {α1 , α2 , . . .} logically implies a formula β, Γ |= β, if and only if β is true in every logically possible case in which all of the formulas in Γ are true. Given this terminology, a slightly restricted version of the prediction criterion PC says the following: Prediction criterion: For all sentences h and sets of sentences B, B confirms h if there is a sentence e in B such that B without e does not logically imply e, B \ {e} |= e, but does so in the presence of h, B \ {e} ∪ {h} |= e. Much like Nicod’s criterion, the prediction criterion provides a sufficient condition for confirmation. In contrast to this, the equivalence condition provides a necessary condition for confirmation. The following reformulation of the equivalence condition makes this clear: For all sentences e and h1 , e confirms h1 only if e confirms every sentence h2 that is logically equivalent to h1 . Here is an example. Our background assumptions include the information that subway tokens are made of two metals to prevent counterfeiting. Consider the hypothesis that all metals expand when heated, h = ∀x (M (x) ∧ H (x) → E (x)). Suppose we have data to the effect that a particular token a is heated and expands: B = {M (a) , H (a) , E (a)}. According to universal instantiation (UI), our universal hypothesis that all metals expand when heated logically implies that the particular token a expands if it is metallic and heated, ∀x (M (x) ∧ H (x) → E (x)) |= M (a) ∧ H (a) → E (a). Given this, we can use the method of truth tables to establish the following two truths of logic: B without E (a) does not logically imply E (a)—{M (a) , H (a)} |= E (a)—but does so in the presence of M (a) ∧ H (a) → E (a)—{M (a) , H (a)} ∪ {M (a) ∧ H (a) → E (a)} |= E (a)—and, hence (see below), also in the presence of h. The prediction criterion then implies that our data regarding the particular token a confirm the hypothesis that all metals expand when heated.

DE DUCT I V E A PPROAC HE S TO CONFIR M AT ION

55

The logical principle we have just used is the principle of monotonicity, which holds for classical logic but not necessarily for other logics. More precisely, the principle of monotonicity M says that a set of formulas Γ+ logically implies a formula β, if the set of formulas Γ logically implies β, and if Γ+ logically implies every formula in Γ. In particular, this is so if Γ is a subset of Γ+ . One of the shortcomings of the prediction criterion is that there are some universal hypotheses that make predictions only in the presence of other universal hypotheses. Hence, the former can never be confirmed on the basis of the prediction criterion. For instance, consider the hypothesis that every human who lives at least as long as every human who ever has lived or will live will have consumed alcohol at most moderately, ∀x (H (x) ∧ ∀y (H (y) → L (x, y)) → M (x)). To confirm this hypothesis on the basis of the predication criterion, we would need information to the effect that some human a has lived at least as long as every human who ever has lived or will live, H (a) and ∀y (H (y) → L (a, y)). The latter is itself a universal hypothesis that we can at best infer inductively but that we can never establish on the basis of observation alone.

4.4 THE LOGIC OF CONFIRMATION Let us now consider four more of Hempel’s conditions of adequacy for any definition of confirmation. As before, we will suppress the background assumptions b. Entailment condition: For all sentences e and h, if e logically implies h, then e confirms h. Special consequence condition: For all sentences e, h1 , and h2 , if e confirms h1 and h1 logically implies h2 , then e confirms h2 .

56

INTRODUCTION TO PROBABILITY AND INDUCTION

Special consistency condition: For all sentences e, h1 , and h2 , if e is not logically false and confirms h1 , and h1 logically implies ¬h2 , then e does not confirm h2 . Converse consequence condition: For all sentences e, h1 , and h2 , if e confirms h1 and h2 logically implies h1 , then e confirms h2 . The entailment condition (EntC) formulates the idea that conclusive proof is a special case of confirmation. For instance, that there are palm trees and jackals in the Sahara proves conclusively—and, hence, also confirms—that there is life in some deserts (given background assumptions such as that the Sahara is a desert, that jackals and palm trees are animals and plants, respectively, and that animals and plants are living objects). The special consequence condition (SCC) formulates the idea that confirmation is transferred from the confirmed hypothesis to its logical consequences. For instance, before moving to Canada, I gathered data confirming the conjunctive hypothesis that Ottawa is the capital of Canada and that Ottawa is a city in Ontario. According to the special consequence condition, I thereby also gathered data confirming the hypothesis that the capital of Canada is a city in Ontario. The special consistency condition (SConsC) formulates the idea that confirmation of a hypothesis is confirmation that the hypothesis is true, and so is not confirmation that the hypothesis is false. The (non-contradictory) data I gathered before moving to Canada confirm the conjunctive hypothesis that Ottawa is the capital of Canada and that Ottawa is a city in Ontario. According to the special consistency condition, these data do not confirm the further hypotheses that the capital of Canada is not a city in Ontario and that there is no capital of Canada. The reason is that the confirmed hypothesis logically implies that both of these further hypotheses are false.

DE DUCT I V E A PPROAC HE S TO CONFIR M AT ION

57

The converse consequence condition (CCC) formulates the idea that hypotheses are often part of larger theoretical networks. For instance, Newton’s second law of motion states, roughly, that the net force on an object equals the object’s mass multiplied by its acceleration, F = m · a. Together with the first and third laws, the second law of motion forms classical mechanics. If we now have data confirming the second law of motion, then, according to the converse consequence condition, we also have confirmation for classical mechanics as a whole of which the second law of motion is a part. A different example is provided by the “theory” that all humans need to eat. The humans we meet are all contemporaries of ours, and they all need to eat, so our data confirm the hypothesis that all our contemporaries need to eat. This hypothesis is part of the more general theory that all humans, and not just all our contemporaries, need to eat. According to the converse consequence condition, our data also confirm this more general theory. The equivalence condition (EC), which was mentioned earlier, has not been listed because it follows logically from the special consequence condition (and also from the converse consequence condition, as you will show). Let us show this. First note that EC is a universally quantified if-then sentence. When we want to show that a universally quantified sentence ∀x (α) follows from another sentence, we need to apply universal generalization (UG). This means we first have to consider an arbitrary instantiation α [c/x] of ∀x (α). If we can show that this arbitrary instantiation is true, then we can use UG to conclude that the universally quantified statement ∀x (α [c/x] [x/c]) (that is, ∀x (α)) is true. Furthermore, when we want to show that an if-then sentence (or conditional) α → β is true, it is useful to first assume the if-part (or antecedent) α, then derive the then-part (or consequent) β, and then, finally, remove the assumption and

58

INTRODUCTION TO PROBABILITY AND INDUCTION

put it back as if-part before the derived then-part to obtain α → β. This is called conditional proof (CP). 1. For all sentences x, y, and z: If x confirms y and y logically implies z, then x confirms z. SCC 2. a confirms b and b is logically equivalent to c assumption for CP as well as for UG, where a, b, and c are arbitrary sentences that have not occurred prior to 2. (that is, in 1). 3. a confirms b and b logically implies c from 2. and the definition of logical equivalence. 4. For all sentences y and z: If a confirms y and y logically implies z, then a confirms z. from 1. by UI with a for x. 5. For all sentences z: If a confirms b and b logically implies z, then a confirms z from 4. by UI with b for y. 6. If a confirms b and b logically implies c, then a confirms c from 5. by UI with c for z. 7. a confirms c from 3. and 6. 8. If a confirms b, and b and c are logically equivalent, then a confirms c from 2. and 7. by CP, which puts 2. as if-part in front of the then-part 7. 9. For all sentences z: If a confirms b, and b is logically equivalent to z, then a confirms z from 8. by UG, which replaces c, which was arbitrary (did not occur prior to 2.), by z, which is new (did not occur in 8). 10. For all sentences y and z: If a confirms y, and y is logically equivalent to z, then a confirms z from 9. by UG, which replaces b, which was arbitrary (did not occur prior to 2.), by y, which is new (did not occur in 9). 11. For all sentences x, y, and z: If x confirms y, and y is logically equivalent to z, then x confirms z from 10. by UG, which replaces a, which was arbitrary (did not occur prior to 2.), by x, which is new (did not occur in 10).

DE DUCT I V E A PPROAC HE S TO CONFIR M AT ION

59

Let us say that a sentence e PC-confirms a sentence h if, and only if, e is a conjunction of two sentences e1 and e2 such that h ∧ e1 logically implies e2 , but e1 does not. This is, of course, nothing but an alternative formulation of the concept of confirmation that is in play in the prediction criterion when the set of sentences B is finite. (As an aside, note that both formulations have the awkward feature that whether there is confirmation for some hypothesis h depends on the particular phrasing of the sentences in the set of sentences B and of the sentence e. This can be fixed, but it renders the two formulations more complicated.) PC-confirmation satisfies the converse consequence condition. In other words, the converse consequence condition is true of PC-confirmation. To show this, we need to show that the converse consequence condition comes out true if we replace all occurrences of the word ‘confirmation’ in it by an occurrence of the word ‘PC-confirmation.’ That is, we need to show that the following sentence is true: For all sentences x, y, and z: If x PC-confirms y, and z logically implies y, then x PC-confirms z. This is again a universally quantified if-then sentence, and so we need to apply UG. As before, it will be helpful to proceed by conditional proof CP. 1. a PC-confirms b and c logically implies b assumption for CP as well as for UG, where a, b, and c are arbitrary sentences that have not occurred prior to 1. 2. a is a conjunction of two sentences e1 and e2 such that b ∧ e1 logically implies e2 , but e1 does not from 1. and the definition of PC-confirmation. 3. a is a conjunction of two sentences e1 and e2 such that c ∧ e1 logically implies e2 , but e1 does not from 1., 2., and M.

60

INTRODUCTION TO PROBABILITY AND INDUCTION

4. a PC-confirms c from 3. and the definition of PC-confirmation. 5. If a PC-confirms b, and c logically implies b, then a PC-confirms c from 1. and 4. by CP, which puts 1. as if-part in front of the then-part 4. 6. For all sentences z: If a PC-confirms b, and z logically implies b, then a PC-confirms z from 5. by UG, which replaces c, which was arbitrary (did not occur prior to 1.), by z, which is new (did not occur in 5.). 7. For all sentences y and z: If a PC-confirms y, and z logically implies y, then a PC-confirms z from 6. by UG, which replaces b, which was arbitrary (did not occur prior to 1.), by y, which is new (did not occur in 6.). 8. For all sentences x, y, and z: If x PC-confirms y, and z logically implies y, then x PC-confirms z from 7. by UG, which replaces a, which was arbitrary (did not occur prior to 1.), by x, which is new (did not occur in 7.). Hempel’s conditions of adequacy are search criteria for a definition of confirmation, much like the search criteria one enters when shopping online for a condo—one’s price range, the distance from one’s workplace, the number of bedrooms, the floor the condo is on, and whether the condo has a balcony. The worst thing that can happen when one shops online for a condo is that there is no condo that meets all one’s criteria. In this case, the search returns no results. For purposes of illustration, suppose first one is looking for a three-bedroom condo within walking distance of campus for less than \$100,000. Chances are that one’s search on the local real estate sites will not return any results. However, this is not because it would be logically impossible to find such a condo. It is just unlikely in the current real estate market. Now suppose one is looking for a one-bedroom condo that is on the first, or ground, floor because one has a fear of height,

DE DUCT I V E A PPROAC HE S TO CONFIRM AT ION

61

and that also has a balcony because of the resale value. Now one’s search is guaranteed to not return any results. The reason is that balconies are, by definition, not on the first floor. Given this definition, it is logically impossible to find a one-bedroom condo with a balcony on the first floor—anywhere, not just within walking distance of campus. Something similar happened with Hempel’s search for a definition of confirmation. For starters, note that there are sentences, or else this book would not exist. Furthermore, every sentence confirms itself because of the entailment condition and the logical truth that every sentence logically implies itself. So there is at least one sentence e that confirms e. e can be assumed to not be logically false (if it isn’t, then its negation is, and we work with latter instead). Furthermore, e logically implies ¬¬e. Therefore, the special consistency condition implies that e does not confirm ¬e. There are sentences e and h, viz. ¬e, such that e does not confirm h. In other words, not every sentence confirms every sentence. A concept of confirmation according to which every sentence confirms every sentence is trivial. We have just seen that the entailment condition and the special consistency condition require that confirmation not be trivial. Unfortunately, triviality follows from Hempel’s conditions of adequacy. Taken together these conditions are therefore as inconsistent as a search for a one-bedroom condo with a balcony on the first floor. Here is a proof that triviality follows from the entailment condition and the converse consequence condition. 1. a logically implies a ∨ b truth of logic, where a and b are arbitrary sentences that have not occurred prior to 1. 2. a confirms a ∨ b from 1. and EntC. 3. b logically implies a ∨ b truth of logic. 4. a confirms b from 2. and 3. by CCC.

62

INTRODUCTION TO PROBABILITY AND INDUCTION

5. For all sentences y: a confirms y from 4. by UG, which replaces b, which was arbitrary (did not occur prior to 1.), by y, which is new (did not occur in 4.). 6. For all sentences x and y: x confirms y from 5. by UG, which replaces a, which was arbitrary (did not occur prior to 1.), by x, which is new (did not occur in 5.). How did Hempel respond? Much like a condo hunter who gives up on the idea of a balcony and settles for a one-bedroom condo on the first floor, Hempel rejected the converse consequence condition. In addition, Hempel presented a definition of confirmation that satisfies the entailment condition, the special consequence condition, and the special consistency condition. Before turning to this definition in the next section, let me note that, despite Hempel’s “triviality result,” the converse consequence condition has been popular in the philosophy of science, presumably because it is in the spirit of Popper’s falsificationism (Section 4.6) and is satisfied by hypothetico-deductive confirmation (Section 4.7).

4.5 THE SATISFACTION CRITERION The development of a hypothesis h for a set of individual constants I is the hypothesis DevI (h) that h is true if the individuals named in I are all the objects that exist. For instance, let h be the universally quantified hypothesis that all students live near campus, ∀x (S (x) → C (x)), and suppose IC is the set containing the individual constants ‘a,’ which is another name for Ann, ‘b,’ which another name for Bob, ‘c,’ which is another name for Claire, and ‘d,’ which is another name for Didier. The development of h for I is the hypothesis that all of Ann, Bob, Claire, and Didier live near campus if they are students, DevI (h) =

DE DUCT I V E A PPROAC HE S TO CONFIR M AT ION

63

((S (a) → C (a)) ∧ (S (b)→C (b)) ∧ (S (c)→C (c)) ∧ (S (d)→C (d))). In other words, the development of h for I is a conjunction saying that Ann lives near campus if she is a student, and Bob lives near campus if he is a student, and Claire lives near campus if she is a student, and Didier lives near campus if he is a student. Now consider the existentially quantified hypothesis g that at least one student lives near campus, ∃x (S (x) ∧ C (x)). The development of g for I says that at least one of Ann, Bob, Claire, and Didier is a student and lives near campus, DevI (g) = ((S (a) ∧ C (a)) ∨ (S (b) ∧ C (b)) ∨ (S (c) ∧ C (c)) ∨ (S (d) ∧ C (d))). In other words, the development of g for I is a disjunction saying that Ann is a student and lives near campus, or Bob is a student and lives near campus, or Claire is a student and lives near campus, or Didier is a student and lives near campus. When we form the development of a hypothesis for a set of individual constants, a universal quantifier is replaced by a big conjunction with one conjunct per individual constant. An existential quantifier is replaced by a big disjunction with one disjunct per individual constant. All other parts of the hypothesis remain unchanged. The development of the hypothesis f that Angela Merkel is chancellor of Germany in August 2017 for I is just f , as there are no quantifiers in f . In defining confirmation, Hempel proceeded in two steps by first defining a special case of confirmation called ‘direct confirmation,’ and then broadening or generalizing this special case. This is a common procedure in philosophy that is also used in the definition of other concepts such as causation (some philosophers first define a special case of causation called ‘direct causation,’ and then generalize or broaden this special case by including “chains” of direct causation). A set of sentences B directly confirms a hypothesis h if, and only if, B logically implies the development of h for those individual constants that occur essentially in B. An individual constant occurs essentially in a set of sentences just

64

INTRODUCTION TO PROBABILITY AND INDUCTION

in case there is no logically equivalent set of sentences—that is, one where both sets logically imply all sentences of the other set—that does not use it. For instance, let B = {S (a) ∧ C (a) , ¬S (b) ∧ C (b) , ¬S (c) ∧ C (c) , ¬S (d) , F (e) ∨ ¬F (e)}. Then I is the set containing the individual constants ‘a,’ ‘b,’ ‘c,’ and ‘d’ because these are the individual constants that occur essentially in B. ‘e,’ on the other hand, does not occur essentially in B because there is an alternative set of sentences—namely { S (a) ∧ C (a) , ¬S (b) ∧ C (b) , ¬S (c) ∧ C (c) , ¬S (d) }—that is logically equivalent to B and does not use ‘e.’ B logically implies DevI (h). Therefore, B directly confirms h. Direct confirmation is a special case of Hempel-confirmation. A set of sentences B Hempel-confirms a hypothesis h if, and only if, the set of all hypotheses that are directly confirmed by B, DC (B), logically implies h. For instance, let l be the hypothesis that everybody loves someone, ∀x∃y (L (x, y)), and let B = {L (a, b) , L (b, b)}. I is the set containing the individual constants ‘a’ and ‘b.’ DevI (l) = (L (a, a) ∨ L (a, b)) ∧ (L (b, a) ∨ L (b, b)). B directly confirms l, so l ∈ DC (B). l logically implies that Claire loves someone, ∃y (L (c, y)). Therefore B, which tells us that Ann loves Bob and that Bob loves himself, Hempel-confirms the hypothesis that Claire loves someone. The above definition of Hempel-confirmation is known as the satisfaction criterion (SC). It can also be formulated as follows: A sentence e SC-confirms a sentence h if, and only if, the set containing the sentence e, {e}, Hempel-confirms h. It is a useful exercise to show that SC-confirmation satisfies the entailment condition, the special consequence condition, and the special consistency condition—much like a south-facing one-bedroom condo on the first floor on campus meets the search criteria of a one-bedroom condo on the first floor within walking distance of campus. As before, this means showing that these conditions are true of SC-confirmation. One does so by replacing every occurrence of ‘confirmation’ in

DE DUCT I V E A PPROAC HE S TO CONFIRM AT ION

65

them by an occurrence of ‘SC-confirmation,’ and then applying the definition of SC-confirmation to show that the resulting conditions come out true. Finally, we can now say a bit more about the logically possible cases of predicate logic. Recall that, in propositional logic, the logically possible cases are the lines in a truth table that lists all the relevant sentence letters. We have formulated predicate logic by assuming that there is a name, or individual constant, for each object. Given this assumption, the logically possible cases of predicate logic are or can be represented as sets of sentences B that include the sentence ‘P (a)’ or the sentence ‘¬P (a),’ but not both, for every predicate ‘P’ and every individual constant ‘a;’ as well as the sentence ‘R (a, b)’ or the sentence ‘¬R (a, b),’ but not both, for every binary relation symbol ‘R’ and any two (not necessarily distinct) individual constants ‘a’ and ‘b;’ and so on for all other n-ary relation symbols and n occurrences of individual constants. These sets of sentences are “maximal consistent”: One cannot derive a contradiction such as p ∧ ¬p from them (consistency), but one can derive such a contradiction from them as soon as one adds one new sentence (maximality). In Chapter 7, we will call these sets state descriptions.

4.6 FALSIFICATIONISM In Logik der Forschung (1935) (“Logic of Research”), Popper rejects the logic of induction for the following reason: It leads to an infinite regress, as we can justify induction only inductively, or to “a priorism,” as we can only postulate or assume, but never derive or justify, the principle of induction. This argument is similar to the regress argument for academic skepticism about knowledge. The latter starts by postulating the requirement that one knows a proposition only if one can justify

66

INTRODUCTION TO PROBABILITY AND INDUCTION

this proposition by some other known proposition, which is sometimes called a reason for the first proposition. Then it proceeds by pointing out that the known proposition must itself be justified by some further known proposition, and so on. The following three options emerge: The chain of justifications or reasons goes on ad infinitum; the chain of justifications or reasons stops at an arbitrary proposition that is dogmatically claimed to be known; or, finally, the chain of justifications or reasons eventually returns to the proposition one has started with, in which case one is caught in a vicious circle. The academic skeptic about knowledge finds all three options wanting and concludes that we do not know any proposition. The infinitist finds the first option best and claims that we can justify a proposition by an infinite chain of reasons. The foundationalist finds the second option best and claims that there are propositions which we know even if we do not have any reasons for these propositions. The coherentist finds the third option best and claims that there can be justification that is non-viciously circular. (These responses can also be combined.) Having rejected the logic of induction, Popper goes on to replace it with the deductive method of hypothesis testing. Popper distinguishes between the context of discovery and the context of justification. The former is the context in which scientists come up with hypotheses—for example, the context in which the German chemist Kekulé allegedly dreamed of the ring shape of the benzene molecule. This context is irrelevant to epistemology, for which only the context of justification matters. To continue with our example, the latter is the context in which Kekulé argues for the truth of his novel theory and defends it against objections. The deductive method of hypothesis testing consists in deductively inferring an empirically testable consequence e from the hypothesis h that is to be tested. As a first approximation, we can define a sentence to be empirically testable if, and only

DE DUCT I V E A PPROAC HE S TO CONFIR M AT ION

67

if, it is a conjunction of one or more atomic sentences or their negations. Atomic sentences are sentences of the form ‘R (a1 , . . . , an ),’ where ‘R’ is an n-ary relation symbol and ‘a1 ,’ . . ., ‘an ’ are n individual constants. If an atomic sentence is to be empirically testable, its relation symbol ‘R’ must denote an observable relation (which raises the question if there even are any relations or properties in the external world that are directly observable and not just indirectly inferable). If e is verified, then h is temporarily corroborated. (What Popper calls ‘corroboration’ roughly corresponds to what we call ‘confirmation,’ although there are differences.) If e is falsified, then h is forever falsified. What one never does or never should do, according to Popper, is to make inductive inferences from particular facts to general hypotheses. The alleged truth of logic underlying the deductive method of hypothesis testing is the claim that “general” (this presumably means ‘quantified’) hypotheses can only be falsified by atomic sentences reporting particular facts, but cannot be verified by such sentences. The background against which Popper proposes this method is logical positivism, in particular in the form its proponents in the Vienna circle have defended it (Uebel 2006). At some point, these philosophers proposed verification as a criterion of meaningfulness or significance that demarcates the meaningful claims of scientific disciplines such as physics from the meaningless claims of speculative disciplines such as metaphysics. A slogan associated with this verificationist criterion of meaning is that “the meaning of a sentence is its method of verification.” Instead of verification, Popper proposes falsifiability, though not as a criterion of meaningfulness, but as a criterion of demarcation that demarcates the scientific claims of, say, physics and empirical psychology from the potentially meaningful, but pseudo-scientific claims of astrology and psychoanalysis. According to Popper, a sentence is scientific if, and only if, it is

68

INTRODUCTION TO PROBABILITY AND INDUCTION

falsifiable. The idea is that an empirical-scientific system must be able to founder on experience. Note that the criterion is falsifiability, not falsification. A sentence does not have to be falsified—and, hence, false—in order to be scientific. It merely has to be (logically?) possible to falsify it. Popper also thinks the objectivity of scientific claims lies in their intersubjective testability. This means only these experimental results can be used to test a hypothesis whose experiments can be repeated. Popper does not quite escape the dilemma that he cites as reason for rejecting the logic of induction. This is the “problem of the basis.” The empirically testable consequences, on the basis of which we falsify or corroborate general scientific hypotheses—that is, the “basic sentences” that are the potential falsifiers or corroborators of scientific hypotheses—must, of course, be intersubjectively testable and falsifiable themselves in order to be scientific. This in turn requires those basic sentences to have empirically and intersubjectively testable consequences, and so on ad infinitum. Popper stops this infinite regress by claiming that the decision which sentences to accept as the empirical basis ultimately rests on convention. Among the further problems for Popper’s falsificationism is that many seemingly scientific hypotheses are not falsifiable. The sentence ‘Every planet will have some form of life on it at some point in time,’ ∀x (P (x) → ∃y∃t (L (y, x, t))), is not falsifiable. The reason is that we would need to find a planet a on which there is no form of life y at any time t. Yet the sentence P (a) ∧ ¬∃y∃t (L (y, a, t)), or P (a) ∧ ∀t∀y¬ (L (y, a, t)), describing such a planet a includes itself the universal hypothesis ∀t∀y¬ (L (y, a, t)) that can at best be falsified, but not verified, by an empirically and intersubjectively testable sentence. This is because there will always be a future time t at which some form of life y might develop on planet a. Next there is the problem that falsifying consequences often indicate errors that do not lie in the “target” hypothesis

DE DUCT I V E A PPROAC HE S TO CONFIR M AT ION

69

that is tested, but elsewhere, say, errors of measurement, or errors in an “auxiliary” hypothesis that was relied upon in testing the target hypothesis. This is sometimes called the Quine-Duhem thesis. For instance, when testing the hypothesis that an allergy is present, one relies on the auxiliary hypothesis that the test is somewhat reliable, and one did not misread the test result. In The Aim and Structure of Physical Theory (1914), Duhem points out that hypothesis testing is always holistic in this sense. Hempel, in “Problems and Changes in the Empiricist Criterion of Meaning” (1950), and Quine, in “Two Dogmas of Empiricism” (1951), extend this thesis to the claim that not only the testing of sentences, but even their meaning, is holistic. Finally, there is the problem that the deductive method of hypothesis testing relies upon deduction, which is, of course, just as much in need of justification as induction. As we will see in Section 7.4, the justification of deduction faces the same dilemma between infinite regress and “a priorism” as the logic of induction.

4.7 HYPOTHETICO-DEDUCTIVE CONFIRMATION While Popper rejects the logic of induction, falsificationism provides the background for one of the most popular accounts of confirmation, viz. hypothetico-deductive (HD) confirmation. The idea behind this concept of confirmation is similar to the prediction criterion: Hypotheses are confirmed if they survive many and severe tests, and if they make many strong and successful predictions. Sentence e HD-confirms sentence h given sentence b just in case the conjunction of h and b, h ∧ b, logically implies e, but b does not. HD-confirmation, much like PC-confirmation, satisfies the converse consequence condition, as you will show below. Here we will establish a

70

INTRODUCTION TO PROBABILITY AND INDUCTION

feature of HD-confirmation that is considered to cast doubt on its adequacy. The problem of irrelevant conjunction is that HD-confirmation satisfies the following condition: For all sentences e, h, and b, if e confirms h given b, then it holds for every sentence t that e confirms the conjunction h ∧ t given b. Here is a proof that this condition is true of HD-confirmation. 1. a HD-confirms c given d assumption for UG and CP, where a, c, and d are arbitrary sentences that have not occurred prior to 1. 2. c ∧ d logically implies a from 1. by the definition of HD-confirmation. 3. (c ∧ f ) ∧ d logically implies a from 2. and M, where f is an arbitrary sentence that has not occurred prior to 3. 4. a HD-confirms c ∧ f given d from 3. by the definition of HD-confirmation. 5. For all sentences w: a HD-confirms c ∧ w given d from 4. by UG, which replaces f , which was arbitrary (did not occur prior to 3.), by w, which is new (did not occur in 4.). 6. If a HD-confirms c given d, then for all sentences w: a HD-confirms c ∧ w given d from 1. and 5. by CP, which puts 1. as if-part in front of the then-part 5. 7. For all sentences z: If a HD-confirms c given z, then for all sentences w: a HD-confirms c ∧ w given z from 6. by UG, which replaces d, which was arbitrary (did not occur prior to 1.), by z, which is new (did not occur in 6.). 8. For all sentences y and z: If a HD-confirms y given z, then for all sentences w: a HD-confirms y ∧ w given z from 7. by UG, which replaces c, which was arbitrary (did not occur prior to 1.), by y, which is new (did not occur in 7.).

DE DUCT I V E A PPROAC HE S TO CONFIRM AT ION

71

9. For all sentences x and y and z: If x HD-confirms y given z, then for all sentences w: x HD-confirms y ∧ w given z from 8. by UG, which replaces a, which was arbitrary (did not occur prior to 1.), by x, which is new (did not occur in 8.). For instance, Mercury’s anomalous 43 arc seconds (per century) advance of its perihelion HD-confirms the general theory of relativity (GTR) given “appropriate” background assumptions. Therefore, it also HD-confirms the conjunction of GTR and the claim that there is life on Mars. Allegedly this seems to be false. A similar problem is the problem of irrelevant disjunction, which is that HD-confirmation satisfies the following condition: For all sentences e, h, and b, if e confirms h given b, then it holds for every sentence t that the disjunction e ∨ t confirms h given b. For instance, the claim that Muhammad Ali is immortal if he is human HD-confirms the claim that all humans are immortal because the latter logically implies the former. Therefore, the true disjunction that Toronto is a city or Muhammad Ali is immortal if he is human HD-confirms the hypothesis that all humans are immortal. Again, allegedly this seems to be false. Nicod’s criterion, the prediction criterion, Hempel’s satisfaction criterion, Popper’s falsificationism, and hypothetico-deductive confirmation only employ tools from deductive logic. They define concepts of confirmation (or corroboration) in terms of the logical consequence relation between sentences. All sciences, whether it is the life sciences, the natural sciences such as physics and chemistry, or the social sciences such as economics, political science, and sociology, heavily draw on the resources of statistics and probability theory. Yet statistical and probabilistic hypotheses do not logically imply anything that is itself not statistical or probabilistic, respectively, and not logically true. Therefore, they do not logically imply anything that has empirical content.

72

INTRODUCTION TO PROBABILITY AND INDUCTION

This is because statistical and probabilistic reasoning differs from deductive reasoning and cannot be captured solely in terms of the logical consequence relation. The two problems for hypothetico-deductive confirmation mentioned in this section illustrate this: e logically implies e ∨ t, but the former seems to confirm sentences that the latter does not; and h is logically implied by h ∧ t, but the former seems to be confirmed by sentences that the latter is not. This means that all these deductive approaches to confirmation (or corroboration) miss an indispensable part of actual science. We will turn to this part in the next chapter, which builds on logic and set theory (that is, the material from Chapters 1 and 2), but goes beyond it.

4.8 EXERCISES Exercise 11: Show that HD-confirmation satisfies the converse consequence condition. You may suppress the background assumption in the definition of HD-confirmation so that sentence e HD-confirms sentence h just in case h logically implies e. Exercise 12: Show that the equivalence condition follows logically from the converse consequence condition. Exercise 13: The strong party hypothesis says that everybody who had time attended the party, ∀x (T (x) → P (x)). The weak party hypothesis says that somebody had time and attended the party, ∃x (T (x) ∧ P (x)). Albert, Simone, and Jean-Paul are notorious partygoers, although they do not have time to attend every party. Write down the development of the strong party hypothesis, as well as the development of the weak party hypothesis, for the set containing the names ‘Albert,’ ‘Simone,’ and ‘Jean-Paul.’

DE DUCT I V E A PPROAC HE S TO CONFIR M AT ION

73

Exercise 14: A party scenario is a description of Albert, Simone, and Jean-Paul with respect to whether or not they had time, and with respect to whether or not they attended the party. Describe a party scenario that validates or satisfies the development of the strong party hypothesis as well as the development of the weak party hypothesis. That is, describe a party scenario in which the development of the strong party hypothesis as well as the development of the weak party hypothesis are true. Next describe a party scenario that satisfies the development of the strong party hypothesis, but that invalidates or dissatisfies the development of the weak party hypothesis. That is, describe a party scenario in which the development of the strong party hypothesis is true, but in which the development of the weak party hypothesis is false. Exercise 15: Show that HD-confirmation satisfies the following condition: For all sentences e and h, if e confirms h, then it holds for every sentence t that the disjunction e ∨ t confirms h. You may suppress the background assumption in the definition of HD-confirmation so that sentence e HD-confirms sentence h if and only if h logically implies e. As an afterthought to Exercise 14, note that the party scenarios you are asked to describe can be extended to state descriptions. These state descriptions are the logically possible cases for predicate logic if we assume that there is a name or an individual constant for every object. Every sentence that is true in a party scenario remains true in some state description that extends the party scenario. This means that any such sentence is true in at least one state description or logically possible case. This in turn means that any such sentence has been shown to not be logically false, that is, to be “satisfiable.” The construction of such scenarios or “models” is for predicate logic what the method of truth tables is for propositional logic: It is the most

74

INTRODUCTION TO PROBABILITY AND INDUCTION

important method of showing that a sentence is not logically false in predicate logic.

READINGS The recommended readings for Chapter 4 include: Hempel, Carl G. (1945), Studies in the Logic of Confirmation. Mind 54, 1–26, 97–121.

and perhaps also Popper, Karl R. (1935/2002), The Logic of Scientific Discovery. London, New York: Routledge. Part I: A Survey of Some Fundamental Problems, 3–26. Sprenger, Jan (2011), Hypothetico-Deductive Confirmation. Philosophy Compass 6, 497–508.

CHAPTER 5

Probability

5.1 THE PROBABILITY CALCULUS In Grundbegriffe der Wahrscheinlichkeitsrechnung (1933) (“Basic Concepts of the Probability Calculus”), the Russian mathematician Kolmogoroff, or Kolmogorov, provides the first axiomatization of the probability calculus. Probabilities can be defined on the set theoretic framework of an algebra of propositions as well as the logical framework of a formal language. The propositions in an algebra can be thought of as the meanings, or contents, of the sentences of a formal language. Occasionally mathematicians call the elements of an algebra ‘events,’ but this is philosophically incorrect: The elements of an algebra are abstract sets, whereas events such as the 2016 U.S. presidential election are spatiotemporally extended concrete objects. Which of these two options—propositions or sentences—we choose depends on the application: Sometimes it is more convenient to define probabilities for the sentences of a formal language, and sometimes it is more convenient to define probabilities directly for the meanings of these sentences. The core concept of the probability calculus is that of a probability space, which consists of three components. Its definition comes in several steps characterizing these three components. Let us illustrate how things work with possible pizza slice sizes, or areas of a baking tray on which to put pizza dough. Step 0 is the easiest step and consists in one being given a baking tray. The baking tray is the first component, and it can

76

INTRODUCTION TO PROBABILITY AND INDUCTION

be chosen arbitrarily. Steps 1-3 presuppose the first component and characterize the second component as the collection of all possible pizza slice sizes, or areas of the baking tray on which to put pizza dough. 1. The entire baking tray is a possible pizza slice size, or area of the baking tray on which to put pizza dough. 2. If the left half of the entire baking tray is a possible pizza slice size, or area of the baking tray on which to put pizza dough, then so is the right half. More generally, if one area of the baking tray is a possible pizza slice size, or area of the baking tray on which to put pizza dough, then so is the remaining area of the baking tray. 3. If two areas of the baking tray are possible pizza slice sizes, or areas of the baking tray on which to put pizza dough, then so is the combined area of the baking tray. The second component—that is, the collection of all possible pizza slice sizes, or areas of the baking tray on which to put pizza dough—is a precondition for the third component. The third component is the distribution of one pound of pizza dough across the baking tray. It is characterized by steps 4-6. 4. The amount of pizza dough on each possible pizza slice size, or area of the baking tray on which to put pizza dough, is not negative. 5. The amount of pizza dough on the entire baking tray equals one pound. 6. The amount of pizza dough on a combined area of the baking tray consisting of two non-overlapping possible pizza slice sizes, or areas of the baking tray on which to put pizza dough, equals the amount on the first area plus the amount on the second area.

PROBABILITY

77

The possible pizza slice sizes, or areas of the baking tray on which to put pizza dough, are the carriers of the various amounts of pizza dough. That is, each possible pizza slice size has a specific amount of pizza dough on it. Note that, at least in principle, not every area of the baking tray must be a possible pizza slice size. One may decide that some areas are too small or otherwise unfit to be possible pizza slice sizes. One may also decide to not put any pizza dough on some possible pizza slice size. These latter areas differ from the former areas in that they do have a specific amount of pizza dough on them. It is just that this amount is zero, but that is, of course, also an amount. Here is a different analogy. Suppose you move and sell the contents of your home. You put a price tag on each item you sell, including the items you give away for free. The items you sell are the carriers of the prices and price tags, just as propositions will be the carriers of probabilities. Now suppose someone wants to buy page 17 of your copy of Hume’s Treatise. Page 17 is not for sale, though; only the entire book is. In this sense, page 17 is “priceless”: Neither it, nor any combination of items including it such as your armchair and page 17, has a price or price tag. This is very different for the items you give away for free. They all have a price and price tag, namely (the price tag saying) \$0. Probabilities, or amounts of probability, behave like amounts of pizza dough, and the propositions to which probabilities are assigned behave like possible pizza slice sizes, or areas of the baking tray on which to put pizza dough. What follows in the rest of this section is the definition of a probability space W, A, Pr whose three components are a non-empty set of possible worlds W, an algebra of propositions A over W, and a probability measure Pr on A. Each of these three components presupposes the previous one.1

1 The arrow-shaped brackets ‘’ and ‘’ denote “tuples,” which are similar to sets, except that the order in which their elements are listed matters, as does the number

78

INTRODUCTION TO PROBABILITY AND INDUCTION

The first component is a non-empty set W, which is given to one in step 0. The elements of this set can be anything. However, if we want the elements of the second component to be the meanings of sentences (that is, propositions), then it is best to think of the elements of the given non-empty set W as possible cases or “possible worlds” (Menzel 2013). Just about every philosopher has a theory of what possible worlds—or ways the world could have been—are. Some think possible worlds are real in the same manner the reality we inhabit is real. Others think possible worlds are more like ideas that describe or conceptualize different ways reality could have been. On all accounts possible worlds are alternatives to each other so that no two of them are compatible: If one possible world is identical to, or accurately describes, reality, then no other possible world is, or does, so as well. If the possible worlds are logically possible worlds or cases, then they are logical alternatives to each other. In particular, then, there is at most—usually: exactly—one possible world in W that is identical to, or accurately describes, reality. This possible world is called the actual world. We assume possible worlds to be primitives. This means we assume to have a sufficient understanding of the concept of a possible world so that we can use it to define other concepts. This may seem unsatisfactory, but it is important to note that one always has to assume some concepts as primitive. For instance, I am assuming the concepts I am using to explain probability as primitives in this book. You do the same when you are pointing out one of the many mistakes in it. If I didn’t understand what a mistake is, or if I understood it in a way that differs from your understanding, then you would not be able to point out a mistake (in your sense) to me.

of times an element is listed. A tuple with two elements is called an “ordered pair,” and tuples with three, four, or five elements are called a “triples,” “quadruples,” and “quintuples,” respectively.

PROBABILITY

79

The second component is the algebra of propositions over W. It presupposes the first component W and is defined in steps 1-3. An algebra of propositions A over W is a set of subsets of W such that for all subsets A and B of W: 1. The entire set of possible worlds is a proposition, that is, W ∈ A; 2. If A is a proposition, then so is its complement with respect to W, that is, if A ∈ A, then (W \ A) ∈ A; and 3. If A and B are propositions, then so is their union, that is, if A ∈ A and B ∈ A, then (A ∪ B) ∈ A. Now that we have specified what to assign probabilities to, we only need to say how to assign them. This is what we do in steps 4-6, which define the third component. The third component is a probability measure Pr on A. It presupposes the second component A and, therefore, also the first component W. A probability measure is a function. Functions, or functional relations, are very common. For instance, the biologicalmother-of function maps each human to her or his biological mother. Each human has exactly one biological mother, although some humans have the same biological mother—just as each proposition has exactly one probability, although some propositions have the same probability. Similarly, the height-of function maps each human to her height. Each human has exactly one height, although some some humans have the same height. The child-of relation is not a function. It is not the case that each human has exactly one child. Some humans have no child, while others have more than one child. A function is a relation between objects of one set and objects of another (or the same) set that is “total on the left” and “unique on the right.” That is, a function is a relation that relates each object of one set, the domain of the function, with exactly one object of another (or the same) set, the co-domain of the function. The amount-of-pizza-dough function relates each

80

INTRODUCTION TO PROBABILITY AND INDUCTION

possible pizza slice size with exactly one number, which is its amount of pizza dough. The probability function relates each proposition with exactly one number, which is its probability. Different possible pizza slice sizes and propositions can be related with the same number. However, no possible pizza slice or proposition can be related with more than one number. Otherwise we could not speak of its amount of pizza dough or its probability, respectively. The probability function Pr relates each proposition B from the algebra of propositions A with exactly one real number, which is its probability Pr (B). The symbol ‘R’ denotes the set of real numbers. To indicate that Pr is a function relating each proposition in A with exactly one real number in R, mathematicians sometimes use the notation ‘Pr : A → R.’ The arrow ‘→’ here does not represent a material conditional. Since probability functions are functions, and functions are total on the left (that is, on A) and unique on the right (that is, on R), it follows that, for all A and B in A: If A = B, then Pr (A) = Pr (B). We will frequently make use of this feature. Finally, measures are functions with special properties. Since probability functions have these properties, we will call them probability measures. Now we are in a position to formulate steps 4-6. A function whose domain is an algebra of propositions A over a non-empty set of possible worlds W, and whose co-domain is the set of real numbers R, Pr : A → R, is a probability measure if, and only if, the following holds for all propositions A and B in A: 4. Pr (A) ≥ 0 Non-negativity 5. Pr (W) = 1 Normalization 6. If (A ∩B) = ∅, then Pr (A ∪ B) = Pr (A)+Pr (B) Additivity The sets A and B in steps 4-6 are not arbitrary subsets of W but propositions in the algebra A. A subset of W that is not also

PROBABILITY

81

a proposition in A does not have probability 0. Instead, it is not even assigned a probability and so has no probability. It is meaningless to speak of the probability of such a subset of W, just as it is meaningless to speak of the price of page 17 of your copy of Hume’s Treatise if you only sell the entire book. Some distributions of pizza dough leave some areas of the baking tray on which to put pizza dough empty. However, if we want to bake pizza as the Italians do, namely with a thin crust, then we need to distribute a positive amount of pizza dough on every area of the baking tray on which to put pizza dough. Such Italian style probability measures that assign some positive amount of probability to every non-empty or consistent proposition in the algebra on which they are defined are called regular. More precisely, a probability measure Pr : A → R is regular if, and only if, for every proposition A in A: if A  ∅, then Pr (A) > 0.

5.2 EXAMPLES Example 1. Let us consider an example. We are tossing a coin, and there are two possible worlds, cases, or outcomes: H if the coin lands on heads, and T if the coin lands on tails. This means that our set of possible cases or worlds W is {H, T}. We could, of course, choose a different set of possible cases, say the one including H and I sneeze when tossing the coin, H and I do not sneeze when tossing the coin, T and I sneeze when tossing the coin, T and I do not sneeze when tossing the coin. However, since we are not interested in my sneezing when tossing the coin, that would be pointless. Since it is up to us to pick the set of possible cases, we pick the simpler set containing just H and T. Now that we have our set of cases W, we need to specify the algebra of propositions A over W. According to step 1, W

82

INTRODUCTION TO PROBABILITY AND INDUCTION

itself has to be an element of A. Since W ∈ A, step 2 implies that (W \ W) ∈ A. Of course, W \ W = ∅. Therefore we have just shown that ∅ ∈ A. Since W ∈ A and ∅ ∈ A, step 3 implies that W ∪ ∅, ∅ ∪ W, W ∪ W, as well as ∅ ∪ ∅ are all elements of A. However, W ∪ ∅ = ∅ ∪ W = W ∪ W = W, and ∅ ∪ ∅ = ∅, so these sets already are in A. W and ∅, but no other sets, must be elements of every algebra A over W. The set {∅, W}, which is a set of subsets of W, is the smallest algebra over W. Of course, usually we want to include more propositions, say the proposition {H} that the coin lands on heads. However, if {H} ∈ A, then step 2 implies that W \ {H} = {T} ∈ A. Since there are no other subsets of W left, the set {∅, {H} , {T} , W}, which is, of course, just the power set of W, ℘ (W), is the largest algebra over W. Let us work with the power set of {H, T} as our algebra of propositions over {H, T}. What is the probability that the coin lands on heads? That is, what is the probability of {H}? The probability calculus does not tell us, and it can be any real number between 0 and 1 (inclusive)! What the probability calculus—in particular, step 5—tells us is that the probability of {H, T} equals 1, Pr ({H, T}) = 1. Since {H, T} ∩ ∅ = ∅, step 6 tells us that Pr ({H, T} ∪ ∅) = Pr ({H, T}) + Pr (∅). Since {H, T} ∪ ∅ = {H, T} and Pr is a function—and so Pr (A) = Pr (B) if A = B, for all A and B in the domain of Pr—this implies that Pr ({H, T}) = Pr ({H, T}) + Pr (∅), that is, 1 = 1 + Pr (∅). Elementary calculus then implies that Pr (∅) = 0. This result holds true in general, not just when W = {H, T}. Suppose we think that the probability that the coin lands on heads is a half, Pr ({H}) = 1/2. In this case, the probability calculus specifies the probability that the coin lands on tails. Here is how. Since {H} ∩ {T} = ∅, step 6 tells us that Pr ({H} ∪ {T}) = Pr ({H}) + Pr ({T}). Of course, {H} ∪ {T} = {H, T}, and since Pr is a function, we get Pr ({H, T}) = Pr ({H}) + Pr ({T}), that is, 1 = 1/2 + Pr ({T}). Elementary calculus then implies that Pr ({T}) = 1/2.

PROBABILITY

83

84

INTRODUCTION TO PROBABILITY AND INDUCTION

determined by Pr ({3}) = 1/6 is the proposition that the die will not show three eyes, that is, the proposition that it will show one, two, four, five, or six eyes. Much like in the first example, steps 5 and 6 imply that Pr ({1, 2, 4, 5, 6}) = 5/6. Suppose we think that the probabilities that the die shows any one number between one and six are all equal: Pr ({1}) = · · · = Pr ({6}) = 1/6. Then step 6 determines the probability that the die shows an even number of eyes as follows. Since {2} ∩ {4} = ∅, step 6 implies that Pr ({2} ∪ {4}) = Pr ({2}) + Pr ({4}). This in turn implies that Pr ({2, 4}) = 1/6 + 1/6 = 2/6, since {2} ∪ {4} = {2, 4}, Pr is a function, our assumption about the probabilities of {2} and {4}, and elementary calculus. Furthermore, since {2, 4}∩ {6} = ∅, step 6 implies that Pr ({2, 4} ∪ {6}) = Pr ({2, 4}) + Pr ({6}). Since {2, 4} ∪ {6} = {2, 4, 6}, we get Pr ({2, 4, 6}) = 2/6 + 1/6 = 1/2 for analogous reasons.

5.3 CONDITIONAL PROBABILITY Another element of the probability calculus is the definition of conditional probability. In contrast to steps 4-6, which are axioms or assumptions about how probability behaves, this definition is a stipulation of a new concept, namely conditional probability, in terms of the concept of probability that is characterized or axiomatized by steps 4-6. This stipulation works much like the abbreviation ‘UN’ for ‘United Nations.’ Conditional probability is a very useful concept. Often we cannot say what the probability is that some proposition is true—say, the probability that I have pizza tonight. However, we often find it easier to determine the probability of some proposition given certain assumptions or conditions—say, that I order in and have Italian for dinner tonight. The probability that I have pizza tonight given that I order in and have Italian for dinner is close to 1. In this particular case, the reason is that I

PROBABILITY

85

always have pizza or pasta when I have Italian, that I have pasta when I cook Italian myself, and pizza when ordering Italian. Conditional probabilities are defined as ratios of nonconditional probabilities. The latter are specified by a probability measure Pr on an algebra of propositions A over a non-empty set of possible worlds W. Given such a probability measure Pr and some proposition C from A with positive probability, Pr (C) > 0, the probability measure conditional on C is that function Pr (· | C) that has the same domain A and co-domain R as Pr and is such that for all propositions A from A: Pr (A | C) = Pr (A ∩ C) / Pr (C) If the condition C has probability zero, Pr (C) = 0, there is no probability measure conditional on C. In this case, Pr (· | C) is undefined. The reason is that it does not make sense to divide by zero. Because of step 4, the condition cannot have negative probability, so this case never arises. Finally, the ‘·’ in ‘Pr (· | C)’ has nothing to do with multiplication, but indicates where we write the name ‘A’ for the proposition A, namely to the left of the bar ‘|’ indicating conditionality. We could just as well drop it and adopt a different notation, say ‘PrC ’ and ‘PrC (A).’ Let us return to our examples. In the first example, we have the algebra {∅, {H} , {T} , {H, T}} and the probabilities Pr ({H}) = Pr ({T}) = 1/2. What is the conditional probability that the coin lands on heads given that it lands on tails? The answer is, of course, zero, and this can be shown as follows: Pr ({H} | {T}) = Pr({H}∩{T}) by the definition of conditional probability, which Pr({T}) applies because the condition {T} has positive probability, Pr ({T}) = 1/2 > 0. Since {H} ∩ {T} = ∅ and Pr is a function, elementary calculus as well as our previously established result 0 that Pr (∅) = 0 imply that Pr ({H} | {T}) = Pr(∅) 1/2 = 1/2 = 0. What is the conditional probability that the coin lands on heads given that it lands on heads or tails? The answer is a half, as can be shown as follows: Pr ({H} | {H, T}) = Pr({H}∩{H,T}) Pr({H,T}) by

86

INTRODUCTION TO PROBABILITY AND INDUCTION

the definition of conditional probability, which applies because the condition {H, T} has positive probability according to step 5, Pr ({H, T}) = 1 > 0. Since {H} ∩ {H, T} = {H} and Pr is a function, our assumption and elementary calculus imply that Pr ({H} | {H, T}) = 1/2 1 = 1/2. These two results hold true in general. For all propositions A and C in A: If Pr (C) > 0 and A ∩ C = ∅, then Pr (A | C) = 0, and Pr (A | W) = Pr (A). The latter result has been taken as motivation by Popper (1955) and the Hungarian mathematician Rényi (1955) to axiomatize conditional probability instead of nonconditional probability. Nonconditional probability is then defined in terms of conditional probability as that function Pr (·) that has domain A and co-domain R and is such that for all A in A: Pr (A) = Pr (A | W). We have followed Kolmogorov and proceeded in the opposite direction: We have axiomatized nonconditional probability and then defined conditional probability in terms of it. In the third and fourth examples, we are tossing the coin twice. Let us assume that the algebra in the third example is the power set of {HH, HT, TH, TT}, and the probabilities are Pr ({HH}) = Pr ({HT}) = Pr ({TH}) = Pr ({TT}) = 1/4. Let us further assume that the algebra in the fourth example is the power set of {2H, 1H1T, 2T} and the probabilities are Pr ({2H}) = Pr ({1H1T}) = Pr ({2T}) = 1/3. What is the conditional probability that the coin lands on heads on one toss given that it lands on heads on the other toss? In the third example, this question requires us to compute the conditional probability of {HH} given {HH, HT, TH}. In the fourth example, this question requires us to compute the conditional probability of {2H} given {2H, 1H1T}. It is a useful exercise to calculate these two conditional probabilities and to explain why the results are different. In the fifth example, the algebra is the power set of {1, 2, 3, 4, 5, 6} and the probabilities are Pr ({1}) = · · ·

PROBABILITY

87

= Pr ({6}) = 1/6. We have shown the probability that the die shows an even number of eyes to be a half, Pr ({2, 4, 6}) = 1/2. Suppose I throw the die. The die lands and shows a certain number of eyes that I see, but that you do not see. You want to learn if the die shows two eyes because this is the proposition you have bet on. I tease you and only tell you that the die shows an even number of eyes. In this case, you will want to calculate the conditional probability that the die shows two eyes given that it shows an even number of eyes. Here is how to proceed: Pr ({2} | {2, 4, 6}) = Pr({2}∩{2,4,6}) by the definition of conditional Pr({2,4,6}) probability, which applies because Pr ({2, 4, 6}) = 1/2 > 0. {2} ∩ {2, 4, 6} = {2} and Pr is a function, so elementary calculus implies that Pr ({2} | {2, 4, 6}) = 1/6 1/2 = 1/3.

5.4 ELEMENTARY CONSEQUENCES Here is a more formal derivation of the result that the empty set ∅ is an element of every algebra of propositions A over any non-empty set of possible worlds W. 1. W is a proposition, that is, W ∈ A from step 1. 2. The complement of W is a proposition, that is, (W \ W) ∈ A from 1. by step 2. 3. The empty or inconsistent set is a proposition, that is, ∅∈A from 2. and set theory, which implies that (W \ W) = ∅. Strictly speaking, we would have to apply UG and proceed as follows: 1. B is an algebra of propositions over the non-empty V assumption for CP as well as UG, where B and V are arbitrary sets.

88

INTRODUCTION TO PROBABILITY AND INDUCTION

2. V ∈ B from 1. by step 1. 3. (V \ V) ∈ B from 2. by step 2. 4. ∅ ∈ B from 3. and set theory, which implies that (V \ V) = ∅. 5. If B is an algebra over the non-empty V, then ∅ ∈ B. from 1. and 4. by CP. 6. For all sets W: If B is an algebra over the non-empty W, then ∅ ∈ B from 1. and 5. by UG, which replaces V, which was arbitrary (has not occurred prior to 1.), by W, which is new (does not occur in 5.). 7. For all sets A and W: If A is an algebra over the non-empty W, then ∅ ∈ A from 1. and 6. by UG, which replaces B, which was arbitrary (has not occurred prior to 1.), by A, which is new (does not occur in 6.). However, from now on the shorter proofs skipping UG are fine. Here is a more formal derivation of the result that, for any probability measure Pr on any algebra of propositions A over any non-empty set of possible worlds W, Pr (A) = Pr (A | W) for all propositions A in A. 1. Pr (A | W) = Pr (A ∩ W) / Pr (W) from the definition of conditional probability, which applies because Pr (W) = 1 > 0, which follows from step 5 and elementary calculus. 2. Pr (A | W) = Pr (A) / Pr (W) from 1, set theory which implies that A ∩ W = A if A ⊆ W, and because Pr is a function which implies that Pr (A ∩ W) = Pr (A) if A ∩ W = A (note: A ⊆ W because A is a proposition in A). 3. Pr (A | W) = Pr (A) /1 from 2. and step 5, which implies that Pr (W) = 1. from 3. and elementary calculus. 4. Pr (A | W) = Pr (A)

PROBABILITY

89

To state the next result, we need the concept of a partition of a non-empty set of possible worlds W. While a partition of W is not an algebra over W, it is also a set of subsets of W. The elements of a partition are called “cells,” and they have to be non-empty, mutually exclusive, and jointly exhaustive of W. More precisely, a set of subsets of W, P, is a partition of cells of W if, and only if, (i) C  ∅ for every cell C of P; (ii) A ∩ C = ∅ for any two distinct cells A and C of P; and (iii) every element x in W is an element of some cell C of P. We will only be interested in non-trivial partitions of W, that is, partitions of W other than {W} (other than ∅, if W is empty). The smallest, or most coarse-grained, partitions of a non-empty set of possible worlds W are the sets {A, W \ A}, where A is any non-empty subset of W. In the first example, {{H} , {T}} is the smallest, or most coarse-grained, and, in fact, the only (non-trivial) partition of {H, T}. In the fifth example, one smallest, or most coarse-grained, partition results from the non-empty or consistent proposition that the die shows an even number and its complement with respect to {1, 2, 3, 4, 5, 6}: {{2, 4, 6} , {1, 3, 5}}. The largest, or most fine-grained, partition is always the set of singletons of the elements of W. A singleton is a set containing exactly one member. In the fifth example, the set of singletons of the elements is {{1} , {2} , {3} , {4} , {5} , {6}}. With the help of the concept of a partition, we can now state the law of total probability and Bayes’ theorem from Bayes’ “An Essay towards Solving a Problem in the Doctrine of Chances” (1763). Both come in a special as well as a general form. In its special form, the law of total probability says that for any probability measure Pr on any algebra of propositions A over any non-empty set of possible worlds W and all propositions B and C in A: If Pr (C) > 0 and Pr (W \ C) > 0, then Pr (B) = Pr (B | C) · Pr (C) + Pr (B | W \ C) · Pr (W \ C) .

90

INTRODUCTION TO PROBABILITY AND INDUCTION

The if-clause guarantees that Pr (B | C) and Pr (B | W \ C) are defined. In order to formulate things in a slightly less complicated matter, we will assume from now on that we are considering a fixed but arbitrary probability space W, A, Pr. In its general form, the law of total probability says that for all propositions B in A and any partition {C1 , C2 , . . .} of W: If all cells C1 , C2 , . . . have positive probability, that is, if Pr (C1 ) > 0, Pr (C2 ) > 0, . . ., then Pr (B) = Pr (B | C1 ) · Pr (C1 ) + Pr (B | C2 ) · Pr (C2 ) + · · · To further simplify things, we will also assume from now on that B, C, and all C1 , C2 , . . . are arbitrary propositions in A. In its special form, Bayes’ theorem says that Pr (C | B) = Pr (C) · Pr (B | C) / Pr (B), if Pr (B) > 0 and Pr (C) > 0. The general form of Bayes’ theorem results from the special form and the law of total probability. If Pr (B) > 0, Pr (C) > 0, and Pr (W \ C) > 0, then Pr (C | B) =

Pr (C) · Pr (B | C) . Pr (B | C) · Pr (C) + Pr (B | W \ C) · Pr (W \ C)

Or, even more general, where {C1 , C2 , . . .} is a partition of W: If all cells C1 , C2 , . . . have positive probability, that is, if Pr (C1 ) > 0, Pr (C2 ) > 0, . . ., and if Pr (B) > 0, then Pr (C | B) =

Pr (C) · Pr (B | C) . Pr (B | C1 ) · Pr (C1 ) + Pr (B | C2 ) · Pr (C2 ) + · · ·

Of course, we still have to prove that the law of total probability and the special form of Bayes’ theorem follow from the probability calculus. Let us do so!

PROBABILITY

91

Both claims are if-then claims, so we will proceed by conditional proof. 1. Pr (C) > 0 and Pr (W \ C) > 0 assumption for CP. 2. B = (B ∩ C) ∪ (B ∩ (W \ C)) from set theory. 3. Pr (B) = Pr ((B ∩ C) ∪ (B ∩ (W \ C))) from 2. and because Pr is a function, which implies that Pr (B) = Pr ((B ∩ C) ∪ (B ∩ (W \ C))) if B = (B ∩ C) ∪ (B ∩ (W \ C)). 4. Pr (B) = Pr (B ∩ C) + Pr (B ∩ (W \ C)) from 3. and step 6, which applies because (B ∩ C) ∩ (B ∩ (W \ C)) = ∅, which follows from set theory. (B ∩ (W \ C)) · Pr(W\C) 5. Pr (B) = Pr (B ∩ C) · Pr(C) Pr(C) + Pr Pr(W\C) from 1., 4., and elementary calculus. 6. Pr (B) = Pr (B | C) · Pr (C) + Pr (B | W \ C) · Pr (W \ C) from 5. and the definition of conditional probability, which applies because Pr (C) > 0 and Pr (W \ C) > 0 according to 1. 7. If Pr (C) > 0 and Pr (W \ C) > 0, then Pr (B) = Pr (B | C) · Pr (C) + Pr (B | W \ C) · Pr (W \ C) from 1. and 6. by CP. Now we can proceed to prove Bayes’ theorem: assumption for CP. 1. Pr (B) > 0 and Pr (C) > 0 2. Pr (C | B) = Pr (C ∩ B) / Pr (B) from the definition of conditional probability, which applies because Pr (B) > 0 according to 1. 3. Pr (C | B) = Pr(C)·Pr(B∩C) from 1., 2., set theory, and Pr(C)·Pr(B) elementary calculus. from 3. and 4. Pr (C | B) = Pr(C)·Pr(B|C) Pr(B) the definition of conditional probability, which applies because Pr (C) > 0 according to 1. 5. If Pr (B) > 0 and Pr (C) > 0, then Pr (C | B) = Pr(C)·Pr(B|C) Pr(B) from 1. and 4. by CP.

92

INTRODUCTION TO PROBABILITY AND INDUCTION

For the general versions of the law of total probability and of Bayes’ theorem, one needs to use the principle of mathematical induction that is part of arithmetic (that is, the theory of natural numbers), and that has nothing to do with induction in our sense. In addition, we need a strengthening of step 6 that in turn presupposes a strengthening of step 3. I will state these two strengthenings here, but we will not use them in any proofs. (The concept of countability is explained in Chapter 11.) If all of the (countably many) subsets C1 , C2 , . . . of W are propositions in A, then so is their union C1 ∪ C2 ∪ . . . If no two of the (countably many) propositions C1 , C2 , . . . in A overlap, then Pr (C1 ∪ C2 ∪ . . .) = Pr (C1 ) + Pr (C2 ) + · · ·

5.5 PROBABILITIES ON LANGUAGES Before we turn to some exercises, let us state an alternative formulation of the probability calculus. In this formulation, the carriers of probability, that is, the things to which we assign probabilities, are not propositions of some algebra but sentences or well-formed formulas of some formal language. As before, there are three components. The first component specifies the individual constants, individual variables, predicates, relation symbols, and propositional variables, or sentence letters, and is called the non-empty “vocabulary” V. The non-empty vocabulary V can be chosen arbitrarily. It is given to one in step 0 just as the baking tray and the non-empty set of possible worlds W. The second component is the formal language L over this non-empty vocabulary V. We have already introduced the three steps characterizing it in Chapter 1. 1. If ‘t1 ’, . . ., ‘tn ’ are n terms (that is, individual constants or individual variables), and ‘R’ is an n-ary relation

PROBABILITY

93

symbol (which includes propositional variables, or sentence letters, as the special case where n = 0), then ‘R (t1 , . . . , tn )’ is a well-formed formula of L. 2. If α and β are well-formed formulas of L, and if ‘x’ is     an individual variable, then (¬α),  ¬β ,  α ∧ β ,        α ∨ β ,  α → β ,  α ↔ β , as well as ∃x (α) and ∀x (α) are also well-formed formulas of L. 3. Nothing else is a well-formed formula of L. The third component is the probability function whose co-domain is still the set of real numbers R, but whose domain is now the formal language L over the non-empty vocabulary V. To distinguish it from the probability measure on an algebra of propositions, we call it just probability. A function whose domain is a formal language of well-formed formulas L over some non-empty vocabulary V and whose co-domain is the set of real numbers R, Pr : L → R, is a probability if, and only if, the following holds for all well-formed formulas α and β in L: 4. Pr (α) ≥ 0 Non-negativity 5. Pr (α) = 1 if α is logically true Normalization     6. Pr α ∨ β = Pr (α) + Pr β if α ∧ β is logically false Additivity It is a useful exercise to show that these axioms imply that sentences or well-formed formulas that are logically equivalent, and, hence, express the same proposition, are assigned the same probability. For all well-formed formulas α and β in L:   Pr (α) = Pr β if α and β are logically equivalent Regular probabilities as well as conditional probabilities on a formal language are defined in analogy to regular probability measures and conditional probability measures on an algebra. Finally, it is important to note that conditional probabilities

94

INTRODUCTION TO PROBABILITY AND INDUCTION

are not (nonconditional) probabilities that some conditional sentence or conditional well-formed formula is true. In   particular, the conditional probability Pr α | β is not, in general,   equal to the probability Pr β → α that the material conditional β → α is true.

5.6 EXERCISES Exercise 16: We are considering an algebra A over a non-empty set of possible worlds W. Show that the intersection of A and B is a proposition if both A and B are propositions, that is, (A ∩ B) ∈ A if A ∈ A and B ∈ A. Exercise 17: Consider the set F = {Albert, Jean-Paul, Simone}. What is the largest, or most fine-grained, partition of F, and how many cells does it contain? Exercise 18: Consider the same F as before. Show that the following set of F’s subsets F = {{Albert} , {Jean-Paul} , {Simone}} is a partition of cells of F. Exercise 19: At every party, somebody is the last person to thank the host and go home (no two people can thank the host simultaneously). The usual and only suspects are the three philosophers Albert, Jean-Paul, and Simone so that we are considering the set of possible worlds F = {Albert, Jean-Paul, Simone}. What is the smallest, or most coarsegrained, algebra over F that contains the proposition that it was Jean-Paul, {Jean-Paul}, as an element? Exercise 20: Let F be the same set of possible worlds as before. Show that the set of F’s subsets G is an algebra of propositions

PROBABILITY

95

over F, where G = {∅, {Jean-Paul} , {Albert, Simone} , {Albert, Jean-Paul, Simone}} . For the following eight exercises, we are considering a fixed probability measure Pr on a fixed algebra of propositions A over a fixed non-empty set of possible worlds W. You may skip the use of the principle of universal generalization (UG). Exercise 21: Show that for all propositions A in A: If Pr (A) > 0, then Pr (A | A) = 1. Exercise 22: Show that Pr (∅) = 0. Exercise 23: Show that for all propositions A in A: Pr (W \ A) = 1 − Pr (A). Exercise 24: Show that for all propositions A and B in A: Pr (A) = Pr (A ∩ B) + Pr (A ∩ (W \ B)). Exercise 25: Show that for all propositions A and B in A: Pr (A ∪ B) = Pr (A) + Pr (B) − Pr (A ∩ B). Exercise 26: In this exercise, you show that conditional probabilities are probabilities, that is, that every conditional probability measure is a probability measure. Let C be fixed a proposition in A with positive probability, Pr (C) > 0. Show that the function Pr (· | C) with domain A and co-domain R is a probability measure on A. Exercise 27: Show that for all propositions A and B in A: If A ⊆ B, then Pr (B) ≥ Pr (A).

96

INTRODUCTION TO PROBABILITY AND INDUCTION

Exercise 28: Show that for all propositions A and B in A: If A ⊆ B and Pr (B) > 0, then Pr (A | B) ≥ Pr (A). Exercise 29: Suppose you are increasingly annoyed by the amount of spam among your e-mails, which makes up 20% of all the e-mails you receive. To block spam, you install a filter, which filters all e-mails you receive that contain the phrase “lottery win.” The filter is 99% reliable, which means two things: 99% of the spam you receive contains the phrase “lottery win” and so gets filtered, and 99% of the e-mails you receive that are not spam do not contain the phrase “lottery win” and so do not get filtered. The formal language we consider is built up from a vocabulary containing two sentence letters: s says that the next e-mail you receive is spam, f says that the next e-mail you receive gets filtered. Based on the information above, you think the probabilities are as follows: Pr (s) = 0.2, Pr (f | s) = 0.99, and Pr (¬f | ¬s) = 0.99. What is the conditional probability that the next e-mail you receive is spam given that it gets filtered, Pr (s | f )? You may assume that for all sentences α and γ,   LE Pr (α) = Pr γ , if α and γ are logically equivalent,     N Pr (¬α) = 1 − Pr (α) and Pr ¬α | γ = 1 − Pr α | γ , and   Pr(α)·Pr(γ|α) BT Pr α | γ = Pr γ|α ·Pr(α)+Pr γ|¬α ·Pr(¬α) , if Pr (α) > 0, ( )  ( ) Pr (¬α) > 0, and Pr γ > 0 (this is, of course, just Bayes’ theorem for sentences). Exercise 30: Suppose you are concerned that you have a particular allergy. The allergy is rare—0.1% of the population has the allergy—but to ease your worries, you get tested for the allergy. The test is 98% reliable, which means two things: 98% of the people who have the allergy and get tested test positive, and 98% of the people who do not have the allergy and get tested

PROBABILITY

97

test negative. The formal language we consider is built up from a vocabulary containing three sentence letters: a says that you have the allergy, t says that you get tested for the allergy, and p says that you test positive for the allergy. Based on the previous information, you think the probabilities are as follows: Pr (a) = 0.001, Pr (t) = 1, Pr (p | a ∧ t) = 0.98, and Pr (¬p | ¬a ∧ t) = 0.98. What is the conditional probability that you have the allergy given that you get tested for the allergy and test positive, Pr (a | p ∧ t)? You may assume LE, N, and BT from the previous exercise.

READINGS Among the recommended readings for Chapter 5 are: Papineau, David (2012), Philosophical Devices. Proofs, Probabilities, Possibilities, and Sets. Oxford: Oxford University Press. Chapters 7–8. Steinhart, Eric (2009), More Precisely: The Math You Need to Do Philosophy. Peterborough, ON: Broadview Press. Chapter 5.

and perhaps also Papineau, David (2012), Philosophical Devices. Proofs, Probabilities, Possibilities, and Sets. Oxford: Oxford University Press. Chapter 9.

CHAPTER 6

The Classical Interpretation of Probability

6.1 THE PRINCIPLE OF INDIFFERENCE The probability calculus specifies how probabilities behave mathematically, but it does not specify the probabilities of most propositions, nor does it tell us everything about the nature of probability. For this we need philosophy. Different philosophers have come up with different interpretations or theories of probability. These are philosophical accounts of the nature of probability that try to say more about what probability is. The first interpretation that we will discuss is the classical interpretation of probability. It is formulated by what Jakob Bernoulli calls the principle of insufficient reason in Ars Conjectandi (1713) (“The Art of Conjecturing”) and Laplace calls the principle of indifference in Essai Philosophique sur les Probabilités (1814) (“Philosophical Essay on Probabilities”). According to this principle, the probability of a proposition is the ratio of the number of favorable cases to the number of all possible cases provided these cases are “equally possible.” Since we cannot divide by zero or infinity, this presupposes that there is at least one possible case, and no more than finitely many.

T HE CL A S SIC A L IN T E R PR E TAT ION OF PROB ABIL I T Y

99

In our first example, there are two possible cases: The coin lands on heads, H, and the coin lands on tails, T. If we consider the proposition that the coin lands on heads, {H}, then there is one favorable case—namely the case where the coin lands on heads, H. If the two possible cases are “equally possible,” then the principle of indifference tells us that the probability that the coin lands on heads is a half, Pr ({H}) = 1/2. In this example, it is perhaps plausible that the two cases are “equally possible.” In our second example, there are three possible cases: The coin lands on heads, H, the coin lands on tails, T, and the coin neither lands on heads nor lands on tails, N. In this example, it is presumably not plausible that the three possible cases are “equally possible.” Therefore, the principle of indifference, and with it the classical interpretation of probability, is silent as to what the probability is that the coin lands on heads, or even if there is such a probability. In our fifth example, it is perhaps plausible again that the principle of indifference applies. We might also be able to use it to determine the probability that the next president is born on a Monday, or on a February 29. In the first case, there are seven possible cases, namely the seven days of the week the next president could be born on, and one favorable case, namely Monday. In the second case, there are 4 · 365 + 1 = 1461 possible cases and one favorable case, namely February 29. However, since the number of births is higher in some months than in others, one may question if the principle of indifference really applies in the second case. Concepts can come in a qualitative, comparative, and quantitative form. Possibility is generally considered to be a qualitative concept: A case either is possible, or it is not. If possibility is a qualitative concept, then all possible cases are, in a sense, “equally possible.” In this case, the principle of indifference applies as long as there is at least one possible case, and no more than finitely many. However, as illustrated by our second example, it will often deliver very strange probabilities.

100

INTRODUCTION TO PROBABILITY AND INDUCTION

In the present context, the most plausible candidate for a quantitative concept of possibility is probability so that “equally possible” means equally probable. In this case, the principle of indifference is true but useless as interpretation of probability: It uses the concept of probability to say what probability is. In addition to these concerns, the principle of indifference has been claimed to be inconsistent. Two famous arguments against it take the form of a paradox: One is the paradox of water and wine, and another is one of the paradoxes from Bertrand’s Calcul des Probabilités (1889) (“Calculus of Probabilities”). Both attack a generalization of the principle of indifference that also applies if there are infinitely many cases. An admittedly much less convincing variant of these paradoxes that works with finitely many cases is provided by our third and fourth example. We toss a coin twice. What is the probability that the coin lands on heads on both tosses? In our third example, there are four possible cases that seem to be “equally possible”—HH, HT, TH, and TT—and so the answer is a quarter. In our fourth example, there are three possible cases that seem to be “equally possible”—2H, 1H1T, and 2T—and so the answer is a third. Contradiction!

6.2 BERTRAND’S PARADOX Consider a circle like the circles that follow. Within each circle we can inscribe an equilateral triangle, that is, a triangle whose sides are equally long. The triangles that are inscribed in our circles are such equilateral triangles. Now suppose we draw “at random” a chord through the circle, that is, a straight line from one point on the circle to another. The dotted and dashed lines through our circles are such chords. Drawing a chord through the circle at random is supposed

T HE CL A S SIC A L IN T E R PR E TAT ION OF PROB ABIL I T Y

101

to guarantee that all chords through the circle are “equally possible” and, by the principle of indifference, equally probable. This in turn allows us to “count” (in a sense that also works if we have infinitely many items) the chords to determine the probability that the following question asks about: What is the probability that a “randomly” drawn chord through the circle is longer than a side of the inscribed equilateral triangle? In our circles, the dashed chords are longer than a side of the inscribed equilateral triangle, while the dotted chords are shorter. The three answers to follow will also make use of the concept of randomness. They do so to make sure that the possible cases considered are “equally possible,” as this is the condition of applicability for the principle of indifference. The latter principle then implies that the “equally possible” cases are also equally probable. Randomness is a difficult concept, and its relation to probability (and possibility) is not straightforward (Eagle 2012). We will bracket this complication and assume to have a sufficient understanding of randomness. Here is the first answer. We choose two points on the (circumference of the) circle “at random.” The randomness is supposed to guarantee that all pairs of points on the circle are “equally possible” and, by the principle of indifference, equally probable. Then we draw the chord between these two points. Next we rotate the inscribed triangle so that one of its three vertices coincides with one of the two arbitrarily chosen points. Finally, we see that two-thirds of the points on the circumference give rise to dotted chords that lie outside the inscribed equilateral triangle and that are shorter than its sides. Therefore, one-third of the points on the circumference give rise to dashed chords that go through the inscribed equilateral triangle and that are longer than its sides. “Counting” these points on the circumference delivers the answer that the probability is a third.

102

INTRODUCTION TO PROBABILITY AND INDUCTION

Here is the second answer. We choose a radius and a point on this radius “at random.” The randomness is supposed to guarantee that all points on the radius are “equally possible” and, by the principle of indifference, equally probable. Then we

T HE CL A S SIC A L IN T E R PR E TAT ION OF PROB ABIL I T Y

103

draw a chord through this point that is perpendicular to the radius. Next, we rotate the inscribed triangle so that one of its three sides is parallel to the chord. Finally, we see that half of the points on the radius give rise to dotted chords that lie outside the inscribed equilateral triangle and that are shorter than its sides. Therefore, half of the points on the radius give rise to dashed chords that go through the inscribed equilateral triangle and that are longer than its sides. “Counting” these points on the radius (it does not matter which radius, as the answer is always the same) delivers the answer that the probability is a half. Here is the third answer. We choose a point in (the area of) the circle “at random.” Again, the randomness is supposed to guarantee that all points in the circle are “equally possible” and, by the principle of indifference, equally probable. Then we draw the unique chord that has this point as its midpoint. Next we draw the largest circle inside the inscribed equilateral triangle. The two circles have the same midpoint. We see that

104

INTRODUCTION TO PROBABILITY AND INDUCTION

three-quarters of the points in the larger circle lie outside the smaller circle. These points give rise to dotted chords that are shorter than the sides of the inscribed equilateral triangle. Therefore, one-quarter of the points in the larger circle lie inside the smaller circle. These points give rise to dashed chords that are longer than the sides of the inscribed equilateral triangle. “Counting” these points in the larger circle delivers the answer that the probability is a quarter. The principle of indifference seems to give three different answers to one and the same question. Therefore, it seems to be inconsistent, and this is the paradox. To solve the paradox, we need to recall that a probability space has three components: a non-empty set of possible worlds or cases, an algebra of propositions over this set, and a probability measure that has this algebra as its domain. The first answer assumes that the possible worlds or cases are the points on a line of length 2 · r · π, which is the length of the circumference of the circle if r is the length of a radius of the circle (π = 3.14 . . . is the irrational number pi): W1 = {x ∈ R : 0 ≤ x ≤ 2 · r · π} = [0, 2rπ] The algebra A1 over this set is generated by the subintervals [a, b] of the interval [0, 2rπ]. Its elements, viz. intervals and complements and countable unions, are called Borel sets. The probability measure Pr1 on this algebra assigns to each subinterval [a, b] of [0, 2rπ] its normalized length, that is, b−a the probability (b − a) / (2 · r · π). Pr1 ([a, b]) = 2·r·π “counts” the points in every subinterval of [0, 2rπ]. The second answer assumes that the possible worlds or cases are the points on a line of length r, which is the length of every radius of the circle: W2 = {x ∈ R : 0 ≤ x ≤ r} = [0, r]

T HE CL A S SIC A L IN T E R PR E TAT ION OF PROB ABIL I T Y

105

The algebra A2 over this set is generated by the subintervals [a, b] of the interval [0, r]. The probability measure Pr2 on this algebra assigns to each subinterval [a, b] of [0, r] its normalized length, that is, the probability (b − a) /r. Pr2 ([a, b]) = b−a r “counts” the points in every subinterval of [0, r]. The third answer is a bit more difficult to explain. It assumes that the possible worlds or cases are the two-dimensional points in the disc whose radii have length r. The algebra A3 over this set is the set of two-dimensional Borel sets that are contained in the disc. The probability measure Pr3 on this algebra assigns to each two-dimensional Borel set A its normalized area, that   is, the probability area (A) / r2 · π . Pr3 (A) = area(A) r2 ·π “counts” the two-dimensional points in every two-dimensional Borel set that is contained in the disc. What this means is that our question—that is, what is the probability that a “randomly” drawn chord is longer than a side of the inscribed equilateral triangle?—can be interpreted in at least three different ways. As a consequence, the principle of indifference has not been shown to be inconsistent; instead, each of the three answers responds to a different question. Informally, the set of possible worlds or cases is a way of carving up or conceptualizing reality. What Bertrand’s paradox illustrates is that there is no such thing as the probability that something is the case—there is only the probability that something is the case relative to a way of carving up or conceptualizing reality. The principle of indifference does not tell us how to carve up or conceptualize reality. It presupposes that this has already been done in its mention of the possible cases that need to be judged “equally possible” in order for the principle to apply. The important lesson from Bertrand’s paradox is that we must specify the first and second element of a probability space, the non-empty set of possible worlds or cases and the algebra of propositions over it, before we can even meaningfully speak of probability.

106

INTRODUCTION TO PROBABILITY AND INDUCTION

6.3 THE PARADOX OF WATER AND WINE The paradox of water and wine is mentioned in Richard von Mises’ Wahrscheinlichkeit, Statistik und Wahrheit (1928) (“Probability, Statistics, and Truth”), though it is perhaps due to von Kries’ Die Principien der Wahrscheinlichkeitsrechnung (1886) (“The Principles of the Probability Calculus”). The latter develops the range interpretation of probability, which is an alternative to the classical interpretation of probability that we do not have space to discuss. The paradox is as follows. There is a quantity of liquid that consists of water and wine. The ratio of wine to water, x, lies between 1/3 and 3. What is the probability that the ratio of wine to water is at most 2, Pr (x ≤ 2)? Here is the first answer that we get from the principle of indifference. It seems that each possible ratio of wine to water, x, between 1/3 and 3 is “equally possible.” Hence, Pr (x ≤ 2) = 2−1/3 3−1/3 = 5/8. The ratio of water to wine, y, also lies between 1/3 and 3. Furthermore, for any number k between 1/3 and 3, 1/3 ≤ k ≤ 3, the ratio of wine to water is at most k, x ≤ k, if, and only if, the ratio of water to wine is at least 1/k, y ≥ 1/k. Against this background here is the second answer that we get from the principle of indifference. It seems that each possible ratio of water to wine, y, between 1/3 and 3 is “equally possible.” Hence, Pr (x ≤ 2) = Pr (y ≥ 1/2) = 3−1/2 3−1/3 = 15/16. The proportion of wine, z, lies between 1/3 and 3/4 of the total quantity of liquid. Furthermore, the ratio of wine to water, x, is at most 2, x ≤ 2, if and only if the ratio of water to wine, y, is at least 1/2, y ≥ 1/2. This in turn holds if and only if the proportion of wine, z, is at most 2/3 of the total quantity of liquid, z ≤ 2/3. Against this background here is the third answer that we get from the principle of indifference.

T HE CL A S SIC A L IN T E R PR E TAT ION OF PROB ABIL I T Y

107

It seems that each possible proportion of wine, z, between 1/3 and 3/4 is “equally possible.” Hence, Pr (x ≤ 2) = Pr (y ≥ 1/2) = 2/3−1/3 Pr (z ≤ 2/3) = 3/4−1/3 = 4/5. As before, the principle of indifference seems to be inconsistent because it seems to give three different answers to one and the same question. However, as before, what the paradox of water and wine really shows is that we must be careful in interpreting the following question: What is the probability that the ratio of wine to water is at most 2? In particular, we need to specify the underlying set of possible worlds or cases, and the algebra of propositions over this set, before we can even meaningfully ask this question. The first answer assumes that the possible cases are the possible ratios of wine to water, x: W1 = {x ∈ R : 1/3 ≤ x ≤ 3} The first algebra A1 is generated by the subintervals of the interval [1/3, 3]. The first probability measure Pr1 assigns to each subinterval [a, b] of [1/3, 3] its normalized length, that is, b−a . Pr1 ([a, b]) = 3−1/3 The second answer assumes that the possible cases are the possible ratios of water to wine, y:   W2 = y ∈ R : 1/3 ≤ y ≤ 3 The second algebra A2 is generated by the subintervals of the interval [1/3, 3]. The second probability measure Pr2 assigns to each subinterval [a, b] of [1/3, 3] its normalized length, that is, b−a Pr2 ([a, b]) = 3−1/3 . The third answer assumes that the possible cases are the possible proportions of wine of the total quantity of liquid, z: W3 = {z ∈ R : 1/3 ≤ z ≤ 3/4} The third algebra A3 is generated by the subintervals of the interval [1/3, 3/4]. The third probability measure Pr3

108

INTRODUCTION TO PROBABILITY AND INDUCTION

assigns to each subinterval [a, b] its normalized length, that is, b−a Pr3 ([a, b]) = 3/4−1/3 . Each of the three answers is perfectly acceptable. We just have to be aware that each of the three answers makes very different assumptions about one’s uncertainty or ignorance. The third answer reveals that the first answer is biased towards wine: According to the first probability measure, it is three times as probable that there is more wine in the liquid as that there is more water in the liquid, Pr1 (x > 1) = 3/4. The third answer also reveals that the second answer is biased towards water: According to the second probability measure, it is three times as probable that there is more water in the liquid as that there is more wine in the liquid, Pr2 (y > 1) = 3/4. Von Mises intended the paradox of water and wine to be an argument against the logical and subjective interpretation of probability and for the interpretation of probability as (limiting) relative frequency. We will discuss these interpretations in the chapters to follow. Before doing so, let us briefly return to the shortcoming of the principle of indifference that it can only be applied when there are finitely many possible cases. This shortcoming is remedied by the principle of maximum entropy from E.T. Jaynes’ “Information Theory and Statistical Mechanics” (1957), which generalizes the principle of indifference and which we have applied in the previous cases. A drawback of this latter principle is that it does not always determine a unique probability measure: Sometimes more than one probability measure maximizes entropy. Entropy is introduced in Claude Shannon’s “A Mathematical Theory of Communication” (1948) as a measure of the expected information or certainty that is contained in a probability measure. If the set of possible worlds or cases is finite, W = {w1 , . . . , wn }, and the algebra of propositions is the power set

T HE CL A S SIC A L IN T E R PR E TAT ION OF PROB ABIL I T Y

109

of W, then the entropy H of a probability measure Pr on this algebra is measured as follows:  H (Pr) = − Pr ({w1 }) · log2 Pr ({w1 }) + · · · + Pr ({wn }) ·log2 Pr ({wn }) (Here log2 is the logarithm to base 2, although other bases can be chosen as well.) If Pr ({wi }) = 1 for some possible world wi in W, then H (Pr) = 0, as Pr is minimally uncertain or maximally informative. The opposite is the case for the uniform probability measure Pru that assigns a uniform probability of 1/n to each of the propositions {w1 } , . . . , {wn }, that is, Pru ({w1 }) = · · · = Pru ({wn }) = 1/n. The uniform probability measure Pru is maximally uncertain or minimally informative. We have used it, among other places, in our first and fifth example from Section 5.2. The subjective or Bayesian interpretation of probability as degree of belief distinguishes between prior and posterior probabilities. Prior probabilities are the probabilities before information has been received, and posterior probabilities are the probabilities after information has been received. Since it is often not clear what the prior probabilities are or should be, some philosophers use the maximum entropy principle to determine these prior probabilities. An even more general principle is the principle of relative entropy from Kullback and Leibler’s “On Information and Sufficiency” (1951). The latter principle measures the “distance” of one probability measure from another probability measure subject to the satisfaction of a constraint. Updating a probability measure—that is, obtaining a posterior from a prior probability measure, a topic we will discuss in Section 8.5—can be interpreted as minimizing the distance of the posterior probability measure from the prior probability

110

INTRODUCTION TO PROBABILITY AND INDUCTION

measure subject to the constraint that one has become certain of the information one is updating on. The update rules of strict conditionalization and Jeffrey conditionalization that we will consider minimize relative entropy.

READING The recommended reading for Chapter 6 is: Hájek, Alan (2011), Interpretations of Probability. In E.N. Zalta (ed.), Stanford Encyclopedia of Philosophy.

CHAPTER 7

The Logical Interpretation of Probability

7.1 STATE DESCRIPTIONS AND STRUCTURE DESCRIPTIONS The logical interpretation of probability has been developed by, among others, Carnap and the British economist Keynes, in A Treatise on Probability (1921), as basis for inductive logic. One way to think of this interpretation is as the result of replacing the “equally possible” cases in the classical interpretation of probability by logically possible cases. To this end, we consider a formal language L for predicate logic with individual constants ‘a,’ ‘b,’ and so on and relation symbols ‘F,’ ‘G,’ and so on (these include predicates, as we can conceive of properties as one-place relations). As in Chapter 1, we assume (the vocabulary of) L to contain a name for each individual. A state description for L is a description of all individuals L can talk about that is consistent and L-maximal, that is, as complete as L allows. A state description for L says of each individual a and each property F that L can talk about whether or not a has F, and it says of any two individuals a and b and any binary relation G that L can talk about whether or not a stands in relation G to b, and

112

INTRODUCTION TO PROBABILITY AND INDUCTION

so on. If L contains only finitely many individual constants and relation symbols, we can represent a state description for L as a long conjunction that has every atomic formula or its negation, but not both, as a conjunct. Otherwise it is best represented as the set of these sentences (these sets are the “maximal consistent” sets from the end of Section 4.5). The reason is that sentences are generally assumed to be of finite length, whereas sets can contain infinitely many members. Under the assumption that L contains a name for each individual, these sets are or can be used to represent the logically possible cases of predicate logic. In contrast to a state description, a structure description for L says of each “maximal consistent” property P that L can talk about how many individuals there are with this property, and of each “maximal consistent” binary relation G that L can talk about how many pairs of individuals stand in this relation to each other, and so on. If L contains only finitely many individual constants and relation symbols, we can represent a structure description as a disjunction of state descriptions. Otherwise it is best represented as a set of state descriptions. Let us suppose that the language L contains only finitely many individual constants and relation symbols so that there are only finitely many state and structure descriptions. Each function which assigns weights g1 , . . . , gn between 0 and 1 inclusive, 0 ≤ g1 , . . . , gn ≤ 1, to the n state descriptions s1 , . . . , sn such that g1 + · · · + gn = 1 induces a probability on L. (It will be helpful to review the definition of a probability on a formal language from Section 5.5.) In “Logisch-Philosophische Abhandlung” (1921) (“Logicophilosophical Treatise”), Wittgenstein proposed that the logical probability w on L results from assigning each of the n state description s1 , . . . , sn the same weight 1/n. In contrast to this, Carnap thought that each of the i (i ≤ n) structure descriptions z1 , . . . , zi should be assigned the same weight 1/i. Once this is

T HE L OGIC A L IN T E R PR E TAT ION OF PROB ABIL I T Y

113

done the weight of a structure description should be divided equally among the state descriptions whose disjunction this structure description consists in. The result is the logical probability m∗ , and the conditional probability that is based on it is also known as the (absolute) confirmation function c∗ . That is,   if γ is a formula of L with positive logical probability, m∗ γ > 0, then the (absolute) confirmation function c∗ is the conditional   probability m∗ · | γ that is defined for all formulas α of L by the     ratio m∗ α ∧ γ /m∗ γ . There is a crucial difference between Wittgenstein’s and Carnap’s proposal. Wittgenstein’s proposal does not, but Carnap’s proposal does, allow “learning from experience” in the following sense. Suppose a police officer checks the license plates of twenty cars and finds that nineteen have not expired, but that one has. The conditional probability or degree of absolute confirmation that the license plate of the twenty-first car has not expired given that the license plates of nineteen out of the first twenty cars the police officer has checked have not expired should be greater than the nonconditional probability that the license plate of the twenty-first car has not expired: m∗ (¬Expired (a21 ) | 19 out of a1 , . . . , a20 have not expired) > m∗ (¬Expired (a21 )) This allows learning from experience in the sense that it tells us that we should consider something to be more probable to occur in the future on the assumption that it has occurred frequently in the past. More generally, for appropriate k we get the result that the conditional probability or degree of absolute confirmation that the next object is P given that k out of the N objects about which we have the information whether they are P are P, and the remaining N − k are ¬P, is greater than the nonconditional

114

INTRODUCTION TO PROBABILITY AND INDUCTION

probability that the next object is P: m∗ (P (aN+1 ) | k out of a1 , . . . , aN are P, and N − k are ¬P) > m∗ (P (aN+1 )) In contrast to this, Wittgenstein’s proposal implies that the conditional probability that the next object is P given that any k between 0 and N inclusive, 0 ≤ k ≤ N, out of the N objects about which we have the information whether they are P are P, and the remaining N − k are ¬P, is equal to the nonconditional probability that the next object is P: w (P (aN+1 ) | k out of a1 , . . . , aN are P, and N − k are ¬P) = w (P (aN+1 )) This does not allow learning from experience in the sense that it tells us that we should consider the probability that something occurs in the future to be independent of how many times it has occurred in the past. The above presupposes that learning—or better: updating —works by moving from nonconditional probabilities to conditional probabilities. We will discuss this proposal in Section 8.5. For now, let us consider an example of a formal language with two individual constants ‘a’ and ‘b’ and one predicate ‘F.’ This language generates the following four state descriptions or logically possible cases: s1 = {F (a) , F (b)} or better for our purposes s1 = F (a) ∧ F (b) s2 = {F (a) , ¬F (b)} or better for our purposes s2 = F (a) ∧ ¬F (b) s3 = {¬F (a) , F (b)} or better for our purposes s3 = ¬F (a) ∧ F (b) s4 = {¬F (a) , ¬F (b)} or better for our purposes s4 = ¬F (a) ∧ ¬F (b)

T HE L OGIC A L IN T E R PR E TAT ION OF PROB ABIL I T Y

115

These four state descriptions generate the following three structure descriptions: There are two Fs and zero ¬Fs, z1 = {s1 } = {F (a) ∧ F (b)} or z1 = s1 = F (a) ∧ F (b). There is one F and one ¬F, z2 = {s2 , s3 } = {F (a) ∧ ¬F (b) , ¬F (a)∧ F (b)} or z2 = s2 ∨ s3 = (F (a)∧¬F (b)) ∨ (¬F (a)∧F (b)). There are zero Fs and two ¬Fs, z3 = {s4 } = {¬F (a) ∧ ¬F (b)} or z3 = s4 = ¬F (a) ∧ ¬F (b). As mentioned, Wittgenstein’s proposal is to assign the same weight 1/4 to the four state descriptions s1 , s2 , s3 , and s4 . Carnap’s proposal consists in first assigning the same weight 1/3 to the three structure descriptions z1 , z2 , and z3 . Then these weights are divided equally among the state descriptions a structure description consists in: There are two Fs and zero ¬Fs: m∗ (z1 ) = m∗ (s1 ) = m∗ (F (a) ∧ F (b)) = 1/3. There is one F and one ¬F: m∗ (z2 ) = m∗ (s2 ∨ s3 ) = 1/3, so m∗ (s2 ) = m∗ (¬F (a) ∧ F (b)) = 1/6 and m∗ (s3 ) = m∗ (F (a) ∧ ¬F (b)) = 1/6. There are zero Fs and two ¬Fs: m∗ (z3 ) = m∗ (s4 ) = m∗ (¬F (a) ∧ ¬F (b)) = 1/3. Initially the probability that b is F equals a half: m∗ (F (b)) = m∗ ((F (a) ∧ F (b)) ∨ (¬F (a) ∧ F (b))) ∗

logic

= m (F (a) ∧ F (b)) + m (¬F (a) ∧ F (b)) additivity and logic = 1/3 + 1/6 = 1/2

Carnap’s proposal

elementary calculus

116

INTRODUCTION TO PROBABILITY AND INDUCTION

Now we receive the information that a is F and consider the conditional probability that b is F given that a is F: m∗ (F (b) ∧ F (a)) def. of cond. prob. m∗ (F (a)) m∗ (F (a) ∧ F (b)) logic = ∗ m ((F (a) ∧ F (b)) ∨ (F (a) ∧ ¬F (b))) m∗ (F (a) ∧ F (b)) = ∗ m (F (a) ∧ F (b)) + m∗ (F (a) ∧ ¬F (b))

m∗ (F (b) | F (a)) =

1/3 1/3 + 1/6

= 2/3

Carnap’s proposal

elementary calculus

Thus, we see that m∗ (F (b) | F (a)) > m∗ (F (b)) on Carnap’s proposal, whereas w (F (b) | F (a)) = w (F (b)) on Wittgenstein’s (it is a useful exercise to show this!). Here is the general formula that Carnap presents in Logical Foundations of Probability (1950/1962): m∗ (F (aN+1 ) | k out of a1 , . . . , aN are F, and N − k are ¬F) =

ω k+ω k+κ· κ = N+κ N+κ

Let us unwrap this scary looking formula. First we need to understand the concept of a “maximal consistent” predicate. Suppose the language L contains three predicates ‘F,’ ‘G,’ and ‘H.’ Then there are eight maximal consistent predicates in L, namely: F ∧ G ∧ H, F ∧ G ∧ ¬H, F ∧ ¬G ∧ H, ¬F ∧ G ∧ H, F ∧ ¬G ∧ ¬H, ¬F ∧ G ∧ ¬H, ¬F ∧ ¬G ∧ H, and ¬F ∧ ¬G ∧ ¬H. The predicate ‘F’ is equivalent to the disjunction of those four maximal consistent predicates that have ‘F’ rather than ¬F in it: (F ∧ G ∧ H) ∨ (F ∧ G ∧ ¬H) ∨ (F ∧ ¬G ∧ H) ∨ (F ∧ ¬G ∧ ¬H). The sense in which two predicates are equivalent is that they

T HE L OGIC A L IN T E R PR E TAT ION OF PROB ABIL I T Y

117

are true of the exact same things on purely logical grounds. Also, note that it is predicates that are maximal consistent, not properties as I suggested earlier to keep things simple. Maximal consistency depends on the language we use, and a predicate may be maximal consistent in one language and fail to be so in another. κ in the previous formula is the number of maximal consistent predicates of the language. ω is what Carnap calls the “logical width” of the predicate ‘F’, that is, the number of maximal consistent predicates whose disjunction is equivalent to ‘F.’ In our example κ = 8 and ω = 4. ω/κ is called the “relative (logical) width” of the predicate ‘F.’ The important point for us is that all these factors are purely logical in that they are determined by the formal language L. N, on the other hand, is the number of individuals about which one has the information whether they are F, and k is the number of individuals about which one has the information that they are F. k/N is the observed relative frequency of Fs among a1 , . . . , aN . The important point for us is that these latter factors are not purely logical but empirical, for they depend on the information one has. The previous formula works for some cases but delivers odd results for others. Therefore, Carnap generalizes it in The Continuum of Inductive Methods (1952), where he introduces a parameter λ that weighs between the purely logical factor ω/κ of the relative (logical) width of a predicate ‘F’ and the empirical factor of the observed relative frequency k/N of Fs among a1 , . . . , aN : mλ (F (aN+1 ) | k out of a1 , . . . , aN are F, and N − k are ¬F) =

k + λ · ωκ N+λ

Here is the interesting point. If we let the weighing parameter λ increase without bound, λ → ∞ in the notation of

118

INTRODUCTION TO PROBABILITY AND INDUCTION

mathematicians (the arrow ‘→’ has nothing to do with the material conditional or the arrow in the characterization of a function), then the purely logical factor becomes more and more important, and the empirical factor becomes less and less important. In the limit the empirical factor does not matter anymore at all, and the probability is entirely determined by the purely logical factor. This limiting case is precisely Wittgenstein’s proposal: w (F (aN+1 ) | k out of a1 , . . . , aN are F, and N − k are ¬F) = lim mλ (F (aN+1 ) | k out of a1 , . . . , aN are F, λ→∞

and N − k are ¬F)

= lim

k + λ · ωκ

N+λ ω = = w (F (aN+1 )) κ λ→∞

Another case is to set the weighing factor λ equal to zero. In this case, the purely logical factor does not matter anymore at all, and the probability is entirely determined by the empirical factor of the observed relative frequency. This extreme case is a probabilified version of Reichenbach’s “straight(-forward) rule” from Experience and Prediction (1938) that we will discuss in Section 10.2: r (F (aN+1 ) | k out of a1 , . . . , aN are F, and N − k are ¬F) = m0 (F (aN+1 ) | k out of a1 , . . . , aN are F, and N − k are ¬F) =

k + 0 · ωκ

N+0 k = N

T HE L OGIC A L IN T E R PR E TAT ION OF PROB ABIL I T Y

119

A third case is to set the weighing factor λ equal to κ. This is precisely Carnap’s original proposal: m∗ (F (aN+1 ) | k out of a1 , . . . , aN are F, and N − k are ¬F) = = mκ (F (aN+1 ) | k out of a1 , . . . , aN are F, and N − k are ¬F) =

k + κ · ωκ

N+κ k+ω = N+κ

A well-known special case of the last proposal results when the language L contains just one predicate, in which case ω = 1 and κ = 2. It is called the “rule of succession” and has been used as early as Laplace (1814): s (F (aN+1 ) | k out of a1 , . . . , aN are F, and N − k are ¬F) =

k+1 N+2

The rule of succession is also well-defined if one has no information yet (that is, N = 0 and, hence, k = 0), whereas the straight(-forward) rule and its probabilified version are undefined in this case. This is an important feature of Carnap’s proposal, as I will explain now. Carnap’s general formula mirrors the debate between the continental rationalists Descartes, Spinoza, and Leibniz (Dea 2012) and the British empiricists Locke, Berkeley, and Hume (Markie 2013). The parameter λ characterizes the importance and power attributed to the mind. The empiricists hold that there is no a priori knowledge and that all knowledge comes from the senses. Without experience, the mind is an empty slate, a tabula rasa, that contains no information, so cannot assign any probability. This position is mirrored in Reichenbach’s

120

INTRODUCTION TO PROBABILITY AND INDUCTION

proposal which puts all the weight on the empirical factor and which, quite fittingly, is undefined in the absence of information. In contrast to this, the rationalists hold that there is a priori knowledge, and this position is mirrored in proposals which put some positive weight on the purely logical factor. Wittgenstein’s proposal is the hyperrationalist one that all probability assignments are entirely a priori. Carnap’s earlier proposal puts some weight on the purely logical factor of the relative (logical) width of a predicate, and some weight on the empirical factor of the observed relative frequency. It strikes a middle course between Reichenbach’s empiricist proposal and Wittgenstein’s rationalist or logical proposal, much like Kant tried to synthesize the British empiricist and continental rationalist traditions. Carnap’s philosophical positions are generally classified as belonging to logical empiricism (Creath 2011), or logical positivism, that we have encountered in Chapter 4, although I think it is fair to say that he, much like Kant, leans more towards the rationalist or logical side. In concluding this section, let us briefly return to the justification, rather than definition, of induction. Kant (1781) distinguishes between analytic and synthetic truths, and between a priori and a posteriori knowledge. As mentioned in Section 3.3, the former has become a distinction in the philosophy of language, according to which a true sentence is analytically true if, and only if, it is true in virtue of its meaning, and synthetically true otherwise, that is, if it is true in part because of what reality is like. The latter has become a distinction in epistemology, according to which knowledge is a priori if, and only if, it can be obtained without experience, and a posteriori otherwise, that is, if it can be obtained only with the help of the senses. Both distinctions differ from the metaphysical distinction between necessity and contingency, which bears some resemblance to Hume’s

T HE L OGIC A L IN T E R PR E TAT ION OF PROB ABIL I T Y

121

distinction between relations of ideas and matters of fact (as does the analytic/synthetic distinction). Kant claims that there are synthetic truths that we can know a priori. As an example, he cites the claim—made by Newtonian physics (Newton 1687)—that space satisfies the axioms of Euclid’s Elements. According to Einstein (1915)’s general theory of relativity, this claim is false. Therefore, Kant’s claim is not only not true a priori, but not true at all. Against this background, Carnap (1934) argues that it is not only the example that Kant cites that is false, but Kant’s general claim that there are any synthetic truths that we can know a priori. The claim that there are no synthetic truths that we can know a priori is one of Carnap’s first philosophical contributions, before writing hundreds of pages on the definition or explication of logical probability. Much later, in the volume The Philosophy of Rudolf Carnap (1963) edited by Schilpp, Carnap, for the first time, turns his attention to “the controversial problem of the justification of induction” (Carnap 1963: 978). For Carnap, the justification of induction boils down to the justification of the axioms of inductive logic that characterize the probability m∗ , or the set of probabilities mλ . These axioms include the three probability axioms non-negativity, normalization, and additivity, and many more besides. According to Carnap, the “reasons [for accepting the axioms of inductive logic] are based upon our intuitive judgments concerning inductive validity” (Carnap 1963: 978). Carnap takes this to imply that “[i]t is impossible to give a purely deductive justification of induction” and that these “reasons are a priori” (Carnap 1963: 978). Elsewhere, in Carnap (1934), he identifies analytic truth with deductive derivability or provability in a logical system. Given this understanding of analyticity, Carnap’s claim is that the reasons for accepting the axioms of inductive logic are synthetic truths that we can know a priori—truths whose existence he previously denied.

122

INTRODUCTION TO PROBABILITY AND INDUCTION

7.2 ABSOLUTE CONFIRMATION AND INCREMENTAL CONFIRMATION The next two concepts are crucial. We consider a probability Pr on a formal language L over a non-empty vocabulary V. We say that α absolutely confirms β given γ to degree d (in the sense of the probability Pr) if, and only if, the conditional probability   of β given α ∧ γ is defined and equals d, Pr α ∧ γ > 0 and   Pr β | α ∧ γ = d. Note that this is a quantitative concept of confirmation. Its qualitative counterpart is defined as follows. α absolutely confirms β given γ (in the sense of the probability Pr) if, and only if, the conditional probability of β given α ∧ γ is defined and “sufficiently” high, that is, higher than some   threshold r that is at least a half, Pr β | α ∧ γ > r ≥ 1/2. In contrast to this, α incrementally confirms β given γ (in the sense of the probability Pr) if, and only if, the conditional probability of β given α ∧ γ is defined and greater than the     probability of β given γ, Pr α ∧ γ > 0 and Pr β | α ∧ γ >   Pr β | γ . Note that this is a qualitative concept of confirmation. The idea is that for information to confirm a hypothesis the information must raise or increase the probability of the hypothesis. This concept has many quantitative counterparts, some of which are motivated by the following slightly more general concept: α is positively relevant to β given γ (in the sense     of the probability Pr) if, and only if, Pr γ > 0 and Pr α ∧ β | γ >     Pr α | γ · Pr β | γ . Positive probabilistic relevance is slightly more general than incremental confirmation because it is also well-defined if α ∧ γ (though not γ) has probability zero. Both concepts are symmetric (it is a useful exercise to show this!): If α incrementally confirms β given γ, then β incrementally confirms α given γ; and, if α is positively relevant to β given γ, then β is positively relevant to α given γ. As mentioned, incremental confirmation has many quantitative counterparts. These include the degree of

T HE L OGIC A L IN T E R PR E TAT ION OF PROB ABIL I T Y

123

incremental confirmation as it is defined by the distance measure D and the ratio measure R:         D = Pr β | α ∧ γ − Pr β | γ R = Pr β | α ∧ γ / Pr β | γ Which measure to use in a particular situation depends on what aspects of incremental confirmation one wants to focus on. For instance, in a salary negotiation, one can discuss a potential raise in terms of percents or absolute amounts of dollars. If the salary increase is supposed to compensate for inflation, then discussing the potential increase in percents will be useful. On the other hand, if the salary increase is supposed to reflect one’s achievements during the previous year in a way that is independent of one’s salary bracket, so that senior employees get the same increase for the same achievements as junior employees, then it may be best to use absolute amounts of dollars. The distance measure measures incremental confirmation in terms of absolute amounts of probabilities, while the ratio measure measures it in terms of percents. Both of these measures are legitimate measures of incremental confirmation. One just has to keep in mind that they are focusing on different aspects of incremental confirmation. There is no right or wrong here—just a more or less useful for various purposes.

7.3 CARNAP ON HEMPEL Carnap defines qualitative confirmation as positive probabilistic relevance or incremental confirmation, and quantitative confirmation as conditional probability or absolute confirmation. This is a curious choice, as it does not settle on either incremental confirmation or absolute confirmation as the concept of confirmation. Recall Hempel’s four conditions of adequacy for any definition of confirmation: the entailment condition, the special consequence condition, the special

124

INTRODUCTION TO PROBABILITY AND INDUCTION

consistency condition, and the converse consequence condition. All of these conditions are formulated for a qualitative concept of confirmation. In discussing them, Carnap notes that the entailment condition is satisfied by, or true of, incremental confirmation under certain restrictions, namely if the confirmed hypothesis h does not have probability 1 and the confirming information i has positive probability, Pr (h) < 1 and Pr (i) > 0 (it is a useful exercise to show this!). Regarding the special consequence condition, Carnap suggests that “Hempel has in mind as explicandum the following relation: ‘the degree of confirmation of h on i is greater than r’, where r is a fixed value, perhaps 0 or 1/2” (Carnap 1962: 475). That is, while the entailment condition, under certain restrictions, is satisfied by, or true of, incremental confirmation, the explicandum or analysandum Hempel allegedly has in mind when formulating the special consequence condition is the different qualitative concept of absolute confirmation. Regarding the special consistency condition Carnap notes that “Hempel regards it as a great advantage of any explicatum satisfying [the special consistency condition] “that it sets a limit, so to speak, to the strength of the hypotheses which can be confirmed by given evidence” [. . .] This argument does not seem to have any plausibility for our explicandum,” (Carnap 1962: 477) which is the qualitative concept of incremental confirmation. “But,” Carnap continues, “it is plausible for the second explicandum mentioned earlier: the degree of confirmation exceeding a fixed value r. Therefore we may perhaps assume that Hempel’s acceptance of [the special consistency condition] is due again to an inadvertent shift to the second explicandum” (Carnap 1962: 477–478). Carnap’s discussion of Hempel’s conditions can be summarized as follows. Hempel’s entailment condition is satisfied by, or true of, incremental confirmation under

T HE L OGIC A L IN T E R PR E TAT ION OF PROB ABIL I T Y

125

certain restrictions. Hempel’s special consequence and special consistency conditions are satisfied by, or true of, absolute confirmation for r ≥ 1/2 (again, it is a useful exercise to show this!), but they are not satisfied by, or true of, incremental confirmation. Therefore, so Carnap concludes, Hempel confuses absolute and incremental confirmation when presenting these three conditions of adequacy. (This is an interesting conclusion in light of Carnap’s own choice mentioned above.) Finally, when Hempel presents the converse consequence condition, he gets completely confused, so to speak, because this fourth condition is satisfied neither by absolute confirmation nor by incremental confirmation. Carnap’s analysis is neither very charitable, nor is it very good. First, note (or better: show) that the entailment condition is not only satisfied by incremental confirmation but is also satisfied by absolute confirmation—indeed, it is so without restrictions! Instead of mixing up absolute and incremental confirmation, Hempel presents three conditions that are satisfied by the qualitative concept of absolute confirmation for r ≥ 1/2 (as well as Hempel’s own satisfaction criterion). In addition, Carnap’s analysis leaves open what concept of confirmation Hempel has in mind when presenting the converse consequence condition. An alternative analysis of Hempel’s conditions interprets them as characterizing the end that confirmation is a means to attaining. The first three conditions characterize a concept of confirmation that aims at logically weak or probable hypotheses. The fourth condition characterizes a concept of confirmation that aims at logically strong or informative hypotheses. Hempel’s triviality result then becomes the insight that these two concepts are conflicting. Consequently, Hempel rejects one concept (that is, the concept of confirmation that aims at informative hypotheses), and settles for the other concept of confirmation that aims at probable hypotheses.

126

INTRODUCTION TO PROBABILITY AND INDUCTION

The satisfaction criterion defines this latter concept of confirmation.

7.4 THE JUSTIFICATION OF LOGIC In Fact, Fiction, Forecast (1954/1983), Goodman suggests that, with custom, Hume did not merely want to describe our inductive practices, but also justify them. Goodman compares inductive logic to deductive logic. He suggests that a particular deductive inference is justified by conforming to the valid general rules of deductive logic, and that the valid general rules of deductive logic in turn are justified by conforming to accepted deductive practice, that is, which particular deductive inferences we actually make and find acceptable. This looks flagrantly circular. [. . .] But this circle is a virtuous one. The point is that rules and particular inferences are justified by being brought into agreement with each other. A rule is amended if it yields an inference we are unwilling to accept; an inference is rejected if it violates a rule we are unwilling to amend. (Goodman 1983: 64) According to Goodman, the same is true of inductive logic: “[a]n inductive inference, too, is justified by conformity to general rules, and a general rule by conformity to accepted inductive inferences.” (Goodman 1983: 65) Therefore, the problem of the justification of induction reduces to the problem of the definition of valid inductive rules: “[t]he problem of induction is not a problem of demonstration, but a problem of defining the difference between valid and invalid predictions.” (Goodman 1983: 65) The general rules of deductive logic Goodman refers to are rules such as conditional proof and universal generalization. The particular deductive inferences Goodman refers to are

T HE L OGIC A L IN T E R PR E TAT ION OF PROB ABIL I T Y

127

particular instantiations or applications of these rules such as these in your solutions to the exercises. An example of a possible general rule of inductive logic is the principle of universal induction. A particular inductive inference is inferring the conclusion that all ravens are black from the premise that all objects about which one has the relevant information—that is, whether they are ravens and whether they are black—are black if they are ravens. Contrary to what Goodman suggests, there is no universal agreement between the general rules of deductive logic and the particular deductive inferences we make and find acceptable in practice. As a case in point, consider the following three general rules of deductive logic that are valid, that is, their conclusion (to the right of the therefore symbol ∴) is true in all logically possible cases in which their premise (to the left of the therefore symbol ∴) is true: γ∴α→γ

γ ∴ α∨γ

¬α ∴ α → γ

Now consider the following three particular deductive inferences that conform to (that is, are instantiations or applications of) these three general rules, respectively. I submit that few people will make these inferences or find them acceptable in practice: 1. Athens is the capital of Greece. Therefore, if the moon is made of cheese, then Athens is the capital of Greece. 2. Athens is the capital of Greece. Therefore, Athens is the capital of Greece, or the moon is made of cheese. 3. Theran is not the capital of Greece. Therefore, if Theran is the capital of Greece, then the moon is made of cheese. According to one view, the general rules of logic, whether of deductive logic or inductive logic, are normative principles that tell one how one ought to reason or make inferences.

128

INTRODUCTION TO PROBABILITY AND INDUCTION

Furthermore, according to one view of normativity, or rationality, viz. instrumentalism, what one ought to do is to take the means to one’s ends. On this view norms are hypothetical imperatives, as opposed to categorical imperatives, that are conditional upon some end. If we combine these two views, we arrive at the instrumentalist view that the general rules of logic are hypothetical imperatives that tell one how one ought to reason or make inferences and that are conditional upon some cognitive end. (These ends are cognitive, as we assume that obeying a logical rule will directly serve only a cognitive end.) In order to justify such a hypothetical imperative, one has to show that obeying the rule or imperative (that is, doing what it says one ought to do) actually is a means to attaining the end the imperative is conditional upon. In particular, in order to justify the general rules of logic, one has to show that obeying these rules (that is, reasoning or making inferences in accordance with them) actually is a means to attaining the cognitive end these rules are conditional upon. A consequence of instrumentalism is that, strictly speaking, it is meaningless to say that a general rule of logic, or any other rule for that matter, is, or is not, justified. The reason is that, according to instrumentalism, justification is not a property of such a rule but rather a relation that obtains, or does not obtain, between obeying the rule on the one hand and attaining a certain end on the other hand. To justify a rule, or norm, or imperative, is to provide an argument whose premises one’s audience accepts and whose conclusion says that obeying the rule actually is a means to attaining the end the rule is conditional upon. Yet this requires us to specify an end before we can even meaningfully ask if a norm, or rule, or imperative is, or is not, justified relative to this end. Let us consider an example other than logic. Suppose I ask or tell you: Do not smoke! Why should you obey this rule, or

T HE L OGIC A L IN T E R PR E TAT ION OF PROB ABIL I T Y

129

norm, or imperative, and refrain from smoking? Well, certainly not just because I tell you so. Instead, you should obey this norm and refrain from smoking because doing so is a means to attaining the end of not developing lung cancer, an end I presume you to have. If you are asking me to justify this norm to you, then I point out that I am assuming you to possess a certain end, namely to avoid developing lung cancer, and that not smoking significantly lowers the risk of developing lung cancer, and is a (non-deterministic, causal) means to attaining the end that I am assuming you to have. Of course, if you do not have this end and, instead, desire to develop lung cancer, then my assumption is false, and you may not need to obey this norm. However, this does not affect the validity of the norm in question, which is conditional upon the end I have falsely assumed you to have, and which consists in not smoking being a means to attaining the end of not developing lung cancer. Let us return to logic. What is the cognitive end that the rules of deductive logic are conditional upon? The answer is given by two theorems that are known as the soundness and completeness theorems for classical logic. Soundness says that all particular inferences that conform to, or are instantiations of, the general rules of deductive logic are “truth-preserving with logical necessity”: Their conclusion is true in all logically possible cases in which all their premises are true. Completeness says that only particular inferences that conform to, or are instantiations of, the general rules of deductive logic are truth-preserving with logical necessity—that is, all particular inferences that are truth-preserving with logical necessity conform to, or are instantiations of, the general rules of deductive logic. The method of truth tables only works because of these two results. This means that the general rules of deductive logic, understood as hypothetical imperatives of reasoning or making inferences, are justified relative to the cognitive end of

130

INTRODUCTION TO PROBABILITY AND INDUCTION

reasoning or making inferences in a way that is truth-preserving with logical necessity. For all and only those particular inferences that conform to, or are instantiations of, the general rules of deductive logic are truth-preserving with logical necessity. Reasoning or making inferences according to these rules can be shown to be a means to attaining the end of reasoning or making inferences in a way that is truth-preserving with logical necessity. One should obey these rules and reason in accordance with the general rules of logic to the extent that one desires to reason in a way that is truth-preserving with logical necessity. Unfortunately, a new problem arises. In “The Justification of Deduction” (1976), Haack compares deductive logic to inductive logic and notes that Hume’s argument carries over from the inductive case to the deductive case. Haack’s preferred variant of Hume’s argument takes the form of a dilemma. Hume’s dilemma for induction A deductive justification of induction would be too strong because it would show that induction always leads from true premises to true conclusions. An inductive justification of induction would be circular. Haack’s dilemma for deduction An inductive justification of deduction would be too weak because it would merely show that deduction usually leads from true premises to true conclusions. A deductive justification of deduction would be circular. Haack (1976) is right in claiming that a deductive justification of deduction, like an inductive justification of induction, is circular. The soundness and completeness theorems for classical

T HE L OGIC A L IN T E R PR E TAT ION OF PROB ABIL I T Y

131

logic prove something about classical logic by relying on the very rules that are valid in classical logic. As a consequence, the deductive justification of deduction relative to the cognitive end of reasoning in a way that is truth-preserving with logical necessity is circular, and this is the new problem. However, Haack is not quite right in claiming that a deductive justification of induction would be too strong. According to the instrumentalist view of logic, we first need to specify a cognitive end before we can even meaningfully ask if the principle of induction is justified relative to this cognitive end. As Hume points out, there is no deductively valid argument which does not presuppose its conclusion, whose premises are restricted to information we have, and whose conclusion says that the principle of induction holds. That is, we cannot justify the principle of induction by a deductively valid argument relative to the cognitive end of reasoning in a way that leads from true premises to true conclusions in all or most of the logically possible cases. However, this is compatible with there being other cognitive ends relative to which we can justify the principle of induction by a deductively valid argument. Specifically, as we will see in Section 10.2, Reichenbach provides a deductively valid argument which does not presuppose its conclusion, whose premises may belong to information we have, but whose conclusion does not say that the principle of induction leads from true premises to true conclusions in all or most of the logically possible cases. Instead, the conclusion of Reichenbach’s argument says that a particular principle of induction—the straight(-forward) rule—is a means to attaining the cognitive end of “converging to the correct answer” (in a sense to be explained). To the extent that one has this cognitive end one ought to obey the straight(-forward) rule. Since we can establish this means-end relationship between obeying the straight(-forward) rule and

132

INTRODUCTION TO PROBABILITY AND INDUCTION

attaining the cognitive end of converging to the correct answer by a deductively valid argument which does not presuppose its conclusion and whose premises may belong to information we have, it turns out that, pace Hume and Haack, we can justify induction deductively after all. We just cannot do so relative to the cognitive end of reasoning in a way that leads from true premises to true conclusions in all or most of the logically possible cases. (As an aside, note that it is also not quite correct that an inductive justification of deduction would be too weak. For, I submit, all particular inferences about which we have the information whether they conform to the general rules of deductive logic, and whether their premises and conclusion are true or false, have at least one false premise or a true conclusion if they conform to the general rules of deductive logic. Applied to this premise, the principle of universal induction yields the conclusion that all particular inferences that conform to the general rules of deductive logic have at least one false premise or a true conclusion. That is, applied to this premise the principle of universal induction yields the conclusion that the rules of deductive logic always—and not just usually—lead from true premises to true conclusions.)

7.5 THE NEW RIDDLE OF INDUCTION Goodman’s work on induction is even better known for the so-called “new riddle of induction,” which has led to the demise of Carnap’s attempts to define probability in purely “syntactical,” or logical, terms. Here is one version of this riddle. The information we have includes the proposition that all objects about which we have the information whether they are emeralds and whether they are green, are green if they are emeralds. This proposition can be expressed by the following

T HE L OGIC A L IN T E R PR E TAT ION OF PROB ABIL I T Y

133

sentence e, where a1 , . . . , aN are the objects about which we have the information that they are emeralds: e = a1 is a green emerald, and . . . , and aN is a green emerald Now consider the hypothesis h that the first emerald to be observed on or after January 1, 2100, is green. On most of the purely logical, or syntactical, accounts of confirmation we have come across so far, e confirms h. Let us introduce a new term. We say that an object x is grue at time t if, and only if, x is green at t and t is a time prior to January 1, 2100, or x is blue at t and t is a time on or after January 1, 2100. Given this definition and the fact that all the information we have is restricted to times prior to January 1, 2100, a fact which is itself part of the information we have, e is logically equivalent to the following sentence e∗ : e∗ = a1 is a grue emerald, and . . . , and aN is a grue emerald Now consider the hypothesis h∗ that the first emerald to be observed on or after January 1, 2100, is grue. On most of the purely logical, or syntactical, accounts of confirmation we have come across so far, such as Hempel’s and Carnap’s, e∗ confirms h∗ . This is because the logical, or syntactical, relationship between e∗ and h∗ is exactly the same as that between e and h, respectively. However, given that blue objects are not green, which is part of the information we have, the hypothesis h∗ says that the first emerald to be observed on or after January 1, 2100, is blue. So h∗ contradicts the hypothesis h unless there is no emerald that will be observed on or after January 1, 2100. Thus, the information we have can be used to confirm two (conditionally) inconsistent hypotheses and, indeed, just about any hypothesis, or no hypothesis at all—just about any hypothesis because ‘blue’ can replaced with any other predicate, or no hypothesis at all because of the following consideration.

134

INTRODUCTION TO PROBABILITY AND INDUCTION

T HE L OGIC A L IN T E R PR E TAT ION OF PROB ABIL I T Y

135

from the above that these aspects cannot be specified in purely logical, or syntactical, terms alone. This conclusion may be familiar from discussions of curve-fitting. For any finitely many data points a1 , . . . , aN and any predicted future point aN+1 there is some curve on which all of these points lie. No matter what the data a1 , . . . , aN are, they can be used to confirm any prediction aN+1 whatsoever. In the figure that follows, the data points x, y are 1, 1, 2, 3, and 3, 2 which may be, say, the number of births y in some local community on the day x of the year 2017. For any prediction about the number of births on the fourth day of 2017—such as 4, 1, 4, 2, and 4, 3—there is some function or curve f (x) = y that goes through all the data points—and, hence, is confirmed by these data—as well as the predicted point, which is thus confirmed as well. 4

y

3

2

1

0

1

2

3

4

5

x

As mentioned, this conclusion has led to the demise of Carnap’s project of defining probability in purely logical, or syntactical, terms and paved the way for the rise of the subjective or Bayesian interpretation of probability as degree of belief that we will discuss in the next chapter. Before doing so, let me briefly mention Goodman’s own solution.

136

INTRODUCTION TO PROBABILITY AND INDUCTION

According to Goodman, the objects about which we do not have information resemble the objects about which we do have information in those aspects that are projectible, but not those that are not. Projectibility is inductive inferability, and the idea is that “lawlike” universal hypotheses of the form ‘All Fs are G’ such as that all emeralds are green and that all metals conduct electricity are indeed confirmed by their instances, but that accidental or spurious generalizations such as that all emeralds are grue and that all coins in my pocket are loonies are not. We may project or inductively infer that further metals will conduct electricity from the information that previous metals have, and that further emeralds will be green from the information that previous emeralds are. However, we may not project or inductively infer that further coins in my pocket will be loonies from the information that previous coins are, or that further emeralds will be grue from the information that previous emeralds are. Goodman thinks a hypothesis is lawlike only if it is formulated in terms of predicates such as ‘emerald’ and ‘green’ and ‘metal’ and ‘conducts electricity’ that are well entrenched, and that a predicate is well entrenched if and only if it has been used frequently in the past. Predicates such as ‘grue’ and ‘coin in my pocket’ are not well entrenched because they have not been used frequently in the past. If the rules of (inductive) logic are normative principles, as, for example, the instrumentalist view of logic has it, this amounts to saying that we ought to project or infer inductively what we do in fact project or infer inductively—and this is to infer an Ought from an Is, an inference that Hume (1739) alleges does not hold. Why is this so? Whether a predicate is well entrenched is determined by the frequency with which it has been used in the past. Therefore, it is a factual question whether a predicate is well entrenched. Consequently, it is also a factual question whether a hypothesis is formulated in terms of predicates that

T HE L OGIC A L IN T E R PR E TAT ION OF PROB ABIL I T Y

137

are well entrenched. Suppose all the objects about which we have the information whether they are F and whether they are G are G if they are F. Should we project or infer inductively that the next F to be observed will be G, too? On Goodman’s view, the answer to this question is determined by the factual matter whether ‘F’ and ‘G’ are well entrenched. What we ought to project or infer inductively is determined by, and inferred from, what is the case.

7.6 EXERCISES Exercise 31: Consider two objects, viz. today a and tomorrow b, and one property S they can have, viz. whether or not the sun rises on them. List the four state descriptions and three structure descriptions that the two individual constants ‘a’ and ‘b’ and one predicate ‘S’ give rise to. Exercise 32: Specify the logical probability of the four state descriptions from Exercise 31 according to Wittgenstein’s proposal w as well as according to Carnap’s proposal m∗ . Exercise 33: We continue with the same example. First observe whether or not the sun has risen today and write down your observational result. Then compute the nonconditional logical probability that the sun rises tomorrow and the conditional logical probability that the sun rises tomorrow given your observational result. Do this on Wittgenstein’s proposal w as well as on Carnap’s proposal m∗ . Finally, determine whether your observational result incrementally confirms the hypothesis that the sun rises tomorrow on Wittgenstein’s proposal w as well as on Carnap’s proposal m∗ . Exercise 34: We consider a probability Pr on a formal language L over a non-empty vocabulary V and the following two

138

INTRODUCTION TO PROBABILITY AND INDUCTION

definitions. A formula α incrementally confirms a formula β in     the sense of Pr if, and only if, Pr (α) > 0 and Pr β | α > Pr β . A formula α is positively relevant to a formula β in the sense of     Pr if, and only if, Pr α ∧ β > Pr (α) · Pr β . Show that a formula α is positively relevant to a formula β in the sense of Pr if α incrementally confirms β in the sense of Pr. You may skip the use of the principle of universal generalization (UG). Exercise 35: Continuing Exercise 34, show that ¬α is positively relevant to ¬β in the sense of Pr if the formula α incrementally confirms the formula β in the sense of Pr. You may skip the use of the principle of universal generalization (UG). In addition, you may rely on the “lemma” N that for all formulas α from L: Pr (¬α) = 1 − Pr (α). (A lemma is a theorem that one uses in the proof of another theorem.)

READINGS The recommended readings for Chapter 7 include: Goodman, Nelson (1954/1983), Fact, Fiction, Forecast. Cambridge, MA: Harvard University Press. Chapter 3: The New Riddle of Induction.

and perhaps also Carnap, Rudolf (1962), Logical Foundations of Probability. 2nd ed. Chicago: University of Chicago Press. §87 (468–478). Crupi, Vincenzo (2015), Confirmation. In E.N. Zalta (ed.), Stanford Encyclopedia of Philosophy. Haack, Susan (1976), The Justification of Deduction. Mind 85, 112–119. Huber, Franz (2007), Confirmation and Induction. In J. Fieser & B. Dowden (eds.), Internet Encyclopedia of Philosophy.

CHAPTER 8

The Subjective Interpretation of Probability

8.1 DEGREES OF BELIEF Contemporary Bayesian confirmation theory has evolved out of Carnap’s project of constructing an inductive logic upon a logical interpretation of probability. However, it frees itself from the constraints of the logical interpretation of probability. This development is foreshadowed in Carnap’s own work. Initially he supplemented the three probability axioms non-negativity, normalization, and additivity with many additional axioms that narrow down the set of all probabilities on a formal language over some non-empty vocabulary to just the single probability m∗ . Later he dropped some of these axioms, while keeping others, thus arriving at the set of probabilities mλ that includes m∗ as well as (uncountably) many other probabilities (for uncountability see Chapter 11). According to the subjective or Bayesian interpretation, probabilities are the degrees of belief, or credences, an ideal cognitive agent ought to have. We can define a cognitive agent to be ideal if, and only if, every cognitive action that is physically possible is a cognitive action that is possible for her. This

140

INTRODUCTION TO PROBABILITY AND INDUCTION

implies that she can perform any cognitive action that she ought to perform—that is, the principle that Ought implies Can (McConnell 2014) does not release her from any cognitive obligations. In particular, an ideal cognitive agent always gets to voluntarily decide what to believe to what degree, never forgets any of her degrees of belief, and is always certain of all logical and conceptual truths. On a radical subjective interpretation of probability non-negativity, normalization, and additivity are the only constraints on an ideal cognitive agent’s degree of belief function. This implies that every probability (measure), and not just some mλ , is a permissible degree of belief function for an ideal cognitive agent. The subjective interpretation of probability comes with the thesis that only probability measures are permissible degree of belief functions for an ideal cognitive agent. The radical subjective interpretation strengthens this thesis to the claim, known as probabilism, that all and only probability measures are permissible degree of belief functions for an ideal cognitive agent. Why should an ideal cognitive agent’s degree of belief function obey the probability calculus? Some philosophers think that Cox’s (1946) theorem as well as the representation theorems of measurement (not measure) theory (Krantz & Luce & Suppes & Tversky 1971) provide an answer to this question. However, the two arguments that have attracted the most attention are the Dutch book and gradational accuracy argument. We will discuss these two arguments in turn.

8.2 THE DUTCH BOOK ARGUMENT We define an ideal cognitive agent’s betting ratio for a sentence α as that number r such that she is willing to pay up to r dollars

T HE S UBJE CT I V E IN T E R PR E TAT ION OF PROB ABIL I T Y

141

for, and sell for r dollars or more, the following bet: \$1 if α is true, and \$0 if α is false. We assume the ideal cognitive agent to have a unique betting ratio for every sentence in her language. Apart from the instrumentalist understanding of normativity, or rationality, according to which one ought to take the means to one’s ends, and according to which norms are hypothetical imperatives that are conditional on some end, there is the deontological understanding. The latter has it that there are some things one just ought to do or will or intend—not because doing or willing or intending these things is a means to attaining some end one might have, but because it simply is one’s duty or obligation to do or will or intend these things. On this view, some norms are categorical imperatives that tell one what one’s duties are. The Dutch book argument can be formulated by adopting a deontological or an instrumentalist understanding of normativity. In the former case, the conclusion tells the ideal cognitive agent what her duties (qua believer) are. However, one also needs to postulate what an ideal cognitive agent must not do or will or intend, that is, what some of her other duties are. Therefore, the deontological version of the Dutch book argument, like every deontological argument, inevitably involves an element of dogmatism: It postulates a duty that itself is not justified any further. In the instrumentalist case, no such assumptions are needed. The conclusion says that believing in a particular way is a means to attaining a certain end that the ideal cognitive agent may, or may not, have.

The Dutch Book Argument (deontological version) Premise 1 An ideal cognitive agent’s degrees of belief are identical to her betting ratios. Premise 2 An ideal cognitive agent should not be willing to accept a series of bets that guarantees a monetary loss; that

142

INTRODUCTION TO PROBABILITY AND INDUCTION

is, a Dutch book. It is her duty (perhaps qua actor rather than qua believer) not to do or will or intend so. Premise 3 (Dutch book theorem) An ideal cognitive agent’s betting ratios violate the probability calculus if, and only if, she is willing to accept a series of bets that guarantees a monetary loss. Conclusion An ideal cognitive agent’s degrees of belief ought to obey the probability calculus. It is her duty (qua believer) that they do. The right-to-left or if-direction of the Dutch book theorem presupposes that every bet is a bet on some sentence α in the ideal cognitive agent’s language L that returns \$1 if α is true, and \$0 if α is false (see also Halpern 2003: 79). The Dutch Book Argument (instrumentalist version) Premise 1 An ideal cognitive agent’s degrees of belief are identical to her betting ratios. Premise 2 (Dutch book theorem) An ideal cognitive agent’s betting ratios violate the probability calculus if, and only if, she is willing to accept of a series of bets that guarantees a monetary loss. Conclusion An ideal cognitive agent’s degrees of belief ought to obey the probability calculus given that she has the end of not being willing to accept a series of bets that guarantees a monetary loss. Doing the former is a means to attaining the latter end that she may, or may not, have. Note that the required means in the hypothetical imperative of the latter conclusion cannot be detached from the end the hypothetical imperative is conditional upon: Even if it is a fact that the ideal cognitive agent has the end of not being willing to accept a series of bets that guarantees a monetary loss, one cannot infer that her degrees of belief nonconditionally ought to obey the probability calculus from the fact that she is having this

T HE S UBJE CT I V E IN T E R PR E TAT ION OF PROB ABIL I T Y

143

end. On the instrumentalist view of normativity, Oughts are not derived from, but conditional on ends. The Dutch book argument is due to Ramsey’s “Truth and Probability” (1926) and de Finetti’s “La Prévision: ses lois logiques, ses sources subjectives” (1937) (“Foresight: its logical laws, its subjective sources”). It has been criticized on the grounds that degrees of belief—while perhaps being measurable by betting ratios, or the betting behavior they give rise to, under favorable conditions—are not identical to them. Therefore, Premise 1 is false. The general theme of these criticisms is that degrees of belief are mental states that should be “cognitively rational,” whereas betting ratios are defined in terms of practical actions or intentions to act that should be “pragmatically rational.” Yet cognitive and pragmatic rationality can come apart. Put in deontological terms, the problem is that there is a difference between one’s duties qua believer and one’s duties qua actor. In response to these criticisms, other philosophers have come up with the depragmatized Dutch book argument (Vineberg 2016). The latter works with the concept of a fair betting ratio, which “depragmatizes” the concept of a betting ratio and is defined as follows. An ideal cognitive agent’s fair betting ratio for a sentence α is that number r such that she considers the following bet for a price of r dollars to be fair: \$1 if α is true, and \$0 if α is false. This definition replaces the pragmatic concept of willing to pay or receive some amount of money for a bet by the cognitive concept of considering a bet for a certain price to be fair. We assume the ideal cognitive agent to have a unique fair betting ratio for every sentence in her language. As before, the right-to-left or if-direction of the Dutch book theorem presupposes that every bet is a bet on some sentence α in the ideal cognitive agent’s language L that returns \$1 if α is true, and \$0 if α is false.

144

INTRODUCTION TO PROBABILITY AND INDUCTION

The Depragmatized Dutch Book Argument (deontological version) Premise 1 An ideal cognitive agent’s degrees of belief are identical to her fair betting ratios. Premise 2 An ideal cognitive agent should not consider any Dutch book to be fair. It is her duty (qua believer) not to do so. Premise 3 (Dutch book theorem) An ideal cognitive agent’s fair betting ratios violate the probability calculus if, and only if, she considers at least one Dutch book to be fair. Conclusion An ideal cognitive agent’s degrees of belief ought to obey the probability calculus. It is her duty (qua believer) that they do. The Depragmatized Dutch Book Argument (instrumentalist version) Premise 1 An ideal cognitive agent’s degrees of belief are identical to her fair betting ratios. Premise 2 (Dutch book theorem) An ideal cognitive agent’s fair betting ratios violate the probability calculus if, and only if, she considers at least one Dutch book to be fair. Conclusion An ideal cognitive agent’s degrees of belief ought to obey the probability calculus given that she has the end of not considering any Dutch book to be fair. Doing the former is a means to attaining the latter end that she may, or may not, have. To illustrate, let α be ‘It will be rainy, but not sunny, tomorrow’ and β ‘it will be sunny, but not rainy, tomorrow.’ α ∧ β is logically false. Now you get to determine your betting ratios r for α, s for β, and t for α ∨ β. I then get to decide which of the following three bets I am selling to you, and which of them I am buying from you. You are willing to buy these bets from me as well as sell them to me, for we assume you to have a unique betting ratio for every sentence of your language.

T HE S UBJE CT I V E IN T E R PR E TAT ION OF PROB ABIL I T Y

145

Bet 1: Returns \$1 if α is true, and \$0 if α is false. Bet 2: Returns \$1 if β is true, and \$0 if β is false. Bet 3: Returns \$1 if α ∨ β is true, and \$0 if α ∨ β is false. Suppose first that r + s > t. In this case, I sell bets 1 and 2 to you and buy bet 3 from you. Your payoff looks as follows (α and β cannot both be true): α T F F

total β bet 1 bet 2 bet 3 F 1 − r −s t − 1 t − r − s < 0 T −r 1 − s t − 1 t − r − s < 0 F −r −s t t−r−s < 0

Suppose next that r + s < t. In this case, I buy bets 1 and 2 from you and sell bet 3 to you. Your payoff looks as follows (again, α and β cannot both be true): α T F F

β bet 1 bet 2 bet 3 total F r−1 s 1−t r+s−t < 0 r s−1 1−t r+s−t < 0 T r s −t r + s − t < 0 F

The one and only way for you to distribute your betting ratios for these three bets that does not make you be willing to accept a Dutch book is such that r + s = t. This is precisely what the additivity axiom of the probability calculus requires. In the presence of non-negativity and normalization, our definition of (fair) betting ratios restricts the maximal gain or loss in a bet to \$1. This restriction can be dropped by adopting the following more general definition. An ideal cognitive agent’s fair betting ratio for a sentence α is that number r = b/ (a + b) such that she considers the following bet to be fair: a dollars if α is true, and −b dollars if α is false, where a + b  0. Similarly for betting ratios. To guarantee that the “stakes” a + b of the

146

INTRODUCTION TO PROBABILITY AND INDUCTION

bet do not affect the ideal cognitive agent’s (fair) betting ratio, it is generally assumed that she values money linearly: For any positive real number s, she values s additional dollars always to the same positive degree, no matter how many dollars she already has. The latter assumption implies that the ideal cognitive agent is neither risk averse nor risk prone. She exemplifies risk aversion if she is willing to accept, or considers as fair, a bet that returns \$1 if α is true, and −\$1 if α is false, but is not willing to accept, or does not consider as fair, a bet that returns \$1,000 if α is true, and −\$1,000 if α is false. She exemplifies risk proneness if she does the opposite. Premise 1 remains problematic even in the depragmatized Dutch book argument. It requires that the ideal cognitive agent’s fair betting ratio for α is (in some appropriate sense) independent of the truth value of α. This is not the case for sentences whose truth value the ideal cognitive agent cares about, say, that she will be a billionaire soon, β. She might consider as fair a bet that returns a = \$1 if ¬β is true, and −b = − \$899,999,999 if ¬β is false, even if her degree of belief that she will not be a billionaire soon is smaller than b/ (a + b) = 899, 999, 999/900, 000, 000. Premise 1 is also violated by an ideal cognitive agent who enjoys gambling in the casino and considers as fair a bet such as roulette that returns \$35 if the bullet lands on 36, and −\$1 if the bullet lands on 0 or 1 or . . . or 35, even though her degree of belief that the bullet lands on 36 equals 1/37, and so is smaller than 1/ (35 + 1) = 1/36. Ramsey’s original formulation of the Dutch book argument avoids these problems by working with utility instead of money and by assuming the existence of at least one “ethically neutral” proposition or sentence that the ideal cognitive agent considers to be as probable as its negation. An ethically neutral sentence is one whose truth value has no utility or value for the ideal cognitive agent other than that it settles a bet.

T HE S UBJE CT I V E IN T E R PR E TAT ION OF PROB ABIL I T Y

147

The drawback of moving from money to utility is that one must now assume that utility is additive in order for the Dutch book theorem to remain true. However, while \$1 plus \$1 equal \$2, it is not plausible that the value or utility of one right shoe plus the value or utility of another (or the same) right show equal the value or utility of two right shoes. Plausibly, the latter is smaller. The drawback of assuming the existence of an ethically neutral sentence that the ideal cognitive agent considers to be as probable as its negation is that it is not clear if this assumption can be made precise without presupposing the thesis that her degrees of belief (should) obey the probability calculus—the very thesis that the Dutch book argument is supposed to justify. For what does it mean to consider an ethically neutral sentence to be as probable as its negation other than that this sentence and its negation belong to the language on which the ideal cognitive agent’s degree of belief function is defined, that this function obeys the probability calculus, and that it assigns a half to this sentence and its negation?

8.3 THE GRADATIONAL ACCURACY ARGUMENT In “Accuracy and Coherence: Prospects for an Alethic Epistemology of Partial Belief” (2009), Joyce provides a genuinely nonpragmatic vindication of probabilism, that is, the thesis that all and only probability measures are permissible degree of belief functions for an ideal cognitive agent. The inaccuracy of the ideal cognitive agent’s degree of belief b (A) in the proposition A in the possible world w equals the “distance”—in whichever way it is measured—between b (A) and the truth value of A in w, w (A). We identify the latter with 1 if A is true in w and with 0 if A is false in w. The inaccuracy

148

INTRODUCTION TO PROBABILITY AND INDUCTION

of the ideal cognitive agent’s entire degree of belief function b in the possible world w is determined by the inaccuracies of her degrees of belief b (A) in A in w, for all propositions A. Joyce restricts the discussion to finite partitions of, rather than arbitrary algebras over, the set of all possible worlds W. We will follow him in this regard, and to stress this restriction, I will call the propositions cells. According to Joyce (2009: 288), his theorem “readily generalizes to the case where [the domain of the degree of belief functions] is not a partition.” There are two steps in determining the inaccuracy i (b, w) of a degree of belief function b in a possible world w. First, one needs to determine the inaccuracies of individual degrees of belief b (A) for particular cells A in the possible world w. Then one needs to aggregate these individual inaccuracies of particular cells to the overall inaccuracy of the entire degree of belief function b in the possible world w. One way of first measuring, and then aggregating, individual inaccuracies to the overall inaccuracy employs the so-called “Brier score” B (Brier 1950). B measures the individual inaccuracies by the squared differences between b (A) and w (A) and then aggregates by taking the weighted sum of the individual inaccuracies. Where {A1 , . . . , An } is a finite partition of the set of all possible worlds W: B (b, w) =

(b (A1 ) − w (A1 ))2 + · · · + (b (An ) − w (An ))2 n

A degree of belief function b whose domain is a finite partition of the set of all possible worlds W is accuracy dominated if, and only if, there exists another degree of belief function b∗ whose domain is the same partition and which is such that i (b, w) ≥ i (b∗ , w) for all possible worlds w in W, and i (b, v) > i (b∗ , v) for at least one possible world v in W. If an ideal cognitive agent’s degree of belief function on a finite partition is accuracy dominated, then (and then only) there exists an alternative degree of belief function on this partition which is at least

T HE S UBJE CT I V E IN T E R PR E TAT ION OF PROB ABIL I T Y

149

as accurate in all possible worlds, and strictly more accurate in some. Joyce suggests that, in this situation, an ideal cognitive agent can only improve her cognitive state if she moves from her current, accuracy-dominated degree of belief function to a degree of belief function that is accuracy dominating the former. In deontological terms, we can say that it is the ideal cognitive agent’s duty (qua believer) not to have an accuracy dominated degree of belief function. An instrumentalist is content with pointing out that an ideal cognitive agent may have the cognitive end of not having an accuracy-dominated degree of belief function. Suppose our measure of the overall inaccuracy i of entire degree of belief functions in possible worlds is “finite and continuous.” This means that i’s co-domain does not contain infinity ∞, and that small changes in i’s argument result in small changes in i’s value: Roughly, for any possible world w, if the difference between i’s arguments x and y is “small,” then so is the difference between i’s values i (x, w) and i (y, w) (mathematicians call the elements of a function’s domain “arguments,” and the elements of its co-domain “values,” but these arguments and values having nothing to do with philosophers’ arguments and values). Suppose further that our measure of the overall inaccuracy i satisfies the two conditions below. The first constrains how the individual inaccuracies of particular degrees of belief for various cells in a possible world are aggregated to the overall inaccuracy of an entire degree of belief function in this possible world.

Truth Directedness For every possible world w in W and any two degree of belief functions b and b∗ whose domain is the finite partition {A1 , . . . , An } of W: if b (A) ≥ b∗ (A) ≥ w (A) or b (A) ≤ b∗ (A) ≤ w (A) for all cells A in {A1 , . . . , An }, and if

150

INTRODUCTION TO PROBABILITY AND INDUCTION

b (A) > b∗ (A) ≥ w (A) or b (A) < b∗ (A) ≤ w (A) for at least one of these cells A, then i (b, w) > i (b∗ , w). Coherent Admissibility For all probability measures Pr whose domain is the finite partition {A1 , . . . , An } of W and all degree of belief functions b with the same domain: either i (Pr, w) ≤ i (b, w) for all possible worlds w in W, or i (Pr, v) < i (b, v) for at least one possible world v in W, or both. Suppose a degree of belief function’s individual degrees of belief for all cells are at least as close to what is the truth according to possible world w as the individual degrees of belief of another degree of belief function. Suppose further that the first degree of belief function’s individual degree of belief for at least one cell is strictly closer to what is the truth according to possible world w than the other degree of belief function’s individual degree of belief for this cell. According to truth directedness, the first degree of belief function is overall closer to what is the truth according to possible world w than the other degree of belief function. Coherent admissibility states that probability measures are not accuracy dominated. Joyce’s reasons for this are that the chances that we will discuss in the next chapter are probabilities, and that, according to a principle to be discussed there as well, these chances—when the ideal cognitive agent is certain of them, and when she has no undermining or “inadmissible” information—should guide her subjective degrees of belief. Therefore, the latter should be probabilities, too. For this argument to work, every probability measure needs to be a possible chance function, which is not clear. In addition, we need to assume that chances are probabilities rather than derive this result from said principle, as we will do. For our purposes, it is best to think of coherent admissibility as characterizing the audience of Joyce’s argument. These are philosophers who are already convinced that all probability

T HE S UBJE CT I V E IN T E R PR E TAT ION OF PROB ABIL I T Y

151

measures are permissible degree of belief functions for ideal cognitive agents but who think that some other functions—such as some Dempster-Shafer belief functions (Dempster 1967; 1968 and Shafer 1976)—that violate the probability axioms might be as well. Addressing this audience, Joyce is able to prove that, if all probability measures are permissible degree of belief functions for ideal cognitive agents, then (all and) only probability measures are permissible degree of belief functions for ideal cognitive agents. Joyce’s theorem If inaccuracy is measured by a finite and continuous function satisfying truth directedness and coherent admissibility, then all and only probability measures are not accuracy dominated. This theorem provides the basis for the gradational accuracy argument. The Gradational Accuracy Argument (deontological version) Premise 1 Inaccuracy is measured by a finite and continuous function satisfying truth directedness and coherent admissibility. Premise 2 An ideal cognitive agent should not have a degree of belief function that is accuracy dominated. It is her duty (qua believer) not to have such a degree of belief function. Premise 3 (Joyce’s theorem) If inaccuracy is measured by a finite and continuous function satisfying truth directedness and coherent admissibility, then an ideal cognitive agent’s degree of belief function violates the probability calculus if, and only if, it is accuracy dominated. Conclusion An ideal cognitive agent’s degree of belief function ought to obey the probability calculus. It is her duty (qua believer) that it does.

152

INTRODUCTION TO PROBABILITY AND INDUCTION

The Gradational Accuracy Argument (instrumentalist version) Premise 1 Inaccuracy is measured by a finite and continuous function satisfying truth directedness and coherent admissibility. Premise 2 (Joyce’s theorem) If inaccuracy is measured by a finite and continuous function satisfying truth directedness and coherent admissibility, then an ideal cognitive agent’s degree of belief function violates the probability calculus if, and only if, it is accuracy dominated. Conclusion An ideal cognitive agent’s degree of belief function ought to obey the probability calculus given that she has the cognitive end of having a degree of belief function that is not accuracy dominated. Doing the former is a means to attaining the latter cognitive end that she may, or may not, have. The two figures below illustrate a probabilistic degree of belief function a and a non-probabilistic degree of belief function b on a partition with two and three cells, respectively. In the two-dimensional case, the cells are the points 0, 1 and 1, 0. In the three-dimensional case, the cells are the points 0, 0, 1, 0, 1, 0, and 1, 0, 0. In the two-dimensional case, the probabilistic degree of belief functions are the points on the line connecting the two cells 0, 1 and 1, 0. The non-probabilistic degree of belief functions are all the other points x, y. In this case, the probabilistic degree of belief functions have the form x, 1 − x for 0 ≤ x ≤ 1. In the three-dimensional case, the probabilistic degree of belief functions are the points on the plane that is spanned by the three cells 0, 0, 1, 0, 1, 0, and 1, 0, 0. The non-probabilistic degree of belief functions are all the other points x, y, z. In this case, the probabilistic degree of belief functions have the form x, y, 1 − x − y for 0 ≤ x ≤ 1 and 0 ≤ y ≤ 1.

T HE S UBJE CT I V E IN T E R PR E TAT ION OF PROB ABIL I T Y

153

Joyce’s theorem holds for all finite and continuous measures of inaccuracy satisfying truth-directedness and coherent admissibility. Among

these measures, the two-dimensional (xc − xd )2 + (yc − yd )2 between c = Euclidean distance xc , yc  and d = x

d , yd  as well as the three-dimensional

Euclidean distance (xc − xd )2 + (yc − yd )2 + (zc − zd )2 between c = xc , yc , zc  and d = xd , yd , zd  lend themselves to a visual representation. If w is any possible world in the cell 0, 1, say, the Euclidean inaccuracy of a = xa , ya  in w is

(xa − 0)2 + (ya − 1)2 . If w is any possible world in the cell 0,

0, 1, say, the Euclidean inaccuracy of a = xa , ya , za  in w

(xa − 0)2 + (ya − 0)2 + (za − 1)2 . By looking at the first figure, we see that for each nonprobabilistic degree of belief function b off the line connecting the two cells 0, 1 and 1, 0 there exists a probabilistic degree of belief function a on this line that is closer to what is the truth according to possible world w, no matter which of the two cells contains the possible world w. This probabilistic point a is

1

0.8

0.6 y

a 0.4 b

0.2

0

0

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 x

1

154

INTRODUCTION TO PROBABILITY AND INDUCTION

is the point where the line connecting the two cells 0, 1 and 1, 0 meets the line which is perpendicular to it and contains the non-probabilistic point b. On the other hand, there is no probabilistic point a on the line connecting the two cells 0, 1 and 1, 0 for which there exists another (probabilistic or non-probabilistic) point b on or off this line that is closer to what is the truth according to possible world w, no matter which of the two cells contains the possible world w. Moving the probabilistic point a closer to the possible worlds in the cell 0, 1, say, will result in a degree of belief function that is further away from the possible worlds in the cell 1, 0. As an aside, the Dempster-Shafer belief functions mentioned earlier correspond to the points x, y with x + y ≤ 1 for 0 ≤ x ≤ 1 and 0 ≤ y ≤ 1. These are precisely the points in the lower triangle on the left whose hypotenuse is the line of probabilistic points connecting the two cells 0, 1 and 1, 0. By looking at the second figure, we see that for each non-probabilistic degree of belief function b off the plane that is spanned by the three cells 0, 0, 1, 0, 1, 0, and 1, 0, 0 there exists a probabilistic degree of belief function a on this plane that is closer to what is the truth according to possible world w, no matter which of the three cells contains the possible world w. This probabilistic point a is the point where the plane that is spanned by the three cells 0, 0, 1, 0, 1, 0, and 1, 0, 0 meets the plane which is orthogonal to it and contains the non-probabilistic point b. On the other hand, there is no probabilistic point a on the plane that is spanned by the three cells 0, 0, 1, 0, 1, 0, and 1, 0, 0 for which there exists another (probabilistic or non-probabilistic) point b on or off this plane that is closer to what is the truth according to possible world w, no matter which of the three cells contains the possible world w. Moving the probabilistic point a closer to the possible worlds in the cells 0, 0, 1 and 0, 1, 0, say, will result in a degree of belief

T HE S UBJE CT I V E IN T E R PR E TAT ION OF PROB ABIL I T Y

155

function that is further away from the possible worlds in the cell 1, 0, 0.

1 0.8

b

z

0.6 0.4

a

0.2 0 0 0.2 0.4 x

0.6 0.8

1 0

0.2

0.4

0.6

0.8

1

y

The Dempster-Shafer belief functions now correspond to the points x, y, z with x + y + z ≤ 1 for 0 ≤ x ≤ 1, 0 ≤ y ≤ 1, and 0 ≤ z ≤ 1. These are precisely the points in the lower tetrahedon on the left whose base is the plane of probabilistic points that is spanned by the three cells 0, 0, 1, 0, 1, 0, and 1, 0, 0.

8.4 BAYESIAN CONFIRMATION THEORY Let us assume that we are dealing with a regular probability Pr on some formal language over some non-empty vocabulary and that we are only considering sentences that are neither logically true nor logically false. In this case, there is no difference between incremental confirmation and positive probabilistic relevance, and there are many equivalent ways to define this concept. Here are some of them.

156

INTRODUCTION TO PROBABILITY AND INDUCTION

e incrementally confirms / is irrelevant to or independent of / incrementally disconfirms h given b in the sense of Pr if, and only if, Pr (h | e ∧ b) is greater than / equal to / smaller than Pr (h | b) Pr (h ∧ e | b) is greater than / equal to / smaller than Pr (h | b) · Pr (e | b) Pr (e | h ∧ b) is greater than / equal to / smaller than Pr (e | ¬h ∧ b) Pr (e | h ∧ b) is greater than / equal to / smaller than Pr (e | b) Pr (h | e ∧ b) is greater than / equal to / smaller than Pr (h | ¬e ∧ b) These different but equivalent definitions of the qualitative concept of incremental confirmation suggest different quantitative concepts of incremental confirmation that turn out not to be equivalent. e incrementally confirms h given b in the sense of Pr to degree x if and only if x is equal to: D = Pr (h | e ∧ b) − Pr (h | b) the distance measure (Earman 1992) C = Pr (h ∧ e | b) − Pr (h | b) · Pr (e | b) the Carnap measure (Carnap 1962) L = Pr (e | h ∧ b) / Pr (e | ¬h ∧ b) the likelihood ratio (Fitelson 1999) R = Pr (e | h ∧ b) / Pr (e | b) the ratio measure (Milne 1996) S = Pr (h | e ∧ b) − Pr (h | ¬e ∧ b) the Joyce-Christensen measure (Christensen 1999, Joyce 1999) (The likelihood ratio and the ratio measure are often preceded by a log operation which results in the log-likelihood ratio and the log-ratio measure, respectively. For our purposes, the above variations will do.)

T HE S UBJE CT I V E IN T E R PR E TAT ION OF PROB ABIL I T Y

157

The sense in which these quantitative concepts of incremental confirmation are not equivalent is the following. Some of them say that one hypothesis is more confirmed by some piece of information given some background assumption than a second hypothesis is confirmed by a second piece of information given a second background assumption, whereas others disagree and say the opposite, and these measure do so while agreeing that the two pieces of information qualitatively confirm the two hypotheses given the two background assumptions. The equivalence these measures lack is called “ordinal equivalence” because the orderings these measures generate for a fixed probability among triples of hypotheses, pieces of information, and background assumptions are not the same. As analogy consider the gross domestic products of the United States and Canada in 2017 and 2018, respectively. Suppose it is \$20 and \$21 trillion in the United States and \$2 and \$2.2 trillion in Canada so that the gross domestic product of both countries has seen an increase: \$1 trillion in the United States and \$0.2 trillion in Canada. Now consider the question if the increase in gross domestic product has been greater in the United States or in Canada. If we consider the absolute amount by which the GDP has increased, much like the distance measure considers the absolute increase in probability, the increase in the United States is significantly greater than the increase in Canada—namely \$1 trillion versus \$0.2 trillion. However, if we consider the relative amount by which the GDP has increased, much like the ratio measure considers the relative increase in probability, then the increase in Canada is significantly greater than the increase in the United States—namely 10% versus 5%. The same plurality of points of view exists for increases in probability or amounts of incremental confirmation. If you think that it is a problem that there is more than one measure of incremental confirmation, just as there is more

158

INTRODUCTION TO PROBABILITY AND INDUCTION

than one way of measuring increases in gross domestic product, because you think there is just one true quantitative concept of incremental confirmation, then you are a monist about measures of incremental confirmation. If you think this is not a problem because you think there are different but equally legitimate quantitative concepts of incremental confirmation, then you are a pluralist about measures of incremental confirmation. In Section 7.2, I have adopted the latter pluralist position when I said that there is no right or wrong in choosing between different measures of incremental confirmation, only a more or less useful for various purposes. Apart from this embarrassment of riches, the history of Bayesian confirmation theory consists primarily of success stories. Let us consider a few of these. First, Bayesian confirmation theory retains the insight of Popper’s falsificationism. Suppose information e and background assumption b jointly falsify hypothesis h. This means e ∧ b logically implies ¬h, so Pr (h | e ∧ b) = Pr (h ∧ e ∧ b) / Pr (e ∧ b) definition of conditional probability = 0/ Pr (e ∧ b) e ∧ b logically implies ¬h < Pr (h | b) elementary calculus, Pr is a regular probability Hence, e incrementally disconfirms h given b. Similarly, Bayesian confirmation theory retains the insight of hypothetico-deductivism. Suppose information e HD-confirms hypothesis h given background assumption b. This means h ∧ b logically implies e, so Pr (e | h ∧ b) = Pr (e ∧ h ∧ b) / Pr (h ∧ b) definition of conditional probability

T HE S UBJE CT I V E IN T E R PR E TAT ION OF PROB ABIL I T Y

159

= Pr (h ∧ b) / Pr (h ∧ b) h ∧ b logically implies e > Pr (e | b) elementary calculus, Pr is a regular probability Hence, e incrementally confirms h given b. Recall Hempel’s solution of the ravens paradox. Non-black non-ravens can indeed be used to confirm the ravens hypothesis, just as black ravens can. We take this to be counterintuitive only because we implicitly make many background assumptions. One such background assumption may be that there are more non-black things than ravens, b. Given this background assumption it is plausible to assume that one’s degree of belief that an arbitrary or “randomly chosen” object a is not black is greater than one’s degree of belief that this object is a raven: Pr (¬B (a) | b) > Pr (R (a) | b) Let us also assume that the ravens hypothesis is independent (in the sense of one’s regular degree of belief function Pr) of whether an arbitrary or “randomly chosen” object a is a raven as well as of whether such an object a is not black: Pr (R (a) | ∀x (R (x) → B (x)) ∧ b) = Pr (R (a) | b) Pr (¬B (a) | ∀x (R (x) → B (x)) ∧ b) = Pr (¬B (a) | b) The idea behind the latter two assumptions is that the truth of the ravens hypothesis should be independent (in the sense of one’s regular degree of belief function Pr) of how many ravens there are as well as of how many non-black objects there are. Hence, it should also be independent of whether an arbitrary or “randomly chosen” object is a raven and of whether such an object is not black. Whether these assumptions are plausible is another question.

160

INTRODUCTION TO PROBABILITY AND INDUCTION

Given all these assumptions, we can show the following. First, both a black raven as well as a non-black non-raven can be used to incrementally confirm the ravens hypothesis. Second, a report of a non-black non-raven confirms the ravens hypothesis to a smaller degree than a report of a black raven. This is so according to D, R, and L for incremental confirmation, as well as according to Pr for absolute confirmation. In order to show this, we first prove the following first lemma: Pr (R (a) ∧ B (a) | b) < Pr (¬R (a) ∧ ¬B (a) | b) Pr (R (a) ∧ B (a) | b) = . . . . . . = Pr (R (a) ∧ B (a) | b) + Pr (R (a) ∧ ¬B (a) | b) − Pr (R (a) ∧ ¬B (a) | b) elementary calculus = Pr ((R (a) ∧ B (a)) ∨ (R (a) ∧ ¬B (a)) | b) − Pr (R (a) ∧ ¬B (a) | b) additivity, logic = Pr (R (a) | b) − Pr (R (a) ∧ ¬B (a) | b) logic < Pr (¬B (a) | b) − Pr (R (a) ∧ ¬B (a) | b) first assumption, elementary calculus = Pr ((¬B (a) ∧ R (a)) ∨ (¬B (a) ∧ ¬R (a)) | b) − Pr (R (a) ∧ ¬B (a) | b) logic = Pr (¬B (a) ∧ R (a) | b) + Pr (¬B (a) ∧ ¬R (a) | b) − Pr (R (a) ∧ ¬B (a) | b) additivity, logic

T HE S UBJE CT I V E IN T E R PR E TAT ION OF PROB ABIL I T Y

161

= Pr (R (a) ∧ ¬B (a) | b) + Pr (¬B (a) ∧ ¬R (a) | b) − Pr (R (a) ∧ ¬B (a) | b) logic = Pr (¬B (a) ∧ ¬R (a) | b) elementary calculus = Pr (¬R (a) ∧ ¬B (a) | b) logic With the help of this first lemma, we now prove the following second lemma: Pr (B (a) | R (a) ∧ b) < Pr (¬R (a) | ¬B (a) ∧ b)

Pr (B (a) | R (a) ∧ b) =

Pr (B (a) ∧ R (a) ∧ b) Pr (R (a) ∧ b) definition of conditional probability

=

Pr (B (a) ∧ R (a) ∧ b) · Pr (b) Pr (R (a) ∧ b) · Pr (b) elementary calculus, Pr is a regular probability

=

Pr (B (a) ∧ R (a) | b) Pr (R (a) | b) definition of conditional probability, elementary calculus

=

Pr (R (a) ∧ B (a) | b) Pr ((R (a) ∧ B (a)) ∨ (R (a) ∧ ¬B (a)) | b) logic

162

INTRODUCTION TO PROBABILITY AND INDUCTION

=

Pr (R (a) ∧ B (a) | b) Pr (R (a) ∧ B (a) | b) + Pr (R (a) ∧ ¬B (a) | b) additivity, logic

0 x+z y+z

=

Pr (¬R (a) ∧ ¬B (a) | b) Pr ((¬R (a) ∧ ¬B (a)) ∨ (R (a) ∧ ¬B (a)) | b) additivity, logic

=

Pr (¬R (a) ∧ ¬B (a) | b) Pr (¬B (a) | b) logic

= Pr (¬R (a) | ¬B (a) ∧ b) definition of conditional probability Now we turn to the proof of the theorem below. We will make         use of the equation Pr α ∧ γ = Pr γ · Pr α | γ for Pr γ > 0. This is an immediate consequence, or “corollary,” of the definition of conditional probability. Pr (∀x (R (x) → B (x)) | R (a) ∧ B (a) ∧ b) > Pr (∀x (R (x) → B (x)) | ¬R (a) ∧ ¬B (a) ∧ b) > Pr (∀x (R (x) → B (x)) | b) Pr (∀x (R (x) → B (x)) | R (a) ∧ B (a) ∧ b) = . . . ... =

Pr (∀x (R (x) → B (x)) ∧ R (a) ∧ B (a) ∧ b) Pr (R (a) ∧ B (a) ∧ b) definition of conditional probability

T HE S UBJE CT I V E IN T E R PR E TAT ION OF PROB ABIL I T Y

=

=

=

=

Pr (∀x (R (x) → B (x)) ∧ b ∧ R (a)) Pr (R (a) ∧ B (a) ∧ b) logic Pr (∀x (R (x) → B (x)) ∧ b) · Pr (R (a) | ∀x (R (x) → B (x)) ∧ b) Pr (R (a) ∧ B (a) ∧ b) corollary Pr (∀x (R (x) → B (x)) ∧ b) · Pr (R (a) | b) Pr (R (a) ∧ B (a) ∧ b) second assumption Pr (∀x (R (x) → B (x)) ∧ b) · Pr (R (a) ∧ b) Pr (B (a) ∧ R (a) ∧ b) · Pr (b) definition of conditional probability, logic

=

Pr (∀x (R (x) → B (x)) ∧ b) Pr (B (a) | R (a) ∧ b) · Pr (b) definition of conditional probability, elementary calculus

>

Pr (∀x (R (x) → B (x)) ∧ b) Pr (¬R (a) | ¬B (a) ∧ b) · Pr (b) second lemma, calculus: x < y if, and only if, z z < for x, y, z, u > 0 y·u x·u

=

Pr (∀x (R (x) → B (x)) ∧ b) · Pr (¬B (a) ∧ b) Pr (¬R (a) ∧ ¬B (a) ∧ b) · Pr (b) definition of conditional probability, elementary calculus

=

Pr (∀x (R (x) → B (x)) ∧ b) · Pr (¬B (a) | b) Pr (¬R (a) ∧ ¬B (a) ∧ b) definition of conditional probability

163

164

=

INTRODUCTION TO PROBABILITY AND INDUCTION

Pr (∀x (R (x) → B (x)) ∧ b) · Pr (¬B (a) | ∀x (R (x) → B (x)) ∧ b) Pr (¬R (a) ∧ ¬B (a) ∧ b) third assumption

=

Pr (∀x (R (x) → B (x)) ∧ b ∧ ¬B (a)) Pr (¬R (a) ∧ ¬B (a) ∧ b) corollary

=

Pr (∀x (R (x) → B (x)) ∧ ¬R (a) ∧ ¬B (a) ∧ b) Pr (¬R (a) ∧ ¬B (a) ∧ b) logic

= Pr (∀x (R (x) → B (x)) | ¬R (a) ∧ ¬B (a) ∧ b) definition of conditional probability [this establishes the first inequality, now we turn to the second] =

Pr (∀x (R (x) → B (x)) ∧ ¬R (a) ∧ ¬B (a) ∧ b) Pr (¬R (a) ∧ ¬B (a) ∧ b) definition of conditional probability

=

Pr (∀x (R (x) → B (x)) ∧ b ∧ ¬B (a)) Pr (¬R (a) ∧ ¬B (a) ∧ b) logic

=

Pr (∀x (R (x) → B (x)) ∧ b) · Pr (¬B (a) | ∀x (R (x) → B (x)) ∧ b) Pr (¬R (a) ∧ ¬B (a) ∧ b) corollary

=

Pr (∀x (R (x) → B (x)) ∧ b) · Pr (¬B (a) | b) Pr (¬R (a) ∧ ¬B (a) ∧ b) third assumption

T HE S UBJE CT I V E IN T E R PR E TAT ION OF PROB ABIL I T Y

=

165

Pr (∀x (R (x) → B (x)) ∧ b) · Pr (¬B (a) ∧ b) Pr (¬R (a) ∧ ¬B (a) ∧ b) · Pr (b) definition of conditional probability

=

Pr (∀x (R (x) → B (x)) ∧ b) Pr (¬R (a) | ¬B (a) ∧ b) · Pr (b) definition of conditional probability, elementary calculus

=

Pr (∀x (R (x) → B (x)) | b) Pr (¬R (a) | ¬B (a) ∧ b) definition of conditional probability

> Pr (∀x (R (x) → B (x)) | b) elementary calculus, Pr is a regular probability

Bayesian confirmation theory also avoids the problems of irrelevant conjunction and disjunction that trouble hypothetico-deductive confirmation. Suppose information e incrementally confirms hypothesis h given background assumption b in the sense of some probability Pr, that is, Pr (h ∧ e | b) > Pr (h | b) · Pr (e | b). Even then it need not be the case that e incrementally confirms h∧i given b in the sense of Pr, or that e ∨ i incrementally confirms h given b in the sense of Pr. Before showing this, let me draw your attention to the fact that this Bayesian solution to the problems of irrelevant conjunction and disjunction uses not only the mathematics of the probability calculus. It also relies on the philosophical interpretation of probability as degree of belief. According to probabilism, any probability is a permissible degree of belief function for an ideal cognitive agent. Therefore, to show the two

166

INTRODUCTION TO PROBABILITY AND INDUCTION

claims above, it is sufficient to specify one probability and four sentences standing in the right relations. The following example will do: Pr (p ∧ q ∧ r) = 1/8 Pr (p ∧ q ∧ ¬r) = 1/2 Pr (p ∧ ¬q ∧ r) = 1/16 Pr (p ∧ ¬q ∧ ¬r) = 1/16 Pr (¬p ∧ q ∧ r) = 1/8 − 0.00000001 Pr (¬p ∧ q ∧ ¬r) = 0.00000001 Pr (¬p ∧ ¬q ∧ r) = 1/16 Pr (¬p ∧ ¬q ∧ ¬r) = 1/16 The eight sentences in this example are the eight state descriptions of the formal language whose vocabulary consists of the three propositional variables p, q, and r. The numbers assigned to these state descriptions are non-negative and sum to 1. Therefore, the function Pr is a probability on the formal language whose vocabulary consists of these three propositional variables. Let us first show that p incrementally confirms q given p ∨ ¬p, but that p incrementally disconfirms, so does not incrementally confirm, q ∧ r given p ∨ ¬p. Pr (p ∧ q | p ∨ ¬p) = 5/8 > 3/4 · 3/4 = Pr (p | p ∨ ¬p) · Pr (q | p ∨ ¬p) Pr (p ∧ (q ∧ r) | p ∨ ¬p) = 1/8 < 3/4 · (1/4 − 0.00000001) = Pr (p | p ∨ ¬p) · Pr (q ∧ r | p ∨ ¬p) For the second claim, we reason as follows. Since p incrementally confirms q given p ∨ ¬p, ¬q incrementally confirms ¬p given p ∨ ¬p according to Exercise 35 and the symmetry of incremental confirmation. Furthermore, since p incrementally disconfirms q ∧ r given p ∨ ¬p, ¬ (q ∧ r) incrementally disconfirms ¬p given

T HE S UBJE CT I V E IN T E R PR E TAT ION OF PROB ABIL I T Y

167

p ∨ ¬p according to Exercises 35 and 38 and the symmetry of incremental confirmation. Now ¬ (q ∧ r) is logically equivalent to ¬q ∨ ¬r. Thus, we have established that ¬q incrementally confirms ¬p given p ∨ ¬p, while ¬q ∨ ¬r incrementally disconfirms, so does not incrementally confirm, ¬p given p∨¬p. As mentioned, here we rely on the subjective or Bayesian interpretation of probability as degree of belief. We assume every probability on every language to be a permissible degree of belief function for an ideal cognitive agent. Otherwise we could not just pick any probability on some language to establish that the irrelevancies plaguing hypothetico-deductive confirmation do not automatically plague Bayesian confirmation. For instance, to convince Carnap (1950) that incremental confirmation does not face these problems, we would have to show that the above inequalities are true of some sentences when the probability is m∗ . This flexibility of Bayesian confirmation theory does not come without cost. As Glymour notes in Theory and Evidence (1980), in many instances the data that are said to confirm theories are “old” in the sense that scientists are already certain of these data long before the theories they allegedly confirm have even been invented. For instance, Mercury’s anomalous 43 arc-second advance per century of its perihelion was discovered in the 19th century by Le Verrier (1859), long before Einstein (1915) proposed the general theory of relativity in the 20th century. This gives rise to the so-called problem of old evidence. If we take the subjective interpretation of probability as degree of belief serious, then the probability of data e of which one is already (close to) certain is (close to) 1 (close to, as one’s degree of belief function may be a regular probability). However, this strips these data of any confirmatory potential (or almost any, if one is merely close to certain, though this depends on the measure of incremental confirmation). For as you will show: Pr (h | e ∧ b) = Pr (h | b) for all sentences h, if Pr (e | b) = 1.

168

INTRODUCTION TO PROBABILITY AND INDUCTION

Finally, while Bayesian confirmation theory avoids the problem of irrelevant conjunction in its general form, this problem recurs in a slightly restricted form. Ignoring the background assumption, suppose i is independent of h as well as of h ∧ e in the sense of your regular degree of belief function Pr. This is one sense of what it means for i to be irrelevant to h and e. Then e incrementally confirms h ∧ i if e incrementally confirms h. Here is a proof of this consequence: Pr (h ∧ i | e) = Pr (h ∧ i ∧ e) / Pr (e) definition of conditional probability = Pr (h ∧ e) · Pr (i) / Pr (e) i is independent of h ∧ e in the sense of Pr > Pr (h) · Pr (i) e incrementally confirms h = Pr (h ∧ i) i is independent of h in the sense of Pr

8.5 UPDATING When you began reading this chapter, you had a specific degree of belief function. When you are finished reading this chapter, you will have a different degree of belief function. In between you will hopefully have learned something, and you will certainly have received new information. An update rule tells you how you ought to revise your degrees of belief when you receive new information. Your old degree of belief function is represented by a probability measure Pr on some algebra A over some non-empty set of possible worlds W. New information can be received in various formats. One can become certain of a proposition E from A, Pr∗ (E) = 1, that one deemed possible before, Pr (E) > 0. One can also adopt new degrees of belief

T HE S UBJE CT I V E IN T E R PR E TAT ION OF PROB ABIL I T Y

169

Pr∗ (E1 ) , . . . , Pr∗ (En ) for the cells E1 , . . . , En of a partition of W all of which are propositions in A. The Bayesians’ update rule for the former case is strict conditionalization. Their update rule for the latter case is the more general Jeffrey conditionalization (Jeffrey 1965). Let us first consider strict conditionalization. Suppose Pr : A → R is the ideal cognitive agent’s old probabilistic degree of belief function at time t, and E is the smallest, or logically strongest, proposition she becomes certain of between t and time t∗ . This means her new probabilistic degree of belief for E at time t∗ equals 1, Pr∗ (E) = 1. Suppose further her probabilistic degrees of belief are not directly affected in any other way such as forgetting etc. Then the ideal cognitive agent’s new probabilistic degree of belief function Pr∗ : A → R at t∗ should be her old conditional degree of belief function, that is, for all propositions H: Pr ∗ (H) = Pr (H | E) = Pr (H ∩ E) / Pr (E) Note that this presupposes that E has not been assigned degree of belief zero at time t, Pr (E) > 0. It is also important to distinguish between the definition of conditional probability which is just that—a definition—and the update rule of strict conditionalization. The latter is a substantial norm telling an ideal cognitive agent how to revise or update her degree of belief function when she becomes certain of a proposition. Update rules are not part of the probability calculus and only make sense for the subjective interpretation of probability as degree of belief. They do not make sense for other interpretations of probability. Recall Example 5 from Section 5.2. We roll a die and you initially assign a probability of 1/6 to each of the six propositions that the die shows one eye, {1}, that the die shows two eyes, {2}, . . ., and that the die shows six eyes, {6}. These probabilities are

170

INTRODUCTION TO PROBABILITY AND INDUCTION

now interpreted as your degrees of belief. We suppose you are interested in the proposition that the die shows two eyes. After rolling it I tell you that the die shows an even number, {2, 4, 6}. We have calculated your conditional degree of belief that the die shows two eyes given that it shows an even number of eyes to be 1/3. If you were to become certain that the die shows an even number of eyes upon me telling you so, strict conditionalization would tell you to raise your degree of belief that the die shows two eyes from 1/6 to 1/3. However, just because I tell you so does not mean that you believe me, let alone that you become certain of the proposition that the die shows an even number of eyes. Suppose that, instead of becoming certain that the die shows an even number of eyes, your new degree of belief that it does is 3/4, Pr∗ ({2, 4, 6}) = 3/4. This might be because you trust me somewhat, yet also assign some degree of belief to the possibilities that I misled you, that I made an observational error, that you misheard what I said, and that I misspoke. Your new degree of belief that the die shows an odd number of eyes is 1/4, Pr∗ ({1, 3, 5}) = 1/4. The set of possibilities is {1, 2, 3, 4, 5, 6}. Your degrees of belief have changed on the partition {{2, 4, 6} , {1, 3, 5}} from 1/2 and 1/2 to 3/4 and 1/4, respectively. They have not changed on any larger, or more fine-grained, partition. What should your new degree of belief be that the die shows two eyes, Pr∗ ({2})? Strict conditionalization does not tell us, but Jeffrey conditionalization does: Pr ∗ ({2}) = Pr ∗ ({2} | {2, 4, 6}) · Pr ∗ ({2, 4, 6}) + Pr ∗ ({2} | {1, 3, 5}) · Pr ∗ ({1, 3, 5}) by the law of total probability = Pr ({2} | {2, 4, 6}) · Pr ∗ ({2, 4, 6}) + Pr ({2} | {1, 3, 5}) · Pr ∗ ({1, 3, 5}) by Jeffrey conditionalization

T HE S UBJE CT I V E IN T E R PR E TAT ION OF PROB ABIL I T Y

171

= 1/3 · 3/4 + 0 · 1/4 = 1/4 your old and new degrees of belief Jeffrey conditionalization thus tells you to hold onto those conditional degrees of belief whose conditions are the smallest, or logically strongest, propositions that have been directly affected by the experiential event that has taken place between t and t∗ . Here is the official formulation. Suppose Pr : A → R is the ideal cognitive agent’s old probabilistic degree of belief function at time t, and {E1 , . . . , En } is the largest, or most fine-grained, partition of W on which her probabilistic degrees of belief change between t and time t∗ . This means her new probabilistic degrees of belief at time t∗ are Pr (E1 )∗ , . . . , Pr (En )∗ (E1 , . . . , En are propositions in A). Suppose further her probabilistic degrees of belief are not directly affected in any other way such as forgetting etc. Then the ideal cognitive agent’s new probabilistic degree of belief function Pr∗ : A → R at t∗ should be obtained from her old one by holding fixed those conditional degrees of belief whose conditions are E1 , . . . , En , that is, for all propositions H: Pr ∗ (H) = Pr (H | E1 ) · Pr ∗ (E1 ) + · · · + Pr (H | En ) · Pr ∗ (En ) Again, note that one can assign positive new degrees of belief only to propositions that one has deemed possible before. Strict conditionalization is the special case of Jeffrey conditionalization where the “experiential partition” consists of E and W \ E, and the new probabilistic degrees of belief are 1 and 0, respectively. Let us use this result to illustrate why it is important to hold fixed not all, but only those conditional degrees of belief whose conditions are the smallest, or logically strongest, propositions that are directly affected by the experiential event that takes place between t and t∗ . Suppose

172

INTRODUCTION TO PROBABILITY AND INDUCTION

you become certain that the die shows four eyes, Pr∗ ({4}) = 1. Your new degree of belief that it does not show four eyes is zero, Pr∗ ({1, 2, 3, 5, 6}) = 0. That is, your experiential partition is {{4} , {1, 2, 3, 5, 6}}, and your new probabilistic degrees of belief are 1 and 0, respectively. Now, by telling you that the die shows four eyes, I am, albeit indirectly, also telling you that it shows an even number of eyes, so you should also assign one as your new probabilistic degree of belief to this proposition. However, when calculating your new degree of belief that the die shows two eyes you should not hold onto your conditional degree of belief that it does so given that the die shows an even number of eyes, Pr ({2} | {2, 4, 6}) = 1/3  Pr∗ ({2} | {2, 4, 6}) = 0. You should only hold onto your conditional degree of belief that the die shows two eyes given that the die shows four eyes, Pr ({2} | {4}) = Pr∗ ({2} | {4}) = 0. Otherwise your new degree of belief that the die shows two eyes is not zero—as it should be once you become certain that it shows four eyes—but 1/3! Why should an ideal cognitive agent update her probabilistic degrees of belief according to strict conditionalization, or according to Jeffrey (and thus also strict) conditionalization? One answer to the former question is based on a Dutch book theorem for strict conditionalization (Lewis 1999, Teller 1973). An ideal cognitive agent is willing to accept a series of possibly conditional bets that guarantees a monetary loss if, and only if, her betting ratios violate the probability calculus or strict conditionalization. One answer to both questions is based on a Dutch book theorem for Jeffrey conditionalization (Armendt 1980, Skyrms 1987). An ideal cognitive agent is willing to accept a series of possibly conditional and successive bets that guarantees a monetary loss if, and only if, her betting ratios violate the probability calculus or Jeffrey conditionalization.

T HE S UBJE CT I V E IN T E R PR E TAT ION OF PROB ABIL I T Y

173

8.6 BAYESIAN DECISION THEORY The subjective interpretation of probability is particularly important for the social sciences, where it figures prominently in decision and game theory. An ideal (cognitive or noncognitive) agent chooses an act from a finite set of alternatives. The outcome of an act depends on which possible world in W is actual, which is something the ideal agent is not certain of. Instead she has various probabilistic degrees of belief about which cell of a sufficiently large, but finite partition of W contains the actual world. These cells are called states of the world (there have to be sufficiently many of them so that the outcome of an act is the same for every possible world in any given state of the world). If the acts a1 , . . . , an the ideal agent can take, the states of the world s1 , . . . , sm , and the outcomes oi,j = ai , sj  of every act ai in every state of the world sj are given, a decision problem can be represented by a decision matrix: s1 s2 n-m decision matrix a1 o1,1 = a1 , s1  o1,2 = a1 , s2  a2 o2,1 = a2 , s1  o2,2 = a2 , s2  .. .. .. . . . an

...

sm

. . . o1,m = a1 , sm  . . . o2,m = a2 , sm  .. … .

on,1 = an , s1  on,2 = an , s2  . . . on,m = an , sm 

The ideal agent may face the decision to take or not take her umbrella. Depending on whether or not it will rain her decision will result in one of four outcomes: rain r take umbrella with umbrella in rain don’t take umbrella without umbrella in rain

no rain ¬r with umbrella when no rain without umbrella when no rain

174

INTRODUCTION TO PROBABILITY AND INDUCTION

Which of these two acts the ideal agent should take depends on the utility she assigns to the four outcomes, as well as her probabilistic degrees of belief for the two states of the world. Suppose the numbers are as follows:

take umbrella don’t take umbrella

Pr (r) = .2 Pr (¬r) = .8 80 90 −10 100

The decision rule of Bayesian decision theory is the principle of maximizing expected utility. It recommends that the ideal agent take one of these acts that maximize her expected utility. The expected utility  of an act ai is calculated by summing over the utilities u ai , sj of the outcomes ai , sj  of taking act ai in the  states of the world sj , weighted by the probabilities Pr sj of these states sj . Our ideal agent maximizes her expected utility by taking the umbrella. ¬r

r take umbrella don’t take umbrella

80 · .2 −10 · .2

+ +

90 · .8 100 · .8

= =

expected utility EU 88 78

The general formula looks as follows:

a1 .. . an

s1 ... u (a1 , s1 ) · Pr (s1 ) + . . . .. .. . . ... u (an , s1 ) · Pr (s1 ) + . . .

 sm EU aj + u (a1 , sm ) · Pr (sm ) = EU (a1 ) .. .. .. .. . . . . + u (an , sm ) · Pr (sm ) = EU (an )

In this set-up, due to Savage (1954), each state of the world is assumed to be independent of every act in the sense of the ideal agent’s probabilistic degree of belief function (as well as in

T HE S UBJE CT I V E IN T E R PR E TAT ION OF PROB ABIL I T Y

175

a causal sense). Acts are defined as functions whose domain is the set of all states of the world and whose co-domain is the set of all outcomes. Therefore, an act a assigns to each state of the world s exactly one outcome a (s) = o = a, s. In Chapter 1 we have introduced the distinction between type and token. This distinction does not only apply to sentences but also to acts. The acts in Bayesian decision theory are conceived of as tokens, not as types. The acts our ideal agent can take do not include some general act type of taking an umbrella. They only include the specific act token of taking a specific umbrella at a specific time on a specific location in a specific manner. In classical Bayesian decision theory, the objects of the ideal agent’s desires are outcomes. Acts are mere means to attain outcomes. The utilities the ideal agent assigns to the outcomes represent how much she desires these outcomes. We have already learned that concepts can come in a qualitative, comparative, and quantitative form. A comparative concept is measured on an ordinal scale, and a quantitative concept is measured on a cardinal scale. It is a contested issue if utility is measured on an ordinal or a cardinal scale. If utility is measured on an ordinal scale, the utility numbers carry only comparative information about the order between the utility of the outcomes. This is made precise by the following general definition. Two functions f and g with the same domain X and the set of real numbers R as co-domain are ordinally equivalent if, and only if, for all arguments x and y in X: f (x) ≥ f (y)

if, and only if,

g (x) ≥ g (y)

We have come across the concept of ordinal equivalence before. In Section 8.4, we have considered measures of incremental confirmation that assign real numbers to triples of hypotheses, pieces of information, and background assumptions. Now it is utility functions that assign real numbers to outcomes.

176

INTRODUCTION TO PROBABILITY AND INDUCTION

The principle of maximizing expected utility requires utility to be measured on a cardinal scale. Cardinal scales come in different forms. Interval scales carry information not only about the order between the utilities of the outcomes but also about the differences or intervals between the utilities of the outcomes. They allow us to say such things as that the difference between the utilities of the first and second best outcome is twice the difference between the utilities of the second and third best outcome. More generally, two functions f and g with the same domain X and co-domain R are equivalent on an interval scale if, and only if, there are real numbers k > 0 and m such that for all arguments x in X: g (x) = k · f (x) + m. If two functions are equivalent on an interval scale, they are also ordinally equivalent. Ratio scales carry information not only about the order and differences or intervals between the utilities of the outcomes. They also carry information about their proportions or ratios. They allow us to say such things as that the utility of the best outcome is twice the utility of the second best outcome. More generally, two functions f and g with the same domain X and co-domain R are equivalent on a ratio scale if, and only if, there is a real number k > 0 such that for all arguments x in X: g (x) = k · f (x). If two functions are equivalent on a ratio scale, they are also equivalent on an interval scale (m = 0). Temperature in Celsius and Fahrenheit are measured on an interval scale, and the two are equivalent on this scale. Temperature in Kelvin is measured on a ratio scale because it has an absolute zero point, and so are money, mass, length, time, and volume. Probability is measured on an absolute scale. Two functions f and g with the same domain X and co-domain R are absolutely equivalent if, and only if, for all arguments x in X: g (x) = f (x). If two functions are absolutely equivalent, they are also equivalent on a ratio scale (k = 1). Let us consider an example where the states of the world are not independent of the acts in the sense of the ideal agent’s

T HE S UBJE CT I V E IN T E R PR E TAT ION OF PROB ABIL I T Y

177

probabilistic degree of belief function (nor in a causal sense). The ideal agent decides between studying or not studying for the exam. The two states of the world are that she passes or fails the exam. pass exam p fail exam ¬p study s pass with studying fail with studying don’t study ¬s pass without studying fail without studying The ideal agent’s conditional degree of belief that she passes given that she studies is higher than her conditional degree of belief that she passes given that she does not study. Suppose her conditional degrees of belief and utilities are as follows. study s don’t study ¬s

pass exam p fail exam ¬p 90 with Pr (p | s) = 0.9 −10 with Pr (¬p | s) = 0.1 100 with Pr (p | ¬s) = 0.2 0 with Pr (¬p | ¬s) = 0.8

To deal with such cases, Richard Jeffrey (1965) proposes an alternative set-up in which the ideal agent assigns, not only her utilities, but also her probabilistic degrees of belief to the outcomes. Now it is the latter (that is, the outcomes) that are the cells of a partition of the set of all possible worlds W, whereas acts and states of the world are now conceived of as unions or disjunctions of outcomes. In calculating the expected utility of the acts, Jeffrey’s version of Bayesian decision theory—so-called evidential decision theory—replaces the nonconditional probabilistic degrees of belief Pr (si ) of the states si by the conditional probabilistic degrees of belief Pr si | aj of the states si conditional on the acts aj . fail exam ¬p

pass exam p study s don’t study ¬s

90 · 0.9 100 · 0.2

+ +

−10 · 0.1 0 · 0.8

= =

expected utility EU 80 20

178

INTRODUCTION TO PROBABILITY AND INDUCTION

The general formula looks as follows: ...

s1 a1 .. . an

u (a1 , s1 ) · Pr (s1 | a1 ) .. . u (an , s1 ) · Pr (s1 | an )

+ .. . +

... ... ...

 EU aj

sm + .. . +

u (a1 , sm ) · Pr (sm | a1 ) .. . u (an , sm ) · Pr (sm | an )

= .. . =

EU (a1 ) .. . EU (an )

Even evidential decision theory is not without criticism, as is illustrated by the following example known as Newcomb’s problem (Nozick 1969). There are two boxes. One box is transparent and contains \$1,000. The other box is opaque and contains \$0 or \$1,000,000, depending on a predictor’s prediction. The ideal agent can take the opaque box only, or both boxes. If the predictor predicts the ideal agent will take the opaque box only, \$1,000,000 will be put into the opaque box. If the predictor predicts the ideal agent will take both boxes, \$0 will be put into the opaque box. The predictor first gets to make her prediction. After she has done so, the ideal agent gets to decide between taking the opaque box only or both boxes. This guarantees that the ideal agent’s decision does not causally influence the predictor’s prediction, a fact the ideal agent is certain of. The decision problem the ideal agent faces can be represented by the following decision matrix:

take one box o take two boxes t

predicts one box O \$1,000,000 \$1,001,000

predicts two boxes T \$0 \$1,000

Importantly, the ideal agent is also certain of the following fact: The game has been played 999 times before, and the predictor has always made the correct prediction. Because of the latter information, the ideal agent assigns the following conditional degrees of belief to the states of the world conditional on the acts:

T HE S UBJE CT I V E IN T E R PR E TAT ION OF PROB ABIL I T Y

predicts one box O \$1,000,000; Pr (O | o) = 0.999 take two boxes t \$1,001,000; Pr (O | t) = 0.001

take one box o

179

predicts two boxes T \$0; Pr (T | o) = 0.001 \$1,000; Pr (T | t) = 0.999

The ideal agent’s expected monetary value (so called because we value the outcomes in terms of money rather than utility) is computed as follows: expected monetary value o \$1,000,000 ·0.999 + \$0 ·0.001 = \$999,000 \$2,000 t \$1,001,000 ·0.001 + \$1,000 ·0.999 = O

T

Assuming that the ideal agent wants to maximize her expected monetary value, evidential decision theory recommends that she take the opaque box only. However, the received view is that she should take both boxes. The reason is that her decision to take one box only (or both boxes) cannot causally influence the predictor’s prediction, and taking both boxes guarantees that she will receive an additional \$1,000 no matter what the predictor has predicted. The received view also has it that her conditional degrees of belief are “reasonable” or permissible. One response to Newcomb’s problem is to deny that it is a problem. The ideal agent’s conditional degrees of belief are not permissible, one might say, because they fail to reflect the fact that her decision cannot causally influence the predictor’s prediction. Conversely, if one insists that her degrees of belief are permissible, then she really should take the opaque box only. An alternative line is taken by causal decision theory (Joyce 1999, Weirich 2016). The ideal agent’s degrees of belief are permissible, it says, but she should take both boxes, as correlation is not causal relevance. The perfect correlation

180

INTRODUCTION TO PROBABILITY AND INDUCTION

between the ideal agent’s 999 decisions and the predictor’s 999 predictions should inform the ideal agent’s conditional degrees of belief for the states given the acts. However, this correlation does not imply causal relevance between decisions and predictions. To take this into account, the conditional probabilistic degrees of belief for the states given the acts of evidential decision theory should be replaced by different probabilistic degrees of belief that reflect the ideal agent’s beliefs about the causal efficacy of the acts to bring about the states. Since the acts in Newcomb’s problem are not causally efficacious by design, these alternative probabilistic degrees of belief are the same for both states. Therefore, the expected monetary value of taking both boxes will come out as \$1,000 higher.

8.7 EXERCISES Exercise 36: Suppose your background assumption b and your regular probabilistic degree of belief function Pr are such that the hypothesis that all swans are white, ∀x (S (x) → W (x)), is independent of the claim that an arbitrary or “randomly chosen” object a is a swan, S (a): Pr (S (a) | ∀x (S (x) → W (x)) ∧ b) = Pr (S (a) | b) Suppose further that you are not certain what color a is given that it is a swan and, in particular, that you are not certain that a is white given that it is a swan: Pr (W (a) | S (a) ∧ b) < 1 Show that the claim that a is a white swan, S (a) ∧ W (a), incrementally confirms—in the sense of your regular probabilistic degree of belief function Pr—the hypothesis that

T HE S UBJE CT I V E IN T E R PR E TAT ION OF PROB ABIL I T Y

181

all swans are white, ∀x (S (x) → W (x)) given your background assumption b. That is, show that the following holds: Pr (∀x (S (x) → W (x)) | S (a) ∧ W (a) ∧ b) > Pr (∀x (S (x) → W (x)) | b) Exercise 37: We are considering a probability Pr on some language L over some non-empty vocabulary V. Show that the following holds for all sentences e. If Pr (e) = 1, then for all sentences h: Pr (h | e) = Pr (h). You may skip the first, but not the second, use of universal generalization (UG). The more general claim from Section 8.4 then follows from the fact that conditional probabilities are probabilities. Exercise 38: We are considering a regular probability Pr on some language L over some non-empty vocabulary V. Show that the following holds for all sentences e and h: e incrementally confirms h if, and only if, e incrementally disconfirms ¬h. Note that this requires you to show two if-then claims: if e incrementally confirms h, then e incrementally disconfirms ¬h; and, if e incrementally disconfirms ¬h, then e incrementally confirms h. We are ignoring the background assumption b, and you may skip the use of UG as well as pick any of the five definitions of incremental dis-/confirmation, or probabilistic ir-/relevance, from Section 8.4. Exercise 39: We toss a coin twice. There are four possible outcomes: HH, HT, TH, and TT. You consider two hypotheses. Hypothesis F says that the coin is fair and that the chance of heads equals the chance of tails and that the two coin tosses are “independent and identically distributed” (in a sense we will make precise in Chapter 10). Hypothesis B says that the coin is biased and that the chance of heads is twice the chance of tails and that the two

182

INTRODUCTION TO PROBABILITY AND INDUCTION

coin tosses are “independent and identically distributed.” You obey the so-called “principal principle” (to be introduced in the next chapter), so your conditional degrees of belief for the four possible outcomes are as follows: Pr (HH | F) = Pr (HT | F) = Pr (TH | F) = Pr (TT | F) = 1/2· ×1/2 = 1/4 Pr (HH | B) = 2/3 · 2/3 = 4/9 Pr (HT | B) = 2/3 · 1/3 = 2/9 Pr (TH | B) = 1/3 · 2/3 = 2/9 Pr (TT | B) = 1/3 · 1/3 = 1/9 Initially your probabilistic degree of belief function Pr assigns 1/2 to both hypotheses F and B, Pr (F) = Pr (B) = 1/2. Calculate your nonconditional probabilistic degrees of belief for the four possible outcomes, that is, calculate: Pr (HH) =? Pr (TH) =?

Pr (HT) =? Pr (TT) =?

Exercise 40: Continuing Exercise 39, calculate your conditional probabilistic degrees of belief for hypothesis F given each of the four possible outcomes, as well as your conditional probabilistic degrees of belief for hypothesis B given each of the four possible outcomes, that is, calculate: Pr (F | HH) =? Pr (F | TH) =? Pr (B | HH) =? Pr (B | TH) =?

Pr (F | HT) =? Pr (F | TT) =? Pr (B | HT) =? Pr (B | TT) =?

Then determine which of the four possible outcomes incrementally confirms F, and which incrementally confirms B, in the sense of your probabilistic degree of belief function Pr.

T HE S UBJE CT I V E IN T E R PR E TAT ION OF PROB ABIL I T Y

183

Exercise 41: The utilities of attending a concert by Beyoncé b, Justin Bieber j, Drake d, and Rihanna r on your personal interval scale are as follows: u (b) = 100, u (j) = 700, u (d) = 900, and u (r) = 1000. You do not have utilities for anything else. Which of the following utility functions with the same domain   b, j, d, r and co-domain R are equivalent to u on your personal interval scale? f (b) = 201, f (j) = 1401, f (d) = 1801, and f (r) = 2001. g (b) = 1, g (j) = 2, g (d) = 3, and g (r) = 17. h (b) = 3, h (j) = 6, h (d) = 9, and h (r) = 51. Hint: If two functions f and g with domain X and co-domain R are equivalent on an interval scale, then it holds for all f (x)−f (y) g(x)−g(y) arguments x, y, z, w in X: f (z)−f (w) = g(z)−g(w) . Exercise 42: Continuing Exercise 41, (a) show that u, f , g, and h are ordinally equivalent. (b) Suppose g is measured on a ratio scale. Show that h is equivalent to g on this scale but that f is not equivalent to g on this scale. Hint: If two functions f and g with domain X and co-domain R are equivalent on a ratio scale, g(x) (x) then it holds for all arguments x and y in X: ff (y) = g(y) . Exercise 43: The numbers in the following decision matrix represent the utilities on your personal interval scale and your probabilistic degrees of belief. Calculate your expected utility for the four acts and then determine which act maximizes your expected utility.

a1 a2 a3 a4

s1 with Pr = 1/2 81 100 25 16

s2 with Pr = 1/8 64 0 100 9

s3 with Pr = 3/8 1 121 144 400

184

INTRODUCTION TO PROBABILITY AND INDUCTION

Exercise 44: We say that a utility function u is a linear function of money if, and only if, u is a function whose domain consists of all possible amounts of money, whose co-domain is the set of real numbers R, and there are real numbers a > 0 and b such that for all possible amounts of money m: u (m\$) = m · a + b. Suppose the ideal agent’s utility function u is a linear function of money in the following decision matrix. Calculate her expected utilities for the two acts and determine which act maximizes her expected utility. a1 a2

s1 s2 u (\$40) with Pr (s1 | a1 ) = 0.7 u (\$7) with Pr (s2 | a1 ) = 0.3 u (\$50) with Pr (s1 | a2 ) = 0.6 u (\$5) with Pr (s2 | a2 ) = 0.4

Exercise 45: You can decide between buying a yacht or not buying a yacht. Your utility for becoming very rich is 100, whereas your utility for not becoming very rich is 10. You do not have utilities for any other outcomes, which implies that you do not value having a yacht. Your utilities are measured on an interval scale and you face the following decision. very rich r not very rich ¬r buy yacht b 100 10 100 10 don’t buy yacht ¬b You are certain of the following true piece of information. Almost everybody who buys a yacht is very rich, but only very few people who do not buy a yacht are very rich. On the basis of this information you assign the following probabilistic degrees of belief: very rich r not very rich ¬r 100 with 10 with Pr (r | b) = 0.999 Pr (¬r | b) = 0.001 don’t buy yacht ¬b 100 with 10 with Pr (r | ¬b) = 0.01 Pr (¬r | ¬b) = 0.99

T HE S UBJE CT I V E IN T E R PR E TAT ION OF PROB ABIL I T Y

185

(a) What, if anything, should you do according to the principle of maximizing expected utility in evidential decision theory? (b) You are also certain that buying a yacht is not an effective means for becoming very rich. On the contrary, buying a yacht causes you to be bankrupt, so is causally efficacious for not becoming very rich. In light of this information, does causal decision theory recommend the same decision that was recommended by evidential decision theory?

READINGS The recommended readings for Chapter 8 include: Fitelson, Branden (2006), The Paradox of Confirmation. Philosophy Compass 1, 95–113. Talbott, William (2008), Bayesian Epistemology. In E.N. Zalta (ed.), Stanford Encyclopedia of Philosophy. Vineberg, Susan (2016), Dutch Book Arguments. In E.N. Zalta (ed.), Stanford Encyclopedia of Philosophy.

and perhaps also Briggs, Rachael (2014), Normative Theories of Rational Choice: Expected Utility. In E.N. Zalta (ed.), Stanford Encyclopedia of Philosophy. diFate, Victor (2016), Evidence. In J. Fieser & B. Dowden (eds.), Internet Encyclopedia of Philosophy. Easwaran, Kenny (2011), Bayesianism. Philosophy Compass 6, 312–320, 321–332. Joyce, James M. (2003), Bayes’ Theorem. In E.N. Zalta (ed.), Stanford Encyclopedia of Philosophy. Pettigrew, Richard (2015), Epistemic Utility Arguments for Probabilism. In E.N. Zalta (ed.), Stanford Encyclopedia of Philosophy. Steele, Katie & Orri Stefánsson, Hlynur (2015), Decision theory. In E.N. Zalta (ed.), Stanford Encyclopedia of Philosophy.

186

INTRODUCTION TO PROBABILITY AND INDUCTION

Weirich, Paul (2016), Causal decision theory. In E.N. Zalta (ed.), Stanford Encyclopedia of Philosophy.

A book length introduction to the vast field of decision and game theory is: Peterson, Martin (2009), An Introduction to Decision Theory. Cambridge: Cambridge University Press.

CHAPTER 9

The Chance Interpretation of Probability

9.1 CHANCES In contrast to deductive approaches to confirmation, Bayesian confirmation theory can handle the confirmation of statistical hypotheses specifying the (objective) chance that some proposition is true. Unlike subjective degrees of belief, which are features of the internal world or our minds, chances are, according to some accounts, features of the external world or reality. However, unlike observable relative frequencies, which are also features of the external world, chances are not “empirically accessible.” That is, we cannot directly observe or perceive, but only indirectly or inductively infer, what the chances are that various propositions are true (see, however, the qualification below). In this regard, chance is similar to force and electrons in physics, which also cannot be directly observed, but which are nonetheless postulated by physical theory. This is why they are called theoretical entities (and their names are called theoretical terms). Since chances are postulated by philosophical theory, and not everybody agrees what the right theory is, not everybody

188

INTRODUCTION TO PROBABILITY AND INDUCTION

agrees what chances are, or even if there are chances. Pessimists think talk of chances is just nonsense. Antirealists think talk of chances makes sense but does not commit us to the existence of a theoretical entity in the external world. For instance, idealists hold that chances are mind-dependent constructs or ideas, while reductionists think chances reduce to, and Humeans think they “supervene” on, other things such as (limiting) relative frequencies. For chances to supervene on (limiting) relative frequencies, the chances must be the same whenever the (limiting) relative frequencies are. Finally, realists take the talk of chances at face value and hold that they are real. Sports give rise to many examples of chances (at least if we bracket issues surrounding free will). The chance that the Raptors win a game if they are not behind after the third quarter is over 70% because the Raptors always play their best game in the fourth quarter. Before her first Wimbledon final, the chance that Serena Williams starts her first game with an ace was higher on the first service than it was on the second service. Now that chance is zero because she did not start her first game with an ace. Chances thus change over time, which is why my initial claim that chances are not empirically accessible needs to be refined: The chance that an event takes place at a time before the event takes place, or does not take place, is not empirically observable. The chance that an event takes place at a time after the event has (not) taken place is one (zero). Therefore, this chance can be determined if it can be determined whether the event in question took place. Some claims in the natural sciences are probabilistic. Here is an example. In chemistry we learn that the half-life of uranium-238 is about 4.5 billion years. In all probability, it will take about 4.5 billion years until half of a given number of uranium-238 atoms decay. But what does “probability” mean here? Furthermore, a given uranium-238 atom either decays, or

T HE C HA NC E IN T E R PR E TAT ION OF PROB ABIL I T Y

189

it does not decay. So what does it mean to say that its half-life is about 4.5 billion years? One way to make sense of the claim that the half-life of a particular uranium-238 atom is about 4.5 billion years is to interpret this claim as saying that the chance that the particular uranium-238 atom decays within 4.5 billion years equals a half. On one account, chances are the propensities (Popper 1957) with which events happen or propositions are true. On this account, the chance that a coin lands on heads is a property of the coin-tossing set-up, which includes the physical composition of the coin, the size and shape of the coin and the way it is tossed, the humidity and pressure of the air, and so on. Here chance represents the tendency of the coin-tossing set-up to produce the event that the coin lands on heads. On another account, the chances are whatever the “best system” (Lewis 1986) says they are. Here the best system is selected among those physical theories or systems whose non-probabilistic claims are true, and whose probabilistic claims provide a statistical summary of the events that happen. The best system is the one which achieves the best trade-off between simplicity and informativeness as well as fit of its probabilistic claims with the (relative frequencies of the) events that happen. While this is not the place to defend any one account of chance over another, the following section may not hide my view that the most promising account of chance is a form of modal idealism (it is called “modal” because chance, like necessity and possibility, is a modality that specifies the mode in—in this case: the chance with—which something is true). According to this view, chance, much like causation, is a mind-dependent construct or idea that has no reality itself but that we find extremely useful in representing, and navigating through, reality.

190

INTRODUCTION TO PROBABILITY AND INDUCTION

9.2 PROBABILITY IN PHYSICS Much like the subjective interpretation of probability is particularly important for the social sciences, the interpretation of probability as chance is of particular importance for the natural sciences, where it plays a prominent, yet different, role in statistical and quantum mechanics. Statistical mechanics (Frigg 2008, Pathria & Beale 2011, Sklar 2015) distinguishes between the microstates and macrostates a physical system can be in. A macrostate can be realized by many different microstates, and in general, only the former will be observable or measurable. For instance, an “isolated” physical system—that is, a physical system considered in isolation and assumed not to interact with its environment (such as an ideal gas)—may contain N particles. Each particle has a particular position and momentum in each dimension of space. This means that there are two parameters for each particle in each dimension of space, and 2 · 3 parameters for each particle in three-dimensional space. For an isolated physical system with N particles in three-dimensional space, this gives rise to 2 · 3 · N “degrees of freedom.” The set of all possible specifications of these 6 · N degrees of freedom (represented by the 6 · N-fold “Cartesian product” of R) is called the phase space. Every element or “point” in the phase space represents a possible microstate of this physical system. The microstates play a role similar to that of the cells of a partition of the set of all possible worlds W, or the state descriptions of a formal language L. They are the most fine-grained or detailed descriptions of the physical system that are assigned a probability. Every less detailed or more coarse-grained description can be thought of as a union or disjunction of them.

T HE C HA NC E IN T E R PR E TAT ION OF PROB ABIL I T Y

191

The number of particles in a physical system as well as its volume and energy are both microscopic and macroscopic “variables” (we define variables in the next chapter). It makes sense to ascribe these properties to both a macrostate of a physical system as well as a microstate. In contrast to these, temperature and pressure are only macroscopic variables. They are defined in terms of the distribution of microstates realizing a given macrostate and so are statistical in nature. It makes sense to ascribe these properties to a macrostate of a physical system but not a microstate. The distribution of the microstates over the macrostates is called the density of (or volume of the phase space occupied by) the macrostates: The more microstates there are that realize a given macrostate, the denser this macrostate. We can think of a macrostate as the disjunction or union of all microstates realizing it. According to the additivity axiom, the probability of a macrostate is the sum of the probabilities of all microstates realizing the macrostate. The additivity axiom applies because the microstates are mutually exclusive or incompatible: At any given point in time, a physical system can be in only one microstate. The microstates in the phase space evolve with time, and their dynamics is described by the “Hamiltonian.” Now, classical mechanics is deterministic (Hoefer 2016): A complete specification of the microstate at any one point in time determines the microstate at every—later or earlier—point in time. In addition, the evolution of the microstates is time-reversal invariant. This means that an evolution from the microstate X (t0 ) at time t0 to the microstate X (t0 + τ) at a later time t0 + τ, τ > 0, is consistent with the laws of motion of classical mechanics only if the “reversal” of this evolution is so, too. The reversal results by reversing the sign of the momentum of each particle in each dimension of space (that is, + becomes − and − becomes +), and by keeping the positions fixed. Starting

192

INTRODUCTION TO PROBABILITY AND INDUCTION

this reversed evolution in microstate X (t0 + τ) at time t0 + τ results in a microstate X (t0 + 2 · τ) at time t0 + 2 · τ that is the reversal of X (t0 ), R (X (t0 )). In contrast to this, the evolution of the macrostates is not time-reversal invariant (Callender 2016). Instead, the evolution is usually from macrostates of lower Boltzmann entropy to macrostates of higher Boltzmann entropy (Boltzmann entropy is a special case of Shannon’s concept of entropy from Section 6.3). Once a physical system is in a macrostate with maximal Boltzmann entropy, it is in equilibrium. For instance, an isolated physical system such as your study in the winter will have a certain temperature. Once you open the window of your study, it will interact with its environment, which also has a certain temperature and is sometimes called the “heat bath.” Initially the entire system consisting of your study embedded in its environment is in a macrostate with low Boltzmann entropy: While it is cold outside, it will still be warm in your study. With the window of your study being open, the temperatures of the two systems will, in all probability, change until your study and the environment have the same temperature. From this point on, the entire system is in equilibrium. According to the so-called principle of equal a priori probabilities, all microstates realizing a given macrostate of a physical system in equilibrium are equally probable. This is the first occurrence of probability in statistical mechanics. We consider the evolution of the microstate of a physical system with time. The probability of a macrostate corresponds to how much time on average a physical system spends in this macrostate. This is why these first probabilities are called time averages. It is very difficult to determine time averages, so there is a second occurrence of probability in statistical mechanics. Instead of considering the evolution of the microstate of a physical system with time, we now imagine all the possible

T HE C HA NC E IN T E R PR E TAT ION OF PROB ABIL I T Y

193

microstates realizing the macrostate the system is in at a given moment in time. In reality, the physical system is in precisely one microstate, but we imagine all the possible microstates it could be in at the given moment in time. The collection of all these “mental copies” (Pathria and Beale 2011: 25) is called the ensemble. Depending on how the micro- and macrostates in it are characterized one is dealing with the microcanonical, canonical, or grand canonical ensemble. For the microcanonical ensemble, we consider a physical system with N particles confined to a region of space of volume V and total energy E. The macrostate of the physical system is characterized by the three parameters N, V, and E. For the canonical ensemble, it is temperature T rather than total energy E that characterizes the macrostate in combination with N and V. Ω (N, V, E) is the number of microstates realizing the macrostate characterized by N, V, and E. The Boltzmann entropy of this macrostate is S (N, V, E) = k · ln (Ω (N, V, E)) (Uffink 2014). k is the so-called Boltzmann constant, and ln is the natural logarithm. The physical system is in equilibrium if it is in the macrostate that is realized by the largest number of microstates. According to the principle of equal a priori probabilities, these microstates are all equally probable. There has been a debate (Pathria & Beale 2011: sct. 1.6) about what the microstates of a physical system are to which the same (let us call it) statistical probability should be assigned according to the principle of equal a priori probabilities. This debate is analogous to the debate (discussed in Section 7.1) between Wittgenstein and Carnap about what the descriptions are to which one should assign the same logical probability: state descriptions or structure descriptions. Initially the microstates were identified with state descriptions which specify of each particle what its energy level is. In response to the so-called Gibbs paradox (Pathria & Beale 2011: sct. 1.5), the microstates are now identified with structure descriptions which merely specify

194

INTRODUCTION TO PROBABILITY AND INDUCTION

how many particles ni have energy level εi (where the sum of all ni equals the total number of particles N, and the weighted sum of all energy levels ni · εi equals the total energy E). The elements of the ensemble are the merely possible, imagined microstates that could realize a given macrostate (plus the one microstate the physical system actually is in). They are not the actual microstates that the physical system is in at different moments in time, as in the evolution of the microstates in the phase space. Since the probability of a macrostate now corresponds to the proportion of the ensemble containing the microstates that could realize it, these second probabilities are called ensemble averages. If the physical system in question is ergodic, time and ensemble averages coincide (Frigg, Berkovitz, & Kronz 2016). So much for statistical mechanics when the underlying mechanics is classical. If the latter is replaced by quantum mechanics (Ismael 2015), the microstates are quantum states described by the wave function. The evolution of the quantum states with time follows Schrödinger’s equation and is deterministic. Indeterminism may enter the picture when a “measurement” is carried out. The so-called eigenstates of an “observable” are the possible outcomes of a measurement of this observable. According to the “collapse postulate,” when a measurement of an observable B on a physical system in a quantum state is carried out, the system collapses into an eigenstate of B. In this situation, quantum mechanics does not determine which eigenstate of B the system collapses into; it only specifies, through Born’s rule, the probabilities of these eigenstates. Note that these probabilities in quantum mechanics are different from the probabilities in statistical mechanics. To wrap up, the principle specifying the statistical probabilities is the principle of equal a priori probabilities. The principle specifying the quantum mechanical probabilities is Born’s rule. Statistical probabilities and quantum mechanical

T HE C HA NC E IN T E R PR E TAT ION OF PROB ABIL I T Y

195

probabilities are distinct, and both of them are at play in quantum statistical mechanics (Pathria & Beale 2011: ch. 5; here a third principle is needed for the quantum statistical probabilities that guarantees that the mental copies or elements of the ensemble are completely disentangled from each other). While this is highly controversial territory (Beisbart & Hartmann 2011, Ben-Menahem & Hemmo 2012), it is these statistical and quantum probabilities that are the primary candidates for the interpretation of probability as chance. Whereas time averages could perhaps still be made sense of in terms of the (limiting) relative frequencies to be discussed in more detail in the next chapter, this is much less clear for ensemble averages, and it is essentially impossible for quantum mechanical probabilities because these are generally (see, however, Born 1954) taken to require an interpretation that makes sense of “single-case” probabilities. Single-case probabilities are probabilities of (propositions about) events that occur, or do not occur, one single time without being actually repeated across time, as in the case of time averages, and without being hypothetically repeated in the imagination, as in the case of ensemble averages. A given uranium-238 atom either decays, or it does not decay. To make sense of the claim that the half-life of a particular uranium-238 atom is about 4.5 billion years, one needs single-case probabilities such as chances and degrees of belief (as in quantum Bayesianism; see Healey 2016); (limiting) relative frequencies, whether they are actual or hypothetical, won’t do.

9.3 THE PRINCIPAL PRINCIPLE Let us consider a partition {H1 , . . . , Hn } consisting of n statistical hypotheses specifying alternative chances ch (E) = xi for some

196

INTRODUCTION TO PROBABILITY AND INDUCTION

proposition E. This presupposes that the chance of E exists, which is a nontrivial assumption, but that we are not certain what it is. We merely have various probabilistic degrees of belief about what the chance of E might be. Pr (E | Hi ∩ B) is called the likelihood of Hi on E given B, and Pr (Hi | B) is called the prior of Hi given B. Together the priors of the Hi and the likelihoods of the Hi on E determine the posterior of each statistical hypothesis H given E and B, Pr (H | E ∩ B), via Bayes’ theorem (provided Pr (E ∩ B) > 0, Pr (H1 ∩ B) > 0, . . ., Pr (Hn ∩ B) > 0): Pr (H | E ∩ B) =

Pr (E | H ∩ B) · Pr (H | B) Pr (E | H1 ∩ B) · Pr (H1 | B) + · · · + Pr (E | Hn ∩ B) · Pr (Hn | B)

In the simplest case, we have just two statistical hypotheses, H and its complement W \ H, and no, or tautological, background assumptions W: Pr (H | E) =

Pr (E | H) · Pr (H) Pr (E | H) · Pr (H) + Pr (E | W \ H) · Pr (W \ H)

To illustrate, suppose we deal with 101 statistical hypotheses. Hi is the statistical hypothesis that the chance that the coin lands on heads on the next toss, E, equals i%, ch (E) = i/100. H17 says the chance equals 17%, and H50 says the chance equals 50%. Let us assume that the priors of H0 , H1 , . . ., H100 are the same, Pr (Hi ) = 1/101, that E is assigned a positive probabilistic degree of belief, Pr (E) > 0, and that there are no, or only tautological, background assumptions. Bayes’ theorem tells us what the posteriors of these statistical hypotheses are conditional on the assumption E that the coin lands on heads on the next toss. Pr (Hk | E) = =

Pr (E | Hk ) · Pr (Hk ) Pr (E | H0 ) · Pr (H0 ) + · · · + Pr (E | H100 ) · Pr (H100 ) k 100 0 100

1 · 101

1 1 1 2 1 1 · 101 + 100 · 101 + 100 · 101 + · · · + 100 100 · 101

T HE C HA NC E IN T E R PR E TAT ION OF PROB ABIL I T Y

197

k k = 0 + 1 + 2 + · · · + 100 50 · 101 This means that E incrementally confirms Hk if, and only if, k > 50, which makes sense: That the coin lands on heads on the next toss confirms these hypotheses that say that the coin is biased towards heads and disconfirms these hypotheses according to which the coin is biased against heads. However, we have implicitly made use of a principle in this calculation that needs to be made explicit. We have replaced the likelihood Pr (E | Hi ) of the statistical hypothesis Hi on E with what the chance of E is according to Hi , ch (E) = i%. In other words, we have replaced the subjective probability or degree of belief in E given Hi , Pr (E | Hi ), with what the chance of E is according to Hi . Since the probability calculus allows an ideal cognitive agent to assign any degree of belief between zero and one inclusive to E given Hi , an additional principle besides non-negativity, normalization, and additivity is needed to guarantee that this conditional degree of belief is equal to the hypothesized chance. This principle is so important that Lewis, in “A Subjectivist’s Guide to Objective Chance” (1980), calls it the principal principle. As noted earlier, chances vary across time. Let cht (A) = x be the proposition that the chance at time t that proposition A is true exists, and equals x. Let Bt be a proposition that is “admissible” for A at time t (more on admissibility below). According to the principal principle, an ideal cognitive agent’s initial or a priori degree of belief function Pr should be regular and such that for all propositions A, and all propositions B that are admissible for A at t as well as consistent with cht (A) = x: =

Pr (A | cht (A) = x ∩ B) = x The idea is that, for propositions for which chances exist, their chances guide the ideal cognitive agent’s initial or a priori degrees of belief in these propositions in the absence

198

INTRODUCTION TO PROBABILITY AND INDUCTION

T HE C HA NC E IN T E R PR E TAT ION OF PROB ABIL I T Y

199

degrees of belief of mine anymore in the presence of the former information about Trump’s win in Florida. What information is admissible for which propositions at which times will depend on one’s theory of chance. Lewis assumes that “historical” information about times prior to t is admissible at t for all propositions. In particular, the complete history of possible world w up to time t, Hwt , is admissible at t for all propositions. In addition, Lewis assumes that information about how chances at time t depend on the complete history up to t is also admissible for all propositions. In particular, possible world w’s theory of chance, Tw , is admissible at any time t for all propositions. Lewis identifies this theory with the conjunction, or intersection, of all “history-to-chance-conditionals” ‘if Ht , then cht (A) = x’ that are true at w, where Ht specifies a complete possible history up to time t. In the presence of these two assumptions, the principal principle implies that an ideal cognitive agent’s initial or a priori degree of belief function Pr should be such that for all possible worlds w, times t, and propositions A for which the chance at w at t exists: chwt (A) = Pr (A | Hwt ∩ Tw ) This consequence of the principal principle says that the chance distribution of any possible world w at any time t, chwt , comes from what an ideal cognitive agent’s initial or a priori degree of belief function should be by conditionalizing on the complete history of w up to t, Hwt , as well as w’s theory of chance, Tw . It has itself two important consequences. First, chances obey the probability calculus because the degrees of belief an ideal cognitive agent should have initially or a priori do so. This means that we do not have to postulate that chances satisfy the probability calculus but can derive this result from the principal principle and the thesis that only probability measures are permissible degree of belief functions for an ideal cognitive

200

INTRODUCTION TO PROBABILITY AND INDUCTION

agent. Second, “[w]hat’s past is no longer chancy” (Lewis 1980: 273), as already indicated earlier. Let us first prove the last claim. Suppose the chance of A at time t in possible world w, chwt (A), exists. If A is about the history of w up to t, then there are two possible cases: Hwt ⊆ A if A is true at w, and Hwt ∩ A = ∅ if A is false at w. Suppose first that Hwt ⊆ A. Then 1. chwt (A) = Pr (A | Hwt ∩ Tw ) the principal principle.

from the consequence of

wt ∩Tw 2. chwt (A) = Pr(A∩H from 1. by the definition Pr(Hwt ∩Tw ) of conditional probability, which applies because Pr is regular and Hwt ∩ Tw is non-empty or consistent. wt ∩Tw ) 3. chwt (A) = Pr(H from 2., set theory, and because Pr(Hwt ∩Tw ) Hwt ⊆ A by assumption. from 3. and elementary calculus. 4. chwt (A) = 1

)

Suppose next that Hwt ∩ A = ∅. 1. chwt (A) = Pr (A | Hwt ∩ Tw ) from the consequence of the principal principle. wt ∩Tw 2. chwt (A) = Pr(A∩H from 1. by the definition Pr(Hwt ∩Tw ) of conditional probability, which applies because Pr is regular and Hwt ∩ Tw is non-empty or consistent. from 2., set theory, and because 3. chwt (A) = Pr(HPr(∅) wt ∩Tw ) Hwt ∩ A = ∅ by assumption. from 3., Exercise 22, and elementary 4. chwt (A) = 0 calculus.

)

Thus, for any proposition A about the history of any possible world w at any time t: If chwt (A) exists, then chwt (A) = 1 or chwt (A) = 0.

T HE C HA NC E IN T E R PR E TAT ION OF PROB ABIL I T Y

201

As to the consequence of the principal principle itself, we will be content with an informal proof. According to Lewis’ assumptions, Hwt ∩ Tw is admissible at time t for all propositions because both Hwt and Tw are, and because Lewis also assumes that (so-called Boolean) combinations of admissible information are themselves admissible. Hwt ∩ Tw is also consistent, or non-empty, and since Pr is regular, Pr (Hwt ∩ Tw ) > 0. Hence, Pr (A | Hwt ∩ Tw ) is well-defined for all propositions A. (In reality, things are more complicated because it is not the case that for every algebra of propositions there is a regular probability measure whose domain is this algebra. For the algebra we are implicitly assuming here, and in lines 2 of the above proofs, there likely is no regular probability measure whose domain is this algebra.) Suppose the chance of A at w at t exists, and equals x, so that cht (A) = x is true at w, that is, chwt (A) = x. Hwt ∩ Tw implies cht (A) = x. This is because Tw includes the history-to-chance conditional ‘if Hwt , then cht (A) = x’ that specifies how, at possible world w, the chance of A at t depends on the complete history of w up to time t. Hence: chwt (A) = x

by assumption

= Pr (A | cht (A) = x ∩ Hwt ∩ Tw ) by the principal principle = Pr (A | Hwt ∩ Tw ) by the implication mentioned above This concludes our informal proof of the above-mentioned consequence of the principal principle. A different way of stating the idea behind the principal principle is that it tells an ideal cognitive agent to treat chance as an expert in the sense of Gaifman’s (1988) “A Theory of Higher Order Probabilities”—at least initially or a priori, and

202

INTRODUCTION TO PROBABILITY AND INDUCTION

in the absence of inadmissible information. In “Belief and the Will” (1984), van Fraassen puts forth the reflection principle. According to it, an ideal cognitive agent should treat her own future degrees of belief as experts. For all propositions A, times t, and later times t : Pr t (A | Pr t (A) = r) = r As in the case of the principal principle, it is not clear if this principle is always well-defined (see the above remark in parentheses). Likely it will not be possible for Pr t to assign a positive probability to all propositions Pr t (A) = r. Therefore, not all conditional probabilities Pr t (A | Pr t (A) = r) will be defined. A closely related principle, which is subject to an analogous remark, requires the ideal cognitive agent to treat her current degrees of belief as experts. For all propositions A: Pr (A | Pr (A) = r) = r This principle implies that the ideal cognitive agent should be certain of what her own current degrees of beliefs are, that is, for all propositions A: If Pr (A) = r, then Pr (Pr (A) = r) = 1. This principle is also the probabilistic version of the principles of positive and negative introspection proposed in Hintikka’s Knowledge and Belief (1961). For all propositions A: If the ideal cognitive agent believes A, then she should believe that she believes A. If the ideal cognitive agent does not believe A, then she should believe that she does not believe A.

T HE C HA NC E IN T E R PR E TAT ION OF PROB ABIL I T Y

203

All these principles belong to the growing field of formal epistemology (Weisberg 2015), which includes Bayesian epistemology (Talbott 2008) as a special case.

READINGS The recommended readings for Chapter 9 include: Briggs, Rachael (2010), The Metaphysics of Chance. Philosophy Compass 5, 938–952.

and perhaps also Lewis, David K. (1980), A Subjectivist’s Guide to Objective Chance. In R.C. Jeffrey (ed.), Studies in Inductive Logic and Probability. Vol. II. Berkeley: University of Berkeley Press, 263–293. Reprinted with Postscripts in D. Lewis (1986), Philosophical Papers. Vol. II. Oxford: Oxford University Press, 83–132. Weisberg, Jonathan (2015), Formal Epistemology. In E.N. Zalta (ed.), Stanford Encyclopedia of Philosophy.

C H A P T E R 10

The (Limiting) Relative Frequency Interpretation of Probability

10.1 THE JUSTIFICATION OF INDUCTION Let us grant that Bayesian confirmation theory adequately explicates or analyses or defines the concept of confirmation. Then it holds by definition that the conclusion of an inductively strong argument is confirmed by its premises. But does this also justify us in believing the conclusion of an inductively strong argument whose premises are restricted to information we have? Don’t we first have to justify our analysis or explication or definition of confirmation? According to Hume, we cannot justify the principle of induction. This has led some philosophers to replace the project of justifying induction by the project of defining induction. According to Goodman and Carnap, we can allegedly justify our definition of inductively strong arguments by appeals to accepted inductive practice, or intuitions about inductive validity. Even if this was true, it would still not justify us in

F R EQ UE NC Y IN T ER PR E TAT ION OF PROB ABIL I T Y

205

believing the conclusions of these so-defined inductively strong arguments, but merely the way we talk about them. The principle of induction is a method for answering questions to which one does not have the answer in light of information that one has. The third premise of Hume’s argument says that there is no inductively strong argument which does not presuppose its conclusion, whose premises are restricted to information we have, and whose conclusion says that the principle of induction holds. The reason is that any inductively strong argument whose conclusion says that the principle of induction holds presupposes rather than derives its conclusion. The second premise of Hume’s argument says that there is no deductively valid argument which does not presuppose its conclusion, whose premises are restricted to information we have, and whose conclusion says that the principle of induction holds. The reason is that the principle of induction is a general principle that applies to all questions, including those to which we do not have the answer. Furthermore, it is a truth of logic that there is no deductively valid argument whose premises are restricted to information we have, and whose conclusion says something that is the answer to a question to which we do not have the answer. Therefore, as long as there is some question to which we do not have the answer (that is, some information that we do not have), there is no deductively valid argument whose conclusion says that the principle of induction holds and whose premises are restricted to information we have. There is little hope to get around these two premises. The first premise of Hume’s argument says that we can justify the principle of induction only if there is a deductively valid or an inductively strong argument which does not presuppose its conclusion, whose premises are restricted to information we have, and whose conclusion says that the principle of induction holds. We have assumed that the principle of induction holds if, and only if, it leads from true premises to

206

INTRODUCTION TO PROBABILITY AND INDUCTION

true conclusions in all or most of the logically possible cases. Put in terms of the instrumentalist view of logic, the first premise thus says that we can justify the principle of induction only if we can show or prove it to be the means to attaining the end of reasoning from true premises to true conclusions in all or most of the logically possible cases. Here to show or prove is to provide a deductively valid or an inductively strong argument which does not presuppose its conclusion, and whose premises are restricted to information we have. The second and third premises of Hume’s argument tell us that we cannot show or prove the principle of induction to be a means to attaining this end. However, as indicated at the end of Section 7.4, this is compatible with us being able to show or prove induction to be a means to attaining a different end. Indeed, in light of the fact that it is deductive logic that is a means to attaining the cognitive end of reasoning from true premises to true conclusions in all logically possible cases, and bracketing Haack’s dilemma for deduction, it is only to be expected that induction is a means to attaining a different cognitive end. Now, whatever this different cognitive end may be, there will still be no inductively strong argument which does not presuppose its conclusion, whose premises are restricted to information we have, and whose conclusion says that induction is a means to attaining this different cognitive end. The reason is still that any inductively strong argument whose conclusion says that the principle of induction is a means to attaining the cognitive end, whatever it may be, that makes inductively strong arguments desirable presupposes rather than derives its conclusion. However, even if there is some information that we do not have, there may now be a deductively valid argument which does not presuppose its conclusion, whose premises are restricted to information we have, and whose conclusion says that induction is a means to attaining this different cognitive end. This is so

F R EQ UE NC Y IN T ER PR E TAT ION OF PROB ABIL I T Y

207

because this different cognitive end may be sufficiently modest for there to be a deductively valid argument whose conclusion says that the principle of induction is a means to attaining this different cognitive end, and whose premises are restricted to information we have. Precisely such a deductive justification of induction relative to a cognitive end that differs from reasoning from true premises to true conclusions in all or most of the logically possible cases is given in Reichenbach’s Experience and Prediction (1938).

10.2 THE STRAIGHT(-FORWARD) RULE Like sentences (Chapter 1) and acts (Section 8.6), events come in the form of types and tokens. Reichenbach is interested in the limit of the relative frequency with which various event types are “instantiated” in various sequences of event tokens. The event type may be that a particular coin lands on heads, and the sequence of event tokens may be the sequence of all tosses of this coin that land on heads or tails. To figure out what these limits are, Reichenbach proposes a statistical version of the principle of induction that is known as the straight(-forward) rule. Here is a contemporary version: From the premise that m out of the n event tokens about which one has the information whether they are of type A are of type A, one may and ought to infer the conclusion that the limit of the relative frequency of event tokens of type A in any sequence of event tokens continuing these n event tokens equals m/n. What Reichenbach shows is that this rule is a means to attaining the cognitive end of converging to the limit of the relative frequency of any event type in any sequence of event tokens for which there is such a limit. Suppose we are tossing a coin three times, and it lands twice on heads and once on tails: m = 2 and n = 3. We are interested

208

INTRODUCTION TO PROBABILITY AND INDUCTION

in the limit of the relative frequency of the event type heads in any sequence of event tokens that continues these three tosses. According to the straight(-forward) rule, we may and ought to infer that the limit of the relative frequency of the event type heads in any sequence of event tokens that continues these three tosses equals 2/3. Now, the three tosses may be continued, or they may not, and if they are, then there may be finitely many further tosses, or infinitely many further tosses. If there are no further tosses with this coin, or only finitely many, then the limit of the relative frequency of the event type heads in this sequence of finitely many event tokens exists and equals the relative frequency, that is, 2/3 if there are no further tosses, and k/N if the entire sequence contains N event tokens of which k instantiate, or fall under, the event type heads. If there are infinitely many further tosses with this coin, then the limit of the relative frequency of the event type heads in this infinite sequence of event tokens may or may not exist. If it exists, then this is because the relative frequencies converge, as in the following sequence of relative frequencies: 2/3, 4/5, 5/6, . . . , n − 1/n, . . ., which converges to 1. If the limit does not exist, then this is because the relative frequencies diverge, as in the following sequence of relative frequencies: heads on every toss until the relative frequency of the event type heads is above 0.9; then tails on every toss until the relative frequency of the event type heads is below 0.1; then heads on every toss until the relative frequency of the event type heads is again above 0.9; and so on. You may say that no coin will ever be tossed infinitely many times, and this is a perfectly legitimate position. The reason we nevertheless have to discuss infinite sequences of event tokens is that some philosophers have suggested that the probability that some event token instantiates a particular event type equals the limit of the relative frequency of event tokens of this type in the hypothetical sequence of event tokens that would

F R EQ UE NC Y IN T ER PR E TAT ION OF PROB ABIL I T Y

209

result if the event was “repeated” infinitely many times (von Mises 1928, perhaps also Venn 1866). We have already come across this idea in Section 9.2 in the discussion of ensemble averages. Now, event tokens cannot be repeated, only event types can. The idea is that there are two event types—say, tossings of a particular coin, and landings of this particular coin on heads—and one considers a hypothetical sequence of infinitely many event tokens instantiating the former event type—say, tossings of a particular coin. Then one asks if the limit of the relative frequency of event tokens instantiating the second event type—say, landings of this particular coin on heads—among event tokens instantiating the first event type exists (and if so, what it is equal to). Of course, by doing so we are leaving the realm of the empirical or observable because we are employing so-called counterfactual conditionals. As mentioned in Chapter 1, these counterfactual conditionals say what would have been the case, if certain conditions—conditions that may well be contrary-to-fact—had obtained. Counterfactual conditionals are at least as contested as the probabilities that are defined in terms of them on the above proposal, and we will not discuss them in any detail. Suffice it to say that counterfactual conditionals are often considered to be closely related to causation (Menzies 2014), that is, the relation between causes and their effects that we have come across in Sections 3.3 and 8.6. Reichenbach justifies the straight(-forward) rule, which is his inductive logic or principle of induction, by a deductively valid argument. The premises of this argument contain enough mathematics to make sense of the numbers mentioned in the straight(-forward) rule. Since this is less than the mathematics required for the probability calculus, we will suppress these premises. The conclusion of this argument does not say that the straight(-forward) rule leads from true premises to true

210

INTRODUCTION TO PROBABILITY AND INDUCTION

conclusions in all or most of the logically possible cases. Instead it says that the straight(-forward) rule is a means to attaining the cognitive end of converging to the limit of the relative frequency of any event type in any sequence of event tokens for which there is such a limit—a cognitive end that one may, or may not, have. Reichenbach’s argument, the so-called vindication of induction, is an instrumentalist argument like the instrumentalist versions of the Dutch book and gradational accuracy arguments. It has inspired an entire discipline of such arguments called formal learning theory that we will discuss in Chapter 11. The vindication of induction Premise (Reichenbach’s theorem) For any sequence of event tokens s and any type of event A: The limit of the relative frequency of event tokens of type A in the sequence of event tokens s exists if, and only if, the straight(-forward) rule converges to this limit. Conclusion An ideal cognitive agent ought to obey the straight (-forward) rule given that she has the cognitive end that her conjectures converge to the limit of the relative frequency of any event type in any sequence of event tokens for which there is a limit of the relative frequency with which this event type is instantiated. Doing the former is a means to attaining the latter cognitive end that she may, or may not, have. Reichenbach’s theorem implies that, for any sequence of event tokens s and any event type A, if some rule converges to the limit of the relative frequency of A in s, then the straight(-forward) rule does so. We prove both claims informally. If the straight(-forward) rule converges to the limit of the relative frequency of some event type A in some sequence of event tokens s, then there is some rule that converges to this

F R EQ UE NC Y IN T ER PR E TAT ION OF PROB ABIL I T Y

211

limit. If there is some rule that converges to this limit, then this limit exists. If this limit exists, then it equals limn→∞ #(A,s|n) n , where # (A, s | n) is the number of event tokens instantiating the event type A in the first n event tokens of s. This is also the number the straight(-forward) rule converges to if this limit exists (and, hence, if some rule converges to this limit). If no limit exists, then there is nothing to converge to, and the straight(-forward) rule is as unsuccessful as any other rule conjecturing limits. This means that the straight(-forward) rule is successful if, and only if, some rule conjecturing limits is. This situation is somewhat similar to that of a patient who has a disease that can be cured only, if at all, by a risky operation: If anything saves the patient, the risky operation does, but there is no guarantee that the latter will do so. One objection to the vindication of induction is associated with the slogan: “In the long run we are all dead” (Keynes 1923: 80). The idea is that nobody has, or should have, the cognitive end of converging to the limit of the relative frequency of any event type in any sequence of event tokens for which there is such a limit. Even if this was the case, it would miss the point. The vindication of induction is an instrumentalist argument. It does not presuppose that one has this end, or, as deontological arguments do, that one ought to have it. Instead it establishes a means-end relationship without suggesting that one ought to, or does, have the end in question. Furthermore, the situation in deductive logic is analogous. Just as in the long run we are all dead, nobody lives in any logically possible world other than (the one described by) the actual one. Yet this does not prevent people from desiring to have beliefs that are consistent, that is, true in some logically possible world that need not be the actual one. While consistency alone may not make beliefs desirable, consider how undesirable beliefs are that are not even consistent. Similarly, mere convergence to the truth may not make a sequence of

212

INTRODUCTION TO PROBABILITY AND INDUCTION

conjectures desirable, but consider how undesirable a sequence of conjectures is that does not even converge to the truth. It is no coincidence that “estimators” in statistics are called consistent if their conjectures converge (in an even weaker sense; see Section 10.5). Another objection is that the straight(-forward) rule is not the only rule that converges to the limit of the relative frequency of any event type in any sequence of event tokens for which there is such a limit. Rule 17 is another one: From the premise that m out of the n event tokens about which one has the information whether they are of type A are of type A, one may and ought to infer the conclusion that the limit of the relative frequency of event tokens of type A in any sequence of event tokens continuing these n event tokens equals (m + 17) /n. In response to this objection, one may again point out that the situation in deductive logic is analogous. There are many different systems of general rules for classical logic (Klement 2016a mentions two). However, all these systems are functionally equivalent in the sense that they all license the exact same particular inferences. This means that, as far as the cognitive end of reasoning in a way that is truth-preserving with logical necessity is concerned, all these different systems are on a par. Of course, these different systems are not on a par with respect to other ends—say, the end of deriving conclusions in as few steps as possible. Here is a different example. Drinking a glass of water quenches one’s thirst if, and only if, drinking a glass of water or some other drink does. As far as the end of quenching one’s thirst is concerned, drinking a glass of water is on a par with drinking another drink. The two are functionally equivalent with respect to this end. Of course, there are other ends, such as quenching one’s thirst with a sweet or low-calorie drink, with respect to which drinking a glass of water is not on a par with drinking another drink. However, that is besides the point

F R EQ UE NC Y IN T ER PR E TAT ION OF PROB ABIL I T Y

213

because that is not claimed. Similarly, the straight(-forward) rule and rule 17 are on par concerning the cognitive end of converging to the limit of the relative frequency of any event type in any sequence of event tokens for which there is such a limit. The two are functionally equivalent with respect to this cognitive end. It is not claimed that these two rules are functionally equivalent with respect to any other end—say, the end of making true or accurate conjectures about the type of the next event token. Given that one has the end of quenching one’s thirst with a low-calorie drink, one ought to drink a glass of water. Yet, one cannot just drink a glass of water. One can only drink a particular glass of water containing a certain number of H2 O molecules in a particular manner. However, drinking another glass of water containing the same number of H2 O molecules, but different ones, in a different manner would do equally well: It would be functionally equivalent with respect to the end of quenching one’s thirst with a low-calorie drink. Hence, given that one has this end, what one really ought to do is something that is functionally equivalent to drinking a particular glass of water containing a certain number of H2 O molecules in a particular manner, where functional equivalence is understood with respect to the end of quenching one’s thirst with a low-calorie drink. In the same way, one cannot just follow the straight(-forward) rule or rule 17 or some alternative rule that converges to the same limits. One can only follow one of these alternative rules. In light of all this, we must formulate the conclusion of the vindication of induction as follows: An ideal cognitive agent ought to obey a rule that is functionally equivalent to the straight(-forward) rule given that she has the cognitive end that her conjectures converge to the limit of the relative frequency of any event type in any sequence of event tokens for which there is such a limit. Functional equivalence is understood with respect to this cognitive end.

214

INTRODUCTION TO PROBABILITY AND INDUCTION

F R EQ UE NC Y IN T ER PR E TAT ION OF PROB ABIL I T Y

215

it attempts to conjecture the limit of the relative frequency of some event type in any sequence of event tokens continuing the event tokens about which one has the information whether they are of the given type. To be fair, Reichenbach was also interested in using one’s information about relative frequencies to assign probabilities to propositions such as that the next event token is of a given type. The vindication of induction does not carry over to the different cognitive end of using one’s information about relative frequencies to assign probabilities. However, this does not change the fact that Reichenbach has taught us that we can justify induction by a deductively valid argument which does not presuppose its conclusion, whose premises are restricted to information we presumably have, and whose conclusion says that the principle of induction is a means to attaining a cognitive end that differs from the end of reasoning from true premises to true conclusions in all or most of the logically possible cases.

10.3 RANDOM VARIABLES One of the most important theorems in probability theory is the strong law of large numbers. Once again we consider a non-empty set of possible worlds W and an algebra of propositions A over W. This time we additionally consider a second non-empty set V and a second algebra B over V. The first concept we need to define is that of a random variable. Like many technical terms, it is a misnomer insofar as a random variable is neither random, nor is it a variable. Instead, it is a function. Recall from Chapter 5 that functions are mappings from their domain into their co-domain such that each element of the domain, each argument, is mapped to exactly one element of the co-domain, the argument’s value

216

INTRODUCTION TO PROBABILITY AND INDUCTION

under the function. The domain of a random variable is W, and its co-domain is V, so that a random variable X is a function from W into V, X : W → V. However, a random variable is not just any function from W into V. Instead, a function X from W into V is a random variable if, and only if, it is a “measurable function”: Each element B in B is such that “its inverse image under X,” X −1 (B), is a proposition in A. (The term ‘measurable function’ derives from the terminology of calling the elements of algebras “measurable sets.”) The inverse image of B under X, X −1 (B), is the following subset of W: X −1 (B) = {w ∈ W : X (w) ∈ B} If X is to be a random variable (or A-B measurable function), the inverse image of every element B in the algebra B must be a proposition in the algebra A on which the probability measure Pr is defined. The point of this requirement is that we want Pr to assign probabilities to these inverse images. We can do this only if they are propositions in A, which is the domain of the probability measure Pr. Let us consider two examples in which we interpret the domain W differently, not as a set of possible worlds that are (logical) alternatives to each other, but as a population of individuals (more on this in Section 10.7). Suppose W is the set of humans, V is the set of natural numbers including zero, and X is the number-of-children function mapping each human in W to exactly one natural number in V, namely this human’s number of children. Suppose further that the algebra B is the power set of V so that one of the elements in B is the set B = {v ∈ V : v ≥ 2}. The inverse image of this element of B under the number-of-children function, X −1 ({v ∈ V : v ≥ 2}), is the set of humans from W who have at least two children: {w ∈ W : X (w) ≥ 2}. In order for the number-of-children function to be a random variable, this inverse image has to be an element of the algebra A. The point

F R EQ UE NC Y IN T ER PR E TAT ION OF PROB ABIL I T Y

217

of this requirement is that we want Pr to assign a probability to the set of humans who have at least two children. We can do so only if this set is an element of the algebra A on which the probability measure Pr is defined. The probability of the set of   humans who have at least two children is Pr X −1 {v ∈ V : v ≥ 2} . Sometimes this is said to be the probability that a “randomly chosen” human has at least two children. Alternatively, suppose X is the height-in-centimeters function mapping each human in W to exactly one natural number in V, namely this human’s height in centimeters. As before, the algebra B is the power set of V so that one element in B is the set B = {170}. The inverse image of this element of B under the height-in-centimeters function, X −1 ({170}), is the set of humans from W whose height equals 170 cm: {w ∈ W : X (w) = 170}, or simply: X = 170. In order for the height-in-centimeters function to be a random variable, this inverse image has to be an element of the algebra A. Again, the point of this requirement is that we want Pr to assign a probability to the set of humans whose height equals 170 cm. We can do so only if this set is an element of the algebra A on which the probability measure Pr is defined. The probability of the set of humans whose height equals 170 cm is Pr ({w ∈ W : X (w) = 170}), or simply Pr (X = 170). Sometimes this is said to be the probability that a “randomly chosen” human is 170 cm tall.

10.4 INDEPENDENT AND IDENTICALLY DISTRIBUTED RANDOM VARIABLES One way to make sure that a function X from W into V is a random variable is to let it generate the algebra of propositions A. This works by first requiring A to include the inverse images X −1 (B) of all elements B from the algebra B over V. Then one considers the smallest set that contains all these inverse images

218

INTRODUCTION TO PROBABILITY AND INDUCTION

under X and that is, as we say, “closed under complementation and countable union” so that steps 1-3 of the definition of an algebra as well as the countably infinite version of step 3 from Section 5.4 are satisfied. The algebra of propositions generated by a random variable in this way is always unique. Therefore, we can say that a proposition A is about a random variable X if, and only if, A is an element of the algebra of propositions that is generated by X. If we have several random variables X1 , X2 , . . . we first let each of them generate its algebra. Then we consider the smallest set that contains all these algebras as subsets and that is closed under complementation and countable union. This super-sized algebra is the algebra of propositions on which the probability measure Pr is defined. The next two concepts have already appeared informally in Exercise 39. In order to make them formally precise, it is best to think of a random variable as an experiment token such as the flipping of a coin or the rolling of a die. The possible outcome types of an experiment token often are numbers, as when we roll a die and consider how many eyes it shows. However, this is not essential, as illustrated by the coin toss. Here the outcome types are heads and tails. In general, the possible outcome types of an experiment token are the values v in the set V, which is the co-domain of the experiment token. We consider a possibly infinite set of random variables, or experiment tokens. The random variables, or experiment tokens, in this set are independent in the sense of Pr if, and only if, for every finite subset of this set, say the set containing the random variables Y1 , . . . , Yn , and every proposition A1 about Y1 , . . ., and every proposition An about Yn : Pr (A1 ∩ . . . ∩ An ) = Pr (A1 ) · . . . · Pr (An ) The random variables, or experiment tokens, in this set are identically distributed in the sense of Pr if, and only if, for any

F R EQ UE NC Y IN T ER PR E TAT ION OF PROB ABIL I T Y

219

two random variables Y and Z and every element B from B:  Pr Y −1 (B) = Pr ({w ∈ W : Y (w) ∈ B})  = Pr Z−1 (B) = Pr ({w ∈ W : Z (w) ∈ B}) (The first and third equality follow from the definition of inverse images.) The next concept has already appeared informally, too, and we are now in a position to make it formally precise. We consider a possibly infinite sequence of random variables, or experiment tokens: X1 , X2 , . . .. The difference between a set and a sequence is that the order in which the elements of a sequence are listed matters, and so do multiple occurrences. (In the terminology of Section 5.1, a sequence is a tuple.) All these random variables, or experiment tokens, X1 , X2 , . . . have the same domain W and the same co-domain V. Therefore, we can think of them as repetitions of the same experiment type X. We will momentarily consider an example where the experiment type is the tossing of a particular coin, and the experiment tokens are the first toss of this coin, the second toss of this coin, and so on. The relative frequency of the possible outcome type v from V in the first n repetitions of the experiment type X in the possible world w from W is the proportion of outcome tokens of type v among the outcome tokens X1 (w) , . . . , Xn (w): rf (v, X n (w)) = # {i : Xi (w) = v, 1 ≤ i ≤ n} /n Let us consider an example. Our experiment type is tossing a particular coin, X. The first repetition of this experiment type (that is, the first toss of this coin) is the experiment token or random variable X1 : W → {H, T} whose domain is the set of all possible worlds W. Its co-domain is the set of possible outcome types {H, T}, which includes H for heads and T for tails. The second repetition of this experiment type (that is, the second toss of this coin) is the experiment token or random variable

220

INTRODUCTION TO PROBABILITY AND INDUCTION

X2 : W → {H, T}. And so on. Note that X1 , X2 , and all the other tosses with this coin have the same domain W and the same co-domain {H, T}; otherwise they would not be repetitions of the same experiment type X. W need not be the set of all logically possible cases. It is sufficient if W specifies for each natural number i whether the coin lands on heads or tails on the i-th toss. (In fact, W can be constructed in this way from the co-domain of the experiment type X as the infinite Cartesian product of V.) V is the set containing H for heads and T for tails, and the algebra over V = {H, T} is the power set. The algebra of propositions A over W is the super-sized algebra mentioned above that is generated by all the coin tosses, that is, the first coin toss X1 , the second coin toss X2 , and so on. Finally, Pr is the probability measure on the algebra A. To say that the coin tosses X1 , X2 , . . . are independent in the sense of Pr is to say that for any n coin tosses Y1 , . . . , Yn , any proposition A1 about Y1 , . . ., and any proposition An about Yn : Pr (A1 ∩ . . . ∩ An ) = Pr (A1 ) · . . . · Pr (An ) In this formulation, I have used ‘Y’s instead of ‘X’s because the n coin tosses need not be X1 , . . . , Xn , but may be, say, the n = 4 coin tosses X17 , X5 , X88 , and X20 , respectively. The various propositions about a particular coin toss, say, X17 , are the contradictory proposition ∅, the tautological proposition W, the proposition X17 = H, that is, {w ∈ W : X17 (w) = H} that the 17th toss with the coin lands on heads, and the proposition X17 = T, that is, {w ∈ W : X17 (w) = T} that the 17th toss with the coin lands on tails. This allows us to simplify the claim that the coin tosses X1 , X2 , . . . are independent to the claim that, for any n coin tosses Y1 , . . . , Yn , where each vi is either H or T: Pr (Y1 = v1 ∩ . . . ∩ Yn = vn ) = Pr (Y1 = v1 ) · . . . · Pr (Yn = vn ) This means that the probability of any outcome of any toss is independent, in the sense of Pr, of the outcomes of any finite

F R EQ UE NC Y IN T ER PR E TAT ION OF PROB ABIL I T Y

221

number of other coin tosses. For instance, the probability that the coin lands on heads on the 17th toss is independent of the outcomes of the first 16 tosses. In particular, if the probability that the coin lands on heads on the 17th toss equals a half, then so does the conditional probability that the coin lands on heads on the 17th toss given that it has landed on tails on the first 16 tosses (provided the latter condition has positive probability). To say that the coin tosses X1 , X2 , . . . are identically distributed in the sense of Pr is to say that for any two coin tosses Xi and Xj : Pr (Xi = H) = Pr(Xj = H) This means that the probability that the coin lands on heads (or, for that matter, tails) is the same on any two—and, hence, all—tosses. For instance, if the probability that the coin lands on heads on the first toss equals a half, then so does the probability that it lands on heads on the 17th toss. Combined, these two concepts are very powerful. If the coin tosses X1 , X2 , . . . are independent and identically distributed, or iid, then the probability that the coin lands on, say, heads on any n coin tosses Y1 , . . . , Yn equals: i

Pr (Y1 = H ∩ . . . ∩ Yn = H) = Pr (Y1 = H) · . . . · Pr (Yn = H) id

= Pr (Y1 = H)n Finally, here is how our example illustrates the concept of relative frequency. One of the possible worlds in W is the actual world @. Suppose we toss the coin three times, and it lands on heads on the first two tosses and on tails on the third: X1 (@) = H, X2 (@) = H, and X3 (@) = T. This means that the relative frequency of the outcome type heads H in the first three repetitions of the experiment type of tossing a coin X in the   actual world @ equals 2/3, rf H, X 3 (@) = 2/3.

222

INTRODUCTION TO PROBABILITY AND INDUCTION

10.5 THE STRONG LAW OF LARGE NUMBERS Often the set of possible outcome types V is a set of numbers, say, when the random variables are the rolls of a die, and the possible outcome types are 1, 2, 3, 4, 5, and 6. In this case, we can consider not only the relative frequency of some outcome type such as 4 in some possible world in the first n repetitions of the experiment type of rolling the die. We can also consider the outcome type that has occurred on average. The mean of the first n repetitions of the experiment type X in the possible world w, X n (w), is defined as follows: X n (w) = (X1 (w) + · · · + Xn (w)) /n For instance, suppose we roll the die three times, and it shows two eyes on the first roll and four eyes on the second roll and three eyes on the third roll and one eye on the fourth roll, that is, X1 (@) = 2, X2 (@) = 4, X3 (@) = 3, and X4 (@) = 1. Then the mean of the first four repetitions of the experiment type of rolling the die in the actual world @ is X 4 (@) = (2 + 4 + 3 + 1) /4 = 2.5. If the set of possible outcome types is a set of numbers, we can consider not only the probability that some outcome type occurs in an experiment token. We can also consider the outcome type the probability measure Pr “expects” the experiment token to result in. The expected value of the experiment token Xn , Exp (Xn ) is defined as follows: Exp (Xn ) = v1 · Pr (Xn = v1 ) + v2 · Pr (Xn = v2 ) + · · · For instance, suppose the probability measure Pr is the one from Exercise 5 in Section 5.2 that assigns a probability of 1/6 to each of the six propositions that the die shows one eye, {1}, that the die shows two eyes, {2}, . . ., and that the die shows six eyes, {6}. The value that this probability measure expects the fifth roll of the die to result in is Exp (X5 ) = 1 · 1/6 +

F R EQ UE NC Y IN T ER PR E TAT ION OF PROB ABIL I T Y

223

2 · 1/6 + 3 · 1/6 + 4 · 1/6 + 5 · 1/6 + 6 · 1/6 = 3.5. Note that neither the mean nor the expected value has to be a possible value or outcome type from V. This is illustrated by our example: Neither 2.5 nor 3.5 is among the possible values or outcome types from V = {1, 2, 3, 4, 5, 6}. The strong law of large numbers tells us something about the relationship between the limiting relative frequency of an outcome type such as heads or four in a sequence of experiment tokens such as tosses of a particular coin or rollings of a particular die, and the probability that this outcome type will be instantiated by a particular experiment token. If the set of possible values or outcome types is a set of numbers, it also tells us something about the relationship between the limiting mean in a sequence of experiment tokens, and the expected value of a particular experiment token. The strong law of large numbers has been proven as early as Borel (1909), and there are different formulations of it. We will rely on the formulation by Kolmogorov (1933), which assumes the set of possible values or outcome types V to be the set of real numbers R. The algebra B over V is the set of Borel sets from Section 6.2, that is, the smallest set that contains all intervals and is closed under complementation and countable union. Kolmogorov’s Strong Law of Large Numbers Let X1 , X2 , . . . be a sequence of random variables, or experiment tokens, whose domain is a non-empty set of possible worlds W and whose co-domain is the set of real numbers R so that these experiment tokens are repetitions of the same experiment type X. Let A be the super-sized algebra of propositions over W that is generated by these random variables when the algebra B over V is the set of Borel sets, and let Pr be a probability measure on A. A and Pr are assumed to satisfy the strengthened versions of steps 3 and 6, respectively, from Section 5.4.

224

INTRODUCTION TO PROBABILITY AND INDUCTION

If X1 , X2 , . . . are independent and identically distributed in the sense of Pr, then the expected value of any one of them, say, Xn+1 , is finite, Exp (Xn+1 ) < ∞, if, and only if, there exists a proposition A in A whose probability equals one and which is such that, for all possible worlds w in A, the mean of the first n repetitions of the experiment type X in the possible world w converges to the expected value of the next, or any other, experiment token, lim | X n (w) − Exp (Xn+1 ) |= 0.

n→∞

Recall that only those sets of possible worlds have probabilities that are propositions in the algebra. Mathematicians say that a set of possible worlds almost surely contains the actual world if, and only if, the set itself, or (if the set is not a proposition) one of its subsets, has probability one. Given this terminology, Kolmogorov’s strong law of large numbers says that, in numerical experiments that are iid, the means almost surely converge to the expected values if, and only if, these expected values are finite. In the language of statistics, it says that means are strongly consistent estimators of expected values if, and only if, these expected values are finite. We do not have to assume that the set of possible values or outcome types is a set of numbers. If we drop this assumption, the strong law of large numbers can be formulated as follows. The Strong Law of Large Numbers Let X1 , X2 , . . . be a sequence of random variables, or experiment tokens, whose domain is a non-empty set of possible worlds W and whose co-domain is a finite set of possible outcome types V so that these experiment tokens are repetitions of the same experiment type X. Let A be the super-sized algebra of propositions over W that is generated by these random variables when the algebra B over V is the

F R EQ UE NC Y IN T ER PR E TAT ION OF PROB ABIL I T Y

225

power set of V, and let v be an arbitrary outcome type from V. Finally, let Pr be a probability measure on A, and assume A and Pr to satisfy the strengthened versions of steps 3 and 6, respectively, from Section 5.4. If X1 , X2 , . . . are independent and identically distributed in the sense of Pr, then there exists a proposition A in A whose probability equals one and which is such that, for all possible worlds w in A, the relative frequency of the outcome type v in the first n repetitions of the experiment type X in the possible world w converges to the probability that the next, or any other, experiment token results in this outcome type v: lim | rf (v, X n (w)) − Pr (Xn+1 = v) | = 0.

n→∞

The strong law of large numbers says that the relative frequencies of the outcome types in iid experiments almost surely converge to the probabilities of these outcome types. In the language of statistics, it says that relative frequencies are strongly consistent estimators of probabilities. The strong law of large numbers relies on the infinite version of the additivity axiom. The weak law of large numbers does not. The Weak Law of Large Numbers Let X1 , X2 , . . . be a sequence of random variables, or experiment tokens, whose domain is a non-empty set of possible worlds W and whose co-domain is a finite set of possible outcome types V so that these experiment tokens are repetitions of the same experiment type X. Let A be the super-sized algebra of propositions over W that is generated by these random variables when the algebra B over V is the power set of V, and let v be an arbitrary outcome type from V. Finally, let Pr be a probability measure on A, and

226

INTRODUCTION TO PROBABILITY AND INDUCTION

assume A to satisfy the strengthened version of step 3 from Section 5.4, though Pr is not assumed to satisfy the strengthened version of step 6. If X1 , X2 , . . . are independent and identically distributed in the sense of Pr, then for every ε > 0, the probability that the relative frequency of the outcome type v in the first n repetitions of the experiment type X differs from the probability that the next, or any other, experiment token results in this outcome type v by at least ε converges to zero, that is, for all ε > 0: lim Pr ({w ∈ W : | rf (v, X n (w)) − Pr (Xn+1 = v) | ≥ ε}) = 0.

n→∞

The set of possible worlds {w ∈ W : limn→∞ | rf (v, X n (w)) − Pr (Xn+1 = v) | = 0} may but need not be a proposition in the algebra A. If it is, the strong law of large numbers can be formulated as follows: If X1 , X2 , . . . are iid in the sense of Pr, then

 Pr w ∈ W : lim | rf (v, X n (w)) − Pr (Xn+1 = v) | = 0 = 1. n→∞

No such qualification is required in the case of the weak of law large numbers. The reason is that for every ε > 0 and every natural number n, the set of possible worlds {w ∈ W : | rf (v, X n (w)) − Pr (Xn+1 = v) | ≥ ε} is a proposition in A. The weak law of large numbers has been proven as early as Bernoulli (1713). It says that the probability converges to zero that the relative frequencies differ from the probabilities. In the language of statistics, it says that relative frequencies are weakly consistent estimators of probabilities. Weak consistency is defined in terms of “convergence in probability.” It is implied by strong consistency, which is defined in terms of almost sure convergence. The latter in turn is implied by “pointwise” convergence, which is the concept used in Reichenbach’s theorem.

F R EQ UE NC Y IN T ER PR E TAT ION OF PROB ABIL I T Y

227

10.6 DEGREES OF BELIEF, CHANCES, AND RELATIVE FREQUENCIES As a theorem, the strong law of large numbers holds for every interpretation of probability. Therefore, we can choose to interpret the probability measure Pr as chance measure. On this interpretation, the strong law of large numbers says that the chance equals one that the relative frequencies in repetitions of experiment tokens that are independent and identically distributed, in the sense of the chance measure, converge to the chances. Thus, on this interpretation, the strong law of large numbers relates limiting relative frequencies and chances. In Section 9.3 we came across the principal principle that relates chances and probabilistic degrees of belief. Let us use this principle and theorem, as well as Bayes’ theorem and the update rule of strict conditionalization, to shed light on the interplay between the three major interpretations of probability: probabilistic degrees of belief, chances, and (limiting) relative frequencies. Let Cr be the ideal cognitive agent’s initial probabilistic degree of belief function, and ch the chance measure. The principal principle and Bayes’ theorem imply that conditionalizing on information about relative frequencies of ch-iid repetitions of experiments affects her degrees of belief in chance hypotheses in the following way (the factorial operation ! is defined as follows: n! = 1 · 2 · . . . · (n − 1) · n): Cr (ch (Xn+1 = v) = p | rf (v, X n ) = q) =

n! · pn·q · (1 − p)n·(1−q) Cr (ch (Xn+1 = v) = p) · (n · q)! · (n · (1 − q))! Cr (rf (v, X n ) = q)

This assumes that the chance hypothesis ch (Xn+1 = v) = p and the information rf (v, X n ) = q that the relative frequency of the outcome type v in the first n ch-iid repetitions of the

228

INTRODUCTION TO PROBABILITY AND INDUCTION

experiment type X equals q were both deemed possible initially, that is, assigned a positive initial degree of belief. Suppose the ideal cognitive agent becomes certain that rf (v, X n ) = q. Suppose further she updates her degrees of belief by strict n!·pn·q ·(1−p)n·(1−q)

conditionalization. For fixed n, the term (n·q)!·(n·(1−q))! is larger the closer the relative frequency q is to the hypothesized chance p. The strong law of large numbers then implies that the chance equals one that, in the limit, the ideal cognitive agent’s degree of belief will be highest for the true chance hypothesis. In other words, by obeying the principal principle and by updating her degrees of belief by strict conditionalization, an ideal cognitive agent who becomes certain of what the true relative frequencies in iid repetitions of some experiment are will almost surely and in the limit “learn” what the chances are, provided that her initial degrees of belief for the true chance hypothesis and the information received are not zero. Let us continue the example in which the relative frequency of the outcome type heads H in the first three repetitions of the experiment type of tossing a coin X in the actual world   @ equals 2/3, rf H, X 3 (@) = 2/3. Let us assume we assigned a positive initial degree of belief to these relative frequencies and now become certain of them in the sense that we assign   degree of belief one to the proposition rf H, X 3 = 2/3, that     is, w ∈ W : rf H, X 3 (w) = 2/3 . Furthermore, suppose the coin tosses X1 , X2 , . . . are iid in the sense of the chance measure ch, and we are certain of this. To some extent, the art of designing an experiment consists in making sure that this assumption is met, but at the end of the day, it is just that: an assumption. Now consider the hypothesis that the coin is fair, ch (X4 = H) = 1/2, and the hypothesis that the coin is biased towards heads 2:1, ch (X4 = H) = 2/3. Both of them are assigned a positive initial degree of belief. Finally, we assume not only that the chance that the coin lands on heads exists, but also that it is either 1/2 or 2/3, so that one of these two hypotheses is true.

F R EQ UE NC Y IN T ER PR E TAT ION OF PROB ABIL I T Y

229

The principal principle and Bayes’ theorem imply that our degree of belief function Cr should be such that:   Cr ch (X4 = H) = 1/2 | rf H, X 3 = 2/3 = 3! · (1/2)3·(2/3) · (1/2)3·(1/3) Cr (ch (X4 = H) = 1/2) · (3 · (2/3))! · (3 · (1/3))! Cr (rf (H, X 3 ) = 2/3) 3 Cr (ch (X4 = H) = 1/2) = · 8 Cr (rf (H, X 3 ) = 2/3)   Cr ch (X4 = H) = 2/3 | rf H, X 3 = 2/3 = =

3! · (2/3)3·(2/3) · (1/3)3·(1/3) Cr (ch (X4 = H) = 2/3) · (3 · (2/3))! · (3 · (1/3))! Cr (rf (H, X 3 ) = 2/3) 4 Cr (ch (X4 = H) = 2/3) = · 9 Cr (rf (H, X 3 ) = 2/3) Becoming certain that the relative frequency of heads in the first three tosses of the coin equals 2/3, where these coin tosses are assumed to be iid, and updating our degrees of belief by strict conditionalization improves our degree of belief in the biased-chance hypothesis ch (X4 = H) = 2/3 compared to our degree of belief in the fair-chance hypothesis ch (X4 = H) = 1/2—provided both of them have not been assigned degree of belief zero to begin with, and provided that the information we have received has not been assigned degree of belief zero either. Suppose we continue to become certain of the true relative frequencies of heads in tosses with this coin, which presupposes that we did not assign degree of belief zero to any of these relative frequencies, and these tosses actually are iid. Then the strong law of large numbers guarantees that the chance is one that we eventually assign the highest degree of belief to the true hypothesis about what the chance of heads in tosses with this coin is (again, assuming that this chance exists and the true hypothesis is one of the two chance hypotheses considered by us). =

230

INTRODUCTION TO PROBABILITY AND INDUCTION

10.7 DESCRIPTIVE STATISTICS Some readers will have come across probability in connection with statistics, and we will conclude this section with a discussion of the latter. Mathematically, random variables are measurable functions, but philosophically we can interpret them in at least two distinct ways: as singular variables and as generic variables. Singular variables have as their domain a set of possible worlds that are (logical) alternatives to each other: No two possible worlds are (logically) compatible with each other. Hence, the information we have access to is restricted to the value v a singular variable X takes on in one possible world. If we make no mistake, this is the actual world @ so that we have access to the information X (@) = v. In contrast to this, generic variables have as their domain a set of objects that are not (logical) alternatives to each other. This set is called the population. The fact that the objects or individuals in the population are not (logical) alternatives to each other allows us to obtain information about several—in principle even all—of them. The information that a generic variable Y takes on the values v1 , . . . , vn for the n objects a1 , . . . , an , respectively, in the domain of Y is called a sample (outcome) of size n, Y (a1 ) = v1 , . . . , Y (an ) = vn . A sample outcome is a specification of the value of a generic variable for the individuals in a finite subset of the entire population. The two random variables a-human’s-number-of-children and a-human’s-height-in-centimeters are generic variables: Their domain is the population of all humans. A specification of the number of children of the presidents of the United States is a sample outcome for the first generic variable. A specification of the height-in-centimeters of this season’s players of the University of Connecticut Huskies (a famous

F R EQ UE NC Y IN T ER PR E TAT ION OF PROB ABIL I T Y

231

women’s basketball team) is a sample outcome for the second generic variable. Statistics usually works with generic variables. Descriptive statistics is mostly concerned with the description of the empirically accessible information that is contained in a sample, as well as how to best obtain samples. In contrast to this, inferential statistics is primarily concerned with the inferences that can be drawn about the entire population from the information in a sample. These inferences can take different forms. They can be estimates of parameters such as the mean or average value in the entire population from information about these parameters in a sample. They can also be inferences to accept or reject statistical hypotheses about the entire population, or to believe these statistical hypotheses to a specific degree. Most of the issues in the philosophy of statistics (Howson & Urbach 2005, Mayo 1996, Seidenfeld 1979, Romeijn 2014) concern inferential statistics. Among others, there is a debate between frequentist or classical statistics (Fisher 1925; 1935, Neyman & Pearson 1967) and Bayesian statistics (de Finetti 1970, Edwards & Lindman & Savage 1963, Lindley 1965, Savage 1954). While the average number of children of the presidents of the United States may be indicative of the average number of children of all humans, the average height-in-centimeters of this season’s players of the University of Connecticut Huskies will not be indicative of the average height-in-centimeters of all humans. To make sure the parameters of the sample such as the mean or average value are indicative of these parameters in the entire population, statisticians have developed various sophisticated sampling techniques. One of these is random sampling, which is supposed to make plausible the assumption that the individuals are selected from the population “at random.” This means that all individuals are chosen with the

232

INTRODUCTION TO PROBABILITY AND INDUCTION

same probability, and independently so that the iid assumption of the laws of large numbers holds. Generic variables represent properties that the individuals in the population have, such as hair color, birth rank, and height-in-centimeters. Like all concepts, these properties can come in a qualitative, comparative, and quantitative form. For the most part, statistics is concerned with the typical or average value of generic variables in samples and the population. If the generic variable represents a qualitative concept such as hair color, then the typical values in a sample are given by the modes of the sample (there may be more than one mode). The modes of a sample outcome are the most frequent values in the sample, that is, those values vi such that the number of individuals in the sample with value vi is at least as great as the number of individuals in the sample with value vk , for every value vk . If we have a sample of five humans, two of whom have black hair, two of whom have brown hair, and one of whom has red hair, then the modes are black hair and brown hair. If we have a sample of five humans, four of whom have black hair and one of whom has blonde hair, then the mode is black hair. If the generic variable represents a comparative concept such as birth rank, then the typical or average value in a sample is given by the median of the sample. The median of a sample outcome (of odd size) is the middle value in the sample, that is, the smallest value vi such that at least half of the individuals in the sample have a value vk that is not greater than vi (it is this comparison that requires the concept to be comparative). If we have a sample of five humans, two of whom are firstborns, two of whom are second-borns, and one of whom is a third-born, then the median is second-born. If we have a sample of five humans, four of whom are firstborns and one of whom is a second-born, then the mode is firstborn.

F R EQ UE NC Y IN T ER PR E TAT ION OF PROB ABIL I T Y

233

If the generic variable represents a quantitative concept such as height-in-centimeters, then the average value in a sample is given by the mean of the sample. We have already defined the mean for singular variables. For generic variables Y, the mean of Y for the n individuals a1 , . . . , an is defined as follows: Y (a1 ) Y (an ) + ··· + n n If we have a sample of five humans with heights 150 cm, 163 cm, 171 cm, 165 cm, and 190 cm, then the mean of the generic 163 171 variable height-in-centimeters in this sample is 150 5 + 5 + 5 + 165 190 5 + 5 = 167.8. This is also the mean for this variable when the sample consists of five humans with heights 165 cm, 166 cm, 171 cm, 165 cm, and 172 cm. The values in the latter sample diverge less from the mean. This divergence from the mean is captured by the variance and standard deviation of a sample, which are defined as follows (for ease of readability, we will drop the reference to the n individuals a1 , . . . , an from now on):   2 2 Y (a1 ) − Y n Y (an ) − Y n Var (Y n ) = + ··· + , n n  σ (Y n ) = Var (Y n ) Y [a1 , . . . , an ] = Y n =

Squaring the differences of the individual values from the mean in calculating the variance renders all differences positive as well as amplifies larger differences. Taking the square root of the variance in calculating the standard deviation compensates somewhat for the latter effect. In our examples, the variances are: (150 − 167.8)2 (163 − 167.8)2 (171 − 167.8)2 + + + 5 5 5 (165 − 167.8)2 (190 − 167.8)2 + = 170.16 + 5 5

234

INTRODUCTION TO PROBABILITY AND INDUCTION

(165 − 167.8)2 (166 − 167.8)2 (171 − 167.8)2 + + + 5 5 5 (165 − 167.8)2 (172 − 167.8)2 + + = 9.36 5 5 √ √ The standard deviations are 170.16 ≈ 13.04 and 9.36 ≈ 3.06, respectively. Often one considers not just one but two generic variables, say height-in-centimeters and weight-in-kilograms. Suppose the humans in our first sample of size five have weights 55 kg, 60 kg, 90 kg, 70 kg, and 88 kg so that the mean is 72.6 kg. The covariance of two generic variables Y and Z measures how much correlation or association there is between these two variables. Where the sample contains the n individuals a1 , . . . , an , it is defined as follows:   Y (a1 ) − Y n · Z (a1 ) − Zn cov (Y n , Zn ) = + ···+ n   Y (an ) − Y n · Z (an ) − Zn + n The covariance of height-in-centimeters and weight-in-kilograms in our sample of five humans is (150 − 167.8) · (55 − 72.6) (163 − 167.8) · (60 − 72.6) + + 5 5 (171 − 167.8) · (90 − 72.6) (165 − 167.8) · (70 − 72.6) + + + 5 5 (190 − 167.8) · (88 − 72.6) + 5 313.28 + 60.48 + 55.68 + 7.28 + 341.88 = = 155.72 5 The fact that there is a correlation between two generic variables does not imply that there is causal relevance between these two variables (Hitchcock 2010). Among others, correlation

F R EQ UE NC Y IN T ER PR E TAT ION OF PROB ABIL I T Y

235

is a symmetric relation, but causal relevance is not. While there is no agreement on the precise relationship between correlation and causal relevance, height presumably is causally relevant to weight but not conversely. In general, the study of causal relevance in terms of correlation requires the concept of conditional correlation, or conditional probabilistic dependence, between random variables (Arntzenius 2010, Pearl 2009, Spirtes & Glymour & Scheines 2000). We want to distinguish two relationships: the relationship of causal relevance between (singular or generic) variables and the relationship of causation between a (singular or generic) variable’s taking on a particular value (for a possible world or for an individual from the population, respectively). In the case of generic variables, causal relevance is a relationship between generic properties such as height and weight, whereas causation is a relationship between an individual’s instantiating or having two generic properties such as Al’s height of 163 cm and his weighing 60 kg. In the case of singular variables, causal relevance is a relationship between particularized properties or “tropes” (Maurin 2013) such as Al’s height and his weight, whereas causation is a relationship between propositions such as that Al is 163 cm tall and that he weighs 60 kg. Sometimes statisticians draw a distinction between independent and dependent variables, as well as between between explanatory and response variables. The idea is that one can repeatedly manipulate and control the former variables to check for a correlation with the latter. It is important to note that such repeated manipulations, or interventions (Woodward 2016), can be performed only on generic variables that are defined on a population from which one can draw samples. Such repeated manipulations cannot be performed on singular variables. Conversely, a logically precise formulation of the propositions that stand in causal and probabilistic relationships requires the use of singular variables that are defined on a set

236

INTRODUCTION TO PROBABILITY AND INDUCTION

of possible worlds that are (logical) alternatives to each other. This is why we have stated the laws of large numbers in terms of singular variables representing individual experiment tokens that we thought of as repetitions of an experiment type. We will do the same for the central limit theorem below. The concepts of mean, variance, and standard deviation can be defined for samples of size n as well as for the entire and possibly infinite population on which a generic variable is defined. In the latter case, the definitions are the same as before, provided that the probability that any one individual is selected is the same as the probability that any other individual is selected. Otherwise one has to weigh the various contributions of these individuals not by 1/n, as we have done above when 2  we divided the numbers Y (ai ) and Y (ai ) − Y n , respectively, by n, but by their probability of being selected, Pr ({ai }). The mean now becomes the expected value (see, however, the caveat below), in analogy to the terminology for singular variables (see Section 10.5). Where v1 , v2 , . . . are the possible values of the generic variable Y, and Pr (Y = vi ) is shorthand for Pr ({a ∈ P : Y (a) = vi }), μ (Y) = Y (a1 ) · Pr ({a1 }) + Y (a2 ) · Pr ({a2 }) + · · · = v1 · Pr (Y = v1 ) + v2 · Pr (Y = v2 ) + + · · · = Exp (Y)  2 Var (Y) = σ2 (Y) = Y (a1 ) − μ (Y) · Pr ({a1 }) +  2 + Y (a2 ) − μ (Y) · Pr ({a2 }) + · · · = (v1 − Exp (Y))2 · Pr (Y = v1 ) + + (v2 − Exp (Y))2 · Pr (Y = v2 ) + · · · The probability measure Pr is defined on an algebra over the entire population P. We assume this algebra to be the power set of P, and P to be finite. If P is infinite and countable (see

F R EQ UE NC Y IN T ER PR E TAT ION OF PROB ABIL I T Y

237

Chapter 11), the mean, or expected value, and variance may depend on the order in which the individuals in P are selected, and the possible values in V are enumerated. If P is infinite and uncountable, the range of the generic variable Y—that is, the set of values v in V such that Y assigns at least one individual a from P to v—may be uncountable as well. In this case, one has to work with a so-called probability density function and the so-called Lebesgue integral. Pr is not defined on an algebra over a set of possible worlds W. To indicate this, we will use subscripts below. If the population is finite, Pr determines the mean, variance, and standard deviation of every generic variable Y on this population as well as, for any value v, the probability Pr ({a ∈ P : Y (a) ≤ v}) = Pr (Y ≤ v). This probability represents the (weighted) proportion of individuals in the entire population to which the generic variable Y assigns a value that is smaller than or equal to v. Often one is interested in these probabilities for all values v of a random variable Y. Their specification is called a cumulative probability distribution. A cumulative probability distribution specifies the probable distribution of the values of a generic variable in a population. It makes essential reference to a random variable whose values are plotted on the x-axis. Their probabilities are plotted on the y-axis. (In the uncountable case, the cumulative probability distribution results by taking the Lebesgue integral of the probability density function. It measures the area underneath the graph of the probability density function.) If the population is finite, and each individual has the same probability of being selected, the cumulative probability of a value v is the proportion of individuals in the population to which the generic variable Y assigns a value that is not greater than v. Otherwise it represents the weighted proportion of these individuals, where the weight is the probability that one of these individuals is selected.

238

INTRODUCTION TO PROBABILITY AND INDUCTION

Famous cumulative probability distributions include the Bernoulli distribution (for random variables with two values), the exponential distribution, the uniform distribution (whose probability density function is flat-shaped), and, above all, the normal or Gauss(ian) distribution that is pictured below (and whose probability density function is bell-shaped). Different distributions are characterized by different parameters: The Bernoulli distribution is characterized by the probability of one of the two values, the exponential distribution is characterized by the mean, and the normal distribution is characterized by the mean and standard deviation.

10.8 THE CENTRAL LIMIT THEOREM Much like the strong law of large numbers, the central limit theorem relates the mean in the first n repetitions of an experiment type to the expected value of these repetitions. The difference is that the focus is now on the probable distribution of the possible sample means and its relation to the expected value. The central limit theorem has been proven as early as de Moivre (1718) and Laplace (1812). We will state the version due to Lindeberg (1920, 1922) and Lévy (1925). Lindeberg-Lévy Central Limit Theorem Let X1 , X2 , . . . be a sequence of random variables, or experiment tokens, whose domain is a non-empty set of possible worlds W and whose co-domain is the set of real numbers R so that these experiment tokens are repetitions of the same experiment type X. Let A be the super-sized algebra of propositions over W that is generated by these random variables when the algebra B over V is the set of Borel sets, and let Pr be a probability measure on A. A and

F R EQ UE NC Y IN T ER PR E TAT ION OF PROB ABIL I T Y

239

Pr are assumed to satisfy the strengthened versions of steps 3 and 6, respectively, from Section 5.4. If X1 , X2 , . . . are independent and identically distributed in the sense of Pr, and their common variance is finite, Var (Xn+1 ) = σ (Xn+1 )2 < ∞, so that their common expected value is finite, too, Exp (Xn+1 ) < ∞, then the difference between the mean X n of the first n experiment tokens and the expected value Exp (Xn+1 ), X n√ − Exp (Xn+1 ), √ n·(X n −Exp(Xn+1 )) n standardized by σ(Xn+1 ) , (that is, Zn = ) σ(Xn+1 ) converges in distribution to the standard normal or Gauss(ian) distribution N (0, 1) with mean 0 and standard deviation 1. As in the statement of the strong law of large numbers, we consider a sequence of singular variables X1 , X2 , . . . that are independent and identically distributed so that we can think of them as repetitions of an experiment type X. Since they are iid, all of these singular variables have the same expected value Exp (Xn+1 ) and the same standard deviation σ (Xn+1 ). The mean for these singular variables has been defined in Section 10.5, and their variance and standard deviation are defined in the same was as the variance and standard deviation of generic variables, except that the probability measure PrW is defined on an algebra over a set W of possible worlds that are (logical) alternatives to each other (namely the super-sized algebra A from Section 10.4). Where v1 , v2 , . . . are the possible values of the singular variable Xn , and Pr (Xn = vi ) is shorthand for PrW ({w ∈ W : Xn (w) = vi }), Var (Xn ) = (v1 − Exp (Xn ))2 · Pr (Xn = v1 ) + + (v2 − Exp (Xn ))2 · Pr (Xn = v2 ) + · · ·  σ (Xn ) = Var (Xn )

240

INTRODUCTION TO PROBABILITY AND INDUCTION

Each new repetition of the experiment type X generates a new sample of an ever larger size, namely the sample of all repetitions so far. After the first repetition, we consider the mean of the sample of size 1, X 1 . After the second repetition, we consider the mean of the sample of size 2, X 2 . After the n-th repetition, we consider the mean of the sample of size n, X n . Note that, in contrast to the expected value Exp (Xn+1 ), these means of samples of size n do not make reference to the probability measure PrW . Instead they are defined in terms of the relative frequencies of the possible outcomes vi . The means in samples of size n, X n , are themselves singular variables with  domain W and co-domain R. Their expected values Exp X n equal the common expected value Exp (Xn+1 ) of the singular variables X1 , X2 , . . .. Thus, they are thesame for all n. In contrast to this, their standard deviations σ X n depend √ on n and equal σ (Xn+1 ) / n. Now, at each point n we consider the difference between the mean X n and the common expected value Exp (Xn+1 ), X n − Exp (Xn+1 ). In addition, we amplify this difference by dividing it by the standard deviation of the mean √ X√n , σ (Xn+1 ) / n, to obtain the “standardized difference” Zn = n·(X n −Exp(Xn+1 )) . σ(Xn+1 ) These standardized differences Zn are again singular variables with domain W and co-domain R. They are now the ones that converge “in distribution” to the standard normal distribution, that is, the normal distribution with mean 0 and standard deviation 1, which is also known as the z-distribution. Convergence in distribution means that the cumulative probability distribution PrW (Zn ≤ z) converges to the cumulative probability distribution Φ of the standard normal distribution. That is, it holds for all possible values z of Φ: limn→∞ PrW (Zn ≤ z) = Φ (z). This in turn means that for every real number z and every natural number k there is a natural number n such that the difference between PrW (Zn ≤ z)

F R EQ UE NC Y IN T ER PR E TAT ION OF PROB ABIL I T Y

241

and Φ (z) is smaller than 1/k, | PrW (Zn ≤ z) − Φ (z) | < 1/k. (In general, the definition of convergence in distribution refers only to those real numbers z at which the function to which convergence takes place is “continuous.” In the case of the function Φ, these are all real numbers. Convergence in distribution is implied by convergence in probability and, hence, by almost sure and pointwise convergence). Here is a picture of the standard normal distribution N (0, 1) with mean 0, standard deviation 1, and cumulative probability distribution z 2 1 −x √ Φ (z) = 2·π −∞ e /2 dx (and probability density function φ (x) =

2 √ 1 e−x /2 ): 2·π

0.5

(x ) =

1 –x2/2 e √2·π

0.4 0.3

34.1% 34.1%

0.2 0.1 0 −4

2.1% 0.1% 13.6% −3

−2

2.1% 0.1%

13.6% −1

0

1

2

3

4

Sometimes the central limit theorem is paraphrased as saying that the sample mean converges in distribution to the population mean. This paraphrase is then used to justify estimates of the population mean μ (Y) of a generic variable Y on the basis of the sample mean Y n in a sample of size n. This makes sense if the sample mean Y n equals the sample mean X n (@) of the first n repetitions X1 , . . . , Xn of some experiment type X in the actual world @, and if the probability that

242

INTRODUCTION TO PROBABILITY AND INDUCTION

the i-th individual ai has a value that is not greater than v, PrW ({w ∈ W : Xi (w) ≤ v}), equals the weighted proportion of individuals in the entire population P that have a value that is not greater than v, PrP ({a ∈ P : Y (a) ≤ v})—that is, if it holds for all i and v: PrW ({w ∈ W : Xi (w) ≤ v}) = PrP ({a ∈ P : Y (a) ≤ v}). The latter implies that, for all i, Exp (Xi ) = μ (Y), where the expected value Exp (Xi ) is calculated relative to PrW and the population mean μ (Y) is calculated relative to PrP . For purposes of illustration, suppose the generic variable Y is a-human’s-height-in-centimeters. The former assumption may be plausible if the singular variables X1 , X2 , . . . represent the experiment tokens of measuring a1 ’s height-in-centimeters, a2 ’s height-in-centimeters, and so on, and they are iid in the sense of the probability measure PrW that is defined on an algebra over a set of possible worlds W. The latter assumption is highly nontrivial, though, even if the sample is a random sample and every individual has the same probability of being selected. For it equates the probability that the i-th individual is at most v cm tall, PrW ({w : Xi (w) ≤ v}), with the weighted proportion of humans in the entire population that are at least v cm tall, PrP ({a ∈ P : Y (a) ≤ v}), and it does so for all individuals ai and all values v. In other words, the latter assumption equates the single-case probability that, say, Al is at most 163 cm with the statistical probability or proportion of humans that are at most 163 cm. Why should this equation between single-case and statistical probabilities hold? It certainly does not hold on conceptual grounds, where this equation amounts to a conflation of single-case chances or degrees of belief on the one hand and actual or hypothetical (limiting) relative frequencies on the other hand. Furthermore, it forces one to address a version of the reference class problem: For which population P is one referring to, or referencing, when considering the single-case probability that Al is at most 163 cm? The population

F R EQ UE NC Y IN T ER PR E TAT ION OF PROB ABIL I T Y

243

of humans in Austria in 2017, the population of humans who are alive in 2017, the population of all humans who ever have lived, the population of all humans who ever have lived or will live? The central limit theorem does not tell us. The strong law of large numbers and central limit theorem establish a connection between relative frequencies and sample means on one side and chances (or degrees of belief) and expected values on the other. Without further assumptions, they do not establish a connection between sample means and population means. We want to keep this in mind when relying on these results in inferential statistics.

10.9 INFERENTIAL STATISTICS Bracketing that expected values and population means do not coincide without further assumptions, let us look at how the central limit theorem is used to estimate population means on the basis of sample means. This is done by drawing a sample of size n and calculating the sample mean and sample standard deviation, as we have done with our sample of five humans 5 whose mean height in centimeters √ Y was 167.8 and whose  5 standard deviation σ Y was 170.16. The latter number is used to estimate

the population√standard√deviation by 5 , which yields 170.16 · 1.25 ≈ 14.58. multiplying it by 5−1 The general formula for this estimate sn (Y) of the population standard deviation σ (Y) on the basis of the sample standard  n deviation in a sample of size n, σ (Y n ), is sn (Y) = σ (Y n ) · n−1 . The reason we have to estimate the population standard deviation σ (Y) is that we can only determine it with the population mean μ (Y), yet the latter is precisely the information we do not have, and ultimately want to estimate. As with any estimate, the real value σ (Y) of the population

244

INTRODUCTION TO PROBABILITY AND INDUCTION

standard deviation may differ significantly from its estimated value sn (Y), especially when the sample size n is “small” (often this is takento mean that n is smaller than 30). The “Bessel n correction” n−1 is intended to mitigate this risk, but it only goes so far. Next we use the estimate of the population standard deviation sn (Y) to make an estimate of the population standard deviation of, not Y itself, but the mean of Y in a sample of size n, Y (n) . Note that Y (n) is a generic variable with co-domain R whose domain is the set of all samples of size n (the n-fold Cartesian product of the population P if we sample “with replacement” so that individuals can be drawn repeatedly). Specifically, Y (n) is distinct from Y n = Y [a1 , . . . , an ], which is a number, namely the mean of Y in a specific sample of size n, that is, Y n is a possible value of the generic variable Y (n) . Why do we consider the mean? As mentioned in the previous section, the means X n of the first n repetitions of an experiment type are real-valued random variables if the individual repetitions Xn are—and,  importantly, they have the same expected value Exp X n = Exp (Xn ) = μ (Y) (bracketing that expected value Exp (Xn ) and population mean μ (Y) do not coincide without further assumptions). Since we are interested in estimating this expected value Exp (Xn ) or population mean μ (Y), we can focus on the means X n = Y (n) . We want to do so because the means of these random variables are eventually distributed normally according to the central limit theorem, even if the random variables themselves are not. It is this information about the normal distribution of means that makes the central limit theorem so central to estimation: The assumption, underpinned by the central limit theorem, that the mean Y (n) is distributed normally  around the expected value or population mean μ (Y) = μ Y (n) allows   us to determine the probability Pr Y n of every possible

F R EQ UE NC Y IN T ER PR E TAT ION OF PROB ABIL I T Y

245

probability Pr ({Yn})

sample mean Y n —including the probability of the one sample mean X n (@) that we actually obtained. On the basis of this information, we can then make an estimate about, or inference to, the population mean (see, however, the caveat about this inference mentioned on next page).

possible values Y n of the mean Y(n) in a sample of size n

The estimate of the population deviation of  standard (n) the mean in samples of size n, σ Y , on the basis of the sample standard deviation a sample of size n, σ (Y n ), equals √ in n  n )· n) σ(Y (Y) n−1 √ √ sn Y (n) = sn√n = = σ(Y . In our example, we n n−1 √ √ divide the √ sample√standard deviation 170.16 by 5 − 1 to obtain 170.16/ 4 ≈ 6.52. This estimate is called standard error. The standard error can finally be used to estimate the population mean on the basis of Student’s (1908) t-distribution. The t-distribution approximates the standard normal or z-distribution, and converges to it if the sample size n increases without bound. We use the t-distribution  instead of the z-distribution if the standard deviation σ Y (n) is not assumed  to be given, but estimated by the standard error sn Y (n) . If the sample size n is small, as it is in our case, we additionally assume that not only the sample mean in samples of size n, Y (n) , but the random variable Y itself is distributed normally. Otherwise, not even the use of the t-distribution

246

INTRODUCTION TO PROBABILITY AND INDUCTION

instead of the z-distribution compensates for the small sample size and the potential error introduced in estimating the population standard deviation (of the mean). We also assume that the sample is a random sample so that the individuals are selected with equal probability, as well as independently. Without the latter independence assumption, the iid-condition of the central limit theorem is not met.1 The result of this process of estimating the population mean on the basis of the sample mean in a sample of size n, Y n , is a confidence interval. A confidence interval around the population mean μ (Y) is an interval of the form:     Y n − tn · sn Y (n) , Y n + tn · sn Y (n) The t-value depends on the sample size n as well as the confidence level. For a 95%-confidence interval for samples of size five, it is 2.776. Therefore, the 95%-confidence interval for our population mean is approximately [149.7, 185.9]. If the sample size is not just five but sufficiently large so that t- and equal, it is  z-values are  approximately  (n) (n) n n . One can, of course, Y − 1.96 · sn Y , Y + 1.96 · sn Y also consider other confidence intervals, say the 90%- or 99%-confidence interval. Assuming a sufficiently large sample, the numbers in these cases are 1.645 and 2.576, respectively. What’s more important to us than the particular values of the t- and z-distribution is to understand the reasoning behind the use of confidence intervals in the estimation of population means. For purposes of illustration, we will work

1

At this point, one may wonder why not directly estimate the population mean in the same way we have directly estimated the population standard deviation, namely by taking the observed sample mean Y n to be the estimate of the population mean μ (Y)? This is indeed an option. It is called a point estimate, whereas we will make an interval estimate.

F R EQ UE NC Y IN T ER PR E TAT ION OF PROB ABIL I T Y

247

with the 95%-confidence interval and a sufficiently large sample so that t- and z-value equal approximately 1.96. Assuming that the sample mean Y (n) in samples of size n is distributed  normally and its population standard deviation equals σ Y (n) , it follows that the statistical probability  that a sample of size n and mean n Y lies within 1.96 · σ Y (n) of the expected value of the mean,  μ Y (n) , equals 0.95:       Pr Y n : μ Y (n) − 1.96 · σ Y (n) ≤ Y n ≤ μ Y (n) + 1.96 · σ Y (n) = 0.95 n The sample mean  Y is information we have, and the standard deviation σ Y (n) can be estimated by the standard error  sn Y (n) . The quantity we are interested in is the expected value or population mean of Y, μ (Y). The latter equals the expected  value of the sample mean Y (n) in samples of size n, μ Y (n) . Hence:

    Pr Y n : μ (Y) − 1.96 · σ Y (n) ≤ Y n ≤ μ (Y) + 1.96 · σ Y (n) = 0.95

This says that, with a probability of 95%, the sample mean we have access to lies within a certain distance of the population mean we are interested in. Yet to say that the sample mean lies within a certain distance of the population mean is to say that the population mean lies within this distance of the sample mean. Mathematically we get this result as follows:   μ (Y) − 1.96 · σ Y (n) ≤ Y n ≤ μ (Y) + 1.96 · σ Y (n)   −1.96 · σ Y (n) − Y n ≤ −μ (Y) ≤ 1.96 · σ Y (n) − Y n | −μ (Y) − Y n   1.96 · σ Y (n) + Y n ≥ μ (Y) ≥ Y n − 1.96 · σ Y (n) | · (−1)

248

INTRODUCTION TO PROBABILITY AND INDUCTION

It follows that    Pr Y n : Y n − 1.96 · σ Y (n) ≤ μ (Y) ≤ Y n + 1.96 · σ Y (n) = 0.95

Here the quantity of interest (that is, the population mean), is marked by dashes, the quantity we can estimate (that is, the expected value of the sample mean in samples of size n), is marked by dots, and the quantity we have access to (that is, the sample mean), is underlined. It is tempting to paraphrase this result as follows: With a probability of 95%, the uncertain population mean lies within an estimable distance of the observed sample mean. However, this would be a mistake, for we have not established that    Pr μ (Y) : Y n − 1.96 · σ Y (n) ≤ μ (Y) ≤ Y n + 1.96 · σ Y (n) = 0.95

The classical or frequentist statistician only determines the probability that the observed sample mean falls within a given interval. She does not also determine the probability that the population mean falls within this, or any other, interval. Only sample means and other sample parameters are assigned a probability; hypotheses about the population mean and other population parameters are not. On one view, the reason is that probabilities exist only for (propositions about) event types that actually are—or hypothetically could be—instantiated repeatedly, such as sample means which can be instantiated repeatedly by sampling repeatedly. Single-case probabilities for (propositions about) event types that are and could be instantiated only one single time, such as the population mean which is what it is, do not exist. Alternatively, we may want to test a hypothesis about the population mean or some other population parameter. To do so, we have to come up with a hypothesis, the so-called null hypothesis, that is specific enough to determine the probability of every possible value of this parameter in samples of a given

F R EQ UE NC Y IN T ER PR E TAT ION OF PROB ABIL I T Y

249

size. If we want to test a hypothesis about the population mean and assume the latter to be distributed normally, it suffices that the null hypothesis specifies a unique population mean. If the sample one observes is too improbable, the null hypothesis is rejected, and, provided it is specified, the alternative hypothesis is accepted. Minimally, the alternative hypothesis says that the null hypothesis is not true, but in general, it is much more specific. There are two mistakes one can make in testing a null hypothesis. One can reject the null hypothesis when it is true, and one can fail to reject the null hypothesis when it is false. The former mistake is called a type I error, the latter a type II error. The significance level α of a test is the probability of a type I error, that is, the probability of rejecting the null hypothesis given that it is true. The power 1 − β of a test is related to the probability of avoiding a type II error. It is the probability of rejecting the null hypothesis given that the alternative hypothesis is true. The latter probability is defined if the alternative hypothesis, too, is specific enough to determine the probability of every possible value of the parameter of interest in samples of a given size. Merely taking the alternative hypothesis to be the negation of the null hypothesis does not guarantee this. The specification of the alternative hypothesis is optional. If it is specified, one compares the probability of the observed value of the sample parameter on the null hypothesis to the probability of the observed value of the sample parameter on the alternative hypothesis. The ratio of these two probabilities is the likelihood-ratio. For a null and alternative hypothesis that are both simple (that is, sufficiently specific to determine all possible population parameters so that the cumulative probability distribution is uniquely specified), the likelihood-ratio determines the most powerful test for a given significance level (Neyman and Pearson 1933). To illustrate, suppose we are looking at a screen that displays the numerals ‘1’ and ‘0.’ We wonder whether the

250

INTRODUCTION TO PROBABILITY AND INDUCTION

numbers 1 and 0 convey any meaning, perhaps a secret code or whether somebody has won \$1 or \$0. Our null hypothesis is that no meaning is conveyed and that the occurrence of these numbers is random so that the numbers occur independently and the probability of 1 as well as the probability of 0 is 1/2. We test the null hypothesis at the 80%-significance level. This means we have to specify the 80%-confidence interval for the null hypothesis. If we observe a result that falls outside this interval, the null hypothesis gets rejected at the 80%-significance level. Otherwise it does not get rejected. We do not specify an alternative hypothesis, so nothing gets accepted. The possible sample outcomes are sequences of 0s and 1s of length four. They determine the possible sample means: There is one sequence with four 0s whose mean is 0, there are four sequences with one 1 and three 0s whose mean is 1/4, there are six sequences with two 1s and two 0s whose mean is 1/2, there are four sequences with three 1s and one 0 whose mean is 3/4, and there is one sequence with four 1s whose mean is 1. According to the null hypothesis, the 1s and 0s are independent, and they have the same probability of occurring. Therefore, the probabilities for these possible sample means are as follows: 1/16 for a mean of 0, 4/16 for a mean of 1/4, 6/16 for a mean of 1/2, 4/16 for a mean of 3/4, and 1/16 for a mean of 1. The population mean μ (Y) and, hence, the expected value of the  mean in samples of size four, μ Y (4) , is 1/2.  In general, we would now calculate the standard error σ Y (4) and use it and the t-value to determine the 80%-confidence interval around the population mean μ (Y). However, since our null hypothesis is sufficiently specific to allow us to calculate the probability of every possible sample outcome and mean, we do not need to use the t-distribution in testing it. In this example, we reject the null hypothesis at the 80%-significance level if we observe a sequence of four 0s

F R EQ UE NC Y IN T ER PR E TAT ION OF PROB ABIL I T Y

251

or a sequence of four 1s. In all other cases, we do not reject the null hypothesis at this significance level. The reason is that the sample means of 0000 and 1111 (that is, 0 and 1) are the only ones that fall outside the 80%-confidence interval. We get this result as follows. On the null hypothesis, the statistical probability that we observe a sample mean Y 4 within 1/4 of the population mean μ (Y) = 1/2 equals 87.5%, but the statistical probability that we observe a sample mean within less than 1/4 of the population mean is at most 37.5%, and hence smaller than 80%:   Pr Y 4 : 0.25 ≤ Y 4 ≤ 0.75 = 14/16 = 0.875     Pr Y 4 : 0.25 < Y 4 < 0.75 = Pr Y 4 : Y 4 = 1/2 = 6/16 = 0.375 Now suppose the possible sample outcomes are sequences of 0s and 1s of length three. In this case, there is one sequence with three 0s whose mean is 0, there are three sequences with one 1 and two 0s whose mean is 1/3, there are three sequences with two 1s and one 0 whose mean is 2/3, and there is one sequence with three 1s whose mean is 1. According to the null hypothesis, the 1s and 0s are independent and equally probable. Therefore, the probabilities for these possible sample means are as follows: 1/8 for a mean of 0, 3/8 for a mean of 1/3, 3/8 for a mean of 2/3, and 1/8 for a mean of 1. As before, the population mean μ (Y) and, hence, expected value of  the (3) the mean in samples of size three, μ Y , is 1/2. The smallest interval around the population mean that has a probability of at least 0.8 is the entire interval [0, 1] of all possible sample means. Every proper subinterval of [0, 1] has a probability of at most 0.75 on the null hypothesis, as the probability of the interval (0, 1) equals 0.75 (it just contains the two means 1/3 and 2/3 whose probabilities sum to 6/8). Since no possible sample mean

252

INTRODUCTION TO PROBABILITY AND INDUCTION

falls outside the interval [0, 1], our experiment does not test the null hypothesis severely enough for us to possibly reject it at the 80%-significance level. In particular, we do not reject the null hypothesis if we observe the sample outcome 111 with mean 1. Assume we observe the sample outcome 111 with mean 1. This time, however, suppose our stopping rule is the following. We check the first number. If it is 0, we stop the experiment. If it is 1, we continue. If the second number is 0, we stop the experiment. If it is 1, we continue to the third number and then stop the experiment. A stopping rule is a protocol for performing an experiment. In the previous case, the stopping rule was to observe a sequence of 1s and 0s of length three, which resulted in eight possible sample outcomes and four possible sample means. This time the possible sample outcomes are 0, 10, 110, and 111. The possible sample means now are 0, 1/2, 2/3, and 1, and they occur with a probability of 1/2, 1/4, 1/8, and 1/8, respectively. The probability that we observe a sample mean within the interval [0, 2/3] is now 0.875, and so greater than 0.8. This means that we can reject the null hypothesis at the 80%-significance level if we observe the sample outcome 111 with mean 1, even though the probability of this sample outcome and its mean is the same as before, namely 1/8! What is decisive for the rejection of the null hypothesis in this case is the probability of other, merely possible sample outcomes and means that have not been observed. Likelihoodists (Hacking 1965, Edwards 1972/1992, Royall 1997) think this is a problem. According to them, the likelihood or probability of the data on the hypothesis should determine the impact these data have on the hypothesis; the probability of merely possible data that turn out to be false should not matter. Let us consider one more test involving two generic variables. A teacher holds a midterm exam and notices five students who have not attended class before. She is interested in the effect that attending class has on students’ grades,

F R EQ UE NC Y IN T ER PR E TAT ION OF PROB ABIL I T Y

253

so asks these five students to participate in an experiment. The students agree and attend class for the rest of the term. The teacher’s null hypothesis is that attending class makes no difference to the students’ grades. After the final exam, the teacher has two samples of the same size, namely the five students’ scores on the midterm and final exams. These two samples can be merged into a new sample of the same size whose values show the difference in the students’ scores as follows. Mid-term exam M

Final exam F Difference D

Student 1

61

87

26

Student 2

91

99

8

Student 3

88

96

8

Student 4

44

82

38

Student 5

57

77

20

Note that the number 61 in the line for student 1 in the row for the midterm exam is short-hand for M (student 1) = 61, and similarly for the other numbers. The null hypothesis says that attending class does not make a difference, μ (D) = 0. The observed sample mean D5 is   26+8+8+38+20 = 20, and the sample standard deviation σ D5

5 √ √ (26−20)2 +(8−20)2 +(8−20)2 +(38−20)2 +(20−20)2 is = 648/ 5. The 5 √  √ = t-score counts how many standard errors s5 D(5) = √5· 648 5−1 √ 32.4 the observed sample mean is away from the population mean according to the null hypothesis: D5 − μ (D) 20 − 0 ≈ 3.514  = √ 32.4 s5 D(5)

254

INTRODUCTION TO PROBABILITY AND INDUCTION

This means that the teacher can reject the null hypothesis at the 95%-significance level for which the t-score has to be at least 2.776 for a sample of size five. She cannot reject the null hypothesis at the 98%-significance level, though, for the t-score is less than 3.747. This would be different if the teacher’s null hypothesis had been that attending class makes no positive difference to the student’s grades, μ (D) ≤ 0. In this latter case, she could reject the null hypothesis at the 98%-significance level because the t-score for such a “one-sided” or “one-tailed” test at the (100 − n) %-significance level equals the t-score for a “two-sided” or “two-tailed” test at the (100 − 2 · n) %-significance level. As in the case of confidence intervals, the classical or frequentist statistician does not determine the probability of the null hypothesis. She only determines the probability of the possible sample outcomes or means. In contrast to this, the Bayesian statistician assigns subjective probabilities to both the possible sample outcomes or means as well as to statistical hypotheses about the population mean and other population parameters. The Bayesian statistician can do so because she assigns a prior subjective probability to the various statistical hypotheses about the population mean. These prior subjective probabilities are then updated in light of the observed sample outcome or mean along the lines of Sections 9.3 and 10.6. As the preceding sections have made clear, there are a lot of assumptions that enter statistical inference. The design of experiments is supposed to render some these assumptions plausible. However, even if it succeeds, it is an art rather than a science, and other assumptions remain. Our job as philosophers is to point these assumptions out—not in a destructive effort to criticize for the sake of criticizing, but in a constructive effort to improve our fellow scientists’ arguments. Any conclusion can be obtained from flawed premises. Therefore, we need to make sure that we are aware of all assumptions we make.

F R EQ UE NC Y IN T ER PR E TAT ION OF PROB ABIL I T Y

255

10.10 EXERCISES Exercise 46: You have a sample of five students whose scores on the final exam are 77, 46, 88, 90, and 77 out of 100 points. Determine the mode(s), median, and mean of this sample. Exercise 47: You have a sample of seven students who did not complete their homework assignments prior to the midterm exam, but who did complete their homework assignments for the rest of the term after the midterm exam. These students’ scores are as follows: Midterm M

Final F

Student 1

60

88

Student 2

81

95

Student 3

87

97

Student 4

54

82

Student 5

67

87

Student 6

91

90

Student 7

64

84

Calculate the sample mean, sample variance, and sample standard deviation for these two samples. Exercise 48: Continuing Exercise 47, we are interested in the difference between the grades of the seven students in the midterm and final exams. To this end, we test the null hypothesis that completing the homework assignments does not make a (positive or negative) difference D to the students’ grades, μ (D) = 0. The two samples we have give us a new sample D7 of the same size for the difference D that completing the homework assignments makes.

256

INTRODUCTION TO PROBABILITY AND INDUCTION

First determine the new sample outcome D7 as well as the new sample mean D7 . Next, calculate the sample standard   deviation σ D7 . Now use the latter to estimate the population standard deviation and the population standard deviation of the mean of samples of size seven, that is, determine the estimated population deviation s7 (D), as well as the  standard standard error s7 D(7) . Finally, test the null hypothesis that completing the homework assignments makes no difference, μ (D) = 0, at the 99%-significance level. For samples of size seven, the t-score for such a (two-sided) test at the 99%-significance level is 3.707. Exercise 49: You have a random sample of forty humans. Seven have no sibling, eighteen have one sibling, twelve have two siblings, two have three siblings, and one has five siblings. Estimate the population mean or expected value μ (S) of the generic variable a-human’s-number-of-siblings S by calculating the 99%-confidence interval for μ (S). The t-value for the 99%-confidence interval for samples of size 40 is 2.708. Exercise 50: To celebrate the end of the term, you and your friends consider going to an elusive club to which entry seems to be granted randomly. Since you want to avoid waiting in line in vain, you test the statistical hypothesis that entry E is granted randomly. The sample you have consists of five party-goers who have been selected randomly; that is, independently and with equal probability. If a party-goer is admitted to the club, the result is 1; otherwise it is 0. The null hypothesis you want to test says that, for every party-goer p, the chance of entry equals the chance of denial of entry, Pr (E (p) = 1) = Pr (E (p) = 0) = 1/2. First, determine all possible sample outcomes and all possible sample means, as well as the probability of the possible sample means on the null hypothesis.

F R EQ UE NC Y IN T ER PR E TAT ION OF PROB ABIL I T Y

257

Second, determine the population mean μ (E), which is equal to the expected value of the mean in samples of size five, μ(E(5) ), as well as the population variance of the mean in samples of size five, Var(E(5) ), and the standard error σ(E(5) ). Third, test the null hypothesis at the 90%-significance level and decide if it gets rejected at this level when the observed sample outcome is 11111 so that the observed sample mean is 1. The t-value for such a test at the 90%-significance level for samples of size five is 2.132, but you do not have to use it. Fourth, test the null hypothesis at the 90%-significance level and decide if it gets rejected at this level when the stopping rule for the experiment is to check whether four randomly (that is, independently and with equal probability) selected party-goers are admitted and the observed sample outcome is 1111 so that the observed sample mean is 1. The t-value for such a test at the 90%-significance level for samples of size four is 2.352, but you do not have to use it. Fifth, determine whether the previous verdict on the null hypothesis changes if the observed sample outcome and sample mean are the same (that is, 1111 and 1, respectively), but the stopping rule for the experiment is as follows: Observe up to four randomly selected party-goers, but stop after the first 0 is observed.

READINGS The recommended readings for Chapter 10 include: Reichenbach, Hans (1938), Experience and Prediction. An Analysis of the Foundations and the Structure of Knowledge. Chicago: University of Chicago Press. §§38–40 (339–363).

258

INTRODUCTION TO PROBABILITY AND INDUCTION

and perhaps also Romeijn, Jan-Willem (2014), Philosophy of Statistics. In E.N. Zalta (ed.), Stanford Encyclopedia of Philosophy.

A very accessible introduction to classical statistics is: Burdess, Neil (2010), Starting Statistics: A Short, Clear Guide. Los Angeles: Sage Publications.

C H A P T E R 11

Alternative Approaches to Induction

11.1 FORMAL LEARNING THEORY Besides the probability calculus, there are several other approaches to inductive reasoning that also work by assigning numbers to propositions. These include the theory of Dempster-Shafer belief functions briefly mentioned in Section 8.3, possibility theory (Dubois & Prade 1988), and ranking theory (Spohn 2012). Halpern’s Reasoning about Uncertainty (2003) provides an excellent overview. A genuine alternative to these approaches is formal learning theory, which was developed in Kelly’s The Logic of Reliable Inquiry (1996) (see also Martin & Osherson 1998). Instead of assigning numbers to propositions representing how strongly the propositions should be believed, formal learning theory considers the (objective) reliability with which a certain rule, such as the principle of universal induction or the straight(-forward) rule, gives the correct answer to a given question. The main idea is very simple: A rule is permissible in answering a question only if the rule reliably gives the correct answer to the question or no rule does so.

260

INTRODUCTION TO PROBABILITY AND INDUCTION

A LT E R N AT I V E A P P R O A C H E S T O I N D U C T I O N

261

262

INTRODUCTION TO PROBABILITY AND INDUCTION

bijective function from the first set to the second set, namely the function that pairs Abelard with Heloise, Barack with Michelle, and Sheherazade with King Shahryar. One set is the set of natural numbers {0, 1, 2, . . .}, and a set that has as many elements as it is called countably infinite. For instance, the set of even natural numbers {0, 2, 4, . . .} is countably infinite because there is a bijective function from it to the set of natural numbers, namely: 0 is paired with 0, 2 is paired with 1, 4 is paired with 2, and in general the even number n is paired with the natural number n/2. The set of even natural numbers has as many elements as the set of natural numbers, even though the former is a proper subset of the latter. For finite sets, this cannot happen, but for infinite sets it does. In fact, a set can be defined to be infinite if, and only if, it has as many elements as one or more of its proper subsets. A set that is finite or countably infinite is countable. A set that is not countable is uncountable. The set of real numbers is uncountable. So is the power set of the set of natural numbers, which has as many elements as the set of real numbers. Cantor’s (1878) continuum hypothesis says that there is no infinity between this infinity and countable infinity. Now, the power set of a set always has more elements than the set it is a power set of. In particular, this is true for infinite sets, which implies that there are infinitely many infinities. First, there is the infinity of the set of natural numbers, viz. countable infinity. Next there is the infinity of the power set of the set of natural numbers. Then there is the infinity of the power set of the power set of the set of natural numbers, and then the infinity of the power set of this power set, and so on. According to the generalized continuum hypothesis, there are no further infinities in between these infinities (Koellner 2013). To introduce the terminology of formal learning theory, we only need the concepts of finiteness and countable infinity. e = e1 , . . . , en  is a finite sequence of n data points. e may

A LT E R N AT I V E A P P R O A C H E S T O I N D U C T I O N

263

be the finite sequence that consists of the following four pieces of information: e = a1 is black, a1 is a raven, a2 is white, a2 is a raven. In contrast to this, ε is a countably infinite sequence of data points that is best thought of as the observable correlate of a possible world. ε may be a countably infinite sequence of data points saying of various objects that they are ravens and what color they are: a1 is black, a1 is a raven, a2 is black, a2 is a raven, . . .. As ε consists of bits of information, we will call it “empirical possibility.” This understanding of ε is also the reason for the clumsy formulation in the examples from the beginning of this chapter. The motivation for assuming ε to be countably infinite is that we will never receive more than finitely many pieces of information, though any finite number is possible. The information received is assumed to be true, and we will additionally assume that the ideal cognitive agent does not forget any of the information she ever receives. ε | n is the finite initial segment of the empirical possibility ε of length n. In our example, ε | 3 = a1 is black, a1 is a raven, a2 is black. The empirical content of a hypothesis H is the set of empirical possibilities ε in which H is true. It is also denoted by ‘H,’ H = {ε : H is true in ε}. Similarly for the empirical content of the background assumptions K, K = {ε : K is true in ε}. The empirical content of a hypothesis is called decidable, verifiable, or falsifiable if, and only if, the question whether it is true is decidable, verifiable, or falsifiable, respectively. Now we can finally say how formal learning theory characterizes an inductive method, namely as a function δ whose domain is the set e of all finite sequences of data points e and whose co-domain is the set H of the empirical contents of all hypotheses, the set of all “empirical hypotheses” H. δ conjectures empirical hypotheses in response to finitely many data points, much like the principle of universal induction or the straight(-forward) rule output hypotheses in response to

264

INTRODUCTION TO PROBABILITY AND INDUCTION

finitely many bits of information. An inductive method δ is said to stabilize to the empirical hypothesis H on the empirical possibility ε if, and only if, there is a natural number m such that for all natural numbers n ≥ m: δ (ε | n) = H. A question Q is (represented as) a partition of the set of empirical possibilities, where each cell of the partition characterizes one of the possible answers to the question. For instance, the question in which Canadian city with more than 1 million inhabitants I reside in 2017 is (represented as) the partition whose first cell consists of all empirical possibilities in which I reside in Calgary, whose second cell consist of all empirical possibilities in which I reside in Montréal, and whose third cell consists of all empirical possibilities in which I reside in Toronto. Each cell contains many different empirical possibilities, as there are many different addresses in each of these three cities at which I could reside. An inductive problem is a question Q given some background assumptions K, Q, K. For each empirical possibility ε in K, there exists exactly one empirical hypothesis H in Q that is true in ε. This is the correct answer to the question, and we denote it by ‘H (ε).’ Continuing the above example, if we receive the information that I live on Bloor Street in Toronto, the correct answer in the actual empirical possibility is Toronto. This would also be the correct answer if we received the information that I live on University Avenue in Toronto. An inductive method δ solves an inductive problem Q, K if, and only if, for each empirical possibility ε in K: δ stabilizes to H (ε). This means that, no matter which of the empirical possibilities is the actual one, the inductive method eventually gives the correct answer, and continues to do so forever after, though without necessarily signaling that it has arrived at the correct answer. Finally, we say that an inductive problem is solvable if, and only if, some inductive method solves it.

A LT E R N AT I V E A P P R O A C H E S T O I N D U C T I O N

265

Against this background we can now state one of Kelly’s theorems: Kelly’s theorem An inductive problem Q, K with countable question Q = {H1 , H2 , . . .} and background assumptions K saying that exactly one of the answers is correct, K = H1 ∪ H2 ∪ . . ., is solvable if, and only if, each possible answer H in Q is a countable union of falsifiable empirical hypotheses Fi , H = F1 ∪ F2 . . .. (Genin & Kelly 2017 extend Kelly’s theorem to statistical hypotheses.) A corollary of Kelly’s theorem is that the strong law of large numbers cannot be strengthened from holding for all possible worlds w in some proposition A whose probability equals one to holding for all possible worlds. The relative frequencies in iid experiments converge to the chances only with chance one, or almost surely; they do not also converge to the chances surely, that is, with logical (or even just mathematical) necessity.

11.2 PUTNAM’S ARGUMENT In “Degree of Confirmation” and Inductive Logic (1963), Putnam argues against Carnap’s inductive logic by showing that not every “adequate” inductive method can be represented by a Carnapian confirmation function or conditional logical probability. For the logical probability to be adequate is for it to be defined on a formal language that is sufficiently rich (it needs to be sufficiently rich to describe arithmetic—that is, the theory of natural numbers—as well as the ordering induced by

266

INTRODUCTION TO PROBABILITY AND INDUCTION

space-time, but these details do not matter for our purposes). In addition, the logical probability needs to satisfy the following condition: The instance confirmation of every true and effective hypothesis eventually becomes greater than a half, and remains so forever after. A hypothesis h in the sufficiently rich formal language is said to be effective if, and only if, h provably says of each individual constant ‘ai ’ whether or not ai has a given property M, and it says nothing else. (Provably means that, if true, there exists a proof of h → M (ai ) in the sufficiently rich language.) The n-th instance confirmation of the universally quantified hypothesis ∀x (M (x)) is the conditional logical probability that the n + 1st individual an+1 has property M given that the first n individuals a1 , . . . , an all have property M, Pr (M (an+1 ) | M (a1 ) ∧ · · · ∧ M (an )). The focus on instance confirmations is a concession to Carnap whose inductive logic has the unwelcome consequence that universally quantified hypotheses have logical probability zero, and so Pr (∀x (M (x)) | M (a1 ) ∧ · · · ∧ M (an )) = 0 (this problem was later fixed by Hintikka 1966). The idea behind Putnam’s concession is that, despite this, one accepts the universally quantified hypothesis ∀x (M (x)) if its instance confirmation is greater than a half, so that one eventually accepts it, and continues to do so forever after, if there is a natural number m such that for all natural numbers n ≥ m: Pr (M (an+1 ) | M (a1 ) ∧ · · · ∧ M (an )) > 1/2. Putnam’s condition implies that for all natural numbers m and predicates ‘M’ there provably is a natural number n such that (±M stands for either M or ¬M): Pr (M (am+n+1 ) | ±M (a1 ) ∧ · · · ∧ ∧ ±M (am ) ∧ M (am+1 ) ∧ · · · ∧ M (am+n )) > 1/2

A LT E R N AT I V E A P P R O A C H E S T O I N D U C T I O N

267

Putnam then shows that no logical probability on a sufficiently rich language has this property if it satisfies the above condition. This means there is no logical probability on a sufficiently rich language that satisfies Putnam’s condition and leads one to accept every true and effective hypothesis after finitely many steps and forever after (in the sense that for every true and effective hypothesis there is some point from which on this hypothesis is accepted). However, as Putnam also shows, there is a non-probabilistic inductive method that leads one to accept every true and effective hypothesis after finitely many steps and forever after (in the same sense as above). The upshot of this is the following: Carnap’s inductive logic, instead of enabling one to learn hypotheses, turns out to prevent one from learning hypotheses that can be learned! Putnam’s argument is a classical argument in the tradition of formal learning theory: An inductive method that prevents one from learning hypotheses that can be learned is not permissible. Previously we said that a rule is permissible in answering a question if, and only if, the rule answers the question as reliably as the question can be answered, that is, at least as reliably as every rule. Given this terminology, what Putnam shows is that no logical probability is permissible in answering all of the following questions: Is a given effective hypothesis from the sufficiently rich language true? The reason is that no logical probability answers all of these questions as reliably as Putnam’s non-probabilistic rule. It is to be noted, though, that Putnam assumes—with Carnap, but without contemporary Bayesian confirmation theorists—that the probability measure has to be “computable;” that is, the question what probability a sentence has must be a decidable question. In Bayes or Bust? A Critical Examination of Bayesian Confirmation Theory (1992), Earman conjectures that, when this assumption is dropped, every inductive problem

268

INTRODUCTION TO PROBABILITY AND INDUCTION

that can be answered at all can be answered by a probabilistic inductive method. Juhl (1996) proves this conjecture to be true for certain special cases. In concluding, let us return to Reichenbach’s vindication of induction. In the terminology of formal learning theory, it shows that the straight(-forward) rule is permissible in answering all of the following questions: What is the limit of the relative frequency of a given event type in a given sequence of event tokens? The reason is that every inductive problem of this sort that can be solved at all, that is, that is solved by some rule conjecturing limits, is solved by the straight(-forward) rule. Unlike Carnap’s inductive logic, the straight(-forward) rule does not prevent one from learning hypotheses that can be learned.

READINGS The recommended readings for Chapter 11 include: Schulte, Oliver (2012), Formal Learning Theory. In E.N. Zalta (ed.), Stanford Encyclopedia of Philosophy.

and perhaps also Putnam, Hilary (1963), “Degree of Confirmation” and Inductive Logic. In P.A. Schilpp (ed.), The Philosophy of Rudolf Carnap. La Salle, IL: Open Court, 761–783.

Advanced texts that continue and go beyond the material in this book are: Earman, John (1992), Bayes or Bust? A Critical Examination of Bayesian Confirmation Theory. Cambridge, MA: MIT Press. Hájek, Alan, & Hitchcock, Christopher (2016), The Oxford Handbook of Probability and Philosophy. Oxford: Oxford University Press. Halpern, Joseph Y. (2003), Reasoning about Uncertainty. Cambridge, MA: MIT Press.

A LT E R N AT I V E A P P R O A C H E S T O I N D U C T I O N

269

Hawthorne, James (2012), Inductive Logic. In E.N. Zalta (ed.), Stanford Encyclopedia of Philosophy. Howson, Colin, & Urbach, Peter (2005), Scientific Reasoning: The Bayesian Approach. 3rd ed. La Salle, IL: Open Court. Huber, Franz (2016), Formal Representations of Belief. In E.N. Zalta (ed.), Stanford Encyclopedia of Philosophy. Huber, Franz, & Schmidt-Petri, Christoph (2009, eds.), Degrees of Belief. Synthese Library 342. Dordrecht: Springer. Pettigrew, Richard, & Weisberg, Jonathan (forthcoming), The Open Handbook of Formal Epistemology.

REFERENCES

Armendt, Brad (1980), Is There a Dutch Book Argument for Probability Kinematics? Philosophy of Science 47, 583–588. Arntzenius, Frank (2010), Reichenbach’s Common Cause Principle. In E.N. Zalta (ed.), Stanford Encyclopedia of Philosophy. Bayes, Thomas, & Price, Richard (1763), An Essay towards Solving a Problem in the Doctrine of Chances. By the Late Rev. Mr. Bayes, F. R. S. Commincated by Mr. Price, in a Letter to John Canton, A. M. F. R. S. Philosophical Transactions 53, 370–418. Beisbart, Claus, & Hartmann, Stephan (2011, eds.), Probability in Physics. Oxford: Oxford University Press. Ben-Menahem, Yemima, & Hemmo, Meir (2012, eds.), Probability in Physics. The Frontiers Collection. Dordrecht: Springer. Bernoulli, Jakob (1713), Ars Conjectandi. Basel: Impensis Thurnisiorum, Fratrum. Bertrand, Joseph (1889), Calcul des Probabilités. Paris: Gauthier-Villars. Borel, Émile F. (1909), Les Probabilités Dénombrables et leurs Applications Arithmétique. Rendiconti del Circolo Matematico di Palermo 27, 247–271. Born, Max (1954), The Statistical Interpretation of Quantum Mechanics. Nobel Lecture. Brier, Glenn W. (1950), Verification of Forecasts Expressed in Terms of Probability. Monthly Weather Review 78, 1–3.

272

REFERENCES

Briggs, Rachael (2010), The Metaphysics of Chance. Philosophy Compass 5, 938–952. Briggs, Rachael (2014), Normative Theories of Rational Choice: Expected Utility. In E.N. Zalta (ed.), Stanford Encyclopedia of Philosophy. Burdess, Neil (2010), Starting Statistics: A Short, Clear Guide. Los Angeles: Sage Publications. Callender, Craig (2016), Thermodynamic Asymmetry in Time. In E.N. Zalta (ed.), Stanford Encyclopedia of Philosophy. Cantor, Georg (1874), Ueber eine Eigenschaft des Inbegriffes aller reellen algebraischen Zahlen. Journal für die reine und angewandte Mathematik 77, 258–262. Cantor, Georg (1878), Ein Beitrag zur Mannigfaltigkeitslehre. Journal für die reine und angewandte Mathematik 84, 242–258. Carnap, Rudolf (1934), Logische Syntax der Sprache. Vienna: Springer. Carnap, Rudolf (1950/1962), Logical Foundations of Probability. 2nd ed. Chicago: University of Chicago Press. Carnap, Rudolf (1952), The Continuum of Inductive Methods. Chicago: University of Chicago Press. Carnap, Rudolf (1963), Replies and Systematic Expositions. Probability and Induction. In P.A. Schilpp (ed.), The Philosophy of Rudolf Carnap. La Salle, IL: Open Court, 966–998. Christensen, David (1999), Measuring Confirmation. Journal of Philosophy 96, 437–461. Cox, Richard T. (1946), Probability, Frequency and Reasonable Expectation. American Journal of Physics 14, 1–13. Creath, Richard (2011), Logical Empiricism. In E.N. Zalta (ed.), Stanford Encyclopedia of Philosophy. Crupi, Vincenzo (2015), Confirmation. In E.N. Zalta (ed.), Stanford Encyclopedia of Philosophy. Dea, Shannon (2012), Continental Rationalism. In E.N. Zalta (ed.), Stanford Encyclopedia of Philosophy. de Finetti, Bruno (1937), La Prévision: Ses Lois Logiques, Ses Sources Subjectives. Annales de l’Institut Henri Poincaré 7, 1–68. Engl. Transl. by H.E. Kyburg, Jr. as “Foresight: Its Logical Laws, Its Subjective Sources.” In H.E. Kyburg, Jr., & H.E. Smokler (1964, eds.), Studies in Subjective Probability. New York: Wiley, 93–158. de Finetti, Bruno (1970), Teoria delle probabilità (vol. I, II). Torino: Einaudi.

REFERENCES

273

de Moivre, Abraham (1718), The Doctrine of Chances. London: W. Pearson. Dempster, Arthur P. (1967), Upper and Lower Probabilities Induced by a Multivalued Mapping. The Annals of Mathematical Statistics 38, 325–339. Dempster, Arthur P. (1968), A Generalization of Bayesian Inference. Journal of the Royal Statistical Society (Series B, Methodological) 30, 205-247. diFate, Victor (2016), Evidence. In J. Fieser & B. Dowden (eds.), Internet Encyclopedia of Philosophy. Dubois, Didier, & Prade, Henri (1988), Possibility Theory. An Approach to Computerized Processing of Uncertainty. New York: Plenum. Duhem, Pierre (1914/1991), The Aim and Structure of Physical Theory. Transl. by P.P. Wiener. Princeton: Princeton University Press. Eagle, Antony (2012), Chance versus Randomness. In E.N. Zalta (ed.), Stanford Encyclopedia of Philosophy. Earman, John (1992), Bayes or Bust? A Critical Examination of Bayesian Confirmation Theory. Cambridge, MA: MIT Press. Easwaran, Kenny (2011), Bayesianism. Philosophy Compass 6, 312–320, 321–332. Edwards, Ward (1972/1992), Likelihood. 2nd. ed. Baltimore, MD: Johns Hopkins University Press. Edwards, Ward & Lindman, Harold & Savage, Leonard J. (1963), Bayesian Statistical Inference for Psychological Research. Psychological Review 70, 193–242. Einstein, Albert (1915), Erklärung der Perihelbewegung des Merkur aus der allgemeinen Relativitätstheorie. Königlich Preußische Akademie der Wissenschaften (Berlin). Sitzungsberichte (1915), 831–839. Euclid (BCE/1926), The Thirteen Books of Euclid’s Elements. Ed. by T.L. Heath. 3 vols. 2nd. ed. Cambridge, UK: Cambridge University Press. Fisher, Ronald A. (1925), Statistical Methods for Research Workers. Edinburgh: Oliver and Boyd. Fisher, Ronald A. (1930), The Genetical Theory of Natural Selection. Oxford: Clarendon Press. Fisher, Ronald A. (1935), The Design of Experiments. Edinburgh: Oliver and Boyd.

274

REFERENCES

Fitelson, Branden (1999), The Plurality of Bayesian Measures of Confirmation and the Problem of Measure Sensitivity. Philosophy of Science 66, S362–S378. Fitelson, Branden (2006), The Paradox of Confirmation. Philosophy Compass 1, 95–113. Frege, Gottlob (1893/1903), Grundgesetze der Arithmetik. Band I/II. Jena: Verlag Herman Pohle. Frigg, Roman (2008), A Field Guide to Recent Work on the Foundations of Statistical Mechanics. In D. Rickles (ed.), The Ashgate Companion to Contemporary Philosophy of Physics. London: Ashgate, 99–196. Frigg, Roman, & Berkovitz, Joseph & Kronz, Fred (2016), The Ergodic Hierarchy. In E.N. Zalta (ed.), Stanford Encyclopedia of Philosophy. Gaifman, Haim (1988), A Theory of Higher Order Probabilities. In B. Skyrms & W.L. Harper (eds.), Causation, Chance, and Credence. Vol. I. Dordrecht: Kluwer, 191–219. Genin, Konstantin, & Kelly, Kevin T. (2017), The Topology of Statistical Verifiability. EPTCS 51, 236–250. Gettier, Edmund L. (1963), Is Justified True Belief Knowledge? Analysis 23, 121–123. Glymour, Clark (1980), Theory and Evidence. Princeton: Princeton University Press. Goodman, Nelson (1954/1983), Fact, Fiction, Forecast. 4th ed. Cambridge, MA: Harvard University Press. Haack, Susan (1976), The Justification of Deduction. Mind 85, 112–119. Hacking, Ian (1965), Logic of Statistical Inference. Cambridge, UK: Cambridge University Press. Hacking, Ian (2001), An Introduction to Probability and Inductive Logic. Cambridge: Cambridge University Press. Hájek, Alan (2011), Interpretations of Probability. In E.N. Zalta (ed.), Stanford Encyclopedia of Philosophy. Hájek, Alan, & Hitchcock, Christopher (2016), The Oxford Handbook of Probability and Philosophy. Oxford: Oxford University Press. Halpern, Joseph Y. (2003), Reasoning about Uncertainty. Cambridge, MA: MIT Press. Hawthorne, James (2012), Inductive Logic. In E.N. Zalta (ed.), Stanford Encyclopedia of Philosophy.

REFERENCES

275

Healey, Richard (2016), Quantum-Bayesian and Pragmatist Views of Quantum Theory. In E.N. Zalta (ed.), Stanford Encyclopedia of Philosophy. Hempel, Carl G. (1945), Studies in the Logic of Confirmation. Mind 54, 1–26, 97–121. Hempel, Carl G. (1950), Problems and Changes in the Empiricist Criterion of Meaning. Revue Internationale de Philosophie 41, 41–63. Hitchcock, Christopher (2010), Probabilistic Causation. In E.N. Zalta (ed.), Stanford Encyclopedia of Philosophy. Hintikka, Jaakko (1961), Knowledge and Belief. An Introduction to the Logic of the Two Notions. Ithaca, NY: Cornell University Press. Hintikka, Jaakko (1966), A Two-Dimensional Continuum of Inductive Methods. In J. Hintikka & P. Suppes (eds.), Aspects of Inductive Logic. Amsterdam: North-Holland Publishing, 113–132. Hoefer, Carl (2016), Causal Determinism. In E.N. Zalta (ed.), Stanford Encyclopedia of Philosophy. Howson, Colin, & Urbach, Peter (2005), Scientific Reasoning: The Bayesian Approach. 3rd ed. La Salle, IL: Open Court. Huber, Franz (2007), Confirmation and Induction. In J. Fieser & B. Dowden (eds.), Internet Encyclopedia of Philosophy. Huber, Franz (2016), Formal Representations of Belief. In E.N. Zalta (ed.), Stanford Encyclopedia of Philosophy. Huber, Franz, & Schmidt-Petri, Christoph (2009, eds.), Degrees of Belief. Synthese Library 342. Dordrecht: Springer. Hume, David (1739/1896), A Treatise of Human Nature. Ed. by L.A. Selby-Bigge. Oxford: Clarendon Press. Hume, David (1748/1993), An Enquiry Concerning Human Understanding. Ed. by E. Steinberg. Indianapolis: Hackett. Ismael, Jenann (2015), Quantum Mechanics. In E.N. Zalta (ed.), Stanford Encyclopedia of Philosophy. Jaynes, Edwin T. (1957), Information Theory and Statistical Mechanics. The Physical Review 106, 620–630; 108, 171–190. Jeffrey, Richard C. (1965/1983), The Logic of Decision. 2nd ed. Chicago: University of Chicago Press. Joyce, James M. (1999), The Foundations of Causal Decision Theory. Cambridge, UK: Cambridge University Press. Joyce, James M. (2003), Bayes’ Theorem. In E.N. Zalta (ed.), Stanford Encyclopedia of Philosophy.

276

REFERENCES

Joyce, James M. (2009), Accuracy and Coherence: Prospects for an Alethic Epistemology of Partial Belief. In F. Huber & C. Schmidt-Petri (eds.), Degrees of Belief. Synthese Library 342. Dordrecht: Springer, 263–297. Juhl, Cory (1996), Objectively Reliable Subjective Probabilities. Synthese 109, 293–309. Kant, Immanuel (1781), Critik der reinen Vernunft. Riga: Johann Friedrich Hartknoch. Kelly, Kevin T. (1996), The Logic of Reliable Inquiry. Oxford: Oxford University Press. Keynes, John M. (1921), A Treatise on Probability. London: Macmillan. Keynes, John M. (1923), A Tract on Monetary Reform. London: Macmillan. Klement, Kevin C. (2016a), Propositional Logic. In J. Fieser & B. Dowden (eds.), Internet Encyclopedia of Philosophy. Klement, Kevin C. (2016b), Russell’s Paradox. In J. Fieser & B. Dowden (eds.), Internet Encyclopedia of Philosophy. Koellner, Peter (2013), The Continuum Hypothesis. In E.N. Zalta (ed.), Stanford Encyclopedia of Philosophy. Kolmogoroff, Andrej N. (1933), Grundbegriffe der Wahrscheinlichkeitsrechnung. Berlin: Springer. Krantz, David H., Luce, Duncan R., Suppes, Patrick, & Tversky, Amos (1971), Foundations of Measurement (Volume 1). New York: Academic Press. Kullback, Solomon, & Leibler, Richard A. (1951), On Information and Sufficiency. The Annals of Mathematical Statistics 22, 79–86. Laplace, Pierre Simon (1812), Théorie Analytique des Probabilités. Paris: Courcier. Laplace, Pierre Simon (1814), Essai Philosophique sur les Probabilités. Paris: Courcier. Le Verrier, Urbain (1859), Lettre de M. Le Verrier à M. Faye sur la théorie de Mercure et sur le mouvement du périhélie de cette planète. Comptes rendus hebdomadaires des séances de l’Académie des sciences (Paris) 49, 379–383. Lévy, Paul (1925), Calcul des Probabilités. Paris: Gauthier-Villars. Lewis, David K. (1980), A Subjectivist’s Guide to Objective Chance. In R.C. Jeffrey (ed.), Studies in Inductive Logic and Probability. Vol. II.

REFERENCES

277

Berkeley: University of Berkeley Press, 263-293. Reprinted with Postscripts in D. Lewis (1986), Philosophical Papers. Vol. II. Oxford: Oxford University Press, 83–132. Lewis, David K. (1999), Why Conditionalize? in D. Lewis (1999), Papers in Metaphysics and Epistemology. Cambridge: Cambridge University Press, 403–407. Lindeberg, Jarl W. (1920), Über das Exponentialgesetz in der Wahrscheinlichkeitsrechnung. Annales Academiae Scientiarum Fennicae, Series A, 16, 1–23. Lindeberg, Jarl W. (1922), Eine neue Herleitung des Exponentialgesetzes in der Wahrscheinlichkeitsrechnung. Mathematische Zeitschrift 15, 211–225. Lindley, Dennis V. (1965), Introduction to Probability and Statistics from a Bayesian Viewpoint (part 1, 2). Cambridge: Cambridge University Press. Markie, Peter (2013), Rationalism vs. Empiricism. In E.N. Zalta (ed.), Stanford Encyclopedia of Philosophy. Martin, Eric, & Osherson, Daniel (1998), Elements of Scientific Inquiry. Cambridge, MA: MIT Press. Maurin, Anna-Sofia (2013), Tropes. In E.N. Zalta (ed.), Stanford Encyclopedia of Philosophy. Mayo, Deborah G. (1996), Error and the Growth of Experimental Knowledge. Chicago: University of Chicago Press. McConnell, Terrance (2014), Moral Dilemmas. In E.N. Zalta (ed.), Stanford Encyclopedia of Philosophy. Menzel, Christopher (2013), Possible Worlds. In E.N. Zalta (ed.), Stanford Encyclopedia of Philosophy. Milne, Peter (1996), log [Pr (H | E ∧ B) / Pr (H | B)] is the One True Measure of Confirmation. Philosophy of Science 63, 21–26. Newton, Isaac (1687), Philosophiæ Naturalis Principia Mathematica. London: Jussu Societatis Regiæ ac Typis Joseph Streater. Neyman, Jerzy, & Pearson, Egon S. (1933), On the Problem of the Most Efficient Tests of Statistical Hypotheses. Philosophical Transactions of the Royal Society of London, Series A, 231, 289–337. Neyman, Jerzy, & Pearson, Egon S. (1967), Joint statistical papers by J. Neyman and E.S. Pearson. Cambridge, UK: Cambridge University Press.

278

REFERENCES

Nicod, Jean (1930), Foundations of Geometry and Induction. Transl. by P.P. Wiener. London: Routledge and Kegan Paul Ltd. Nolt, John (2014), Free Logic. In E.N. Zalta (ed.), Stanford Encyclopedia of Philosophy. Nozick, Robert (1969), Newcomb’s Problem and Two Principles of Choice. In N. Rescher (ed.), Essays in Honor of Carl G. Hempel. Dordrecht: Reidel, 114–146. Papineau, David (2012), Philosophical Devices: Proofs, Probabilities, Possibilities, and Sets. Oxford: Oxford University Press. Pathria, R.K., & Beale, Paul D. (2011/1973), Statistical Mechanics. 3rd ed. Burlington, MA: Academic Press. Pearl, Judea (2009), Causality: Models, Reasoning, and Inference. 2nd ed. Cambridge, UK: Cambridge University Press. Peterson, Martin (2009), An Introduction to Decision Theory. Cambridge, UK: Cambridge University Press. Pettigrew, Richard (2015), Epistemic Utility Arguments for Probabilism. In E.N. Zalta (ed.), Stanford Encyclopedia of Philosophy. Pettigrew, Richard, & Weisberg, Jonathan (forthcoming), The Open Handbook of Formal Epistemology. Plato (BCE/1997), Complete Works. Ed. by J.M. Cooper. Indianapolis: Hackett. Popper, Karl R. (1935/2002), The Logic of Scientific Discovery. London, New York: Routledge. Popper, Karl R. (1955), Two autonomous axiom systems for the calculus of probabilities. British Journal for the Philosophy of Science 6, 51–57. Popper, Karl R. (1957), The Propensity Interpretation of the Calculus of Probability and the Quantum Theory. In S. Körner (ed.), Observation and Interpretation: a Symposium of Philosophers and Physicists. London: Butterworths, 65–70. Putnam, Hilary (1963), “Degree of Confirmation” and Inductive Logic. In P.A. Schilpp (ed.), The Philosophy of Rudolf Carnap. La Salle, IL: Open Court, 761–783. Quine, Willard V.O. (1040), Mathematical Logic. Cambridge, MA: Harvard University Press. Quine, Willard V.O. (1951), Two Dogmas of Empiricism. The Philosophical Review 60, 20–43.

REFERENCES

279

Ramsey, Frank P. (1926), Truth and Probability. In Ramsey, Frank P. (1931), The Foundations of Mathematics and Other Logical Essays. Ed. by R.B. Braithwaite. London: Kegan, Paul, Trench, Trubner & Co., New York: Harcourt, Brace, and Company, 156–198. Reichenbach, Hans (1938), Experience and Prediction. An Analysis of the Foundations and the Structure of Knowledge. Chicago: University of Chicago Press. Rényi, Alfred (1955), On a New Axiomatic System for Probability. Acta Mathematica Academiae Scientiarum Hungaricae 6, 285–335. Romeijn, Jan-Willem (2014), Philosophy of Statistics. In E.N. Zalta (ed.), Stanford Encyclopedia of Philosophy. Royall, Richard (1997), Statistical Evidence: A Likelihood Paradigm. London: Chapman and Hall. Russell, Bertrand (1902), Letter to Frege. In J.van Heijenoort (1967, ed.), From Frege to Gödel. Cambridge, MA: Harvard University Press, 124–125. Savage, Leonard J. (1954), The Foundations of Statistics. New York: Wiley. Schulte, Oliver (2012), Formal Learning Theory. In E.N. Zalta (ed.), Stanford Encyclopedia of Philosophy. Seidenfeld, Teddy (1979), Philosophical Problems of Statistical Inference: Learning from R.A. Fisher. Theory and Decision Library 22. Dordrecht: D. Reidel. Shafer, Glenn (1976), A Mathematical Theory of Evidence. Princteton: Princeton University Press. Shannon, Claude E. (1948), A Mathematical Theory of Communication. The Bell System Technical Journal 27, 379–423, 623–656. Shapiro, Stewart (2013), Classical Logic. In E.N. Zalta (ed.), Stanford Encyclopedia of Philosophy. Sklar, Lawrence (2015), Philosophy of Statistical Mechanics. In E.N. Zalta (ed.), Stanford Encyclopedia of Philosophy. Skyrms, Brian (1966/2000), Choice and Chance: An Introduction to Inductive Logic. 4th ed., Belmont, CA: Wadsworth Thomson Learning. Skyrms, Brian (1987), Dynamic Coherence and Probability Kinematics. Philosophy of Science 54, 1–20.

280

REFERENCES

Spirtes, Peter, Glymour, Clark, & Scheines, Richard (2000), Causation, Prediction, and Search. 2nd ed. Cambridge, MA: MIT Press. Spohn, Wolfgang (2012), The Laws of Belief. Ranking Theory and its Philosophical Applications. Oxford: Oxford University Press. Sprenger, Jan (2011), Hypothetico-Deductive Confirmation. Philosophy Compass 6, 497–508. Steele, Katie, & Orri Stefánsson, Hlynur (2015), Decision theory. In E.N. Zalta (ed.), Stanford Encyclopedia of Philosophy. Steinhart, Eric (2009), More Precisely: The Math You Need to Do Philosophy. Peterborough: Broadview Press. Student (1908), The Probable Error of a Mean. Biometrika 6, 1–25. Talbott, William (2008), Bayesian Epistemology. In E.N. Zalta (ed.), Stanford Encyclopedia of Philosophy. Teller, Paul (1973), Conditionalization and Observation. Synthese 26, 218–258. Uebel, Thomas (2006), Vienna Circle. In E.N. Zalta (ed.), Stanford Encyclopedia of Philosophy. Uffink, Jos (2014), Boltzmann’s Work in Statistical Physics. In E.N. Zalta (ed.), Stanford Encyclopedia of Philosophy. van Fraassen, Bas C. (1984), Belief and the Will. Journal of Philosophy 81, 235–256. Venn, John (1866), The Logic of Chance. An Essay on the Foundations and Province of the Theory of Probability, With Especial Reference to Its Logical Bearings and its Application to Moral and Social Science, and to Statistics. London: Macmillan. Venn, John (1880), On the Diagrammatic and Mechanical Representation of Propositions and Reasonings. Philosophical Magazine and Journal of Science (Fifth Series) 10, 1–18. Vickers, John (2014), The Problem of Induction. In E.N. Zalta (ed.), Stanford Encyclopedia of Philosophy. Vineberg, Susan (2016), Dutch Book Arguments. In E.N. Zalta (ed.), Stanford Encyclopedia of Philosophy. von Kries, Johannes (1886), Die Principien der Wahrscheinlichkeitsrechnung. Eine logische Untersuchung. Freiburg: J.C.B. Mohr. von Mises, Richard (1928), Wahrscheinlichkeit, Statistik und Wahrheit. Wien: Julius Springer.

REFERENCES

281

Weirich, Paul (2016), Causal decision theory. In E.N. Zalta (ed.), Stanford Encyclopedia of Philosophy. Weisberg, Jonathan (2015), Formal Epistemology. In E.N. Zalta (ed.), Stanford Encyclopedia of Philosophy. Wittgenstein, Ludwig (1921), Logisch-Philosophische Abhandlung. Annalen der Naturphilosophische 14, 185–262. Woodward, James F., Causation and Manipulability. In E.N. Zalta (ed.), Stanford Encyclopedia of Philosophy. Zach, Richard (2016), Sets, Logic, Computation. An Open Logic Text. Calgary: University of Calgary.

INDEX

accuracy argument, gradational, 8.3 acts. See decision theory, Bayesian additivity. See probability measure algebra of propositions, 79 analysis, conceptual, 47–48 argument, 11 average, ensemble vs. time, 192, 194–195, 209 Bayes, Thomas. See Bayes’ theorem; decision theory, Bayesian; probability, interpretation of: subjective Bayes’ theorem, 89–90 Bernoulli, Jakob, 226, 238. See also indifference, principle of Bertrand, Joseph. See under paradox betting ratio, 140–141, 143 Borel set, 104, 223

Carnap, Rudolf, 167, 193, 204 measure, 156 objections to, 7.5, 11.2 See also concepts, qualitative, comparative, and quantitative; explication, conceptual; probability, interpretation of: logical case, logically possible, 11, 65, 73, 111. See also world, possible causation, 41, 63, 189, 209, 235 central limit theorem, 10.8 chance, 10.6. See also under probability, interpretation of propensity vs. best system interpretation of, 189 comprehension principle, restricted vs. unrestricted, 26–28 concepts, qualitative, comparative, and quantitative, 37–38, 99–100

284

INDEX

condition of adequacy, 49–50, 55–57 conditional counterfactual, 6, 209 material (see logic: propositional) conditionalization Jeffrey, 171 strict, 169 conditional probability, 85 confidence interval, 246 confirmation, 36–38, 40, 59, 63–64, 69, 113 incremental vs. absolute, 7.2, 155–156 consequence empirically testable, 66–67 logical, 11–12, 54 countability, 92, 104, 139, 218, 223, 236–337, 261–263, 265 criterion, 54, 64, 67–68 Nicod’s, 50 curve-fitting, 135

decision theory, Bayesian, 8.6 causal vs. evidential, 177–180 de Finetti, Bruno, 231. See also Dutch book argument degree of belief, 139–140, 10.6 description, state vs. structure, 7.1 dilemma for deduction, Haack’s, 130–131, 206 distribution, 238 cumulative probability, 237–238 standard normal, 240

t−, 245 See also under figure(s) Dutch book argument, 8.2 entropy, 108–109 equivalence, 12, 49, 212 error standard, 245, 247, 250, 253, 256–257 type I vs. type II, 249 estimate, 243–244 estimators, 212, 224, 226 event, 36, 75, 171, 207, 248 evidence, 36, 124 problem of old, 167 expected value of an experiment token, 222–224, 238–239 explication, conceptual, 48 extensionality, principle of, 22 falsifiability, 67–68 figure(s) of accuracy argument, 153, 155 of Bertrand’s paradox, 102–103 of binomial distribution, 245 of new riddle of induction, 135 of standard normal or Gauss(ian) distribution, 241 of Venn diagrams, 23–24 Fisher, Ronald A., 40, 231 frequency, (limiting) relative. See under probability, interpretations of function, 79, 216, 261 generalization, principles of existential and universal, 18–19

INDEX

285

Glymour, Clark, 167, 235 Goodman, Nelson, 7.4–7.5, 204

justification, 38, 41, 48, 50, 66, 69, 120–121, 7.4, 10.1

Haack, Susan. See dilemma for deduction, Haack’s Hempel, Carl G., 47, 64, 69, 7.3. See also condition of adequacy, criterion: Nicod’s Hume, David, 47, 50, 119, 120–121, 126, 130–132, 136, 188, 204–206 against the justification of induction, 3.3, 130 hypothesis, 62, 68–69, 187, 248–249, 266 lawlike, 136 null vs. alternative, 248–249

Kelly’s theorem, 265 Keynes, John M., 211. See also probability, interpretation of: logical Kolmogorov, Andrej N., 75. See also under large numbers, law of

induction new riddle of, 7.5 principles of, 39 problem of, 3.2 vindication of, 210 indifference, principle of, 98 instantiation, principle of universal, 18–19 insufficient reason, principle of. See indifference, principle of irrelevant conjunction and disjunction, problems of, 70–71

language formal, 2, 9–10, 92–93 object and meta-, 2–3 Laplace, Pierre Simone, 119. See also central limit theorem; indifference, principle of large numbers, law of Kolmogorov’s strong, 223–224 strong, 224–225 weak, 225–226 learning theory, formal, 201, 11.1 Lévy, Paul. See central limit theorem Lewis, David K., 172, 189. See also principal principle likelihood-ratio, 156, 249 Lindeberg, Jarl W. See central limit theorem logic predicate, 1.2 propositional, 1.1

Jeffrey, Richard C., 177. See also conditionalization: Jeffrey Joyce, James M. See accuracy argument, gradational; decision theory, Bayesian: causal vs. evidential

mean, 222, 233 mechanics, statistical and quantum, 9.2 median and mode, 232 model theory, 5, 13 monotonicity, principle of, 55

286

INDEX

Neyman, Jerzy and Pearson, Egon S., 231, 249 non-negativity. See probability measure normalization. See probability measure normativity, 40 deontological vs. instrumental, 127–128, 136, 141, 143 paradox Bertrand’s, 6.2 Gibbs, 193 liar, 4–5 ravens, 4.2 Russell’s, 26–28 of water and wine, 6.3 partition, 89 Popper, Karl R., 4.6–4.7, 86, 158. See also chance: propensity vs. best system interpretation of; falsifiability principal principle, 9.3 probabilism, 140, 165 probability, interpretation of chance, 9.1–9.3 classical, 6.1–6.3 (limiting) relative frequency, 10.1–1.10 logical, 7.1–7.6 subjective, 8.1–8.7 probability measure, 80, 93 regular, 81 probability space, 5.1 proof theory, 5 Putnam, Hilary, 11.2 quantifiers. See logic: predicate

Ramsey, Frank P. See Dutch book argument randomness, 101, 217 reference class problem, 214–215, 242 reflection principle, 202 regress argument, 65–66 Reichenbach, Hans, 119–120, 131, 207. See also induction: vindication of; straight(-forward) rule relevance, positive probabilistic, 122, 155–156 risk, 146 Russell, Bertrand. See under paradox sample, 230–232 Savage, Leonard, 174, 231 scale, ordinal, cardinal, interval, ratio, and absolute, 175–176 set theory, axioms of, 22–25 Shannon, Claude E., 192. See also entropy single-case probability, 195, 242, 248 stabilization, 261, 264 standard deviation, 233 state, micro- and macro-, 190 statistics, 40, 10.7, 10.9 stopping rule, 252 straight(-forward) rule, 118, 131, 10.2, 263, 268 test, power and significance level of a, 249 testing, deductive method of hypothesis, 66–67 total probability, law of, 89–90

INDEX

truth logical, 11 table, 22–23 value, 5 type vs. token distinction, 1 use vs. mention distinction, 3–4 utility, 146–147 principle of maximizing expected, 174 validity, logical, 11 variable, random, 215–216, 235 independent and identically distributed, 181–182, 218–221, 224–226, 329

287

singular vs. generic, 230 variance and covariance, 233–234 Venn, John, 209, 31. See also under figure(s) von Kries, Johannes. See paradox: of water and wine von Mises, Richard, 209. See also paradox: of water and wine

Wittgenstein, Ludwig. See probability, interpretation of: logical world, possible, 78, 230. See also case, logically possible