The Computational Nature of Language Learning and Evolution [1 ed.] 9780262280686, 9780262140942

The introduction of a mathematical and computational framework within which to analyze the interplay between language le

150 66 4MB

English Pages 503 Year 2006

Report DMCA / Copyright

DOWNLOAD FILE

Polecaj historie

The Computational Nature of Language Learning and Evolution [1 ed.]
 9780262280686, 9780262140942

Citation preview

49489Niyogi

3/20/06

7:22 AM

Page 1

linguistics/cognitive science

Partha Niyogi

Current Studies in Linguistics 43

“Partha Niyogi introduces new perspectives on the link between language acquisition and language change across generations. He theorizes with a mathematician’s rigor, generalizes across biological, political, and cultural distinctions, and offers intriguing simulations.” —David W. Lightfoot, Professor of Linguistics, Georgetown University, and Assistant Director, National Science Foundation

“A study of the relationship between language learning by individuals and the evolution of language gives rise to a host of issues, all of them controversial. Niyogi has first provided a lucid discussion of many of these issues and then suggested a very interesting formal and computational model, based primarily on a distribution over the grammars of a population of learners. Linguists and computational linguists of different persuasions will find this book very rewarding.” —Aravind K. Joshi, Henry Salvatori Professor of Computer and Cognitive Science, University of Pennsylvania “Biological evolution can only be understood by thinking in terms of populations. This book helps us to think in terms of linguistic populations. The vast array of examples and models offers a wealth of tools for understanding the dynamics of the subtle interplay between language evolution and language learning.” —Karl Sigmund, Faculty for Mathematics, University of Vienna

Cover: Tower of Babel, Pieter the Elder Brueghel, 1563, oil on panel, courtesy of The Bridgeman Art Library

The Computational

Nature of Language Learning and Evolution

0-262-14094-2 The MIT Press Massachusetts Institute of Technology Cambridge, Massachusetts 02142 http://mitpress.mit.edu

Niyogi

,!7IA2G2-beajec!:t;K;k;K;k

Partha Niyogi

The Computational Nature

Partha Niyogi is Professor of Computer Science and Statistics at the University of Chicago.

“A thoughtful and original analysis of important problems in the history, evolution, and acquisition of language.” —Steven Pinker, Johnstone Professor of Psychology, Harvard University, and author of The Language Instinct, Words and Rules, How the Mind Works, and The Blank Slate

of Language Learning and Evolution

Over the years, historical linguists have postulated several accounts of documented language change. Additionally, biologists have postulated accounts of the evolution of communication systems in the animal world. This book creates a mathematical and computational framework within which to embed those accounts, offering a research tool to aid analysis in an area in which data is often sparse and speculation often plentiful.

The Computational

Nature of Language Learning and Evolution

The nature of the interplay between language learning and the evolution of a language over generational time is subtle. We can observe the learning of language by children and marvel at the phenomenon of language acquisition; the evolution of a language, however, is not so directly experienced. Language learning by children is robust and reliable, but it cannot be perfect or languages would never change— and English, for example, would not have evolved from the language of the Anglo-Saxon Chronicles. In this book Partha Niyogi introduces a framework for analyzing the precise nature of the relationship between learning by the individual and evolution of the population. Learning is the mechanism by which language is transferred from old speakers to new. Niyogi shows that the evolution of language over time will depend upon the learning procedure—that different learning algorithms may have different evolutionary consequences. He finds that the dynamics of language evolution are typically nonlinear, with bifurcations that can be seen as the natural explanatory construct for the dramatic patterns of change observed in historical linguistics. Niyogi investigates the roles of natural selection, communicative efficiency, and learning in the origin and evolution of language—in particular, whether natural selection is necessary for the emergence of shared languages.

(continued on back flap)

The Computational Nature of Language Learning and Evolution

The Computational Nature of Language Learning and Evolution Partha Niyogi

The MIT Press Cambridge, Massachusetts London, England

c 2006 Massachusetts Institute of Technology  All rights reserved. No part of this book may be reproduced in any form by any electronic or mechanical means (including photocopying, recording, or information storage and retrieval) without permission in writing from the publisher. MIT Press books may be purchased at special quantity discounts for business or sales promotional use. For information, please e-mail special [email protected] or write to Special Sales Department, The MIT Press, 55 Hayward Street, Cambridge, MA 02142-1315. This book was set in LATEXby Partha Niyogi. Printed and bound in the United States of America. Library of Congress Cataloging-in-Publication Data Niyogi, Partha. The computational nature of language learning and evolution / Partha Niyogi. p. cm.— (Current studies in linguistics ; 43) Includes bibliographical references and index. ISBN 0-262-14094-2 (alk. paper) 1. Language acquisition. 2. Language and languages — Origin. 3. Computational linguistics. 4. Linguistic change. 5. Multilingualism. 6. Language and culture. I. Title. II. Current studies in linguistics series ; 43. P118.N57 2006 401.93—dc22 10 9 8 7 6 5 4 3 2 1

Contents Preface

xiii

Acknowledgments

xvii

I

The Problem

1

1 Introduction 1.1 Language Acquisition . . . . . . . . . . . . . . . . . . . . . . 1.2 Variation — Synchronic and Diachronic . . . . . . . . . . . . 1.3 More Examples of Change . . . . . . . . . . . . . . . . . . . . 1.3.1 Phonetic and Phonological Change . . . . . . . . . . . 1.3.2 Syntactic Change . . . . . . . . . . . . . . . . . . . . . 1.4 Perspective and Conceptual Issues . . . . . . . . . . . . . . . 1.4.1 The Role of Learning . . . . . . . . . . . . . . . . . . 1.4.2 Populations versus Idiolects . . . . . . . . . . . . . . . 1.4.3 Gradualness versus Abruptness (or the S-Shaped Curve) 1.4.4 Different Time Scales of Evolution . . . . . . . . . . . 1.4.5 Cautionary Aspects . . . . . . . . . . . . . . . . . . . 1.5 Evolution in Linguistics and Biology . . . . . . . . . . . . . . 1.5.1 Scientific History . . . . . . . . . . . . . . . . . . . . . 1.6 Summary of Results . . . . . . . . . . . . . . . . . . . . . . . 1.6.1 Main Insights . . . . . . . . . . . . . . . . . . . . . . . 1.7 Audience and Connections to Other Fields . . . . . . . . . . . 1.7.1 Structure of the Book . . . . . . . . . . . . . . . . . .

3 8 13 16 16 22 25 27 28 29 30 31 32 34 37 39 42 44

II

47

Language Learning

2 Language Acquisition: The Problem of Inductive Inference 49

CONTENTS 2.1 2.2

2.3

A Framework for Learning . . . . . . . . . . . . . . . . . . . . 2.1.1 Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . The Inductive Inference Approach . . . . . . . . . . . . . . . 2.2.1 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . 2.2.2 Additional Results . . . . . . . . . . . . . . . . . . . . The Probably Approximately Correct Model and the VC Theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.3.1 Sets and Indicator Functions . . . . . . . . . . . . . . 2.3.2 Graded Distance . . . . . . . . . . . . . . . . . . . . . 2.3.3 Examples and Learnability . . . . . . . . . . . . . . . 2.3.4 The Vapnik-Chervonenkis (VC) Theorem . . . . . . . 2.3.5 Proof of Lower Bound for Learning . . . . . . . . . . . 2.3.6 Implications . . . . . . . . . . . . . . . . . . . . . . . . 2.3.7 Complexity of Learning . . . . . . . . . . . . . . . . . 2.3.8 Final Words . . . . . . . . . . . . . . . . . . . . . . . .

vi 50 51 56 60 63 71 71 72 72 74 76 79 81 82

3 Language Acquisition: A Linguistic Treatment 83 3.1 Language Learning and the Poverty of Stimulus . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86 3.2 Constrained Grammars — Principles and Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . 88 3.2.1 Example: A Three Parameter System from Syntax . . 89 3.2.2 Example: Parameterized Metrical Stress in Phonology 94 3.3 Learning in the Principles and Parameters Framework . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96 3.4 Formal Analysis of the Triggering Learning Algorithm . . . . 100 3.4.1 Background . . . . . . . . . . . . . . . . . . . . . . . . 100 3.4.2 The Markov Formulation . . . . . . . . . . . . . . . . 102 3.4.3 Derivation of the Transition Probabilities for the Markov TLA Structure . . . . . . . . . . . . . . . . . . . . . . 109 3.5 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111 3.6 Appendix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113 3.6.1 Unembedded Sentences For Parametric Grammars . . 113 3.6.2 Proof of Learnability Theorem . . . . . . . . . . . . . 113 4 Language Acquisition: Memoryless Learning 117 4.1 Characterizing Convergence Times for the Markov Chain Model . . . . . . . . . . . . . . . . . . . . . . . 117 4.1.1 Some Transition Matrices and Their Convergence Curves . . . . . . . . . . . . . . . . . . . . . . . . . . . 118

vii

CONTENTS

4.2

4.3 4.4

4.5 4.6 4.7

III

4.1.2 Absorption Times . . . . . . . . . . . . . . . 4.1.3 Eigenvalue Rates of Convergence . . . . . . . Exploring Other Points . . . . . . . . . . . . . . . . 4.2.1 Changing the Algorithm . . . . . . . . . . . . 4.2.2 Distributional Assumptions . . . . . . . . . . 4.2.3 Natural Distributions–CHILDES CORPUS . Batch Learning Upper and Lower Bounds: An Aside Generalizations and Variations . . . . . . . . . . . . 4.4.1 Markov Chains and Learning Algorithms . . 4.4.2 Memoryless Learners . . . . . . . . . . . . . . 4.4.3 The Power of Memoryless Learners . . . . . . Other Kinds of Learning Algorithms . . . . . . . . . Conclusions . . . . . . . . . . . . . . . . . . . . . . . Appendix: Proofs for Memoryless Algorithms . . . .

. . . . . . . . . . . . . .

. . . . . . . . . . . . . .

. . . . . . . . . . . . . .

. . . . . . . . . . . . . .

. . . . . . . . . . . . . .

Language Change

5 Language Change: A Preliminary Model 5.1 An Acquisition-Based Model of Language Change . . . . . . . . . . . . . . . . . . . 5.2 A Preliminary Model . . . . . . . . . . . . 5.2.1 Learning by Individuals . . . . . . 5.2.2 Population Dynamics . . . . . . . 5.2.3 Some Examples . . . . . . . . . . . 5.3 Implications and Further Directions . . . 5.3.1 An Example from Yiddish . . . . . 5.3.2 Discussion . . . . . . . . . . . . . . 5.3.3 Future Directions . . . . . . . . . .

123 123 129 130 132 134 135 138 138 140 141 142 144 146

153 155 . . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

6 Language Change: Multiple Languages 6.1 Multiple Languages . . . . . . . . . . . . . . . . . . . . . . 6.1.1 The Language Acquisition Framework . . . . . . . 6.1.2 From Language Learning to Population Dynamics 6.2 Example 1: A Three-Parameter System . . . . . . . . . . 6.2.1 Homogeneous Initial Populations . . . . . . . . . . 6.2.2 Modeling Diachronic Trajectories . . . . . . . . . . 6.2.3 Nonhomogeneous Populations: Phase-Space Plots 6.3 Example 2: Syntactic Change in French . . . . . . . . . . 6.3.1 The Parametric Subspace and Data . . . . . . . .

. . . . . . . . .

. . . . . . . . .

157 160 161 162 164 177 177 179 182

. . . . . . . . .

187 . 187 . 187 . 188 . 195 . 196 . 205 . 211 . 218 . 219

CONTENTS

6.4

viii

6.3.2 The Case of Diachronic Syntactic Change in French . 220 6.3.3 Some Dynamical System Simulations . . . . . . . . . . 223 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 229

7 An Application to Portuguese 7.1 Portuguese: A Case Study . . . . . . . . . . . . . . 7.1.1 The Facts of Portuguese Language Change 7.2 The Logical Basis of Language Change . . . . . . . 7.2.1 Galves Batch Learning Algorithm . . . . . 7.2.2 Batch Subset Algorithm . . . . . . . . . . . 7.2.3 Online Learning Algorithm (TLA) . . . . . 7.3 Conclusions . . . . . . . . . . . . . . . . . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

233 234 234 238 239 246 247 248

8 An Application to Chinese Phonology 8.1 Phonological Merger in the Wenzhou Province 8.2 Two Forms in a Population . . . . . . . . . . . 8.2.1 Case 1 . . . . . . . . . . . . . . . . . . . 8.2.2 Analysis . . . . . . . . . . . . . . . . . . 8.2.3 Case 2 . . . . . . . . . . . . . . . . . . . 8.2.4 Case 3 . . . . . . . . . . . . . . . . . . . 8.2.5 Case 4 . . . . . . . . . . . . . . . . . . . 8.2.6 Remarks and Discussion . . . . . . . . . 8.3 Examining the Wenzhou Data Further . . . . 8.4 Error-Driven Models . . . . . . . . . . . . . . . 8.4.1 Asymmetric Errors . . . . . . . . . . . . 8.4.2 Bifurcations and the Actuation Problem 8.5 Discussion . . . . . . . . . . . . . . . . . . . . . 8.5.1 Sound Change . . . . . . . . . . . . . . 8.5.2 Connections to Population Biology . . . 8.6 Conclusions . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . .

251 252 257 257 258 260 262 263 264 265 268 269 270 271 271 272 273

. . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . .

9 A Model of Cultural Evolution and Its Application to guage 9.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . 9.2 The Cavalli-Sforza and Feldman Theory . . . . . . . . . 9.3 Instantiating the CF Model for Languages . . . . . . . . 9.3.1 One-Parameter Models . . . . . . . . . . . . . . . 9.3.2 An Alternative Approach . . . . . . . . . . . . . 9.3.3 Transforming NB Models into the CF Framework

Lan275 . . . 275 . . . 277 . . . 279 . . . 279 . . . 281 . . . 282

ix

CONTENTS 9.4

9.5 9.6 9.7

CF Models for Some Simple Learning Algorithms . . . . . . . . . . . . . . . . . . . . . . . 9.4.1 TLA and Its Evolution . . . . . . . . . . . . . 9.4.2 Batch- and Cue-Based Learners . . . . . . . . 9.4.3 A Historical Example . . . . . . . . . . . . . A Generalized NB Model for Neighborhood Effects . 9.5.1 A Specific Choice of Neighborhood Mapping A Note on Oblique Transmission . . . . . . . . . . . Conclusions . . . . . . . . . . . . . . . . . . . . . . .

10 Variations and Case Studies 10.1 Finite Populations . . . . . . . . . . . . . . . . . 10.1.1 Finite Populations . . . . . . . . . . . . . 10.1.2 Stochastic Dynamics . . . . . . . . . . . . 10.1.3 Evolutionary Behavior as a Function of N 10.2 Spatial Effects . . . . . . . . . . . . . . . . . . . 10.2.1 Spatial Variation and Dialect Formation . 10.2.2 A General Spatial Model . . . . . . . . . 10.3 Multilingual Learners . . . . . . . . . . . . . . . 10.3.1 Bilingualism Modeled as a Lambda Factor 10.3.2 Further Remarks . . . . . . . . . . . . . . 10.3.3 A Bilingual Model for French . . . . . . . 10.4 Conclusions . . . . . . . . . . . . . . . . . . . . .

IV

The Origin of Language

. . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

284 284 288 289 298 300 302 303

. . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

305 305 306 306 307 314 314 316 320 322 327 330 336

339

11 The Origin of Communicative Systems: Communicative Efficiency 341 11.1 Communicative Efficiency of Language . . . . . . . . . . . . . 343 11.1.1 Communicability in Animal, Human, and Machine Communication . . . . . . . . . . . . . . . . . . . . . . 345 11.2 Communicability for Linguistic Systems . . . . . . . . . . . . 346 11.2.1 Basic Notions . . . . . . . . . . . . . . . . . . . . . . . 346 11.2.2 Probability of Events and a Communicability Function 349 11.3 Reaching the Highest Communicability . . . . . . . . . . . . . 351 11.3.1 A Special Case of Finite Languages . . . . . . . . . . 351 11.3.2 Generalizations . . . . . . . . . . . . . . . . . . . . . . 359 11.4 Implications for Learning . . . . . . . . . . . . . . . . . . . . 359 11.4.1 Estimating P . . . . . . . . . . . . . . . . . . . . . . . 360

CONTENTS 11.4.2 Estimating Q . . . . . . . . . . . . . . . . . . . 11.4.3 Sample Complexity Bounds . . . . . . . . . . . 11.5 Communicative Efficiency and Linguistic Structure . . . . . . . . . . . . . . . . . . . . . . . . . 11.5.1 Phonemic Contrasts and Lexical Structure . . . 11.5.2 Functional Load and Communicative Efficiency 11.5.3 Perceptual Confusibility and Functional Load .

x . . . . 362 . . . . 363 . . . .

. . . .

. . . .

. . . .

366 367 368 371

12 The Origin of Communicative Systems: Linguistic Coherence and Communicative Fitness 375 12.1 General Model . . . . . . . . . . . . . . . . . . . . . . . . . . 376 12.1.1 The Class of Languages . . . . . . . . . . . . . . . . . 376 12.1.2 Fitness, Reproduction, and Learning . . . . . . . . . . 377 12.1.3 Population Dynamics . . . . . . . . . . . . . . . . . . 378 12.2 Dynamics of a Fully Symmetric System . . . . . . . . . . . . 379 12.2.1 Fixed Points . . . . . . . . . . . . . . . . . . . . . . . 380 12.2.2 Stability of the Fixed Points . . . . . . . . . . . . . . 385 12.2.3 The Bifurcation Scenario . . . . . . . . . . . . . . . . 391 12.3 Fidelity of Learning Algorithms . . . . . . . . . . . . . . . . . 392 12.3.1 Memoryless Learning . . . . . . . . . . . . . . . . . . . 393 12.3.2 Batch Learning . . . . . . . . . . . . . . . . . . . . . . 395 12.4 Asymmetric A Matrices . . . . . . . . . . . . . . . . . . . . . 398 12.4.1 Breaking the Symmetry of the A Matrix . . . . . . . . 398 12.4.2 Random Off-Diagonal Elements . . . . . . . . . . . . . 399 12.4.3 Final Remarks . . . . . . . . . . . . . . . . . . . . . . 402 12.5 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 402 13 The Origin of Communicative Systems: Linguistic Coherence and Social Learning 405 13.1 Learning Only from Parents . . . . . . . . . . . . . . . . . . . 406 13.2 Social Learning: Learning from Everybody . . . . . . . . . . 408 13.2.1 The Symmetric Assumption . . . . . . . . . . . . . . . 408 13.2.2 Coherence for n = 2 . . . . . . . . . . . . . . . . . . . 409 13.3 Coherence for General n . . . . . . . . . . . . . . . . . . . . . 415 13.3.1 Cue-Frequency Based Batch Learner . . . . . . . . . . 415 13.3.2 Evolutionary Dynamics of Batch Learner . . . . . . . 416 13.4 Proofs of Evolutionary Dynamics Results . . . . . . . . . . . 418 13.4.1 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . 418 13.4.2 Equilibria . . . . . . . . . . . . . . . . . . . . . . . . . 420 13.4.3 Stability . . . . . . . . . . . . . . . . . . . . . . . . . . 422

xi

CONTENTS 13.4.4 Bifurcations . . . . . . . . . . . . . . . . . . . . . . 13.5 Coherence for a Memoryless Learner . . . . . . . . . . . . 13.6 Learning in Connected Societies . . . . . . . . . . . . . . . 13.6.1 Language Evolution in Locally Connected Societies 13.6.2 Magnetic Systems: The Ising Model . . . . . . . . 13.6.3 Analogies and Implications . . . . . . . . . . . . . 13.7 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . .

V

. . . . . . .

. . . . . . .

Conclusions

14 Conclusions 14.1 A Summary of the Major Insights . . . . . . . . . . . . . 14.1.1 Learning and Evolution . . . . . . . . . . . . . . . 14.1.2 Bifurcations in the History of Language . . . . . . 14.1.3 Natural Selection and the Emergence of Language 14.2 Future Directions . . . . . . . . . . . . . . . . . . . . . . . 14.2.1 Empirical Validation . . . . . . . . . . . . . . . . . 14.2.2 Connections to Other Disciplines . . . . . . . . . . 14.3 A Concluding Thought . . . . . . . . . . . . . . . . . . . .

426 432 433 434 435 438 441

443 . . . . . . . .

. . . . . . . .

445 446 446 448 449 449 453 456 458

Bibliography

459

Index

479

Preface This book explores the interplay between learning and evolution in the context of linguistic systems. For several decades now, the process of language acquisition has been conceptualized as a procedure that maps linguistic experience onto linguistic knowledge. If linguistic knowledge is characterized in computational terms as a formal grammar and the mapping procedure is algorithmic, this conceptualization admits computational and mathematical modes of inquiry into language learning. Indeed, such a view is implicit in most modern approaches to the subject in linguistics, cognitive science, and artificial intelligence. Learning (acquisition) is the mechanism by which language is transmitted from old speakers to new. Therefore, the evolution of language over generational time in linguistic populations will depend upon the learning procedure used by the individuals in it. Yet the interplay between learning by the individual and evolution of the population can be quite subtle. We need tools to reason about the phenomena and elucidate the precise nature of the relationships involved. To this end, this book presents a framework in which to conduct such an analysis. Most people can directly observe the learning of language by children and marvel at the phenomenon of language acquisition. In contrast, few people have direct experience with the unfolding history of a language. Picking up an Old English text like the Anglo-Saxon Chronicles is not always part of our daily existence. People doing this, however, will find a language that is incomprehensible to modern English speakers. This leads to the following question: if in the ninth century A.D., people in England spoke a language like that in the Anglo-Saxon Chronicles, this is the language their children should have learned — and their children after them, and so on. How, then, did it come to be that the process of iterative learning by successive generations led to the evolution of English so far from its origins? What does it imply for how English might be a thousand years from now? Of course, the problem is not limited to English alone. Language change

PREFACE

xiv

and evolution is ubiquitous. It happens in most languages, in their syntax, their phonology, and their lexicon. It manifests itself in language birth and death phenomena, in creolization, and in dialect formation. It is happening around us as we speak. More mysteriously, it has happened over evolutionary time scales as the language capacity evolved from prelinguistic versions of it. There is thus a tension between language learning and language evolution. On the one hand, the learning of language by children is robust and reliable. On the other hand, it cannot be perfect or else languages (barring major migratory effects) would not change. This book is an attempt to resolve this tension. The analytic framework introduced here considers a population of linguistic agents. Linguistic agents are of two types: mature users of a language and learners who acquire a language from the other users. Each learner acquires language based on its own primary linguistic data, i.e., linguistic examples received from other users in the community. By taking an ensemble average across learners, we can derive the average linguistic composition of the mature speakers of the next generation. Thus the average linguistic composition evolves as a dynamical system. The framework is noteworthy for its shift of emphasis from the individual to the population in the analysis of learning and its evolutionary consequences. Much of language learning theory (often termed learnability theory) focuses on an idealized speaker-hearer interaction in a homogeneous linguistic environment. In this tradition, one is concerned with whether the learner will converge to the unique target grammar of the parent as more and more data becomes available. In contrast, I analyze learning algorithms in the case in which the learner is immersed in a heterogeneous linguistic environment. There is no unique target grammar and the learner never converges. Instead, there is a distribution of grammars in the linguistically mature population, and the learner matures after a finite time corresponding to its developmental learning period. In this setting, my main results may be summarized as follows. First, I elucidate the subtle nature of the relationship between learning and evolution. In particular, I show that different learning algorithms may have different evolutionary consequences. Therefore, we are able to bring to bear both developmental and evolutionary data and arguments to judge the plausibility of various learning algorithms for language acquisition. Second, I find that the dynamics of language evolution are typically nonlinear. Further, there are often bifurcations that lead to a change in the stability profile of the equilibrium distribution of languages. The parameters associated with such bifurcations are naturally interpretable as the frequency of usage of

xv

PREFACE

various linguistic expressions. Thus, much like phase transitions in physics, I argue that the continuous drift of such frequency effects could lead to discontinuous changes in the stability of languages over time. I claim that these bifurcations are the natural explanatory construct for the dramatic patterns of change observed in historical linguistics. Third, I investigate the role of natural selection, communicative efficiency, and learning in the origin and evolution of language. In particular, I investigate the conditions under which shared languages (communicative systems) might emerge. I show that if individuals learn from a single agent in the population, then natural selection is necessary for the emergence of shared languages. On the other hand, if individuals learn from multiple agents in the community (social learning), then shared languages might emerge even in the absence of natural selection. It is natural to compare linguistic and biological evolution. In biological evolution, one studies how biological (genotypic or phenotypic) diversity evolves under the action of various inheritance mechanisms (sexual and asexual reproduction) and natural selection. In language evolution, one studies how linguistic (syntactic, phonological, and so on) diversity evolves. However, the mechanism of transmission is not inheritance. Rather, it is learning by individual children. Moreover, whereas in biological evolution, one acquires (via inheritance) one’s genes from one’s parents alone, in linguistic evolution, one might acquire linguistic features from a greater variety of individuals. Further, the sense in which natural selection and fitness may be meaningfully considered in language evolution remains unclear. These similarities and differences have marked the history of both subjects. Since the promulgation of the Indo-European thesis by William Jones, historical linguistics was the preoccupation of linguists of the nineteenth century. Darwin was clearly influenced by some of these ideas and in the Descent of Man, he has often remarked on these analogies. In the twentieth century, evolutionary ideas were integrated with the genetic and molecular biology revolution. Correspondingly, the traditional questions of nineteenth century linguistics are being reformulated with the insights of modern generative linguistics. The study of language evolution has a special significance in the scheme of things because it makes it possible for us to transmit information in a nongenetic manner across generations. That is why, as humans, we have such a profound sense of history, culture, and tradition. Learning, rather than inheritance, is the basis of this transmission of information. It is of interest, therefore, to understand the evolutionary properties of systems where the mechanism of transmission is learning rather than inheritance. More

PREFACE

xvi

generally, my effort to understand the relationship between the population and the individual is a variation on a theme that cuts across many subjects where one studies the behavior of a complex system of many interacting components. Statistical physics, population biology, individual and collective choice in economics, and the study of social and cultural norms provide other examples. This book represents a small step toward a larger understanding of the issues in language learning and evolution. This larger understanding will require mathematical models, computer simulations, empirical data analysis, and controlled experiments. The insights will illuminate the nature of communication in humans, animals, and machines. They will have implications for how information is acquired and propagated in linguistics, biology, and computer science. P.N. Chicago December 2005.

Acknowledgments This book is the outcome of more than ten years of thinking about the problems of language learning, change, evolution, and their interplay. My perspective on this subject has been shaped over the years by collaborations, discussions, and debates with a variety of people across a range of disciplines. Thanks to Ken Wexler for first getting me started with a conversation in December 1993. I was then a graduate student at MIT working on problems of learning and inference in humans and machines. Ken suggested that I consider the relation between learning in one generation and change across many generations. This led me inevitably to evolutionary questions and resulted in a 1995 publication titled “The Logical Problem of Language Change” (MIT AI Lab Technical Report No. 1516). Bob Berwick, my collaborator on that paper, provided intellectual companionship, and many parts (Chapters 3, 4, and 5) of this book are based on papers written with him in those early years. A second collaborative phase began with Martin Nowak and Natalia Komarova, as we worked on a series of papers on models of language evolution and explored the similarities and differences with biological evolution. Substantial parts of Chapters 11 and 12 reflect joint work with them. I thank them all for their ideas, help, and friendship. Along the way, numerous linguists, cognitive scientists, biologists, mathematicians, and computer scientists provided critiques, arguments, sanity checks, advice, encouragement, and appreciation. I would especially like to thank Tony Kroch, David Lightfoot, Bill Wang, Salikoko Mufwene, John Goldsmith, Charles Yang, Morris Halle, Norbert Hornstein, Ed Stabler, Dan Osherson, Martin Nowak, Natalia Komarova, Stuart Kurtz, Janos Simon, Bob Berwick, Ted Briscoe and Stephen Smale. Thanks also to Dinoj Surendran (Chapter 11) and Thomas Hayes (Chapter 13) who collaborated with me on some aspects of the research program while they were students at The University of Chicago. The University of Chicago has provided a superb intellectual environment for pursuing these ideas. I thank them and especially the Department

ACKNOWLEDGMENTS

xviii

of Computer Science for their support. My parents and brother provided encouragement and love. My sons Nikhil and Kabir (now three years old) gave me a direct appreciation of the problem of language acquisition and taught me how little we really understand about these matters. Finally, my wife Parvati believed in this research program from the beginning. For this, and for much more, I dedicate this book to her.

Part I

The Problem

Chapter 1

Introduction Let us begin with a fact. The two sentences in (1a,b) are constructed with English words. All native speakers of English recognize that (1a) is grammatically well-formed (“correct” or “natural”) while (1b) is grammatically ill-formed. Following standard practice in the linguistics literature, I have indicated the ill-formed expression by an asterisk.

(1)

a. b.

He ran from there with his money. ∗He his money with there from ran.

We accept this fact because we know English, and this knowledge seems to endow us with the ability to recognize grammaticality and thus separate grammatical sentences from ungrammatical ones. Of course, we weren’t born knowing this fact. We learned English as children — presumably from exposure to sentences in English from parents and caretakers. As adults with a mature knowledge of English, we are now able to discriminate between well-formed expressions and ill-formed ones. 1 1

There is often disagreement among researchers and lay people alike regarding the firmness and reliability of grammaticality judgments. Some of this disagreement is well founded and calls for a more nuanced interpretation of grammatical rules. However, from time to time, alarmists have suggested that grammaticality is not a useful notion at all and is frequently violated in natural language. Part of this feeling may arise from a confusion between competence and performance issues, between prescriptive and descriptive notions of grammar, between idiolectal and communal languages. Even such alarmists must concede, however, that certain expressions are clearly well-formed (such as (1a)) and others are clearly ill-formed (such as (1b)) and about these judgments there can be no reasonable disagreement. This is usually a good starting point from which one can invoke various softer notions of grammaticality possibly using probability theory as a tool. For

CHAPTER 1

4

In normal circumstances children acquire the language of their parents. Thus children growing up today in a relatively homogeneous English speaking environment would learn English from their parents, and even if they had not encountered sentences (1a) or (1b) before, their judgment on these sentences would agree with that of their parents. That is essentially what it means to learn one’s native language.2 Thus language would be successfully transmitted from parent to child. Indeed, if one polled three consecutive generations of English speakers in an English-speaking community, one would find general agreement on the grammatical judgments of (1a) and (1b). Let us imagine an English-speaking community today. Let V be the vocabulary, i.e., the set of unique words in English. One can then form strings over V (elements of V ∗ ), and (1a) and (1b) are two such strings. Each adult in such a community has an internal system of rules (knowledge) that allows him or her to decide which elements of V ∗ are acceptable and which are not. For individual i let Ei ⊂ V ∗ be the set of acceptable sentences for that individual. Correspondingly, I i is the internal system of rules that characterizes the linguistic knowledge and therefore the extensional set E i of the ith speaker. The sets Ei might differ slightly from individual to individual, but they must largely agree, for otherwise speakers would not share the same language.3 Most of these sets Ei would contain (1a) but not contain (1b). Let E denote the intersection of the sets E i , i.e., E = ∩i Ei . We can interpret E to be the set of sentences that would be considered grammatically acceptable by everyone in the community of adults. In fact, we would find (1a) to be an element of E while (1b) is not. Children growing up in such a community would hear the ambient sentences in their linguistic environment from their parents, caretakers, and others they come in contact with. On the basis of such exposure, they too would “learn English”, that is, they would acquire a system of rules and cora discussion on the role of grammaticality judgments in providing empirical support for various linguistic theories, see Schutze 1996. 2 One might quibble that children disagree some with their parents. While this is arguably true, this disagreement can never be extreme. Such extreme disagreement would lead to breakdown in successful linguistic communication between parent and child. Note that I use the term parent rather loosely to denote parents, caretakers, and others in the immediate vicinity. Note also that in linguistically heterogeneous communities, the role of parents may be less important than that of others. These issues will get clearer as we proceed. 3 A few remarks are worthwhile. Ei is typically an infinite set for which a finite characterization may be provided by Ii . E  = ∪i Ei corresponds to the set of all sentences “English speakers” in the community produce. The elements of Ei are observable but the object of fundamental significance is Ii . These distinctions are related to those between E-language and I-language that appear in the work of Chomsky.

5

INTRODUCTION

respondingly a language. For the ith such child, let us denote the internal system of rules he or she acquires by I ic and the corresponding extensional set by Eic . The mechanisms of language acquisition guide the learning child toward the language of the ambient community. Therefore, one ought to find that E c = ∩i Eic mirrors E. In fact, if one looks at the last hundred years in a relatively homogeneous English-speaking community, this seems to be roughly true. Indeed, we are easily able to read English texts from a hundred years ago. In other words, if all children acquired the language of their parents (read parental generation), and if generation after generation of children acquired the language of their parents, then language would be successfully transmitted from one generation to the next and the linguistic composition of every generation would look exactly like the linguistic composition of the previous one. A thousand years from now, English-speaking communities would still judge (1a) to be grammatical and (1b) to be not so. Languages would not change with time. But they do! In fact, historical linguistics is the study of how, why, when, and in what form languages change with time. So let us now go back a thousand years. Shown below is an extract from the Anglo-Saxon Chronicles taken from the writers of English in 878 A.D. (reproduced from Trask 1996). The original is italicized and a word-for-word gloss is provided below. Her ...Ælfred cyning ...gefeaht wi ealne here, and hine Here Alfred king... fought against whole army and it geflymde and him aefter rad o et geweorc, and aer saet put to flight and it after rode to the fortress and there camped XIIII niht, and a sealde se here him gislas and myccle fourteen nights and then gave the army him hostages and great a as, et he of his rice woldon, and him eac geheton oaths that they from his kingdom would [go] and him also promised et heora cyng fulwihte onfon wolde, and hi aet gelaston that their king baptism receive would and they that did It is striking that the language has changed so much and at so many different levels that it is barely recognizable as English today. Let us ignore

CHAPTER 1

6

for the moment changes in pronunciation and lexical items and focus instead on the underlying word order and grammaticality. I have underlined some “odd” portions of the passage for this reason. Clearly, grammaticality judgments in the ninth century were quite different from those today. The set E describing the language of English speakers in 878 A.D. is quite different from what it is today. There are many points of difference, but let us examine a certain systematic difference a little more closely. There are some regularities in the underlying system of rules that characterize “well-formedness” (grammaticality) and result in the sets E both of English today and of English in the ninth century. For example, English today has VO word order, i.e., the verb (V) in a verb phrase precedes the object (O). Thus we have phrasal fragments such as ate [with a spoon] kicked [the ball] jumped [over the fence] and so on. This fact has received treatment in a variety of linguistic formalisms. For example, getting ahead of ourselves for the moment, we can introduce the notion of the head of a phrase, which for a verb phrase would be the verb, for a prepositional phrase the preposition, and so forth. English today might be deemed head-first. As a result, in combining words into phrases and ultimately sentences, English speakers put the verb before its object, the preposition before its argument, and so forth. Some other languages (see Bengali later, for example) are the other way around, and languages tend to be on the whole fairly systematic and internally consistent on this point. Now consider the following phrasal fragments from English of the ninth century. a Darius geseah aet he oferwunnen beon wolde then Darius saw that [he conquered be would] (Orosius 128.5) & him aefterfylgende waes and [him following was] (Orosius 236.29) Nu ic wille eac aes maran Alexandres gemunende beon now I will also [the great Alexander considering be] (Orosius 110.10)

7

INTRODUCTION

Clearly, the language of Old English speakers was underlyingly OV. So what went on? These were the kinds of sentences that children presumably heard. The primary linguistic data that children received was consistent with an OV-type grammar, and therefore, this is what we would expect the children to have acquired. If, indeed, English was homogeneous in 800 A.D., and children learned the language of their parents, and their children after them, and so on, why did the language change? These are not changes that are easily explained away by sociological considerations of changing political or technological times, innovations, fads, and the like. It is not a word here, an idiomatic expression there, a nuance here, or an accent there — it is deep and systematic change in the underlying word order of sentences — changes that would accumulate over recursions in hierarchically structured phrases, leading to such dramatic examples as ondraedende aet Laecedemonie ofer hie ricsian mehten swa hie aer dydon dreading that Laecedemonians over them rule might as they before did “dreading that the Laecedemonians might rule over them as they had done in the past” (Orosius 98.17) or eh ne geortriewe ic na Gode aet he us ne maege gescildan although not shall-distrust I never to-God, that he us not can shield “although I shall never distrust God so much as to think he cannot shield us” (Orosius 86.3) The phenomena are quite striking and the puzzle is quite real. There are two forces that seem to be at odds with each other. On the one hand we have language acquisition — the child learning the language of its parents successfully. If acquisition is robust and reliable, one would think that language (grammars, linguistic knowledge) would be reliably transmitted from one generation to the next. On the other hand we have language change — the language of a community drifting over generational time, sometimes just a little bit, sometimes drastically, and sometimes not at all.

CHAPTER 1

8

And there you have the heart of the problem of historical linguistics. Given that children attempt to learn the language of their parents and caretakers, why do languages change with time? Why don’t they remain stable over all time? How fast do they change? In which direction do they change? What are the envelopes of possible change? What are the factors that influence change? These are the kinds of questions that historical linguistics wishes to answer — and indeed, historical linguists over the years have postulated many accounts of documented language change, in a number of linguistic subdomains from phonetics to syntax and in a number of different languages of the world. This book creates a mathematical and computational framework within which to embed those accounts. Such a computational treatment of historical linguistics compels us to make arguments about change precise and to work out the logical consequences of such arguments — consequences that might not be obvious from a more informal treatment of the subject. The work in this book is therefore presented as a research tool to judge the adequacy of competing accounts of language change — to aid us in our thinking as we reason about the forces behind such change — to prevent us from falling into the usual pitfalls of Kiplingesque just-so stories in an area where data is often sparse and speculation often plentiful. More generally, over the course of this book, I will discuss the themes of learning, communication, language, evolution, and their intertwined relationships. Let me elaborate.

1.1

Language Acquisition

The question of how we come to acquire our native language has received a central position in the current conceptualization of linguistic theory. Learning a language is characterized as ultimately developing a system of rules (a grammar) on the basis of linguistic examples encountered during the learning period. The language learning algorithm4 is therefore a map A : D → g 4

The terms learning, acquisition, and development carry different connotations and correspondingly different pictures of the same process. This leads to acrimonious debates, and it is safest perhaps to use the more neutral term map to denote the procedure that takes linguistic experience (data) as input and produces a computational system (grammar) as output. This map is the learning map, acquisition map, or development map, depending upon one’s point of view. I will generally use the term learning as well as the metaphors and concepts of learning theory to discuss this map and its consequences. It is also worth remarking that the grammar the child develops is probably not the result of conscious meditative deliberation, as is the case in developing a strategy for chess. Rather,

9

INTRODUCTION

where D denotes data and g the grammar. What is remarkable about this map is that it involves generalization. Of all the different grammars that may be compatible with the data, the child develops a particular one — one that goes beyond the data and one that is remarkably similar to that of its parents in normal and homogeneous environments. The nontrivial task of generalizing to a grammar from finite data leads to the so-called logical problem of language acquisition. This has received considerable computational attention. Beginning with the work of Gold 1967 and Solomonoff 1964, continuing with Feldman 1972, Blum and Blum 1975, Angluin 1980ab, on to Jain et al 1998, a rich tradition of research in inductive inference and learnability theory exists that casts the languageacquisition problem in a formal setting that consists of the following key components: 1. Target grammar gt ∈ G is a target grammar drawn from a class of possible target grammars (G). Grammars are representational devices for generating languages. Languages are subsets of Σ ∗ where Σ is a finite alphabet in the usual way. 2. Example sentences si ∈ Lgt are example sentences generated by the target grammar and presented to the learner. Here s i is the ith example sentence in the learner’s data set and L gt is the target language corresponding to the target grammar. 3. Hypothesis grammars h ∈ H are hypothesis grammars drawn from a class of possible hypothesis grammars that learners (children) construct on the basis of exposure to example sentences in the environment. These grammars are then used to generate and comprehend novel sentences not encountered in the learning process. 4. Learning algorithm A is an effective procedure by which grammars from H are selected (developed) on the basis of example sentences received by the child. These components are introduced to meaningfully discuss the problem of generalization in language acquisition. Consider a somewhat idealized parent-child interaction over the course of language acquisition. The parent has internal knowledge of a particular language (grammar) so that by his or her reckoning, arbitrary strings of words can be assigned grammaticality values. Thus an English-speaking parent would know that sentence (1a) it is like a reflex — an instinctual reaction to one’s linguistic environment.

CHAPTER 1

10

above was grammatical while (1b) was not. This language (grammar) is taken to be the target5 language (grammar) that children must acquire and do in normal circumstances. In a natural language-acquisition setting, children are not directly instructed as to the nature of the grammar that generates sentences of the target language. Rather, they are exposed to sentences of the ambient language as a result of spoken interaction with the world. Thus, their linguistic experience consists of example sentences (mostly from the target language) they hear, and this constitutes their so-called primary linguistic data. On exposure to such linguistic examples, language acquisition is the process by which a grammar is learned (developed, acquired, induced/inferred) so that when novel sentences are produced by parents, children will (among other things) be able to correctly judge their grammaticality and in fact will be able to produce ones of their own as well. This leads to successful ongoing communication between parent and child. Successful generalization to novel sentences is the key aspect of language acquisition. Thus in our idealized parent-child interaction one might imagine that neither sentence (1a) nor (1b) was encountered by the learning child over the period of learning English. When the learning period is over, the child’s judgment of (1a) and (1b) would agree with that of the parents — the child has been able to go beyond the data to successfully generalize to novel sentences. This is what it means for the child to learn the language of its parents. Scholars have conceptualized the learning procedure of the child as constructing grammatical hypotheses about the target grammar after encountering sentences in the primary linguistic data. Let h n ∈ H be the grammatical hypothesis after the nth sentence. Successful generalization requires at the very least that the learner’s hypothesis come closer and closer to the target as more and more data become available. In other words, the learner’s hypothesis converges to the target (in some sense indicated here by the 5

Much of learning theory proceeds with the idealized assumption that there is actually a target grammar. This is a necessary position if one wishes to understand the phenomenon of generalization. This is a reasonable position if one considers an idealized parent-child interaction or an idealized homogeneous community. In practice, however, there is always linguistic variation and children acquire a linguistic system from the varied input they receive from the community at large. Therefore, there really is no target in the learning process. Rather, the learning algorithm is a map from data to grammatical systems. In this book we try to understand what happens if we iterate this map in situations that correspond to heterogeneous communities.

11

INTRODUCTION

distance metric d(hn , gt )) as the data goes to infinity, i.e., lim d(hn , gt ) = 0

n→∞

(1.1)

Language acquisition is after all a particular cognitive instantiation of a generic problem in learning theory, and it is no surprise that the framework here is quite general and applicable to a variety of learning problems in linguistic and nonlinguistic domains. For our purposes, it is important to point out that though we have begun on a fairly traditional note with grammars and languages as characterizations of syntactic phenomena, the framework is quite general and is not committed to any particular linguistic theory or even linguistic domain. A number of different aspects of this framework need to be emphasized. First, G or H could represent grammars for syntax in more traditional generative linguistics traditions such as Government and Binding theory (GB), Minimalism, Head-driven Phrase Structure Grammar (HPSG), LexicalFunctional Grammar (LFG), Tree Adjoining Grammar (TAG), and so on. It might represent syntactic grammars with less traditional notational systems such as those that arise in connectionist traditions or in recent statistical linguistics traditions. In the areas of phonology, G (and correspondingly H) might represent grammars for phonology in any tradition, e.g., Optimality Theory, parameterized theories for metrical stress, Finite State Phonology, and so on. As a matter of fact, G need not even be a class of symbolic grammars. It might be a class of real-valued functions characterizing the decision boundary in some acoustic-phonetic-perceptual space between two phonemic classes. Such a decision boundary also needs to be learned by children in order to acquire relevant phonetic distinctions and build up a phonological system. Second, the example sentences (where s i denotes the ith example sentence) might be strings of lexical items, annotated lexical strings, parse trees of example sentences, (form, meaning) pairs such as pairings of syntactic structure with semantic representation and so on. In the case of phonology, they may be surface forms, acoustic waveforms, stress patterns, and the like. Third, the learning algorithm will undoubtedly depend upon the representations used for grammars in H and examples s i . Learning algorithms vary from parameter-setting algorithms in the Principles and Parameters tradition, constraint reranking algorithms in Optimality Theory, parameter estimation methods based on statistical criteria like Expectation Maximization (EM), Maximum Entropy and related methods, gradient descent and Backpropagation in neural networks, and so on.

CHAPTER 1

12

Thus, depending upon the domain and the phenomena of interest, an appropriate notational system for grammars and a cognitively plausible learning algorithm is used in formal explorations in the study of language acquisition. We will encounter several such instantiations over the course of the book. Finally, the question of generalization characterized by the convergence criterion in Equation 1.1 can be studied under a number of different notions of convergence. The entire framework can be probabilized so that sentences are now drawn according to an underlying probability distribution. One can then study convergence on all data sequences, on almost all data sequences, strong and weak convergence in probability, and so on. The norm in which convergence takes place can vary from extensional set differences (the L 1 (μ) norm where μ is a measure on Σ∗ and languages are indicator functions on Σ∗ ) to intensional differences between grammars as defined by the distance between Godel numberings in an enumeration of candidate grammars. The resulting learning-theoretic frameworks vary from the Probably Approximately Correct framework of Valiant (1984) and Vapnik (1982) to the inductive inference framework of Gold (1967). The necessary and sufficient conditions for successive generalization by a learning algorithm have been the topic of intense investigation by the theoretical communities in computer science, mathematics, statistics, and philosophy. They point to the inherent difficulty of inferring an unknown target from finite resources, and in all such investigations, one concludes that tabula rasa learning is not possible. Thus children do not entertain every possible hypothesis that is consistent with the data they receive but only a limited class of hypotheses. This class of grammatical hypotheses H is the class of possible grammars children can conceive and therefore constrains the range of possible languages that humans can invent and speak. It is Universal Grammar (UG) in the terminology of generative linguistics. Thus we see that there is a learnability argument 6 at the heart of the modern approach to linguistics. The inherent intractability of learning a language in the absence of any constraints suggests that the only profitable 6

This is usually articulated as the Argument from Poverty of Stimulus (APS). There are strong and weak positions one can take on this issue and this has been the subject of much debate and controversy. The theoretical implausibility of tabula rasa learning and the empirical evidence relating to child language development suggest that H is a proper subset of the set of unrestricted rewrite rule systems (equivalent to Turing Machines). What the precise nature of H is and whether it admits a low dimensional characterization is a matter of reasonable debate. Over the course of this book, I work with certain plausible choices for illustrative purposes.

13

INTRODUCTION

direction is to try and figure out what the appropriate constraints are. Linguistic theory attempts to elucidate the nature of the constraints that underlie H; psychological learning theory concentrates on elucidating plausible learning algorithms A. Together they posit a solution to the problem of language acquisition. Language acquisition is the launching point for our discussion of language change. If language acquisition is the mode of transmission of language from one generation to the next, what are its long-term evolutionary consequences over generational time? How do these relate to the historically observed trajectories of language change and evolution? This is the primary issue that I will attempt to resolve over the course of this book.

1.2

Variation — Synchronic and Diachronic

A ubiquitous fact of human language is the variation that exists among the languages of the world. At the same time, the fact that language is learnable suggests that this variation cannot be arbitrary. In fact, theories of Universal Grammar attempt to circumscribe the degree of variation possible in the languages of the world. Since H characterizes the set of possible grammatical hypotheses humans can entertain, at any point in time or space each natural language corresponds to a particular grammar g belonging to H. For example, shown below are two sentences of Bengali (Bangla) with a word-for-word translation. (2)

a.

o He

or his

paisa money

niye with

shekhan theke dourolo. there from ran.

b.

∗o He

dourolo ran

theke from

shekhan there

niye or paisa. with his money.

Clearly Bengali has a different system of grammaticality rules from English today, so that unlike English, (2b) is deemed ill-formed while (2a) is well-formed. Even if one ignores the fact that the two languages use different lexical items, it is easy to recognize that they use different linguistic (syntactic, in this case) forms to convey precisely the same meaning. The variation across languages might occur at several different levels. For a start, they might have different lexical items. Further, the system of rules that determine grammaticality might consist of phonetic, phonological, syntactic, semantic, pragmatic, and other considerations. Two languages might have different lexicons but similar syntactic systems, as is the case for

CHAPTER 1

14

Hindi and Urdu, two languages spoken in large parts of South Asia. They might also have similar lexicons but different syntactic systems, as is often the case for dialects of the same language. Or they might share similar lexical and syntactic properties yet have have very different phonological systems, as is the case for the different forms of English spoken around the world. While the modules governing the different aspects of the grammatical system of a language all need to be specified to define a full-blown grammar in UG, in particular inquiries of linguistic phenomena, one considers H to cover the variation that is relevant depending upon the domain under consideration. My discussion so far has been as if languages have an existence that is independent of the individuals that speak them. Perhaps it is important to clarify my point of view. H denotes the set of possible linguistic systems that humans may possess. In any community, let the ith individual possess the system gi ∈ H. In a homogeneous community most of the g i ’s are similar and one might say that these individuals speak a common language, so that terms like “English”, “Spanish”, and so on refer to these communally accepted common languages. In general, though, there is always variation and these variants may be referred to as different idiolects, dialects, or languages based on social and political considerations. This variation refers to the synchronic variation across individuals in space at any fixed point in time. This book concerns itself with variation along a different dimension — the variation in the language of spatially localized communities over generational time. Thus one could study the linguistic behavior of the population of the British isles over generational time and as I remarked in the opening section, this has shown some striking changes over the years. Indeed, historical phenomena and diachronic variation are properly the objects of study in historical linguistics and this book presents a computational framework in which to conduct that study. Since the goal is to understand the possible behaviors of linguistic systems changing with time, we will be led to a dynamical systems framework and will derive several such dynamical systems over the course of this book. The starting point for the derivation of such dynamical systems is a class of grammars H and a learning algorithm A to learn grammars in this class. To see the interplay between the two in a population setting, imagine for a moment that there were only two possible languages in the world, i.e., H = {h1 , h2 } defining the two languages Lh1 ⊂ Σ∗ and Lh2 ⊂ Σ∗ over a finite alphabet Σ. Consider now a completely homogeneous linguistic community where all adults speak the language Lh1 corresponding to the grammar h1 . A typical child in this community receives example sentences, and utilizing a

15

INTRODUCTION

learning procedure A, constructs grammatical hypotheses. Let us denote by hn the grammatical hypothesis the learning child has after encountering n sentences. Suppose that each child is given an infinite number of sentences to acquire its language so that limn→∞ d(hn , h1 ) = 0, i.e., the child converges to the language of the adults. This happens for all children, and the next generation would consist of homogeneous speakers of L h1 . There would be no change. Now consider the possibility that the child is not exposed to an infinite number of sentences but only to a finite number N after which it matures and its language crystallizes. Whatever grammatical hypothesis the child has after N sentences, it retains for the rest of its life. In such a setting, if N is large enough, it might be the case that most children acquire L h1 but a small proportion  end up acquiring L h2 . In one generation, a completely homogeneous community has lost its pure7 character — a proportion 1 −  speak the original language while a proportion  speak a different one. What happens in the third generation? Will the proportion  grow further and eventually take over the population over generational time? Or will it decrease again? Or will it reach a stable  ∗ ? Or will it bounce back and forth in a limit cycle? It will obviously depend upon how similar the two languages Lh1 and Lh2 are, the size of N , the learning algorithm A, the probability with which sentences are presented to the learner, and so on. In order to reason through the possibilities, one will need a precise characterization of the dynamics of linguistic populations under a variety of assumptions. We will consider several variations to this theme over the course of this book. Even a simplified setting like this is not without significant linguistic applications. In a large majority of interesting cases of language change, two languages or linguistic types come into contact and their interaction can then be tracked over the years through historical texts and other sources. For example, in the case of English, it is believed that there were two variants of the language — a northern variant and a southern one that differed in word order and grammaticality — and that contact between the two led to one variety sweeping through the population. I will consider this and several other cases in greater detail over the course of this book. 7

It is worth noting that one need not necessarily consider starting conditions that are homogeneous. The dynamics will relate the linguistic states of any two successive generations. One may then consider these dynamical systems from any initial condition — those that relate to mixed states corresponding to language contact may be of particular interest.

CHAPTER 1

1.3

16

More Examples of Change

The case of English syntactic change with which I opened this chapter is only one of a myriad of cases of historical change across linguistic communities of the world for which documented evidence exists. It is important therefore to recognize that there is a genuine phenomenon at hand here and the pervasiveness of such phenomena is important to emphasize. Let us consider additional examples drawn from different linguistic subsystems and different regions of the world. They present interesting puzzles to work on.

1.3.1

Phonetic and Phonological Change

The earliest studies of historical change were often in the domain of sound change — phonetic and phonological changes occurring in various languages — and the Neogrammarian enterprise of the early twentieth century brought it to the center stage of historical linguistics. The Great English Vowel Shift In the Middle English (ME) period from the fourteenth to the sixteenth century, the long vowels of English underwent a cyclic shift so that pronunciations of words using these long vowels changed systematically. A simplified version of the cyclic shift of vowels is presented below (for more details, see Wolfe 1972). Back Vowels The back vowels are produced with the tongue body at the back of the vocal cavity, resulting in a lowered first formant (Stevens 1998). I will consider in this system the following four vowels: (1) the diphthong /a u / as in the modern English word “loud” (2) /aw/ as in the modern English word “law” (3) /o : / as in the word “grow” and (4) /u : / as in “boot”. The pronunciations of the words in the ME phonological system went through the following cyclic shift: /au / → /aw/ → /o : / → /u : / → /au / Thus, the word “law” (pronounced /law/ today) was pronounced differently as /lau / in ME. See Table 1.1.

17 /au / → /aw/ law saw bought

INTRODUCTION /aw/ → /o : / grow mow hose

/o : / → /u : / boot moot loose

/u : / → /au / loud proud house

Table 1.1: A partial glimpse of the vowel shift in Middle English. Words which share the same vowel are shown in each column. Each of these words went through a systematic change in pronunciation indicated by the vowel shift shown in the top row. Thus “grow” (pronounced /gro : / today) was pronounced /graw/ before. Words pronounced with an /o : / before are pronounced with an /u : / today (words in the third column).

A similar cyclic shift occurred for front vowels. Thus at one point in time (before the fourteenth century) speakers in England pronounced words in a particular way using a vocalic system that was in place at the time. Consider a random child growing up in such an environment. Such a child would have presumably heard “house”,”mouse”,”proud”,and ”loud” all being pronounced with the vowel /u : /. Why would they not learn the same pronunciation? One might argue that the actual pronunciation of words is sometimes sloppy and therefore listeners might misperceive the pronunciations of the words. However, sloppy pronunciation might have a random distribution around the canonical pronunciation, and in that case it is not clear at all that such random mispronunciation effects would have a directional effect and systematicity that would accumulate over generations. Even if a few children misconverged, what is the guarantee that the new pronunciation system will actually spread through the population over generational time? One might reasonably argue that there was either language variation or language contact resulting in two pronunciation systems existing in the population at some time so that competition between these two systems would have led to the gradual loss of one over time. In the absence of a deeper analysis, this argument seems speculative and evades the problem. Then there is the matter of the cyclic nature of the change. Such cyclic changes are often referred to as drag chains in the historical linguistics literature. Because a particular vowel shifts, i.e., a pronunciation changes, it leaves a gap in the vowel system with an unutilized vowel. At the same time, unless other vowels shift too, a number of homophonous pairs will be created, leading to possible confusion. Imagine for a moment that /o : / changed to /u : / in word pronunciations. Therefore “boat” and “boot”

CHAPTER 1

18

would be a homophonous pair (we are considering modern pronunciations here to make the point). In order to eliminate confusion, perhaps, speakers and listeners will feel compelled to shift the pronunciation of “boot”. This might now create new confusions (“boot” with “bout”, for example) that need to be eliminated leading to further changes and so on in a chain reaction to the first change from /o : / to /u : /. Again, a number of questions arise. Why, for example, don’t speakers and listeners simply exchange the vowels /o : / and /u : /? That would fill the gap in the vowel system, eliminate homophony, and present a satisfactory solution. In order to reason coherently through the various possibilities without resorting to dubious arguments, one will need to tease apart several notions: individual learning by children; tendencies by speakers, listeners, and learners to avoid gaps and reduce homophonies; the fact that words are used with varying frequencies and some vowel mergers might have greater consequences for communication than others; and the effect of all of the above at a population level leading to systematicities in population behavior. It is almost impossible to examine the interplay between these factors by verbal argument alone. One will be compelled, therefore, to consider computational models in the spirit of those developed over the course of this book. Phonological Mergers and Splits Phonological mergers occur when two phonemes that are distinguished by speakers of the language stop being distinguished. This implies that certain acoustic-phonetic differences are no longer given phonological significance by users of the language. The reverse process occurs when a phoneme (typically allophonic variations) splits into two. Some examples of historical change along this dimension are illustrated below. Sanskrit, Hindi, and Bengali In Sanskrit, there were (and are) three different unvoiced strident fricatives that vary by place of articulation. These are shown below with the point of constriction of the vocal tract in producing these sounds varying from the front of the cavity to the back from 1 through 3. 1. /s/ alveolar-dental as in sagar (sea) 2. /xh/ retroflex as in purush (man) 3. /sh/ postalveolar as in shakti (energy) Phonological mergers have occurred in two descendents of Sanskrit — Hindi and Bengali. In Hindi, the retroflexed and postalveolar fricatives have

19

INTRODUCTION

merged into a single postalveolar (palatal) one so that the “sh” in purush is pronounced identically to the “sh” in shakti. Thus there are only two strident (unvoiced) fricatives in the phonological system of the language. In Bengali, all three have merged into a single palatal fricative so that the fricative in sagar, purush, and shakti are all pronounced identically. The words in question are Sanskrit originals that have been retained in the daughter languages with altered pronunciations. Interestingly, the orthographic system used in writing preserves the distinction between each of the three different fricatives, so a different symbol is used for the fricatives in 1,2,and 3 although they are pronounced in the same way by Bengali speakers. Similarly, Hindi inherited the Devanagari orthographic system of Sanskrit and distinguishes the fricatives in the written form although 2 and 3 have merged. An example of a similar merger can be considered from Spanish where an ancestral form of the language had both /b/ and /v/ as distinct phonemes. In the modern version of the language, these phonemes are merged. However, the old spelling has been retained so that boto (meaning “dull”) and voto (meaning “vote”) are spelled differently yet both are pronounced with a word-initial /b/ by modern Spanish speakers. Wu Dialect in Wenzhou Province Zhongwei Shen (1997) describes two detailed studies of phonological change in the Wu dialects. I consider here as an example the monophthongization of /oy / resulting in a phonological merger with the rounded front vowel /o/. This sound change is apparently not influenced by contact with Mandarin and is conjectured to be due to phonetic similarities between the two sounds. These two phonological categories were preserved as distinct by many speakers, but over a period of time, the distinction was lost and their merger created many homophonous pairs. Thus, the word for “cloth” — /poy /42 — now became homophonous with the word for “half” — /po/42 and similarly, the word for “road” — /lo y / became homophonous with the word for “in disorder” — /lo/ 11 . A list of 35 words with the diphthong /oy / is presented in Shen 1993, and some of these are reproduced in Table 1.2. The phonetic difference between the two sounds lies in movements of the first and second formants. Both of the sounds in question are long vowels. The monophthong /o/ has a first formant at around 600 Hz. and a second formant at 2200 Hz. The diphthong /o y / has a first formant that starts around 600 Hz and gradually drops down to 350 Hz. while the second

CHAPTER 1

20 /poy /42 /doy /31 /moy /31 /toy /42 /soy /42

“cloth” “graph” “to sharpen” “jealous” “to tell”

Table 1.2: A subset of the words of Wu dialect that underwent change over the last one hundred years. The vowels were all diphthongs that changed to monophthongs. The numeric superscript denotes the tonal register of the vowel (unchanged). formant increases slightly above 2200 Hz. The change from the diphthong to monophthong can in principle be gradual with no compelling phonetic reason to make this change abrupt. Each word participating in the change has two alternative pronunciations in the population: the original pronunciation using the diphthong and an altered pronunciation using the monophthong. At one point all speakers used the original pronciation. Gradually speakers adopted the other pronunciation and today, everyone uses the monophthongized pronunciation of the word. I consider this example in some detail later in the book (Chapter 8). In particular, I will examine several plausible learning mechanisms and work out their evolutionary consequences for the case when two distinct linguistic forms are present in the population. By doing so, we will arrive at a better understanding of the stable modes of the linguistic population and under what conditions a switch from one stable mode to another might happen. Other Assorted Changes A wide variety of phonetic and phonological changes have been studied and discussed in the rich literature on historical linguistics and language change. Let us briefly consider a few more examples for illustrative purposes. Yiddish is descended from Middle High German (MHG), which is itself descended from Old High German (OHG). In OHG, words could end in voiced obstruents, e.g., tag for “day”. A change occurred from OHG to MHG so that word-final voiced obstruents were devoiced. (See the discussion in Trask 1996). Thus tag became tac, and gab (“he gave”) became gap, weg (“way”) became wec, aveg (“away”) became avec, and so on. This can be expressed as the rule

21

INTRODUCTION [+obstruent +voice] −→ [+obstruent -voice] |

#

Of course, voiced obstruents that are not in word-final position remain voiced. Thus the plural forms of the words are tage (“days”) and wege (“ways”). Modern German retains this rule. In Yiddish, on the other hand, the forms of the same words are tog (“day”), weg (“way”), and so on. At the same time, words without alternations 8 such as avec are pronounced with a voiceless stop. Consider now the sequence of transformations from OHG to MHG to Yiddish. The devoicing rule was added from OHG to MHG. Now one could postulate that (i) a new rule was added in Yiddish so that word final unvoiced consonants were voiced, or (ii) the devoicing rule that was added in MHG was simply lost again. If (i) were true, then it would not explain why avek remain unvoiced. Therefore (ii) must be closer to the truth. A plausible explanation is that words where alternations provide clues as to their underlying form (such as tac-tage) were reanalyzed as voiced. Words without alternations suggesting the possibility of a voiced underlying form were analyzed as unvoiced. Since the devoicing rule was lost anyway, the reanalyzed form was not subject to devoicing, which explains the modern form of Yiddish words. A number of issues now arise. Rules are part of the phonological grammatical system. Why would rules arise and be lost? One explanation might lie in variation existing in the population. Perhaps some portion of the population had the devoicing rule and some did not. Given conflicting data, it is possible that some children acquired the devoicing rule while others did not. How might learning by children, frequency of usage of different forms, and variation in the population interact to create the circumstances under which a rule might be gained and the circumstances under which a rule might be lost? We need ways for thinking about these issues in order to sharpen our understanding of the factors involved. Like the Chinese example of Shen described earlier, another example of a linguistic change in progress comes from William Labov’s pioneering study of vowel centralization on Martha’s Vineyard. In the speech of Martha’s Vineyard, the diphthongs /ai/ and /au/ as in “light” and “house” are centralized. This is unusual for New England and Labov studied a large number of subjects of varying ages with respect to the degree of centralization of each of the diphthongs. A measure of centralization called the centralization 8 Root words that had inflections where the relevant obstruent was voiced in some cases but not in others are referred to as alternations. Thus tac-tage and wec-wege are alternations.

CHAPTER 1

22

index (CI) was constructed and could be plotted for each subject by age. Strikingly, it was observed that centralization decreased with age, with the oldest group having the lowest CI. The youngest group however had a low CI index, too, suggesting that centralization increased over time and then started decreasing again. This can be related to occupation, social stratification, and degree of identification with the island, and serves as an example of social forces interacting with linguistic forces that has been studied in a quantitative manner in the sociolinguistic tradition pioneered by Labov (see Labov 1994 for an account.).

1.3.2

Syntactic Change

As I have discussed before, changes in the grammatical properties of linguistic populations occur in many different linguistic domains, and here I review some cases of syntactic change. French Old French had a number of properties, including (i) V2 — the tendency of (finite) verbs to move to second position in matrix clauses, and (ii) prodrop — the ability to drop the pronominal subject from a sentence without sacrificing the grammaticality of the resulting expression. Let us examine the case of pro-drop for a moment. In some languages of the world, like Modern English, the pronominal subject of a sentence has to be present in the surface form for the sentence to be deemed grammatical in that language. Thus in the English sentences below, (3a) is grammatical while (3b) is not. (3)

a. b.

He went to the market. *Went to the market.

Modern Italian, on the other hand, allows one to drop the subject if the putative subject can be unambiguously inferred by pragmatic or other considerations. Thus both (4a) and (4b) (meaning “I speak”) are grammatical. (4)

a. b.

Io parlo. Parlo.

Or consider another Italian sentence with pro-drop. Giacomo Giacomo

ha has

detto said

che that

ha (he) has

telefonato. telephoned.

23

INTRODUCTION

It has been suggested that this aspect of syntactic structure defines a typological distinction between languages of the world, with some allowing pro-drop and others not. It turns out that Old French used to allow pro-drop, while Modern French does not. Consider the following two sentences taken from from the discussion on French change in Clark and Roberts 1993. Ainsi thus

s’amusaient bien (they) had fun

cette that

nuit. night.

and Si thus

firent (they) made

grant great

joie joy

la the

nuit. night.

Both these sentences are ungrammatical in Modern French. Again we are led to the usual puzzles. At one time, French children would have had enough exposure to the language of their times that they would have learned that pro-drop was allowable and acquired the relevant grammatical rule, much as Italian children do today. Why then did they stop acquiring it? Maybe a few didn’t acquire it, the frequency of usage became rare, it triggered the rule in fewer and fewer children as time went on, and ultimately it died out. This story needs to be made more precise with data, models, and a deeper understanding of the interaction of learning, grammar, and population dynamics. Clark and Roberts (1993) and Yang (2002) have taken steps in this direction, and we will revisit this problem later in the book. Yiddish Yiddish is the language of Jews of Eastern and Central Europe and is descended from medieval German with considerable influence from Hebrew and Slavic languages as well. Like English and French, Yiddish underwent some remarkable syntactic changes, leading to different word-order formations in the modern version of the language. One particular change had to do with the location of the auxiliary verb with respect to the subject and the verb phrase in clauses. Following Chomsky 1986, one might let the auxiliary verb belong to the functional category INFL (which bears inflectional markers) and thus distinguish between the two basic phrase-structure alternatives as in (5a) and (5b). (5)

a. b.

[Spec [VP INFL]]IP [Spec [INFL VP]]IP

CHAPTER 1

24

The inflectional phrase (IP) describes the whole clause (sentence) with an inflectional head (INFL), a verb-phrase argument (VP) for this INFL head, and a specifier (Spec). The item in specifier position is deemed the subject of the sentence. In Modern English, for example, phrases are almost always of type (5b). Thus the sentence (6) (6) [John [can [read the blackboard ]VP ]]IP corresponds to such a type with “John” being in Spec position, “can” being the INFL-head, and “read the blackboard” being the verb phrase. If we deem structures like (5a) to be INFL-final and structures like (5b) to be INFLmedial, we find that languages on the whole might be typified according to which of these phrase types is preponderant in the language. 9 Interestingly, Yiddish changed from a predominantly INFL-final language to a categorically INFL-medial one over the course of a transition period from 1400 A.D. to about 1850 A.D. Santorini 1993 has a detailed quantitative analysis of this phenomenon, and shown below are two unambiguously INFL-final sentences of early Yiddish (taken from Santorini 1993). Such sentences would be deemed ungrammatical in the modern categorically INFL-medial Yiddish. ds that ven if

zi they der the

droyf there-on vatr father

nurt only

givarnt warned

vern were

doyts German

(Bovo 39.6, 1507)

leyan read

kan can

(Anshel 11, 1534)

To illustrate this point quantitatively, a corpus analysis of Yiddish documents over the ages yields the statistics shown in Table 1.3. Clauses with simple verbs are analyzed for INFL-medial and INFL-final distributions of phrase structures. More statistics are available in Santorini 1993, but this simple case illustrates the clear and unmistakable trend in the distribution of phrase types. 9

It should be mentioned that while this typological distinction is largely accepted by linguists working in the tradition of Chomsky 1981, there is still considerable debate as to how cleanly languages fall into one of these two types. For example, while Travis (1984) argues that INFL precedes VP in German and Zwart (1991) extends the analysis to Dutch, Schwartz and Vikner (1990) provide considerable evidence arguing otherwise. Part of the complication often arises because the surface forms of sentences might reflect movement processes from some other underlying form in often complicated ways. But this is beyond the scope of this book.

25

INTRODUCTION Time Period 1400–1489 1490–1539 1540–1589 1590–1639 1640–1689 1690–1739 1740–1789 1790–1839 1840–1950

INFL-medial 0 5 13 5 13 15 1 54 90

INFL-final 27 37 59 81 33 20 1 3 0

Table 1.3: Relative numbers of INFL-medial and INFL-final structures in clauses with simple verbs (at different points in time). Taken from the study of the history of Yiddish in Santorini 1993. It is worth mentioning here that while Santorini 1993 expresses the statistics within the notational conventions of Chomsky 1986, almost any reasonable grammatical formalism would capture this variation and change, with two different grammatical types or forms in competition with one gradually yielding to the other over generational time. Again one might wonder about the causes of such a change, the stable grammatical modes of populations, the directionalities involved, and the like. As quantitative measures of the sort described here are made to characterize the historical phenomena at hand, one is led irrevocably toward quantitative and computational models to attain a deeper understanding of the underlying processes involved.

1.4

Perspective and Conceptual Issues

This book is a computational treatise on historical and evolutionary phenomena in human language. At the outset it may not be entirely clear that there are meaningful computational questions and that such a computational treatment is possible, profitable, or necessary in the discourse on historical linguistics. After all, one does not typically study human social and political history with computational tools. On the other hand, evolutionary biology is today a heavily mathematized discipline. In fact the mathematization of evolutionary biology began in the early twentieth century to resolve the apparent conflict between the ideas of Mendel, Darwin, and other evolutionary thinkers — conflicts that were difficult to resolve by verbal reasoning alone.

CHAPTER 1

26

Human language is interesting because it is in part cultural and in part biological. The part that is biological belongs more readily to the natural sciences and is amenable to a treatment by the usual modes of inquiry in the natural sciences. I have tried to illustrate in previous sections some of the examples of language change that belong to this domain and some of the issues that arise in the study of such phenomena for which a computational analysis becomes necessary. The overall rationale behind such an approach and the possibility of a computational treatment rests on three aspects of language that are central to my point of view and worth highlighting separately. 1. Language has form. The linguistic objects of distinctive features, phonemes, syllables, morphemes, words, phrases, and sentences have reasonably concrete representations and display systematic regularities that give language form. This formal aspect separates it from amorphous cultural convention and makes it amenable to study by formal or mathematical means. Indeed, the discipline of formal language theory evolved in part to provide the apparatus to describe this formal structure and associated linguistic phenomena. Interestingly enough, grammars, automata, and languages are central also to investigation in logic and computer science, and many of the ideas I present in this book are possible to articulate only because of this link between computer science and linguistics. One might quibble about the details of this form, about grammaticality judgments, about competence and performance issues, about functionality issues. One might argue that the true goal of language is communication after all, that the meaning of sentences is paramount and their form not all that sacrosanct. However, one will still have to concede that we don’t speak word salad. Of all the different ways of conveying the same meaning, a particular language will choose a limited number of ways to give form to that meaning. Thus English chooses (1a) while Bengali (2b) — it could easily have been the other way around and indeed in 800 A.D. it was. When one moves away from syntactic to phonological phenomena the link between form and meaning becomes even more remote, and it is in some ways easier to recognize this strict yet arbitrary formal aspect of language in phonological systems. 2. Language is learned. Unlike other modalities like vision or olfaction, where the role of learning is unclear beyond some plasticity in the

27

INTRODUCTION neural apparatus, language is clearly and indisputably learned. When we are born we don’t know language. We are exposed to linguistic data and we learn it. In many ways, it fits quite neatly into the framework of learning from examples and in fact the field of formal inductive inference arose to study the tractability of the problem of language acquisition. The ability to learn has been a central topic of investigation in artificial intelligence, and a variety of computational tools ranging from abstract theory to computer simulation have been brought to bear in this enterprise. 3. Languages vary. Variation across the languages of the world is a ubiquituous fact of human existence. In many ways it might have been quite convenient if they did not vary at all — if there was one perfect language that was hardwired in our genes and we all grew up speaking the same language. While this is not true, in some ways perhaps it is not far from the truth, for while we are not born with the details of a particular language, it is likely that we are born with the class H that limits possible variations in some sense. This book attempts to create the computational framework for studying diachronic variation.

Thus the mathematical and computational tools that will be utilized to characterize each of these aspects of language are 1. Formal Language Theory and related areas to describe linguistic form and linguistic structures. 2. Learning Theory to characterize the problem of language acquisition and learning. 3. Dynamical Systems to characterize the diachronic evolution of linguistic populations over time. In the rest of the book we will see how these different areas of mathematics come together in our computational approach to the problem. As we proceed, we will need to tease apart several issues that need to be kept in mind for a complete treatment of historical phenomena in linguistics. Indeed, historical linguists have considered these phenomena at various points in time.

1.4.1

The Role of Learning

Clearly language is acquired by children — most significantly from the input provided by the previous generation of speakers in the community. The idea

CHAPTER 1

28

that language change is contingent on language learning has been a longstanding one. As early as the nineteenth century we have the following observations: ..the main occasion of sound change consists of the transmission of sounds to new individuals (Paul 1891, 53-54) More strikingly, the British linguist, Henry Sweet argued that ...if languages were learned perfectly by the children of each generation, then language would not change: English children would still speak a language as old atleast as Anglo Saxon and there would be no such languages as French or Italian. (Sweet, 1899, 75) More recently, Halle (1962), Kiparsky (1965), Weinreich, Labov, and Herzog (1968), Wang (1969,1991), Ohala (1993) have invoked the connection between language change and language learning in explicit or implicit ways in the phonological domain. Similarly, in syntax, Lightfoot (1979, 1998), Roberts (1992), Kroch (1989,1999), and Yang (2002) among others have argued this connection strongly. This book contributes to the effort to explore systematically the precise nature of the relationship between language acquisition and language change.

1.4.2

Populations versus Idiolects

Isolated instances of mislearning or idiosyncratic linguistic behavior are clearly of little consequence unless they spread through the community over time to result in large-scale language change. In any meaningful discourse on language change, one therefore needs to distinguish between the population and the individuals in it. Individual speaker-hearers (language users) might differ from each other at any single point in time and this characterizes the synchronic variation in the population at that point in time. However, one can also discuss average characteristics of the population as a whole and in some sense, when one talks about a language changing with time, one is talking about the average characteristics of the population changing over successive generations. After all, an individual occupies only one generation. Historical linguistics often confuses this issue. Part of the reason is that our data about language change often comes from individual writers. Strong

29

INTRODUCTION

trends in different individual writers over successive generations are certainly suggestive of larger-scale population-level effects but don’t necessarily imply it. Mufwene (2001), Labov (1994), and Shen (1997) have in various ways emphasized this difference. Shen (1997) provides the source of the Wenzhou data that is discussed in a later chapter. The data arose by explicitly sampling multiple people in the population for each generation. An important goal of this book is to explore the relationship between change at the individual level and change at the population level.

1.4.3

Gradualness versus Abruptness (or the S-Shaped Curve)

The rate and time course of language change have been the object of study and speculation by historical linguists for some time. Since most linguistic changes are ultimately categorical ones, the possibility exists for a language to change categorically — and therefore abruptly — from one generation to the next. Empirical studies of the process have always yielded, however, a more graded behavior and much has been made of the so-called S-shaped curve denoting the change in linguistic behavior (average population behavior, typically) over successive generations. Bailey (1973, 77) discusses the S-curve: A given change begins quite gradually; after reaching a certain point (say, twenty percent), it picks up momentum and proceeds at a much faster rate and finally tails off slowly before reaching completion. The result is an S-curve, ... Bailey (1973, 77) Similarly, we have Osgood and Sebeok (1954) discuss the S-shaped nature of change while introducing the notion of community (population) and the possibility of change being actuated by children (learning): The process of change in a community would most probably be represented by an S-curve. The rate of change would probably be slow at first, appearing in the speech of innovators, or more likely young children; become relatively rapid as these young people become the agents of differential reinforcement; and taper off as fewer older and more marginal individuals remain to continue the old forms. Osgood and Sebeok (1954) Weinreich, Labov, and Herzog (1968, 133) also discuss the S-shaped curve as follows

CHAPTER 1

30

...the progress of language change through a community follows a lawful course, an S-curve from minority to majority to totality. Weinreich, Labov, and Herzog (1968, 133) As we see, for some time now, there has been a discourse on the importance and pervasiveness of the S-shaped change in historical linguistics. Of course, the “knee” of the S could be sharp, reflecting a sudden transition from one linguistic usage to another, or it could be gradual over many centuries. Lightfoot (1998) argues that many of the changes in English syntax from Old to Middle to Modern English were actually quite categorical and sudden. Why should the changes be S-shaped? A historicist account would claim this to be one of the “historical laws” that govern language change with time. An alternative position — and one I explore in this book — would consider this to be an epiphenomenon. I attempt to derive the long-term evolutionary consequences of short-term language learning by children. As a result, the book provides some understanding of when trajectories can be expected to be S-shaped. The collection of quantitative historical (or pseudo-historical) datasets along with an in interest in explaining qualitative S-shaped behavior has prompted researchers in recent times (Kroch 1989; Shen 1997) to explore mathematical models of the phenomena. I discuss them at a later point in this book.

1.4.4

Different Time Scales of Evolution

It is worth noting that there are two distinct time scales at which one can study the evolutionary history of linguistic systems. One time scale corresponds to historical linguistics, i.e., the period after modern humans arose and the human language faculty was in place. Much of our discussion so far has been at this time scale. We have seen how the linguistic systems of humans living in different geographic regions have undergone change (evolution) over time. A second time scale corresponds to the origin and evolution of the human language faculty from prelinguistic versions of it that may have existed in our prehuman ancestors. In a discussion of the major transitions of evolution, Maynard Smith and Szathmary (1995) consider the evolution of human language to be the last major transition. These two time scales present interesting similarities and differences. In both, one needs to concern oneself with population dynamics, individual learning, social networks, and linguistic systems. However, it is likely that

31

INTRODUCTION

on historical time scales, natural selection (differential reproductive fitness based on communicative advantage) is less important than it is on evolutionary time scales. Another matter of importance is the range of data available to empirically ground theories and explanations. If one is interested in human language, much more data is available at historical time scales while almost none is available at evolutionary time scales. For studies at evolutionary time scales, therefore, one will probably have to resort to cross-species comparative studies across different animal communication systems (see Hauser 1997 for this point of view). What can be said about human language as a result of such comparisons remains unclear. In Parts II and III of this book, my discussion is mostly about human language and examples from various linguistic systems are provided. The discussion in Part IV, however, has considerable relevance to both animal and artificial communication systems and should be read with this thought in mind.

1.4.5

Cautionary Aspects

Long-term change in a language is complicated by several compounding factors. First, sociopolitical considerations often enter the picture. The undue influence of one person or group of persons on society might result in the propagation of their linguistic preference across the population over time. Prestige, power, and influence are difficult to formalize and model precisely and are often best left alone in this regard. I will concentrate in this book on those kinds of phenomena for which we believe a linguistic rather than extralinguistic (sociological) explanation is possible or likely. Nevertheless, I am acutely aware that explanatory possibilities from sociopolitical considerations need to be carefully considered at all times, for “naturalistic” explanations are often proffered too eagerly while the underlying causes reside elsewhere. As a matter of fact, the interaction of social forces with linguistic considerations is explored fully in the kind of quantitative sociolinguistic work pioneered by Weinreich, Labov, and Herzog 1968 and discussed at length in Labov 1994. A second complication arises from the nature of the data available and the testability of theories. Because the discipline is inherently a historical one, it is hard to replay the tape of life or conduct experiments of any sort.10 At the same time, historical records often show clear patterns of regularity — with data so strikingly regular and abundant that the force of the phenomena becomes compelling. Of course this problem is not peculiar 10

See Ohala 1993 for an interesting suggestion of laboratory experiments simulating sound change.

CHAPTER 1

32

to linguistics and is shared by all scientific disciplines that focus on historical phenomena from evolutionary biology to cosmology. In recent times, the collection of large electronic corpora of linguistic facts and documents (see, for example, the collections of the Linguistic Data Consortium) has also provided fruits for historical linguistics. The Penn Helsinki corpus of Middle English consists of a million parsed sentences from a variety of texts in the Middle English period, from which it is possible to collect statistics on the frequency of various kinds of constructions and track their change with time. This is beginning to be repeated for a number of other languages. For example, texts of European Portuguese in the period from 1600 to 1800 A.D. have been collected and are beginning to be annotated and made available in computer format as part of the Tycho Brahe project (Galves and Galves; see http://www.ime.usp.br/∼tychobrahe). Field studies of languages changing with time in the last fifty years have been conducted in a variety of languages — including creoles (see Mufwene 2001; DeGraff 2001; Rickford 1987), sign languages (Senghas and Coppola 2001), Chinese (Shen 1997), American English dialects (Labov 1994; Christian, Wolfram, and Bube 1988), British English (Milroy and Milroy 1985, 1993; Bauer 1994) and so on — that provide the empirical base on which historical linguistics and language change are founded. It is not possible to do justice to the variety of such field studies and I will cite and deal with only a limited number of case studies over the course of this book.

1.5

Evolution in Linguistics and Biology

Each of the issues discussed above arises albeit in a different form in the domain of evolutionary biology. Heritability and modes of transmission of genetic information leading to the similarity between children and parents is a crucial feature of biological populations. A variety of such modes of transmission from sexual to asexual reproduction have been considered by evolutionary biologists. The interaction of genetically transmitted information and inputs from the environment contributes to the developmental sequence of a biological organism and its ultimate mature form. In the case of human language, the mode of transmission is learning rather than genetic reproduction. Learning by children results in linguistic similarity between parent and child. However, parents are not the only influence on the child’s linguistic development; the general linguistic environment of the growing child plays a role as well. Thus while sexually reproducing biological organisms inherit their genetic makeup from

33

INTRODUCTION

their parents, children acquire their language based on the linguistic composition of the parental generation at large. For this reason, it is likely that language evolution is more like epidemiology or ecology than like Mendelian genetics. Population thinking pervades all of biology. The individual organism and the population of which the organism is a part are distinguished, and the entire field of population biology attempts to work out the population dynamics resulting from individual interactions that may vary from biological reproduction to strategies for survival in predator-prey systems. From gradualness to punctuated equilibria, biologists have pondered the various dynamical aspects of changing populations (see Hofbauer and Sigmund 1988 or Nagylaki 1992 for mathematical reviews). Because individual learning is the mechanism by which language is transmitted from the speakers of one generation to those of the next, the theory of learning will play a central role in the development of the evolutionary models that I consider in this book. As a result, the precise nature of these models and of the mathematics that surrounds their analysis is quite distinct from that encountered in the literature on evolutionary biology. The importance of the population as an object of study results in an emphasis on observing and characterizing variation and typology within the population. Evolutionary biologists since the time of Darwin have been interested in the diversity of biological life forms — how it arises, what maintains it, and how it evolves. Correspondingly, linguists have always been interested in linguistic diversity in space and time. In this context, it is interesting to reflect upon Lewontin’s (1978) review of sufficient conditions for evolution by natural selection. These are 1. There is variation in . . . behavioral traits among members of a species (the principle of variation); 2. The variation is in part heritable . . . in particular, offspring resemble their parents (the principle of heredity); 3. Different variants leave different numbers of offspring either in immediate or in remote generations (the principle of differential fitness) In the case of language evolution, we are interested in linguistic behavior. As I have noted already, in any population there is variation in linguistic behavior. This variation is in part inheritable though the mechanism of inheritance is based on learning — not just from the parents but from a larger group of people.

CHAPTER 1

34

The case of differential fitness is a tricky point. There is no obvious sense in which speakers of one linguistic variant are reproductively more successful than speakers of another in recent historical times. It is also not clear that natural languages are getting “fitter” in any sense over historical time though this may be a matter of some debate. Fitness may be viewed, however, as the differential transmission of linguistic variants, and the role of communicative efficiency, principles of “least effort” (Zipf 1948) in production and perception of speech, and the differential ease of learnability of different features of a language will all need to be sorted out. When one considers evolutionary scenarios in prehistoric times and questions like the origin of language in modern humans or the evolution of different kinds of communication systems in the animal world in general, then it is quite possible that communicative ability plays a role in survival or mate selection and therefore has direct bearing on reproductive success.

1.5.1

Scientific History

A little bit of historical perspective on these parallels and differences between language and biology is helpful. Since the discovery of the relatedness of the members of the Indo European family of languages by William Jones (see Collected Works, Volume III) in the late eighteenth century, historical linguistics dominated the research agenda of much of the nineteenth century. Linguistic family trees were constructed by various methods in an attempt to uncover relatedness and descent of languages. Darwin, living in that century, was by his own admission greatly influenced by these ideas and several times in The Descent of Man, he likens biological diversity to linguistic diversity. Species were like languages. Reproductive compatibility was like communicative compatibility. Like languages, species too could evolve over time and be descended from each other: The formation of different languages and of distinct species, and the proofs that both have been developed through a gradual process, are curiously parallel.* But we can trace the formation of many words further back than that of species, for we can perceive how they actually arose from the imitation of various sounds. We find in distinct languages striking homologies due to community of descent, and analogies due to a similar process of formation. The manner in which certain letters or sounds change when others change is very like correlated growth. We have in both cases the re-duplication of parts, the effects of long-

35

INTRODUCTION continued use, and so forth. The frequent presence of rudiments, both in languages and in species, is still more remarkable. . . . Languages, like organic beings, can be classed in groups under groups; and they can be classed either naturally according to descent, or artificially by other characters. . . . The survival or preservation of certain favoured words in the struggle for existence is natural selection. (Descent of Man, C. Darwin 1871)

Both Jones and Darwin were radicals in their own ways. To suggest that Sanskrit (the language of the colonized Indians) was in the same family as Latin and Greek (languages with which the imperial masters identified strongly) was against the ingrained notions of those colonial times. To suggest that humans and apes belonged to the same broader family of primates went strongly against the deeply held beliefs of those religious times. At various points since the promulgation of evolutionary theory by Darwin and Mendel, linguists have taken up the analogy between biological evolution and language change. For example, the German scholar August Schleicher did much work on Indo-European linguistic-tree reconstruction and was influenced both by Linnaeus’ taxonomy and Darwin’s ideas. In 1863, he published a manuscript entitled The Darwinian Theory and the Science of Language. Similarly, the Norwegian linguist Otto Jespersen was also influenced by the Darwinian approach and advanced the view that there was an evolutionary scale and languages were on the whole getting “fitter” and more efficient with the passage of time. In recent times, with the rise of generative linguistics, which has viewed language as a distinct cognitive and therefore biological trait, these analogies have become more precise. For example, Lightfoot (1998), Kroch (1999), McMahon (2000), PiatelliPalmarini (1989),Pinker (1994), Aitchison (2000), Wang (1991), Mufwene (2001), Croft (2000), and Jenkins (2001), among others, have elaborated on this connection. In the twentieth century, both the politics and the science changed. Particularly following the cognitive revolution in linguistics identified most strongly with Chomsky, there was a shift in focus from diachronic to synchronic phenomena as the object of study. Linguistic structure and its acquisition were better understood. In biology, following the genetic revolution brought about by Watson and Crick, the genetic basis of biological variation began to be probed and evolutionary theory quickly incorporated these mechanisms into their models and explanations. Similarly, for the last

CHAPTER 1

36

twenty years, the insights from generative grammar and mechanisms of language acquisition have been used to reexamine the issues and questions of historical linguistics and language evolution. However, historical and evolutionary points of view are vastly more vigorous in biology today than they are in linguistics. It is time, perhaps, to change that. Clearly, biological systems are extremely complex and biological form is arguably even harder to characterize quantitatively than linguistic form. The forces shaping biological evolution are by no means simpler than those shaping linguistic evolution. Yet biologists have recognized the utility of computational and mathematical thinking to reason through the morass of possibilities and seeming tautologies to shed light on important trends. For more than seventy years now, evolutionary biology has been steadily mathematized with a wide range of models ranging from probability theory to game theory. For a random sampling of this aspect of the field, see Fisher 1930; Wright 1968–1978; Crow and Kimura 1970; Maynard Smith 1982; Hofbauer and Sigmund 1988. Given this trend, and given that many aspects of linguistic inquiry were greatly mathematized by the Chomskyan revolution in the 1950s, it is somewhat surprising that the study of language evolution has avoided mathematical analysis until recently. Over the last decade, however, a significant body of work has begun to emerge on computational approaches to the problem, opening up the way to such modes of inquiry into historical linguistics and language evolution. (References are too numerous to cite in full. A partial list includes Yang 2002; Niyogi and Berwick 1997; Steels 2001; Clark and Roberts 1993; Freedman and Wang 1996; Shen 1997; Briscoe 2000; Batali 1998; Kirby 1999; Hurford 1989, Hurford and Kirby 2001; Nowak and Krakauer 1999; Nowak, Komarova, and Niyogi 2001; Cucker, Smale, and Zhou 2004; Abrams and Strogatz 2003; Wang, Ke, and Minett 2004; Cangelosi and Parisi 2002; Christiansen and Kirby 2003. Since 1996, an International Conference on Language Evolution has been held every two years. See also the website http://www.isrl.uiuc.edu/amag/langev/ for more references.) Finally, no account will be complete without noting that there has also been a distinct tradition in the study of cultural evolution that has many potential points of contact with language evolution. Pioneering mathematical accounts of cultural evolution have been proposed by Cavalli-Sforza and Feldman (1981), Boyd and Richerson (1985), and Axelrod (1984). Of these, the first-mentioned work has greatest overlap with the approach in this book. Chapter 9 has been devoted to similarities and differences between the two approaches. More recently, at a very different level of analysis, an empirical

37

INTRODUCTION

study of the similarities between genes, people, and languages is reported in Cavalli-Sforza 2001. Ruhlen (1994,1997) in a number of scholarly works following in the tradition of Greenberg (1966,1974,1978) constructs phylogenetic trees using techniques from classical comparative linguistics and influenced by perspectives from biological and cultural evolution.

1.6

Summary of Results

As I have noted above, there are many similarities and differences between evolutionary processes in linguistics and biology. Rather than dwell too much on analogies, I will develop the logic of language evolution on its own terms over the course of this book. Let me reiterate my point of view. 1. Linguistic behavior and underlying linguistic knowledge may be characterized as a formal system. Let H represent the range of such formal systems that humans possibly possess. 2. Variation within any population (at time t) may be characterized by a probability distribution Pt on H. For any h ∈ H, Pt (h) represents the proportion of individuals using the system h. 3. An individual child born within this population will acquire language based on a learning algorithm A that maps its primary linguistic data onto linguistic systems (elements of H). The distribution of data that the typical child receives will depend upon the distribution of linguistic types in the previous generation Pt and the mode of interaction between the child and its environment. Stipulations (1), (2), and (3) taken together will allow us to deduce a map Pt → Pt+1 that characterizes how linguistic variation evolves over time. The interplay between learning by individuals and change (evolution) in populations is subtle. We do not currently have good intuitions about the precise nature of this relationship and the possible forms it could take. Progress on this question is key to developing good theories of how and why languages change and evolve. It is extremely difficult to make progress by verbal arguments alone and therefore it makes sense to study this question in some formal abstraction. So although I try to engage linguistic facts throughout this book, much of my discussion remains abstract. Another aspect of the book is its focus on mathematical models where the relationship between various objects may be formally (provably) studied. A complementary approach is to consider the larger class of computational models where one resorts to simulations. Mathematical models with their equations and proofs and computational models with their programs and

CHAPTER 1

38

Language Evolution g 1

g 1

and

Biological Evolution

g5

g 2 g1

grammatical variation in adults genetic variation in adults

D2

D1 A

A

g

h

transmission via learning transmission via inheritance

grammatical variation in children genetic variation in children

Natural Selection ??

Figure 1.1: The logic of language evolution. There is grammatical variation in the population of parents. g1 , g2 , g5 are some of the grammatical systems (idiolects) attested in the parental generation as shown. Different children have different linguistic experiences (D 1 and D2 are two different data sets that two different children receive). Each child has the same learning algorithm A that maps these different linguistic experiences on to different grammatical systems (g and h respectively). Thus there is variation in the next generation of speakers. The sense in which natural selection plays a role in language evolution is unclear. No commitment is made at this point to the precise nature of g’s or the learning algorithm A.

39

INTRODUCTION

simulations provide different and important windows of insight into the phenomena at hand. In the first, one constructs idealized and simplified models but one can now reason precisely about the behavior of such models and therefore be very sure of one’s conclusions. In the second, one constructs more realistic models but because of the complexity, one will need to resort to heuristic arguments and simulations. In summary, for mathematical models the assumptions are more questionable but the conclusions are more reliable — for computational models, the assumptions are more believable but the conclusions more suspect.

1.6.1

Main Insights

Let me summarize the main insights that emerge from the investigations conducted in this book. Learning and Evolution Learning at the individual level and evolution at the population level are related. Furthermore, we see that different learning algorithms have different evolutionary consequences. Thus every theory of language acquisition also makes predictions about the nature of language change. Such theories may therefore be tested not only against developmental psycholinguistic data but also against historical and evolutionary data. Over the course of this book, I explore many different learning algorithms and their evolutionary consequences. We will see that the evolutionary dynamics can depend in subtle ways on whether learning operates with online memoryless algorithms or global batch algorithms. Similarly, there is a difference between symmetric algorithms like the trigger-based algorithms and asymmetric algorithms like the cue-based algorithms. While both satisfy learnability criteria, they have different evolutionary profiles. In the context of P&P-based algorithms, there is some debate as to whether there are default, marked states or not during the acquisition process. Such marked states would give rise to asymmetric learning algorithms, and their different evolutionary consequences may then be judged against historical data. We will also see the role of critical age periods (the maturation parameter) in learning and evolution. If the learning stops and the mature language crystallizes after a number n of examples have been received, we will see that the evolutionary dynamics are characterized by degree n polynomial maps. Although such high-degree polynomial maps may have complicated behavior in general, in the particular case of language, they operate in bounded pa-

CHAPTER 1

40

rameter regimes. Thus, though bifurcations typically arise, chaos typically does not. We will also see the differences between learning algorithms that learn from the input provided by a single individual (parent, teacher, or caretaker) versus algorithms that learn from the input provided by the community at large. This point is elaborated shortly. I have considered some examples of language change to provide linguistic grounding and to give a sense to the reader of how linguistic data may be engaged using the approaches described here. Particular note should be taken of the treatment of phonological change in Chinese and Portuguese and syntactic change in French and English. An assortment of other changes are scattered across the book. Finally it is worth noting that much of learning theory in language acquisition is developed in the context of the classical Chomskyan idealization of an “ideal speaker hearer in a homogeneous linguistic environment”. As a result one typically assumes that there is a target grammar that the learner tries to reach. This book drops the homogeneous assumption and analyzes the implications of learning theory in a heterogeneous population with linguistic variation. Learning theory has not been systematically developed in such a context before. Bifurcations in the History of Language A major insight that emerges from the analytic treatment pursued here is the role of bifurcations11 in population dynamics as an explanatory construct to account for major transitions in language. When one writes down the dynamics one would expect in linguistic populations under a variety of assumptions, again and again, one notices that (1) the dynamics is typically nonlinear, and (2) there are bifurcations (phase transitions) which may be interpretable in linguistic terms as the change of language from one seemingly stable mode to another. There are numerous examples of such bifurcations in this book. 11

Bifurcations may be recognized as phase transitions in a number of dynamical models in physics and biology. The most familiar instances of phase transitions are those that lead to changes in the state of materials, e.g., water turning into ice or an iron bar becoming a permanent magnet (the Ising and Potts models). In both cases, statistical physics models may be constructed and temperature is the parameter that is varied in such models. It is seen that a critical threshold in temperature separates the qualitatively different regimes of behavior of the material. Thus temperature may change continuously across this threshold, leading to a discontinuous change in the state of the material. For a more precise discussion of this point, see Chapter 13.

41

INTRODUCTION

In Chapter 5, I introduce models with two languages in competition. For a TLA-based learner for learning in the P&P model, we will see that the equilibrium state depends upon the relationship of a with b where a and b are the frequencies with which ambiguous forms are generated by speakers of each of the two languages in question. If a = b, then no evolutionary change is possible. If a < b, then one of the two languages is stable, the other is unstable. For a > b the reverse is true. We thus see that it is possible for a language to go from a stable to an unstable state because of a change in the frequencies with which expressions are produced. In a cue-based model of learning, also discussed in Chapter 5, we will see that there is a bifurcation from a regime with two stable equilibria to one with only a single stable equilibrium as k (the number of learning samples) and p (the cue frequency) vary as a function of each other. For models inspired by change in European Portuguese or those inspired by phonological change in Chinese, similar bifurcations arise. In chapters on the emergence of grammatical coherence, we will see how there is a bifurcation point below which stable solutions include all languages (polymorphism) and above which stable solutions contain a single shared language in the community at large. These results provide some understanding of how a major transition in the linguistic behavior of a community may come about as a result of a minor drift in usage frequencies, provided those usage frequencies cross a critical threshold. Though usage frequencies may drift from one generation to the next, the underlying linguistic systems may remain stable. But if these usage frequencies cross a threshold, rapid change may come about. Thus a novel solution to the actuation problem (the problem of what initiates language change) is posited. Natural Selection and the Emergence of Language In Part IV of this book, I shed some light on the complex nature of the relationship between communicative efficiency and fitness, social connectivity, learnability, and the emergence of shared linguistic systems. For example, I discuss the emergence of grammatical coherence, i.e., a shared language of the community in the absence of any centralized agent that enforces such coherence. Two different models are considered in Chapters 12 and 13. In one, children learn from their parents alone. In the other, they learn from the entire community at large. In both models, it is found that coherence emerges only if the learning fidelity is high, i.e., for every possible target grammar g, the learner will learn it with high confidence (with probability > γ). After examining conditions for learnability, we see that the complexity

CHAPTER 1

42

of the class of possible grammars H, the size of the learning set n, and the confidence γ are all related. For a fixed n, if γ is to be large, then H must be small. Thus in addition to the traditional learning-theoretic considerations, we see that there may be evolutionary constraints on the complexity of H — the class of Universal Grammar. In order to stably maintain a shared language in a community, the class of possible languages must be restricted and something like Universal Grammar must be true. A second insight emerges from considering the difference in the two models of Chapters 12 and 13. We see that if one learns from parents alone, then natural selection based on communicative fitness is necessary for the emergence of a shared linguistic system. On the other hand, if one learns from the community at large, then natural selection is not necessary. Now in human societies, the social connectivity pattern ensures that each individual child receives linguistic input from multiple people in the community. In such societies, it is therefore not necessary to postulate mechanisms of natural selection for the emergence of language. On the other hand, in those kinds of animal societies where learning occurs in the “nesting phase” with input primarily from one teacher, one may need to invoke considerations of natural selection. This is the case for some bird song communities, for example.

1.7

Audience and Connections to Other Fields

This book is a computational treatment of the interplay between learning, language, and evolution. The subject matter and ideas reside at the boundary of several disciplines that form the intended audience for the book. The audience includes the following: 1. The linguist ought to be interested from a variety of perspectives. Those studying language acquisition can now examine how acquisition at the individual level may have long term evolutionary consequences at the population level. Those studying historical linguistics will find here a computational framework to investigate and explain the phenomena of their discipline. Finally, those interested in the origin of language and how evolutionary considerations may possibly shape the structure of language, will find some new tools with which to reason about their theories. 2. The computer scientist in domains like computational linguistics and artificial intelligence are introduced here to a new domain of study

43

INTRODUCTION with its own phenomena that have received little computational attention in the past. While computational linguistics is a fairly old discipline with its roots in mechanical translation, in recent times the focus has largely been on computational models of learning and parsing. Computational work in historical or evolutionary linguistics has been extremely rare in the past, though a gradual stream of work beginning in the mid-nineties has started to gain momentum. Many subdisciplines of artificial intelligence might find an interest in this work. Those in the areas of artificial life and artificial societies might be interested in the behavior of societies of linguistic agents and their dynamics. While this book concentrates on those cases for which analytic understanding is possible, a larger set of phenomena might be realistically pursued within the framework of agent-based simulations. This book does not pursue such an approach. A grand challenge in artificial intelligence is to understand how humans acquire language and to get a computer to do the same. Work in this tradition has ranged from abstract theories of language acquisition discussed earlier (Wexler and Culicover 1980; Osherson, Stob, and Weinstein 1986; Blum and Blum 1975; Gold 1967; Feldman 1972; Angluin 1988; Sakakibara 1990) to computational work (Berwick 1985; Feldman et al 1996; Regier 1996; Siskind 1992). An unusual and previously unexploited window into the phenomenon of language acquisition is provided by the facts of language change. Variation and variability among speakers is possibly the primary source of difficulty for computers to automatically process natural languages. This arises at all levels from speech recognition to language understanding to language translation. Why does this variation arise? What constrains it? What propagates it? These are important questions to resolve and a better understanding of variability is critical to progress in spoken-language systems. Here I take a fundamentally historical attitude toward linguistic variation — one that has almost never been taken in the synchronic view of variation that is implicit in naturallanguage processing. 3. The mathematician interested in dynamical systems will find a variety of concrete iterated function maps that arise in the study of historical linguistics, some of which are of considerable mathematical difficulty. The study of dynamical systems has been fed by problems in physics, biology, and economics but never before from linguistics as far as I

CHAPTER 1

44

know. 4. The evolutionary biologist may be interested in a new domain with much of the same character. I have already touched upon the similarities and differences between evolutionary processes in language and biology. Researchers in animal communication, signaling, and ethology may find synergies between their own perspectives and tools and those developed here. 5. Social scientists like anthropologists interested in culture and its transmission, economists interested in bounded rationality and its evolutionary consequences, and evolutionary psychologists interested in evolutionary perspectives on cognitive behavior will find a parallel here in the behavior of linguistic learners as they learn language and evolve over time. 6. This book discusses the relationship between the macroscopic behavior of a linguistic population and the microscopic behavior of the linguistic agents in this population. As a result, a theme that runs across this book is the analysis of the emergent properties of a complex system of several interacting components. This theme arises in different ways in studies of pattern formation in biology, physics, and various social sciences. Researchers interested in the study of complexity and complex phenomena may find new applications in the analysis of language as a complex adaptive system.

1.7.1

Structure of the Book

The rest of the book is organized as follows. In the next several chapters, I introduce the basic dynamical-system model for studying language change. This is developed over two parts. The starting point for my narrative is the problem of language acquisition. In Part II of this book (Chapters 2–4) I discuss the philosophical problem of inductive inference that lies at the heart of the language-learning problem. I introduce frameworks for the analysis of learning algorithms that will play a useful role in later chapters. In Part III, I begin by deriving the dynamics of linguistic populations (Chapters 5 and 6) that form the core model for much of the rest of the book. I show how models of language change depend upon the learning algorithm, the class of grammars, and sociolinguistic considerations of frequency of language use. I continue in Part III by applying my model (Chapters 6–10) to various special cases. As a result, variations and extensions of the basic model are

45

INTRODUCTION

fleshed out. Many of these chapters have a running discussion of relevant linguistic phenomena that provide the motivation for this entire exercise. In Part IV I consider the trickier problem of the origin of linguistic communicative systems. I explore in some detail two themes. First, I consider the matter of communicative efficiency and discuss a probabilistic formulation for two linguistic agents communicating in a shared world. Second, I examine the role of fitness based on communicative efficiency in language evolution. I consider evolutionary models with fitness in Chapter 12 where I analyze the situation in which individuals reproduce at differential rates that are proportional to their communicative success. In Chapter 13, I consider language models without fitness but with social learning where individuals learn from everybody. In both cases, I discuss when and how linguistic coherence emerges and relate this coherence threshold to the learning fidelity of the individual learner. In summary, my goal in this book is to shed light on the nature of the relationship between individual language learning, grammatical families, and population dynamics. In a nutshell, I wish to understand how the distribution of different grammatical types in a population will evolve under a variety of learning algorithms and modes of interaction. Chapters 2 through 13 are an exploration of this theme in some mathematical detail. Computational and mathematical modeling forces us to be precise in our reasoning as we proceed. In Part V, I conclude. I take stock of the situation, outline my essential results, and suggest directions for future work.

Part II

Language Learning

Chapter 2

Language Acquisition: The Problem of Inductive Inference An appropriate point to begin our whole narrative is to consider the problem of language acquisition. Children learn, with seemingly effortless ease, the language of their parents and caretakers. Let me begin by considering the computational difficulty of the problem that children solve so routinely. In order to get started, I will consider a language to be a set of sentences. Given a finite alphabet (I take this to be the lexicon) Σ, I denote by Σ ∗ the universe of all possible finite strings (sentences) in the usual way. A language then is simply a subset of Σ∗ — a subset consisting of the well-formed strings. In the previous chapter, I have considered several examples from natural languages like English, French, and Yiddish to illustrate how some strings (sentences) are in the language1 while others are not. The underlying grammar of a language determines which sentences are acceptable and which are not. Later in this chapter, I will consider alternative and perhaps more general conceptions of language, but these will not change the fundamental import of the discussion here. Consider a child born and raised in a homogeneous English-speaking 1 Note that in a formal sense, it is quite uncontroversial to speak of a language as a subset of Σ∗ . What is potentially more problematic is the notion of a natural language like English. I will take the position that individuals have their own individual languages and these might differ from each other. These individual languages are all natural languages. If the members of a community have so much overlap in their languages that mutual intelligibility is extremely high, then one might label this shared language “English” or “French” or “Chinese”, as the case may be.

CHAPTER 2

50

community. Such a child is exposed to a finite number of sentences as a result of interaction with its linguistic environment. Yet on the basis of this, the child is able to generalize and thus form and understand novel sentences it has not encountered before. It is this ability to infer (induce) the novel elements of a set that is the cornerstone of successful language learning. Let us consider the following idealized setting for such a phenomenon. Imagine the community speaks a language L t ⊂ Σ∗ . This is the target language that the learner must identify or approximate in some sense. As a result of interaction with the community, the child learner ultimately has access to a sequence (in time) of sentences s1 , s2 , s3 , s4 , . . . , s n , . . . where si ∈ Lt is the ith example available to the learner. Suppose that the learner makes a guess about the target language L t after each new sentence becomes available. Consider the case when the learner has no prior knowledge about the nature of Lt . Suppose that the kth example has just been received and the learner now has to make a guess about the identity of the target. What should the learner do? All it knows is that the target contains s 1 through sk — there are an infinite number of possible languages that contain s 1 through sk . Any of them could be the target. How many of them should (can) the learner consider? What should the learner guess? Such is the dilemma of the learning child when confronted with finite linguistic evidence over the course of language acquisition. A theoretical analysis of this situation will rapidly lead us to the conclusion that learning in the complete absence of prior information is impossible. But first, let us consider a framework within which we can meaningfully discuss the problem of learning and inductive inference and better understand the various issues that underlie effective learning.

2.1

A Framework for Learning

The canonical problem to which much of language learning can conceptually be reduced is that of identifying an unknown set on the basis of example sentences. The framework within which analysis of such problems can usefully be conducted consists of the following components: 1. Target language Lt ∈ L is a target language drawn from a class of possible target languages (L). This is the language the learner must identify on the basis of examples.

51

INDUCTIVE INFERENCE 2. Example sentences si ∈ Lt are drawn from the target language and presented to the learner. Here si is the ith such example sentence. 3. Hypothesis languages h ∈ H drawn from a class of possible hypothesis languages that child learners construct on the basis of exposure to example sentences in the environment. 4. Learning algorithm A is an effective procedure by which languages from H are chosen on the basis of example sentences received by the learner.

For further discussion, see Osherson, Stob, Weinstein 1986; Valiant 1984; Niyogi 1998; Wexler and Culicover 1980; Jain et al. 1998.

2.1.1

Remarks

At this point, some clarifying remarks are worthwhile. 1. Languages and grammars: I have formulated the question of learning a language as essentially that of identifying a set. I have not so far specified the nature of these sets. I now impose the restriction that these languages must be computable. Therefore, following the computability thesis of Church and Turing, I will consider only recursively enumerable (r.e.) languages. An r.e. language has a potentially infinite number of sentences and has a finite representation in terms of a Turing Machine or a Phrase Structure Grammar. Thus languages, machines, and grammars are formally equivalent ways (Harrison 1978) of specifying the same set. In my notation, g typically refers to a grammar and Lg refers to the language generated by it. With slight abuse of notation, in much of what follows, I will refer to the sets in question as grammars or languages depending upon the context. The r.e. languages constitute an enumerable set and effective procedures exist to enumerate the grammars (machines) that generate them (Rogers 1958; Hopcroft and Ullman 1979). Since a grammar is a finite representation of the language, it is reasonable to suppose that language users and learners work with these finite representations rather than the infinite languages themselves. Since many different grammars may be compatible with the same language, this raises the question of intensional versus extensional learning — a notion that is captured in the distinction between I-language and E-language (Chomsky 1986) that is worthwhile to keep in mind as the book develops. Thus in a formal sense, we really have a collection G of possible target

CHAPTER 2

52

grammars and L is then defined as L = {Lg |g ∈ G} It is finally worth noting that all computable Phrase Structure Grammars may be enumerated as g1 , g2 , . . .. Given any r.e. language L there are an infinite number of gi ’s such that Lgi = L. Then any collection of grammars may be defined by specifying their indices in an acceptable enumeration. 2. Example sentences: Sentences will be presented to the learner one at a time. One might imagine many different modes of interaction with the environment as a result of which such sentences become available. However, it is worthwhile to note that the ability of children to learn a language does not seem to be sensitive to the precise order in which sentences are presented to them. (Newport, Gleitman, and Gleitman 1977; Schieffelin and Eisenberg 1981). Hence, in my treatment of learnability I will mostly require that a psychologically plausible learning algorithm can be shown to converge to the target in a manner that is independent of the order of presentation of sentences. In much of this book, I will almost exclusively consider examples presented in i.i.d. fashion 2 according to a fixed, underlying, probability distribution μ. The distribution μ characterizes the relative frequency of different kinds of sentences that children are likely to encounter during language acquisition. For example, they are more likely to hear shorter sentences than substantially embedded longer sentences. This distribution μ might have support over all of Σ∗ , in which case both positive (sentences in the target language) and negative (sentences not in the target language) 2 In reality, linguistic experience is conducted with clear dependencies and correlations between successive sentences. This dependence is based on considerations of pragmatics and discourse and difficult to model precisely. However, it is possible that such dependencies affect the semantic content of sentences but not their syntactic structure. For example, the sentences “John ate an apple” and “Bill moved the car” have different lexical choices and correspondingly different semantic content. They may occur as part of different conversations with different probability distributions of the precise sentences that follow it. However, both have similar syntactic structure and are of the form Subject-VerbObject (SVO). It may be that if sentences are viewed as strings over syntactic categories, the i.i.d. assumption is a believable one. Further, it may also be that while there are immediate dependencies between sentences based on discourse, there are no long-range dependencies and the i.i.d. assumption is like sampling from the stationary distribution of the corresponding Markov process. In any event, the probabilistic assumption of i.i.d. sentences is used as a convenient mathematical device to get a handle on the fact that different syntactic forms occur with different frequencies. The precise consequences of such an assumption may then be understood, opening the path to understanding more complicated phenomena. This is one of the many abstractions I make in order to be able to take first steps in reasoning about what is otherwise a very complex situation.

53

INDUCTIVE INFERENCE

examples will be presented to the learner. On the other hand, μ might have support only over Lt , in which case only positive examples are presented to the learner. This latter case is psychologically the more real, because considerable evidence seems to exist suggesting that children do not have much exposure to negative examples over their learning period 3 (Brown and Hanlon 1970; Hirsh-Pasek, Treiman, and Schneiderman 1984; Demetras, Post, and Snow 1986). 3. Learning algorithm: The learning algorithm A is an effective procedure allowing the learning child to construct hypotheses about the identity of the target language on the basis of the examples it has received. In principle, any learner, be it the child learner over the course of language acquisition, or a machine learner, has to follow a procedure or algorithm and is therefore subject to the computational laws that govern such processes. In particular, following the Church-Turing thesis, I will accept the equivalence of partial recursive functions and effective procedures, and therefore consider learning algorithms to be mappings from the set of all finite data streams to hypotheses in H. A particular finite data stream of k example sentences may be denoted as (s1 , s2 , . . . , sk ). Let D k = {(s1 , . . . , sk )|si ∈ Σ∗ } = (Σ∗ )k be the set of all possible sequences of k example sentences that the learner might potentially receive. Then we can define D = ∪k>0 D k to be the set of all finite data sequences. Since D k is enumerable, so is D and we can then let A to be a partial recursive function A:D→H where H is the enumerable set of hypothesis grammars 4 (languages). Given a data stream t ∈ D, the learner’s hypothesis is given by A(t) and is an 3

Much of the time, members of the adult community simply produce sentences in their language, giving the child exposure to positive examples. The only source of negative examples therefore comes from mistakes that children make during language acquisition, and the feedback from this experience is often absent, inadequate, or misleading. 4 A language is a set with a potentially infinite number of sentences. However, they have finite representations in terms of grammars. It is reasonable therefore to postulate that human knowledge of a language has a compact encoding in terms of a grammar. Correspondingly, learners therefore conjecture (develop) grammars as they attempt to learn a language. The set H is the hypothesis set of possible grammars they may conjecture (develop) in the course of learning a language. The map A from linguistic experience D to grammatical hypotheses H may be viewed as the language learning map, the language development map, or the language growth map depending on one’s point of view.

CHAPTER 2

54

element of H. In much of this book, my treatment of learning will largely be in a probabilistic setting with natural ties to statistical learning theory as developed in Vapnik 1982, 1998, or in Valiant 1984, as well as probabilistic settings of inductive inference in the extended Gold tradition (Jain et al. 1998). In this I will deviate from my strict formulation of the learner as a deterministic procedure to consider probabilistic learners that are allowed to flip a coin to choose hypotheses that are sometimes elements of H and sometimes subsets of H. An important thing to note is that the behavior of the learner for a particular data stream (s 1 , . . . , sk ) ∈ D k is independent of the target language or languages from which the data is drawn. It depends only on the data stream and can be predicted either deterministically or probabilistically if the learning algorithm is analyzable. Some kinds of learning procedures are worth introducing here, and I will return to them at several points over the course of this book. A consistent learner always maintains a hypothesis (call it h n after n examples) that is consistent with the entire data set it has received so far. Thus, if the data set the learner has received is denoted by (s 1 , . . . , sn ), then the learner’s grammatical hypothesis hn is such that each si ∈ Lhn . An empirical risk minimizing learner uses the following procedure: hn = arg min R(h; (s1 , . . . , sn )) h∈H

The risk function R(h, (s1 . . . , sn )) measures the fit to the empirical data consisting of the example sentences (s 1 , . . . , sn ). In many cases, this minimization might not be unique, in which case the learner will need a further criterion to decide which of the minima should be picked as a hypothesis language. Some version of Occam’s razor is natural to consider here. For example, the learner might conjecture the smallest or simplest grammar that fits the data well.5 A memoryless learning algorithm is one whose hypothesis at every point depends only on the current data and the previous hypothesis. Let tn = (s1 , . . . , sn ) ∈ D be a data set with n examples and let tn−1 be the first n − 1 examples (i.e., tn−1 = (s1 , . . . , sn−1 )) of this data set. Then A is such that A(tn ) depends only upon A(tn−1 ) and sn . Learning by enumeration is another common strategy. Here the learner enumerates all possible grammars in H in some order. Let this enumeration be h 1 , h2 , . . .. 5

This idea finds a clear instantiation in the Minimum Description Length principle (MDL) of Rissanen and its application to language learning as in Rissanen and Ristad 1992, de Marcken 1996, Brent and Cartwright 1996, Brent 1999, and Goldsmith 2001. Earlier treatments of this idea are to be found in the evaluation metric of Chomsky 1965 and complexity functions of Feldman 1972.

55

INDUCTIVE INFERENCE

It then begins with the conjecture h1 . Upon receiving new example sentences, the learner simply goes down this list and updates its conjecture to the first one that is consistent with the data seen so far. Variants on this basic idea may be considered. 4. Criterion of success: A significant component of the learning framework involves a criterion of success so that we can measure how well the learner has learned at any point in the learning process. This takes the form of a distance measure d so that for any target grammar g t and any hypothesis grammar h, one can define the distance between the target grammar and the hypothesis grammar as d(gt , h). If hn is the learner’s hypothesis after n sentences have been received, then learnability implies that lim d(gt , hn ) = 0

n→∞

In other words, the learner’s hypothesis converges to the target in the limit. If the learning algorithm A is such that for every possible target grammar gt ∈ G, the learner’s hypothesis converges to it (when presented with a data sequence from the target grammar), then the family G is said to be learnable by A. By varying d, and by varying whether the convergence needs to be in a probabilistic sense or not, different convergence criteria may be obtained. I will consider a few different ones in the treatment that follows. Two related notions might be introduced here. I use the term generalization error to refer to the quantity d(gt , hn ) that measures the distance of the learner’s hypothesis (after n examples) to the target. Learnability implies that the generalization error eventually converges to zero as the number of examples goes to infinity. In a statistical setting, the generalization error is a random variable and can only converge to zero in a probabilistic sense (Vapnik 1982; Valiant 1984). 5. Generality of the framework: It is worthwhile to emphasize that the basic framework presented here for the analysis of learning systems is quite general. The target and hypothesis classes L and H might consist of grammars in a generative linguistics tradition. Example of such grammars include those in the tradition of Government and Binding (Chomsky 1981); Head Driven Phrase Structure Grammar (HPSG; Pollard and Sag 1994); LexicalFunctional Grammar (LFG; Kaplan and Bresnan 1982; Bresnan 2001), Autolexical Grammar (Sadock 1991); and so on. They might consist of general grammatical families such as finite-state grammars (DFAs), context-free grammars (CFGs), tree-adjoining grammars (TAGs; Joshi, Levy, and Takahashi 1975), and so forth. They might consist of grammars in a connectionist tradition. I do not commit myself to any representational issues here but

CHAPTER 2

56

choose a generic enumeration scheme to enumerate the grammars in the class.6 Again, no commitment is made just yet to learning algorithms; they could in principle be grammatical inference procedures, gradient descent schemes like those occurring in connectionist learning, Minimum Description Length (MDL) learning, maximum-likelihood learning via the EM algorithm, and so on. Different research traditions use different grammatical families, different representations, and correspondingly different learning algorithms. Most of these are analyzable in the framework considered here. In what follows, I always consider the case in which the hypothesis class H is equal to the target class G. In other words, L H = {Lh |h ∈ H} = L. If this were not the case, for some languages (those in L \ L H ) the learner could never converge to the target because it could never hypothesize the target language (grammar). Such languages could never come to be spoken by humans and therefore would not exist as natural languages.

2.2

The Inductive Inference Approach

The first study of this problem of identifying sets was conducted in the sixties with the pioneering work of Gold (1967). Later work by Feldman (1972), Blum and Blum (1975), Angluin (1980a,b,1988), Osherson, Stob, and Weinstein (1986), Pitt (1989), Gasarch and Smith (1992), Jain et al. (1998) and a host of others elaborates on this theme. Jain et al. 1998 provides an excellent and updated exposition of the various technical results in this area. In this section, I provide a brief introduction to the paradigm of identification in the limit and a proof of Gold’s celebrated result. I then proceed to interpret this result and its extensions in the context of the natural phenomena of language acquisition. First, I need to define a few terms. Definition 2.1 A text t for a language L is an infinite sequence of examples s1 , . . . , sn , . . . such that (i) each si ∈ L (ii) every element of L appears at least once in t. I denote by t(k) the kth element of the text. This is simply sk . I denote by tk the first k elements of the text. This is simply s 1 . . . sk . Thus tk ∈ D k (if represented as a k-tuple). Let us now define the notion of learnability in the limit for this framework. 6

The fact that grammars may be enumerated follows from the computability thesis we have adopted here.

57

INDUCTIVE INFERENCE

Definition 2.2 Fix a distance7 metric d, a target grammar gt ∈ G, and a text t for the target language (Lgt ). The learning algorithm A identifies (learns) the target gt (Lgt ) on the text t in the limit if lim d(A(tk ), gt ) = 0

k→∞

If the target grammar gt is identified (learned) by A for all texts of L gt , the learning algorithm is said to identify (learn) g t in the limit. If all grammars in G can be identified (learned) in the limit, then the class of grammars G (correspondingly class of languages L) is said to be identifiable (learnable) in the limit. Thus learnability is equivalent to identification in the limit. Now consider the case where G = H. After more notational niceties, I will prove the following fundamental result. Definition 2.3 Given a finite sequence x = s 1 , s2 , . . . , sn (of length n), I denote the length of the sequence by lh(x) = n. For such a sequence x, I denote by x ⊆ L the fact that each s i in x is contained in the language L. I denote by range(x) the set of all unique elements of x. Thus x ⊆ L is shorthand for the more set-theoretically correct statement range(x) ⊆ L. The concatenation of two sequences x = x 1 , . . . , xn and y = y1 , . . . , ym is denoted by x ◦ y = x1 , . . . , xn , y1 , . . . , ym . Now I am in a position to state 7

The choice of the distance metric d allows us to consider many different notions of convergence to a target grammar. It is worth noting that any two grammars g and h define corresponding sets of expressions (languages) Lg and Lh . However, the metric d is defined on the space of grammars and may incorporate both extensional and intensional terms. For example, a purely extensional notion of distance would be one in which d(g, h) depends only on Lg and Lh and nothing else. In that case, for all grammars f such that Lf = Lg we would have d(g, h) = d(f, h). If ig and ih are the indices of g and h in some acceptable enumeration of the grammars, then a purely intensional notion of distance could be d(g, h) = |ig − ih |. In some sense, both the specific choices discussed above are unsatisfactory. In fact, from a cognitive perspective, there is a long tradition of thought in generative grammar that attributes some psychological reality to grammars. A child’s developing linguistic knowledge is codified, represented, and interpreted in terms of its grammar. Consequently, some grammars may be interpretable while others may not. Therefore, the child’s convergence to the parent’s grammar ought to include considerations of grammatical representation. On the other hand, since the index of a grammar may bear no natural relationship to the extensional set identified as the corresponding language, a distance measure like d(g, h) = |ig − ih | might end up ignoring extensional agreement altogether resulting in an absurdity. In much of my discussion in this chapter, I will discuss learning based on extensional criteria to understand fundamental limitations of inductive inference. In subsequent chapters, I will take a more linguistic point of view and consider intensional criteria in models of language acquisition.

CHAPTER 2

58

Theorem 2.1 (after Blum and Blum; -version) If A identifies a target grammar g in the limit, then, for every  > 0, there exists a locking data set l ∈ D such that (i) l ⊆ Lg (ii) d(A(l ), g) < , and (iii) d(A(l ◦ σ), g) <  for all σ ∈ D where σ ⊆ Lg . In other words, after encountering the locking data set, the learner will be -close to the target as long as it continues to be given sentences from the target language. Proof: I prove by contradiction. Suppose no locking data set exists. Then for every l ∈ D such that l ⊆ Lg and d(A(l), g) < , there must exist some σl ∈ D (where σl ⊆ Lg ) that has the property d(A(l ◦ σl ), g) ≥ . I will use this fact to construct a text for L g on which A will not identify the target. Begin with a text r = s1 , s2 , . . . , sn , . . . for Lg . Now construct a new text q in the following manner. Let q (1) = s1 . If d(A(q (i) ), Lg ) < , then I pick σq(i) that violates the locking property and update the text by letting q (i+1) = q (i) ◦ σq(i) ◦ si+1 . If d(A(q (i) ), Lg ) ≥ , then I simply let q (i+1) = q (i) ◦ si+1 . Since an si is added at each stage of the text creation process, it is clear that q is a valid text. At the same time, it is clear that A can never converge to g for the text q. This is because every time it conjectures a grammar h such that d(h, g) <  (say at q (j) ), it is forced to conjecture some other grammar that is not in the -neighborhood of g by the time it reaches q (j) ◦ σq(j) . In other words, the learner’s conjectures on q are such that d(A(qi ), g) ≥  infinitely often. This suggests that if a grammar g (correspondingly, a language L g ) is identifiable (learnable) in the limit, a locking data set exists that “locks” the learner’s conjectures to within an -ball of the target grammar after encountering this locking data set. In what follows, I will consider the important and classical case of exactly identifying the target language in the limit. Here the distance metric is 0 − 1-valued and given by d(g 1 , g2 ) = 1 if and only if Lg1 = Lg2 . Putting  = 12 in the previous theorem, I get the following classical result Theorem 2.2 (Blum and Blum 1975) If A identifies a target grammar g in the limit, then there exists a locking data set l ∈ D such that (i) l ⊆ L g (ii) d(A(l), g) = 0, and (iii) d(A(l ◦ σ), g) = 0 for all σ ∈ D where σ ⊆ L g . Utilizing this result, I can now prove Gold’s famous theorem. Theorem 2.3 (Gold 1967) If the family L consists of all the finite languages and at least one infinite language, then it is not learnable (identifiable) in the limit.

59

INDUCTIVE INFERENCE

Proof: I prove by contradiction. Suppose that an algorithm A is able to identify the family L. Therefore, in particular, it is able to identify the infinite language (call it Linf ). Therefore, by Theorem 2.2, a finite locking data sequence must exist (call it σ inf ). Consider the language L = range(σinf ). This is a finite language and therefore is in L. Furthermore, one can construct a text t for this language such that t k = σinf where k = lh(σinf ). However, by the locking property, for such a text t, the algorithm A converges to Linf . Therefore, A does not identify L and so does not identify L. Since both regular and context-free languages contain all the finite languages and many infinite ones, it immediately follows as a corollary that Corollary 2.1 The family of languages represented by (i) Deterministic Finite Automata and (ii) Context Free Grammars are not identifiable in the limit. Thus, all grammatical families in the core Chomsky hierarchy of grammars are unlearnable in this sense. Before I examine the implications of this result, it is worthwhile to consider a formulation of the necessary and sufficient conditions for the learnability of a family of languages. The proof of Gold’s result presented here uses the notion of locking sequences that was introduced by Blum and Blum some years after Gold’s original paper. The use of such locking sequences in the proof illustrates how they hold the key to learnability of language families. This is indeed the case as the following theorem shows. Theorem 2.4 (Angluin 1980) The family L is learnable (identifiable) if and only if for each L ∈ L, there is a subset D L such that if L ∈ L contains DL , then L is not a proper subset of L. Proof: First I prove the necessary part. Since L is learnable, therefore, an algorithm A exists and for every L ∈ L a locking data sequence σ L must exist. Consider the case of a particular language L and its locking data sequence σL . If there is some other language L ∈ L such that (i)range(σL ) ⊆ L and (ii) L ⊂ L then it is possible to construct a text t for L such that tk = σL where k = lh(σL ). On such a text, because of the locking property, the algorithm A will be such that d(A(t j ), L) → 0 and since L ⊂ L, we have that d(A(tj ), L ) → 0. Therefore L and consequently the family L is not identifiable in the limit.

CHAPTER 2

60

Now I prove the sufficient part. Suppose that for each L, there exists a DL ⊆ L such that if DL ⊆ L for some L ∈ L, then L ⊂ L. I now show how to construct an algorithm A to learn the family L. Let L1 , L2 , . . . , be an enumeration of the languages in L where for each L i there exists a DLi as discussed. Consider the following learning algorithm. On an input data stream σ ∈ D such that lh(σ) = k, the learner searches for the least index i ≤ k such that DLi ⊆ range(σ) ⊆ Li . If no such i exists, the learner simply conjectures L1 as a default. I need to prove that this algorithm will identify every language in L. Let the target language be Lj . Consider a text t for Lj . After k ≥ j example sentences have been encountered, the learner could potentially conjecture Lj . The only reason it may not is if it conjectures some L i where i < j. I claim that for every i < j, the learner will either not conjecture L i at all or will eventually abandon it for good. To see this, consider some L i (i < j). There are two cases to consider: Case I: Lj \ Li = ∅. This means that Lj ⊂ Li . But then DLi cannot be a subset of Lj . Since range(tk ) ⊆ Lj for all k, it can never be the case that DLi ⊆ range(tk ) ⊆ Li . The learner can never conjecture L i at any point. Case II: Lj \ Li = ∅. This means there is some sentence s ∈ L j that is not in Li . Eventually this sentence must occur in t k for some k. For all n > k, therefore, it can never be the case that range(t n ) ⊆ Li . Thus the learner will never conjecture Li after this point.

2.2.1

Discussion

In light of these results, let us now take stock of the situation. Remark 1. I have considered here the difficulty of inferring a set from examples of members of this set. The problem is clearly motivated by the inference problem children have to solve in the process of learning a language. A particular language may be viewed as a set of well-formed expressions. The nature of this set is unknown to the child before language acquisition. In syntax, for example, this refers to the set of well-formed sentences (or phrases or constructions, depending upon one’s point of view); in phonology, these refer to the set of well-formed phonological forms. Every such set has a finite representation in terms of an underlying grammar. Crucially, children are not exposed directly to the grammar — they are exposed only to the expressions8 of their language and their task is to learn the grammar 8

See Lightfoot 1991 for a discussion of learning from cues or phrasal fragments that seem to have a status somewhere in between an expression and a grammatical rule.

61

INDUCTIVE INFERENCE

that provides a compact encoding of the ambient language they are exposed to. Remark 2. It is worth noting that the precise notion of convergence depends upon the distance metric d imposed on the space of grammars. I have considered in some of the above theorems the case in which the metric d(g1 , g2 ) is 0 − 1 valued and depends only on whether L g1 = Lg2 . In this metric the learner may converge on the correct extensional set but may not converge to the correct grammar. In fact the learner is not required to converge to any single grammar at all. Some would argue that this notion of convergence may be behaviorally plausible but cognitively implausible. Taking the view that grammars have a cognitive status and therefore that the maturation of the child’s linguistic knowledge must be captured in the development of its grammar, one may impose the stronger metric where d(gi , gj ) = |i − j| or |C(i) − C(j)| where i, j are indices of the grammars and C(i) and C(j) are measures of grammatical complexity in some sense. This notion of convergence is significantly more stringent and much less is learnable in this case. In much of this chapter, I will review results using the more relaxed extensional (behavioral) notion of convergence so that we appreciate the impossibility of tabula rasa learning even in this setting. It is finally worth noting that a related notion of convergence was originally proposed by Gold (1967) where it was required that the learner stabilize (converge) to some grammar that was extensionally correct. This may be viewed as a compromise between the two extremes that have just been outlined. Remark 3. It may be argued that the proper role of language is to mediate a mapping between form and meaning or symbol and referent in the Saussurean (Saussure 1983) sense. When one takes this point of view, one reduces a language to a mapping from Σ ∗1 to Σ∗2 where Σ∗1 is the set of all linguistic expressions (over some alphabet Σ 1 ) and Σ∗2 is a characterization of the set of all possible meanings. In this framework, a language L is regarded as a subset of Σ∗1 × Σ∗2 , in other words, a potentially infinite set of (form-meaning) pairs or associations. There are, in fact, many possible formalizations of the notion of a language for our purposes. I enumerate below a list of possibilities. 1. L ⊂ Σ∗ – this is the central and traditional notion of a language both in computer science and linguistics. 2. L ⊂ Σ∗1 × Σ∗2 – a subset of form-meaning pairs and in a formal sense no different from notion 1.

CHAPTER 2

62

3. L : Σ∗ → [0, 1] – a language maps every expression to a real number between 0 and 1. For any expression s ∈ Σ ∗ , the number L(s) characterizes the degree of well-formedness of that expression with L(s) = 1 denoting perfect grammaticality and L(s) = 0 denoting complete lack of it. Note that notion 1 simply considers languages to realize mappings from Σ∗ to {0, 1}; thus sentences are either grammatical or ungrammatical with no notion of graded grammaticality. The support of this function may be a restricted subset of Σ ∗ , and this restricted subset may be regarded as a language in the sense of notion 1. 4. L is a probability distribution μ on Σ ∗ – this is the usual notion of a language in statistical language modeling. 5. L is a probability distribution μ on Σ ∗1 × Σ∗2 . It is worthwhile to observe that each of the extended notions of language makes the learning problem for the child harder rather than easier. Thus the nonlearnability results discussed earlier have a significance greater than the particular formal context in which they have been developed. Remark 4. The arguments presented above suggest that if the space L = H of possible languages is too large, then the family is no longer learnable. As a matter of fact, taking the result of Gold 1967, we see that even the space of regular languages (DFAs) is too large for learnability by this account. It is also worth emphasizing that the arguments are of an informational rather than computational nature. In other words, we consider A to be a mapping from D to H. Even if this mapping were not computable, the negative results about learnability presented here would still stand. Remark 5. One may think that the unlearnability of regular languages is due to the fact that all the finite sets are included in this family. Since we know that natural languages are never finite, one may ask what sort of learnability results would hold for families of languages L where each member of L was required to be infinite. For example, is the class of infinite regular languages unlearnable? While Gold’s Theorem does not apply to this case, it is worth noting that the proof technique does carry over. Consider two languages L 1 and L2 such that L1 ⊂ L2 and further L2 \ L1 is an infinite set. Clearly one may find two such languages in L = { infinite regular languages }. Let σ 2 be a locking sequence for L2 . Then clearly, L1 ∪range(σ2 ) is a language that contains σ2 , is a proper subset of L2 , and is contained in L. One sees that this language will not be learnable from a text whose prefix is σ 2 because the learner will lock on to L2 on such a text.

63

INDUCTIVE INFERENCE

Remark 6. The most compelling objection to the classical inductive inference paradigm comes from statistical quarters. It seems unreasonable to expect the learner to exactly identify the target on every single text. The natural framework that modifies both assumptions is the so called Probably Approximately Correct learning framework (Valiant 1984) that tries to learn the target approximately with high probability. I discuss this in the next section but it is worthwhile to note here that the PAC model also emphasizes identification in the limit. The quantity d(g t , hn ) is now a random variable that must converge to 0 — not on every single data sequence as in in the classical Gold-style inductive inference framework, but in probability (weak convergence of random variables). Before we consider the PAC approach, let us first review some additional results in more stochastic formulations that have existed in the inductive inference framework.

2.2.2

Additional Results

In the classical framework of the previous section, no assumption was made about the source that generated the text. Let us assume that the text is generated in i.i.d. fashion according to a probability measure μ on the sentences of the target language L. In much of this book, I will adopt this assumption in order to derive probabilistic bounds on the performance of the language learner that will be critical in deriving my models of language change. The measure μ has support on L. Therefore, one may define corresponding measures on the product spaces. Thus, we have measure μ 2 on the product space L × L. All texts t from the language L that have been generated according to i.i.d. draws from μ will be such that t 2 ∈ L × L. Similarly, we can define the measure μ 3 on the space L × L × L and so on. By the Kolmogorov Extension  Theorem, a unique measure μ ∞ is guaranteed to exist on the set T = ∞ i=1 Li (where Li = L for each i). The set T consists of all texts that may be generated from L by i.i.d. draws according to μ, i.e., t ∈ T is such that t(k) ∈ L for all k and each t(k) is drawn in i.i.d. fashion according to μ. The measure μ ∞ is defined on T and thus we have a measure on the set of all texts. Theorem 2.5 Let A be an arbitrary learning algorithm and g be an arbitrary (not necessarily target) grammar. Then the set of texts on which the learning algorithm A converges to g is measurable.

CHAPTER 2

64

Proof: Consider the set A = {t| lim d(A(tk ), g) = 0} k→∞

This is the set of all texts on which the learning algorithm converges to the grammar g. To see that this is measurable, it is enough to see that for each k, l, the set Bk,l (defined as below) is measurable. 1 Bk,l = {t|d(A(tk ), g) < } l As a matter of fact, μ∞ (Bk,l ) = μk ({T ∈ Lk | d(A(T ), g) < 1l }). Now let Cl = ∪i ∩m>i Bm,l Thus Cl is simply the set of texts for which the learning algorithm eventually makes conjectures that lie within a 1l ball of the grammar g. By the usual properties of sigma algebras, Cl is clearly measurable. Finally, we see that A = ∩∞ l=1 Cl is measurable too. As a result, it is possible to define learning with measure 1 in the following way. Definition 2.4 Let g be a target grammar and texts be presented to the learner in i.i.d. fashion according to a probability measure μ on L g . If there exists a learning algorithm A such that μ ∞ ({t| limk→∞ d(A(tk ), g) = 0}) = 1 then the target is said to be learnable with measure 1. The family G is said to be learnable with measure 1 if all grammars in G are learnable with measure 1 by some algorithm A. According to this notion of learnability, it is worthwhile to note that 1. If the measure μ is known in a certain sense, the entire family of r.e. sets becomes learnable with measure 1. We will see a proof of this shortly. 2. On the other hand, if μ is unknown, the Superfinite languages (those having all the finite languages and at least one infinite language) are not learnable. Thus, the class of learnable languages is not enlarged for distribution-free learning in a stochastic setting.

65

INDUCTIVE INFERENCE 3. Computable distributions make languages learnable. Thus any collection of computable distributions is identifiable in the limit. This is comparable to Gold 1967, where it is shown that if examples are constrained to be produced by effective procedures (called computable presentations), the class of learnable languages is significantly enhanced.

Let us consider the set of r.e. languages and let L 1 , L2 , L3 , . . . be an enumeration of them. Let measure μ i be associated with language Li . Thus μi has support on Li and if Li is to be the target language, then examples are drawn in i.i.d. fashion according to this measure. As discussed above, by the extension theorem a natural measure μ i,∞ exists on the set of texts for Li . Then it is possible to prove that Theorem 2.6 With strong prior knowledge about the nature of the μ i ’s, the family L of r.e. languages is measure 1 learnable. Proof: The proof follows that in Osherson, Stob, and Weinstein 1986. Let s1 , s2 , . . . be an enumeration of all finite strings (sentences) in Σ ∗ . For an example sequence σ ∈ D, let us say that σ agrees with L j through n if for each si (i ≤ n), we have si ∈ Lj ⇔ si ∈ range(σ). In other words, σ and Lj agree on the membership of the first n sentences of Σ ∗ . Now let us first introduce the set Aj,n,m = {t|t is a text for Lj and ∃i ≤ n|si ∈ Lj \ range(tm )} Thus Aj,n,m is the set of all texts for Lj such that one of the first n elements of Σ∗ is in Lj but does not appear in tm . It is easy to see that Aj,n,m+1 ⊆ Aj,n,m. Therefore μj,∞(Aj,n,m ) is a monotonically decreasing function of m. As a matter of fact, since ∩∞ m=1 Aj,n,m = φ, we have lim μj,∞ (Aj,n,m ) = 0

m→∞

for every fixed j, n. Next we define the function d(n) as d(n) = least m such that μi,∞ (Ai,n,m ) ≤ 2−n for all i ≤ n. In other words, if one fixes n, then after seeing at least d(n) examples, one is guaranteed that if the target were one of L 1 through Ln , with high probability one would get the membership values for each of the sentences s1 ∈ Σ∗ through sn ∈ Σ∗ . It is also clear that d(n) is a monotonically increasing function of n. Further, as n grows, one would eventually establish

CHAPTER 2

66

the membership of every sentence and thus identify the target language with measure 1. The learning algorithm would work as follows. On an input sequence σ ∈ D, let m = lh(σ). Then, the learning algorithm finds the largest n such that (i) n ≤ m and (ii) d(n) ≤ m. I will indicate such an n by d −1 (m). It is therefore appropriate to conjecture one of L 1 through Ln . Let j be the least integer such that j ≤ n and Lj agrees with σ through n. If no such j exists, then let j = 1. The algorithm conjectures L j on input sequence σ. Now I need to prove that the algorithm learns the target grammar with measure 1. Let the target language be such that k is the least index for it. Consider now the set of texts on which the learning algorithm does not converge to Lk . This is given by B = {t|A(tn ) = k for infinitely many n} I will show that the measure of B is 0. To see this consider the following intermediate set Xk given by Xk = ∩i ∪m>i Ak,d−1 (m),m Now, Claim 1: B ⊆ Xk . To see this, consider a t ∈ B. This means that on t, we have A(t m ) = Lk infinitely often. There are two reasons why A(t m ) might not be equal to Lk . They are (i) tm and Lk do not agree through d−1 (m) or (ii) there exists some Li (i < k) such that Li and tm agree through d−1 (m). However, for any such Li , since tm and Li must eventually disagree for some m, no such L i can be conjectured infinitely often. Therefore the only reason that something other than Lk is conjectured infinitely often is that t m and Lk must not agree through d−1 (m) for infinitely many m’s. Therefore, t ∈ Xk = ∩i ∪m>i Ak,d−1 (m),m Claim 2: ∩i ∪m>i Ak,d−1 (m),m ⊆ ∩i ∪n>i Ak,n,d(n) . To see this, assume t ∈ ∩i ∪m>i Ak,d−1 (m),m . Now we need to show that t ∈ ∩i ∪n>i Ak,n,d(n) . To show the latter, we need to show that for every i, there exists some n > i such that t ∈ Ak,n,d(n) . However, since t ∈ ∩i ∪m>i Ak,d−1 (m),m , therefore t ∈ Ak,d−1 (m),m for some m > d(i + 1). Let n = d−1 (m) for such an m. Clearly, n ≥ i + 1. Therefore, t ∈ A k,n,m where d(n) ≤ m and n ≥ i + 1. Since Ak,n,l+1 ⊆ Ak,n,l for all l, we clearly have t ∈ Ak,n,d(n) for some n > i.

67

INDUCTIVE INFERENCE

 Finally notice that since n μi,∞ (Ak,n,d(n) ) < ∞, by the Borel Cantelli lemma, and Claims 1 and 2, μk,∞(Xk ) = 0. The set of texts on which the learning algorithm does not converge has measure zero. It might appear that by simply changing the requirement from learning on all texts to learning on almost all texts, the class of learnable languages is significantly enlarged. This is however misleading since the measure 1 learnability of the above theorem requires one to know d(n). This is a very strong assumption indeed. In particular, if the learning algorithm works for one particular set of measures {μi }, it is very easy to perturb the source measures so as to ensure nonlearnability. Therefore, a more natural requirement as one moves to a probabilistic framework is to require learnability in a distribution-free sense. Definition 2.5 Consider a target grammar g and a text stochastically presented by i.i.d. draws from the target language L g according to a measure μ. If a learning algorithm exists that can learn the target grammar with measure 1 for all measures μ, then g is said to be learnable in a distribution-free sense. A family of grammars G is learnable in a distribution-free sense if there exists a learning algorithm that can learn every grammar in the family with measure 1 in a distribution-free sense. It is worthwhile to note that when one considers statistical learning, the distribution-free requirement is the natural one and all statistical estimation algorithms worth their salt are required to converge in a distribution-free sense. When this restriction is imposed, the class of learnable families is not enlarged. In particular, it is possible to prove Theorem 2.7 (Angluin 1988) If a family of grammars G is learnable with measure 1 (on almost all texts) in a distribution-free sense, then it is learnable in the limit in the Gold sense (on all texts). As a matter of fact, one can prove the even stronger theorem Theorem 2.8 (Pitt 1989) If a family of grammars G is learnable with measure p > 21 then it is learnable in the limit in the Gold sense. This immediately implies of course that regular languages are not measure 1 learnable in a distribution-free sense. We will soon turn our attention to a model of learning that requires only weak convergence of the learner to the target. Here we will only require that lim P[d(A(tk ), g) > ] = 0

k→∞

CHAPTER 2

68

leading to the Probably Approximately Correct (PAC) model of learning. Before we do so, however, it is worthwhile to mention a few positive results on the learning of grammatical families that are worth keeping in mind for a more complete and nuanced understanding of the possibilities and limitations of learning and inductive inference. 1. We see that given a family of languages L, if the learning algorithm knows the source distribution of each of the languages, the family is learnable with measure 1. If, on the other hand, the text is presented stochastically from some unknown distribution, the family is not learnable. One might frame the question of language learning as essentially identifying (learning in some sense) a measure μ from a collection of measures M. This reduces to a density estimation problem, which in principle is harder than function approximation (or set identification). Indeed, if no constraints are put on the family M, then identifying the target measure is not possible. It is reasonable to ask under what conditions the family M becomes identifiable in the limit. One answer to this question was provided by Angluin 1988, where it was proved that a uniformly computable distribution could be identified in the limit in a certain sense. Let M = {μ0 , μ1 , μ2 , . . .} be a computable (hence enumerable) family of distributions. Define the distance between two distributions μi and μj as d(μi , μj ) = sup |μi (x) − μj (x)| x∈Σ∗

This is the L∞ norm on sequences. The family M is said to be uniformly computable if there exists a total recursive function f (i, x, ) such that for every i, for every x ∈ Σ ∗ , and for every , f (i, x, ) outputs a rational number p such that |μi (x) − p| <  The learner receives a text probabilistically drawn according to an unknown target measure from the family M. After k examples are received, the learning algorithm guesses A(t k ) ∈ M. It is possible to construct a learning algorithm that has the property lim d(A(tk ), μj ) = 0

k→∞

Special cases of this result are the learnability of stochastic finite-state grammars (van der Mude and Walker 1978) and stochastic context-

69

INDUCTIVE INFERENCE free grammars (Horning 1969). It is worth noting that uniform computability is a very strong prior constraint on the set of all distributions. For example, in the context of stochastic context-free grammars, one obtains probability measures on context-free languages by tying probabilities to context-free rules. As a result, the probability distributions are always such that longer strings become exponentially less likely. An arbitrary collection of probability measures with support on context-free languages need not be uniformly computable and so need not be learnable. 2. A second class of positive results arise from active learning on the part of the learner. Here the learner is allowed to make queries about the membership of arbitrary elements x ∈ Σ ∗ (membership queries). This allows the regular languages to be learned in polynomial time though context-free grammars remain unlearnable (Angluin 1987; Angluin and Kharitonov 1995). Other query-based models of learning with varying degrees of psychological reality have also been considered. They enlarge the family of learnable languages but none allow all languages to be learnable (Gasarch and Smith 1992). It is certainly reasonable to consider the possibility that children explore the environment and this active exploration facilitates learning and circumvents some of the intractability inherent in inductive inference. On the other hand, the ability to make arbitrary membership queries seems to be too strong and it is likely that the learning child possesses this ability only to a limited extent. 3. The problem of inference is seen to be difficult because the learner is required to succeed on all or almost all texts. It is natural to consider further restrictions on the set of texts on which the learner is required to succeed. These are as follows: (a) Recursive texts. A text t is said to be recursive if {t n |n ∈ N} is recursive. The learner is required to converge to the target language for all recursive texts that correspond to this language. It is possible to show there exists a map A (from data sets to grammars) such that all Phrase Structure Grammars are learnable. Unfortunately this map is not computable, and the following theorem (see Jain et al. 1998) holds: If a computable map exists that can learn a family of languages L from recursive texts, then L is algorithmically learnable from all texts. This result implies that restricting learnability to recursive texts does not enlarge the family of learnable languages.

CHAPTER 2

70

(b) Ascending texts. A text t is said to be ascending if for all n < m, the length of t(n) is less than or equal to the length of t(m), i.e., sentences are presented in increasing order of length. It is possible to show that there are language families L that are learnable from ascending texts but not learnable from all texts. Superfinite families (i.e., families containing all the finite languages and at least one infinite language), however, remain unlearnable in this setting. (c) Informant texts. A text t for a language L is said to be an informant if it consists of both positive and negative examples. Every element of Σ∗ appears in the text with a label indicating whether it belongs to the target language or not. All recursively enumerable sets are learnable from informant texts. The general consensus in empirical studies of language acquisition seems to be that children are hardly ever exposed to negative examples. While it is true that there may be ways to get indirect evidence of negative examples, it still seems unlikely that the learning child ever gets an opportunity to sample the space of negative examples with enough coverage to get an unbiased estimate of the target language. 4. One may consider weaker convergence criteria. For example, an overly weak convergence criterion  is to put a metric on the family of languages defined by d(L1 , L2 ) = s∈Σ∗ μ(s)|1L1 (s) − 1L2 (s)| (where μ is a fixed measure on Σ∗ ) and define convergence in this norm. This strategy (Wharton 1974) leads to the unfortunate consequence that the finite languages become dense in the space of r.e. languages and therefore a learning procedure need only output finite languages in order to learn successfully. A more satisfactory weakening is provided by the framework of anomalies where one is required to learn the target language up to (at most) k mistakes. Gold identification corresponds to k = 0 and it is possible to show (Case and Smith 1983) that a proper hierarchy of learnable families is obtained as k varies. Yet another criterion is the notion of strongly approaching the target grammar (Feldman 1972) where the learner must eventually be dislodged from all incorrect hypotheses and supply a correct hypothesis infinitely often. 5. One may consider various ways to incorporate structure into the learning problem leading to learnability results. Examples include learning context-free grammars from structured examples (Sakakibara 1990) or more recently the work on learning categorial grammars (Kanazawa 1998). If one were able to provide a cognitively plausible justifica-

71

INDUCTIVE INFERENCE tion for how such structures were made available to the learning child, then such approaches would provide a natural framework for structured learning of linguistic families.

I have tried in the previous sections to trace the central developments and results of the theory of inductive inference that continues to provide the basic formal framework for reasoning about language acquisition. Some caveats notwithstanding, the main implication of these results is that learning in the complete absence of any prior information is infeasible. Through the various sections I have tried to provide the reader with a feel for some of the prior knowledge that is used to prove the technical results.

2.3

The Probably Approximately Correct Model and the VC Theorem

Another significant approach to learning theory is the decidedly statistical route pursued by a large number of researchers in computer science and statistics. The central theoretical framework for such an approach was pioneered by Vapnik and Chervonenkis (1971) and elaborated fully in Vapnik 1998. In the context of computer science, this work was introduced with additional computational complexity considerations by Valiant (1984) as the Probably Approximately Correct (PAC) model which has stimulated a rich dialogue between computer science and statistics over the last two decades. The canonical problem in statistical learning theory is the learning of functions. The concept class and the hypothesis class are classes of functions f : X → Y where X and Y are arbitrary sets. The learner is required to converge (identify, learn) to the target in the limit. However, this convergence is probabilistic as I now clarify.

2.3.1

Sets and Indicator Functions

In the canonical framework of statistical learning theory, the class F of possible target functions (usually referred to as the concept class in the PAC literature) and the hypothesis class H are both classes of functions f : X → Y . In the case of language, it is natural to consider X to be the set Σ∗ — the set of all possible strings, and Y to be the set {0, 1}. Therefore for a particular language L ⊂ Σ ∗ , we can define the indicator function associated to L as 1L (x) : Σ∗ → {0, 1}

CHAPTER 2

72

where 1L (x) = 1 if and only if x ∈ L. Thus, identifying or learning a language is equivalent to learning the indicator function corresponding to that language. Languages now have three natural representations: as recursively enumerable subsets of Σ∗ , as Turing machines or programming systems or Phrase Structure Grammars, or as indicator functions over Σ ∗ . In my discussions so far, I have always taken F = H and I will continue with this assumption.

2.3.2

Graded Distance

The discussion of language learning in the inductive inference framework is dominated by the notion of exact identification where the distance measure d(1L , 1L ) = 1 if and only if L = L and = 0 otherwise. It may reasonably be argued that such a distance does not allow for a natural graded topology on the space of possible languages. Therefore, I rectify this by considering the L1 (P ) topology on the space of languages as follows. Define a probability measure P on Σ ∗ . Then, the L1 (P ) distance between two languages L and L might be defined as  |1L (s) − 1L (s)|P (s) d(L, L ) = s∈Σ∗

Given any language L, we can therefore naturally define the -neighborhood of the language as NL () = {L |d(L, L ) < } This allows us to consider languages that are arbitrarily close to each other and potentially alleviate the apparent pathologies introduced by the notion of exact identification in the limit.

2.3.3

Examples and Learnability

Assume that examples are randomly presented to the learner according to a probability distribution P on Σ ∗ . In the generic framework of function learning, the learner is presented with both positive and negative examples. Undoubtedly, the task of learning the target function with balanced (positive and negative) examples is easier than learning from positive examples alone. While the latter is the more natural setting for language acquisition, let us first develop the basic insights of statistical learning theory in the context of the easier function learning problem in order to understand some essential constraints on inductive inference from finite data.

73

INDUCTIVE INFERENCE

Let L be the target language. Thus, 1L : Σ∗ → {0, 1} is the target function corresponding to this language. Examples are (x, y) pairs where x ∈ Σ∗ (drawn according to P ) and y = 1L (x). On the basis of these examples, the learner hypothesizes functions in H (recall H : Σ∗ → {0, 1}). As before, the learner is a mapping from possible data sets to the hypothesis class. In view of the fact that positive and negative examples are received, we will need to redefine D k – the set of all data streams of length k – to be Dk = {(z1 , . . . , zk ) | zi = (xi , yi ); xi ∈ Σ∗ , yi ∈ {0, 1}} After receiving l data points, the learner conjectures a function hˆl ∈ H. The most natural statistical learning procedure to consider is one that minimizes empirical risk. According to this procedure, the learner’s hypothesis hˆl is chosen as l 1 ˆ |yi − h(xi )| hl = arg min h∈H l i=1

I have indicated the learner’s hypothesis by hˆl where the ˆsymbol explicitly denotes that the learner’s hypothesis hˆl is a random function. This is simply because the learner’s hypothesis depends upon the randomly generated data. More generally, a learning algorithm A is an effective procedure mapping data sets to hypotheses, i.e., A : ∪∞ i=1 Dk −→ H Therefore, hˆl = A(dl ) where dl is a random element of Dl . Successful learning would require the learner’s hypothesis to converge to the target as the number of data points goes to infinity. Because the learner’s hypothesis is a random function, it is now natural to consider this convergence in probability. Hence, we can define the following: Definition 2.6 The learner’s hypothesis hˆl converges to the target (1L ) in probability, if and only if for every  > 0 lim P[d(hˆl , 1L ) > ] = 0

l→∞

Some remarks are worthwhile. First, note that the probability distribution P does double duty for us. On the one hand, it allows us to define the distance between languages as the L 1 (P ) distance between their corresponding indicator functions. Convergence is therefore measured in this norm. On

CHAPTER 2

74

the other hand, it also provides the distribution with which data is drawn and presented to the learner. It therefore characterizes the probabilistic behavior of the random function hˆl . Crucially, however, P is unknown to the learner except through the random draws of examples. Second, note that the notion of convergence is exactly the notion of weak convergence of a random variable. In the standard case of inductive inference treated earlier (for a text t) the distance d(A(t k ), L) is a deterministic sequence that must converge to zero for learnability. Now, the text t is generated randomly in i.i.d. fashion. Therefore, d(A(t k ), L) is a random variable that is required to converge to zero in probability. This notion of weak convergence is usually stated in an (, δ) style in PAC formulations of learning theory in computer science communities. If hˆl converges to the target 1L in a weak sense, it follows that for every  > 0 and every δ > 0, there exists an m(, δ) such that for all l > m(, δ), we have P[d(hˆl , 1L ) > ] < δ In other words, with high probability (> 1 − δ), the learner’s hypothesis (hˆl ) is approximately close (within an  in the appropriate norm) to the target language. The quantity m(, δ) is usually referred to as the sample complexity of learning. Finally, we are able to define the notion of learnability within this PAC framework. Definition 2.7 Consider a target language L. If there exists a learning algorithm A such that for any distribution P on Σ ∗ according to which examples ((x, 1L (x)) pairs) are drawn and presented to A, the learner’s hypothesis converges to the target in probability, then the target language L is said to be learnable. A family of languages L is said to be learnable if there exists a learning algorithm A which can learn every language in the family uniformly, i.e, for every , δ > 0, there exists an m(, δ) such that if the learning algorithm draws at least l > m(, δ) examples, then for all probability distributions P and target languages L ∈ L, with probability > 1 − δ, the learner’s hypothesis is - close to the target, i.e., d( hˆl , 1L ) < .

2.3.4

The Vapnik-Chervonenkis (VC) Theorem

It is natural now to consider what classes L are learnable under this new framework of learnability. The most fundamental characterization of learnability was provided by the pioneering work of Vapnik and Chervonenkis (summarized in Vapnik 1998). I provide a treatment of their work in the current context of language learning.

75

INDUCTIVE INFERENCE

Let L be a collection of languages and H be the associated collection of indicator functions on Σ∗ . Thus H = {1L |L ∈ L}. Now let the target language be Lt ∈ L. Correspondingly, the target function is 1Lt ∈ H. Note that Lt = arg min E[|1Lt − 1L |] L∈L

Within the framework of empirical risk minimization, the learner’s hypothesis is given by 1 Lˆl = arg min L∈L l

l 

|yi − 1L (x)|

i=1

The language Lˆl empirically chosen by the learner corresponds to an indicator function 1Lˆl from the class H of hypothesis functions on Σ ∗ . According to our previously denoted notation, hˆl = 1 ˆ . Recall, however, that the Ll

learner in general need not be an empirical risk minimizing learner. The following theorem (also developed in Blumer et al. 1989) applies in complete generality irrespective of the learning algorithm used. Theorem 2.9 (Vapnik and Chervonenkis 1971 1991) Let L be a collection of languages and H be the corresponding collection of functions. Then L is learnable if and only if H has finite VC dimension. In order to make sense of this theorem, we need to define the VC dimension of the family H of functions. We first develop the notion of shattering. Definition 2.8 A set of points x1 , . . . , xn is said to be shattered by H if for every set of binary vectors b = (b1 , b2 , . . . , bn ), there exists a function hb ∈ H such that hb (xi ) = 1 ⇔ bi = 1. In other words, for every different way of partitioning the set of n points into two classes, a function (a different function for every different partition) in H is able to implement the partition. Obviously, H must have at least 2 n different functions in it. Definition 2.9 The VC dimension of a set of functions H is d if there exists at least one set of cardinality d that can be shattered and no set of cardinality greater than d that can be shattered by H. If no such finite d exists, the VC dimension is said to be infinite. Theorem 2.9 provides a completely different characterization of learnable families from those that have emerged in our previous treatments. Before I examine its consequences for language learning, let us consider a proof of the necessity of finite VC dimension for learnability to hold.

CHAPTER 2

2.3.5

76

Proof of Lower Bound for Learning

Recall that learnability requires that for every  > 0 and δ > 0, there exists a finite m(, δ) such that if the learner draws m > m(, δ) examples, then P[d(hˆm , h) > ] < δ for all h ∈ H and for all distributions P on Σ ∗ according which instances are provided. I will now show that if H has VC dimension = d, then m(, δ) >

3 1 d log2 ( ) + log 2 ( ) 4 2 8δ

Therefore it immediately follows that if H has infinite VC dimension, then m(, δ) cannot be finite. Consequently, the class H is unlearnable. Preliminaries: Assume H has VC dimension = d. I now construct a probability distribution on Σ∗ that will force the learner to draw at least m > m(, δ) > d4 log2 ( 32 ) + 1 ) examples to learn some target function in H to -accuracy with log2 ( 8δ confidence greater than 1 − δ. Since the VC dimension is d, there must exist a set of points x 1 , x2 , . . . , xd with each xi ∈ Σ∗ such that these points can be shattered. Consider a probability distribution P that puts measure 1d on each of these points and zero measure on the rest of the elements of Σ ∗ . Let X = {x1 , . . . , xd }. According to this probability distribution P , any two functions h 1 ∈ H and h2 ∈ H will have d(h1 , h2 ) = 0 if and only if h1 and h2 agree on each of the xi ’s. We may consider h1 to be equivalent to h2 if d(h1 , h2 ) = 0. It is easy to check that this is an equivalence relation and there are exactly 2d different equivalence classes in H. Accordingly, we will not distinguish between different members of the same equivalence class and for the rest of the proof, it is sufficient to assume that |H| = 2 d . Step 1: Let z be a random draw of m i.i.d. examples from Σ ∗ according to P and zh be the labeled data set obtained by labeling the points according to the function h ∈ H. Then zh ∈ Dm and A(zh ) is the learner’s hypothesis on receiving this labeled data set. Imagine that z contained exactly l distinct elements of X, i.e., the other d − l elements did not occur in the data set. There are 2 l different ways in which the l instances in z may be labeled by a potential target function h. Let Hi ⊂ H be the collection of functions that label these instances

77

INDUCTIVE INFERENCE

according to the ith labeling scheme. Thus we see that H 1 through H2l are a disjoint partitioning of H. Consider the sum 

2   l

d(A(zh ), h) =

d(A(zh ), h)

(2.1)

i=1 h∈Hi

h∈H

Note that for each Hi , there are exactly 2d−l different functions in it. These functions agree on the l distinct instances found in z. Therefore, for each h ∈ Hi , the data set zh is the same. On the remaining d − l instances that have not been seen, these functions label them in each of 2 d−l different ways. Consider the quantity d(A(zh ), h). On the d − l instances that have not been seen, let A(zh ) and h disagree on j instances. In that case, we have j d(A(zh ), h) ≥ d d−l There are j different ways in which h and A(zh ) might disagree with each other on exactly j of the unseen d − l instances. Therefore, we have 

d(A(zh ), h) ≥

 d−l   d−l j j=0

h∈Hi

j

d



2d−l d − l d 2

(2.2)

Combining 2.1 and 2.2, we have 

d(A(zh ), h) ≥

h∈H

2d d − l d 2

Step 2: Let S = {z|z has l distinct elements } Then we have  z∈S

P (z)

1  1d−l P (S) d(A(zh ), h) ≥ d 2 d 2 h∈H

where P (z) denotes the probability of drawing the instance set z and P (S) denotes the total probability of drawing instances in the set S. Changing the order of sums, we see 1d−l 1  P (S) P (z)d(A(zh ), h) ≥ d 2 d 2 h∈H z∈S

CHAPTER 2

78

from which it is clear that there exists at least one h ∗ ∈ H such that 

P (z)d(A(zh∗ ), h∗ ) ≥

z∈S

1d−l P (S) d 2

Step 3: We see that h∗ is a candidate target function for which the learner’s hypothesis is potentially inaccurate quite often. Consider the set Sβ = {z ∈ S|d(A(zh∗ ), h∗ ) > β} In other words, Sβ is the set of draws of m instances (with exactly l unique elements) on which the learner’s hypothesis differs from the target by more than β. We can lower bound P (Sβ ) by noticing that   1d−l P (S) ≤ P (z)d(A(zh∗ ), h∗ ) + P (z)d(A(zh∗ ), h∗ ) d 2 z∈Sβ

z∈S\Sβ

≤ P (Sβ ) + β(P (S) − P (Sβ )) From this we have

 (1 − β)P (Sβ ) ≥

 1d−l − β P (S) d 2

Step 4: Let β = . Then we see that if the target were h ∗ then with probability at least P (S ) the learner’s hypothesis would be more than  away from the target. Let us find the conditions under which   1d−l −  P (S) > δ d 2 Since l was arbitrary, we can let l = d2 . In that case for all  < 18 , it is easy to check that   1 1d−l −  P (S) > P (S) d 2 8 Therefore, it is enough to find the conditions for P (S) > 8δ. Recall that P (S) is the probability ofdrawing exactly l distinct items from d items in m i.i.d. trials. There are dl different ways of choosing l items. For each such choice, there are l! different ways in which the items could appear in the first l positions. Consider the ith such choice. Any sequence of m examples

79

INDUCTIVE INFERENCE

(denoted by z) such that (i) its first l items correspond to this choice and (ii) the remaining m − l items are made up of only elements of this l-set is a member of S. Let S (i) be the set of all elements of S that satisfy this property. Clearly 1 l P (S (i) ) = ( )l ( )(m−l) d d Since each of the S (i) are disjoint subsets of S, we have that   d 1 l P (S) ≥ l!( )l ( )(m−l) d d l Step 5: Put l = d2 . Then we have       2l 1 1 2l 2 1 (2l)! 1 m d 1 l ( ) l!( )l ( )(m−l) = l!( )l ( )m = l!( )l ( )(m−l) = d d d 2 d 2 (l!)(ll ) 2 l l l Now, 1 l i (2l)! = (1 + ) ≥ (1 + ) 2 l (l!)(l ) l 2 l

i=1

Therefore, if m
δ. 8 Thus if the target were h∗ , the probability that the learner’s hypothesis is more than  away from the target is greater than δ. If H had infinite VC dimension we can always choose a d large enough so that for every  < 18 , the probability of making a mistake larger than δ can be correspondingly arranged. The class of infinite VC dimension will therefore be unlearnable in this sense. P (S ) ≥

2.3.6

Implications

Thus we see that the class of languages L must be such that H = {1 L |L ∈ L} must have finite VC dimension for learnability to hold. An immediate corollary is Corollary 2.2 The class of all finite languages is unlearnable in the PAC setting.

CHAPTER 2

80

Remark. The PAC framework is often misunderstood to be one that allows a larger collection of languages to be learned than the Gold framework. In this context it is worth making two remarks. First, we have just seen that the set of all finite languages is not PAC learnable. It is easy to check that this family is Gold learnable. Second, in the PAC setting, one learns from both positive and negative examples. If the learner is allowed such a privilege in the Gold setting (learning from informants), we have see already that the entire family of r.e. sets may be identified. It is also possible to define collections of languages that are PAC learnable but not Gold learnable. Hence we conclude that PAC and Gold are just different frameworks with no obvious relationship. Corollary 2.3 The class of languages represented by (i) Finite Automata (DFA’s) (ii) Context Free Grammars (CFGs) are both unlearnable in the PAC setting. Interestingly, from two very different but plausible frameworks for inferring a language from finite examples, one arrives at the unlearnability of the basic families in the Chomsky hierarchy unless further constraints are put on the problem of language acquisition. One possible constraint that may be put on the class of languages is a constraint on the number of rewrite rules or the size of the representation in some notational system. For example, consider the set of all languages describable by a DFA with at most n states over an alphabet Σ where |Σ| = k. Call this family Hn . It is immediate that Hn is a finite class, and an upper bound on its size is given by   n n ] |Hn | ≤ [ k Therefore we get that the VC dimension of H n is bounded by   n n ] ) ≤ nklog2 (n) V C(Hn ) ≤ log 2 ([ k Similar calculations may be conducted for more interesting families of languages and an Occam principle may then be used. A second aspect of the framework of statistical learning theory merits more discussion. The distance d(A(t k ), Lt ) between the learner’s hypothesis and the target is required to decrease eventually to zero as more and more data become available, i.e., as k → ∞. It is of interest to know the rate at which this convergence occurs and this issue has been a significant direction

81

INDUCTIVE INFERENCE

of work in the field. In general, using Hoeffding bounds on uniform laws of large numbers, it is usually possible to guarantee a rate of O( √1 ). The k rate of convergence takes on a particular significance in a cognitive context because “learnability in the limit” is ultimately only an idealized notion that is not realized in practice. After all, humans have only a finite amount of linguistic experience on the basis of which they must generalize to novel situations. Furthermore, there appears to be a critical (maturational) time period over which much of language acquisition takes place. At the end of this learning phase, children develop mature grammars that remain relatively unaltered over their lifetime. Therefore, it becomes of interest to characterize the probability with which a typical child might acquire the target grammar after its critical linguistic experience during the learning phase. The techniques of statistical learning theory allow us to get a handle on this question. In the rest of this book, this probabilistic characterization of learnability will be used to quantify the degree to which the grammar of children might differ from that of their parents – thereby opening the door to the study of language change.

2.3.7

Complexity of Learning

Much of the research on inductive inference began with attempts to understand the conditions under which a learning algorithm would eventually converge to a target grammar. Even if learnability is possible in principle, one needs to analyze the complexity of the learning task. It could well be that convergence would take too long rendering the algorithms cognitively implausible for computational reasons. Much of the PAC and VC based analyses inject complexity-theoretic ideas into an evaluation of learnability. The discussion in prior sections focused on the informational complexity of learning with bounds on how many examples one would require to learn approximately well. It is also worth noting that given a finite sample, the task of choosing an appropriate grammar that fits the data may be an optimization problem of some difficulty. A range of hardness results clarify this phenomenon. For example, it is NP-hard to find the smallest DFA consistent with a set of positive and negative example sentences (Gold 1978). From a different point of view, Kearns and Valiant (1989) show that efficient DFA inference is hard under certain cryptographic assumptions. Abe and Warmuth (1992) consider the computational complexity of learning more general families of probabilistic grammars (e.g., Hidden Markov Models) and show that it is the computational complexity rather than informational complexity that is the barrier to efficient learnability for such cases.

CHAPTER 3

2.3.8

82

Final Words

In this chapter, I have reviewed frameworks for meaningfully studying the abstract problem of inferring a language from examples. The problem was studied in great generality with minimal prior commitment to particular linguistic or cognitive predispositions. It is worthwhile to qualify my position here. I believe in linguistic structure. That is, I do believe that the objects of a language like phonemes, syllables, morphemes, phrases, sentences, and so on may be given a formal status and a particular language may then be characterized by the unique combinatorial and compositional structure of its linguistic objects. At the same time, the formal expressions of a language do refer (semantically) to states of affairs in the world and in this sense, language mediates a relationship between form and meaning. Beyond that, in this chapter, I have made minimal commitment to the nature of linguistic structure in the world’s languages or the procedure by which they are acquired by children. Therefore, my discussion has centered largely on the general problem of acquiring formal systems via learning algorithms. The general import of the results presented here is that tabula rasa learning, i.e., learning in the complete absence of prior information, is infeasible. Successful language acquisition therefore must come about because of the constraints inherent in the interaction of the learning child with its linguistic environment. The nature of these constraints is a matter of some debate, and in the next section I take a more linguistically oriented view of this state of affairs.

Chapter 3

Language Acquisition: A Linguistic Treatment In the previous chapter, we considered in a somewhat abstract setting the inherent difficulty of inferring the identity of a potentially infinite set on the basis of examples from this set. We considered this problem from many different points of view to highlight, in particular, how learning with unconstrained hypothesis classes is impossible. Let us summarize the essential flow of the argument so far to appreciate the implications for linguistic theory and cognitive science. 1. Languages have a formal structure and in this sense may be viewed as sets of well-formed expressions. We have already discussed several different notions of language that are elaborations of this idea, leading ultimately to perhaps the most general notion of a language as a probability distribution on permissible form-meaning pairs. One may choose to work at any appropriate point in the linguistic hierarchy — from phonology to morphology to syntax to semantics — and at any such level one may meaningfully discuss the nature of the well-formed expressions in any particular natural language. Successful adult usage of a language relies on tacit knowledge of the nature of these wellformed expressions — knowledge that is acquired over the course of language acquisition. 2. On exposure to finite amounts of data, children are able to learn a language and generalize to produce and understand novel expressions they may have never encountered before.

CHAPTER 3

84

3. Language acquisition does not depend upon the order of presentation of sentences and it largely proceeds on the basis of positive examples alone. There is very little explicit instruction to children regarding the nature of the grammatical rules that underlie the well-formedness of expressions. If we view grammars as compact representations of sets of expressions, then it is reasonable to think of language acquisition as a process of grammar construction, i.e., developing compact representations of the linguistic experience encountered over the course of language acquisition. Clearly children develop these representations without explicit instruction regarding the appropriate nature of such representations. 4. All naturally occurring languages are learnable by children. Thus children raised in a Mandarin-speaking environment of China would learn Mandarin, in the English-speaking environment of Chicago would learn English, and so on. Therefore there are no language specific (specific to a particular natural language, that is) predispositions. Furthermore, the class of possible natural languages must be such that every member of this class is learnable. 5. In the complete absence of prior information (constraints on the process of language acquisition), successful generalization to novel expressions is impossible. Therefore, it must be the case that children do not consider every possible set (of well-formed expressions) that is consistent with the data they receive. They consider some hypotheses and discard others – thus the class H must be constrained in some fashion. 6. Therefore the real issue at hand is the nature of the constraints that guide the learning child toward the correct grammatical hypotheses. Linguistic theory in the generative tradition attempts to circumscribe the range of grammatical hypotheses that humans might entertain. In this sense, most formal grammatical theories may be viewed as theories about the nature of H. Developmental psychologists attempt to characterize the constraints on the nature of A — the learning algorithm that children may plausibly use during language acquisition. Taken together, these constraints constitute an explanation for how successful language acquisition may come about. In this chapter, I begin the discussion by examining in some detail a language learning problem in a highly constrained setting that some linguists and psychologists have considered to be a useful model for study. This constrained setting illustrates the interaction of constraints from

85

A LINGUISTIC TREATMENT 1. Linguistic theory as embodied in grammar formalisms such as Government and Binding (Chomsky 1981; Haegeman 1991), Head-Driven Phrase Structure Grammar (Pollard and Sag 1994), Optimality Theory (Prince and Smolensky 1993)and approaches that may be accommodated within a broad construal of the Principles and Parameters view (Chomsky 1986).

2. Psychological learning theory as embodied in learning algorithms that make minimal demands on the learner in terms of memory and computational burdens. Examples of such algorithms include the Triggering Learning Algorithm (Gibson and Wexler 1994), the Stratified Learning Hierarchies of Tesar and Smolensky (1996,2000), and related approaches described in Bertolo 2001.

The origin of some of the ideas presented in this chapter lies in an attempt to formulate the learning-theoretic underpinnings for the Triggering Learning Algorithm (TLA) presented in “Triggers” (Gibson and Wexler 1994). In the next section, I provide a glimpse of some of the linguistic reasoning that accompanies the learning-theoretic considerations underlying the generative approach to linguistics. I then give a brief account of the Principles and Parameters framework, and the issues involved in learning within this framework. This sets the stage for our investigations, and I use as a starting point the TLA, working on a three-parameter syntactic subsystem first analyzed by Gibson and Wexler. Most of the chapter analyzes the TLA from the perspective of learnability. Issues pertaining to parameter learning in general, and the TLA in particular, are discussed at appropriate points. In the next chapter, I continue with the analysis developed here. I show how the framework allows us to characterize the sample complexity of learning in parameterized linguistic spaces. I then generalize well beyond the Principles and Parameters approach. In particular, I discuss how the Markov chain framework developed for the analysis of the TLA in parameterized linguistic spaces is applicable to any learning algorithm for any class of grammars (languages). Some general properties and special cases are then considered for important classes of cognitively plausible learning algorithms. The techniques thus developed will allow us to characterize the behavior of the individual learner precisely and create the learning-theoretic foundation upon which the rest of the book is developed.

CHAPTER 3

3.1

86

Language Learning and the Poverty of Stimulus

While it is clear from the discussion in the previous chapter that the class H of possible grammatical hypotheses must be constrained to ensure successful learnability, the analysis was conducted in a very abstract setting providing virtually no insight into the possible nature of these constraints. Much of linguistic theory, however, is developed with the goal of elucidating the precise nature of such constraints. The learning-theoretic arguments presented formally in the Gold setting and its variants in the previous chapter are developed somewhat more informally as the poverty of stimulus argument in the vast literature in generative linguistics. To develop some appreciation for the nature of the reasoning involved, consider the following paradigmatic case of question formation in English. (1)

a. b.

John is bald. Is John bald?

Users of English are able to take declarative statements as in (1a) and convert them to appropriate questions as in (1b). Thus both (1a) and (1b) are sequences of English words that speakers of English recognize as grammatical. Presumably, their knowledge (unconscious) of the underlying grammar of English enables them to make this judgment. Suppose the learner is provided with (1a) and (1b) as examples of English sentences. On the basis of this, there are many different rules that the learner may logically infer. Let us consider two rules that are consistent with the data. 1. Given a declarative sentence, take the second word in the sentence and move it to the front to form the appropriate question. 2. Given a declarative sentence, take the first “is” and move it to the front. Now consider the novel declarative statement where there are multiple instances of the word “is”. (2)

a.

The man who is running is bald.

What might the appropriate interrogative form of this statement be? We instinctively recognize that (2b) is the appropriate interrogative form as opposed to (2c) below.

87

A LINGUISTIC TREATMENT (2)

b. c.

Is the man who is running < > bald? ∗Is the man who < > running is bald?

Examining (2a–c) we see that neither rule 1 nor rule 2 is adequate for making the correct generalization from (2a) to (2b) but not (2c). In order to make the correct generalization, one must recognize that grammatical sequences have an internal structure. Thus, one may write (1a) as (1)

a.

is

and write (2a) as (2)

a.

is

It is now clear that rule 1 is clearly wrong whereas rule 2 may be saved with some additional consideration. To fully develop the rules for question formation will require significantly greater analysis, but at this point there are already two main conclusions that linguists would draw. The first is that given a finite amount of data, there are always many grammatical rules consistent with the data but that make different generalizations about novel examples. We have already explored this issue in some mathematical detail in the previous chapter, where the problem of inferring the correct rule system (grammar) was studied. The second conclusion is that sentences have an internal structure and these constituents play an important role in determining the correct grammatical system with the right generalization properties. The poverty of stimulus argument1 is developed through several paradigmatic cases of the form that we have just considered from many different areas of language (from morphology to syntax) and in many languages of the world. A rich literature exists on the kinds of generalizations children appear to make during language acquisition. This is accompanied by much theorizing about the kinds of grammatical objects and rules that are invoked in the different languages of the world. See Atkinson 1992 as well as Crain and Thornton 1998 for a treatment of language acquisition from the perspective of generative linguistics. 1

The poverty of stimulus argument has been and has remained controversial. While the discussion in the previous chapter argues that some form of prior constraint is inevitable, disagreements are over the precise nature of these constraints and to what extent they are language specific. For two different kinds of discussions about this question, see Pullum and Scholz 2002 and Elman et al. 1996.

CHAPTER 3

88

One particular approach to these questions has been the Principles and Parameters (P&P) approach that I discuss in the next section. Before proceeding to examine the basic tenets and some examples of the P&P approach to language, it is worthwhile to clarify that it represents only one particular point of view. Many other approaches exist and most linguistic theories, including those subsumed by P&P may be regarded as tentative theses about the nature of H — theses that may well turn out to be inadequate or wrong. Therefore it would certainly be premature for me to commit myself to any one particular articulation of H. Rather, I aim to present the basic principles of learning and evolution in a manner that is general enough to be adapted to any particular linguistic setting depending upon one’s theoretical persuasion. With that qualifying remark, let us move on.

3.2

Constrained Grammars — Principles and Parameters

Having recognized the need for constraints on the class of grammars H, researchers have investigated several possible ways of incorporating such constraints in the classes of grammars to describe the natural languages of the world. Examples of this range from linguistically motivated grammars such as Head-Driven Phrase Structure Grammars (HPSG), Lexical-Functional Grammars (LFG), Government and Binding (GB), and Optimality Theory (OT) to bigrams, trigrams and connectionist schemes suggested from an engineering consideration of the design of spoken-language systems. Note that every such grammar suggests a very specific model for human language, with its own constraints and its own complexity. The Principles and Parameters framework (Chomsky 1981) attempts to describe H in a parametric fashion. It tries to articulate the “universal” principles common to all the natural languages of the world and the parameters of variation across them. On this account, roughly speaking, there are a finite number of principles governing the production of human languages. These abstract principles can take one of several (finite) specific forms — this specific form manifests itself as a rule, peculiar to a particular language (or class of languages). The specific form that such an abstract principle can take is governed by setting an associated parameter to one of several values. In typical versions of theories constructed within such a framework, one therefore ends up with a parameterized class of grammars. The parameters are Boolean valued. Setting them to one set of values defines the

89

A LINGUISTIC TREATMENT

grammar of German (say); setting them to another set of values defines the grammar, perhaps, of Chinese. One may also view this as an attempt to recover the principal dimensions of language variation. A high-level analogy with data analysis is perhaps appropriate here. Given a set of data points x 1 , . . . xn in a k-dimensional space (where k is large), the well-known technique of principal components analysis projects the data into a subspace that preserves as much of the variation in the data set as possible. Parametric modeling of the data is then conducted in this lower-dimensional subspace. Each naturally occurring language may be viewed as a data point in a very high dimensional space. To reduce the problem of modeling the space of all naturally occurring languages to more manageable proportions, one tries to reduce the dimensionality (perhaps drastically) by considering subspaces or modules that are “important” in some sense. Although the syntactic framework of Government and Binding (Chomsky 1981) is most closely associated with the P&P framework, a broad interpretation of this point of view would include several additional linguistic theories such as HPSG, varieties of LFG, Optimality Theory, and so on. For example, in a constraint-based theory such as Optimality Theory, there are n universal constraints and different natural languages differ in the ranking they give to each of these constraints. On this account, there are n! different natural-language grammars. These ideas are best illustrated in the form of examples. I will provide two examples, drawn from syntax and phonology respectively.

3.2.1

Example: A Three Parameter System from Syntax

Different aspects of syntactic competence are treated in a modular fashion, and here I discuss three syntactic parameters that were first extensively studied in the framework of learning theory by Gibson and Wexler (1994). Two X-bar Parameters A classic example of a parametric grammar for syntax comes from X-bar theory (see Jackendoff 1977 for an early exposition). This describes a parameterized Phrase Structure Grammar, which defines the production rules for constituent phrases, and ultimately sentences in the language. The general format for phrase structure is summarized by the following parameterized production rules:

CHAPTER 3

90

XP → Spec X  (p1 = 0) or X  Spec(p1 = 1) X  → Comp X  (p2 = 0) or X  Comp(p2 = 1) X → X XP refers to an X-phrase, where X, or the “head”, is a lexical category like N (noun), V (verb), A (adjective), P (preposition), and so on. Thus, one could generate a noun phrase (denoted by N P ) where the head X = N is a noun, or a verb phrase (V P ) and other kinds of phrases by a recursive application of these production rules. “Spec” is short for specifier — in other words, that part of the phrase that “specifies” it, roughly like the old in the noun phrase, the old book . “Comp” is short for the complement, roughly a phrase’s arguments, like an ice-cream in the verb phrase ate an ice-cream, or with envy in the adjective phrase green with envy. Spec and Comp are constituents and could be phrases with their own specifiers and complements. Furthermore, in a particular phrase, the Spec-position or the Comp-position might be blank (in these cases, Spec → ∅ or Comp → ∅, respectively). Thus we might include the additional rules Spec → XP ; Spec → ∅ and Comp → XP ; Comp → ∅ Further, these rules are parameterized. Languages can be Spec-first (p1 = 0) or Spec-final (p1 = 1). Similarly, they can be Comp-first, or Compfinal. For example, the parameter settings of English are (Spec-first,Compfinal). Applying these rules recursively, one can thus generate embedded phrases of arbitrary length in the language. As an example, consider the following English noun phrase (after Haegeman, 1991): [The investigation (of [the corpse]N P )P P (after lunch)P P ]N P The constituents are indicated by bracketing. The square brackets [ ] are used to indicate a noun phrase (NP) while parantheses ( ) are used to indicate a prepositional phrase (PP). A partial derivation of the entire phrase is provided below: XP → [SpecX  ] → [Spec[X  Comp]] → [Spec[X  XP ]]

91

A LINGUISTIC TREATMENT

→ [Spec[[X  XP ]XP ]] → [Spec[[XXP ]XP ]] → [the[[N P P ]P P ]] Thus, the N expands into the noun “investigation” and the two prepositional phrases expand into “of the corpse” and “after lunch”, respectively, by a similar application of these rules. In general, the derivation may be described as a tree structure. Shown in Figure 3.1 is an embedded phrase that demonstrates the use of the X-bar production rules (with the English parameter settings) to generate another arbitrary English phrase. VP

XP −−> Spec X’

V’

X’ −−> X’ Comp

Spec (empty)

PP (Comp)

Spec

V’

P’

PP (Comp) (empty)

P’ V

NP (Comp)

P’

P’

Spec

P

(empty)

Spec P’

NP (Comp)

N’ N

P Spec

N’

(empty)

N ran

from

there

with

his

money

Figure 3.1: Analysis of an English sentence. The parameter settings for English are Spec-first, and Comp-final. In contrast, the parameter settings of Bengali are (Spec-first,Comp-first). The translation of the same sentence is provided in Figure 3.2. Notice how a difference in the Comp-parameter setting causes a systematic difference in the word orders of the two languages. Some linguists that as far as basic, underlying word order is concerned, X-bar theory covers all the important

CHAPTER 3

92

possibilities for natural languages.2 Languages of the world simply differ in their settings with respect to the X-bar parameters. VP XP −−> Spec X’ Spec (empty)

V’

X’ −−> Comp X’

PP (Comp)

V’

P’

Spec

PP (Comp)

(empty)

NP (Comp)

P’ P’

Spec P Spec

V

(empty)

N’ NP (Comp)

P’

N

P Spec

N’

(empty)

N

or (his)

paisa (money)

niye

shekhan

(with)

(there)

theke

douralo

(from)

(ran)

Figure 3.2: Analysis of the Bengali translation of the English sentence of Figure 3.1. The parameter settings for Bengali are Spec-first, and Compfirst.

One Transformational Parameter (V2) The two parameters described above define generative rules to obtain basic word-order combinations permitted in the world’s languages. As mentioned before, there are many other aspects that govern the formation of sentences. For example, there are transformational rules that determine the production of surface word order from the underlying (base) word-order structure ob2

A variety of other formalisms have been developed to take care of finer details of sentence structure. This has to do with case theory, movement, government, binding, and so on. See Haegeman 1991. There is also the issue of scrambling and how to deal with languages having apparently free word order.

93

A LINGUISTIC TREATMENT

tained from the production rules above. We saw an example of this earlier when we studied the relation between the interrogative form and the declarative form of the same sentence. The interrogative form was obtained by a transformation of the declarative. This transformation involved moving an appropriate constituent (in the example considered, it was the word “is”) to the front. These transformational rules may also be parameterized and one such parameterized transformational rule that governs the movement of words within a sentence is associated with what has come to be known as the V 2 parameter. It is observed that in German and Dutch declarative sentences, the relative order of verbs and their complements seems to vary depending upon whether the clause in which they appear is a root clause or subordinate clause. Consider the following German clauses: (3)

a.

...dass Karl das Buch kauft. ...(that Karl the book buys.) ...that Karl buys the book.

b.

Karl kauft das Buch. Karl buys the book.

This seems to present a complication in that from these sentences it is not clear whether German is Comp-first (as example (3a) seems to suggest where the complement “das Buch” precedes the verb “kauft”) or Comp-final (as example (3b) seems to suggest). It is believed (Haegeman 1991) that the underlying word-order form is Comp-first (like Bengali, and unlike English, in this respect); however, the V 2 parameter is set for German (p 3 = 1). This implies that finite verbs must move so as to appear in exactly the second position in root declarative clauses (p 3 = 0 would mean that this need not be the case). Therefore the surface form of (3b) is derived by applying such a V 2 constraint after generating a base form according to Comp-first generative rules. This may be viewed as a specific application of a more general transformational rule Move-α. For details and analysis, see (Haegeman 1991). Each of these three parameters can take one of two values. There are thus eight possible grammars (grammatical modules really), and correspondingly eight languages by extension, generated in this fashion. At this stage, the languages are defined over a vocabulary of syntactic categories, like N, V, and so on. Applying the three parameterized rules, one would obtain different ways of combining these syntactic units to form valid expressions

CHAPTER 3

94

(sentences) in each of the eight languages. The appendix contains a list of the set of unembedded (degree-0) sentences obtained for each of the languages, L1 through L8 , in this parametric system. The vocabulary has been modified so that sentences are now defined over more abstract units than syntactic categories.

3.2.2

Example: Parameterized Metrical Stress in Phonology

The previous example dealt with a parameterized family for a syntactic module of grammar. Let us now consider an example from phonology. Our example relates to the domain of metrical stress that describes the possible ways in which words in a language can be accented during pronunciation. Consider the English word “candidate”. This is a three-syllable word, composed of the following syllables, /can/,/di/, and /date/. A native speaker of American English typically pronounces this word by stressing the first syllable of this word. Similarly, such a native speaker would also stress the first syllable of the tri-syllabic word, “/al/-/pha/-/bet/” so that it almost rhymes with “candidate”. In contrast, a French speaker would stress the final syllable of both these words — a contrast that is perceived as a “French” accent by the English ear. For simplicity, assume that stress has two levels, i.e., each syllable in each word can be either stressed or unstressed 3 . Thus, an n-syllable-long word could have, in principle, as many as 2n different possible ways of being stressed. For a particular language, however, only a small number of these ways are phonologically well-formed. Other stress patterns sound accented or awkward. Words could potentially be of arbitrary length. 4 Thus one could write phonological grammars—a functional mapping from these words to their correct stress pattern. Further, different languages correspond to different such functions, i.e., they correspond to different phonological grammars. Within the Principles and Parameters framework, an attempt is made 3

While I have not provided a formal definition of either stress or syllable, I hope that at some level, the concepts are intuitive to the reader. It should, however, be pointed out that linguists differ on their characterization of both these objects. For example, how many levels can stress have? Typically (Halle and Idsardi 1992), three levels are assumed. Similarly, syllables are classified into heavy and light syllables. I have discounted such niceties for ease of presentation. 4 One shouldn’t be misled by the fact that that a particular language has only a finite number of words. When presented with a foreign word, or a “nonsense” word one hasn’t heard before, one can still attempt to pronounce it. Thus, the system of stress assignment rules in our native language probably dictates the manner in which we choose to pronounce it. Speakers of different languages would accent these nonsense words differently.

95

A LINGUISTIC TREATMENT

to parameterize these phonological grammars. Let us consider a simplified version of two principles associated with three Boolean-valued parameters that play a role in the Halle and Idsardi 1992 metrical stress system. These principles describe how a multisyllable word can be broken into its constituents (recall how sentences were composed of constituent phrases in syntax) before stress assignment takes place. This is done by a bracketing schema that places brackets at different points in the word, thereby marking (bracketing) off different sections as constituents. A constituent is then defined as a syllable sequence between consecutive brackets. In particular, a constituent must be bounded by a right bracket on its right edge, or a left bracket on its left edge (both these conditions need not be satisfied simultaneously). Further, it cannot have any brackets in the middle. Finally, note that not all syllables of the word need be part of a constituent. A sequence of syllables might not be bracketed by either an appropriate left or right bracket — such a sequence cannot have a stressbearing head, and might be regarded as an extrametrical sequence. Here are some further distinctions. 1. The edge parameters: there are two such parameters. a. put a left (p1 = 0) or right (p1 = 1) bracket. b. put the above-mentioned bracket exactly one syllable after the left (p2 = 0) edge or before the right (p2 = 1) edge of the word. 2. The head parameter: each constituent (made up of one or more syllables) has a “head”. This is the stress-bearing syllable of the constituent, and is in some sense the primary or most important syllable of that constituent (recall how syntactic constituents, the phrases, had a lexical head). This phonological head could be the leftmost (p 3 = 0) or rightmost (p3 = 1) syllable in the constituent. Suppose the parameters are set to the following set of values: [p 1 = 0, p2 = 0, p3 = 0]. Figure 3.3 shows how some multisyllable words would have stress assigned to them. In this case, any n-syllable word would have stress in exactly the second position (if such a position exists) and no other. In contrast, if [p1 = 0, p2 = 0, p3 = 1], the corresponding language would stress the final syllable of all multisyllable words. Monosyllabic words are unstressed in both languages. These three parameters represent a very small (almost trivial) component of stress-pattern assignment. There are many more parameters that describe metrical stress assignment in more complete fashion. At this level of analysis, for example, the language Koya has p 3 = 0, while Turkish has

CHAPTER 3

96 H

H X (X X X X X

X (X X X X X

H

H

X (X X X X

X (X X X X

X(

X (

p =0 p =0 p =0 1

2

3

p =0 p =0 p =1 1

2

3

Figure 3.3: Depiction of stress-pattern assignment to words of different syllable length under the parameterized bracketing scheme described in the text. p3 = 1; see Kenstowicz 1994 for more details. This example provides a flavor of how the problem of stress assignment can be described formally by a parametric family of functions. The analysis of parametric spaces developed in this chapter can be equally well applied to such stress systems.

3.3

Learning in the Principles and Parameters Framework

Language acquisition in the Principles and Parameters framework reduces to the estimation of the parameters corresponding to the “target” language. A child is born in an arbitrary linguistic environment. It receives examples in the form of sentences it hears in its linguistic environment. On the basis of example sentences it hears, it presumably learns to set the parameters appropriately. Thus, referring to our three-parameter system for syntax, if the child is born in a German-speaking environment and hears German sentences, it should learn to set the V2 parameter (to +V 2), and the Specparameter to Spec-first. Similarly, a child hearing English sentences should learn to set the Comp-parameter to Comp-final. In principle, the child is thus solving a parameter estimation problem — an unusual class of parameter estimation problems, no doubt, but in spirit, little different from the parameter estimation problems that are encountered in statistical learning. One can thus ask a number of questions about such problems. What sort of data does the child need in order to set the target parameters? Is such data readily available to the child? How often is such data made available to the child? What sort of algorithms does the child use in order to set the

97

A LINGUISTIC TREATMENT

parameters? How efficient are these algorithms? How much data does the child need? Will the child always converge to the target “in the limit” ? Language acquisition, in the context of parameterized linguistic theories, thus gives rise to a class of learning problems associated with finite parameter spaces. Furthermore, as emphasized particularly by Wexler in a series of works (Hamburger and Wexler 1975; Wexler and Culicover 1980; Gibson and Wexler 1994), the finite character of these hypothesis spaces does not solve the language-acquisition problem. As Chomsky notes in Aspects of the Theory of Syntax (1965), the key point is how the space of possible grammars — even if finite — is “scattered” with respect to the primary language input data. It is logically possible for just two grammars (or languages) to be so near each other that they are not separable by psychologically realistic input data. This was the thrust of Hamburger and Wexler as well as Wexler and Culicover’s earlier work on the learnability of transformational grammars from simple data (with at most two embeddings). More recently, a significant analysis of specific parameterized theories has come from Gibson and Wexler 1994. They propose the Triggering Learning Algorithm — a simple, psychologically plausible algorithm that children might conceivably use to set parameters in finite parameter spaces. Investigating the performance of the TLA on the three-parameter syntax subsystem discussed in the previous example yields the surprising result that the TLA cannot achieve the target parameter setting for every possible target grammar in the system. Specifically, there are certain target parameter settings for which the TLA could get stuck in local maxima from which it would never be able to leave, and consequently, learnability would never result. In this chapter, our interest lies in both the learnability and the sample complexity of the finite hypothesis classes suggested by the Principles and Parameters theory. An investigation of this sort requires us to define the important dimensions of the learning problem — the issues that need to be systematically addressed. Figure 3.4 provides a schematic representation of the space of possibilities that needs to be explored in order to completely understand and evaluate a parameterized linguistic family from a learningtheoretic perspective. The important dimensions are summarized in the following list. 1. The parameterization of the language space itself: a particular linguistic theory would give rise to a particular choice of universal principles and associated parameters. Thus, one could vary along this dimension of analysis, the parameterization of the hypothesis classes that need to be investigated. The parametric system for metrical stress is due

CHAPTER 3

98

Parameterization

Distribution of Data Noise

Memory Requirements Learning Algorithm Figure 3.4: The space of possible learning problems associated with parameterized linguistic theories. Each axis represents an important dimension along which specific learning problems might differ. Each point in this space specifies a particular learning problem. The entire space represents the class of learning problems that are considered interesting within the Principles and Parameters framework.

99

A LINGUISTIC TREATMENT to Halle and Idsardi (1992). A variant, investigated by Dresher and Kaye (1990), can equally well be subjected to analysis. 2. The distribution of the input data: once a parametric system is decided upon, one must, then, decide the probability distribution according to which data (i.e., sentences generated by some target grammar belonging to the parameterized family of grammars) is presented to the learner. Clearly, not all sentences occur with equal frequency. Some are more likely than others. How does this affect learnability? How does this affect sample complexity? One could, of course, attempt to come up with distribution-independent bounds on the sample complexity. This, as we will soon see, is not possible. 3. The presence, and nature, of noise, or extraneous examples: in practice, children are exposed to noise (sentences that are inconsistent with the target grammar) due to the presence of foreign or idiosyncratic speakers, disfluencies in speech, or a variety of other reasons. How does one model noise? How does it affect sample complexity or learnability or both? 4. The type of learning algorithm involved: a learning algorithm is an effective procedure mapping data to hypotheses (parameter values). Given that the brain has to solve this mapping problem, it then becomes of interest to study the space of algorithms that can solve it. How many of them converge to the target? What is their sample complexity? Are they psychologically plausible? 5. The use of memory: this is not really an independent dimension, in the sense that it is related to the kind of algorithm used. The TLA and variants, as we will soon see, are memoryless algorithms. These can be modeled by a Markov chain.

This is the space that needs to be explored. By making a specific choice along each of the five dimensions discussed (corresponding to a single point in the five-dimensional space of Figure 3.4), we arrive at a specific learning problem. Varying the choices along each dimension (thereby traversing the entire space of Figure 3.4) gives rise to the class of learning problems associated with parameterized linguistic theories. For our analysis, we choose as a concrete starting point the Gibson and Wexler Triggering Learning Algorithm (TLA) working on the three-parameter syntactic subsystem in the example shown. In our space of language learning problems, this corresponds to (1) a three-way parameterization, using mostly X-bar theory;

CHAPTER 3

100

(2) a uniform sentence distribution over unembedded (degree-0) sentences; (3) no noise; (4) a local gradient ascent search algorithm; and (5) memoryless (online) learning. Following our analysis of this learning system, we consider variations in learning algorithms, sentence distribution, noise, and language/grammar parameterizations.

3.4

Formal Analysis of the Triggering Learning Algorithm

Let us start with the TLA. I first show that this algorithm and others like it are completely modeled by a Markov chain. I explore the basic computational consequences of this fundamental fact, including some surprising results about sample complexity and convergence time, the dominance of random walk over hill climbing, and the potential applicability of these results to actual child-language acquisition and possibly language change — a theme that I build upon over the course of this book.

3.4.1

Background

Following Gold (1967) and Gibson and Wexler (1994), the basic framework is that of identification in the limit. Recall Gold’s assumptions from the previous chapter. The learner receives an (infinite) sequence of (positive) example sentences from some target language. After each example presentation, the learner either stays in the same state or moves to a new state (changes its parameter settings). If after some finite number of examples the learner converges to the correct target language and never changes its guess, then it has correctly identified the target language in the limit; otherwise, it fails. In the Gibson and Wexler model (and others) the learner obeys two additional fundamental constraints: (1) the single-value constraint—the learner can change only one parameter value each step; and (2) the greediness constraint — if the learner is given a positive example it cannot recognize, it will change a parameter value, only if with the new parameter settings it is now able to analyze the new sentence. The TLA can then be precisely stated as follows. See Gibson and Wexler 1994 for further details. • [Initialize] Step 1. Start at some random point in the (finite) space of possible parameter settings, specifying a single hypothesized grammar with its resulting extension as a language.

101

A LINGUISTIC TREATMENT • [Process input sentence] Step 2. Receive a positive example sentence si on ith iteration (examples drawn from the language of a single target grammar, L(Gt )), from a uniform distribution on the degree-0 sentences of the language (we relax this distributional constraint later on). • [Learnability on error detection] Step 3. If the current grammar parses (generates) si , then go to Step 2; otherwise, continue. • [Single-step hill climbing] Step 4. Select a single parameter uniformly at random, to flip from its current setting, and change it (0 mapped to 1, 1 to 0) iff that change allows the current sentence to be analyzed. • [Iterate] Step 5. Go to Step 2.

Of course, this algorithm never halts in the usual sense. Gibson and Wexler aim to show under what conditions this algorithm converges “in the limit”—that is, after some number, m, of steps, where m is unknown, the correct target parameter settings will be selected and never changed. They investigate the behavior of the TLA on a linguistically natural, threeparameter subspace (of the complete linguistic parametric space that involves many more parameters). This subspace was discussed in Section 3.2.1 and will be reviewed again shortly. Note that a grammar in this space is simply a particular n-length array of 0’s and 1’s; hence there are 2 n possible grammars (languages). Gibson and Wexler’s surprising result is that the simple three-parameter space they consider is unlearnable in the sense that positive-only examples can lead to local maxima — incorrect hypotheses from which a learner can never escape. More broadly, they show that learnability in such spaces is still an interesting problem, in that there is a substantive learning theory concerning feasibility, convergence time, and the like, that must be addressed beyond traditional linguistic theory and that might even choose between otherwise adequate linguistic theories. Remark. Various researchers (Clark and Roberts 1993; Frank and Kapur 1992; Gibson and Wexler 1994; Lightfoot 1991; Fodor 1998, Bertolo 2001) have explored the notion of triggers as a way to model parametric language learning. For these researchers, triggers are essentially sentences from the target that cannot be analyzed by the learner’s current grammatical hypothesis and thereby indirectly inform it about the correct hypothesis. Gibson and Wexler suggest that the existence of triggers for every (hypothesis, target) pair in the space suffices for TLA learnability to hold. As we will see later, one important corollary of our stochastic formulation shows that

CHAPTER 3

102

this condition does not suffice. In other words, even if a triggered path exists from the learner’s hypothesis language to the target, the learner might, with high probability, not take this path, resulting in nonlearnability. A further consequence is that many of Gibson and Wexler’s proposed cures for nonlearnability in their example system, such as a “maturational” ordering imposed on parameter settings, simply do not apply. On the other hand, this result reinforces Gibson and Wexler’s basic point that apparently simple parameter-based language-learning models can be quite subtle—so subtle that even a seemingly complete computer simulation can fail to uncover learnability problems.

3.4.2

The Markov Formulation

Given this background, I turn directly to the formalization of parameter space learning in terms of Markov chains. This formalization is in fact suggested but left unpursued in a footnote in Gibson and Wexler 1994. Parameterized Grammars and their Corresponding Markov Chains Consider a parameterized grammar (language) family with n parameters. We picture the 2n -size hypothesis space as a set of points; see Figure 3.5 for the three-parameter case. Each point corresponds to one particular vector of parameter settings (languages, grammars). Call each point a hypothesis state or simply state of this space. As is conventional, we define these languages over some alphabet.5 One state is the target language (grammar). Without loss of generality, we may place the (single) target language at the center of this space. Since by the TLA the learner is restricted to moving at most one binary value in a single step, the theoretically possible transitions between states can be drawn as (directed) lines connecting parameter arrays (hypotheses) that differ by at most one binary digit (a 0 or a 1 in some corresponding position in their arrays). (Recall that the distance between the grammars in parameter space is the so-called Hamming distance.) We may further place weights, bij , on the transitions from state i to state j. These correspond to the probabilities that the learner will move from hypothesis state i to state j. In fact, given a probability distribution over the sentences of the target language L(G), we can carry out an exact calculation of these transition probabilities themselves. Thus, we can picture Following standard notation, Σ denotes a finite alphabet and Σ∗ denotes the set of all finite strings (sentences) obtained by concatenating elements of Σ. 5

103

A LINGUISTIC TREATMENT

11/12 4 [1 0 1] 1/12

1/12

31/36

1

1/2 3 [1 0 0] 1/18 [0 0 0]

[1 1 0] 1

1/6 sink 1/3 2 [1 1 1] sink target: 5 [spec 1st, comp final, –V2] 5/18 5 1/6 7 [0 1 0] 6 [0 1 1]

2/3

1 1/18

1/36

5/6 1/12 [0 0 1]

8

8/9

Figure 3.5: The 8 parameter settings in the GW example, shown as a Markov structure. Directed arrows between circles (states, parameter settings, grammars) represent possible nonzero (possible learner) transitions. The target grammar (in this case, number 5, setting [0 1 0]), lies at dead center. Around it are the three settings that differ from the target by exactly one binary digit; surrounding those are the 3 hypotheses two binary digits away from the target; the third ring out contains the single hypothesis that differs from the target by 3 binary digits. Note that the learner can either stay in the same state or step in or out one ring (binary digit) at a time, according to the single-step learning hypothesis; but some transitions are not possible because there is no data to drive the learner from one state to the other under the TLA. Numbers on the arcs denote transition probabilities between grammar states; these values are not computed by the original GW algorithm. The next section shows how to compute these values, essentially by taking language set intersections.

CHAPTER 3

104

the TLA learning space as a directed, labeled graph V with 2 n vertices.6 As mentioned, not all these transitions will be possible in general. For example, by the single-value hypothesis, the system can only move one bit at a time. Also, by assumption, only differences in surface strings can force the learner from one hypothesis state to another. For instance, if state i corresponds to a grammar that generates a language that is a proper subset of another grammar hypothesis j, there can never be a transition from j to i, and there might be one from i to j. Further, it is clear that once we reach the target grammar there is nothing that can move the learner from this state, since no positive evidence will cause the learner to change its hypothesis. Thus, there must be a loop from the target state to itself, and no exit arcs. In the Markov chain literature, this is known as an Absorbing State (A). Obviously, a state that leads only to an absorbing state will also drive the learner to that absorbing state. If a state corresponds to a grammar that generates some sentences of the target there is always a loop from that state to itself, with some nonzero probability. Finally, let us introduce the notion of a closed set of states C to be any proper subset of states in the Markov chain such that there is no arc from any of the states in C to any state outside C in the Markov chain (see Isaacson and Madsen 1976; Resnick 1992; and later in this chapter for further details). In other words, it is a set of states from which there is no way out to other states lying ouside this set. Clearly, a closed set with only one element (state) is an absorbing state. Note that in the absence of noise, the target state is always an Absorbing State in the systems under discussion. This is because once the learner is at the target grammar, all examples it receives are analyzable and it will never exit this state. Consequently, the Markov chains we will consider always have at least one A. Given this formulation, one can immediately provide a very simple learnability theorem stated in terms of the Markov chains corresponding to finite parameter spaces and learning algorithms. 7 I do this below.

6 Gibson and Wexler construct an identical transition diagram in the description of their computer program for calculating local maxima. However, this diagram is not explicitly presented as a Markov structure; it does not include transition probabilities, which we will see lead to crucial differences in learnability results. Of course, topologically, both structures must be identical. 7 Note that learnability requires that the learner converge to the target state from any initial state in the system.

105

A LINGUISTIC TREATMENT

Markov Chain Criteria for Learnability I have argued how the behavior of the Triggering Learning Algorithm can be formalized by a Markov chain. This argument will be formally completed by providing details of the transition probabilities in a little while. While the formalization is provided for the TLA, every memoryless learning algorithm A for identifying a target grammar gf from a family of grammars G via positive examples can be formalized as a Markov chain M (see the following chapter). In particular, M has as many states as there are grammars in G with the states in M being in 1-1 correspondence with grammars g ∈ G. The target grammar gf corresponds to a target state sf of M . We call M the Markov chain associated with the triple (A, G, g f ), and the triple itself a memoryless learning system, or learning system for short. The triple completely decides the topology of the chain. The transition probabilities of the chain are related to the probability P with which sentences are presented to the learner. An important question of interest is whether or not the learning algorithm A identifies the target grammar in the limit. The following theorem shows how to translate this conventional Gold-learnability criterion for identifiability in the limit into a corresponding Markov chain criterion for such memoryless learning systems. We first recall the familiar definition for a probabilistic version of Goldlearnability: Definition 3.1 Consider a family of grammars G, a target grammar g f ∈ G, and a learning algorithm A that is exposed to sentences from the target according to some arbitrary distribution P . Then g f is said to be Gold learnable by A for the distribution P if and only if A identifies g f in the limit with probability 1. A family of grammars G is Gold learnable if and only if each member of G is Gold learnable. The learnability theorem below says that if a target grammar g f ∈ G is to be Gold learnable by A, then the Markov chain associated with the particular learning system must be restricted in a certain way. To understand the statement of the theorem, we first recall the related notions of absorbing state and closed set of states. Intuitively, these terms refer to Markov chain connectivity and associated probabilities: an absorbing state has no exit link to any other state, while a closed set of states is the extension of the absorbing state notion to a set of states. They have already been in-

CHAPTER 3

106

troduced informally in the earlier section for pedagogical reasons. They are reproduced here again for completeness of the current formal account. Definition 3.2 Given a Markov chain M , an absorbing state of M is a state s ∈ M that has no exit arcs to any other states of M . Since by the definition of a Markov chain the sum of the transition probabilities exiting a state must equal 1, it follows that an absorbing state must have a self-loop with transition probability 1. In a learning system that makes transitions based on error detection, the target grammar will be an absorbing state, because once the learner reaches the target state, all examples are analyzable and the learner will never exit that state. Definition 3.3 Given a Markov chain M , a closed set of states (C) is any proper subset of states in M such that there is no arc from any of the states in C to any state not in C. If two states belong to the same closed set C then there may be transitions from one to the other. Further, there can be transitions from states outside C to states within C. However, there cannot be transitions from states within C to states outside C. Clearly, an absorbing state represents the special case of a closed set of states consisting of exactly one element, namely, the absorbing state itself. I can now state the learnability theorem. Theorem 3.1 Let < A, G, gf ∈ G > be a memoryless learning system. Let sentences from the target be presented to the learner according to the distribution P and let M be the Markov chain associated with this learning system. Then the target gf is Gold learnable by A for the distribution P if and only if M is such that every closed set of states in it includes the target state corresponding to gf . Proof: This has been relegated to the appendix for continuity of reading. Thus, if we are interested in the Gold learnability of a memoryless learning system, one could first construct the Markov chain corresponding to such a system and then check to see if the closed sets of the chain satisfy the conditions of the above theorem. If and only if they do, the system is Gold learnable. I now provide an informal example of how to construct a Markov chain for a parametric family of languages. This is followed by a formal account

107

A LINGUISTIC TREATMENT

of how to compute the transition probabilities of the Markov chain. Finally, I note some additional properties of the learning system that fall out as a consequence of my analysis. For example, my analysis is consistent with the subset principle, it can handle a variety of algorithms, and it can even handle noise.

Example. Consider the following three-parameter system studied by Gibson and Wexler (1994). Its binary parameters are: (1) Spec(ifier) first (0) or last (1); (2) Comp(lement) first (0) or last (1); and Verb Second constraint (V2) does not exist (0) or does exist (1). Recalling the discussion in the previous section, I follow standard linguistic convention. Thus, by specifier I mean the part of a phrase that “specifies” that phrase, roughly, like the old in the old book ; by complement I mean roughly a phrase’s arguments, like an ice-cream in John ate an ice-cream or with envy in green with envy. There are also seven possible “words” in this language: S, V, O, O1, O2, Adv, and Aux, corresponding to subject, verb (main), object, direct object, indirect object, adverb, and auxiliary verb. There are twelve possible surface strings for each (−V2) grammar and eighteen possible surface strings for each (+V2) grammar if we restrict ourselves to unembedded or “degree-0” examples for reasons of psychological plausibility (for discussion, see Wexler and Culicover 1980; Lightfoot 1991; Gibson and Wexler 1994). Note that the “surface strings” of these languages are actually phrases such as [subject verb object] as in John ate an ice-cream. Figure 3 of Gibson and Wexler summarizes the possible binary parameter settings in this system. For instance, parameter setting 5 corresponds to the array [0 1 0]= Spec-first, Comp-final, and −V2, which works out to the possible basic English surface phrase order of subject–verb–object (SVO). As shown in Gibson and Wexler’s Figure 3, the other possible arrangements of surface strings corresponding to this parameter setting include S V; S V O1 O2 (two objects, as in give John an ice-cream); S Aux V (as in John will eat); S Aux V O; S Aux V O1 O2; Adv S V (where Adv is an adverb, like quickly); Adv S V O; Adv S V O1 O2; Adv S Aux V; Adv S Aux V O; and Adv S Aux V O1 O2.

Table 3.1 in the appendix gives a complete list of all degree-0 (unembedded) sentences (expressions) for each of the eight different grammars in this simple system. As shown in the figure, English and French correspond to the language L5 , Bengali and Hindi correspond to L7 , while German and Dutch correspond to L8 .

CHAPTER 3

108

The Markov Chain for the Three-Parameter Example Suppose the target language is SVO (subject verb object, or “English” setting 5=[0 1 0]). Within the Gibson and Wexler three-parameter system, there are 23 = 8 possible hypotheses, so we can draw this as an eight-point Markov configuration space, as shown in Figure 3.5. The shaded rings represent increasing distance in parameter space (Hamming distances) from the target. Each labeled circle is a Markov state, a possible array of parameter settings or grammar, hence specifies a possible target language. Each state is exactly one binary digit away from its possible transition neighbors. Each labeled, directed arc between the points is a possible transition from state i to state j, where the labels are the transition probabilities; note that the probabilities from all arcs exiting a state sum to 1. I will show how to compute these probabilities immediately below. The target grammar, a double circle, lies at the center. This corresponds to the (English) SVO language. Surrounding the bull’s-eye target are the three other parameter arrays that differ from [0 1 0] by one binary digit each; I picture these as a ring 1 Hamming distance away from the target: [0, 1, 1], corresponding to Gibson and Wexler’s parameter setting 6 in their figure 3 (Spec-first, Comp-final, +V2, basic order SVO+V2); [0 0 0], corresponding to Gibson and Wexler’s setting 7 (Spec-first, Comp-first, −V2, basic order SOV); and [1 1 0], Gibson and Wexler’s setting 1(Spec-final, Comp-final, −V2, basic order VOS). Around this inner ring lie three parameter setting hypotheses, all two binary digits away from the target: [0 0 1], [1 0 0], and [1 1 1] (grammars 2, 3, and 8 in Gibson and Wexler’s figure 3). Finally, one more ring out, three binary digits different from the target, is the hypothesis [1 0 1], corresponding to target grammar 4. It is easy to see from inspection of the figure that there are exactly two Absorbing States in this Markov chain, that is, states that have no exit arcs with nonzero probability. One absorbing state is the target grammar (by definition). The other absorbing state is state 2 (corresponding to language VOS+V2, i.e., [1 1 1]). Finally, state 4 (parameter setting [1 0 1]), while not an absorbing state in itself, has no path to the target. It has arcs that lead only to itself or to state 2 (an absorbing state that is not the target). These two states correspond to the local maxima at the head of Gibson and Wexler’s figure 4. Hence this target language is not learnable. Besides demonstrating these local maxima, the next section shows that there are in fact other states from which the learner will, with high probability, never reach the correct target.

109

3.4.3

A LINGUISTIC TREATMENT

Derivation of the Transition Probabilities for the Markov TLA Structure

I have discussed in the previous section how the behavior of the TLA can be modeled as a Markov chain. The argument is incomplete without a characterization of the transition probabilities of the associated Markov chain. I first provide an example and follow it with a formal exposition. Example. Consider again the three-parameter system in Figure 3.5 with target language 5. What is the probability that the learner will move from state 8 to state 6 ? The learner will make such a transition if it receives a sentence that is analyzable according to the parameter settings of state 6, but not according to the parameter settings of state 8. For example, a sentence of the form (S V O1 O2) as in Peter gave John an ice-cream could drive the learner to change its parameter settings from 8 to 6. If one assumes a probability distribution with which sentences from the target are presented to the learner, one could find the total probability measure of all such sentences and use it to calculate the appropriate transition probability. Formalization The computation of the transition probabilities from the language family can be done by a direct extension of the procedure given in Gibson and Wexler 1994. Let the target language L t consist of the strings s1 , s2 , . . . , i.e., Lt = {s1 , s2 , s3 , . . .} Let there be a probability distribution P on these strings. Suppose the learner is in a state s corresponding to the language L s . Consider some other state k corresponding to the language L k . What is the probability that the TLA will update its hypothesis from L s to Lk after receiving the next example sentence? First, observe that due to the single-valued constraint, if k and s differ by more than one parameter setting, then the probability of this transition is zero. In fact, the TLA will move from s to k only if the following two conditions are met: (1) the next sentence it receives (say, ω occurring with probability P (ω)) is analyzable by the parameter settings corresponding to k and not by the parameter settings corresponding to s; and (2) the TLA happens to pick and change the one parameter (out of n) that would move it to state k.  Event 1 occurs with probability ω∈(Lk \Ls )∩Lt P (ω). This is simply the probability measure associated with all strings ω that are both in the target Lt and Lk but not in the language Ls (the learner’s currently hypothesized

CHAPTER 3

110

language). Event 2 occurs with probability 1/n, since the parameter to flip is chosen uniformly at random out of the n possible choices. Thus the cooccurrence of both these events yields the following expression for the total probability of transition from s to k after one step:  (1/n)P (ω) P[s → k] = ω∈(Lk \Ls )∩Lt

Since the total probability over all the arcs out of s (including the self-loop) must be 1, we obtain the probability of remaining in state s after one step as:  P[s → k] P[s → s] = 1 − k is a neighboring state of s In other words, the probability of remaining in state s is 1 minus the probability of moving to any of the other (neighboring) states. Finally, given any parameter space with n parameters, we have 2 n languages. Fixing one of them as the target language L t we obtain the following procedure for constructing the corresponding Markov chain. Note that this is simply the Gibson and Wexler procedure for finding local maxima, with the addition of a probability measure on the language family. • [Assign distribution] Fix a probability measure P on the strings of the target language Lt . • [Enumerate states] Assign a state to each language, i.e., each L i . • [Normalize by the target language] Intersect all languages with the target language to obtain for each i, the language L i = Li ∩ Lt . Thus with state i associated with language L i , we now associate the language Li . • [Take set differences] For any two states i and k, i = k, if they are more than 1 Hamming distance apart, then the transition P[i → k] = 0. If they are 1 Hamming distance apart, then P[i → k] = n1 P (Lk \ Li ). For  i = k, we have P[i → i] = 1 − j =i P[i → j]. Remark. This model captures the dynamics of the TLA completely. We note that the learner’s movement from one language hypothesis to another is driven by purely extensional considerations—that is, it is determined by set differences between language pairs. A detailed investigation of this point is beyond the scope of this chapter.

111

A LINGUISTIC TREATMENT

Example (continued). For our three-parameter system, we can follow the above procedure to calculate set differences and build the Markov figure straightforwardly. For example, consider P[8 → 6]; we compute (L 6 \ L8 ) ∩ L5 = { S V O1 O2, S Aux V O, S Aux V O1 O2}. This set has three degree-0 sentences. Assuming a uniform distribution on the twelve degree-0 strings of the target L5 , we obtain the value of the transition from state 8 to state 1 . Further, since the normalized language L 1 for state 1 6 to be 13 (3/12) = 12 is the empty set, the set difference between states 1 and 5 (L 5 \ L1 ) yields the entire target language, so there is a (high) transition probability from state 1 to state 5. Similarly, since states 7 and 8 share some target-language strings in common, such as S V, and do not share others, such as Adv S and S V O, the learner can move from state 7 to 8 and back again.

3.5

Conclusions

In this chapter I have provided a brief account of the linguistic reasoning that lies behind the Principles and Parameters approach to linguistic theory (Chomsky 1981). I have considered a psycho-linguistically motivated algorithm for language acquisition within this framework. Once the mathematical formalization has been given many additional properties of this particular learning system now become evident. For example, an issue that is amenable to analysis in the current formalization has to do with the existence of subset-superset pairs of languages. The existence of such pairs does not alter the procedure by which the Markov chain is computed, nor does it alter the validity of our main learnability theorem. However, it is clear from our analysis that if the target happens to be a subset language, the superset language will correspond to an absorbing state. This is because all target sentences are analyzable by the superset language and if the errordriven learner happens to be at the state corresponding to it, it will never exit. This additional absorbing state automatically implies nonlearnability by our theorem. However, following Gibson and Wexler and others working in this tradition, I will assume that such complications do not typically arise in the parametric systems under discussion in the current and the next chapter. It is now easy to imagine other alternatives to the TLA that will avoid the local maxima problem: we can vary any of the five aspects of the language learning models described at the beginning of this chapter. To take just one example, as it stands, the learner is allowed to change only one parameter setting at a time. If we relax this condition so that the learner can

CHAPTER 3

112

change more than one parameter at a time, i.e., the learner can conjecture hypotheses far from its current one (in parameter space), then the problem with local maxima disappears. It is easy to see that in this case, there can be only one Absorbing State, namely, the target grammar. All other states have exit arcs (under the previous assumption of no subset-superset relations). Thus, by our main theorem, such a system is learnable. As another variant, consider the possibility of noise — that is, occasionally the learner gets strings that are not in the target language. Gibson and Wexler state (footnote 4) that this is not a problem: the learner need only pay attention to frequent data. But this is of course a serious problem for the model; how is the learner to “pay attention” to frequent data? Unless some kind of memory or frequency-counting device is added, the learner cannot know whether the examples it receives are noise or not. If the learner is memoryless, then there is always some finite probability, however small, of escaping a local maximum. Clearly, the memory window has to be large enough to ensure that sufficient statistics are computable to distinguish noise from relevant data. A serious investigation of this issue is beyond the scope of this chapter. To explore these and other possible variations systematically, I will return, in the next chapter, to the five-way classification scheme for learning models introduced at the beginning of this chapter. First, I consider details about sample complexity. Next, I turn to questions about the distribution of the input data, and ask how this changes the sample complexity results. I also consider more realistic input distributions — in particular, those obtained using statistics computed from the CHILDES corpus (MacWhinney 1996). Finally, I briefly consider sample complexity issues if the learning algorithms operate in batch rather than online mode. Needless to say, the Principles and Parameters framework discussed here represents a very particular approach to describing the class H of possible natural-language grammars within which learning algorithms like the Triggering Learning Algorithm have been formulated. In the next chapter, we will also see how the learning framework developed in this context is general enough to accommodate a wider variety of approaches to the problem of language acquisition. The ability to characterize rates of learning within the Markov framework developed here will take on added significance as we move on in subsequent chapters to study the problem of language change and evolution.

113

A LINGUISTIC TREATMENT

3.6

Appendix

3.6.1

Unembedded Sentences For Parametric Grammars

Table 3.1 provides the unembedded (degree-0) sentences from each of the eight grammars (languages) obtained by setting the three parameters of the three-parameter syntactic system of Gibson and Wexler 1994 to different values. The languages are referred to as L 1 through L8 . Language L1

Spec. 1

Comp. 1

V2 0

L2

1

1

1

L3

1

0

0

L4

1

0

1

L5 (English, French) L6

0

1

0

0

1

1

L7 (Bengali, Hindi) L8

0

0

0

0

0

1

Degree-0 unembedded sentences VS, VOS, VO1O2S, AuxVS, AuxVOS, AuxVO1O2S AdvVS, AdvVOS, AdvVO1O2S, AdvAuxVS AdvAuxVOS, AdvAuxVO1O2S SV, SVO, OVS, SVO1O2 O1VO2S, O2VO1S, SAuxV, SAuxVO, OAuxVS SAuxVO1O2, O1AuxVO2S, O2AuxVO1S, AdvVS, AdvVOS AdvVO1O2S, AdvAuxVS, AdvAuxVOS, AdvAuxVO1O2S VS, OVS, O2O1VS, VAuxS, OVAuxS O2O1VAuxS, AdvVS, AdvOVS, AdvO2O1VS AdvVAuxS, AdvOVAuxS, AdvO2O1VAuxS SV, OVS, SVO, SVO2O1, O1VO2S O2VO1S, SAuxV, SAuxOV, OAuxVS SAuxO2O1V, O1AuxO2VS, O2AuxO1VS, AuxVS, AdvVOS AdvVO2O1S, AdvAuxVS, AdvAuxOVS, AdvAuxO2O1VS SV, SVO, SVO1O2, SAuxV, SAuxVO SAuxVO1O2, AdvSV, AdvSVO, AdvSVO1O2 AdvSAuxV, AdvSAuxVO, AdvSAuxVO1O2 SV, SVO, OVS, SVO1O2, O1VSO2 O2VSO1, SAuxV, SAuxVO, OAuxSV, SAuxVO1O2 O1AuxSVO2, O2AuxSVO1, AdvVS, AdvVSO, AdvVSO1O2 AdvAuxSV, AdvAuxSVO, AdvAuxSVO1O2 SV, SOV, SO2O1V, SVAux, SO2O1VAux AdvSV, AdvSOV, AdvSO2O1V, AdvSVAux SOVAux, AdvSOVAux, AdvSO2O1VAux SV, SVO, OVS, SVO2O1, O1VSO2, O2VSO1 SAuxV, SAuxOV, OAuxSV, O1AuxSO2V, O2AuxSO1V AdvVS, AdvVSO, AdvVSO2O1, AdvAuxSV, AdvAuxSOV SAuxO2O1V, AdvAuxSO2O1V

Table 3.1: A list of all the degree-0 (unembedded) expressions for each of eight different grammatical types. The eight grammatical types (denoted by L1 through L8 ) are obtained by setting the Spec (0=spec-final), Comp (0=comp-final), and V 2 (0=no V2) parameters respectively. Expressions are not lexicalized but are denoted as strings over grammatical categories. These categories are S (subject), O (object), V (verb), Adv (adverb), Aux (auxiliary verb), O1 (direct object), and O2 (indirect object) for the sentence types shown. Taken from Gibson and Wexler 1994.

3.6.2

Proof of Learnability Theorem

To establish the theorem, we recall three additional standard terms associated with the Markov chain states: (1) equivalent states (2) recurrent states and (3) transient states. We then present another standard result about the

CHAPTER 3

114

form of any Markov chain: its canonical decomposition in terms of closed, equivalent, recurrent, and transient states. Some Markov State Terminology Definition 3.4 Given a Markov chain M , and any pair of states s, t ∈ M , we say that s is equivalent to t if and only if s is reachable from t and t is reachable from s, where by reachable we mean that there is a path from one state to another. Two states s and t are equivalent if and only if there is a path from s to t and a path from t to s. Using the equivalence relation defined above, we can divide any M into equivalence classes of states. All the states in one class can reach and are reachable from the states in that class. Definition 3.5 Given a Markov chain M , a state s ∈ M is recurrent if the chain returns to s in a finite number of steps with probability 1.

Definition 3.6 Given a Markov chain M , and a state s ∈ M , if s is not recurrent, then s is transient. We will need the following simple property about transient states later: Lemma 3.1 Given a Markov chain M , if t is a transient state of M , then, for any state s ∈ M lim pst (n) = 0

n→∞

where pst (n) denotes the probability of going from state s to state t in exactly n steps. Proof: (Sketch) Proposition 2.6.3 in Resnick 1992 states that ∞ 

pst (n) < ∞

n=1

Therefore,



pst (n) is a convergent series. Thus pst (n)n→∞ → 0.

115

A LINGUISTIC TREATMENT

Canonical Decomposition A particular Markov chain might have many closed states (see Definition 3.3 earlier), and these need not be disjoint; they might also be subsets of each other. However, even though there can be many closed states in a particular Markov chain, the following standard result shows that there is a canonical decomposition of the chain (Lemma 3.2) that will be useful to us in proving the learnability theorem. Lemma 3.2 Given a Markov chain M , we may decompose M into disjoint sets of states as follows: M = T ∪ C 1 ∪ C2 . . . where (i) T is a collection of transient states and (ii) the C i ’s are closed, equivalence classes of recurrent states. Proof: This is a standard Markov chain result; see Corollary 2.10.2 in Resnick 1992. We can now proceed to a proof of the main learnability theorem. Proof of Main Theorem ⇒. We need to show that if the target grammar is learnable, then every closed set in the chain must contain the target state. By assumption, target grammar gf is learnable. Now assume for the sake of contradiction that there is some closed set C that does not include the target state associated with the target grammar. If the learner starts in some s ∈ C, by the definition of a closed set of states, it can never reach the target state. This contradicts the assumption that gf was learnable. ⇐ Assume that every closed set of the Markov chain associated with the learning system includes the target state. We now need to show that the target grammar is Gold learnable. First, we make use of some properties of the target state in conjunction with the canonical decomposition of Lemma 3.2 to show that every non-target state must be transient. Then we make use of Lemma 3.1 about transient states to show that the learner must converge to the target grammar in the limit with probability 1. First, note the following properties of the target state: (i) By construction, the target state is an absorbing state, i.e., no other state is reachable from the target state.

CHAPTER 3

116

(ii) Therefore, no other state can be in an equivalence relation with the target state and the target state is in an equivalence class by itself. (iii) The target state is recurrent since the chain returns to it with probability 1 in one step (the target state is an absorbing state). These facts about the target state show that the target state constitutes a closed class (say Ci ) in the canonical decomposition of M . However, there cannot be any other closed class Cj , j = i in the canonical decomposition of M . This is because by the definition of the canonical decomposition any other such Cj must be disjoint from Ci , and by the hypothesis of the theorem, such Cj must contain the target state, leading to a contradiction. Therefore, by the canonical decomposition lemma, every other state in M must belong to T , and must therefore be a transient state. Now denote the target state by sf . The canonical decomposition of M must therefore be in the form T ∪ {sf }. Without loss of generality, let the learner start at some arbitrary state s. After any integer number n of positive examples, we know that  pst (n) = 1 t∈M

because the learner has to be in one of the states of the chain M after n examples with probability 1. But by the decomposition lemma and our previous arguments M = T ∪ sf . Therefore we can rewrite this sum as two parts, one corresponding to the transient states and the other corresponding to the final state: 

pst (n) + pssf (n) = 1

t∈T

Now take the limit as n goes to infinity. By the transient-state lemma, every pst (n) goes to zero for  t ∈ T . There are only a finite (known) number of states in T . Therefore, t∈T pst (n) goes to zero. Consequently, pssf goes to 1. But that means that the learner converges to the target state in the limit (with probability 1). Since this is true irrespective of the starting state of the learner, the learner converges to the target with probability 1, and the associated target grammar gf is Gold learnable.

Chapter 4

Language Acquisition: Memoryless Learning In this chapter, we continue our exploration of the Markov chain framework for the analysis of learning algorithms for language acquisition. There are two main themes to this exploration. First, we see how the framework allows us to get a theoretical handle on the important question of rates of learnability. Not only must the learner converge to the target in the limit, it must do so at psychologically plausible rates. I develop this point in the next few sections. Second, we see in later parts of this chapter that all learning algorithms may be characterized by inhomogeneous Markov chains. More significantly, however, the cognitively interesting class of memorylimited learning algorithms may be ultimately characterized by first-order Markov chains. I explore this in subsequent parts of this chapter. I conclude finally with a brief overview of computational work in language acquisition that engages the research traditions in linguistics and psychology at varying levels of detail.

4.1

Characterizing Convergence Times for the Markov Chain Model

The Markov chain formulation gives us some distinct advantages in theoretically characterizing the language acquisition problem. First, we have already seen how given a Markov chain one could investigate whether or not it has exactly one absorbing state corresponding to the target grammar. This is equivalent to the question of whether any local maxima exist. One could also look at other issues (like stationarity or ergodicity assumptions) that

CHAPTER 4

118

might potentially affect convergence. Later we will consider several variants to the TLA and analyze them formally within the Markov framework. We will also see that these variants do not suffer from the local maxima problem associated with Gibson and Wexler’s TLA. Perhaps the significant advantage of the Markov chain formulation is that it allows us to also analyze convergence times. Given the transition matrix of a Markov chain, the problem of how long it takes to converge has been well studied. This question is of crucial importance in learnability. Following Gibson and Wexler 1994, I believe that it is not enough to show that the learning problem is consistent, i.e., that the learner will converge to the target in the limit. We also need to show, as Gibson and Wexler point out, that the learning problem is feasible, i.e., the learner will converge in “reasonable” time. This is particularly true in the case of finite parameter spaces where consistency might not be as much of a problem as feasibility. The Markov formulation allows us to address the feasibility question. It also allows us to clarify the assumptions about the behavior of data and learner inherent in such an approach. We begin by considering a few ways in which one could formulate the question of convergence times.

4.1.1

Some Transition Matrices and Their Convergence Curves

Let us begin by following the procedure detailed in the previous chapter to explicitly calculate a few transition matrices. Consider the three-parameter example that was informally considered before. The target grammar was grammar 5 (according to our numbering of the languages in Figure ??). For simplicity, let us first assume a uniform distribution on the degree-0 strings in L5 , i.e., the probability the learner sees a particular string s j in L5 is 1/12 because there are 12 (degree-0) strings in L 5 . We can now compute the transition matrix

From

L1 L2 L3 L4 L5 L6 L7 L8

To L1 L2 L3 L4 L5 L6 L7 L8 ⎡ 1 1 ⎤ 1 ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎣

2

6

1 1 12

3

3 4

1 12 11 12

1 6

1

1 6 5 18

5 6 1 12

2 3 1 36

1 18 8 9

⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎦

119

MEMORYLESS LEARNING

where 0’s occupy matrix entries if not otherwise specified. Notice that both 2 and 5 correspond to absorbing states; thus this chain suffers from the local maxima problem. Note also (following Figure 3.5 as well) that state 4 exits either to itself or to state 2 and is also a problematic initial state. For a given transition matrix T, it is possible to compute 1 T∞ = lim T m m→∞

If T is the transition probability matrix of a chain, then T ij , i.e., the element of T in the ith row and jth column is the probability that the learner moves from state i to state j in one step. It is a well-known fact that if one considers the corresponding i, j element of T m , then this is the probability that the learner moves from state i to state j in exactly m steps. Correspondingly, the i, jth element of T ∞ is the probability of going from initial state i to state j “in the limit” as the number of examples goes to infinity. For learnability to hold irrespective of which state the learner starts in, the probability that the learner reaches state 5 should tend to 1 as m goes to infinity. This means that column 5 of T ∞ should consist of 1’s, and the matrix should contain 0’s everywhere else. Actually we find that T m converges to the following matrix as m goes to infinity:

T∞ =

From

L1 L2 L3 L4 L5 L6 L7 L8

To L1 L2 L3 L4 L5 L6 L7 L8 ⎤ ⎡ 2 0 0 35 0 0 0 0 5 ⎥ ⎢ 1 ⎥ ⎢ 3 2 ⎥ ⎢ 5 5 ⎥ ⎢ ⎥ ⎢ 1 ⎥ ⎢ ⎥ ⎢ 1 ⎥ ⎢ ⎥ ⎢ 1 ⎥ ⎢ ⎦ ⎣ 1 1

Examining this matrix, we see that if the learner starts out in states 2 or 4, it will certainly end up in state 2 in the limit. These two states correspond to local maxima grammars in Gibson and Wexler’s framework. If the learner 1

The limiting matrix is not always guaranteed to exist. The existence of a limiting distribution is equivalent to the chain being ergodic. For precise conditions on ergodicity, see Isaacson and Madsen 1976 or any standard text on Markov chains. For our discussion, we will assume that a limit exists, and barring pathological conditions, it does for the applications we consider.

CHAPTER 4

120

starts in either of these two states, it will never reach the target. From the matrix we also see that if the learner starts in states 5 through 8, it will certainly converge in the limit to the target grammar. The situation regarding states 1 and 3 is more interesting, and not covered in Gibson and Wexler 1994. If the learner starts in either of these states, it will reach the target grammar with probability 3/5 and reach state 2 (the other absorbing state) with probability 2/5. Thus we see that local maxima (states unconnected to the target) are not the only problem for learnability. As a consequence of our stochastic formulation, we see that there are initial hypotheses from which triggered paths exist to the target; however, the learner will not take these paths with probability 1. In our case, because of the uniform distribution assumption, we see that the path to the target will only be taken with probability 3/5. By making the distribution more favorable, this probability can be made larger, but it can never be made 1. This analysis considerably increases the number of problematic initial states from those presented in Gibson and Wexler 1994. While the broader implications of this are not clear, it certainly renders moot some of the linguistic2 implications of Gibson and Wexler’s analysis. Obviously one can examine other details of this particular system. However, let us now look at a case where there is no local maxima problem. This is the case when the target languages have verb-second (V2) movement in Gibson and Wexler’s three-parameter case. Consider the transition matrix (shown on next page) obtained when the target language is L 1 . Again we assume a uniform distribution on strings of the target. Here we find that T m does indeed converge to a matrix with 1’s in the first column and 0’s elsewhere. Consider the first column of T m . It is of the form: (p1 (m), p2 (m), p3 (m), p4 (m), p5 (m), p6 (m), p7 (m), p8 (m)) where pi (m) denotes the probability of being in state 1 at the end of m 2 For example, Gibson and Wexler rely on “connectedness” to obtain their list of local maxima. From this (incorrect) list, noticing that all local maxima were +Verb Second (+V2), they argued for ordered parameter acquisition or “maturation”. In other words, they claimed that the V2 parameter was more crucial, and had to be set earlier in the child’s language acquisition process. My analysis shows that this is incorrect, an example of how computational analysis can aid the search for adequate linguistic theories.

121

MEMORYLESS LEARNING

examples for the case in which the learner started in state i.

From

L1 L2 L3 L4 L5 L6 L7 L8

To L1 L2 L3 L4 L5 L6 L7 L8 ⎡ ⎤ 1 5 ⎢ 16 ⎥ 6 ⎢ 5 ⎥ 2 1 ⎢ ⎥ 18 3 18 ⎢ ⎥ 3 1 8 ⎢ ⎥ ⎢ 1 36 36 9 23 1 ⎥ ⎢ ⎥ ⎢ 3 ⎥ 36 36 31 5 ⎢ ⎥ ⎢ ⎥ 36 36 11 1 ⎦ 1 ⎣ 18

1 18

12

36 17 18

For learnability, we naturally require lim pi (m) = 1

m→∞

and for the example at hand this is indeed the case. Figure 4.1 shows a plot of the following quantity as a function of m, the number of examples: p(m) = min{pi (m)} i

The quantity p(m) is easy to interpret. For example, p(m) = 0.95 means that for every initial state of the learner the probability that it is in the target state after m examples is at least 0.95. Further, there is one initial state (the worst initial state with respect to the target, which in our example is L 8 ) for which this probability is exactly 0.95. We find on looking at the curve that the learner converges with high probability within 100 to 200 (degree0) example sentences, a psychologically plausible number. One can now of course proceed to examine actual transcripts of child input to calculate convergence times for more realistic distributions of examples. Now that I have made a first attempt to quantify the convergence time, several other questions can be raised. How does convergence time depend upon the distribution of the data? How does it compare with other kinds of Markov structures with the same number of states? How will the convergence time be affected if the number of states increases, i.e., the number of parameters increases? How does it depend upon the way in which the parameters relate to the surface strings? Are there other ways to characterize convergence times? I now proceed to answer some of these questions.

122

Probability of converging from most unfavourable state p(m) 0.0 0.2 0.4 0.6 0.8 1.0

CHAPTER 4

0

100

200 Number of examples (m)

300

400

Figure 4.1: Convergence as a function of number of examples. The horizontal axis denotes the number of examples received and the vertical axis represents the probability of converging to the target state. The data from the target is assumed to be distributed uniformly over degree-0 sentences. The solid line represents TLA convergence times and the dotted line is a Random Walk Algorithm (RWA). Note that random walk actually converges faster than the TLA in this case.

123

4.1.2

MEMORYLESS LEARNING

Absorption Times

In the previous section, I computed the transition matrix for a fixed (in principle, this could be arbitrary) distribution and characterized the rate of convergence in a certain way. In particular, I plotted p(m) (the probability of converging from the most unfavorable initial state) against m (the number of samples). However, this is not the only way to characterize convergence times. Given an initial state, the time taken to reach the absorption state (known as the absorption time) is a random variable. One can compute the mean and variance of this random variable. For the case when the target language is L1 , we have seen that the transition matrix has the form:   1 0 T = R Q Here Q is a seven-dimensional square matrix. The mean absorption times from states 2 through 8 is given by the vector (see Isaacson and Madsen 1976) ) μ = (I − Q)−1 1 where 1 is a seven-dimensional column vector of 1’s. The vector of second moments is given by μ = (I − Q)−1 (2μ − 1). Using this result, we can now compute the mean and standard deviation of the absorption time from the most unfavorable initial state of the learner. (Note that the second moment is fairly skewed in such cases and so is not symmetric about the mean, as may be seen from the previous curves.) The four learning scenarios considered are the TLA with uniform and increasingly malicious distributions (discussed later), and the random walk (also discussed later).

4.1.3

Eigenvalue Rates of Convergence

I have shown how to characterize learnability by Markov chains. Recall that Markov chains corresponding to memoryless learning algorithms have an associated transition matrix T. We saw that T k was the transition matrix after k examples, and in the limiting case, lim T k = T∞ .

k→∞

In general, the structure of T∞ , as discussed earlier, determined whether the target grammar was learnable with probability 1. The rate at which T

CHAPTER 4 Learning scenario TLA (uniform) TLA (a = 0.99) TLA (a = 0.9999) RW

124 Mean abs. time 34.8 45000 4.5 × 106 9.6

St. Dev. of abs. time 22.3 33000 3.3 × 106 10.1

Table 4.1: Mean (col. 1) and standard deviation (col. 2) of absorption times to the target state for TLA with different distributions and the Random Walk Algorithm. See text for more explanation. converges to T∞ determines the rate at which the learner converges to the target “in the limit”. This rate allows us to bound the sample complexity in a formal sense, i.e., it allows us to bound the number of examples needed before the learner will be at the target with high confidence. In this section, we develop some formal machinery borrowed from classical Markov chain theory that is useful to bound the rate of convergence of the learner to the target grammar for learnable target grammars. We first develop the notion of an eigenvalue of a transition matrix and show how this can be used to construct an alternative representation of T k . We then discuss the limiting distributions of Markov chains from various initial conditions, and finally combine all these notions to formally state some results for the rate at which the learner converges to the target. Eigenvalues and Eigenvectors Many properties of a transition matrix can be characterized by its eigenvalues and eigenvectors. Definition 4.1 A number λ is said to be an eigenvalue of a matrix T if there exists some nonzero vector x satisfying x T = λx Such a row vector x is called a left eigenvector of T corresponding to the eigenvalue λ. Similarly, a nonzero column vector y satisfying T y = λy is called a right eigenvector of T. It can be shown that the eigenvalues of a matrix T can be obtained by solving

125

MEMORYLESS LEARNING

|λI − T | = 0

(4.1)

where I is the identity matrix and |M | denotes the determinant of the matrix M. Example. Consider the matrix  2 1  T = 13 32 3

3

Such a matrix could, for example, be the transition matrix for a learner in a parametric space with two grammars, i.e., a space defined by one Booleanvalued parameter. In order to solve for the eigenvalues of the matrix, we need to solve    2 1    λ 0  |λI − T | =  − 13 32  = 0 0 λ 3 3 This reduces to the quadratic equation 1 2 (λ − )2 = 3 9 which can be solved to yield λ = 1 and λ = 13 as its two solutions. It can be easily seen that the row vector, x = (1, 1), is a left eigenvector corresponding to the eigenvalue λ = 1. As a matter of fact, all multiples of (1, 1) are eigenvectors for this particular eigenvalue. Similarly, it can also be seen that x = (1, −1) is a left eigenvector for the eigenvalue λ = 13 . In general, for an m×m matrix T, Equation 4.1 is an mth order equation and can be solved to yield m solutions (complex-valued) for λ. Two other facts about eigenvalue solutions of such transition matrices are worth noting here: 1. For transition matrices corresponding to finite Markov chains, it is possible to show that λ = 1 is always an eigenvalue. Further, it is the largest eigenvalue in that any other eigenvalue, λ, is less than 1 in absolute value, i.e., |λ| < 1. 2. For transition matrices corresponding to finite Markov chains, the multiplicity of the eigenvalue λ = 1 is equal to the number of closed classes in the chain. In our example above, we do see that λ = 1 is an eigenvalue. It has multiplicity of 1, indicating that there is only one closed class in the chain; in the example, the class consists of the two states of the chain.

CHAPTER 4

126

Representation of T k The eigenvalues and associated eigenvectors can be used to represent T k in a form that is convenient for bounding the rate of its convergence to T ∞ . This representation is only true for matrices that are of full rank, i.e., m × m matrices that have m linearly independent left eigenvectors. Let T be an m×m transition matrix. Let it have m linearly independent left eigenvectors x1 , . . . , xm corresponding to eigenvalues λ1 , λ2 , . . . , λm . One could then define the matrix L whose rows are the left eigenvectors of the matrix T. Thus ⎡  ⎤ x1 ⎢ x2 ⎥ ⎥ ⎢ ⎥ . L=⎢ ⎥ ⎢ ⎣ . ⎦ xm Clearly, since the rows of L are linearly independent, its inverse, L −1 , exists. It turns out that the columns of L −1 are the right eigenvectors of T. Let the ith column of L−1 be yi , i.e., L−1 =



y1 y2 . . . ym



Now we can represent T k in a convenient form stated in the following lemma: Lemma 4.1 Let T be an m×m transition matrix having m linearly independent left eigenvectors, x1 , . . . , xm corresponding to eigenvalues λ1 , . . . , λm . Further let L be the matrix whose rows are the left eigenvectors and let the columns of L−1 be yi ’s. Then Tk =

m 

λki yi xi

i=1

Thus, according to the lemma above, T k can be represented as the linear combination of m fixed matrices (yi xi ). The coefficients of this linear combination are λki . Clearly, we see that the rate of convergence of T k is now bounded by the rate of convergence of terms like λ ki . Example (continued). Continuing our previous example, we can construct the matrices L and L−1 out of the left eigenvectors. In fact, using our solutions from before, we see that

127

MEMORYLESS LEARNING  L=

1 1 1 −1



 and

L−1

=

1 2 1 2

1 2 − 12



The rows of L are the xi ’s and the columns of L−1 are the yi ’s. Initial Conditions and Limiting Distributions Recall that the learner could start in any initial state. One could quantify the initial condition of the learner by putting a distribution on the states of the Markov chain according to which the learner picks its initial state. Let this be denoted by the row vector Π0 = (π1 (0), π2 (0), . . . , πm (0)). Thus, πi (0) is the probability with which the learner picks the ith state as the initial state. For example, if the learner were equally likely to start in any 1 for all i. state, then πi (0) = m The above characterizes the probability with which the learner is in each of the states before having seen any examples. The learner would then move from state to state according to the transition matrix T. After k examples, the probability with which the learner would be in each of the states is given by Πk = Π0 T k Finally, one could characterize the limiting distribution as Π = lim Πk = Π0 T∞ k→∞

(4.2)

Clearly, Π characterizes the probability with which the learner is in each of the states “in the limit”. Suppose the target were L 1 , and it were Gold learnable; then the first element of the vector Π would be 1 and all other elements would be 0. In other words, the probability that the learner is at the target in the limit is 1 and the probability that the learner is at some other state (non-target) in the limit is correspondingly 0. Rate of Convergence We are interested in bounding the rate at which Π k converges to Π. We see that this rate depends on the rate at which T k converges to T∞ (Equation 4.2) which in turn depends upon the rates at which the λ ki ’s converge to 0 by Lemma 4.1 (for i > 1). As we have discussed, λ 1 = 1. Consequently, we can bound the rate of convergence by the rate at which the second largest eigenvalue converges to 0. Thus we can state the following theorem.

CHAPTER 4

128

Theorem 4.1 Let the transition matrix characterizing the behavior of the memoryless learner be T. Further, let T have the eigenvalues, λ 1 , . . . , λm , and m linearly independent left eigenvectors x 1 , . . . , xm and m right eigenvectors y1 , . . . , ym ; λ1 = 1. Then, the distance between the learner’s state after k examples and its state in the limit is given by  Πk − Π =

n 

λki Π0 yi xi

i=2

≤ max {|λi | } k

2≤i≤n

n 

 Π0 yj xj 

j=2

Let us first apply this theorem to the illustrative example of this section. Example (continued). We have already solved for the eigenvalues of T and constructed the matrices L and L −1 . The rows of L are the row vectors xi and the columns of L−1 are the column vectors yi . Assuming that the learner is three times as likely to start in state 1 as compared to state 2, i.e., Π0 = ( 34 , 14 ), we can show that  k 1 1 ( )  Πk − Π ≤ 3 2 Thus the rate at which the learner converges to the limiting distribution over the state space is of the order of ( 13 )k . Note that 13 is the second largest eigenvalue of the transition matrix. Transition Matrix Recipes The above discussion allows us to see how one could extract useful learnability properties of the memoryless learner from the transition matrix characterizing the behavior of that learner on the finite parameter space. As a matter of fact, I can now outline a procedure whereby one could check for the learnability and sample complexity of learning in such parameter spaces: 1. Construct the transition matrix T for the memoryless learner according to the arguments developed earlier. Such a matrix has 2 n states if there are n Boolean-valued parameters in the grammatical theory. 2. Compute the eigenvalues of the matrix T. 3. If the multiplicity of the eigenvalue λ = 1 is more than 1, then there are additional closed classes and by the learnability theorem, the target grammar is not Gold learnable.

129

MEMORYLESS LEARNING

Learning scenario TLA (uniform) TLA(a = 0.99) TLA(a = 0.9999) Random Walk

Rate of convergence O(0.94k ) O((1 − 10−4 )k ) O((1 − 10−6 )k ) O(0.89k )

Table 4.2: Bounds on the rate of convergence to the target for TLA under different distributional assumptions and the Random Walk Algorithm. k is the number of examples. We see how the second eigenvalue changes for each of these cases. 4. If the target is Gold learnable, and the eigenvectors are linearly independent, then use Theorem 4.1 to bound the rate of convergence. If the eigenvectors are not linearly independent, then one will need to project into the appropriate subspace of lower dimension and compute the rates in the subspace. See Isaacson and Madsen 1976 for general details and Rivin and Komarova 2003 for specific calculations pertaining to these kinds of learning algorithms. Using such a procedure, we can bound the rate of convergence of each of the following learning scenarios for the three-parameter syntactic subsystem we have examined in some detail in previous examples. In each case, the target is the language L1 . The learning algorithm is the TLA with different sentence distributions (parameterized by a with b, c, d chosen to make sentences outside of A equally likely; see the next section). We also considered the Random Walk Algorithm (no greediness, no single value; see the next section) with a uniform sentence distribution. The rate of convergence is denoted as a function of the number of examples.

4.2

Exploring Other Points

I have developed, by now, a complete set of tools to characterize learnability and sample complexity of memoryless algorithms working on finite parameter spaces. I applied these tools to a specific learning problem that corresponded to a single point in our five-dimensional space — a point previously investigated by Gibson and Wexler. I also provided an account of how our new analysis revised some of their conclusions and had possible applications to linguistic theory. Here we now explore some other points in the space. In the next section, we consider varying the learning algorithm,

CHAPTER 4

130

while keeping other assumptions about the learning problem identical to those before. Later, we vary the distribution of the data.

4.2.1

Changing the Algorithm

As one example of the power of this approach, we can compare the convergence time of TLA to other algorithms. TLA observes the single-value and greediness constraints. We consider the following three simple variants by dropping either or both of the single-value and greediness constraints. Random walk with neither greediness nor single-value constraints: We have already seen this example before. Suppose the learner is in a particular state. Upon receiving a new sentence, it remains in that state if the sentence is analyzable. If not, the learner moves uniformly at random to any of the other states and stays there waiting for the next sentence. This is done without regard to whether the new state allows the sentence to be analyzed. Random walk with no greediness but with single-value constraint: The learner remains in its original state if the new sentence is analyzable. Otherwise, the learner chooses one of the parameters uniformly at random and flips it, thereby moving to an adjacent state in the Markov structure. Again this is done without regard to whether the new state allows the sentence to be analyzed. However, since only one parameter is changed at a time, the learner can only move to neighboring states at any given time. Random walk with no single value constraint but with greediness: The learner remains in its original state if the new sentence is analyzable. Otherwise the learner moves uniformly at random to any of the other states and stays there iff the sentence can be analyzed. If the sentence cannot be analyzed in the new state the learner remains in its original state. Figure 4.2 shows the convergence times for these three algorithms when L1 is the target language. Interestingly, all three perform better than the TLA for this task (learning the language L 1 ). More generally, it is found that the variants converge faster than the TLA for every target language. Further, they do not suffer from local maxima problems. In other words, the class of languages is not learnable by the TLA, but is by its variants. This is another striking consequence of our analysis. The TLA seems to be the “most preferred algorithm” by psychologists. The failure of the TLA to

MEMORYLESS LEARNING

Probability of converging from most unfavourable state 0.0 0.2 0.4 0.6 0.8 1.0

131

0

20

40 60 Number of samples

80

100

Figure 4.2: Convergence rates for different learning algorithms when L 1 is the target language. The curve with the slowest rate (large dashes) represents the TLA. The curve with the fastest rate (small dashes) is the Random Walk (RWA) with no greediness or single-value constraints. Random walks with exactly one of the greediness and single-value constraints have performances in between these two and are very close to each other.

CHAPTER 4

132

learn the three-parameter space was used to argue for maturational theories, alternate parameterizations, and parameter ordering. In view of the fact that the failure of the TLA can be corrected by fairly simple alterations,3 one should examine the conceptual support (from psychologists) for the TLA more closely before drawing any serious linguistic conclusions. This remains yet another example of how the computational perspective can allow us to rethink cognitive assumptions. Of course, it may be that the TLA has empirical support, in the sense of independent evidence that children do use this procedure (given by the pattern of their errors, and so on).

4.2.2

Distributional Assumptions

In an earlier section we assumed that the example data was generated according to a uniform distribution on the sentences of the target language. I computed the transition matrix for a particular target language and showed that convergence times were of the order of 100–200 samples. In this section I show that the convergence times depend crucially upon the distribution. In particular we can choose a distribution that will make the convergence time as large as we want. Thus the distribution-free convergence time for the three-parameter system is infinite. As before, we consider the situation where the target language is L 1 . There are no local maxima problems for this choice. We begin by letting the distribution be parameterized by the variables a, b, c, d where a = P (A = {Adv V S}) b = P (B = {Adv V O S, Adv Aux V S}) c = P (C = {Adv V O1 O2 S, Adv Aux V O S, Adv Aux V O1 O2 S}) d = P (D = {V S}) Thus each of the sets A, B, C, and D contain different degree-0 sentences of L1 . Clearly the probability of the set L 1 \{A∪B ∪C ∪D} is 1−(a+b+c+d). The elements of each defined subset of L 1 are equally likely with respect to each other. Setting positive values for a, b, c, d such that a+b+c+d < 1 now defines a unique probability for each degree-0 sentence in L 1 . For example, the probability of (Adv V O S) is b/2, the probability of (Adv Aux V O S) is c/3, that of (V O S) is (1 − (a + b + c + d))/6, and so on. 3

Note that we have barely scraped the tip of the iceberg as far as exploring the space of possible algorithms is concerned.

133

L1 L2 L3 L4 L5 L6 L7 L8

MEMORYLESS LEARNING

L1 1

L2

1−a−b−c 3 1−a−d 3

2+a+b+c 3

1 3

c 3 b+c 3

L3

L4

2+a+d−b 3 d 3

b 3 3−c−d 3

a+d 3

b 3

L5

L6

2−a 3

a 3 3−b−c 3

L7

L8

3−2a−d 3

a 3 3−b 3

Table 4.3: Transition matrix corresponding to a parameterized choice for the distribution on the target strings. In this case the target is L 1 and the distribution is parameterized according to Section 4.7.2. We can now obtain the transition matrix corresponding to this distribution. This is shown in Table 4.3. Compare this matrix with that obtained with a uniform distribution on the sentences of L1 in the earlier section. This matrix has nonzero elements (transition probabilities) exactly where the earlier matrix had nonzero elements. However, the value of each transition probability now depends upon a, b, c, and d. In particular if we choose a = 1/12, b = 2/12, c = 3/12, d = 1/12 (this is equivalent to assuming a uniform distribution) we obtain the appropriate transition matrix corresponding to a uniform distribution. Looking more closely at the general transition matrix, we see that the transition probability from state 2 to state 1 is (1 − (a + b + c))/3. Clearly if we make a arbitrarily close to 1, then this transition probability is arbitrarily close to 0, so that the number of samples needed to converge can be made arbitrarily large. Thus choosing large values for a and small values for b will result in large convergence times. This means that the sample complexity cannot be bounded in a manner that is distribution-free, because by choosing a highly unfavorable distribution the sample complexity can be made as high as possible. For example, we now give the convergence curves calculated for different choices of a, b, c, d. We see that for a uniform distribution the convergence occurs within 200 samples. By choosing a distribution with a = 0.9999 and b = c = d = 0.000001, the convergence time can be pushed up to as much as 50 million samples. (Of course, this distribution is presumably not psychologically realistic.) For a = 0.99, b = c = d = 0.0001, the sample complexity

134

Probability of converging from most unfavourable state 0.0 0.2 0.4 0.6 0.8 1.0

CHAPTER 4

0

10

20 Log(Number of Samples)

30

40

Figure 4.3: Rates of convergence for TLA with L 1 as the target language for different distributions. The y-axis plots the probability of converging to the target after m samples and the x-axis is on a log scale, i.e., it shows log(m) as m varies. The solid line denotes the choice of an “unfavorable” distribution characterized by a = 0.9999; b = c = d = 0.000001. The dotted line denotes the choice of a = 0.99; b = c = d = 0.0001 and the dashed line is the convergence curve for a uniform distribution, the same curve as plotted in the earlier figure.

is on the order of 100, 000 positive examples.

4.2.3

Natural Distributions–CHILDES CORPUS

Given the distribution of the sample complexity upon distributional assumptions, it is of interest to examine the fidelity of the model using real language distributions. For this purpose I carried out some preliminary experiments using the CHILDES database (MacWhinney 1996). I have carried out preliminary direct experiments using the CHILDES caretaker English input to “Nina” and German input to “Katrin”; these consist of 43,612 and 632 sentences each, respectively. I note, following well-known results by psycholinguists, that both corpora contain a much higher percentage of aux-inversion

135

MEMORYLESS LEARNING

and wh-questions than “ordinary” text (e.g., the LOB): 25,890 questions and 11, 775 wh-questions; 201 and 99 in the German corpus; but only 2,506 questions or 3.7% out of 53,495 LOB sentences. To test convergence, an implemented system using a newer version of de Marcken’s partial parser (see de Marcken 1990) analyzed each degree-0 or degree-1 sentence as falling into one of the input patterns SVO, S Aux V, and so on, as appropriate for the target language. Sentences not parsable into these patterns were discarded (presumably “too complex” in some sense, following a tradition established by many other researchers; see Wexler and Culicover 1980 for details). Some examples of caretaker inputs follow: this is a book? what do you see in the book? how many rabbits? what is the rabbit doing? (. . .) is he hopping? oh . and what is he playing with? red mir doch nicht alles nach! ja , die schw¨ atzen auch immer alles nach (. . .) When the examples are run through the TLA, we discover that convergence falls roughly along the TLA convergence time displayed in Figure 4.1 — roughly 100 examples to asymptote. Thus, the feasibility of the basic model is confirmed by actual caretaker input, at least in this simple case, for both English and German. One may explore this model with other languages and distributional assumptions. However, there is one very important new complication that must be taken into account: we have found that one must (obviously) add patterns to cover the predominance of auxiliary inversions and wh-questions. However, that largely begs the question of whether the language is verb-second or not. Thus, as far as we can tell, we have not yet arrived at a satisfactory parameter-setting account for V2 acquisition.

4.3

Batch Learning Upper and Lower Bounds: An Aside

So far we have discussed a memoryless learner moving from state to state in parameter space and hopefully converging to the correct target in finite

CHAPTER 4

136

time. As we saw, this was modeled well by our Markov formulation. In this section, however, we step back and consider upper and lower bounds for learning finite language families if the learner was allowed to remember all the strings encountered and optimize over them. Needless to say, this might not be a psychologically plausible assumption, but it can shed light on the information-theoretic complexity of the learning problem. Consider a situation where there are n languages L 1 , L2 , . . . Ln over an alphabet Σ. Each language can be represented as a subset of Σ ∗ , i.e., Li = {ωi1 , ωi2 , . . . }; ωij ∈ Σ∗ The learner is provided with positive data (strings that belong to the language) drawn according to distribution P on the strings of a particular target language. The learner is to identify the target. It is quite possible that the learner receives strings that are in more than one language. In such a case the learner will not be able to uniquely identify the target. However, as more and more data becomes available, the probability of having received only ambigious strings becomes smaller and smaller and eventually the learner will be able to identify the target uniquely. An interesting question to ask, then, is how many samples the learner needs to see so that with high confidence it is able to identify the target, i.e., the probability is less than δ that after seeing that many samples, the learner is still ambiguous about the target. The following theorem provides a lower bound. 1 ln(1/δ) Theorem 4.2 The learner needs to draw at least M = max j =t ln(1/p j) samples (where pj = P (Lt ∩ Lj )) in order to be able to identify the target with confidence greater than 1 − δ.

Proof: Suppose the learner draws m (less than M ) samples. Let k = 1 ln(1/δ) and (2) that with probarg maxj =t pj . This means (1) M = ln(1/p k) ability pk the learner receives a string that is both in L k and Lt . Hence it will be unable to discriminate between the target and the kth language. After drawing m samples, the probability that all of them belong to the set Lt ∩ Lk is (pk )m . In such a case even after seeing m samples, the learner will be in an ambiguous state. Now (pk )m > (pk )M since m < M and pk < 1. Finally since M ln(1/pk ) = ln((1/pk )M ) = ln(1/δ), we see that (pk )m > δ. Thus the probability of being ambiguous after m examples is greater than δ, which means that the confidence of being able to identify the target is less than 1 − δ. This simple result allows us to assess the number of samples we need to draw in order to be confident of correctly identifying the target. Note

137

MEMORYLESS LEARNING

that if the distribution of the data is very unfavorable, i.e., the probability of receiving ambiguous strings is quite high, then the number of samples needed can actually be quite large. While the previous theorem provides the number of samples necessary to identify the target, the following theorem provides an upper bound for the number of samples that are sufficient to guarantee identification with high confidence.

1 ln((N − 1)/δ) Theorem 4.3 If the learner draws more than M = ln(1/b t) samples, then it will identify the target with confidence greater than 1 − δ. (Here bt = maxj =t P (Lt ∩ Lj )) and N is the total number of languages in the family.)

Proof: Let the target be Lt . We can define Ai to be the event that Lt and Li are not distinguishable after n events. The probability of event A i (denoted by P (Ai )) is pni where pi = P (Lt ∩ Li ). Thus Ai occurs if all n example sentences belong to both Lt and Li . Now, the probability that at least one of is given by P (∪j =t Aj ). Using the union bound, we have the events Ai occurs  P (∪j =t Aj ) ≤ j =t P (Aj ) ≤ (N − 1)bnt . For this to be smaller than δ, we 1 ln((N − 1)/δ). Thus, if more need (1/bt )n > (N − 1)/δ or n > M = ln(1/b t) than M examples are drawn, the probability of being unable to distinguish the target language from any one of the other languages is made small. To summarize, this section provides a simple upper and lower bound on the sample complexity of exact identification of the target language from positive data. The δ parameter that measures the confidence of the learner in being able to identify the target is suggestive of a PAC (Valiant 1984) formulation. However there are two crucial differences. First, in the PAC formulation, one is interested in an -approximation to the target language with at least 1 − δ confidence. In our case, this is not so. Since we are not allowed to approximate the target, the sample complexity shoots up with the choice of unfavorable distributions. Second, the learner has to make do with only positive data. In the classical PAC setting, the learner has access to both positive and negative examples. Recalling our discussion of the PAC framework from Chapter 2, it is worthwhile to note that any finite family of languages is PAC learnable and upper and lower bounds on the sample complexity for learning such families are easily derived following the usual analysis (Vapnik 1998). I do not explore these sorts of questions any further in the rest of the book.

CHAPTER 4

4.4

138

Generalizations and Variations

The previous sections introduced and analyzed the Markov chain framework for analyzing the learnability of grammars in the Principles and Parameters (P&P) tradition. I now show that this framework has general applicability well beyond the scope of P&P and the Triggering Learning Algorithm.

4.4.1

Markov Chains and Learning Algorithms

Consider a learning algorithm A specified by a mapping from D→H where as before D is the set of all finite-length data streams and H is a class of hypothesis grammars (languages). The learning algorithm is conceptualized as an online procedure that develops grammatical hypotheses after each new example sentence. Suppose the learner has an initial hypothesis h 0 . After each new sentence has been received, it updates its hypothesis. Let us denote its hypothesis after n examples as hn . One can then reasonably ask — what is the probability with which the following event happens: Event: hn = g; hn+1 = f The probability of this event is simply given by the measure of the set An,g ∩ An+1,f = {t ∈ T |A(tn ) = g} ∩ {t ∈ T |A(tn+1 ) = f } Here T is the set of all texts and t is a generic element of this set, i.e., it is a particular text. In accordance with the notation introduced in Chapter 2, t n refers to the first n sentences in the text t and therefore t n ∈ D. We discussed the natural measure μ∞ on T that exists by the Kolmogorov Extension Theorem and thus both sets An,g and An+1,f are measurable. Therefore, one may define P[hn+1 = f |hn = g] =

μ∞ (An+1,f ∩ An,g ) μ∞ (An,g )

provided that μ∞ (An,g ) > 0. If μ∞ (An,g ) = 0, this means that after n examples have been received, the probability with which the learner will conjecture g at this stage is exactly 0. We can now naturally define an inhomogeneous Markov chain. The state space corresponds to the set of possible grammars in H – for each grammar

139

MEMORYLESS LEARNING

g ∈ H we have a state in the chain. At each point in time, the learner has a particular grammatical hypothesis and therefore the chain is in the state corresponding to that grammar. At the nth time step (after n examples have been received), the transition matrix of the chain is given by Tn (g, f ) = P[hn+1 = f |hn = g] = P[g → f ] =

μ(An+1,f ∩ An,g ) μ∞ (An,g )

(4.3)

Since Tn (g, f ) as defined by Equation 4.3 can be evaluated only for those g for which μ∞ (An,g ) > 0, we need to specify the values of Tn for other choices of g.  To do this it is enough to choose a set of positive numbers α g such that g∈H αg = 1. Therefore, we can define Tn (g, f ) = αf ⇔ μ∞ (An,g ) = 0 It is easy to check that ∀g,



(4.4)

Tn (g, f ) = 1

f ∈H

It is similarly easy to also check that P[hn+1 = f ] = μ∞ (An+1,f ) =



P[hn = g]Tn (g, f )

g∈H

The transition matrix Tn is time dependent and characterizes the evolution of the learner’s hypothesis from example sentence to example sentence. Thus we make the following observations: 1. Let H be a class of possible target grammars. 2. Let g ∈ H be the target grammar. 3. Let μ be a probability measure on Lg ⊂ Σ∗ according to which sentences are presented to the learner. 4. By the Kolmogorov Extension Theorem, a unique measure μ ∞ exists on the set of all possible texts T as discussed. 5. Any arbitrary learner A : D → H may be exactly characterized by an inhomogeneous Markov chain with as many states as there are grammars in H and whose transition matrix T n (after n steps) is given by Equations 4.3 and 4.4 respectively. We have thus seen the proof of

CHAPTER 4

140

Theorem 4.4 Any deterministic learning algorithm may be characterized by an inhomogeneous Markov chain. The behavior of the chain depends upon the learning algorithm A, the target grammar g ∈ H, and the probability measure μ on Lg ∈ Σ∗ . The target grammar g is learnable (with measure 1) if and only if the chain settles in the state corresponding to the target grammar. Thus learnability of grammars is related to the convergence of nonstationary Markov chains. In general, such an inhomogeneous chain converges to its limiting distribution if the chain is ergodic. Conditions for the ergodicity of inhomogeneous chains may be expressed in a variety of ways, most notably utilizing the notion of the ergodic coefficient. This, in turn, might allow us to obtain learnability conditions expressed in the language of Markov chains rather than recursion theory. Thus, for example, if the learning algorithm A is such that (i) the associated Markov chain is ergodic, and (ii) for each n, Tn is such that all closed sets contain the target state, then the hypothesis generated by A will converge to the target grammar in the limit. It is worthwhile to note, however, that ergodicity is not necessary for learnability since the chain need not converge to the target state from all initial distributions. A more involved discussion of the relationship between the learnability of grammars and the convergence of the corresponding inhomogeneous chains is beyond the scope of this book. We turn now to the consideration of the class of memoryless learning algorithms that are characterized by stationary chains. The conditions for the convergence of such chains are obtained from the analysis of the TLA provided earlier.

4.4.2

Memoryless Learners

Memoryless algorithms may be regarded as those that have no recollection of previous data, or previous inferences made about the target function. At any point in time, the only information upon which such an algorithm acts is the current data and the current hypothesis (state). A memoryless algorithm may then be characterized as an effective procedure mapping this information to a new hypothesis. In general, given a particular hypothesis state (h in H, the hypothesis space), and a new datum (sentence, s in Σ ∗ ), such a memoryless algorithm will map onto a new hypothesis (g ∈ H). Of course, g could be the same as h or it could be different depending upon the specifics of the algorithm and the datum. Formally, therefore, the algorithm A must be such that for all n and for

141

MEMORYLESS LEARNING

all texts t ∈ T , we have A(tn+1 ) = a(A(tn ), t(n + 1)) where a is a mapping from H × Σ∗ to H. Following our previous discussion, the behavior of such an algorithm is also characterized by a Markov chain. It is easy to see that Tn (g, f ) = P rob[hn+1 = f |hn = g] = μ({s ∈ Σ∗ |a(g, s) = f }) where as before we have assumed that the text is generated by sampling in i.i.d. fashion according to a probability measure μ on Σ ∗ . Clearly Tn is independent of n and the resulting Markov chain is a stationary one.

4.4.3

The Power of Memoryless Learners

Pure memoryless learners belong to the more general class of memory limited learners. Memory limited learners develop grammatical hypothesis based on a finite memory of sentences they have heard over their lifetime. The following definition provides a useful formalization of the notion: Definition 4.2 (Wexler and Culicover 1980) For any finite data stream u ∈ D, let u− be the data stream consisting of all but the last sentence of u. Thus if n = lh(u) and u = s1 , s2 , . . . , sn , then u− = s1 , s2 , . . . , sn−1 . Let u− k be the data stream consisting of the last k elements of u. Thus if n > k then u− k ∈ D is such that u− k = sn−(k−1) , sn−(k−2) , . . . , sn . A learning algorithm A is said to be k-memory limited if for all u, v ∈ D, such that (i) u− k = v − k, and (ii) A(u− ) = A(v − ), we have A(u) = A(v). To put it differently, A(u) depends only upon the previous grammatical hypothesis (A(u− )) and the last k sentences heard (u− k). Using this, it is easy to develop the notion of a memory limited learner. This is given by Definition 4.3 A learning algorithm A is memory limited if there exists some integer m such that A is m-memory limited. Thus, in general, a memory limited learner is required only to have a finite memory. No bound is set on the size of the memory it is required to have. From a cognitive point of view, such memory limited learners have great appeal since it seems like a natural way to characterize the fact that learning children are unlikely to have arbitrary unbounded memory of their data.

CHAPTER 4

142

One might think that the class of languages learnable by memory limited learners is larger than that learnable by memoryless learners. This, however, turns out not to be the case. Theorem 4.5 If a class of grammars H is learnable in the limit by a kmemory limited learning algorithm A k , then there exists some memoryless learning algorithm that is also able to learn it. Proof: Omitted. In the last two chapters, we have implicitly adopted a probabilistic model of learning where the learner is required to converge to the target in probability (alternative, with measure 1). It is worthwhile to make the following additional observation Theorem 4.6 If a class of grammars H is learnable in the limit (in the classical Gold sense), then it is learnable with measure 1 by some memoryless learner. Proof: The proof has been relegated to the appendix for continuity of ideas. Thus, the class of memoryless learners is quite general. Consequently, the characterization of memoryless learners by first order Markov chains takes on a general significance that far exceeds the original context of the Triggering Learning Algorithm.

4.5

Other Kinds of Learning Algorithms

We have examined in some detail the problem of language acquisition in the P&P framework with particular attention to models surrounding the Triggering Learning Algorithm of Gibson and Wexler 1994. By this time, it should be clear, however, that the basic framework that has been applied for a more penetrating analysis of the TLA is considerably more general in its scope. It is therefore timely for us to note that there has been significant computational activity in the area of language acquisition in a variety of linguistic and cognitive traditions. Let us consider a few different organizing strands for this kind of research. 1. In Optimality Theory (Prince and Smolensky 1993 ), grammatical variation is treated via constraints rather than rules. Surface expressions, be they phonological forms or syntactic forms, are deemed acceptable if

143

MEMORYLESS LEARNING they violate the least number of constraints. In many instantiations of this theory, one begins with a finite number of constraints C 1 , . . . , Cn . In the grammar of a particular language, these constraints are ordered in importance and determine the ranking and relative importance of constraint violations for candidate surface forms. Thus there are in principle n! different grammars possible. The task of the learning child is conceptualized as determining the appropriate ordering for the target grammar given example sentences they hear. An extensive treatment of the learning algorithms appropriate for this framework is provided in Tesar and Smolensky 2000. Iterative strategies utilizing error-driven constraint demotion are online and memoryless and may be exactly characterized by a random walk on a state space of n! possible grammars.

2. Algorithms for learning grammars in different linguistic traditions have been considered by Briscoe (2000) in an LFG framework, Yang (2000) in a GB framework, Stabler (1998) in a minimalist framework with movement, Fodor (1994,1998), Sakas (2000), Bertolo (2001), Clark and Roberts (1993) in P&P frameworks, and Neumann and Flickinger (1999) in an HPSG framework. They present interesting variations, consider subtleties involved in learning complex grammar families, and investigate issues when one models natural languages as composed of multiple grammars. 3. An important thread in computational studies of language acquisition attempts to clarify the manner in which semantic considerations enter the process of acquiring a language. Learning procedures rely on semantic feedback to acquire formal structures in a more functionalist perspective on language acquisition. See Feldman et al. 1996, Regier 1996, or Siskind 1992 for computational exploration of these themes. 4. Other probabilistic accounts of language acquisition attempt to encode prior knowledge of language to varying degrees of specificity in a minimum description length framework as in Brent 1999, Goldsmith 2001, deMarcken 1996. 5. Connectionist and other data-driven approaches to language acquisition may also be characterized formally within the frameworks provided in this chapter and the previous one. Examples of such approaches are Daelemans 1996, Gupta and Touretzky 1994, Charniak 1993, MacWhinney 1987, 2004, and so on.

CHAPTER 4

144

Most of these approaches to language acquisition attempt to characterize in computational terms the procedures of language learning in a variety of cognitive settings with varying degrees of preconceived notions. All of these are ultimately analyzable within the general computational framework considered over the last three chapters.

4.6

Conclusions

In this chapter I have continued my investigation of language acquisition within the P&P framework with central attention to the kinds of models inspired by the TLA. The problem of learning parameterized families of grammars has several different dimensions, as I have emphasized earlier. One needs to investigate the learnability for a variety of algorithms, distributional assumptions, parameterizations, and so on. In this chapter, I have emphasized that it is not enough to merely check for learnability in the limit (as previous research within an inductive inference Gold framework has tended to do; see, for example, Jain et al. 1998); one also needs to quantify the sample complexity of the learning problem, i.e., how many examples does the learning algorithm need to see in order to be able to identify the target grammar with high confidence? In order to get a handle on this question, I take my Markov analysis to the next logical stage — that of characterizing convergence times. A rich literature exists on characterizing the invariant distributions of Markov chains and the rate at which the chain converges to it. I have provided in this chapter a brief survey of some of the important techniques that are involved in this analysis. We saw the dependence of the convergence rates upon the probability distribution according to which example sentences were presented to the learner. We considered pathological distributions that significantly increased convergence times as well as more natural distributions obtained from the CHILDES corpus. Although much of the analysis was inspired by the TLA, it is important to recognize that the general framework is considerably broader in scope. Any learning algorithm on any enumerable class of grammars may be characterized as an inhomogeneous Markov chain. Any memory-limited learning algorithm (as biological learning algorithms must be) is ultimately a first order Markov chain. Much of the cognitively motivated computational work on language acquisition — reviewed briefly in Section 4.5 — may then be analyzed satisfactorily within this framework. How a child acquires its native language presents one of the deepest

145

MEMORYLESS LEARNING

scientific problems in cognitive science today. While we are still quite far from a complete understanding of the process, much research in linguistics, psychology, and artificial intelligence has been conducted with this problem in mind. Because a natural language with its phonetic distinctions, morphological patterns, and syntactic forms has a certain kind of formal structure, computational modeling has played an important role in helping us reason through various explanatory possibilities for how such a formal system may be acquired. Chapters 2 through 4 of this book present many variations of the basic computational framework and an overview of the central insights that must inform us as we search for a solution. Throughout these past few chapters, language acquisition is framed in a conventional setting as an idealized parent-child interaction with a single homogeneous target grammar (language) that must be attained over the course of this interaction. I use this as a building block for the more natural setting of learners immersed in linguistically heterogeneous populations. As a result, the problems of language change and evolution on the one hand, and language acquisition on the other, become irrevocably linked. I elaborate on this theme in the rest of the book.

CHAPTER 4

4.7

146

Appendix: Proofs for Memoryless Algorithms

The power of memoryless algorithms comes from the fact that it is possible for an algorithm to code the data set it has received so far into its current conjecture. This is a consequence of two important and well-known facts that I state without proof. Proposition 4.1 There is a mapping f : N × N → N that is one-to-one, onto, and therefore invertible. Therefore, any pair of natural numbers i, j can be coded as k = f (i, j) such that from knowing the value of k, one is able to decode it by f −1 (k) = (i, j). By applying this recursively, one may code any finite number of natural numbers. For example, if one were to code three numbers i, j, k, then one may do this by f (f (i, j), k). Applying f −1 twice to this number would recover the three original numbers. To make this idea work in general, one will also need to code the total number of numbers being coded. This will indicate to the receiver how many times the f −1 operation needs to be applied to the coded number to recover the original numbers. Thus the true code for the numbers i, j, k would be given by l = f (3, f (f (i, j), k)). Upon applying f −1 once to l, one recovers 3 (the total number of natural numbers being coded). This indicates that f −1 needs to be applied two more times to recover the three original numbers. The extension to coding a finite number of natural numbers is clear. This means that the current data set may be coded as a natural number. Enumerate the elements of Σ∗ as s1 , s2 , s3 , . . . ,. Let t be a text presented to the learner. Let i1 , i2 , . . . , in be the indices of the first n sentences in the text. Thus si1 = t(1), si2 = t(2), . . . , and so on. Then at stage n, the learner will have encountered tn = si1 si2 . . . sin . The learner may encode this data by encoding the n integers i1 , . . . , in . Let us denote this coding procedure as l = code(tn ). A second fact is a consequence of the s − m − n theorem. Let g 1 , g2 , . . . be an enumeration of Phrase Structure Grammars (equivalent to the r.e. sets) in an acceptable programming system. Let L 1 , L2 , . . . correspond to their respective languages. As is well known, for any r.e. set L, there are an infinite number of indices j1 , j2 , . . . , such that Ljk = L. Further, an infinite number of such indices may be enumerated by the padding function, as the following theorem indicates. Theorem 4.7 There exists a one-to-one, onto, computable function pad : N × N → N

147

MEMORYLESS LEARNING

such that Lpad(i,j) = Li for all i, j and pad(i, j) is an increasing function of j for each i. Now recall the basic learning-theoretic setting. Consider an acceptable enumeration of grammars. Then grammars may be specified by specifying their index in this enumeration. Consider A ⊆ N. Then this specifies G = {gi |i ∈ A} and L = {Li |i ∈ A}. Any learning algorithm is a map from data to grammars. Consider learning according to a prespecified 0 − 1 valued metric d such that d(gi , gj ) = 0 ⇔ Li = Lj . This is the same as requiring extensional (behavioral) convergence. Theorem 4.8 If a family of grammars G (correspondingly a family of languages L) is identifiable (on all texts and with an extensional norm) in the limit by an algorithm A, then it is identifiable (on all texts and with an extensional norm) in the limit by some memoryless algorithm. Proof: The memoryless algorithm A memoryless works by coding the data, calling A as a subroutine on this data, and padding the output of A. More formally, consider text t = s i1 si2 . . . as input to the learning algorithm. The initial guess of the learning algorithm before seeing any data is g1 . After seeing the first data point s i1 the learner calls A to obtain gm = A(si1 ). It codes the data as k = code(t1 ). It uses the padding function to obtain p = pad(m, k). The learner outputs g p . At stage n, let the learner’s hypothesis be g j . Let (i, k) = pad−1 (j). Then Li = Lj and uncode(k) = si1 . . . sin . On input sin+1 the learner recovers tn+1 by appending it to uncode(k). Thus it has effectively created tn+1 . It calls A to obtain gm = A(tn+1 ). Let l = code(tn+1 ). Finally let p = pad(m, l). The learner outputs gp . It is clear that at each stage n, if gmn = Amemoryless (tn ) and gln = A(tn ) then Lmn = Lln . Therefore if the target language is grammar is g (language L) lim d(g, gmn ) = lim d(g, gln ) = 0 n→∞

n→∞

In the above theorem, we see that while the memoryless learner converges extensionally to the right set, it need not stabilize on any grammar. One may be interested in a stronger notion of convergence in which on each text

CHAPTER 4

148

t the learner stabilizes on a grammar gt such that d(gt , g) = 0 (here g is the target grammar). An algorithm A is said to stabilize on a grammar g t on text t if there exists a point in time n such that for all m > n, we have that A(tm ) = gt , i.e., after seeing n sentences, the learner’s grammatical hypothesis is always gt . It turns out that while it is not possible to construct a memoryless algorithm that converges in this strong sense on every single text, it is possible to construct one that converges on almost all texts. In other words, the following theorem is true. Theorem 4.9 If a family of grammars G (correspondingly languages L) is identifiable in the limit (by stabilizing on an appropriate grammar on all texts ) by some learning algorithm, it is identifiable with measure 1 (by stabilizing on an appropriate grammar on almost all texts) by some memoryless learning algorithm. Proof: Let L ∈ L be the target language and let j be the least index for it, i.e., j is the least index such that L j = L. Let t = si1 si2 . . . be a text from the target obtained by sampling according to μ in i.i.d. fashion. Note that μ has support on L. We first begin by making the following two observations. 1. Suppose the text t is such that each element of L occurs infinitely often in t. We call such a text a rich text. Then by the laws of probability, it is possible to show (here μ∞ is the product measure on all texts by Kolmogorov Extension Theorem in the standard way) that μ∞ ({t|t is a rich text of L}) = 1 We will now construct a memoryless learning algorithm that can identify L on all rich texts. 2. Since L is identifiable in the limit, by the necessary and sufficient conditions for identifiablity discussed in Chapter 2, we know that for every L ∈ L, there exists a DL such that if any L contains DL then L ⊂ L. The description of the memoryless learning algorithm follows. The learner’s hypothesis after n examples have been processed will be denoted by the number h(n). h(n) is the index of the grammar (g h(n) ) or language (Lh(n) ) that is hypothesized after n examples. The learner will maintain a small evidentiary set S composed of example sentences and a counter c that counts how many times it enters cases 3 and 4 of the algorithm below. Note that the evidentiary set S may be coded as a natural number.

149

MEMORYLESS LEARNING

Now consider the following learning algorithm. At stage 0, i.e., after 0 examples have been seen, the algorithm is initialized by having S = ∅, c = 0. The learner’s initial guess is h(0) = pad(M, f (code(S), c)) where M is the smallest index k such that Lk ∈ L. At stage n, i.e., after n examples have been seen, the learner updates its hypothesis in the following way. Note that from h(n) it can recover S and c uniquely. if (DLh(n) ⊆ S ∪ {sin+1 } ⊆ Lh(n) ) then if (c ≤ in+1 ) then [Case 1] h(n + 1) = h(n) else [Case 2] S = S ∪ {sin+1 } p = f (code(S), c) if (S and p don’t change) then h(n + 1) = h(n) else h(n + 1) = pad(h(n), p) endif endif else c =c+1 S = S ∪ {sin+1 } p = f (code(S), c) if (∃ smallest l ≤ n s.t. Ll ∈ L and DLl ⊆ S ⊆ Ll ) then [Case 3] h(n+1) = pad(l,p) else [Case 4] h(n+1) = pad(M,p) endif endif First, we must agree that this is memoryless. To see this, simply notice that the new hypothesis depends only upon the previous hypothesis and the current data. This is because from the previous hypothesis, both the hypothesized language and the evidentiary set S may be recovered. Second, notice that at all stages, the algorithm outputs a hypothesis that is in the family L. Next, we prove that if the algorithm stabilizes on a grammar, it stabilizes on one that generates the target language L. Assume that the algorithm

CHAPTER 4

150

converges on index k, i.e., h(n) = k for all n large enough. Since cases 3 and 4 result in a change of hypotheses, it must be in case 1 or 2 after a finite number of examples have been seen. From this stage on, it is clear that DLk ⊆ S ∪ {t(n)} ⊆ Lk for all n. Since every element of L occurs infinitely often in t, we see that every element of L must be contained in Lk , i.e., L ⊆ Lk . On the other hand, DLk ⊆ S ∪ {t(n)} ⊆ L. Therefore by the definition of DLk and the learnability of L we have that L cannot be a proper subset of Lk . Therefore it must be that Lk = L. Finally, we show that it must converge (stabilize). The proof is by contradiction. Suppose not. This means that it changes its hypothesis infinitely often. Therefore cases 2,3,4 must occur infinitely often. Let us argue that cases 3 and 4 can occur only a finite number of times. We will prove this by contradiction. Assume it occurs an infinite number of times. Therefore c increases without bound. Consider an arbitrary s i ∈ L. We know that si occurs infinitely often in t. If on any of the instances it occurs, the algorithm enters case 3 or 4, then s i will be included in S after that point. On the other hand, if on each occasion, it enters case 1 or 2, then the moment c > i (and this moment must come since c increases without bound); the algorithm will enter case 2 at that stage and s i will be included in S after that point. Thus every s i ∈ L will get included in S eventually. Now consider the elements of DL . Since DL is a finite set, there is a finite time (stage N ) such that after N examples have been received, all elements of DL have been included in S so that DL ⊆ S. Let k = max(N, j). Consider the case where the learner enters case 3 or 4 after stage k. Since Lj = L, we have that (i) DLj ⊆ S ⊆ Lj . Consider any i < j. We will now argue that the learning algorithm can hypothesize Li only a finite number of times after this. Suppose not. Then it must be the case that (ii) DLi ⊆ S ⊆ Li an infinite number of times. Since every element of L = Lj eventually gets included in S, this means that L j ⊆ Li . Yet, by the learnability of L, if DLi ⊆ S ⊆ Lj , it cannot be that Lj is a proper subset of Li . Therefore, it must be that Li = Lj . Yet we know that j is the smallest index for Lj = L leading to a contradiction. Thus, every hypothesis Li where i < j will be eventually discarded when the learner enters case 3 or 4 after stage k. Therefore, the learner will eventually hypothesize Lj if it enters case 3 or 4 after stage k. Having done this, it is clear that it will never subsequently enter case 4 or 3. This leads to a contradiction in our assumption that the algorithm enters case 3 or 4 an infinite number of times. Therefore cases 3 and 4 must occur only a finite number of times and there is a maximum number C that c achieves, after which it never grows.

Thus eventually, the learner will only enter case 1 or 2. Now we will argue that the learner’s grammatical hypothesis can change only a finite number of times after this. First we note that there are only a finite number of sentences si ∈ L such that i < C. The algorithm can enter case 2 only when one of these si ’s occurs in the text. Each of these s i ’s will eventually get included in S when the algorithm enters case 2. After this point whether the algorithm is in case 1 or 2, the set S does not change and c = C does not change. Therefore, h(n) does not change with n and the learner converges on a fixed index.

Part III

Language Change

Chapter 5

Language Change: A Preliminary Model For the last forty years, synchronic linguistics has been driven by the socalled “logical problem of language acquisition” — the problem of how children come to acquire the language of their parents. In Part II (Chapters 2 through 4) I discussed in some detail the inherent difficulty of inferring a language on the basis of finite exposure to it. Within the analytic framework of our discussion, we concluded that successful language learning would be possible only in the presence of adequate prior constraints on the class of possible grammars H and correspondingly on the class of possible learning algorithms A. Linguists in the generative tradition have proposed theories of universal grammar that constrain the range of grammatical hypotheses that children might entertain during language learning. Psycholinguists, developmental psychologists, and cognitive scientists have pursued a range of approaches from empirical studies of child-language acquisition to formal analyses of the process (Pinker 1984; Wexler and Culicover 1980; Crain and Thornton 1998; Slobin 1985-97; Gleitman and Landau 1994; Newport and Aslin 2000). These shed light on the possible mechanisms or algorithms of language learning. Language acquisition may be viewed as the mechanism by which language is transmitted from parent to child — and in fact, from one generation of language users to the next. Perfect language acquisition would imply perfect transmission. Children would acquire perfectly the language of their parents, language would be mirrored perfectly in successive generations, and languages would not change with time. Yet, in Chapter 1, we saw a number

CHAPTER 5

156

of documented cases of such change. Therefore, for languages to change with time, children must do something differently from their parents. There is thus a tension between language learning on the one hand and language change on the other. Perfect language learning would imply no change. At the same time, language learning cannot be so imperfect that the learned language of the children does not resemble at all that of the parents (more generally, the linguistic environment). If, due to slight imperfections of language learning, the linguistic composition of the population shifts just a bit, can this slight shift lead eventually to a significant change over long time scales? This is a question that I address over the next few chapters. In a discussion of the history of English, Lightfoot (1991) clearly points to such a possibility: As somebody adopts a new parameter setting, say a new verbobject order, the output of that person’s grammar often differs from that of other people’s. This in turn affects the linguistic environment, which may then be more likely to trigger the new parameter setting in younger people. Thus a chain reaction may be created. (Lightfoot 1991, 162) In fact, linguists have long been occupied with describing phonological, syntactic, and semantic change, often appealing to an analogy between language change and evolution, but rarely going beyond this. For instance, Lightfoot (1991, 163-65) talks about language change in this way: “Some general properties of language change are shared by other dynamic systems in the natural world . . . In population biology and linguistic change there is constant flux. . . . If one views a language as a totality, as historians often do, one sees a dynamic system.” Indeed, entire books have been devoted to the description of language change using the terminology of population biology: genetic drift, clines, and so forth. 1 Other scientists have explicitly made an appeal to dynamical systems in this context; see especially Hawkins and Gell-Mann 1992. It is only over the last decade, however, that this connection has begun to be seriously and formally explored. In this chapter, I make explicit the link between learning and evolutionary dynamics. In particular, I show formally that a model of language change emerges as a logical consequence of language acquisition, an argument made informally by Lightfoot (1991). We will see that Lightfoot’s 1

For example, see Nichols 1992, and more recently Mufwene 2001.

157

LANGUAGE CHANGE: A PRELIMINARY MODEL

intuitive metaphor of dynamical systems is essentially correct as is his proposal for turning language-acquisition models into language-change models.

5.1

An Acquisition-Based Model of Language Change

How does the combination of a grammatical theory and learning algorithm lead to a model of language change? I begin my treatment by arguing that the problem of language acquisition at the individual level leads logically to the problem of language change at the group or population level. Consider a population speaking a particular language. 2 This is the target language — children are exposed to primary linguistic data (PLD) from this source, typically in the form of sentences uttered by caretakers (adults). The logical problem of language acquisition calls for an explanation of how children acquire this target language from their primary linguistic data — in other words, it calls for an adequate learning theory. Following the development of the previous chapters, I take a learning theory to be simply a mapping from primary linguistic data to the class of Phrase Structure Grammars (computational systems), usually an effective procedure, and so an algorithm. For example, in a typical inductive inference model, given a stream of sentences, an acquisition algorithm would update its grammatical hypothesis with each new sentence according to some computable process. We encountered various formal criteria for learnability, all of which require that the algorithm’s output hypothesis converge to the target in some sense as more and more data become available. Now suppose that we fix an adequate grammatical theory and an adequate acquisition algorithm. There are then essentially two means by which the linguistic composition of the population could change over time. First, if the primary linguistic data presented to the child is altered (due to any number of causes, perhaps to presence of foreign speakers, contact with another population, disfluencies, and the like), the sentences presented to the learner (child) are no longer consistent with a single target grammar. In the face of this input, the learning algorithm might no longer converge to the target grammar. Indeed, it might converge to some other grammar (g 2 ), or it might converge to g2 with some probability, g3 with some other probabil2

In my analysis this implies that all the adult members of this population have internalized the same grammar (corresponding to the language they speak).

CHAPTER 5

158

ity,3 and so forth. In either case, children attempting to solve the acquisition problem using the same learning algorithm could internalize grammars different from the parental (target) grammar. In this way, in one generation the linguistic composition of the population can change. 4 Second, even if the PLD comes from a single target grammar, the actual data presented to the learner is truncated, or finite. After a finite sample sequence, children may, with nonzero probability, hypothesize a grammar different from that of their parents. This can again lead to a differing linguistic composition in succeeding generations. In short, the diachronic model is as follows: individual children learn a language based on linguistic input generated from the grammar of their caretaker (the target grammar). After a finite number of examples, some acquire a grammar very similar to that of their caretakers, but others may have acquired something different. The next generation will therefore no longer be linguistically homogeneous. The third generation of children will hear sentences produced by the second — a different distribution — and they, in turn, will attain a different set of grammars. Over successive generations, the linguistic composition evolves as a dynamical system. In this manner, language acquisition and language change become intimately related. Figure 5.1 provides a pictorial perspective on this state of affairs. Language acquisition takes a microscopic view of the situation. The object of study is the individual child and one studies how its linguistic (grammatical) hypotheses evolve from example sentence to sentence over its developmental lifetime. If one were to step away and take a macroscopic view of the same phenomena, one could make the object of one’s study the entire community or population. If one studied how the linguistic composition of the population were to evolve from generation to generation over evolutionary time scales, then one would end up with models of language change. Such models of language change are driven by models of language learning. Thus every model of language learning at the individual level could have potentially different evolutionary consequences at the population level. This book explores such consequences in some detail. 3

Recall from the previous chapters that convergence is measurable. Therefore it makes sense to speak of the probability of acquiring a particular grammar under any source distribution. 4 Sociological factors affecting language change affect language acquisition in exactly the same way, yet are abstracted away from the formalization of the logical problem of language acquisition. In this same sense, I similarly abstract away such causes here, though they can be brought into the picture as variation in probability distributions and learning algorithms. I will consider a few such variations in future chapters.

159

LANGUAGE CHANGE: A PRELIMINARY MODEL

MICROSCOPIC LANGUAGE ACQUISITION

INDIVIDUAL

POPULATION

GRAMMATICAL HYPOTHESIS

LINGUISTIC COMPOSITION

#EXAMPLES

#GENERATION

LANGUAGE CHANGE

MACROSCOPIC

Figure 5.1: Language acquisition takes a microscopic view of the situation. It focuses on the individual language learner and studies how its language (grammar) develops on exposure to primary linguistic data over its critical learning period. Language change and evolution takes a macroscopic view with a focus on the population over generational time scales. One now studies how the linguistic composition of the population evolves from generation to generation.

CHAPTER 5

160

On this view, language change is a logical consequence of specific assumptions about 1. The grammar hypothesis space — the space of possible grammars that humans might acquire. In the Principles and Parameters framework of modern linguistic theory, this reduces to the choice of a particular parameterization. 2. The language acquisition device — the learning algorithm the child uses to develop grammatical hypotheses on the basis of data. 3. The primary linguistic data — the distribution of sentences that a child is exposed to and that affect its linguistic development. If we specify (1) through (3) for a particular generation, we should, in principle, be able to compute the linguistic composition for the next generation. In this manner, we can compute the evolving linguistic composition of the population from generation to generation; we arrive at a dynamical system. We now proceed to make this calculation precise. We first review a standard language acquisition framework, and then show how to derive a dynamical system from it. We begin with the simplest possible model — that of two languages in competition with each other.

5.2

A Preliminary Model

Our discussion is motivated by a syntactic5 view of the world where each language is viewed as a set of expressions that are well-formed according to the rules of some underlying grammar. Therefore, languages may be treated formally as subsets of Σ∗ where Σ is a finite alphabet (denoting, for example, the lexical items). Imagine a world with only two possible languages L 1 and L2 where each Li is a subset of Σ∗ in the usual way. In general, L1 and L2 are not disjoint. Sentences belonging to both L 1 and L2 are ambiguous and may be parsed according to the underlying grammar (g 1 and g2 respectively) of each language. We consider a case where each individual is a user of precisely one language – this is the monolingual case. The language of the individual is 5

The general methodology is equally applicable to phonology. One may view a phonological grammar as defining a set of well-formed phonological expressions. The set of such well-formed expressions may be defined using a notational system that utilizes a phonological alphabet.

161

LANGUAGE CHANGE: A PRELIMINARY MODEL

acquired during a learning period (over childhood) on the basis of exposure to linguistic examples provided by the ambient linguistic community. To make matters simple, we divide the population neatly into coincident generations and now consider two successive generations. The state of any adult generation is simply described by a single variable αt (the subscript t denoting generation number). Here α t is the proportion of individuals speaking language L1 in generation t — therefore, a proportion 1 − αt of the population consist of users of language L 2 . Let us characterize the probability with which speakers of L 1 produce sentences by P1 and similarly P2 for speakers of L2 . Thus a sentence s ∈ Σ∗ will be produced with probability P1 (s) by a user of L1 and with probability P2 (s) by a user of L2 . If s is not an element of L1 , clearly P1 (s) = 0 and similarly, if s is not an element of L2 , then P2 (s) = 0. .

5.2.1

Learning by Individuals

We begin by examining the acquisition of language by individuals in the population. Language acquisition is the process of developing grammatical hypotheses on the basis of linguistic experience, i.e., exposure to linguistic data during childhood. Within the purview of generative linguistics, this is conceptually regarded as choosing an appropriate grammar from a class of potential grammars G (Universal Grammar or UG) on the basis of primary linguistic data. In this example, we consider the case where there are only two potential grammars — g1 and g2 underlying the languages L1 and L2 respectively. Consider now a learning procedure (algorithm) to choose a language based on linguistic examples. As we saw in previous chapters, this can, in general, be characterized as a mapping from linguistic data sets to the hypothesis6 set {L1 , L2 }. Following the notation developed previously, we let Dn be the set of all potential finite data streams of exactly n sentences each and D = ∪i≥1 Di to be the set of all finite length data sets. The learning algorithm A is a computable mapping from D to {L 1 , L2 }. 6

Although I refer to the learner as choosing a language L1 or L2 , it is worthwhile to clarify that languages are typically infinite sets of sentences that have finite representations in terms of their grammars. Therefore, with their finite brains and finite lifetimes, learners presumably choose grammars. In general, there may be many grammars that are weakly equivalent (generate the same language), but in our setting in this chapter, there is a oneto-one mapping between the class of grammars G = {g1 , g2 } and the class of languages L = {L1 , L2 }. Therefore, we identify G with L and each may be viewed as the hypothesis set in our setting.

CHAPTER 5

162

Now fix a probability distribution P on Σ ∗ according to which sentences are drawn independently at random and presented to the learner. After k such examples are drawn, the learner’s data set can be denoted by dk = {s1 , s2 , . . . , sk } where each si is drawn according to the distribution P . Clearly dk is an element of Dk . In this setting, it is possible to define the following object: pk = P[A(dk ) = L1 ] In other words, pk is the probability with which the learning algorithm will guess L1 after k randomly drawn sentences are presented to it. Now p k will in general depend upon the probability distribution P that generates the data as well as the learning algorithm A. We will denote this dependence by pk (A, P ). In this probabilistic setting, it is worthwhile to recall the natural notion of learnability, which requires that the learner’s hypothesis must converge to the target as the data goes to infinity. This simply means that if the probability distribution P had support on L 1 so that only sentences of L1 occurred in the data sets of the learner (i.e., L 1 is the target language), then lim pk (A, P = P1 ) = 1

k→∞

Similarly, if the probability distribution P had support on L 2 so that only sentences of L2 were presented to the learner, then lim k→∞ pk (A, P = P2 ) = 0. We will evaluate pk for several different choices of A shortly.

5.2.2

Population Dynamics

The previous section discussed how the grammatical hypothesis of the individual learner develops over a series of linguistic examples during a critical learning period (i.e., until linguistic maturation time). The central question in that context is whether or not the learner’s hypothesis gets closer and closer to the target and eventually converges to it as more and more data become available. Of course, convergence to the target only occurs as the number of data goes to infinity — and learners live only finite lives. As a matter of fact, learners do not endlessly update their hypotheses but “mature” after a point and live with their mature hypothesis thereafter. Let us assume that maturation occurs after K examples have been presented to the learner. This assumption is consistent with evidence from developmental psychology suggesting that there is a critical age effect in language learning.

163

LANGUAGE CHANGE: A PRELIMINARY MODEL

In this section, we consider the evolutionary implications at the population level of learning procedures at the individual level. Let us begin by considering a completely homogeneous population where all adult speakers speak the language L1 . Consider now the generation of children in this community. These children attempt to learn the language of the adults. A typical child will receive examples drawn according to a probability distribution P1 . Over its learning period, it will receive K examples and with probability pK (A, P1 ) the typical child will acquire the language L 1 . With probability 1− pK (A, P1 ), however, the child might acquire the language L 2 . Therefore, when the generation of children mature into adulthood, the population of new adults will no longer be homogeneous. In fact, a proportion pK (A, P1 ) will be L1 users and a proportion 1 − pK (A, P1 ) will be L2 users. In this fashion, the linguistic compositions of two successive generations may be related to each other. We need not have started with a homogeneous adult population. Imagine now that the state of the adult population is denoted by α a where αa is the proportion of L1 users in the adult population. Now consider the generation of children. They will receive example sentences from the entire adult population — which in this case is a mixed population. In particular, they will receive examples drawn according to the distribution P = αa P1 + (1 − αa )P2 On receiving example sentences from this distribution, children proceed as before. A proportion pK (A, P ) will acquire L1 . Letting αc be the proportion of children who grow up to be L1 speakers, we see that αc = pK (A, αa P1 + (1 − αa )P2 ) In this manner, we see that αc can be expressed in terms of αa . In this example, the linguistic composition of the population can be characterized by a single variable αt . This denotes the proportion of the population that consists of L1 users in generation t. By considering the behavior of the typical child and then averaging over the entire population of children, we have related the linguistic composition of two successive generations as follows: αt+1 = pK (A, αt P1 + (1 − αt )P2 ) In order to do this, we assumed 1. The population could be isolated into coincident generations.

(5.1)

CHAPTER 5

164

2. Children receive data drawn from the entire adult population in a manner that reflects the distribution of languages in the adult population. 3. The probability of drawing sentences (given by P 1 and P2 ) does not change with time. 4. The learning algorithm A constructs a single hypothesis language (grammar) after each example sentence is received. After maturation it ends up with a single language (grammar). 5. Population sizes are infinite. We will return to a discussion of these assumptions later. Let us now consider some examples where we make specific choices regarding the learning algorithm and derive the evolutionary consequences. In particular, the functional relationship between αt and αt+1 (Equation 5.1) will be explicitly derived for a number of different algorithms.

5.2.3

Some Examples

A variety of dynamical maps are obtained by different choices for (i) the maturation time K and (ii) the learning algorithm A. We consider three different examples here. A : Memoryless Learners A memoryless learner — described in Chapter 4 — is one whose hypothesis at every stage depends only upon the current input sentence and the previous hypothesis it had. There are a wide class of such algorithms and one in particular has received considerable attention in the linguistic parametersetting literature. This is the Triggering Learning Algorithm (TLA) of Gibson and Wexler 1994, which was extensively analyzed in Chapters 3 and 4. While the algorithm works for any finite parameter space in general, the particular instantiation for the two-language case is as follows. Triggering Learning Algorithm (TLA) • [Initialize] Step 1. Start with an initial hypothesis (either L 1 or L2 ) chosen uniformly at random. • [Process input sentence] Step 2. Receive a positive example sentence si at the ith time step.

165

LANGUAGE CHANGE: A PRELIMINARY MODEL • [Learnability on error detection] Step 3. If the current grammatical hypothesis parses (generates) s i , then go to Step 2 to receive next example sentence; otherwise, continue. • [Single-step hill climbing] Step 4. Flip the current hypothesis and go to Step 2 to receive next example sentence.

The population at any point in time can be characterized by a single variable (αt for the tth generation) that describes the proportion of L 1 users in the population. Now one may ask the question: If children were TLA learners, then how would the population evolve? The precise nature of the evolution will depend not only upon the the algorithm A (in this case, the TLA) but also the probability distribution with which sentences are produced by L 1 and L2 users respectively. For this case, it turns out that it is sufficient to characterize P 1 and P2 by two parameters a and b, given as follows: a = P1 [L1 ∩ L2 ]; 1 − a = P1 [L1 \ L2 ] and similarly b = P2 [L1 ∩ L2 ]; 1 − b = P2 [L2 \ L1 ] Here L1 ∩ L2 refers to the set of ambiguous sentences — those that can be parsed (generated) by the underlying grammars of both languages. Thus a is the probability with which such ambiguous sentences are produced by L1 users and b is the same for L2 users. If we now assume that K = 2, i.e., the maturation time is short, it is fairly easy to show that the evolution occurs according to the following update rule: Theorem 5.1 The linguistic composition in the (t + 1)th generation (α t+1 ) is related to the linguistic composition of the tth generation (α t ) in the following way: αt+1 = Aα2t + Bαt + C where A = 21 ((1 − b)2 − (1 − a)2 ); B = b(1 − b) + (1 − a) and C =

b2 2.

Proof: Let the adult population have αt proportion of L1 users. We need to compute the probability with which the learner acquires L 1 after 2 examples. First note that the probability with which a random example belongs to (a) L1 \ L2 , (b) L1 ∩ L2 , (c) L2 \ L1 is given by (i) αt (1 − a), (ii) αt a + (1 − αt )b, and (iii) (1 − αt )(1 − b) respectively. Now with probability 1 2 , the learner chooses L1 as its initial hypothesis. There are two different

CHAPTER 5

166

ways in which it could retain its hypothesis after two examples. These are (i) its hypothesis is L1 after both the first and the second example; (ii) its hypothesis is L2 after the first and L1 after the second. Case (i) will occur if both examples lie in L1 . This happens with probability (αt + (1 − αt )b)2 . Case (ii) will occur if the first example is in L 2 \ L1 and the second example is in L1 \ L2 . This happens with probability ((1 − αt )(1 − b))(αt (1 − a)). Similarly, with probability 12 , the learner begins with hypothesis L 2 . There are again two different ways in which it could have a hypothesis of L 1 after two examples. These are given by (i) it flips its hypothesis to L 1 after the first example and retains L1 after the second example; (ii) it retains L 2 after the first example but switches to L 1 after the second. The probability with which each of these cases occurs can be easily calculated in a similar manner and putting it all together after some algebra, the update rule is obtained. A few remarks concerning this dynamical system are in order: Remark 1. When a = b, the system has exponential growth. When a = b the dynamical system is a quadratic map (which can be reduced by a transformation of variables to the logistic, and shares the same dynamical properties). We note that Cavalli-Sforza and Feldman (1981), using a different formulation, also obtain a quadratic map in such cases for the example of general ‘vertical’ cultural change. This is discussed at length in Chapter 9. Remark 2. The scenario a = b is much more likely to occur in practice — consequently, we are more likely to see logistic change rather than exponential change. Note that logistic change will give rise to the S-shaped pattern that historical linguists have often observed in field studies of language change. See Figure 5.2 for a graphic display. Remark 3. Logistic maps are known to be chaotic. However, in our system it is easy to show that: Theorem 5.2 Due to the fact that a, b ≤ 1, the dynamical system never enters the chaotic regime. Remark 4. We obtain a class of dynamical systems. The quadratic nature of our map comes from the fact that K = 2. If we choose other values for K we would get cubic and higher-order maps. In general, it is possible to show that for a fixed, finite K, the evolutionary dynamics is given by Theorem 5.3 If individual learners in a population of TLA learners have a maturation time K, the population evolves according to αt+1 =

B + 12 (A − B)(1 − A − B)K A+B

LANGUAGE CHANGE: A PRELIMINARY MODEL

0.0

0.2

Percentage of L-1 0.4 0.6

0.8

1.0

167

0

20

40

60 Generations

80

100

Figure 5.2: Evolution of linguistic populations whose speakers differ only in the V 2 parameter setting. This reduces to a two-language model differing by one linguistic parameter. Note the exponential growth when a = b. The different exponential curves are obtained by varying the value a = b. When a is not equal to b, the system has a qualitatively different (logistic) growth. By varying the values of a and b we get the different logistic curves. It has been the empirical observation of many that language change undergoes an S-shaped trajectory. Here we show how such a trajectory may emerge as a result of the underlying dynamics of individual learners.

CHAPTER 5

168

where αt is the proportion of L1 users in generation t and A = (1−αt )(1−b) and B = αt (1 − a). Proof: Recall that the TLA learner may be exactly characterized as a Markov chain — in this case with two states — one (state 1) corresponding to L1 and the other (state 2) to L2 . Let this transition matrix be T . Then it is easy to see that T12 = (1 − αt )(1 − b) = A; T21 = αt (1 − a) = B Since T is a stochastic matrix, T11 = 1−T12 and T22 = 1−T21 . Let us denote by T (n) the matrix of transitions after n examples have been received. Thus T (n) = T n . Clearly the probability of acquiring L 1 after n examples is given (n) (n) by 21 (T11 + T21 ). Using the Chapman-Kolmogorov equation, we have (1)

T11 = 1 − A and (n)

(n−1)

T11 = (1 − A)T11

(n−1)

+ BT12 (n−1)

= B + (1 − A − B)T11 By induction, we have (n)

T11 =

A(1 − A − B)n B + A+B A+B (n)

Similarly, it is possible to solve for T 21 as (n)

T21 =

B(1 − A − B)n B − A+B A+B

Putting these together, the final expression for the probability of acquiring L1 after K examples is found. Since population size is infinite, this is the proportion of L1 speakers in the next generation and the update rule is thus derived. The evolutionary modes in this setting may now be investigated. Although the map looks potentially very complex, the dynamics turns out to be surprisingly simple. For any fixed K, it is possible to show

169

LANGUAGE CHANGE: A PRELIMINARY MODEL B+ 1 (A−B)(1−A−B)K

2 1. The function f (α) = may be re-expressed by exA+B K K  K panding out the term (1 − A − B) as i=0 i (−1)i (A + B)i . When this is done, we see that the expression reduces to

K   1 1 K (A − B)(−(A + B))(i−1) f (α) = + i 2 2 i=1

Thus, the function f is seen to be a polynomial (in α) of degree K. 2. There is only one stable fixed point in the interval [0, 1] to which the population converges from all initial conditions. 3. If a = b, the stable point is given by α = 12 . From all initial conditions, populations move to this mix. 4. If a = b, then the location of the stable fixed point changes. In particular, if a > b, then the stable fixed point is very close to α = 0 and the population will mostly speak L2 eventually. On the other hand if a < b, the stable fixed point is very close to α = 1, and the population will mostly speak L1 eventually. To see (2), it is sufficient to show that the map α t+1 = f (αt ) is such that f is continuous, f (0) > 0, f (1) < 0, the equation x = f (x) has only one root in [0, 1], and |f  (x)| < 1 at this root. The proof of these last two facts is tedious and omitted for our discussion. To see (3), we can substitute a = b in the update rule. After some algebra, we see that the update rule reduces to bK αt+1 = αt (1 − bK ) + 2 This is a linear recurrence relation and it is easily seen that α t → 12 exponentially as t tends to infinity. To investigate (4) a little more closely, we show in Figure 5.3 the stable fixed point as a and b vary for K held fixed at K = 5. Notice that at a = b, the stable fixed point is 21 . When a = b, the stable fixed point changes very rapidly to a value close to 0 or 1 depending upon whether a > b or vice versa. Note, however, that this transition is rapid for small values a, b, while it is more gradual for values of a, b close to 1. In fact, the gradualness of this transition increases as K becomes small. For large values of K, the transition becomes sharper and becomes abrupt for K = ∞, as outlined in the next remark.

CHAPTER 5

170

1.2 1 0.8 0.6 0.4 1

0.2 0.8 0 1

0.6 0.8

0.4

0.6 0.4

0.2 0.2

a

0

0

b

Figure 5.3: The stable fixed point as a function of a and b. Note that if a = b, the stable fixed point is close to 1 or 0 for most values of a, b. The transition of the fixed point from a value close to 1 to a value close to 0 is quite sharp for low values of a, b and gradual for values of a, b close to 1.

Remark 5. If we let the number of examples K become larger and larger (eventually tending to ∞) the limiting map is obtained simply as αt+1 = f (αt ) =

αt (1 − a) αt (1 − a) + (1 − αt )(1 − b)

It is easy to see that f  (α) is given by f  (α) =

(1 − b)(1 − a) [(1 − b) + α(b − a)]2

For this map, a dynamical systems analysis yields the following results: 1. If a = b, then αt+1 = αt — in other words, the initial population mix is preserved for ever. 2. If a > b, there are exactly two fixed points. This is immediate from the solutions of α = f (α). α = 1 is a stable fixed point while α = 0 is an unstable point. This is easily seen by substituting α = 0 and α = 1 in the expression for f  (α).

171

LANGUAGE CHANGE: A PRELIMINARY MODEL

3. If a < b, the two fixed points switch stability so that α = 1 is now unstable while α = 0 is the stable fixed point. Again, this is easily seen from the expression for f  (α). Remark 6. The parameters a and b determine the evolution of the population and its stable modes. As I have mentioned before, they represent respectively the proportion of L1 and L2 sentences respectively that are ambiguous. It is conceivable that one might be able to estimate these parameters from synchronic or diachronic corpora. A : Batch Error-Based Learner In contrast to the memoryless learner, a batch learner waits until the entire data set of K examples has been received. Then, it simply chooses the language that is more consistent with the data received. For each language Li , one can define an error measure (denoted by e(L i )) as ki e(Li ) = K where ki is the number of example sentences in the data set that is not analyzable according to the grammar of L i . Then a simple decision rule is ˆ = arg min e(Li ) L L1 ,L2

This amounts to the following rule: 1. Group the K example sentences of the data set (PLD) into three classes: (A) those sentences that belong to L 1 alone and are not analyzable by the underlying grammar of L 2 ; (B) those sentences that are analyzable by the underlying grammar of L 1 but are ambiguous in that they are also analyzable by the underlying grammar of L 2 , i.e., sentences belonging to L1 ∩ L2 ; (C) those sentences that belong to L 2 alone and are not analyzable by the underlying grammar of L 1 . Let n1 , n2 , n3 be the number of examples of type A,B,C respectively. 2. Clearly, n1 + n2 + n3 = K. Choose L1 if n1 > n3 , choose L2 if n3 > n1 . If n1 = n3 , one might choose either L1 or L2 according to a deterministic or randomized rule. For illustrative purposes, let us consider a version of this algorithm that chooses L1 if n1 ≥ n3 and L2 otherwise. For this learning algorithm, it is

CHAPTER 5

172

possible to show that the proportion of L 1 users in two successive generations (αt and αt+1 , respectively) is related by the following update rule:    K [p1 (αt )]n1 [p2 (αt )]n2 [p3 (αt )]n3 αt+1 = n1 n2 n3 P (n1 ,n2 ,n3 )|n1 ≥n3 ;

i

ni =K

where p1 (αt ) = αt (1 − a); p2 (αt ) = αt a + (1 − αt )b; p3 (αt ) = (1 − αt )(1 − b). In general, the nature of the population dynamics is different in this case from the previous one. An analysis of this iterated map reveals the following: 1. If b = 1, then p3 (α) = 0. Consequently, n3 will always be 0 and n1 is ≥ n3 with probability 1. Therefore, we see that α t+1 = 1 and remains fixed at this value thereafter. 2. If b = 1 and a = 1, we have p1 (α) = 0. Therefore n1 = 0 and the probability that n1 ≥ n3 is the same as the probability that n 3 = 0. Therefore the map reduces to αt+1 = [1 − (1 − αt )(1 − b)]K For this case, α = 0 is not a fixed point while α = 1 is a fixed point. The stability of α = 1 depends upon the value of K(1 − b). In fact, we see that α = 1 is stable if and only if b > 1 − K1 . For smaller values of b, this fixed point is unstable and a new stable point arises in the open interval (0, 1). 3. For most other choices of a, b we see that α = 1 is stable. There are exactly two other fixed points α1 < α2 with αi ∈ (0, 1) where α1 is stable and α2 is unstable. 4. If we let K go gradually to infinity, we see that that nK1 → p1 while n2 K → p3 . Therefore αt → 1 when p1 > p3 , i.e., α(1−a) > (1−α)(1−b). It is clear that α = 0 and α = 1 are both stable fixed points while 1−b is an unstable fixed point in between. α = (1−b)+(1−a) 5. It is worthwhile to note that the version of the batch learner considered here is asymmetric in that if n1 = n3 , the algorithm chooses L1 . This asymmetry is less important in large-K situations where the probability of the event n1 = n3 is small. One may consider a symmetric

173

LANGUAGE CHANGE: A PRELIMINARY MODEL version of this algorithm where the learner chooses L 1 with probability 12 when n1 = n3 . The evolutionary dynamics corresponding to this algorithm is qualitatively similar to that corresponding to the current (asymmetric) one for a, b < 1.

A : Cue-Based Learner A cue-based learner examines the data set for cues to a linguistic parameter setting. Let a set C ⊆ (L1 \ L2 ) be a set of examples that are cues to the learner that L1 is the target language. If such cues occur often enough in the learner’s data set, the learner will choose L 1 , otherwise the learner chooses L2 . This follows the cue-driven approach advocated in Lightfoot 1998 and is often associated with theories of markedness of linguistic parameters. This approach is instantiated in the following procedure. Let the learner receive K examples. Out of the K examples, say k are from the cue set. Then, if k >τ K the learner chooses L1 , otherwise the learner chooses L2 . One can again determine the evolutionary dynamics of the population based on such a learner. Let P1 (C) = p, i.e., p is the probability with which an L1 user produces a cue. If a proportion αt of adults use L1 , then we see that the probability with which a cue is presented to a typical child is given by pαt and so the probability with which k > Kτ is given by  K  (pαt )i (1 − pαt )(K−i) i Kτ ≤i≤K

and therefore, we get αt+1 =

 Kτ ≤i≤K

  K (pαt )i (1 − pαt )(K−i) i

Here, αt is the proportion of L1 users in the tth generation. The evolutionary modes of the population clearly depend upon the value of the parameter p. An analysis reveals the following: 1. For p = 0, cues are never produced so the only stable point is α = 0, to which the population converges in exactly one step. 2. For small values of p, α = 0 is the only fixed point of the population and it is stable.

CHAPTER 5

174

1.2

1

Fixed Points

0.8

0.6

0.4

0.2

0

−0.2 −0.2

0

0.2

0.4 0.6 Parameter p

0.8

1

1.2

Figure 5.4: The bifurcation diagram for variation of the parameter p. For this example, K = 50 and τ = 0.6 Notice how for small values of p there is only one stable point at α = 0. As p increases a new pair of fixed points arises — one of which is unstable (dotted) and the other stable (solid). For any value of p on the x-axis, the y-axis denotes the values of the fixed points.

3. As p increases a bifurcation occurs when two new fixed points arise. There are three fixed points in all — α = 0, which remains stable; α = α1 , which is unstable; and α = α2 > α1 , which is stable. 4. For p = 1, there are two stable fixed points (α = 0 and α = 1) and there is exactly one unstable fixed point in between. Shown in Figure 5.4 is the bifurcation diagram as p is varied for the case where K = 50 and τ = 0.6. Notice that for small values of p, there is only one stable point at α = 0. As p increases, a new pair of fixed points arise — one of which is stable and the other unstable. To see the qualitative nature of the dynamics, we need to investigate the fixed points of the functional map αt+1 = f (αt ) = g(pαt ) where K    K i y (1 − y)K−i g(y) = i i=Kτ

By inspection of f (α), it is easy to see that f (0) = 0 for all values of p.

175

LANGUAGE CHANGE: A PRELIMINARY MODEL

To show stability, it is sufficient to show that |f  (0)| < 1. Note that f  (α) = pg  (pα) Differentiating g term by term, we get (let K τ be the smallest integer bigger than Kτ ) 

g (y) =

K−1  k=Kτ

   K ky k−1 (1 − y)K−k − (K − k)y k (1 − y)K−k−1 + Ky K−1 k

Expanding this out, we see g (y) =

K−1  k=Kτ



K−1  k=Kτ

K! ky k−1 (1 − y)K−k (K − k)!k!

K! (K − k)y k (1 − y)K−k−1 + Ky K−1 (K − k)!k!

Factoring K out of the expression, we have ⎡ ⎤ K−1  (K − 1)! y k−1 (1 − y)K−k ⎦ g (y) = K ⎣ (K − k)!(k − 1)! k=Kτ

⎡ −K ⎣

K−1  k=Kτ

⎤ (K − 1)! y k (1 − y)K−k−1 − y K−1 ⎦ k!(K − k − 1)!

After canceling terms, we see    K − 1 Kτ −1  K−Kτ y (1 − y) g (y) = K Kτ − 1

(5.2)

Clearly, f  (0) = pg  (0) = 0. Stability of the fixed point α = 0 is shown. We next need to show that there can be at most two fixed points in the interval (0, 1]. First note that f (1) = g(p) < 1 if p < 1. The fixed points are the points where f (x) crosses the graph of the function h(x) = x. Since (i) f (0) = 0, (ii) f  (0) = 0, and (iii) f (x) is continuous, we see immediately that the number of such crossings in (0, 1] will be even. Let there be 2m such crossings in all and let the crossings be indicated by α1 , α2 , . . . , α2m . Further the slope of f (x) will be alternately greater and smaller than h = 1 at each of these points. Thus we have

CHAPTER 5

176

f  (α1 ) > 1, f  (α2 ) < 1, f  (α3 ) > 1, and so on. Let there be four or more fixed points. Then clearly, in the interval (α 1 , α3 ) the graph of f  goes from being > 1 to < 1 to > 1 again. Therefore, there must be a point where f  = 0. By the same logic, in the interval (α2 , α4 ) there must be another distinct point where f  = 0. Therefore there must be at least two distinct points where f  vanishes. However, it is possible to show that in the entire interval (0, 1), the derivative f  vanishes at most once. To see this, notice that f  (y) = p2 g (py) Now g  (y) = Ay i (1 − y)j where A, i, j can be read off from Equation 5.2. Therefore g (y) = A[iy i−1 (1 − y)j − jy i (1 − y)j−1 ] = Ay i−1 (1 − y)j−1 [i − (i + j)y] It is clear that g  (y) has exactly one zerocrossing in (0, 1) and therefore g (py) will have at most one zerocrossing in (0, 1). This leads to a contradiction. A limiting analysis may be conducted by allowing the number of examples K → ∞. It is seen that k → pαt K Therefore, if pαt < τ , all learners choose L2 and the population evolves to a homogeneous community of L2 speakers in one generation. Conversely, if pαt > τ , all learners choose L1 and the population evolves to a homogeneous community of L1 speakers in one generation. From this, the following observations may be made 1. If p < τ , then pα < τ for all α ∈ [0, 1], and so the only stable fixed point is α = 0. 2. If p > τ , then there are two stable fixed points. From all initial conditions α0 ∈ [0, τ /p), the population converges to α = 0. From all initial conditions α0 ∈ (τ /p, 1], the population converges to α = 1. We see that the dynamics of the finite sample case (finite K) are qualitatively similar to the infinite sample case. In conclusion we see that a homogeneous population of L 2 users will always remain stable and can never change to a population of L 1 users. On the other hand, a homogeneous population of L 1 users will remain stable only in a certain regime of p values. As p changes the basin of attraction

177

LANGUAGE CHANGE: A PRELIMINARY MODEL

shrinks and after a critical value of p the population switches to a stable mode of L2 speakers. Thus a change from L1 to L2 is possible — a change the other way is never possible. We will examine later the implications of this for various explanations of directional changes in linguistic grammars.

5.3

Implications and Further Directions

The simple two-language models of the preceding sections are not without significant linguistic applications. In many cases of language change, one finds that there are two variants (dialects, grammars) differing by a significant linguistic parameter that coexist in a population in varying proportions at different points in time. Often, linguistic change leads to the gradual loss of one variant from the population entirely — in many cases following an S-shaped pattern over time. For example, the loss of verb-second (V2) from grammars of Old English to that of Modern English is a much-studied instance of precisely such a change (Lightfoot 1998; Kroch and Taylor 1997). Other examples include the loss of verb-second (V2) from Old to Modern French (Clark and Roberts 1993; Roberts 1992), the change in subordinateclause word order in Yiddish considered in Santorini 1993, and so on. As an example of such a change consider the following case of Yiddish reported in Santorini 1993.

5.3.1

An Example from Yiddish

Yiddish underwent some significant linguistic changes from the fifteenth to the nineteenth century A.D. One particular change had to do with the location of the auxiliary verb with respect to the subject and the verb phrase in clauses. For convenience, I reproduce here the discussion about such a change from Chapter 1 of this book. Following Chomsky 1986, one might let the auxiliary verb belong to the functional category INFL (which bears inflectional markers) and thus distinguish between the two basic phrasestructure alternatives as in (5a) and (5b). (5)

a. b.

[Spec [VP INFL]IP [Spec [INFL VP]IP

The inflectional phrase (IP) describes the whole clause (sentence) with an inflectional head (INFL), a verb-phrase argument (VP) for this INFL head, and a specifier (Spec). The item in specifier position is deemed the subject of the sentence. In Modern English, for example, phrases are almost always of type (5b). Thus sentence (6)

CHAPTER 5

178

(6) [John [can [read the blackboard ]VP ]]IP corresponds to such a type, with “John” being in Spec position, “can” being the INFL-head, and “read the blackboard” being the verb phrase. If we deem structures like (5a) to be INFL-final and structures like (5b) to be INFLmedial, we find that languages on the whole might be typified according to which of these phrase types is preponderant in the language. 7 Interestingly, Yiddish changed from a predominantly INFL-final language to a categorically INFL-medial one over the course of a transition period from 1400 A.D. to about 1850 A.D. Santorini 1993 has a detailed quantitative analysis of this phenomenon, and shown below are two unambiguously INFL-final sentences of early Yiddish (taken from Santorini 1993). Such sentences would be deemed ungrammatical in the modern categorically INFL-medial Yiddish. ds that ven if

zi they der the

droyf there-on vatr father

nurt only

givarnt warned

vern were

doyts German

(Bovo 39.6, 1507)

leyan read

kan can

(Anshel 11, 1534)

To illustrate this point quantitatively, a corpus analysis of Yiddish documents over the ages yields the statistics shown in Table 5.1. Clauses with simple verbs are analyzed for INFL-medial and INFL-final distributions of phrase structures. More statistics are available in Santorini 1993, but this simple case illustrates the clear and unmistakable trend in the distribution of phrase types. It is worth mentioning here that while Santorini 1993 expresses the statistics within the notational conventions of Chomsky 1986, almost any reasonable grammatical formalism would capture this variation and change, with two different grammatical types or forms in competition with one gradually yielding to the other over generational time. 7

It is worthwhile to reiterate that while this typological distinction is largely accepted by linguists working in the tradition of Chomsky 1981, there is still considerable debate as to how cleanly languages fall into one of these two types. For example, while Travis (1984) argues that INFL precedes VP in German and Zwart (1991) extends the analysis to Dutch, Schwartz and Vikner (1990) provide considerable evidence arguing otherwise. Part of the complication often arises because the surface forms of sentences might reflect movement processes from some other underlying form in often complicated ways. But this is beyond the scope of this book.

179

LANGUAGE CHANGE: A PRELIMINARY MODEL Time Period 1400–1489 1490–1539 1540–1589 1590–1639 1640–1689 1690–1739 1740–1789 1790–1839 1840–1950

INFL-medial 0 5 13 5 13 15 1 54 90

INFL-final 27 37 59 81 33 20 1 3 0

Table 5.1: Number of sentences of INFL-medial and INFL-final type in corpora of Yiddish from 1400 to 1950.

5.3.2

Discussion

We see that the case of Yiddish discussed earlier falls into a setting where there are two grammatical variants in competition with each other. The grammatical variants are characterized as INFL-medial and INFL-final respectively. In the beginning, all members of the population seemed to be using an INFL-final grammar, while by the twentieth century all members were using an INFL-medial grammar. Two aspects of this linguistic change are interesting for our purposes. The first is that over a period of time, the population moves from one seemingly stable state to another. The second is that both stable states seem to be homogeneous, i.e., while the population passed through mixed modes during intermediate years, the initial and final states of the population seem homogeneous and categorically INFL-final and INFL-medial respectively. As I have discussed earlier, the learning of language by children is the key mechanism by which language is transmitted over generational time. Any coherent discussion of why languages change must of necessity take this into account. In fact, one must assign to it a possibly central role in the discourse. While this position has been taken by many historical linguists, only a computational treatment allows this position to be elaborated precisely and exposes the subtleties involved in the analysis. If we assume that individuals used one or the other variant consistently, 8 8 This claim is often contentious and sometimes clearly falsifiable empirically. Nonetheless it is a useful exercise to examine the consequences of such a claim because it clarifies the subtle interplay between learning and evolution under various assumptions. See later

CHAPTER 5

180

the dynamical evolution of populations may be characterized using the models described earlier. In this chapter, we considered three different learning models and the evolutionary consequences of each were carefully studied. It is worthwhile to reflect on our general findings: 1. The evolutionary modes of a TLA-based learner were characterized by a, b. If a = b, no change was possible. However, if a = b, only one stable mode of the population was reached. Whether L 1 or L2 dominates the population depends upon whether a > b or not. 2. The evolutionary modes of a batch learner were characterized by a, b. Both L1 and L2 were stable modes of the population with each having its own basin of attraction. 3. The evolutionary modes of a cue-based learner were characterized by p. For low values of p only L2 is stable. For higher values of p both L1 and L2 are stable modes, with each having its basin of attraction. A bifurcation between these two evolutionary scenarios occurs at a critical value of p. How might we explain language change under each of these learning models? First, it is worthwhile to note that under all three models, the stable modes of a population are entirely homogeneous or largely so with one language dominating the other greatly. However, the details are different. According to (1) the only way change is possible is when a = b. Thus a population may be stable with all members speaking L 1 (corresponding to a < b). If, for some reason, in some generation a becomes greater than b, the stable mode will become unstable. A slight drift out of homogeneity will then move the population entirely to the other language. Both a change from L1 to L2 and a change from L2 to L1 are possible within this framework. According to (2) the population will have to jump from one basin of attraction to the other, crossing a barrier for the language to change. The only plausible way for this to happen is if language contact as a result of migration brings about a change in the linguistic composition driving the population mix from one basin to the other. Unlike case (1), internally driven change is not possible by drifting of the parameters a and b from generation to generation. According to (3) there is a critical parameter effect and an asymmetry in the evolutionary possibilities. A homogeneous L 2 -speaking population is parts of this book for dropping the assumptions.

181

LANGUAGE CHANGE: A PRELIMINARY MODEL

the only stable mode for small values of p and therefore no change is possible in this regime. For large values of p both L 1 and L2 are stable modes and a change from one to the other requires the population to jump from one basin of attraction to the other. However, parameter drifting might drive a stable L1 -speaking population to a stable L2 -speaking one if the value of p changes from above the critical threshold to below it. While change in this direction may be internally driven, a change in the other direction is never possible by parameter drifting over generations. Only language change and migration would explain a change in the other direction. This asymmetry is of particular significance in all models of learning and language where a distinction is made between marked and unmarked linguistic parameter settings. Major Insights There are thus two major insights that have emerged: 1. Different learning algorithms might have different evolutionary consequences. These evolutionary consequences may be tested using historical data. As a result, diachronic facts may be brought to bear as additional evidence in judging the adequacy of different theories of language acquisition. 2. The evolutionary dynamics are typically nonlinear and the bifurcations (phase transitions) associated with them may provide a suitable theoretical construct to explain language change. This provides a novel solution to the actuation problem. These results raise the possibility that discontinuous change may result from a continuous drift in the frequencies with which cues, triggers, and so on, are produced. It is worth noting that these insights emerge only as a result of computational analysis following some simplified model construction. The relationship between learning and evolution can be quite subtle and it is hard to see how, for example, one might discover the possibility of bifurcations from verbal arguments alone. The case of Yiddish described here was provided only as an illustrative example of a situation where the language of the population seemed to be changing along one linguistic dimension. The different models of learning we have considered may be derived from psycholinguistic considerations of language development — we see clearly in these sections how they have different consequences for language change. In later portions of this book,

CHAPTER 5

182

we will examine similar models in the context of other linguistic changes in languages like Portuguese, Chinese, French, and English.

5.3.3

Future Directions

The models in this chapter have already brought into sharp focus the interplay between learning and change in linguistic populations. At the same time, one needs to recognize the drastic simplifications that have been made in order to formulate this first coherent model. It is worthwhile to reflect on some of these simplifying assumptions and the possibility of relaxing them in more complex models of this process. 1. Multiple languages. Clearly, there are more than two languages in the world. The space G represents the hypothesis space that learning algorithms operate on and from which they draw grammatical hypotheses over their learning period. While this space G is in principle all of Universal Grammar, in linguistic applications of a more specific nature, one might consider a subset of G to be the more appropriate object to model and study. For example, in studies of syntax, its acquisition and change, it may be meaningful to ignore the phonological components of UG that might have no bearing on the phenomena at hand. Depending upon the submodule of syntax under study, other “irrelevant” modules might also be usefully ignored. Thus, the space G that is formulated in models of language change is really a very low-dimensional projection of the high dimensional space of UG. It is the linguist’s intuition and understanding of the phenomena that provides the appropriate lowdimensional projection. Indeed, linguists might differ on this matter, and the consequences then need to be worked out. Having thus argued that in most useful applications, the space G will be low dimensional, it might still consist of more than two grammars (languages) and it is important to extend such models to multilingual settings. The extension to n-language families has already been considered (Niyogi and Berwick 1997; Yang 2000). Setting up the models for such cases is easy enough — analytical solutions are harder to come by and one might need to resort to simulations. I consider the n-language case in the next chapter. 2. Finite populations. One reason we have been able to derive deterministic dynamical maps relating successive generations to each other is the assumption of infinite population size that allows us to take ensemble averages of individual behavior over the entire population. In

183

LANGUAGE CHANGE: A PRELIMINARY MODEL practice, of course, populations are always finite. If the population sizes are large, then the assumption of infinite sizes may not be too bad. If, on the other hand, population sizes are very small, then one might need to consider the implications of such small sizes more carefully. Let us consider briefly the effect of finite population sizes on the two-language models discussed in this chapter. Recall that each individual child attains L1 with probability pK (A, αa P1 + (1 − αa )P2 ). From this we concluded that a proportion p K of the children would end up as L1 users. This statement is exactly true if there were an infinite number of children. Imagine, instead, there were only N children in the population. Each child could end up either as an L 1 speaker or as an L2 speaker. In fact, with probability (p K )N , all children would acquire L1 ; with probability (1 − pK )N all would acquire L2 ; and different intermediate mixes are possible with probabilities given by the binomial distribution. Thus all evolutionary trajectories are possible; the question is — which ones are likely or probable? The evolution is characterized now as a stochastic process rather than a deterministic dynamical system. The consequences of this can be worked out. The details are beyond the scope of this chapter and are pursued in a later chapter (Chapter 10).

3. Generational structure. In attempting to derive the relationship between successive generations, we have assumed that generations move in clean time steps. In practice, of course, the generational structure is a little more complex than this. One might therefore need to divide time into finer intervals and consider the cohort of learning children at each such time interval. The primary linguistic data that this cohort receives is now drawn from a more diverse group of older people in the population. This group would consist of parents, grandparents, older cohorts, and so on. For example, in the two-language models described earlier, one might proceed as follows: Let the state of cohort t be described by a variable α t (as before, where αt denotes the proportion of the cohort using the language L 1 ). Consider now the (t + 1)th cohort of learning children. Assume that they receive data drawn from the previous three cohorts (who may, for example, be characterized as the cohort of young adults, parents, and grandparents respectively) in equal proportions. Then the probability distribution with which data is presented to the (t + 1)th cohort of

CHAPTER 5

184

learners is given by

  1 1 P = (αt + αt−1 + αt−2 ) P1 + 1 − (αt + αt−1 + αt−2 ) P2 3 3

where we have assumed that all cohorts are equal in size and influence. Given this setup, it is easy to see that αt+1 is now going to be given by αt+1 = pK (A, P ) and in this manner, αt+1 will depend upon αt ,αt−1 , and αt−2 respectively. The resulting dynamics may be analyzed using the traditional tools. 4. Spatial population structure. We have assumed in the models that speakers of both language types are evenly distributed throughout the population. Further, the child learners all receive data from the entire adult population. In other words, all children receive data drawn from the same probability distribution and this distribution reflects the mix of L1 and L2 speakers in the adult population as a whole. Reality, as always, might be more complicated. Speakers of different linguistic types might reside in different “neighborhoods”. Children born in different neighborhoods might receive data drawn from different probability distributions that reflect the linguistic composition of their neighborhood. For example, one might imagine an L1 -speaking neighborhood and an L2 -speaking neighborhood whose population sizes are in the ratio αt : (1 − αt ). Children born in the L1 -speaking neighborhood might receive data drawn mostly according to P1 while those born in the L2 -speaking neighborhood might receive data mostly drawn according to P2 . The evolutionary consequences of such a spatial structure in the population need to be worked out and represents an important direction for future research. Chapters 9 and 10 discuss spatial models. 5. Multilingual acquisition. The learning algorithm A realizes a mapping from linguistic data sets to grammatical hypotheses. In particular, we have restricted the learner to having precisely one grammatical conjecture at each point in time. Furthermore, at the end of the learning period (i.e., after receiving K examples) it is assumed that the learner will end up with precisely one language. If the target distribution corresponds to a unique grammar, it is certainly reasonable to expect the learner to end up with exactly one

185

LANGUAGE CHANGE: A PRELIMINARY MODEL language. In case the target distribution is mixed, i.e., not consistent with a single unique target grammar, natural models of the learning process should allow the possibility of multilingual rather than monolingual acquisition. Thus, for example, in the two-language case of this chapter, one might allow the possibility that the learner acquires both languages (in some ratio, perhaps). For the case of English, for example, Kroch and Taylor (1997) argue that learners were effectively bilingual, having acquired both dialectal variations in different proportions. Yang (2002) considers such a learning algorithm and explores the evolutionary consequences. I explore the multilingual setting in Chapters 8 and 10.

6. Nonvertical and other modes of transmission. We have considered vertical modes (from parent to child or from one generation to the next) as the primary mode of transmission of language over time. It is often remarked that the interaction of a cohort of language users with each other in a social setting shapes the way language develops in children and therefore the way it evolves over time. The effects of children of the same generation on each other might be viewed as a nonvertical (horizontal) mode of transmission of language. It might therefore become necessary to consider such alternative modes of transmission for a more complete understanding of the complexities involved in such processes. The effects of each of these assumptions can be systematically explored. Together they constitute important directions for future work in this nascent field of computational studies of language change. Some of these directions have already begun to be explored. Others await further explication. In the next few chapters, we will consider a variety of such generalizations and case studies. These will further elucidate the central theme of the current chapter — the nature of the relationship between language learning and language change.

Chapter 6

Language Change: Multiple Languages 6.1

Multiple Languages

The previous chapter examined some preliminary models that arise as a result of two languages in competition with each other. I now consider a more general case where n languages may be potentially present in the population at any point in time. I begin by following the usual logic in deriving dynamical models of language change from models of language acquisition. The basic framework is developed in the next few sections. As we will see, the n-language case gives rise to n − 1 dimensional discrete-time dynamical systems. While a complete analytic understanding is well beyond our current scope, we will conduct two simulation studies later in this chapter to get some insight into the evolutionary trajectories of such systems in a linguistic context. Both these simulation studies are developed in the context of a case of syntactic change observed in the evolution of French from the twelfth century to modern times. Over the course of these simulations, we will see how the dynamical systems framework might allow us to engage the issues involved in language change in a concrete, formal, and reasoned fashion.

6.1.1

The Language Acquisition Framework

To formalize the model, I must first state my assumptions about grammatical theories, learning algorithms, and sentence distributions: 1. Denote by G, a family of possible (target) grammars. Each grammar

CHAPTER 6

188

g ∈ G defines a language L(g) ⊆ Σ∗ over some alphabet Σ in the usual way. I will particularly consider the case where G has n grammars denoted by g1 through gn , corresponding to n languages L1 through Ln , respectively. 2. Denote by P a probability distribution on Σ ∗ according to which sentences are drawn and presented to the learner. Let speakers of L i produce sentences according to a distribution P i . Thus Pi has support on Li ⊆ Σ∗ . Therefore, if the adult population is linguistically homogeneous (with grammar g1 ), then P = P1 . If the adult population is composed of two groups of equal size one of which speaks L 1 and the other L2 , then P = 12 P1 + 12 P2 . 3. Denote by A the acquisition algorithm that children use to hypothesize a grammar on the basis of input data. Following the notation I developed previously, A is a computable mapping from the set of all possible finite data streams D to the set of possible grammatical hypotheses G. Given this setting, I will adopt a notion of probabilistic convergence as a notion of learnability. Thus a grammar gi ∈ G is learnable if on presentation of example sentences in i.i.d. fashion according to P i , the learner converges to the target with probability 1, i.e., lim P[A(dn ) = gi ] = 1

n→∞

where dn ∈ D is a random data set of n sentences drawn according to P i .

6.1.2

From Language Learning to Population Dynamics

The framework for language learning focuses on the behavior of learners attempting to infer grammars on the basis of linguistic data. At any point in time, n, (i.e., after encountering n example sentences) the learner may have any of the grammatical hypotheses in G. In particular, let us denote by pn (h), the probability with which it has the hypothesis h ∈ G. Now consider what happens when there is a population of such learners. Since an arbitrary learner has a probability pn (h) of developing hypothesis h (for every h ∈ G), it follows that a fraction pn (h) of the population of learners internalize the grammar h after n examples. We therefore have a current state of the learning population after n examples. This state of the population might well be different from the state of the parental population.

189

LANGUAGE CHANGE: MULTIPLE LANGUAGES

Assume for a moment that after N examples, maturation occurs, i.e., the grammatical hypothesis after N examples crystallizes in the learner’s mind and is retained and used for communication for the rest of its adult life. Then one would arrive at the state of the mature population for the next generation.1 This new generation now produces sentences for the following generation of learners according to the distribution of grammars in its population. In this manner the process repeats itself and the linguistic composition of the population evolves from generation to generation. I can now define a discrete-time dynamical system by providing its two necessary components: A State Space: A set of system states, S. Here the state space is the space of possible linguistic compositions of the population. Each state is described by a distribution Ppop on G describing the language spoken by the population. At any given point in time, t, the system is in exactly one state s ∈ S. An Update Rule: How the system states change from one time step to the next. Typically, this involves specifying a functional mapping, f, that maps st ∈ S to st+1 . As a linguistic example, consider the three-parameter syntactic space described in Gibson and Wexler 1994 and examined at length in previous chapters. This system defines eight possible “natural” grammars — thus G has eight elements. We can picture a distribution on this space as shown in Figure 6.1. In this particular case, the state space is S = {P ∈ R8 |

8 

Pi = 1}

i=1

Here we interpret the state as the linguistic composition of the population. For example, a distribution that puts all its weight on grammar g 1 and 0 everywhere else indicates a homogeneous population that speaks a language corresponding to grammar g1 . Similarly, a distribution that puts a probability mass of 1/2 on g1 and 1/2 on g2 denotes a population (nonho1

Maturation seems to be a reasonable hypothesis in this context. After all, it seems even more unreasonable to imagine that learners are forever wandering around in hypothesis space. There is evidence from developmental psychology to suggest a maturational account such that after a certain point children mature and retain their current grammatical hypotheses forever. There are, however, subtle effects on the adult’s native language due to factors such as second-language learning or immersion in a foreign-language community as a result of migration. Such effects may be specifically modeled in a systematic manner, but I leave aside such considerations for now.

CHAPTER 6

190 R

P3 P1

P8

P4

P2

P5

P6

P7 G

g1

g2

g3

g

4

g5

g6

g

7

g8

Figure 6.1: A simple illustration of the state space for the 3-parameter syntactic case. There are 8 grammars. A probability distribution on these 8 grammars, as shown above, can be interpreted as the linguistic composition of the population. Thus, a fraction P1 of the population have internalized grammar, g1 , and so on.

mogeneous) with half its speakers speaking a language corresponding to g 1 and half speaking a language corresponding to g 2 . To see in detail how the update rule may be computed, consider the acquisition algorithm A. The state at time t (given by P pop,t ) determines the distribution of speakers of different languages in the parental population as a whole. Therefore, one can obtain the distribution with which sentences from Σ∗ will be presented to a typical learner. Recall that the ith linguistic group in the population, speaking language L i , produces sentences with distribution Pi on Σ∗ . Therefore for any ω ∈ Σ∗ , the probability with which ω will be presented to the learner is given by  Pi (ω)Ppop,t (i) P (ω) = i

This fixes the distribution P with which sentences are presented to the learner. Note that I have crucially assumed that there is perfect spatial mixing so that each individual child learner is exposed to the parental population in an unbiased way, i.e., they are exposed to all the different linguistic types in proportion to their numbers in the entire adult population. I will consider departures from this assumption in a later chapter. The logical problem of language acquisition also assumes some success criterion for attaining the mature target grammar. For our purposes, I take this as being one of two broad possibilities: either (1) the usual scenario of identification in the limit, which I will call the limiting sample case; or (2) identification in a fixed, finite time, which I will call the finite sample case. 2 2

Of course, a variety of other success criteria, e.g., convergence within some epsilon,

191

LANGUAGE CHANGE: MULTIPLE LANGUAGES

Consider case (2) first. Here, one draws n example sentences according to distribution P , and the acquisition algorithm develops hypotheses (A(d n ) ∈ G). One can, in principle, compute the probability with which the learner will posit hypothesis hi after n examples: Finite Sample: P[A(dn ) = hi ] = pn (hi )

(6.1)

Now turn to case (1), the limiting case. Here learnability requires that pn (gt ) converge to 1 for the case where a unique target grammar g t exists. However, in general, there need not be a unique target grammar since the linguistic population can be nonhomogeneous. Even so, recall that since convergence is measurable, the following limiting behavior still exists: Limiting Sample: lim P[A(dn ) = hi ] = p(hi ) n→∞

(6.2)

Let us now turn from the individual child to the population as a whole. For each grammar hi ∈ G, the individual child learner adopts (internalizes) this grammar with probability pn (hi ) in the finite sample case or with probability p(hi ) in the limiting sample case. In a population of such individuals one would therefore expect a proportion p n (hi ) or p(hi ) respectively to have internalized grammar hi . In other words, the linguistic composition of the next generation is given by Ppop,t+1 (hi ) = pn (hi ) for the finite sample case and by Ppop,t+1 (hi ) = p(hi ) in the limiting sample case . In this fashion, A

Ppop,t −→ Ppop,t+1 Remarks. 1. In deriving this update rule, I have assumed that population sizes are infinite, so that the proportion of L i speakers in the next generation is exactly equal to the probability with which the typical child acquires Li . For small population sizes, the deviation from this may be quite significant. This leads one to specify the linguistic evolution as a stochastic process rather than a deterministic dynamical system and I will examine the consequences of this in Chapter 10. 2. For a Gold-learnable (by the algorithm A) family of languages and a limiting sample assumption, homogeneous populations are always stable. This is simply because each child and therefore the entire population always eventually converges to and thus attains the unique target grammar in each generation. or polynomial in the size of the target grammar, are possible; each leads to a potentially different language-change model. I do not pursue these alternatives here.

CHAPTER 6

192

3. The finite sample case, however, is different from the limiting sample case. Suppose we have solved the maturation problem — that is, we know roughly the time, or number of examples N, the learner takes to develop its mature (adult) hypothesis. In that case p N (h) is the probability that a child internalizes the grammar h, and p N (h) is the percentage of speakers of Lh in the next generation. Note that under this finite sample analysis, even for a homogeneous population with all adults speaking a particular language (corresponding to grammar g, say), pN (g) will not be 1 — that is, there will be a small percentage of learners who have misconverged. This percentage could blow up over several generations, and we therefore have potentially unstable languages. 4. The formulation is very general. Any triple < A, G, {P } > yields a dynamical system.3 In short: < G, A, {Pi } >−→ ( dynamical system) Note that the formulation does not assume any particular linguistic theory — the elements of G could be specified in any suitable formalism from Optimality Theory to Government and Binding to Connectionism. Nor have I assumed any particular learning algorithm or distribution with which sentences are drawn. Of course, I have implicitly assumed a learning model, i.e., positive examples are drawn in i.i.d. fashion and presented to the learner. The dynamical systems formalization follows as a logical consequence of this learning framework. One can conceivably imagine other learning frameworks — these would potentially give rise to other kinds of dynamical systems — but I do not formalize them here. In previous chapters I examined the problem of learnability within parametric systems. In particular, I showed how the behavior of any memoryless learning algorithm can be modeled as a Markov chain. This analysis allows us to solve Equations 6.1 and 6.2 and thereby obtain the update equations for the associated dynamical system. Let me now show how to derive such models in detail. I first provide the particular < G, A, {P i } > triple, and then give the update rule. 3

Note that this probability could evolve with generations as well. That will complete all the logical possibilities. However, for simplicity, I assume that this does not happen.

193

LANGUAGE CHANGE: MULTIPLE LANGUAGES

A Learning System Triple 1. G: Assume there are n binary-valued linguistic parameters — this leads to a space G with 2n different grammars. 2. A: Let us imagine that the child learner follows some memoryless (incremental) algorithm to set parameters. For the most part, I will assume that the algorithm is the Triggering Learning Algorithm (TLA) or one of the variants discussed in previous chapters. 3. {Pi }: Let speakers of the ith language, L i , in the population produce sentences according to the distribution P i . For the most part, I will assume in my simulations that this distribution is uniform on degree-0 (unembedded) sentences. The Update Rule I can now compute the update rule associated with this triple. Suppose the state of the parental population is Ppop,n on G. Then I can obtain the distribution P on the sentences of Σ ∗ according to which sentences will be presented to the learner. Once such a distribution is obtained, then given the Markov equivalence established earlier, I can compute the transition matrix T according to which the learner updates its hypotheses with each new sentence. From T I can finally compute the following quantities, one for the finite sample case and one for the limiting sample case: P[ Learner’s hypothesis = hi ∈ G after m examples] ={

1 T m 1 n T }[i] 2n 2

(6.3)

In the above equation, 12n represents the 2n dimensional column vector with all its components taking the value 1 and 1 T2n is simply its transpose. The i, j element of the matrix T m contains the probability with which the learner moves from an initial hypothesis of h i to a hypothesis of hj after exactly m examples. I have assumed that the learner chooses an initial hypothesis uniformly at random from the 2n different hypotheses in G. 21n 1T2n T m is therefore a row vector and { 21n 1T2n T m }[i] denotes its ith component. Similarly, making use of the limiting distributions of Markov chains (Resnick 1992) one can obtain the following (where ON E is a 2 n ×2n matrix with all ones):

CHAPTER 6

194

P[ Learner’s hypothesis = hi “in the limit”] ={

1 T 1 n (I − T + ON E)−1 }[i] 2n 2

(6.4)

These expressions allow me to compute the linguistic composition of the population from one generation to the next according to my analysis of the previous section. Remark. The limiting distribution case is more complex than the finite sample case and requires some careful explanation. There are two possibilities. If there is just a single target grammar, then by definition, the learners all identify the target correctly in the limit, and there is no further change in the linguistic composition from generation to generation. This case is essentially uninteresting. If there are two or more target grammars, then recalling our analysis of learnability from the previous chapters, there can be no absorbing states in the Markov chain corresponding to the parametric grammar family. In this situation, a single learner will oscillate between some set of states in the limit. In this sense, learners will not converge to any single, correct target grammar. However, there is a sense in which we can characterize limiting behavior for learners: although a given learner will visit each of these states infinitely often in the limit, it will visit some more often than others. The exact proportion of the time the learner will be in a particular state is given by Equation 6.4. Therefore, since we know the fraction of the time the learner spends in each grammatical state in the limit, we assume that this is the probability with which it internalizes the grammar corresponding to that state in the Markov chain. An alternative interpretation is worth emphasizing. The limiting distribution may be viewed as characterizing the probability with which the learner will use each of the grammars after seeing a large (infinite) number of examples. Hence, one might interpret the final state of the learner as having internalized multiple grammars. For more discussion of multilingual acquisition, see later chapters. One can summarize the basic computational framework for modeling language change as follows: 1. Let π1 be the initial population mix, i.e., the percentage of different language speakers in the community. Assuming that the i th group of speakers produces sentences with probability P i , we obtain the probability P with which sentences in Σ ∗ occur for the next generation of learners.

195

LANGUAGE CHANGE: MULTIPLE LANGUAGES

2. From P we obtain the transition matrix T for the Markov learning model and using Equations 6.3 and 6.4, we derive the distribution of the linguistic composition π2 for the next generation. 3. The second generation now has a population mix of π 2 . We repeat step 1 and obtain π3 . Continuing in this fashion, in general we can obtain πi+1 from πi . This completes the abstract formulation of the dynamical system model. While this characterizes formally the relationship between learning and change in linguistic populations, the following questions still remain: • Can we really compute all the relevant quantities to specify completely the parameters of the 2n dimensional dynamical system? • Can we evaluate the behavior (phase-space characteristics) of the resulting dynamical system? • Does the dynamical system model shed light on diachronic models and linguistic theories generally? In the remainder of this chapter I give some concrete answers to these questions within the Principles and Parameters framework of modern linguistic theory. I consider first the three-parameter syntactic subsystem for which I have already conducted a learnability analysis in previous chapters. I derive the evolutionary consequences of learning in this setting and explore the factors on which evolutionary trajectories depend. Having thus generated some insight into possible evolutionary behaviors, I then turn to the analysis of a five-parameter system used by Clark and Roberts (1993) in their discussion of the historically observed syntactic changes in French.

6.2

Example 1: A Three-Parameter System

Recall that every choice of < G, A, {Pi } > gives rise to a unique dynamical system. I begin by making specific choices for these three elements: 1. G : This is a three-parameter syntactic subsystem described in Gibson and Wexler 1994. Thus G has exactly eight grammars, generating languages from L1 through L8 described in previous chapters. 2. A : The memoryless algorithms we consider are the TLA and the variants obtained by dropping either or both of the single-valued and greediness constraints.

CHAPTER 6

196

3. {Pi } : For the most part, we assume sentences are produced according to a uniform distribution on the degree-0 sentences of the relevant language, i.e., Pi is uniform on (degree-0 sentences of) L i . Ideally, a complete investigation of diachronic possibilities would involve varying G, A, and P and characterizing the resulting dynamical systems by their phase-space plots. Rather than exploring this entire space, I first consider only systems evolving from homogeneous initial populations, under four basic variants of the learning algorithm A while holding G and {P i } fixed. This will give us an initial grasp of how linguistic populations may change. Indeed, linguistic change has been studied before; even the dynamical system metaphor itself has been invoked. Our computational paradigm allows us to go further than these previous qualitative descriptions; for example, (1) we can precisely calculate what the rates of change will be; (2) we can determine what diachronic population-curve changes will look like, without stipulating in advance that they must be S-shaped (sigmoid) or not, and without curve fitting to a predefined functional form.

6.2.1

Homogeneous Initial Populations

First we consider the case of a homogeneous population — no noise or confounding factors like foreign target languages. How stable are the languages in the three-parameter system in this case? To determine this, we begin with a finite-sample analysis with n = 128 example sentences (recall by the analysis of previous chapters that learners converge to all target languages in the three-parameter system with high probability after hearing this many sentences). Some small proportion of the children misconverge; the goal is to see whether this small proportion can drive language change — and if so, in what direction. To give the reader some idea of the possible outcomes, let us consider the four possible variations in the learning algorithm (±Single-step, ±Greedy), holding fixed the sentence distributions and size of the learning sample. Variation 1: A = TLA (+Single Step, +Greedy); P i = Uniform; Finite Sample = 128 Suppose the learning algorithm is the Triggering Learning Algorithm (TLA). Table 6.1 shows the language mix after thirty generations. Languages are numbered from 1 to 8. Recall that +V2 refers to a language that has the verb-second property, and −V2 to one that does not.

197

LANGUAGE CHANGE: MULTIPLE LANGUAGES Initial Language (−V 2) 1 (+V 2) 2 (−V 2) 3 (+V 2) 4 (−V 2) 5 (+V 2) 6 (−V 2) 7 (+V 2) 8

Change to Language? 2 (0.85), 6 (0.1) 2 (0.98); stable 6 (0.48), 8(0.38) 4 (0.86); stable 2 (0.97) 6 (0.92); stable 2 (0.54), 4(0.35) 8 (0.97); stable

Table 6.1: Language change driven by misconvergence from a homogeneous initial linguistic population. A finite-sample analysis was conducted allowing each child learner 128 examples to internalize its grammar. After 30 generations, initial populations drifted (or not, as shown in the table) to different final linguistic compositions. For example, the third row shows that a homogeneous initial population consisting entirely of L 3 speakers evolves over many generations to one with mostly L 6 (48 percent) and L8 (38 percent) speakers and a smattering of other speakers. Observations. Some striking patterns regarding the resulting population mixes may be noted. I have collected together some observations from preliminary simulations: 1. All the +V2 languages are relatively stable, i.e., the linguistic composition of a +V2 community did not vary significantly over 30 generations. This means that every succeeding generation mostly acquired the target parameter settings — a smattering acquired alternative settings — but no significant parameter drifts were observed over time. 2. In contrast, populations speaking −V2 languages all drift to +V2 languages. Thus a population speaking L 1 winds up speaking mostly L2 (85% speaking L2 ). A population speaking language L 7 gradually shifts to a population with 54 percent speaking L 2 and 35 percent speaking L4 (with a smattering of other speakers) and apparently remains basically stable in this mix thereafter. Note that the relative stability of +V2 languages and the tendency of −V2 languages to drift to +V2 is exactly contrary to evidence in the linguistic literature. Lightfoot (1991), for example, claims that the tendency to lose

CHAPTER 6

198

V2 dominates the reverse tendency in the world’s languages. Certainly, both English and French lost the V2 parameter setting — an empirically observed phenomenon that needs to be explained. Immediately, then, we see that our dynamical system does not evolve in the expected manner. The reason could be due to any of the assumptions behind the model: the parameter space, the learning algorithm, the initial conditions, or the distributional assumptions about sentences presented to learners. Exactly which is in error remains to be seen, but nonetheless our example shows concretely how assumptions about a grammatical theory and learning theory can make evolutionary, diachronic predictions — in this case, incorrect predictions that falsify the assumptions. 3. The rates at which the linguistic composition change vary significantly from language to language. Consider for example the change of L 1 to L2 . Figure 6.2 shows the gradual decrease in speakers of L 1 over successive generations along with the increase in L 2 speakers. We see that over the first six or seven generations very little change occurs, but over the next six or seven generations the population changes at a much faster rate. Note that in this particular case the two languages differ only in the V2 parameter, so the curves essentially plot the gain of V2. In contrast, consider Figure 6.3 which shows the decrease of L 5 speakers and the shift to L2 . Here we note a sudden change: over a space of just four generations, the population shifts completely. Analysis of the time course of language change has been given some attention in linguistic analyses of diachronic syntax change, and I return to this issue later. 4. We see that in many cases a homogeneous population splits up into different linguistic groups, and seems to remain stable in that mix. In other words, certain combinations of language speakers seem to asymptote toward equilibrium (at least through thirty generations). For example, a population of L7 speakers shifts over five or six generations to one with 54 percent speaking L 2 and 35 percent speaking L4 and remains that way with no shifts in the distribution of speakers. Of course, we do not know for certain whether this is really a stable mixture. It could be that the population mix could suddenly shift after another 100 generations. What we really need to do is characterize the stable points of these dynamical systems. Other linguistic mixes can be inherently unstable; they might drift systematically to stable

LANGUAGE CHANGE: MULTIPLE LANGUAGES

1.0

199











• •

Percentage of Speakers 0.4 0.6

0.8

• -V2







0.2

• • +V2



0.0

• 5

10 Generations

15









• 20

Figure 6.2: Percentage of a population (denoted as a proportion) speaking languages L1 (−V2)and L2 (+V2), denoted on the y-axis, as the population evolves over some number of generations, measured on the x-axis. The plot has been shown only up to twenty generations, because the proportions of L1 and L2 speakers do not vary significantly thereafter. Note that this curve is “S” shaped. Kroch (1989) imposes such a shape using models from population biology, while I derive this shape as an emergent property of the dynamical model. L1 and L2 differ only in the V2 parameter setting. The initial condition is a homogeneous L 1 speaking population.

200

1.0

CHAPTER 6

0.2

Percentage of Speakers 0.4 0.6

0.8

VOS+V2

0.0

SVO-V2

5

10 Generations

15

20

Figure 6.3: Percentage of the population speaking languages L 5 (SV O−V 2) and L2 (V OS + V 2) as the population evolves over a number of generations. Note that a complete shift from L5 to L2 occurs over just four generations.

201

LANGUAGE CHANGE: MULTIPLE LANGUAGES situations, or might shift dramatically (as with language L 1 ).

5. It seems that the observed instability and drifts are to a large extent an artifact of the learning algorithm. Remember that the TLA suffers from the problem of local maxima.4 We note that those languages whose acquisition is not impeded by local maxima (the +V2 languages) are stable over time. Languages that have local maxima are unstable; in particular they drift to the local maxima over time. Consider L 7 . If this is the target language, then there are two local maxima (L 2 and L4 ) and these are precisely the states to which the system drifts over time. The same is true for languages L 5 and L3 . In this respect, the behavior of L1 is quite unusual since it actually does not have any local maxima, yet it tends to flip the V2 parameter over time. Now let us consider a different learning algorithm from the TLA that does not suffer from local maxima problems, to see whether this changes the dynamical system results. Variation 2: A = +Greedy, −Single-Value; P i = Uniform; Finite Sample = 128 Consider a simple variant of the TLA obtained by dropping the single-valued constraint. This implies that the learner is no longer constrained to change just one parameter at a time: on being presented with a sentence it cannot analyze, it chooses any of the alternative grammars and attempts to analyze the sentence with it. Greediness is retained; thus the learner retains its original hypothesis if the new one is also not able to analyze the sentence. Given this new learning algorithm, and retaining all the other original assumptions, Table 6.2 shows the distribution of speakers after thirty generations. Observations. In this situation there are no local maxima, and the evolutionary pattern takes on a very different nature. Two distinct observations can be made: 1. All homogeneous populations eventually drift to a strikingly similar population mix, irrespective of what language they start from. What 4

I regard local maxima of a language Li as alternative absorbing states (sinks) in the Markov chain for that target language. This formulation differs slightly from the conception of local maxima in Gibson and Wexler 1994, a matter discussed at some length in Niyogi and Berwick 1993. Thus according to our definition L4 is not a local maxima for L5 and consequently no shift is observed.

CHAPTER 6 Initial Language −V 2 1 +V 2 2 −V 2 3 +V 2 4 −V 2 5 +V 2 6 −V 2 7 +V 2 8

202 Change to 2 (0.41), 4 2 (0.42), 4 2 (0.40), 4 2 (0.41), 4 2 (0.40), 4 2 (0.40), 4 2 (0.40), 4 2 (0.40), 4

Language? (0.19), 6 (0.18), (0.19), 6 (0.17), (0.19), 6 (0.18), (0.19), 6 (0.18), (0.19), 6 (0.18), (0.19), 6 (0.18), (0.19), 6 (0.18), (0.19), 6 (0.18),

8 8 8 8 8 8 8 8

(0.13) (0.12) (0.13) (0.13) (0.13) (0.13) (0.13) (0.13)

Table 6.2: Language change driven by misconvergence. A finite-sample analysis was conducted allowing each child learner (following the TLA with single-value constraint dropped) 128 examples to internalize its grammar. Initial populations were linguistically homogeneous, and they drifted to different linguistic compositions. The major language groups after thirty generations have been listed in this table. Note how all initially homogeneous populations tend to the same composition. is unique about this mix? Clearly it is a kind of stable fixed point. Is it the only one, however? What does it depend upon as a function of linguistic and other parameters? Further simulations and theoretical analyses are needed to resolve these questions; I leave them open at this point. 2. All homogeneous populations drift to a population mix of only +V2 languages. Thus, the V2 parameter is gradually set over succeeding generations by all people in the community (irrespective of which language they speak). In other words, as before, there is a tendency to gain rather than lose V2, contrary to the empirical facts. As an example, Figure 6.4 shows the changing percentage of the population speaking the different languages, starting off from a homogeneous population speaking L5 . As before, learners who have not converged to the target in 128 examples are the driving force for change here. Note again the time evolution of the grammars. For about five generations there is only a slight decrease in the percentage of speakers of L 5 . Then the linguistic patterns switch rapidly over the next seven generations to a relatively stable mix. Note the S-shaped nature of some (but not all) of the trajectories.

LANGUAGE CHANGE: MULTIPLE LANGUAGES

1.0

203

Percentage of Speakers 0.4 0.6

0.8

SVO-V2

VOS+V2

0.2

OVS+V2

0.0

SVO+V2

5

10 Generations

15

20

Figure 6.4: Time evolution of grammars using a greedy learning algorithm with no single-value constraint in place.

CHAPTER 6

204

Initial Language Any language (homogeneous)

Change to Language? 1 (0.11), 2 (0.16), 3 (0.10), 4 (0.14) 5 (0.12), 6 (0.14), 7 (0.10), 8 (0.13)

Table 6.3: Language change driven by misconvergence, using two different acquisition algorithms that do not obey a local gradient-ascent rule (a greediness constraint). A finite-sample analysis was conducted with the learning algorithm following a random-step algorithm or else a single-step algorithm, along with 128 examples to internalize its grammar. Initial populations were linguistically homogeneous, and they drifted to different linguistic compositions. The major language groups after 30 generations have been listed in this table. Note that all initially homogeneous populations converge to the same final composition. Variations 3 and 4: −Greedy, ±Single-Value constraint; P i = Uniform; Finite Sample = 128 Having dropped the single-value constraint, we consider the next obvious variation in the learning algorithm: dropping greediness while varying the single-value constraint. Again, our goal is to see whether this makes any difference in the resulting dynamical system. This gives rise to two different learning algorithms: (1) allow the learning algorithm to pick any new grammar at most one parameter value away from its current hypothesis (retaining the single-value constraint, but without greediness, i.e., the new grammar does not have to be able to parse the current input sentence); (2) allow the learning algorithm to pick any new grammar at each step (no matter how far away from its current hypothesis). In both cases, the population mix after thirty generations is the same irrespective of the initial language of the homogeneous population. These results are shown in Table 6.3. Observations.

All languages are represented in the final equilibrium mix.

1. Both algorithms yield dynamical systems that arrive at the same population mix after thirty generations. The path by which they arrive at this mix is, however, not the same (see Figure 6.5). 2. The final population mix contains all languages in significant proportion. This is in distinct contrast to the previous situations, where we saw that −V2 languages were eliminated over time.

205

LANGUAGE CHANGE: MULTIPLE LANGUAGES

6.2.2

Modeling Diachronic Trajectories

With some basic intuitions in hand as to how diachronic systems may evolve given different learning algorithms, I turn next to the question of population trajectories. One aspect of diachronic evolutionary trajectories that has attracted repeated attention from historical linguists is the so-called S-shape that they often take. For example, Bailey (1973) proposed a “wave” model of linguistic change: linguistic replacements follow an S-shaped curve over time. In Bailey’s own words (taken from Kroch 1989), A given change begins quite gradually; after reaching a certain point (say, twenty percent), it picks up momentum and proceeds at a much faster rate; and finally tails off slowly before reaching completion. The result is an S-curve: the statistical differences among isolects in the middle relative times of the change will be greater than the statistical differences among the early and late isolects. While we see already that some evolutionary trajectories in our computational model may have a “linguistically classical” S-shape, the smoothness of such S-shaped trajectories may vary considerably. Crucially, however, our current formalization allows us to be much more precise than this. Unlike the previous work in diachronic linguistics that we are familiar with, we can explore systematically the space of possible trajectories and thus examine the factors that affect their evolutionary time course, without assuming an a priori S-shape. The idea that linguistic changes follow (and ought to follow) an S-curve has also been proposed by Osgood and Sebeok (1954) and Weinreich, Labov, and Herzog (1968). More specific logistic forms have been advanced by Altmann et al. (1983) and Kroch (1989). Here, the idea of a logistic functional form is borrowed from population biology, where it is demonstrable that the logistic governs the replacement of organisms and of genetic alleles that differ in Darwinian fitness. However, Kroch (1989) concedes that “unlike in the population biology case, no mechanism of change has been proposed from which the logistic form can be deduced.” Crucially, in our case, I suggest a specific mechanism of change: an acquisition-based model where the combination of grammatical theory, learning algorithms, and distributional assumptions on sentences drive change. The specific form might or might not be S-shaped, and might have varying

206

1.0

CHAPTER 6

0.2

Percentage of Speakers 0.4 0.6

0.8

-V2

0.0

+V2

0

5

10

15 Generations

20

25

30

Figure 6.5: Time evolution of linguistic composition for the situations where the learning algorithm is −Greedy, +Single Value constraint (dotted line), and −Greedy, −Single Value (solid line). Only the percentage of people speaking L1 (−V2) and L2 (+V2) are shown. The initial population is homogeneous and speaks L1 . The percentage of L1 speakers gradually decreases to about 11 percent. The percentage of L 2 speakers rises to about 16 percent from 0 percent. The two dynamical systems converge to the same population mix; however, their trajectories are not the exactly the same — the rates of change are a little different, as shown in this plot.

207

LANGUAGE CHANGE: MULTIPLE LANGUAGES

rates of change.5 Furthermore, as we saw over the last two chapters, many different equations may all be consistent with an S-shaped trajectory and these equations may be derived from different learning algorithms. Among the other factors that affect evolutionary trajectories are maturation time — the number of sentences available to the learner before it internalizes its adult grammar — and the probability distributions according to which such sentences are presented to the learner. I examine these in turn. The Effect of Maturation Time or Sample Size One obvious factor influencing the evolutionary trajectories is the maturation time, i.e., the number (N ) of sentences the child is allowed to hear before forming its mature hypothesis. This was fixed at 128 in all the systems shown so far (based in part on our explicit computation for the Markov convergence time in this situation). Figure 6.6 shows the effect of varying N on the evolutionary trajectories. As usual, I plot only a subspace of the population. In particular, I plot the percentage of L 2 speakers in the population with each succeeding generation. The initial composition of the population was taken to be homogeneous (with all adult language users speaking L 1 ). Observations. Here we see that the maturation time affects the evolutionary trajectories which need not be S-shaped in general. 1. The initial rate of change of the population is highest when the maturation time is smallest, i.e., the learner is allowed the least amount of time to develop its mature hypothesis. This is not surprising. If the learner were allowed access to a lot of examples to make its mature hypothesis, most learners would reach the target grammar. Very few would misconverge, and the linguistic composition would change little over the next generation. On the other hand, if the learner were allowed very few examples to develop its hypothesis, many would misconverge, possibly causing great change over one generation. 5

Of course, I do not mean to say that we can simulate any possible trajectory — that would make the formalism empty. Rather, I am exploring the initial space of possible trajectories, given some example initial conditions that have already been advanced in the literature. Because the mathematics for dynamical systems is in general quite complex, at present we cannot make general statements of the form, “under these particular initial conditions the trajectory will be sigmoidal, and under these other conditions it will not be.” I have conducted only very preliminary investigations demonstrating that potentially at least, reasonable, distinct initial conditions can lead to demonstrably different trajectories.

CHAPTER 6

208

2. The “stable” linguistic compositions seem to depend upon maturation time. For example, if learners are allowed only 8 examples, the percentage of L2 speakers rises quickly to about 0.26. On the other hand, if learners are allowed 128 examples, the percentage of L 2 speakers eventually rises to about 0.41. 3. Note that the trajectories do not have an S-shaped curve. The proportion of L2 speakers rises beyond its stable value before falling again and reaching the equilibrium position. Part of this is surely the consequence of plotting one variable of what is essentially a multivariable dynamical system. Two further remarks are worth making in this context. First, recall the many different examples of two-language models that we considered in the previous chapter. For some parameter values, the change observed was S-shaped, for others it was not. Therefore, the fact that linguistic trajectories in general need not be S-shaped ought not to come as a surprise in the n-dimensional setting. Second, when linguists study historical change they usually focus on one or two most salient (fastest changing) linguistic parameters. The S-shaped curves referred to in the historical linguistics literature are usually plots of one parameter in what is always a multidimensional evolutionary system. In this sense, our one-dimensional plots derived from multidimensional systems take on an added touch of realism. 4. The maturation time is related to the order of the dynamical system. In particular, following our discussion of learning models in previous chapters, it is easy to see that if N is the maturation time, then the dynamical systems correspond to degree-N polynomial maps.

The Effect of Sentence Distributions (Pi ) Another important factor influencing evolutionary trajectories is the distribution Pi with which sentences of the ith language, L i , are presented to the learner. In a certain sense, the grammatical space and the learning algorithm jointly determine the order of the dynamical system. On the other hand, sentence distributions are much like the parameters of the dynamical system (see Section 6.2.3). Clearly the sentence distributions affect rates of convergence within one generation. Further, by putting greater weight on certain word forms rather than others, they might influence systemic evolution in certain directions. We have already encountered the possibly subtle effects that sentence distributions might have in the two-language models

LANGUAGE CHANGE: MULTIPLE LANGUAGES

0.0

0.2

Percentage of Speakers 0.4 0.6

0.8

1.0

209

0

5

10

15 Generations

20

25

30

Figure 6.6: Time evolution of linguistic composition when varying maturation time (sample size). The learning algorithm used is the +Greedy, −Single-value. Only the percentage of people speaking L 2 (+V2) is shown. The initial population is homogeneous and speaks L 1 . The maturation time was varied through 8, 16, 32, 64, 128, and 256, giving rise to the six curves shown. The curve with the highest initial rate of change corresponds to 8 examples for maturation time. The initial rate of change decreases as the maturation time N increases. The value at which these curves asymptote also seems to vary with the maturation time, and increases monotonically with it.

CHAPTER 6

210

of Chapter 5. We saw that not only did the numerical value of the fixed points change with sentence distributions, there may be bifurcations leading to a qualitative change in the dynamics altogether. To illustrate the idea in the multidimensional setting, consider an example focusing on the interaction between L 1 and L2 speakers in the community as the sentence distributions with which these speakers produce sentences change. Recall that so far in this chapter, we have assumed that all speakers produce sentences with uniform distributions on degree-0 sentences of their respective languages. Now we consider alternative distributions parameterized by a value p: 1. Let L1,2 = L1 ∩ L2 . 2. P1 : Speakers of L1 produce sentences so that all degree-0 sentences of L1,2 are equally likely and their total probability is p. Further, sentences of L1 \ L1,2 are also equally likely, but their total probability is 1 − p. 3. P2 : Speakers of L2 produce sentences so that all degree-0 sentences of L1,2 are equally likely and their total probability is p. Further, sentences of L2 \ L1,2 are also equally likely, but their total probability is 1 − p. 4. Other Pi ’s are all uniform over degree-0 sentences. The parameter p determines the weight on the sentence patterns in common between the languages L1 and L2 . Figure 6.7 shows the evolution of the L2 speakers as p varies. Here the learning algorithm is +Greedy, +Single Value (TLA, or local gradient-ascent) and the initial population is homogeneous, 100% L1 ; 0% L2 . Note that the system moves in different ways as p varies. When p is very small (0.05), i.e., sentences common to L 1 and L2 occur infrequently, we find that in the long run the percentage of L 2 speakers does not increase; the population stays put with mostly L 1 speakers. However, as p grows, more strings of L 2 occur, and the dynamical system changes so that the long-term percentage of L 1 speakers decreases and that of L2 speakers increases. When p reaches 0.75 the initial population evolves into a completely L2 -speaking community. After this, as p increases further, we notice (see p = 0.95) that the L2 speakers increase but can never rise to 100 percent of the population; there is still a residual L 1 -speaking component. This is to be expected, because for such high values of p, many strings common to L1 and L2 occur frequently. This means that a learner could

211

LANGUAGE CHANGE: MULTIPLE LANGUAGES

sometimes converge to L1 just as well as L2 , and some learners indeed begin to do so, increasing the number of the L 1 speakers. This example shows us that if we wanted a homogeneous L 1 -speaking population to move to a homogeneous L 2 -speaking population, by choosing our distributions appropriately, we could drive the grammatical dynamical system in the appropriate direction. It suggests another important application of the dynamical system approach: one can work backward, and examine the conditions needed to generate a change of a certain kind. By checking whether such conditions could have possibly existed historically, we can falsify a grammatical theory or a learning paradigm. Note that this example showed the effect of sentence distributions, and how to alter them to obtain desired evolutionary envelopes. One could, in principle, alter the grammatical theory or the learning algorithm in the same fashion — leading to a tool to aid the search for an adequate linguistic theory. 6

6.2.3

Nonhomogeneous Populations: Phase-Space Plots

For our three-parameter system, we have been able to characterize the update rules for the dynamical systems corresponding to a variety of learning algorithms. Each dynamical system has a specific update procedure according to which the states evolve from some homogeneous initial population. A more complete characterization of the dynamical system would be achieved by obtaining phase-space plots of this system. Such phase-space plots are pictures of the state-space S filled with trajectories obtained by letting the system evolve from various initial points (states) in the state space. Phase-Space Plots: Grammatical Trajectories Earlier, I described the relationship between the state of the population in one generation and the next. In our case, let Π denote an eight-dimensional vector (state variable). Specifically, Π = (π 1 , . . . , π8 ) (with πi ≥ 0 variable 8 and i=1 πi = 1, i.e., Π ∈ Δ(7) where Δ(7) is the seven-dimensional simplex), as discussed earlier. The following schema reiterates the chain of dependencies involved in the update rule governing system evolution. The state of the population at time t (in generations) allows us to compute the transition matrix T for the Markov chain associated with the memoryless learner. Now, depending upon whether we want (1) an asymptotic analysis 6 Again, I stress that we obviously do not want so weak a theory that we can arrive at any possible initial conditions simply by carrying out reasonable changes to the sentence distributions. This may, of course, be possible; we have not yet examined the general case.

212

1.0

CHAPTER 6

Percentage of Speakers 0.4 0.6

0.8

p=0.75

0.2

p=0.95

0.0

p=0.05

0

5

10

15 Generations

20

25

30

Figure 6.7: The evolution of L2 speakers in the community for various values of p (a parameter related to the sentence distributions P i ; see text). The algorithm used was the TLA; the initial population was homogeneous, speaking only L1 . The curves for p = 0.05, 0.75, and 0.95 have been plotted as solid lines.

213

LANGUAGE CHANGE: MULTIPLE LANGUAGES

or (2) a finite-sample analysis, we compute (1) the limiting behavior of T m as m (the number of examples) goes to infinity (for an asymptotic analysis), or (2) the value of T N (where N is the number of examples after which maturation occurs). This allows us to compute the next state of the population. Thus Π(t + 1) = g(Π(t)) where g is a complex nonlinear relation. Π(t) =⇒ P on Σ∗ =⇒ T =⇒ T m =⇒ Π(t + 1) If we choose a certain initial condition Π1 , the system will evolve according to the above relation and one can obtain a trajectory of Π in the eightdimensional space over time. Each initial condition yields a unique trajectory and one can then plot these trajectories, obtaining a phase-space plot. Each such trajectory corresponds to a curve in the seven-dimensional hyper plane (simplex) given by 8i=1 πi = 1. One cannot directly display such a high-dimensional object, but in Figure 6.8, we plot the projection of a particular trajectory onto a two-dimensional subspace given by (π 1 (t), π2 (t)) (the proportion of speakers of L1 and L2 ) at different points in time. As mentioned earlier, with a different initial condition we get a different grammatical trajectory. The complete state-space picture is thus filled with all the different trajectories corresponding to different initial conditions. They all seem to be converging to the same fixed point. Figure 6.9 shows this. Stability Issues The phase-space plots show that many initial conditions yield trajectories that seem to converge to a single point in the state space. In the dynamical system terminology, this corresponds to a stable fixed point of the system — a population mix that stays at the same composition. Many natural questions arise at this stage. What are the conditions for stability? How many fixed points are there in a given linguistic system? How can we solve for them? These are interesting questions, but detailed answers are not within the scope of the current chapter. I will provide partial answers in certain contexts over the course of the rest of this book. For the time being, in lieu of a more complete analysis, let us first consider at least the equations that allow one to characterize the stable population mixes. First, some notational preliminaries. Consider a memoryless learning algorithm. As before, let Pi be the distribution on the sentences of the ith language Li . From Pi , we can construct Ti , the transition matrix whose elements are given by the explicit procedure documented in previous chapters. The matrix Ti characterizes the Markov development of a memoryless

214

•• • • •• •• •••



Percentage of Speakers VOS+V2 0.1 0.2 0.3 0.4

0.5

CHAPTER 6





• •

0.0

• •• 0.2

0.4 0.6 Percentage of Speakers VOS-V2

0.8

1.0

Figure 6.8: Subspace of a phase-space plot. The plot shows (π 1 (t), π2 (t)) as t varies, i.e., the proportion of speakers speaking languages L 1 and L2 in the population. The initial state of the population was homogeneous (speaking language L1 ). The algorithm used was +Greedy, −Single Value.

LANGUAGE CHANGE: MULTIPLE LANGUAGES

0.0

0.2

VOS+V2 0.4 0.6

0.8

1.0

215

0.0

0.2

0.4

0.6

0.8

1.0

VOS-V2

Figure 6.9: Subspace of a phase-space plot. The plot shows (π 1 (t), π2 (t)) as t varies for different nonhomogeneous initial population conditions. The algorithm used was +Greedy, −Single Value.

CHAPTER 6

216

learner when the target language is L i (and sentences from the target are produced according to Pi ). Note that fixing the Pi ’s fixes the Ti ’s and thus the Pi ’s are a different sort of “parameter” that characterize how the dynamical system evolves.7 Now let the state of the parent population at time t be Π(t). Therefore the probability distribution according to which the learner is exposed to example sentences is given by P =

8 

πi (t)Pi

(6.5)

i=1

Recall that the memoryless learner chooses hypotheses depending only upon its previously held grammatical hypothesis and the newly available example sentence. In this sense, one may usefully characterize the memoryless learner as a computable mapping A : H ∪ Σ∗ → H where A(h, s) determines what the algorithm will hypothesize next if it currently held the hypothesis h ∈ H and sentence s ∈ Σ ∗ was presented to it. The transition matrix T under the influence of examples presented according to P may then be easily derived. Recall that  P (s) Tij = P[hi → hj ] = s∈A

where A = {s ∈ Σ∗ |A(hi , s) = hj } is the set of all sentences on which the learning algorithm would change its hypothesis 8 from hi to hj . However, using Equation 6.5, we see that Tij is simply =

8  s∈A k=1

πk (t)Pk (s) =

8  k=1

πk (t)

 s∈A

Pk (s) =

8 

πk (t)Tk (i, j)

k=1

7 There are thus two distinct kinds of parameters in our model: first, parameters that define the 2n languages and define the state-space of the system; and second, the Pi ’s that characterize the way in which the system evolves and are therefore the parameters of the complete grammatical dynamical system. 8 Strictly speaking, the memoryless algorithms we have been considering in this example are all variants of the TLA. Such algorithms are not deterministic but randomized, for which the transition matrix formulas were derived in earlier chapters on learning. While the derivation presented here holds only for deterministic memoryless algorithms, an extension to the randomized case is easily achieved. I omit such a derivation for expository purposes, though the reader may check the validity of the extension.

217

LANGUAGE CHANGE: MULTIPLE LANGUAGES

where Tk (i, j) refers to the i, j element of the matrix T k . The matrix Tk is the transition matrix of the Markov chain characterizing the behavior of the learner when the target language is L k , and therefore example sentences are drawn according to Pk . Since T can be expressed in terms of π i ’s and Ti ’s, we have the following statements: Statement 6.1 (Finite Case) A fixed point of the grammatical dynamical system (derived from a ±Greedy, ±Single Value learner operating on the eight-parameter space with k examples to choose its final hypothesis) is a solution of the following equation: 8  πi Ti )k Π = (π1 , . . . , π8 ) = (1/8, . . . , 1/8) ( i=1

Note: This equation is obtained simply by setting Π(t + 1) = Π(t). Note however, that this is an example of a nonlinear multidimensional iterated function map. The complete analysis of such dynamical systems is nontrivial and beyond the scope of the current chapter. Similarly, for the limiting (asymptotic) case, the following holds: Statement 6.2 (Limiting or Asymptotic Analysis) A fixed point of the grammatical dynamical system (derived from a ±Greedy, ±Single value learner operating on the eight-parameter space with infinite examples to choose its mature hypothesis) is a solution of the following equation: Π = (π1 , . . . , π8 ) = (1, . . . , 1) (I −

8 

πi Ti + ON E)−1

i=1

where ON E is the 8 × 8 matrix with all its entries equal to 1. Note: Again this is trivially obtained by setting Π(t + 1) = Π(t). The expression on the right provides an analytic expression for the update equation in the asymptotic case. See Resnick 1992 for details. All the caveats mentioned before in the finite-case statement apply here as well. Remark. I have barely scratched the surface as far as the theoretical characterization of these grammatical dynamical systems is concerned. The main purpose of this chapter is to continue the argument that these dynamical systems exist as a logical consequence of assumptions about a grammatical space and an acquisition theory. I have demonstrated only some preliminary

CHAPTER 6

218

simulations with these higher-dimensional systems. From a theoretical perspective, it would be much more valuable to have complete characterizations of such systems. Because the systems described above are multidimensional and nonlinear, the possibility exists for multiple stable and unstable equilibria, as well as more complicated behavior such as cycles, bifurcations, and ultimately chaos. Keep in mind, however, that some of these regimes may not be entered because the matrices T i are constrained to be stochastic matrices. This is analogous to the one-dimensional models we examined in the previous chapter, where the corresponding dynamical systems did not enter the chaotic regime because the parameters a, b were bounded.

6.3

Example 2: Syntactic Change in French

So far, our examples have been based on a three-parameter linguistic theory for which we derived several different dynamical systems. Our goal was to concretely instantiate our philosophical arguments, sketching the factors that influence evolutionary trajectories. In this section, we briefly consider a different parametric linguistic system studied by Clark and Roberts (1993). The historical context in which Clark and Roberts advanced their linguistic proposal is the evolution of Modern French from Old French. Their parameters are intended to capture some, but of course not all, of this change. They too use a learning algorithm — in their case, a genetic algorithm — to account for historical change but do not analyze their model from the viewpoint of dynamical systems. Here I adopt their parameterization, with all its strengths and weaknesses, but consider an alternative learning paradigm and the dynamical systems approach. Extensive simulations in the earlier section reveal that while the learnability problem of the three-parameter space can be solved by stochastic hill-climbing algorithms, the long-term evolution of these algorithms has a behavior that is at variance with the diachronic change actually observed in historical linguistics. In particular, we saw how there was a tendency to gain rather than lose the V2 parameter setting. While this could well be an artifact of the class of learning algorithms considered, a more likely explanation is that loss of V2 (observed in many of the world’s languages like French, English, and so forth) is due to an interaction of parameters and triggers other than those considered in the previous section. I investigate this possibility and begin by first reviewing Clark and Roberts’s alternative parametric theory.

219

6.3.1

LANGUAGE CHANGE: MULTIPLE LANGUAGES

The Parametric Subspace and Data

I now consider a syntactic space requiring five (boolean-valued) parameters. I do not attempt to describe these parameters. The interested reader should consult Haegeman 1991 as well as Clark and Roberts 1993 for details. 1. p1 : Case assignment under agreement (p 1 = 1) or not (p1 = 0). 2. p2 : Case assignment under government (p 2 = 1) or not ((p2 = 0). Relevant triggers for this parameter include “Adv V S” and “S V O”. 3. p3 : Nominative clitics. 4. p4 : Null subject. Here relevant triggers would include “wh V S O”. 5. p5 : Verb-second V2. Triggers include “Adv V S” and “S V O”. These five parameters define a space with thirty two grammars. Each grammar in this parameterized system can be represented by a string of five bits depending upon the values of p 1 , . . . , p5 . For instance, the first bit position corresponds to case assignment under agreement. We can now look at the surface strings (sentences) generated by each such grammar. For the purpose of explaining how Old French evolved into Modern French, Clark and Roberts consider a list of key sentences provided in Table 6.4. The parameter settings required to generate each sentence are provided in brackets; an asterisk is a “doesn’t matter” value and an “X” means any phrase. The parameter settings provided in brackets represent the grammars that generate the sentence. For example, the sentence form “adv V S” (corresponding to quickly ran John — an incorrect word order in English) is generated by all grammars that have case assignment under government (the second element of the array set to 1, p 2 = 1) and verb-second movement (p5 = 1). The other parameters can be set to any value. Clearly there are eight different grammars that can generate (alternatively parse) this sentence. Similarly there are sixteen grammars that generate the form S V O (eight corresponding to parameter settings of [*1**1] and eight corresponding to parameter settings of [1***0]) and four grammars that generate ((s) V Y). Remark. Note that the set of sentences Clark and Roberts consider is only a subset of the total number of degree-0 sentences generated by the thirty two grammars in question. In order to directly compare their model with ours, we have not attempted to expand the data set or fill out the space

CHAPTER 6

220 adv V S SVO wh V S O wh V s O X (pro) V O XVs XsV XSV (S) V Y

[*1**1] [*1**1] or [1***0] [*1***] [**1**] [*1*11] or [1**10] [**1*1] [**1*0] [1***0] [*1*11]

Table 6.4: A list of the key sentences that serve as triggers to drive learning in the five parameter system of Clark and Roberts 1993. Parameter settings are indicated in brackets. An asterisk indicates that the relevant sentence is parsable under both settings of the relevant parameter. An X or Y denotes a placeholder that may be occupied by any phrase. wh refers to a question word like who,where and so on. V refers to verb, S to subject, O to object, pro to pronoun, s to subject clitic, and adv to adverb. any further. As a result, the grammars do not all have unique extensional properties, i.e., some pairs of grammars generate the same set of sentences and are weakly equivalent in this setting.

6.3.2

The Case of Diachronic Syntactic Change in French

Let us continue with the analysis of Clark and Roberts within this parameter space. Historical data, primarily in the form of written texts, suggest that the language spoken in France underwent a parametric change from the twelfth century A.D. to modern times. In particular, it has been observed that both V2 and pro-drop are lost. Examples illustrating this change are provided below. Loss of null subjects: pro-drop (Old French; +pro-drop) Si firent grant thus (they) made great

joie joy

(Modern French; −pro-drop) ∗Ainsi s’amusaient bien cette thus (they) had fun that

la the nuit. night.

nuit. night.

221

LANGUAGE CHANGE: MULTIPLE LANGUAGES Old French adv V S – [*1**1] SVO– [*1**1] or [1***0] wh V S O [*1***] X (pro) V O [*1*11] or [1**10]

Table 6.5: Sentence types corresponding to the parameter settings of Old French. After Clark and Roberts 1993. Loss of V2 (Old French; +V2) Lors oirent ils then heard they

venir come

(Modern French; −V2) ∗Puis entendirect-ils then heard-they

un a

un a coup clap

escoiz clap de of

de of

tonoire. thunder.

tonnerre. thunder.

Clark and Roberts (1993) observe that it has been argued that this transition was brought about by the introduction of new word orders during the fifteenth and sixteenth centuries, resulting in generations of children acquiring slightly different grammars and eventually culminating in the grammar of Modern French. A categorization of the historical process into three basic periods (after Clark and Roberts 1993) is as follows. Old French; Setting [11011] The language spoken in the twelfth and thirteenth centuries had verb-second movement and null subjects, both of which were dropped by the twentieth century. Of particular interest in the analysis of Clark and Roberts are some of the sentence types that are generated by the parameter settings corresponding to Old French (Table 6.5). Note that from this data set it appears that the parameters corresponding to case agreement and nominative clitics remain ambiguous. In particular, Old French is in a subset-superset relation with another language (generated by the parameter settings of 11111). In this case, possibly some kind of subset principle (Berwick 1985) could be used by the learner; otherwise it is not clear how the data would allow the learner to converge to the Old French grammar in the first place. None of the ±Greedy, ±Single Value algorithms would converge uniquely to the grammar of Old French.

CHAPTER 6

222 Middle French adv V S [*1**1] SVO [*1**1] or [1***0] wh V S O [*1***] wh V s O [**1**] X (pro)V O [*1*11] or [1**10] XVs [**1*1] XsV [**1*0] XSV [1***0] (s)VY [*1*11]

Table 6.6: A listing of sentence types in the Middle French period. The string (X)VS occurs with frequency 58% and SV(X) occurs with 34% in Old French texts. It is argued that this frequency of (X)VS is high enough to cause the V2 parameter to be triggered to a setting of +V2. Middle French In Middle French, the data is not consistent with any of the thirty two target grammars. This corresponds therefore to a heterogeneous population speaking a mixture of the different grammars, giving rise in turn to all the different data types. Analysis of texts from that period reveal that some old forms (like Adv V S) decreased in frequency and new forms (like Adv S V) increased. Clark and Roberts argue that such a frequency shift causes “erosion” of V2, bringing about parameter instability and ultimately convergence to the grammar of Modern French. In this transition period (i.e., when Middle French was spoken/written) the data is of the form shown in Table 6.6. Thus, we have old sentence patterns like Adv V S (though it decreases in frequency and becomes only 10%), SVO, X (pro)V O, and whVSO. The new sentence patterns that emerge at this stage are adv S V (increases in frequency to become 60%), X subjclitic V, X V subjclitic, (pro)V Y, whV subjclitic O. Modern French; Setting [10100] By the eighteenth century, French had lost both the V2 parameter setting and the null-subject parameter setting. The sentence patterns consistent

223

LANGUAGE CHANGE: MULTIPLE LANGUAGES

with Modern French parameter settings are SVO [*1**1] or [1***0], X S V [1***0], V s O [**1**]. Note that this data, though consistent with Modern French, will not trigger all the parameter settings. In this sense, Modern French (just like Old French) is not uniquely learnable from data. However, as before, we shall not concern ourselves overly with this, for the relevant parameters (V2 and null subject) are uniquely set by the data here.

6.3.3

Some Dynamical System Simulations

Using a TLA-like learning algorithm as a starting point, we can again obtain a dynamical system characterization for the linguistic population in this parametric space. For illustrative purposes, I show the results of two simulations conducted with such dynamical systems. Homogeneous Populations [Initial state:Old French] Let us conduct a simulation on this new parameter space using the Triggering Learning Algorithm. Recall that the relevant Markov chain in this case has thirty two states. Our goal will be to see if misconvergence by learners can be a sufficient reason to drive a population from speakers of Old French to speakers of Modern French. Consequently, we start the simulation with a homogeneous population speaking Old French (parameter setting = 11011). As before, we can observe the linguistic composition of the population over several generations. It is observed that in one generation, 15 percent of the children converge to grammar 01011; 18 percent to grammar 01111; 33 percent to grammar 11011 (target); and 26 percent to grammar 11111 with very few having converged to other grammars. Thereafter, the population consists mostly of speakers of these four languages, with one important difference: 15 percent of the speakers eventually lose V2. In particular, they have acquired the grammar 11110. Figure 6.10 shows the percentage of the population speaking each of the four languages mentioned above as the population evolves over twenty generations. Notice that in the space of a few generations, the speakers of 11011 and 01011 have dropped out altogether. Most of the population now speaks language 1111 (46 percent) and 01111 (27 percent). Fifteen percent of the population speaks 11110, and there is a smattering of other speakers. The population remains roughly stable in this configuration thereafter. Observations. I collect my observations on these simulations. It is worth noting the weak equivalence and subset relations between the different gram-

224

1.0

CHAPTER 6

Percentage of Speakers 0.4 0.6

0.8

p=11011

p=11111

0.0

0.2

p=01111

p=01011 0

5

10 Number of Generations

15

20

Figure 6.10: Evolution of speakers of different languages in a population starting off with speakers only of Old French.

225

LANGUAGE CHANGE: MULTIPLE LANGUAGES

mars to which the population converges. 1. On examining the four languages to which the system converges after one generation, we notice that they share the same settings for the principles [Case assignment under government], [pro-drop], and [V2]. These correspond to the three parameters that are uniquely set by data from Old French. The other two parameters can take on any value. Consequently four languages are generated, all of which satisfy the data from Old French. 2. Recall my earlier remark that due to insufficient data, there were equivalent grammars in the parameter system. It turns out that in this particular case, the grammars (01011) and (11011) are identical as far as their extensional properties are concerned, as are the grammars (11111) and (01111). Thus, extensionally, there are only two languages. 3. There is a subset relation between the two sets described in (2). The grammar (11011) is in a subset relation with (11111). This explains why after a few generations most of the population switches to either (11111) or (01111) (the superset grammars). 4. An interesting feature of the simulation is that 15 percent of the population eventually acquires the grammar (11110), i.e., they have lost the V2 parameter setting. This is the only sign of instability of V2 that is apparent from our simulations so far (for greedy algorithms that are psychologically preferred). Recall that for such algorithms, the V2 parameter was very stable in the previously conducted simulations using a three-parameter system. This suggests that an explanation for loss of V2 may lie in the structure of UG and the interaction of the parameters that make up UG. Heterogeneous Populations (Mixtures) The previous section showed that with no new (foreign) sentence patterns the grammatical system starting out with only Old French speakers showed some tendency to lose V2. However, the grammatical trajectory did not terminate in Modern French. In order to more closely duplicate this historically observed trajectory, I will examine alternative initial conditions. I will start the simulations with an initial condition that is a mixture of two sources: data from Old French and data from New French (reproducing in

CHAPTER 6

226

this sense, data similar to that obtained from the Middle French period). Thus children in the next generation observe new surface forms. Most of the surface forms observed in Middle French are covered by this mixture. Observations. Although small amounts of data from a Modern French source cause disproportionate change in the V2 parameter setting, I am still not able to reproduce the historical trajectories from these assumptions. 1. On performing the simulations using the TLA as a learning algorithm on this parameter space, an interesting pattern is observed. Suppose the learner is exposed to sentences where 90 percent are generated by the grammar of Old French (11011) and 10 percent by the grammar of Modern French (10100). We find that within one generation, 22 percent of the learners have converged to the grammar (11110) and 78 percent to the grammar (11111). Thus the learners set each of the parameter values to 1 except the V2 parameter setting. Modern French is a non-V2 language, and 10 percent of data from Modern French is sufficient to cause 22 percent of the speakers to lose V2. This is the behavior over one generation. The new population (consisting of 78 percent speaking grammar (11111) and 22 percent speaking grammar (11110)) remains stable forever. 2. Figure 6.11 shows the proportion of speakers who have lost V2 after one generation, as a function of the proportion of sentences from the Modern French source. The shape of the curve is interesting. For small values of the proportion of the Modern French source, the slope of the curve is greater than 1. Thus there is a greater tendency of speakers to lose V2 than to retain it. Thus 10 percent of novel sentences from the Modern French source cause 20 percent of the population to lose V2; similarly 20 percent of novel sentences from the Modern French source cause 40 percent of the speakers to lose V2. This effect wears off later. This seems to capture computationally the intuitive notion of many linguists that a small change in inputs provided to children could drive the system toward larger change. 3. Unfortunately, there are several shortcomings of this particular simulation. First, we notice that mixing Old and Modern French sources does not cause the desired (historically observed) grammatical trajectory from Old to Modern French (corresponding in our system to movement from state (11011) to state (10100) in the Markov chain). Although we find that a small injection of sentences from Modern

LANGUAGE CHANGE: MULTIPLE LANGUAGES

Proportion of speakers who have lost V2 0.2 0.4 0.6 0.8

1.0

227

. . . •.

.. •.

.. •.

. . .. •.

•.

•.

•.

. . . . ... •.

•.

•.

•.

•.

•.

•.

•.

•.

•.

.

•.

•.

0.0



0.0

0.2

0.4 0.6 Proportion of Modern French Source

0.8

1.0

Figure 6.11: Tendency to lose V2 as a result of new word orders introduced by the Modern French source in the population dynamics resulting from a TLA-based Markov learner.

CHAPTER 6

228

French causes a larger percentage of the population to lose V2 and gain subject clitics (which are historically observed phenomena), nevertheless the entire population retains the null-subject setting and case assignment under government. It should be mentioned that Clark and Roberts argue that the change in case assignment under government is the driving force that allows alternate parse-trees to be formed and causes the parametric loss of V2 and null subject. In this sense, it is a more fundamental change. 4. If the dynamical system is allowed to evolve, it ends up in either of the two states (11111) or (11110). This is essentially due to the subset relations these states (languages) have with other languages in the system. Another complication in the system is the equivalence of several different grammars (with respect to their surface extensions), e.g., given the data we are considering, the grammars (01011) and (11011) (Old French) generate the same sentences. This leads to multiplicity of paths, convergence to more than one target grammar, and general inelegance of the state-space description. General Insights and Future Directions I have considered two different scenarios for the plausible change of a population of Old French speakers to one of Modern French speakers. One scenario embodies the hypothesis that the change is internally driven, with misconverging speakers changing the linguistic composition over generational time. My simulations with initial conditions corresponding to homogeneous populations were directed at exploring this hypothesis. While some signs of instability of the V2 parameter are observed, only about 15 percent of the population loses this parameter over time. A second scenario follows the hypothesis that the change is driven by language contact, with two different linguistic types in contact with each other and the population drifting gradually from one to the other. Again, we see that small injections of Modern French speakers in the population may have an effect that is disproportionate to their numbers in the first few generations. This effect, however, decreases over generational time, so that eventually only 22 percent have lost V2. This population mix remains stable thereafter. Thus neither of the two simulations is able to replicate the change from Old to Modern French. These simulations are used as an aid in our reasoning, for the consequences of the two above-mentioned scenarios are difficult to deduce by verbal arguments alone. Furthermore, the computational

229

LANGUAGE CHANGE: MULTIPLE LANGUAGES

framework compels the historical linguist to articulate the explanations for change in a precise manner. The very fact that these two precisely articulated scenarios fail to reproduce the change demonstrates the power of this framework to falsify potential explanations. We can now consider several possible reasons for the failure to explain the complete change from Old to Modern French. First, using more data and filling out the state-space might yield greater insight. Second, TLAlike hill-climbing algorithms do not pay attention to the subset principle explicitly. Given the number of subset-superset relations among grammars in the 5-parameter space, it would be interesting to explicitly program this into the learning algorithm and observe the evolution thereafter. Third, there are often cases when several different grammars generate the same sentences or at least equally well fit the data. Algorithms that operate only on surface strings are unable then to distinguish between these grammars. As a result one finds convergence to all of the weakly equivalent grammars with different probabilities in our stochastic setting. We saw an example of this for convergence to the four states earlier. Clark and Roberts (1993) suggest an elegance criterion by looking at the parse-trees to decide between these grammars. This difference between strong generative capacity and weak generative capacity can easily be incorporated into the model and its consequences examined more thoroughly. Transition probabilities, now, will not depend upon the surface properties of the grammars alone, but also upon the elegance of derivation for each surface string. Finally, rather than examining the evolution of the population, we could look at the evolution of the distribution of sentence types. We can also obtain bounds on the frequencies with which the new data in the Middle French period must occur so that the correct drift is observed. I do not explore any of these directions here but list them as examples of meaningful questions that arise largely due to the application of computational thinking to the historical problem at hand. As one explores these questions further, a more nuanced understanding of the reasons behind the change will doubtless emerge.

6.4

Conclusions

In this chapter, we have continued our exploration of the relationship between language learning and language change. The central argument of this book has been that any specification of (i) linguistic theory as articulated by models of generative grammar (broadly construed) and (ii) learning theory

CHAPTER 6

230

as articulated by models of how those grammars are attained by learning children leads logically to models of grammatical evolution and diachronic change. These models of grammatical evolution therefore represent the evolutionary consequences of models of linguistic theory and language learning. From a programmatic persepective, this argument has two important consequences. First, it allows us to take a formal, analytic view of historical linguistics. Most accounts of language change have tended to be descriptive in nature. In contrast, I place the study of historical or diachronic linguistics in a formal framework. In this sense, my conception of historical linguistics is closest in spirit to evolutionary theory and population biology. Second, this approach allows us to formally pose a diachronic criterion for the adequacy of grammatical theories. A significant body of work in learning theory has already sharpened the learnability criterion for grammatical theories — in other words, the class of grammars G must be learnable by some psychologically plausible algorithm from primary linguistic data. Now we can go one step further. The class of grammars G (along with a proposed learning algorithm A) can be reduced to a dynamical system whose evolution must be consistent with that of the true evolution of human languages (as reconstructed from historical data). In Chapter 5, I considered one-parameter models of language change. In this chapter, I considered the general n-language setting where n different linguistic types are potentially present in the population at all times. To concretely demonstrate that the grammatical dynamical systems need not be impossibly difficult to compute (or simulate), I explicitly showed how to transform parameterized theories, and memoryless learning algorithms into dynamical systems. The specific simulations in this chapter have been conducted with the goal of generating some insight into syntactic change, and I considered in some detail the case of syntactic change in French from the twelfth to the twentieth century. While the simulations conducted here are too incomplete to have any long-term linguistic implications, I hope they will provide a starting point for research in this direction. Some interesting insights obtained along the way merit reiteration: 1. Some light was shed on the time course (the S-shape) of evolution. In particular, we saw how this was a derivative of more fundamental assumptions about initial population conditions, sentence distributions, and learning algorithms. 2. Notions of linguistic system stability were formally developed. Thus, certain linguistic parameters could change with time, others might

231

LANGUAGE CHANGE: MULTIPLE LANGUAGES remain stable. This can now be measured, and the conditions for stability or change can be investigated. In Chapter 5, we saw analytically how the stability of a system might change as a result of drift in frequencies. These correspond to bifurcations and provide an important theoretical construct for explaining language change and evolution.

3. It was demonstrated how one could manipulate the system (by changing the algorithm, or the sentence distributions, or maturational time) to allow evolution in certain directions. These logical possibilities suggest the kinds of modifications needed in the linguistic theory for greater explanatory adequacy. 4. In the case study of French, we saw that the V2 parameter was more stable in the three-parameter case than it was in the five parameter case. This suggests that the explanation for the loss of V2 (actually observed in history) may reside in the nature of the parameters available to the learning child, i.e., the structure of G (though I suggest great caution before drawing strong conclusions on the basis of this study). Now that the basic framework has been developed in some generality, in the next few chapters I will apply this kind of thinking to particular cases of language change studied in historical linguistics. This exercise will allow us to make issues concrete by grounding our models in real cases. More generally, it will also allow us to better understand the role of such models in reasoning about evolutionary and historical phenomena in linguistic populations.

Chapter 7

An Application to Portuguese In this chapter we consider a specific historical example of language change — the change in the grammatical system governing the placement of clitics in European Portuguese from the sixteenth to the eighteenth century A.D. The example illustrates the potential power of our computational approach in explicating the nature of the interaction of two important cognitive phenomena: language learning and language change. The first, language learning, occurs at the level of the individual — children acquire the language (grammar) of their caretakers, a cognitive ability that has been broadly investigated via a range of computational and experimental methodologies. The second, language change, occurs at the level of the population: it is individual language learners whose collective, ensemble properties constitute a distribution of linguistic knowledge and this distribution of linguistic knowledge might change over generational time scales. Against the backdrop of Portuguese change, we will develop models to explore the interplay between language learning and language change. Two major insights emerge: first, that different language learning algorithms may have different evolutionary consequences, and second, that the learningderived dynamics is typically nonlinear, giving rise to bifurcations as the parameters change continuously. Thus it may be possible for subtle changes in parameters to lead to discontinuous changes in the stability of linguistic systems. These insights were already obtained in previous chapters. Seeing them again in this new and concrete linguistic context reinforces our belief in their significance. I argue that these insights generalize across the specifics of particular models and therefore provide some genuine understanding of

CHAPTER 7

234

the forces that shape language change and evolution.

7.1

Portuguese: A Case Study

In what follows, I will present certain aspects of the historical evolution of European (as distinct from Brazilian or Goan, for example) Portuguese and demonstrate how computational models allow us to sharpen our questions and clarify our reasoning in constructing explanations for the phenomena at hand. In particular, we will see how different learning algorithms have different evolutionary consequences some of which are incompatible with the historically observed trend. We will also see how the evolutionary consequences, depend subtly upon parameter values and that this dependence can only be worked out by mathematical analysis. It should be noted that a far more detailed investigation of Portuguese is being carried out as part of an interdisciplinary project coordinated at the University of Sao Paolo, Brazil. A corpus of historical Portuguese texts (the Tycho Brahe corpus) has been collected, is being linguistically annotated, and statistical analyses of the major trends are being conducted. Central to the research effort associated with the Tycho Brahe corpus are mathematical models of learning and dynamical change that are similar in spirit to and motivated by the same philosophical concerns as those discussed in this book. The interested reader may consult http://www.ime.usp.br/∼tycho for further details. Also see Galves and Galves 1995, Fernandez and Galves 1999, Cassandro et al. 1999, and Britto et al. 1999 for research reports.

7.1.1

The Facts of Portuguese Language Change

We focus on a particular change involving an interaction of phonological and syntactic components of the grammar of Portuguese. The discussion that follows is adapted from Galves and Galves 1995 and elaborated in the publications associated with the Tycho Brahe project. Portuguese has always been an SVO language, exemplified by the typical word order in the following sentence: Paulo ama Virginia. Paulo loves Virginia. Roughly, over a period of 200 years, starting in 1800 A.D., “classical” Portuguese (CP) underwent a change in clitic placement. Clitics are morphological items that attach to syntactic heads to form a lexical complex. In

235

AN APPLICATION TO PORTUGUESE

the cases discussed below, the pronominal clitic a attaches to the verb ama to form a lexical complex. However, the clitic may attach before (“a-ama”) or after (“ama-a”) the verb in question and is referred to as a proclitic or enclitic respectively. From the sixteenth century or before until the beginning of the nineteenth century, both proclitics and enclitics were possible in root declarative sentences (with nonquantified subjects), as given by examples (1) and (2) below (from Galves and Galves 1995; also referred to as G&G in later portions of this chapter). (1)

Paulo Paulo

aher

(2)

Paulo Paulo

ama loves

ama. loves. -a. her.

In sentences with a quantified subject (containing a wh-element, for example) proclisis was and continues to be the only option, as in (3) below. (3)

Quem Who

aher

ama? loves?

Galves and Galves 1995 summarize the relevant historical facts as follows: “During the nineteenth century a change affecting the syntax of cliticplacement occurred in the language spoken in Portugal.. . . As a result, sentences like (1) became agrammatical and (2) remained as the only option for root affirmative sentences with non-quantified subjects. This change, however, did not concern sentences like (3) with quantified or Wh-subjects in which proclisis was, and continues to be, the only option.” Table 7.1 shows the percentage of enclitics in the writing of Portuguese authors in the classical period. In contexts where there is variation, proclisis is clearly dominant. In contrast, the writings of authors from 1799 onward show a clear dominance of enclisis, as summarized in Table 7.2. Modern European Portuguese has no proclisis in sentences with nonquantified subjects at all. I refer to the language of modern European Portuguese as European Portuguese (EP) in my discussion. Galves and Galves (1995) offer an explanation of this change by proposing a link between phonology and syntax. Roughly speaking, they argue that phonological changes in Portuguese altered the stress contours associated with the sentences, and consequently the probabilities with which these different sentence types occurred. This difference in stress is what

CHAPTER 7

236

Author Gusmao (1695) Castro (1700) Oliviera (1702) Judeu (1705) Verney (1713) Marques (1728) Marquesa (1750)

Proclisis 27 15 39 27 14 30 34

Enclisis 0 1 7 6 11 10 23

%Enclisis 0 7 16 19 44 25 40

Table 7.1: Data extracted from the works of authors in the classical period from the end of the seventeenth century to the middle of the eighteenth. In contexts where there is variation, proclisis is clearly dominant.

Author Garrett (1799) Camilo (1825) Dinis (1839)

Proclisis 11 6 3

Enclisis 37 70 24

%Enclisis 77 92 88

Table 7.2: Data extracted from authors from the nineteenth century. Note the increasing percentage of enclisis in the same contexts. Modern European Portuguese uses exclusively enclitics.

237

AN APPLICATION TO PORTUGUESE

learning hinges on, and so the historical change. Depending upon one’s linguistic persuasion, one may argue with the details of such an explanation, but for present purposes, I will accept it to illustrate how different learning algorithms might have different evolutionary consequences for historical prediction. I will therefore ignore for the moment the linguistic implications of the various algorithms and concentrate only on their computational properties. To each sentence I will assign (1) a morphological word sequence, (2) a stress contour, and (3) a syntactic structure. For example, again following G&G’s analysis, sentence type (1) will remain only in CP, while the two sentences (2) and (3) above will have different stress patterns for CP and EP. I omit a detailed description of the stress 1 assignment and syntactic2 properties, because they are not necessary for our analysis. All we need to know is that G&G assume that the stress contours corresponding to sentence types (1), (2), and (3), which I denote simply as c 1 , c2 , c3 , follow a Markov chain description and, more importantly, govern the probability with which sentences are produced.3 Thus, if two sentences have the same stress contour, then they will be produced with the same probability (given 1

In the treatment of G&G, given a bracketed clause, a stress mark S is assigned to each stressed word and a non-stress mark U is assigned to each unstressed word according to the framework of metrical grid theory of Halle and Idsardi 1992. It follows that for Classical Portuguese, sentence type (1) has the stress assignment of SU S (2) has the stress assignment of SS where the morphological complex of “ama-a” bears a single stress mark and (3) has the assignment of SU S. The probability of producing sentences is assumed to depend upon the stress assignment. Hence in Classical Portuguese, sentence types (1) and (3) are produced with equal probabilities. Similarly, the stress assignments according to the grammar of EP may be determined. 2 Following Salvi 1990, Madeira 1992, and Manzini 1992, the following points are made in the syntactic treatment of Portuguese over the ages: (1) Only one functional category contains the clitic and the verb in both proclitic and enclitic positions. Proclisis corresponds to a structure in which the clitic has adjoined to the verb in Infl. (2) In Classical Portuguese enclitic constructions the subject lies outside the border of the clause, contrary to what happens in proclitic constructions. (3) The landing site for the subject in European Portuguese enclitic constructions is Spec/CP. (4) The specifier position, which is the landing site of non-interrogative subjects in Classical Portuguese, is no longer available in European Portuguese. (5) Enclisis appears in a position entering in complementary distribution with wh and Focus. Galves and Galves (1995) claim that the change from Classical to European Portuguese results from a reinterpretation of the position of the subject in enclitic constructions. 3 I am of course aware that this assumption of G&G may also be questioned; one might substitute for it any other more plausible relation between stress and sentence types — if any; this assumption is simply designed as a bridge to get the child from a presumably observable surface fact to a sentence type.

CHAPTER 7

238

by the probability of the stress sequence according to Markov production rules). In short, for the purposes of this chapter, it is sufficient to assume that there are simply two grammars (in accordance with the assumptions of G&G): GCP , denoting the grammar of CP (earlier) and G EP , denoting the grammar of EP. Furthermore, the only data that is relevant (ignoring other aspects of the grammar) is as follows: CP: c1 produced with probability p; c2 produced with probability 1 − 2p; and c3 produced with probability p. EP: c1 not produced; c2 produced with probability 1 − q; and c3 produced with probability q. Any (historically changing) population will now by assumption contain a mix of speakers of CP and EP. The CP speakers produce the sentence types shown above with probabilities that are parameterized by p. The EP speakers produce the sentence types shown above with probabilities that are parameterized by q. Thus we have defined 1. The class of grammars (1) G = {GEP , GCP }. Note that the language of EP is a proper subset of the language of GCP in our setting. 2. The probabilities with which speakers of G EP and GCP produce sentences (parameterized by p and q). We can therefore derive the evolutionary consequences at the population level for a variety of learning algorithms. I now proceed to do so.

7.2

The Logical Basis of Language Change

In our model the logical basis of change is language learning: the possibility of mislearning the particular target grammar of one’s caretakers. As we have argued, if children always converged on the language of their parents, then their language would be the same as that of their parents, and this would be true from each generation to the next. Consequently, for languages to change from one generation to the next, children must attain a language different from that of their parents. In our setting, there are two different linguistic types — G CP and GEP — that are represented in the population. On the basis of example sentences from the previous generation, children acquire either of the two grammars. Some may acquire GCP and others may acquire GEP . The probability with which they do so will depend on (i) the distribution of the different

239

AN APPLICATION TO PORTUGUESE

grammatical types in the adult population, (ii) the probability with which sentences are produced by each of the grammatical types in the population, and (iii) the learning algorithm children use to infer a grammar. A complete analysis of the behavior of the individual learner will allow us to analyze the behavior of the population. In order to do this, we make the following idealizations for population modeling: (1) nonoverlapping discrete generations: the populations consists of parents and children, with parents being the source of linguistic data and children being the learners; (2) no neighborhood effects: the mix of linguistic types in the entire adult population determines the source of sentences and this distributional source is identical for all children; (3) adults do not change their grammar/language over their lifetime, i.e,. a monolingual maturation hypothesis; (4) children have a finite time to acquire the grammar, i.e., a learning maturation hypothesis. These assumptions may be systematically dropped and the consequences examined. We do this in a later chapter.

7.2.1

Galves Batch Learning Algorithm

Galves and Galves (1995) describe a maximum likelihood approach to grammatical inference that may be summarized as follows: 1. Draw N examples (sentences). 2. Compute likelihoods, i.e., P (S n |GCP ) and P (Sn |GEP ). 3. Use the maximum likelihood principle to choose between the grammars. In order to derive the evolutionary consequences of such a learning procedure, we first need to be able to analyze the behavior of the individual learner. Analysis of Individual Learning Algorithm For the analysis of the algorithm, we assume that sentences are drawn in i.i.d. fashion according to a probability distribution determined by the stress contours of the relevant sentences as indicated in the previous section. First, consider the form of the likelihoods. Let the example sentences be Sn = {s1 , s2 , . . . , sn }. Due to the i.i.d. assumption P (S n |GCP ) is given by  n i=1 P (si |GCP ). Suppose that the set of n examples consists of a draws of

CHAPTER 7

240

c1 , b draws of c3 , and n−a−b draws of c2 . Then the following is immediately clear: P (Sn |GCP ) = pa (1 − 2p)(n−a−b) pb and P (Sn |GEP ) = (0)a (1 − q)(n−a−b) q b Consequently, the individual child, following the likelihood principle will choose the grammar EP (GEP ) only if (1) no instances of c1 occur in its sample, and (2) the number of occurrences of c 2 and c3 are such that q b (1 − q)(n−b) > (1 − 2p)(n−b) pb . There are three cases to consider: Case 1. p < q < 2p. Decision Rule: For this case, it is possible to show that the child (following the maximum likelihood rule) always chooses G EP if no instances of c1 occur. This is simply because 1 − q > 1 − 2p and q > p. Population Update: Suppose that the proportion of speakers of G CP in the ith generation is αi . Then the probability of drawing c1 is given by αi p. Consequently, the probability of drawing a set of n examples without a single draw of c1 is (1 − αi p)n . This is of course the probability with which the individual child chooses the grammar of European Portuguese, G EP . Thus the update rule has the following form: αi+1 = 1 − (1 − αi p)n Case 2. q < p < 2p. Decision Rule: In this case, the maximum likelihood decision rule reduces to the following. Choose GEP if and only if (1) a = 0, i.e., no instances of c1 occur; and (2) b < nγ where γ =

1−q log( 1−2p ) 1−q log( 1−2p )+log( pq )

. For all other data sets,

choose GCP . Population Update: As usual, let αi be the proportion of the previous generation speaking GCP . Using the fact that the numbers of each of the three sentence types has a multinomial distribution, it can be shown (1) nevents nγthat k (n−k) and (2) above occur with a total probability equal to k=0 k P Q where P = αi p + (1 − αi )q and Q = αi (1 − 2p) + (1 − αi )(1 − q). Thus the update rule has the following form: αi+1 = 1 −

nγ    n k=0

k

P k Q(n−k)

241

AN APPLICATION TO PORTUGUESE

Note that since γ is a real number, nγ is not integer valued. Therefore, in the binomial expression, the sum should be taken up to and including the largest integer less than nγ. Case 3. p < 2p < q. Decision Rule: The maximum likelihood decision rule reduces to: 1−2p Choose log( 1−q ) GEP if and only if (1) a = 0; and b > nγ where γ = q 1−2p . log( p )+log(

1−q

)

Otherwise, choose GCP . Population Update: As usual, let αi be the proportion of the previous generation speaking GCP . It can be shown that the update rule has the following form: αi+1

n    n =1− P k Q(n−k) k k=nγ

where P and Q are as in Case 2. Again note that because nγ in not integer valued, the sum in the binomial expression should be taken from the smallest integer larger than nγ. System Evolution We have shown above how the behavior of the population can be characterized as a dynamical system and have derived the update rules for such a system for a maximum likelihood learning algorithm. The dynamical system captures the evolutionary consequences of this particular learning algorithm. Let us elaborate on the evolutionary possibilities. In principle, these may then be matched against the observed historical trends. Case 1. 1. α = 0 is a fixed point, i.e., if the initial population consists entirely of EP (gEP ) speakers, it will always remain that way. Furthermore, if np < 1, then this is a stable fixed point. It is also the only fixed point between 0 and 1. Thus in this case a population speaking entirely CP would gradually be converted to one speaking entirely EP. 2. If np > 1, then α = 0 remains a fixed point but now becomes unstable. For this case, an additional fixed point (stable) is now created between 0 and 1. All initial population compositions will tend to this particular mix of GCP and GEP speakers. Figure 7.1 shows the fixed (equilibrium) point as a function of n and p. Case 2.

242

0

Z 0.2 0.4 0.6

0.8

1

CHAPTER 7

20 15

20 15

1 Y0

10 X

5

5

Figure 7.1: The fixed point of the dynamical system (on the z-axis) as a function of n (on the x-axis) and 1p (on the y-axis).

243

AN APPLICATION TO PORTUGUESE

1. Unlike Case 1, the dynamical evolution depends now upon both p and q in addition to n. 2. It is easily seen that α = 0 is no longer a fixed (equilibrium) point (unless p = q). Consequently, irrespective of their initial composition, populations will always contain some CP speakers.

0.2

0.4

Z

0.6

0.8

1

3. It is possible to show that there is exactly one fixed (stable) point and all initial populations will tend to this value. Shown in Figure 7.2 is a plot of the fixed point as a function of q and p for a fixed value of n. Notice the multiple ridges in the profile suggesting sensitivity to the value of q around some critical points. Shown in Figure 7.3 is a plot of the fixed point as a function of p for various choices of n, keeping q fixed at 0.1.

0

0.4 0.3

0.1

0.2 Y

0.2 X

0.3 0.4

0.1

Figure 7.2: The fixed point of the dynamical system (on the z-axis) as a function of q (on the x-axis) and p (on the y-axis). The value of n was held fixed at 5.

244

1.0

CHAPTER 7

12

8

5

3

0.0

0.2

fixed point 0.4 0.6

0.8

17

0.0

0.1

0.2

0.3 p

0.4

0.5

0.6

Figure 7.3: The fixed point of the dynamical system as a function of p (on the x-axis) for various values of n. Here q was held fixed at 0.1 and p was allowed to vary from 0.1 to 0.5.

245

AN APPLICATION TO PORTUGUESE

Case 3. 1. Like Case 2, the dynamical evolution depends upon both p and q in addition to n. 2. Again, it is easily seen that α = 0 is no longer a fixed point. Therefore, the speakers of CP can never be eliminated altogether for p and q in this range. We can again plot the fixed points of the resulting dynamical system as a function of q and p where n is held fixed at 5 or for various values of n, keeping p fixed. We omit the figures for reasons of space. The results are: again the ridges in the landscape suggest a great sensitivity of the final equilibrium point to slight changes in the values of p and q. CP speakers are never completely eliminated, although their frequency can get quite low in certain regions. What are the important conclusions from this analysis? In short, children using the maximum likelihood rule will choose G EP over GCP . However, a dynamical systems analysis must be carried out to see if that will suffice to “wipe out” CP. Only in Case 1, will CP be lost completely (provided p < 1/n). In all other cases, there will always remain some CP speakers within the community. In fact, the evolutionary properties can be quite subtle. Consider the following three example cases. Example 1. Let p = 0.05, q = 0.02 and n = 4. In this case, if the parental generation were all speaking CP (α = 1), then a simple computation shows that the probability with which the child would pick G EP is 0.66, i.e., it is greater than one-half. Thus, in spite of the fact that the majority of children choose the grammar of European Portuguese, the speakers of Classical Portuguese will never die out completely. In fact, the fixed point is 0.11. Roughly 11 percent of the population will continue to speak CP. In the absence of this analysis, one might naively argue that since a majority of learners choose gEP , this trend will continue generation after generation, leading the population to lose all its CP speakers eventually. This argument, as we have just seen, would be mistaken. Example 2. Let p = 0.05, q = 0.06, and n = 8. If this were the case, and the parental generation all spoke CP, it turns out that the probability with which the individual child would pick GEP would again be 0.66. However, now the speakers of CP would all be lost and the population would move to its stable, fixed point containing only speakers of EP. Example 3. If p = 0.05, q = 0.06, and n = 21, however, it is easily seen that CP speakers can never be completely lost.

CHAPTER 7

7.2.2

246

Batch Subset Algorithm

Most importantly, we see that the above learning algorithm makes specific predictions about the change of the linguistic composition of the population as a whole. For purposes of exploration, let us turn our attention to a simple modification of the previous learning algorithm that we call the Batch Subset Algorithm because (i) it incorporates the Subset Principle (see discussion in Berwick 1985), and (ii) all the data is processed “at once.” Our aim is to demonstrate how readily one may carry out changes in the learning algorithm and investigate their model consequences. 1. Draw n examples. 2. If c1 occurs even once, choose GCP , otherwise choose GEP . Since EP is a subset of CP for the data at hand, such a learner would choose the grammar of EP (GEP ) as its default grammar unless it received contradictory data (in this case c1 ; which informs it that the target is not GEP but GCP ). Of course, such a learning algorithm is guaranteed to converge to the correct target as the data goes to infinity. A natural question to ask is whether it makes a different prediction about how the population would evolve. Assume that a proportion αi of the adult population speaks CP. Let us then calculate the probability with which a typical child internalizes g CP . After n i.i.d. draws, the child would internalize g EP if no examples of c1 occur. Since examples of c1 are produced for the child with probability pα i , the probability of not encountering a single example of c 1 in n draws is given by (1 − pαi )n . Therefore the probability of internalizing g CP is given by αi+1 = 1 − (1 − αi p)n One can already see that the evolutionary properties for this learning algorithm are different from the previous one. The dynamics are always given by the same update rule irrespective of the values of p and q. In fact, the evolution, which is totally independent of q, is identical to Case 1 of the dynamics of the previous learning algorithm. Naturally, it has the same equilibrium behavior as Figure 7.1. It is worthwhile to reflect a bit on the bifurcations that take place as n changes. For very large n, the stable fixed point of the population is at a value close to α = 1. In fact, as n → ∞, this stable fixed point tends to 1. In other words, for large n, the stable mode of the population is to

247

AN APPLICATION TO PORTUGUESE

speak mostly CP. For very small n on the other hand — in fact, for n < 1p , the only stable mode of the population is to speak entirely EP. Therefore a consistent explanation of the change from CP to EP within the framework of such a subset-like learning algorithm might invoke a sudden change of n from large to small values. Thus, if for some reason, the total usage of clitics in these contexts decreased dramatically so that children had far fewer examples on which to base their decision of which grammatical principle to internalize, one could move from a stable population of CP speakers to one of EP speakers. Of course, one can now examine more data to check whether this is indeed a plausible explanation. This example illustrates how the computational analysis sharpens considerably possible explanatory scenarios, leading to more precise questions that one may then try to resolve.

7.2.3

Online Learning Algorithm (TLA)

One might argue that a batch learning mechanism where all the data is processed at once is cognitively less plausible than an online procedure with memory limitations. Let us therefore briefly consider the evolutionary consequences of an example of a canonical online learning algorithm — the Triggering Learning Algorithm (TLA) described in previous chapters. In this context, the TLA works as follows. Choose a hypothesis initially at random. Stay with the current hypothesis until a counterexample comes, at which point flip to the opposite hypothesis. Following the treatment in Chapter 5, we see that there are two languages L CP and LEP . The update rule (for proportion of LCP speakers) may be derived as αi+1 =

B + 12 (A − B)(1 − A − B)n A+B

where (i) a is the probability with which speakers of L CP produce sentences in LCP ∩ LEP , (ii) b is the probability with which speakers of L EP produce sentences in LCP ∩ LEP , (iii) B is the probability with which unique triggers of LCP are presented to the learner, and (iv) A is the probability with which unique triggers of LEP are presented to the learner. It is easy to check b = 1; a = 1 − p; B = αi p; A = 0 Plugging this in, one one obtains the update rule as 1 αi+1 = 1 − (1 − αi p)n 2

CHAPTER 7

248

where αi and αi+1 are the proportion of the population speaking CP in generation i and i + 1 respectively. As usual, n is the number of examples drawn. One can now make the following two observations about this evolution. 1. Since αi ≥ 21 ∀i, the proportion of CP speakers can never be eliminated to less than 12 of the population. Consequently, one is able to see immediately that the TLA does not have the right evolutionary properties to explain the change from CP to EP. 2. It is possible to show that there is exactly one stable fixed point between 12 and 1 to which the system evolves. The exact value of this stable fixed point depends upon n and p. If p = 0, then the only stable point is α = 12 .

7.3

Conclusions

In this chapter, we have grounded our discussion of language change in the context of the historically observed change of clitic placement in Portugal from the sixteenth to the eighteenth century. This is a case for which considerable data seems to exist and is being systematized as part of the Tycho Brahe collections. Let us reflect on the two themes that are illustrated in this chapter and that constitute two of the central insights of this book. First, we note that different language learning strategies on the part of individual learners can lead to qualitatively different evolutionary trajectories over time. We saw this for the Galves Batch Algorithm, the Batch Subset Algorithm and the Online TLA. Second, we note that the dynamics are typically nonlinear. As a result, the interaction between learning and change can be quite subtle. The dynamical systems mathematics therefore becomes essential because intuitions can lead one astray. For example, using the learning algorithm proposed by Galves and Galves 1995, we see that the evolutionary consequences depend subtly upon the parameter values p and q and the operating regime of the system. Bifurcations arise as p and q change continuously, leading the system from one qualitative regime to another. Again and again, in a number of different settings, we encounter such bifurcations, pointing to the pervasiveness of phase transitions like phenomena in language evolution. These considerations illustrate how the mathematical and computational approach discussed in this book allow us to probe more deeply, sharpen questions, correct fallacious reasoning, and generally tease apart issues that

249

AN APPLICATION TO PORTUGUESE

arise in constructing explanations for these kinds of changes. While it is certainly too early to claim a successful resolution of the Portuguese puzzle, the effort associated with the Tycho Brahe project represents strong steps in this direction.

Chapter 8

An Application to Chinese Phonology In a number of cases of language change studied in historical linguistics, one observes two linguistic forms that have coexisted for a period of time in changing ratios, leading ultimately to the complete loss of one form from the entire population. In this chapter, we consider a population analysis of such a situation. To ground our discussion in a concrete linguistic example, we consider a case of phonological merger in Chinese (taken from Shen 1993, 1997). Over the course of this example, I develop the various issues that need to be resolved as linguists try to account for linguistic changes of the sort that occurred in Wenzhou over the last century. These issues are sometimes tricky to tease apart, and the primary role of this chapter is to introduce a computational framework within which one can embed various linguistic accounts, examine their consequences, and generally reason about them. Historical phenomena in language present a particularly interesting window on synchronic linguistics and language acquisition. While considerable computational work exists in the latter two areas, historical linguistics has received minimal computational attention in the past. Yet there is clearly an interplay between the range of possible synchronic variation, the modes of transmission of language from one generation to the next (effected by language acquisition, broadly construed), and diachronic variation over generational time scales. This chapter explores and exploits such an interplay in the particular context of a documented instance of phonological change in China. It is part of our ongoing effort in this book to better understand the nature of

CHAPTER 8

252

the relationship between language acquisition by individuals and language change in populations. We will see over the course of this chapter how the various issues surrounding language change are brought into sharp focus as a result of our computational reasoning. It will also give us a welcome opportunity to depart from the syntax-centric formulation of our models to consider an explicitly phonological example. This will allow us to see both the generality of our computational framework and the universality of the issues that arise in any treatment of language change. We will also see an interesting contrast between linguistic evolution and biological evolution. Our models indicate that in contrast to biological evolution, the particulate acquisition of language (equated with categorical monolingualism) eliminates linguistic diversity, while the blending acquisition of language (equated with multilingualism) maintains linguistic diversity. Note that in biological evolution, particulate inheritance maintains diversity while blending inheritance eliminates it (see, e.g., Fisher 1930 for a discussion of this issue). Finally, we will reflect again on the role of bifurcations in language change and see how they may provide a novel solution to the actuation problem — the problem of why a change may come about in the first place.

8.1

Phonological Merger in the Wenzhou Province

Zhongwei Shen (1997) describes two detailed studies of phonological change in the Wu dialects of China. We consider here as an example the monophthongization of /oy / resulting in a phonological merger with the rounded front vowel /o/. This sound change is apparently not influenced by contact with Mandarin and is conjectured to be due to phonetic similarities between the two sounds. These two phonological categories were preserved as distinct by many speakers, but over a period of time, the distinction was lost and their merger created many homophonous pairs. Thus, the word for “cloth” — /poy /42 — now became homophonous with the word for “half” — /po/42 , and similarly, the word for “road” — /lo y / — became homophonous with the word for “in disorder” — /lo/ 11 . A list of thirty five words with the diphthong /o y / is presented in Shen 1993, and some of these are reproduced in Table 8.1. The phonetic difference between the two sounds lies in movements of the first and second formants. Both of the sounds in question are long vowels. The monophthong /o/ has a first formant at around 600 Hz. and a second formant at 2200 Hz. The diphthong /oy / has a first formant that starts around 600 Hz. and gradually drops down to 350 Hz., while the second

253

AN APPLICATION TO CHINESE PHONOLOGY /poy /42 /doy /31 /moy /31 /toy /42 /soy /42

“cloth” “graph” “to sharpen” “jealous” “to tell”

Table 8.1: A subset of the words that underwent change.

formant increases slightly above 2200 Hz. The change from the diphthong to the monophthong can in principle be gradual with no compelling phonetic reason to make this change abrupt. During the process of this phonological merger, each word in this list had two alternative pronunciations: (i) the original pronunciation with the diphthongized vowel, and (ii) the changed pronunciation with the diphthongized vowel replaced by the monophthongized equivalent. In an effort to locate the rate at which the change spreads through the speakers in a population and through the words in the lexicon, Shen conducted a field study in the summer of 1990 and the results are reported in Shen 1993, 1997. A total of 363 subjects were questioned regarding the pronunciation of each of the 35 words, from which it was possible to elicit the distribution of pronunciations in the population as a whole. The subjects were distributed in age from 15 to 77. A striking pattern was observed, providing a veritable snapshot of the change in process. Among the older people the original pronunciation of the word was in vogue. The proportion of people using the original pronunciation decreased with age, so that among the youngest people in the community, the original pronunciation was almost completely replaced by the new one. Figure 8.1 shows the percentage of people using the original pronunciation as a function of age for two different words in the lexicon. It is clear that different generations of speakers use preferentially different forms of the word. It is also worthwhile to observe that a true diachronic picture of the language changing over time would be observed by literally sampling in time, i.e., sampling the average behavior of the population (represented by the percentage of speakers who use the original form) at different points in time. This would require cross-sectional data over many years. This is usually infeasible, and the technique of using the synchronic variation and factoring it by generations provides an estimate of “apparent time” (introduced by Labov 1966), allowing us a window into the diachronic

CHAPTER 8

254

100 90

Percentage of monophthong

80 70 60 50 40 30 20 10 0 10

20

30

40 50 Apparent Time (years)

60

70

80

10

20

30

40 50 Apparent Time (years)

60

70

80

100 90

Percentage of monophthong

80 70 60 50 40 30 20 10 0

Figure 8.1: Gradual change in the percentage of speakers using the monophthong version of a word. The two figures correspond to two different words that underwent a change in pronunciation. The words are /po y / (top) and /loy / (bottom) respectively. After Shen (1997).

255

AN APPLICATION TO CHINESE PHONOLOGY

variation. We thus see a phonological (sound) change in progress with two linguistic forms: (i) a diphthongized form, and (ii) a monophthongized form in competition. In almost characteristic fashion, one form is completely replaced by the other. Why does such a change happen? What initiates it? Why does it go to completion? Why don’t mixed populations remain stable over all time? These are the canonical questions that arise in many studies in historical linguistics. Let us consider some of the key issues that linguists have to resolve as they construct explanations of such phenomena: 1. The role of learning: Clearly language is acquired by children — most significantly from the input provided by the previous generation of speakers in the community. Now at one point in time, the two phonological forms were being distinguished and only the diphthongized form was being used by all speakers for each of the words in question. Children heard this form of the word and most of them should have acquired this form. Indeed, if all of them had acquired this form, the language would not have changed from one generation to the next in this regard. So some children must have acquired the monophthongized version and thus two successive generations differed from each other. A common explanation for the kind of phonological change discussed here is the phonetic similarity between the two phonological items in question. While phonological classes are discrete, their phonetic realizations are, of course, continuous. For phonetically similar sounds the overlap in the distributions of the two sounds might be considerable, often leading to errors in production or perception. As a result, some child learners might conceivably end up learning an alternative pronunciation leading to phonological variation in the population and ultimately change. The idea that language change is contingent on language learning has been a long-standing one, and indeed Shen (1997) devotes some attention to this connection. We will examine this in a mathematically precise manner over the course of this chapter. 2. Populations versus idiolects: Isolated instances of mislearning or idiosyncratic linguistic behavior are clearly of little consequence unless they spread through the community over time to result in large-scale language change. Central to our point of view in this book is the distinction between the population and the individuals in it — the

CHAPTER 8

256

distinction between the behavior of the individual language users and the group linguistic characteristics of the population as a whole. Shen (1997) provides Wenzhou data discussed here and explicitly samples multiple people in the population for each generation, and the curves of Figure 8.1 are plots of average speaker behavior with respect to pronunciation. Another goal of this chapter is to explore the relationship between change at the individual level and change at the population level in this context. 3. Lexical diffusion versus the Neogrammarian hypothesis: In phonological changes of the sort described in Wenzhou, there are multiple words in the lexicon where each word has two (or more) forms at any given point in time. The neogrammarian position has been that the change (phonetically gradual and phonologically abrupt) occurs in all words at the same rate. In contrast, the lexical diffusion theory (Wang 1969) suggests that the change is initiated in some words and gradually “diffuses” through the lexicon to completion. 4. Monolingualism versus bilingualism: At any given point in time when the two linguistic forms are competing, do speakers choose one form over the other, or do they essentially become bilingual users of both forms? These two different cases might have very different consequences for languages changing over time, and for the issue of whether such a change goes to completion. We will elaborate on this later in the chapter. 5. The role of frequency (statistical) versus rule (categorical) effects: While language is typically conceptualized as categorical (algebraic), there are clearly statistical effects at the margins of categorical behavior. Linguistic expressions — be they words, phrases, or sentences of different types — do not all occur with the same frequency in language use. The probability with which various forms are heard might affect the acquisition of those forms and therefore the transmission of forms from one generation to the next. This accumulated effect might ultimately lead to categorical change. As we construct explanations for various historically observed trends, we need to tease these issues apart. We need tools to help us reason about such issues to separate the plausible from the implausible. In the next section, I explicitly consider a series of models of word learning and examine the long-term evolutionary consequences at the population level of various wordlearning models at the individual level.

257

8.2

AN APPLICATION TO CHINESE PHONOLOGY

Two Forms in a Population

I develop some models of word learning in the context of the phonological merger in Wenzhou. Clearly, there are many words in the lexicon; let us focus on one particular word with its two alternative pronunciations. In generation t, let a proportion αt of the population use the changed form of a particular word. Correspondingly, a proportion 1 − α t of the population uses the original form of the word. The next generation of children will now obviously hear both forms of the word being used by adults at large. We will now consider the following four situations in turn. 1. Children hear both forms of the word. Each child receives a different random draw of words from the entire adult population. Each child acquires the form that occurs more often in a certain sense. 2. Children hear both forms of the word. Each child receives a different random draw of words. However, these words are heard only from the parents and therefore reflect the linguistic form of each parent. Each child acquires the form that occurs more often in a certain sense. 3. Children hear both forms of the word. Each child acquires both forms. However, it uses the two forms in a ratio that reflects the ratio of the two forms in the data set it received (from the entire adult population) during the learning phase. 4. Children hear both forms of the word. These words are heard only from the parents. Each child acquires both forms. Moreover, it uses the two forms in a ratio that reflects the ratio of the two forms in the data set it received during the learning phase. As we will see in the analysis that follows, these different situations lead to different evolutionary behavior at the population level. Linguistic explanations of the ultimate loss of one form will therefore need to take these differences into account.

8.2.1

Case 1

Let us assume that each child hears N words, after which it acquires one or other of the two linguistic forms of the word. Since the words are randomly drawn from the entire adult population of speakers, the probability with which an arbitrary child will hear exactly k words of form 1 is given by

CHAPTER 8

258   N (αt )k (1 − αt )(N −k) k

Let us further assume that a child acquires form 1 if it occurs at least K times during its learning phase. Therefore the probability with which an arbitrary child will acquire form 1 is clearly N    N (αt )k (1 − αt )(N −k) k

k=K

Naturally there will be variation in the population of children. This variation will arise due to particular differences in the primary linguistic data set that each individual child is exposed to. Thus some children will have acquired form 1; others will have acquired form 2. Crucially, however, in the population at large, the proportion of children who have acquired form 1 is also given by αt+1

N    N = (αt )k (1 − αt )(N −k) k

(8.1)

k=K

Thus we have immediately related the proportion of form 1 users in two successive generations. This gives us an iterated map that we examine in some detail below.

8.2.2

Analysis

Note that Equation 8.1 represents a nonlinear (polynomial) iterated map from the interval I = [0, 1] to itself whose properties now need to be examined. It is possible to show the following: 1. There are two stable equilibrium points that are given by α = 0 and α = 1 respectively. 2. There is exactly one unstable equilibrium point obtained in the interior α∗ ∈ (0, 1). 3. There are no further equilibria. N  k  N −k . Shown in To see this, consider the map f (α) = N k=K k α (1 − α) Figure 8.2 is a graph of this function for an arbitrary choice of N = 20 and various choices of K.

259

AN APPLICATION TO CHINESE PHONOLOGY 1

0.9

0.8

0.7

f(x)

0.6

0.5

0.4

0.3

0.2

0.1

0

0

0.1

0.2

0.3

0.4

0.5 x

0.6

0.7

0.8

0.9

1

N  i  N −i plotted against Figure 8.2: The binomial map f (x) = N i=K i x (1−x) x. The dashed line is the identity map f (x) = x. N is chosen to be 20 and the four curves correspond to choices of K = 4, 8, 12, 16, respectively. By inspection of f (α), it is easy to see that f (0) = 0 and f (1) = 1. To show stability, it is sufficient to show that |f  (α)| < 1 for those values of α = 0, 1 respectively. Differentiating f term by term, we get 

f (α) =

N −1   k=K

N k



 kαk−1 (1 − α)N −k − (N − k)αk (1 − α)N −k−1 + N αN −1

Expanding this out, we see f  (α) =

N −1 −

N!

kαk−1 (1 − α)N −k k N −k−1 + N αN −1 (N −k)!k! (N − k)α (1 − α)

k=K (N −k)!k!  N −1 N! k=K

Factoring N out of the expression, we have  N −1

(N −1)! k−1 (1 − α)N −k k=K (N −k)!(k−1)! α   −1 (N −1)! k (1 − α)N −k−1 + αN −1 − N α K k!(N −k−1)!

f  (α) = N

CHAPTER 8

260

After canceling terms, we see    N − 1 K−1  N −K α (1 − α) f (α) = N K −1

(8.2)

Clearly, f  (0) = f  (1) = 0. Stability is shown. It is easy to see that there is at least one fixed point in the interior. For that, we only need to consider h(x) = f (x) − x and notice that (i) h(0) = h(1) = 0 and (ii) h (0) = h (1) = −1. From (i) and (ii) we have that there must exist 1/2 >  1 > 0 and 1/2 > 2 > 0 such that h(1 ) < 0 and h(1 − 2 ) > 0. Since h(x) is continuous, there must exist some α∗ such that h(α∗ ) = 0, i.e., α∗ is an equilibrium point. Further, it is easy to see that since h(1 ) < 0 and h(1 − 2 ) > 0, and h is continuous, there are an odd number of interior roots in (0, 1). These roots correspond to equilibrium points that are alternately unstable and stable. Let the roots be given by α1 , . . . , α2m+1 . Then α1 is unstable, α2 is stable, α3 is unstable, and so on. However, the structure of f  (x) (having only one local maxima in (0, 1)) as derived in Equation 8.2 shows that there is only one root and it has to be unstable.

8.2.3

Case 2

As a starting point, we consider the case of random mating between individuals in the population. This gives rise to four types of parental groups (AA refers to the type where the father uses form A and the mother uses form A; the other three types arise from the other possible combinations). If a proportion αt of the population uses form A, and there is no difference between males and females in this regard, one can compute the proportions of each of the four parental types in the population. This is indicated in Table 8.2. Assume that children are born into each parental type equally often. Recall that each child hears N words drawn at random from the parents. One can now list the source distributions for each of the parental types. These are 1. Children of AA hear only the form A and consequently acquire the form A. 2. Children of BB hear only the form B and consequently acquire the form B.

261 Pat. Form A A B B

AN APPLICATION TO CHINESE PHONOLOGY Mat. Form A B A B

P ( Ch. Form =A) b3 b2 b1 b0

P (T ypes) p3 p2 p1 p0

Rand. Mating α2t αt (1 − αt ) αt (1 − αt ) (1 − αt )2

Table 8.2: The linguistic forms of each parent (A and B indicate the two kinds of forms). The third column indicates the probability with which a child of each parental type will acquire the form A. Column 4 indicates the probability of each parental type and its value under random mating (Column 5). 3. Children of AB and BA hear both forms. Each form has probability 1 2 of occurring. The child acquires either form A or form B using the same criterion as in Case 1, i.e., it chooses A if this form occurs at least K times in the set of N words. Thus, according to the notation of Table 8.2, we see that b 3 = 1 and b0 = 0. Clearly, b2 = b1 and this is given by N    N 1 k 1 (N −k) ( ) ( ) b2 = b1 = k 2 2 k=K

We can now compute the proportion of children who have acquired form A. This is given by 4 

bi pi = 1.(α2t ) + b2 .(2αt (1 − αt )) + 0.(1 − αt )2

i=1

= αt (2b2 − (2b2 − 1)αt ) Thus we have again related the proportion of children acquiring form A in two successive generations. This yields the logistic map, and the analysis of this classic iterated map is well known. We can make the following observations: 1. The evolution depends upon the value of b 2 . If b2 = 12 then αt+1 = αt . Therefore the population never changes in linguistic composition.

CHAPTER 8

262

2. If b2 = 12 , there are two fixed points α = 0 and α = 1 corresponding to the two homogeneous populations. Exactly one of these is stable. If b2 > 12 then α = 1 is stable, if b2 < 12 then α = 0 is stable. Variations of the Parental Model The model discussed above closely follows that developed in Niyogi 2002, and is intimately related to cultural models developed in Cavalli-Sforza and Feldman 1981. This connection is discussed at length in the next chapter. Several variations have been considered in Cavalli-Sforza and Feldman 1981, and we briefly consider a few in the current context. First, note that we have assumed that both parents have an equal role in producing primary linguistic data for the child. In general, the update rule is given by αt+1 = αt (b − (b − 1)αt ) where b = b1 + b2 . This does not affect the qualitative nature of the results. Second, consider relaxing the assumption of random mating. After Cavalli-Sforza and Feldman 1981, one might model this as follows. Let a proportion p of the population mate in an assortative way (i.e., the type AB does not occur) and a proportion 1 − p mate at random. In that case, the update rule is given by αt+1 = αt [p + (1 − p)b + αt (1 − p)(1 − b)] where b = b2 +b1 as before. Again, this does not affect the qualitative nature of the results. A greater variety of models may be developed by considering the neighborhood structure of the population or oblique transmission (across generations). These are considered in the next chapter, and their applicability to the Wenzhou case is not considered any further.

8.2.4

Case 3

First consider generation number t of adult speakers. Let an arbitrary adult in this population use the two forms in the ratio λ t . The distribution P (λt ) of possible λt -values captures the variation in the adult population with respect to the usage of these two forms. The mean of this distribution is given by  E[λt ] =

λt P (λt )dλt

263

AN APPLICATION TO CHINESE PHONOLOGY

This mean characterizes the average use of each of these two forms in the population as a whole. Now consider the arbitrary child. This child hears words uttered by random speakers with random usage of each of the two forms. With probability P (λt ) it comes across a speaker who uses form A with probability λt . Therefore, averaging over all speakers, the total probability of hearing a word of form A is given by  P (λt )λt dλt = E[λt ] This arbitrary child hears N words in all. Some of these (say n) are of form A and the rest (N − n) are of form B. Let us assume that the child matures into adulthood using both forms in a manner that reflects its childhood learning experience. Thus the child uses form A with probability n N . Clearly n is a random variable; different children will have different experiences, and there will be a variation in the population of speakers of the next generation – t + 1. However, the average over this population is given by E[n] N E[λt ] n = = E[λt ] E[λt+1 ] = E[ ] = N N N Interestingly, the average use of the two forms in the population does not change at all! There is no evolution and the usage of both forms in the population is preserved.

8.2.5

Case 4

We consider this variant of Case 3 for completeness. Here the child is exposed only to data from his or her parents. Assume that in generation t the father has a λ-value of λF and the mother has a λ-value of λM . Both values of λ are samples from a distribution of λ-values in the population that can be characterized by P (λ) — a probability distribution over λ-values in the usual way. Assuming random mating, as before, the proportion of children whose fathers have λ = λF and mothers have λ = λM is given by P (λM )P (λF ). Such children will receive N instances of the word each. Assuming that parents are equally likely to speak to the children (as before), we have that with probability 12 the father will provide a word and with probability 12 the mother will. Therefore, the probability with which the child will hear form 1 is given by 1 1 p = λF + λM 2 2

CHAPTER 8

264

Let us assume that of N instances of the word, n are of form 1 and the n . Denoting remaining N − n are of form 2. The child would estimate λ as N ˆ M , λF ) the estimate of λ that such a child (whose parental λ-values by λ(λ are denoted by λM and λF respectively) constructs, we have ˆ M , λF ) = n λ(λ N Clearly different children of the same parental type will actually get different random draws and therefore there might be some variation in the population of such children. However, their mean value is given by ˆ M , λF )] = E[λ(λ

1 N E[n] 1 = p = λF + λM N 2 2

Thus the average λ-value that such children will internalize is given by 12 λF + 1 2 λM . Averaging over children of all parental types, we have   1 1 ˆ [ λF + λM ]P (λM )P (λF )dλM dλF = E[λ] E[λ] = 2 λM λF 2 Clearly, the average value of λ in the generation of children (after learning) is the same as the average value of λ in the generation of adults. As in Case 3, there is no evolution or change in the population over time.

8.2.6

Remarks and Discussion

The four cases we have considered fall into two broad categories. Cases 1 and 2 correspond to a situation where children make a categorical choice in the linguistic variable of interest, i.e., they choose exactly one of the two linguistic forms that exist in the population as a whole. Cases 3 and 4, in contrast, correspond to a situation where children adopt both forms but in a ratio that reflects the frequency of occurrence of the two linguistic forms in the population. Two different modes of transmission were considered for each of these two subcases. There are two interesting results of the analysis. First, we observe that in Cases 1 and 2, mixed populations are inherently unstable. In Case 1, we see that an unstable equilibrium exists for a mixed population, but if the population mix shifts even slightly from this unstable balance, one of the two forms is driven to extinction. In Case 2, the parameter b 2 has to be exactly equal to 12 for a population mix to remain that way, or else one of the two variants is driven to extinction. In Cases 3 and 4, populations do

265

AN APPLICATION TO CHINESE PHONOLOGY

not have any inherent tendency to change. Both forms are preserved in the population over generational time. Second, we observe that in Cases 1 and 2, there is variation in the population that is forced by the categorical nature of language use, i.e., which of the two forms speakers use. In Cases 3 and 4, the speakers are all potentially bilingual (i.e., use both forms) and in a proportion λ that varies from individual to individual. One can compute the variance of λ in the population and this characterizes the variation present in language use in the population. Interestingly, we see V ar(λt+1 ) = V ar(

1 n ) = 2 V ar(n) N N

n . We saw where each individual child simply estimates the λ-value as N N earlier that n = i=1 Xi , where Xi is a 0 − 1 valued random variable that takes on the value 1 if form 1 occurs and 0 otherwise. Form 1 occurs with probability E[λt ]. Thus,

V ar(n) = N V ar(Xi ) = N E[λt ](1 − E[λt ]) Therefore

E[λt ](1 − E[λt ]) N But we have seen that E[λt ] remains the same from generation to generation. Thus E[λt+1 ] = E[λt ] = α (say). Both the mean and the variance in the population are fixed for all time. Thus we see that in Cases 1 and 2, the variation in the population is gradually lost over time as one of the two forms is driven out of existence. In contrast, in Cases 3 and 4, the variation is maintained. This leads us to conjecture that it is the categorical nature of language that forces change. We will explore this in greater detail in later sections. V ar(λt+1 ) =

8.3

Examining the Wenzhou Data Further

Let us reexamine the Wenzhou data, keeping in mind the various issues and the analyses of the previous section. The data has been presented and studied in great detail by Shen (1993), and I present certain aspects of that study that are particularly interesting in the light of the current analysis. A central conclusion of this analysis is that categorical behavior by language users is more likely to result in change than bilingual or blending behavior, which tends to maintain the variation in the population. The first

CHAPTER 8 Word (diph.) /poy /42 /doy /31 /moy /31 /toy /42 /soy /42

266

Meaning

“cloth” “graph” “to sharpen” “jealous” “to tell”

17.7 24 88 70 77 88 88

21.4 28 86 86 93 86 86

26.9 30 87 93 93 87 87

32.2 43 58 65 70 58 58

36.7 47 34 40 43 34 34

Age 42.2 65 14 20 23 14 14

46.5 54 4 9 11 4 4

51.6 39 8 8 10 8 8

57.3 19 0 0 0 0 0

61.0 8 0 13 0 0 0

72.8 6 0 0 0 0 0

Table 8.3: Percentage of speakers using the monophthongized form for each of five words. The first column provides the word, the second its meaning, and the columns after that indicate the percentage of speakers using the changed form for each of eleven age groups. The eleven age groups are indicated by their average age in the first row (with the number of people sampled in each age group in the second row). Adapted from Shen 1993. interesting question for us was to figure out which of these two kinds of linguistic behavior was closer to reality in the case of Wenzhou. An examination of the data in Shen 1993 suggests that for a particular word with its two forms, people tend to always use one form consistently. A single individual never uses both forms for the same word. Of course, there is variation in the population, with a higher percentage of older people using form 1 (diphthongized) and a higher percentage of younger people using form 2 (monophthongized). This is consistent with our analysis since a categorical use of forms tends to be stable only in homogeneous populations with only one form present. Shown below are the percentage of speakers who use the monophthongized form of each word in eleven age brackets (from Shen 1993) for the five words presented in the earlier section. The thirty five different words do not all change at the same time, nor do they change at the same rate. If one studies the distribution of the two forms of each word in the population, then the oldest people all use form 1 (diphthongized) and the youngest mostly use form 2 (monophthongized) version of the word. The speakers in the middle (intermediate age) show a variation along two directions: in the number of words they have changed, and in which words they have changed. The change in some of the words occurs before the change in other words. Further, Shen (1993) finds that the rate of change is higher for words that have started changing later. This lends support to the lexical diffusion theory put forward by Wang (1969). The preceding discussion suggests that Cases 1 and 2 are closer to the underlying phenomena than Cases 3 and 4. Consider now Cases 1 and 2

267

AN APPLICATION TO CHINESE PHONOLOGY

in greater detail. Is it possible to disambiguate these two cases from the data? In other words, in the acquisition of the appropriate pronunciation for each word (form) do speakers learn only (mostly) from parents, or is it more likely that they learn from the entire population? Furthermore, is it possible to evaluate the threshold from the data? Are these thresholds different for different words? Consider the analysis under Case 1. We see that both α = 0 and α = 1 are stable equilibrium conditions where α is the proportion of form 1 (monophthong-form) users. Let the unstable interior equilibrium condition be α . The initial condition was α = 0 and under the modeling assumptions this would remain stable over all time. Therefore, internally driven change is not possible. Language contact with another population consisting of mostly form 1 users is necessary for change. In particular, the number of new speakers must be enough to move the effective population mix from one basin of attraction to another, i.e., α > α  . In other words, a slight introduction of form 1 users in the population is not sufficient — it must actually be greater than α . Now consider the analysis under Case 2. In this case, exactly one of the conditions α = 0 and α = 1 is stable. The other is an unstable equilibrium. Clearly, α = 0 cannot be a stable situation — for the system would never be driven to change. Therefore, α = 1 must be the stable situation where all speakers tend to produce monophthongized versions of the word and this is what the system ultimately tends to anyway. However, for this to be the case, b2 > 12 , which means that K < N2 . In other words, it is easier to learn form 1 (monophthongized) than it is to learn form 2. In this case, even the slightest contact with form 2 users will drive the entire population to change completely. Of course, this raises the question of why the population found itself in an unstable equilibrium in the first place. It is only an examination now of the social setting in which the change comes about that will yield further insight into this issue. If substantial language contact came about, both Case 1 and Case 2 are likely scenarios. However, in the absence of substantial language contact, Case 2 is the more likely candidate. It should also be said that on the face of it the dynamics of Case 1 seem more satisfying since it is desirable that both linguistic systems be stable. It is reasonable to think that Wenzhou speakers maintained the diphthongized form of the words for many generations in a relatively stable mode until change came about over the last century. It is hard to imagine how a basically unstable linguistic system would be maintained for so long. A puzzling question for us is — what initiates the change (the so called

CHAPTER 8

268

actuation problem)? It is reasonable to assume that at one point in time only the diphthongized form of the word was used by the population. Such a population is actually in (stable) equilibrium — so what caused it to change? And why didn’t the change get initiated earlier? The analysis so far has not shed any light on this question. In the next section, we consider error-driven models. In particular, we see that if there are asymmetries in the errors during the learning period, the resulting population dynamics may have bifurcations. These bifurcations have the potential to successfully resolve the actuation problem — a problem that has resisted a satisfactory explanatory account in the scientific literature so far.

8.4

Error-Driven Models

In the analysis above, it has been assumed that transmission between speaker and hearer is essentially error-free. Each learner internalizes a particular form of the word and produces it upon reaching adulthood. Listeners perceive this particular form upon receiving it. In reality, the transmission is likely to have some errors. This is especially so in phonetic and phonological communication where the speaker’s intent might be misperceived by the listener. For example, in Wenzhou, it is conceivable that due to phonetic similarity, the diphthong was perceived as a monophthong (and vice versa) on occasion by listeners. Let us examine the consequences of such errors in the kinds of models discussed above. Let the probability with which form 1 is misperceived (i.e., perceived as form 2) be 1 and the probability with which form 2 is misperceived as form 1 be 2 . Then if form 1 was uttered with probability α, one can compute the probability with which a random utterance was perceived as form 1. This is given by α = α(1 − 1 ) + (1 − α)2 Clearly, if 1 = 2 = 0, then we have the error-free case and α  = α — that is, the probability of hearing form 1 is the same as the probability with which form 1 is produced by speakers. Thus, if a learning child hears N instances of the word directed at it, the probability of perceiving more than K of them as form 1 is simply given by N    N i=K

i

(α )i (1 − α )(N −i) = f (α )

269

AN APPLICATION TO CHINESE PHONOLOGY

In the models of Cases 1 and 2, the population contains a mix of type 1 (users of form 1) and type 2 (users of form 2) speakers in proportion α to 1 − α. Consider Case 1. The source probability of form 1 is simply α. As a result of communication errors along the channel, the probability with which form 1 is perceived is given by α . The probability with which the typical learning child acquires (internalizes) form 1 as the preferred pronunciation is given by f (α ). The update rule for Case 1 is simply αt+1 = f (αt ) = f (αt (1 − 1 ) + (1 − αt )2 ) One can now conduct an equilibrium analysis of this update rule in the usual way. When 1 = 2 , i.e., the errors are symmetric, an analysis yields the following: 1. The stable fixed points at α = 0 and α = 1 are now lost. Instead, all fixed points are interior. 2. There are two stable fixed points at α = α 1 and α = α2 . 3. There is one unstable interior point at α = α ∗ ∈ (α1 , α2 ). Thus we see that the effect of noise does not change the spirit of the results in Case 1. The two stable modes of the population are still largely homogeneous. Due to errors in the channel, the stable modes of the population contain a small fraction of the “other” linguistic type. The proportion of this type in the stable mode changes with the error rate of the channel. The mixed interior mode is still unstable.

8.4.1

Asymmetric Errors

The case when 1 = 2 is the most interesting for our purposes. For simplicity, let us assume that 2 = 0 while 1 > 0. In this case, form 1 may be misperceived by the listener on occasion while form 2 is never misperceived. In the context of our case study, this amounts to assuming, for example, that diphthongs are occasionally misperceived as monophthongs while the reverse never occurs. In this case, we see that the update rule is given by αt+1 = f (αt ) = f (αt (1 − 1 )) In other words, αt+1 =

N    N i=K

i

(pαt )i (1 − pα )(N −i)

CHAPTER 8

270

where p = 1 − 1 . We encountered this map in Chapter 5. Recall that the dynamics corresponding to this depend upon the value of p. For large values of p close to 1, the system is bistable with two stable fixed points α = 0 and α = α2 ≤ 1. There is one unstable fixed point α∗ ∈ (0, α2 ). Below a critical threshold, i.e., when p < pc , the system has only one stable fixed point given by α = 0. Thus there are two regimes the system can be in: one corresponding to a bistable regime when there are two possible stable modes and the other corresponding to a regime when there is only one possible stable mode. The bifurcation between these two regimes is determined by the critical value of pc .

8.4.2

Bifurcations and the Actuation Problem

The bifurcation noted here serves as a useful explanatory construct for the actuation problem. Recall our discussion of Section 8.3. The various symmetric models of Cases 1 through 4 had no bifurcations and left us puzzled as to why a population would move from one stable state to another. Let us now offer an explanation for this puzzle. Assume that errors are asymmetric, and that K N and p = 1 − 1 are fixed from generation to generation. In other words, neither the learning algorithm used by the learner nor the perceptual discriminability changes from one generation to the next. It is possible to compute the value of the critical point p c , and it is seen that pc depends upon N , i.e., the total number of times the word in question was heard (in either form) by the learner. In particular it is seen that pc increases as N decreases. One can now imagine the following situation. Consider a particular word with two possible forms. At one point in time, the word was used often in discourse so that learners had many examples of that word on which to base their learning decision. Thus N was large and p > p c . In this situation, there are two stable modes and one of these α = α 2 ≈ 1 corresponds to a situation where most (all) of the speakers use form 1 (the diphthong). Now consider what happens if the word starts being used less frequently in subsequent generations. The value of N decreases, p c correspondingly increases, and at some point, if the word is used infrequently enough, a bifurcation occurs. At this point, the evolutionary dynamics enter the regime when p < pc . Now there is only one stable point (α = 0) corresponding to a population of form 2 users. A population of form 1 users is unstable, and gradually the population drifts from users of diphthongs to users of monophthongs for the word in question.

271

AN APPLICATION TO CHINESE PHONOLOGY

In this manner, a change may come about simply because of a drift in the frequency with which the word is used during the learning period. This posits an entirely different mechanism for initiating language change. Under the assumption of our error-free models of Cases 1 through 4, the only way in which we could explain language change was by invoking language contact. Language contact would bring about a dramatic change in the mix of linguistic types in the population, thereby driving the population from one equilibrium state to the basin of attraction of another. In the absence of such a dramatic reconstitution of the population, it was difficult to see how change might come about. Our current explanatory framework differs from the theory of language contact in two ways. First, it is possible for change to be internally driven with no dramatic reconstitution of the population due to language contact. Second, it is possible for only a slight change in the value of N to bring about the bifurcation. Let us denote the dependence of p c on N explicitly by writing it as pc (N ). Since pc depends continuously on N , we see that adjacent values of N may let pc (N ) be greater than or less than p. Hence, a subtle and gradual change in frequency-based behavior may lead to a dramatic change in the stable language of subsequent generations. The discovery of bifurcations in this context must count as one of the central insights obtained by the mathematical approach to the study of language evolution embodied in this book.

8.5

Discussion

As we have discussed earlier, the case where two linguistic forms are in competition is almost ubiquitous in historical linguistics.

8.5.1

Sound Change

Sound change is one of the early and well studied examples of historical linguistics for which there is often extensive fieldwork available. Indeed Wang (1977), Labov (1994), and Kiparsky (1982) have all concerned themselves with changes of this nature. Explanatory frameworks for such changes often reside at the phonological level in notions of phonological-rule restructuring (see Kiparsky 1982) and at the phonetic level in notions of speech production and perception. For example, Ohala (1989) discusses how “diachronic variation is drawn from a pool of synchronic variation”. It is argued that there is synchronic variation among the speakers of a language and a language learner will need

CHAPTER 8

272

to deal with this variation. Further, miscommunication may occur between speaker and hearer due to articulatory sloppiness on the part of the speaker or perceptual confusion on the part of the hearer (learner). Several phonetic sources of such miscommunication were considered in an admirable discussion of potential sources of sound change. To take an example from Ohala 1989, consider the classic sound change of Indo-European labiovelar stops to labial stops in Classical Greek. Proto IE Classical Greek *ekwos hippos “horse” *gwiwos bios “life” It is believed that at one point, the word for “horse” had a labiovelar stop and later it had only a labial stop. Again we have two forms in competition, just like the much better documented case of Wenzhou we examined in this chapter. One form gradually gave way to the other. While the two sounds are confusible and learners might incorrectly learn the wrong word, the analysis in the current chapter shows that this by itself is not enough to drive an entire population to change. Much more subtle population-level analysis must be conducted to see how how sounds may be transmitted from speaker to speaker across generations before one can explain the clear directionality of the change.

8.5.2

Connections to Population Biology

Population biology has long considered mathematical models of gene transmission to characterize the evolution of gene frequencies in populations. In many ways, the models considered here are very similar in spirit to population biology models — what is evolving is the frequency of linguistic forms rather than biological forms. The laws of transmission of linguistic forms from one generation to the next are assumed to be governed by the language acquisition process. The macroscopic (population) consequences of such transmission over generational time scales need to be explored, and this chapter represents a step in such a direction. In contrast to the case of language, the laws of transmission of genes (or other biological properties) from one generation to the next are usually governed by reproduction. There are some further interesting connections to population biology. We see a distinction in evolutionary consequences for “categorical” as opposed to “blending” behavior in language learners. This difference is analogous to the distinction between evolutionary consequences of particulate versus blending inheritance. As is well known (see, for example, Fisher 1930), particulate inheritance maintains variation in the population while blending

273

AN APPLICATION TO CHINESE PHONOLOGY

inheritance eliminates it. In contrast, we see that categorical behavior in language eliminates variation and drives linguistic change to completion while blending behavior maintains such variation and tends to suppress change. It is also noteworthy that simple and abstract mathematical models played an important role in reasoning about the distinction between particulate and blending inheritance by Fisher and other writers of that period.

8.6

Conclusions

Using the case of lexical diffusion in the Wenzhou province of China as a motivating example for which concrete data exists, this chapter has explored models of language change when two linguistic forms coexist in a population and are transmitted from one generation to the next through the process of language acquisition. A number of different models were developed, leading to four different cases that were systematically studied. Interestingly enough, none of these four cases provides a satisfactory account of the change. The careful study of these four cases and their failure to account for the change in Wenzhou province highlights the power of our computational approach. It allows us to work out the consequences of various assumptions about language and learning and demonstrates the falsifiability of the models constructed. It brings into sharper focus the mystifying question that has motivated much of this book. Why does a linguistic community change from one stable mode to another? One answer to this question may be that language contact between two different groups as a result of migration may dramatically alter the linguistic composition of the population and move it into a basin of attraction of the other stable point. A second answer may be that a drift in certain parameters of learning may cause a bifurcation in the population dynamics. This bifurcation could explain why a previously stable linguistic system becomes unstable. The asymmetric error-driven model of this chapter illustrates this point by showing how the change in the frequency with which a word is used during the learning period crucially affects the evolutionary dynamics. A third answer may be that change comes about because of random drift caused by finite population sizes. This is a possibility we did not explore in this chapter, but we will take it up at a later point in this book. Finally, it is also worth reiterating an important difference between two modes of individual linguistic behavior as far as long-term evolutionary consequences are concerned. It was found that categorical behavior on the part

CHAPTER 8

274

of learners results in an inherent tendency of linguistic populations to change with time to a homogeneous stable mode with only one linguistic form surviving. Blending behavior on the part of the learner leads to both forms being preserved in the population at large. This is in contrast to models of inheritance in evolutionary biology where blending inheritance eliminates variation while particulate inheritance preserves it. It will become important to incorporate this insight in an account of language change over time.

Chapter 9

A Model of Cultural Evolution and Its Application to Language 9.1

Background

The evolutionary paradigm has applicability beyond the particular case of genetic reproduction to a wide variety of situations — from ecology (May 1973) to cooperation and conflict in populations (Axelrod 1984; Nowak and Sigmund 1993) to the evolution of culture and cognition (Boyd and Richerson 1985; Cavalli-Sforza and Feldman 1981). In this chapter, we take a closer look at models of cultural evolution and outline the relationship between such models and the models of language change that we have discussed so far. As we have seen, under the assumptions of contemporary linguistic theory, change in linguistic behavior of human populations must be a result of a change in the internal grammars that successive generations of humans employ. The question then becomes: Why do the grammars of successive generations differ from each other? In order to answer this question, we need to know how these grammars are acquired in the first place and how the grammars of succeeding generations are related to each other. If such a relationship is uncovered, one might then be able to systematically predict the envelope of possible changes and relate them to actually observed historical trajectories. It is worthwhile to note that in considering the evolution of the linguistic system, we have not invoked the genetic changes that may be going on in

CHAPTER 9

276

human populations at the same time. The explanatory paradigm does not rely on notions of “linguistic” fitness, genetic transmission, or differential reproduction. In this sense, the evolution of language over historical time scales may be viewed as a certain kind of neutral cultural evolution. At the same time, this does not preclude the study of the correlation between languages and genes (see, e.g., Cavalli-Sforza 2001) or the intriguing possibility of coevolution of genes and languages. In a remarkable treatise in 1981, the evolutionary biologists L. CavalliSforza and M. Feldman outlined a general model of cultural change that was inspired by models of biological evolution and has potential and hitherto unexploited applicability to the case of language. Indeed, many motivating examples in Cavalli-Sforza and Feldman 1981 were taken from the field of language change. However, the applicability of such models to language change was not formally pursued there. In this chapter, I introduce their basic model and provide one possible way in which the Principles and Parameters approach to grammatical theory (construed in the broadest possible way) is amenable to their modeling framework. 1 The framework for the computational characterization of changing linguistic populations discussed in this book so far has evolved from an original series of papers by Niyogi and Berwick (1995,1997). We explore here the formal connections between these two approaches for the case of two linguistic variants in competition. In particular, we show how evolutionary trajectories in one framework can be formally translated into the other and discuss their similarities and differences. To ground the discussion in a particular linguistic context, I show the application of such models to generate insight into possible evolutionary trajectories for the case of diachronic evolution of English from the 9th century to the 15th century A.D. Finally, I utilize the insights of the Cavalli-Sforza and Feldman framework to develop an extended model to characterize the effect of spatial (geographic) location. This allows us to study spatial effects on the linguistic interactions between individuals in a population and the evolutionary consequences of such interactions. 1

It is best to state up front that I do not attempt in this chapter to provide any review or systematic coverage of the various mathematical approaches that have been taken to the problem of cultural evolution. This would be quite beyond our current scope. For example, Boyd and Richerson 1985 is an important contribution to cultural evolution and could well have been a comparison point for us. I chose to focus on Cavalli-Sforza and Feldman 1981 in part because the mathematical treatment therein was most directly comparable to my approach. Correspondingly, similarities and differences could then be sharply outlined.

277

9.2

CULTURAL EVOLUTION

The Cavalli-Sforza and Feldman Theory

Cavalli-Sforza and Feldman (1981) outline a theoretical model for cultural change over generations. Such a model is inspired by the transmission of genetic parameters over generations and serves as a point of entry to studying the complex issue of gene-culture coevolution. In a cultural setting we have “cultural” parameters that are transmitted from parents to children with certain probabilities. In the model (hereafter referred to as the CF model in this chapter), the mechanism of transmission is unknown — only the probabilities of acquiring one of several possible variations of the trait are known. I reproduce their basic formulation for vertical transmission (from one generation to the next) of a particular binary-valued trait. Assume a particular cultural trait has one of two values. Some examples of traits they consider are political orientation (Democrat/Republican) or health habits (smoker/nonsmoker) and so on. Let the two values be denoted by H and L. Each individual is assumed to have exactly one of these two values. However, such a value is presumably not innate but learned. A child born to two individuals (mother and father) will acquire one of these two possible values over its lifetime. The probability with which it will acquire each of these traits depends upon its immediate environment — in the standard case of their model (though variations are considered 2 ), these traits are acquired from its parents. Thus one can construct Table 9.1. The first three columns of Table 9.1 are self-explanatory. As one can see, parental compositions can be one of four types depending upon the values of the cultural traits of each of the parents. We denote by b i the probability with which a child of the ith parental type will attain the trait L (with 1−b i , it attains H). In addition, let pi be the probability of the ith parental type in the population. Finally, we let the proportion of people having type L in the parental generation be ut . Here t indexes the generation number, and therefore the proportion of L types in the parental generation is given by u t and the proportion of L types in the next generation (children who mature into adults) is given by ut+1 . 2 Pure vertical transmission involves transmission of cultural parameters from parents to children. They also consider oblique transmission where members of the parental generation other than the parents affect the acquisition of the cultural parameters, and horizontal transmission where members of the same generation influence the individual child. I discuss in a later section the approach of Niyogi and Berwick 1995 that involves oblique transmission of a particular sort and different from the treatment in Cavalli-Sforza and Feldman 1981 (which is briefly discussed at the end of the current chapter)

CHAPTER 9 Paternal Val. L L H H

278 Maternal Val. L H L H

P ( Child = L) b3 b2 b1 b0

P (T ypes) p3 p2 p1 p0

Prob. u2t ut (1 − ut ) ut (1 − ut ) (1 − ut )2

Table 9.1: The cultural types of parents and children related to each other by their proportions in the population. The values depicted are for vertical transmission and random mating. Under random mating, one sees that the proportion of parents of type (L, L) , i.e., male L types married to female L types, is u 2t . Similarly one can compute the probability of each of the other combinations. Given this, Cavalli-Sforza and Feldman go on to show that the proportion of L types in the population will evolve according to the following quadratic update rule: ut+1 = Bu2t + Cut + D

(9.1)

where B = b3 + b0 − b1 − b2 , C = b2 + b1 − 2b0 , and D = b0 . In this manner, the proportion of L types in generation t + 1 (given by u t+1 ) is related to the proportion of L types in generation t (given by u t ). A number of properties and variations of this basic evolutionary behavior are then evaluated (Cavalli-Sforza and Feldman 1981) under different assumptions. Thus, we see that evolution (change) of cultural traits within the population is essentially driven by the probabilities with which children acquire the traits given their parental types. The close similarity of this particular model3 to biological evolution is clear: (1) like gene-types, trait values are discrete, and (2) their transmission from one generation to another depends (in a probabilistic sense) only on the trait-values (gene-types) of the parents. The basic intuition Cavalli-Sforza and Feldman attempted to capture in their model is that cultural traits are acquired (learned) by children from their parents. Thus, by noting the population mix of different parental types and the probabilities with which they are transmitted, one can compute the evolution of these traits within the population. In the next section I show 3 To avoid misinterpretation, it is worthwhile to mention that extensions to continuous valued traits have been discussed. Those extensions have less relevance for the case of language since linguistic objects are essentially discrete.

279

CULTURAL EVOLUTION

how to apply this model to language change.

9.3

Instantiating the CF Model for Languages

In order to apply the model to the phenomena of language change, the crucial point to appreciate is that the mechanism of language transmission from generation to generation is “language learning”, i.e., children learn the language of their parents as a result of exposure to the primary linguistic data they receive from their linguistic environment. Therefore, in this particular case, the transmission probabilities b i ’s in the model above will depend upon the learning algorithm they employ. I outline this dependence for a simplified situation corresponding to two language types in competition.

9.3.1

One-Parameter Models

Assume there are two languages in the world — L 1 and L2 . Such a situation might effectively arise if two languages differing by a linguistic parameter are in competition with each other, and I have discussed many two-language models in previous chapters. In a later section I will also discuss the historical example of syntactic change in English for which this is a reasonable approximation. I consider languages to be subsets of Σ ∗ in the usual sense where Σ is a finite alphabet. Furthermore, underlying each language L i is a grammar gi that represents the internal knowledge that speakers of L i possess. Individuals are assumed to be native speakers of exactly one of these two languages. Furthermore, let speakers of L 1 produce sentences with a probability distribution P1 and speakers of L2 produce sentences with a distribution P2 . There are now four parental types, and children born to each of these parental types are going to be exposed to different linguistic inputs and as a result will acquire different languages with different probabilities. In the abstract, let us assume that children follow some acquisition algorithm A that operates on the primary linguistic data they receive and comes up with a grammatical hypothesis — in our case, a choice of g 1 or g2 (correspondingly L1 or L2 ). Following Chapter 2, we let Dk be the set of all k-tuples of sentences (s1 , . . . , sk ) where si ∈ Σ∗ . Each such k-tuple denotes a candidate dataset consisting of k sentences that might constitute the primary linguistic data a child receives. Clearly D k is the set of all candidate datasets of size k. Then A is a computable mapping from the set ∪ ∞ k=1 Dk to {g1 , g2 }. We now make the following assumptions:

CHAPTER 9

280

1. Children of parents who speak the same language receive examples only from the unique language their parents share, i.e., children of parents speaking L1 receive sentences drawn according to P1 and children of parents speaking L2 receive examples drawn according to P2 . 2. Children of parents who speak different languages receive examples from an equal mixture of both languages, i.e., they receive examples drawn according to 21 P1 + 12 P2 . 3. After k examples, children “mature” and whatever grammatical hypothesis they have, they retain for the rest of their lives. Thus the learning algorithm A operates on the sentences it receives. These sentences in turn are drawn at random according to a probability distribution that depends on the parental type. We now define the following quantity: k  P (wi ) (9.2) g(A, P, k) = {w∈Dk :A(w)=g1 } i=1

Recall that each element w ∈ Dk is a k-tuple of sentences. In Equation 9.2 we denote the ith sentence of w by wi . Therefore, g(A, P, k) is the probability with which the algorithm A hypothesizes grammar g 1 given a random i.i.d. draw of k examples according to probability distribution P. Clearly, g characterizes the behavior of the learning algorithm A if sentences were drawn according to P. It is worthwhile to note that learnability (in the limit, in a stochastic generalization of Gold 1967) requires the following: Statement 9.1 If the support of P is L 1 then limk−→∞ g(A, P, k) = 1 and if the support of P is L2 then limk−→∞ g(A, P, k) = 0. In practice, of course, we have made the assumption that children “mature” after k examples: so a reasonable requirement is that g be high if P has support on L1 and low if P has support on L2 . Given this, we can now write down the probability with which children of each of the four parental types will attain the language L1 . These are shown in Table 9.2. Thus we can express the bi ’s in the CF model of cultural transmission in terms of the learning algorithm. This is reasonable because after all, the bi s attempt to capture the fact that traits are “learned” — in the case of languages, they are almost certainly learned from exposure to linguistic data.

281

CULTURAL EVOLUTION

Paternal Lang. L1 L1 L2 L2

Maternal Lang. L1 L2 L1 L2

P P1 1 1 2 P1 + 2 P2 1 1 2 P1 + 2 P2 P2

Prob. Child speaks L 1 b3 = g(A, P1 , k) b2 = g(A, 12 P1 + 12 P2 , k) b1 = g(A, 12 P1 + 12 P2 , k) b0 = g(A, P2 , k)

Table 9.2: The probability with which children attain each of the language types, L1 and L2 depends upon the parental linguistic types, the probability distributions P1 and P2 and the learning algorithm A. Under random mating,4 we see that the population evolves according to Equation 9.1. Substituting the appropriate g  s from Table 9.2 in place of the bi s, we obtain an evolution that depends upon P 1 ,P2 ,A, and k.

9.3.2

An Alternative Approach

In previous chapters, we have developed a series of models (hereafter, we refer to this class of models as NB5 models) for the phenomenon making the following simplifying assumptions. 1. The population can be divided into children (learners) and adults (sources). 2. All children in the population are exposed to sentences drawn from the same distribution. 3. The distribution with which sentences are drawn depends upon the distribution of language speakers in the adult population. The equations for the evolution of the population under these assumptions were derived. Let us consider the evolution of two-language populations. At any point, one can characterize the state of the population by a single variable (st ∈ [0, 1]) denoting the proportion of speakers of L 1 in the population. Further assume, as before, that speakers of L 1 produce sentences with distribution P1 on the sentences of L1 and speakers of L2 produce sentences with distribution P2 on the sentences of L2 . 4

I have only considered the case of random mating here for illustrative convenience. The extension to more assortative forms of mating can be carried using the standard techniques in population biology. 5 After Niyogi and Berwick 1995, 1997, where the original formulation was articulated and analyzed.

CHAPTER 9

282

The evolution of st over time (the time index t denotes generation number) was derived in terms of the learning algorithm A, the distributions P 1 and P2 , and the maturation time k. This has the form st+1 = f (st ) = g(A, st P1 + (1 − st )P2 , k) The interpretation is clear. If the previous state was s t , then children are exposed to sentences drawn according to s t P1 + (1 − st )P2 . The probability with which the average child will attain L 1 is correspondingly provided by g and therefore one can expect that this will be the proportion of L 1 speakers in the next generation, i.e., after the children mature to adulthood. In previous chapters, I have derived the specific functional form of the update rule f (equivalently g) for a number of different learning algorithms. In the next section, I show how these two approaches to characterizing the evolutionary dynamics of linguistic populations are related. Specifically, I show how the evolutionary update rule f in the NB framework is explicitly related to the update rule in the CF framework.

9.3.3

Transforming NB Models into the CF Framework

Let the NB update rule be given by s t+1 = f (st ). Then, we see immediately that: 1. b3 = f (1) 2. b2 = b1 = f (0.5) 3. b0 = f (0) The CF update rule is now given by Equation 9.1. The update, as we have noted, is quadratic and the coefficients can be expressed in terms of the NB update rule f. Specifically, the system evolves as st+1 = (f (1) + f (0) − 2f (0.5)) s2t + (2f (0.5) − 2f (0)) st + f (0)

(9.3)

Thus we see that if we are able to derive the NB update rule, we can easily transform it to arrive at the CF update rule for evolution of the population. The difficulty of deriving both rules rests upon the difficulty of deriving the quantity g that appears in them. Notice further that the CF update rule is always quadratic, while the NB update rule is in general not quadratic. Remarks. It is worthwhile to reflect on the difference in the evolutionary dynamics of CF- and NB-type models:

283

CULTURAL EVOLUTION

1. The evolutionary dynamics of the CF model depends upon the value of f at exactly 3 points. Thus one might have very different update rules in the NB model corresponding to different iterated maps f , yet if these different f ’s agree at x = 0, 12 , and 1, the corresponding CF update would be the same.

2. If f is linear, then the NB and CF update rules are exactly the same. If f is nonlinear, these update rules potentially differ.

3. The CF update is a quadratic iterated map and has one stable fixed point to which the population converges from all initial conditions. The NB model may have multiple stable fixed points.

4. For some learning algorithms, there may be qualitatively similar evolutionary dynamics for NB and CF models. For example, in the case of the Triggering Learning Algorithm (TLA) discussed below, this is the case. For some other learning algorithms, the qualitative behavior may be quite different. This is the case for the batch- and cue-based learning algorithms, which are bistable in the NB framework but have a single stable fixed point in the CF formulation.

The essential difference in the nature of the two update rules stems from the different assumptions made in the modeling process. Particularly, Niyogi and Berwick (1995,1997) assume that all children receive input from the same distribution. Cavalli-Sforza and Feldman (1981) assume that children can be grouped into four classes depending on their parental type. The crucial observation at this stage is that by dividing the population of children into classes that are different from each other, we derive alternative evolutionary laws. In a later section we utilize this observation to divide children into classes that depend on their geographic neighborhood. This will allow us to derive a generalization of the NB model for neighborhoods. Before proceeding any further, let us now translate the update rules derived in Niyogi and Berwick (1995,1997) into the appropriate CF models. The update rules are derived for memoryless learning algorithms operating on grammars. We consider an application to English with grammars represented in the Principles and Parameters framework.

CHAPTER 9

9.4

284

CF Models for Some Simple Learning Algorithms

In this section we consider some simple learning algorithms 6 (like the Triggering Learning Algorithm of Gibson and Wexler 1994, and the batch- and cue-based learners of previous chapters) and show how their analysis within the NB model can be plugged into Equation 9.3 to yield the dynamics of linguistic populations under the CF model.

9.4.1

TLA and Its Evolution

How will the population evolve if the learning algorithm A in question is the Triggering Learning Algorithm (or related memoryless learning algorithms in general)? The answer is simple. We know how the TLA-driven system evolves in the NB model (from analyses in previous chapters). All we need to do is to plug such an evolution into Equation 9.3 and we are done. Recall that the TLA is as follows: 1. Initialize: Start with randomly chosen input grammar. 2. Receive next input sentence, s. 3. If s can be parsed under current hypothesis grammar, go to 2. 4. If s cannot be parsed under current hypothesis grammar, choose another grammar uniformly at random. 5. If s can be parsed by new grammar, retain new grammar, else go back to old grammar. 6. Go to 2. We have already seen that such an algorithm can be analyzed as a Markov chain whose state space is the space of possible grammars and whose transition probabilities depend upon the distribution P with which sentences are drawn. Using such an analysis, the function f can be computed. For the 6 These algorithms have been chosen here for illustrative purposes to develop the the connections between individual acquisition and population change in a concrete manner in both NB and CF models. Replacing them by other learning algorithms does not alter the spirit of the major points I wish to make in this chapter but rather the details of some of the results we might obtain here. In general, acquisition algorithms can now be studied from the point of view of adequacy with respect to historical phenomena, a point that I have elaborated at length in previous chapters.

285

CULTURAL EVOLUTION

case of two grammars (languages) in competition under the assumptions of the NB model, this function f is seen to be f (st ) =

[b − st (b − a)]k [(1 − b) + st (a + b − 2)] st (1 − a) + (9.4) (1 − b) + st (b − a) 2[(1 − b) + st (b − a)]

In Equation 9.4, the evolving quantity s t is the proportion of L1 speakers in the community. The update rule depends on parameters a, b, and k that need further explanation. The parameter a is the probability with which ambiguous sentences (sentences that  are parsable by both g 1 and g2 ) are produced by L1 speakers, i.e., a = w∈L1 ∩L2 P1 (w); similarly, b is the probability  with which ambiguous sentences are produced by L 2 speakers, i.e., b = w∈L1 ∩L2 P2 (w). Finally, k is the number of sentences that a child receives from its linguistic environment before maturation. It is interesting to note that the only way in which the update rule depends upon P 1 and P2 is through the parameters a and b that are bounded between 0 and 1 by construction. It is not obvious from Equation 9.4 but it is possible to show that f is a polynomial (in st ) of degree k. Having obtained f (s t ), one obtains the quadratic update rule of the CF model by computing the b i ’s according to the formulas given in the earlier section. These are seen to be as follows: b3 = 1−

a+b k ak bk (1 − a) (a − b) ; b0 = ; b1 = b2 = +( ) 2 2 (1 − a) + (1 − b) 2 2[(1 − a) + (1 − b)]

The following remarks are in order: 1. For k = 2, i.e., where children receive exactly two sentences before maturation, both the NB and CF models yield quadratic update rules for the evolution of the population. For the NB model, the following is true: (i) for a = b, there is exponential growth (or decay) to one fixed point of p∗ = 21 , i.e., populations evolve until both languages are in equal proportion and they coexist at this level; (ii) for a = b, there is logistic growth (or decay) and in particular, if a < b, then there is one stable fixed point p∗ (a, b) whose value depends upon a, b, and is greater than 12 . If a > b then there is again one stable fixed point p∗ (a, b) that is less than 12 . Populations tend to the stable fixed point from all initial conditions in logistic fashion. The value of p ∗ as a function of a and b is shown in Figure 9.1. 2. For k = 2, the evolution of the CF model is as follows: (i) for a = b, there is exponential growth (or decay) to one fixed point of p ∗ = 21 ;

CHAPTER 9

286

1.4 1.2

p*(a,b)

1 0.8 0.6 0.4 0.2 0 1 0.8

1 0.6

0.8 0.6

0.4 0.4

0.2 a

0.2 0

0

b

Figure 9.1: The fixed point p∗ (a, b) for various choices of a and b for the NB model with k = 2. (ii) for a = b, there is still one stable fixed point whose value can be seen as a function of a and b in Figure 9.2; (iii) for b > a, the value of this fixed point is greater than 12 , for a > b, the value is less than 12 . While the overall qualitative behavior of the two models for this value of k, are quite similar, the value of p ∗ (a, b) is not identical. This can be seen from Figure 9.3, where we plot the difference (between p ∗N B and p∗CF ) in values of the fixed point obtained for each choice of a and b. 3. If one considers the limiting case where k −→ ∞, i.e., where children are given an infinite number of examples to mature, then the evolution of both the NB and the CF models have the same qualitative character. There are three cases to consider: (i) for a = b, we find that s t+1 = st , i.e., there is no change in the linguistic composition; (ii) for a > b, the population composition st tends to 0; (iii) for a < b, the population composition st tends to 1. Thus one of the languages drives the other out and the evolutionary change proceeds to completion. However, the rates at which this happens differ under the differing assumptions of the NB and the CF models. This difference is explored in a later section as we consider the application of the models to the historical evolution of English syntax. It is worthwhile to add that in real life, a = b is unlikely to be exactly true — therefore language contact between populations is likely to drive one out of existence.

287

CULTURAL EVOLUTION

1.4 1.2

p*(a,b)

1 0.8 0.6 0.4 0.2 0 1 0.8

1 0.6

0.8 0.6

0.4 0.4

0.2

0.2 0

a

0

b

Figure 9.2: The fixed point p∗ (a, b) for various choices of a and b for the CF model with k = 2.

0.08 0.06

0.02

C

p* B − p* F

0.04

N

0 −0.02 −0.04 −0.06 −0.08 1 0.8

1 0.6

0.8 0.6

0.4 0.4

0.2 a

0.2 0

0

b

Figure 9.3: The difference in the values of p ∗ (a, b) for the NB model and the CF model p∗N B − p∗CF for various choices of a and with k = 2. A flat surface taking a value of zero at all points would indicate that the two were identical. This is not the case.

CHAPTER 9

288

Additionally, the limiting case of large k is also more realistic since children typically get adequate primary linguistic data over their learning years in order to acquire a unique target grammar with high probability in homogeneous linguistic communities where a unique target grammar exists. In the treatment of this chapter, we have always assumed that learners attain a single target grammar. Often, when two languages come in contact, learners typically attain both grammars in addition to a reasonable understanding of the social and statistical distribution of the two grammars in question. This can be handled within the framework we discuss here by requiring the learner to learn (estimate) a mixture factor (λ ∈ [0, 1], say) that decides in what proportion the two grammars are to be used. A value of λ = 0 or λ = 1 would then correspond to the case where the learner had attained a unique grammar. One can then analyze a population of such learners to characterize their evolutionary consequences. I do not discuss such an analysis here.

9.4.2

Batch- and Cue-Based Learners

Consider the batch learner of Chapter 5. The update rule is provided by    K p1 (αt )n1 p2 (αt )n2 p3 (αt )n3 αt+1 = f (αt ) = n1 n2 n3 P {(n1, n2 ,n3 )|n1≥n3 ;

i

ni =K}

where p1 (αt ) = αt (1 − a); p3 (αt ) = (1 − αt )(1 − b); p2 (αt ) = aαt + b(1 − αt ). An analyis of this iterated map reveals that for all 0 < a, b < 1, the system has two stable fixed points α = α∗ and α = 1 where α∗ < 1. There is one unstable fixed point α ∈ (α∗ , 1) in between. This instantiates a case in which the NB dynamics is bistable. In contrast, the CF model will have only one stable fixed point. In fact, in the CF model, we see that 1 is always a fixed point. In this setting, the CF update rule is given by 1 1 st+1 = (f (0) + 1 − 2f ( ))s2t + 2(f ( ) − f (0))st + f (0) 2 2 Putting st = 1, we see that 1 is a fixed point of the CF system. The derivative at 1 is given by 1 1 1 2(f (0) + 1 − 2f ( )) + 2(f ( ) − f (0)) = 2(1 − f ( )) 2 2 2 Thus we see that if f ( 12 ) > 1, then s = 1 is a stable fixed point, otherwise s = 1 is unstable. In contrast, of course, the NB system is always stable at 1.

289

CULTURAL EVOLUTION

One may also consider the cue-based learner of Chapter 5. Here the NB dynamics is given by  K  (pαt )i (1 − pαt )K−i αt+1 = f (αt ) = i i≥Kτ

There is a regime where the NB dynamics has only one stable fixed point α = 0 and a regime where the NB dynamics has two stable fixed points α = 0 and α = α∗ > 0. Of course, as usual, the CF dynamics has exactly one stable fixed point. Since f (0) = 0, we see that the CF dynamics is given by 1 1 st+1 = (f (1) − 2f ( ))s2t + 2f ( )st 2 2 Clearly, we see that s = 0 is always a fixed point. To see the stability of this fixed point, we need to differentiate at 0. We obtain 1 2f ( ) < 1 ⇔ s = 0 is stable 2 Thus when f ( 12 ) < 12 , we see that s = 0 is the only stable fixed point. When f ( 12 ) > 12 , we have 1 f (1) 1 f( ) > > 2 2 2 In this regime, the stable fixed point of the CF system is given by s = 2f ( 12 )−1 . 2f ( 12 )−f (1)

In summary, from the preceding discussion we see that the evolutionary characteristics of a population of linguistic agents can be precisely derived under certain simplifying assumptions. I show how the differing assumptions of the NB model and the CF model yield dynamical systems with different behaviors and how these models relate to each other.

9.4.3

A Historical Example

So far our development has been fairly abstract. To ground the current discussion in a particular context, let us consider the phenomena surrounding the evolution of Old English to Modern English and its treatment within both kinds of models. One of the significant changes in the syntax of English as it evolved from the 9th to the 14th century is the change in its word order. Consider, for example, the following passage taken from the Anglo-Saxon Chronicles (878 A.D.) and reproduced in Trask 1996:

CHAPTER 9

290

Her ... Ælfred cyning ... gefeaht wi ealne here, and hine Here ... Alfred king ... fought against whole army and it geflymde, and him aefter rad o et geweorc, and aer saet put to flight and it after rode to the fortress and there camped XIIII niht, and a sealde se here him gislas and myccle fourteen nights and then gave the army him hostages and great a as, et hi of his rice woldon, and him eac geheton oaths that they from his kingdom would [go] and him also promised et heora cyng fulwihte onfon wolde, and hi aet gelaston ... and their king baptism receive would and they that did The original text is in italics and a word for word translation (gloss) provided immediately below each line of the passage. Some phrases have been underlined to indicate the unusual word order prevalent in the writing of the times. Sampling the historical texts over the period from Old to Middle English, one finds that the early period shows three major alternations: (i) verb phrases (VP) may show object-verb (OV) or verb-object (VO) order; (ii) the inflectional head (I) may precede (I-medial) or follow (I-final) the verb phrase; and (iii) there may or may not be movement of the inflected verb to head of CP (complementizer position in clauses) (following the notation of Government and Binding Theory; see Haegeman 1991). For the purposes of the discussion in this chapter, I will collapse the OV/VO and I-final/I-medial distinctions into a single head-complement parameter within the rubric of the Principles and Parameters approach to grammatical theory. The movement of the finite verb to second position is related to the V2 parameter — modern German and Dutch are +V2 while modern English is −V2. Therefore, the two grammatical parameters at issue are: 1. The head-complement parameter. This denotes the order of constituents in the underlying phrase-structure grammar. Recall from X-bar theory that phrases XP have a head (X) and complement, e.g., the verb phrase ate with a spoon and the prepositional phrase with a spoon have as a head the verb ate and the preposition with respectively. Grammars of natural languages could be head-first or head-final. Thus X-bar phrase-structure rules have the form (X and Y are arbitrary syntactic categories in the notation below): head-first: (i) XP −→ X  Y P (ii) X  −→ X head-final: (i) XP −→ Y P X  (ii) X  −→ X

291

CULTURAL EVOLUTION

2. The V2 parameter. This denotes the tendency in some languages where the finite verb moves from its base position to the head of the complementizer (C of CP ) by V to I to C raising. The specifier of CP has to be filled, resulting in the verb appearing to be in the second position in a linear order of constituents. Grammars of natural languages could be +V2 or −V2. Thus +V2: Obligatory movement of V to I to C and specifier of CP filled. −V2: V2 movement absent. Modern English is exclusively head-first and −V2. Old English seems to be largely head-final and +V2. How did such remarkable changes in grammars occur? There are several competing accounts for these changes (see chapters by Kroch and Taylor; Lightfoot; and Warner in Van Kemenade 1997 for discussions) but there seems to be some agreement that there were two competing grammars — a northern Scandinavian based +V2 grammar and a southern indigenous −V2 grammar. The first of these grammars was lost as the populations came into contact. Invoking learnability arguments as an explanation for such a change, Lightfoot (1997 265–266) writes: Children in Lincolnshire and Yorkshire, as they mingled with southerners, would have heard sentences whose initial elements were non-subjects followed by a finite verb less frequently than the required threshold; if we take seriously the statistics from the modern V2 languages and take the threshold to be about 30 % of matrix clauses with initial non-subject in Spec of CP, then southern XP-Vf forms, where the Vf is not I-final and where the initial element is not a wh item or negative, are too consistently subject-initial to trigger a V2 grammar. [PN remark: implying that the +V2 grammar was therefore lost over time]. These are the kinds of arguments that can be modeled precisely and tested for plausibility within the framework I have discussed here. I will not attempt in this section to do justice to the various accounts of the historical change of English in a serious manner, because this is well beyond the scope of the current chapter. However, for illustrative purposes, I discuss below the evolutionary trajectories of populations with two competing grammar types that come into contact. The grammar types have been chosen to capture the parametric oppositions that played themselves out over the course of the historical evolution of English.

CHAPTER 9

292

Case 1: +V2/−V2 for Head-first Grammars Imagine that two linguistic populations came together and the two languages in competition differed only by one parameter — the V 2 parameter. Further assume that all other grammatical parameters of these two languages were identical to modern English. Children growing up in the mixed communities would hear sentences from both grammatical types. Suppose they set (learned) all other grammatical parameters correctly, and it was only in the V2 parameter that children differed from each other in how they set it — i.e., some acquired the +V2 grammar and some acquired the −V2 grammar. How would the population evolve? Would the +V2 grammar die out over time? What conditions must exist for this to happen? These questions can be addressed within the framework that we have developed over the course of this book. To begin with, we need to identify the sets L1 and L2 . Following Gibson and Wexler 1994, we derive the set of degree-0 sentences7 (with no recursion) that are associated with the +V2 and −V2 grammars. These are listed below where S = subject, V = verb, O1 = direct object; O2 = indirect object; Aux = auxiliary; and Adv = adverb. g1 : −V2; Head-first; Spec-first L1 = { S V, S V O, S V O1 O2, S Aux V, S Aux V O, S Aux V O1 O2, Adv S V, Adv S V O, Adv S V O1 O2, Adv S Auz V, Adv S Aux V O, Adv S Aux V O1 O2 } The grammar underlying these sentences corresponds to that of modern English. For example, the sentence type (S Aux V O1 O2 ) maps to realized (lexical sequences) sentences like John will eat beef in London. g2 : +V2; Head-first; Spec-first L2 = { S V, S V O, O V S, S V O1 O2, O1 V S O2, O2 V S O1, S Aux 7

Of course, both L1 and L2 have infinite sentences. Recall that the evolutionary properties of the population will depend upon the probability distributions P1 and P2 with which sentences are produced. In practice, due to cognitive limitations, speakers produce sentences with bounded recursion. Therefore P1 and P2 will have effective support on a finite set only. Furthermore, the learning algorithm of the child A operates on sentences, and a psycholinguistic premise is that children learn only on the basis of degree-0 sentences (Gibson and Wexler 1994; Lightfoot 1991) and all sentences with recursion are ignored in the learning process. I have adopted this premise for the purposes of this discussion. Therefore only degree-0 sentences are considered in this analysis.

293

CULTURAL EVOLUTION

V, S Aux V O, O Aux S V, S Aux V O1 O2, O1 Aux S V O2, O2 Aux S V O1, Adv S V, Adv V S O, Adv V S O1 O2, Adv Aux S V, Adv Aux S V O, Adv Aux S V O1 O2} This grammar requires obligatory movement of the inflected verb to second position (actually to C and the specifier of CP must be filled). Thus, an example of a sentence (not following English word order, of course) corresponding to the sentence type Adv V S O1 O2 is often saw we many students in London. The ambiguous sentence types are those that have different but valid parses under each of the two grammatical systems. For example, the sentence type SV corresponding to the sentence John left belongs to each of the two languages (extensionally). However, this surface form has two different derivations. Under g1 , the subject John is the specifier of the inflectional head. Under g2 , the verb left moves from I to C and the +V2 constraint forces the subject to move to occupy the specifier of the C. To an external agent, it is not obvious what the underlying parse is and therefore whether there is underlying verb movement or not. The set of ambiguous sentence types is simply given by intersecting the two languages L1 ∩L2 = {S V, S V O, S V O1 O2, S Aux V, S Aux V O, S Aux V O1 O2} In previous sections, we have considered several variants of both the CF and NB models for two languages in competition. Recall that when the learning algorithm is the TLA, for large k, the qualitative behavior of the two models is similar. In particular, L1 would drive L2 out from all initial conditions if and only if a < b. Here a is the probability measure on the set of ambiguous sentences produced by speakers of L1 , and b is the probability measure on the set of ambiguous sentences produced by speakers of L 2 . This situation would lead to the loss of +V2 grammar types over time. Under the unlikely but convenient assumption that P 1 and P2 are uniform distributions on degree-0 sentences of their respective languages (L 1 and L2 ), we see that 1 1 a= >b= 2 3 Therefore, the +V2 grammar, rather than being lost over time would tend to be gained over time. Shown in Figure 9.4 are the evolutionary trajectories in the CF and NB models for various choices of a and b. Some further remarks are in order:

CHAPTER 9

294

1

0.9

0.8

Percentage of +V2 speakers

0.7

0.6

0.5

0.4

0.3

0.2

0.1

0

0

5

10

15 Generation No.

20

25

30

Figure 9.4: Trajectories of V2 growth. Shown in the figure are the evolving trajectories of st = proportion of +V2 grammars in the population over successive generations. The solid curves denote the evolutionary trajectories under the NB model; the dashed curves denote the trajectories under the CF model. Two different initial population mixes are considered: (a) 0.1 initial +V2 speakers (b) 0.5 initial +V2 speakers. For each initial mix and each model (CF and NB) the upper curve (faster change) corresponds to a choice of a = 0.5 and b = 0.33 and a = 0.4 and b = 0.33 respectively. Notice that in this regime, the NB model has a faster rate of change than the CF model.

295

CULTURAL EVOLUTION

1. The directionality of change is predicted by the relationship of a with b. While uniform distributions of degree-0 sentences predict that the V2 parameter would be gained rather than lost over time, the empirical validity of this assumption needs to be checked. From corpora of childdirected sentences in synchronic linguistics and aided perhaps by some historical texts, one might try to empirically assess the distributions P1 and P2 by measuring how often each of the sentence types occur in spoken language and written texts. These empirical measures have not been constructed. 2. The dynamical systems that we have derived and applied to this particular case hold only for the case of memoryless learning algorithms like the TLA. In previous chapters, we discussed a larger variety of learning algorithms and noted that their evolutionary consequences were potentially quite different. Case 2: OV/VO for +V2 Grammars Here we consider a head-first (comp-final) grammar in competition with a head-final (comp-first) grammar where both are +V2 grammars that have the same settings for all other parameters — settings that are the same as that of Modern English. Therefore, one of the two grammars (head-first setting) is identical to modern English except for the V2 parameter. It is also the same as g2 of the previous section. The other grammar differs from Modern English by two parameters. As in the previous section, following Gibson and Wexler 1994 we can derive the degree-0 sentences associated with each of the two languages. We do this below: g1 : +V2; Head-first; Spec-first L1 = { S V, S V O, O V S, S V O1 O2, O1 V S O2, O2 V S O1, S Aux V, S Aux V O, O Aux S V, S Aux V O1 O2, O1 Aux S V O2, O2 Aux S V O1, Adv S V, Adv V S O, Adv V S O1 O2, Adv Aux S V, Adv Aux S V O, Adv Aux S V O1 O2} This grammar is the same as g2 of the previous section. The competing grammar is head-final but with a +V2 parameter setting.

CHAPTER 9

296

g2 : +V2; Head-final; Spec-first L2 = { S V, S V O, O V S, S V O2 O1, O1 V S O2, O2 V S O1, S Aux V, S Aux O V, O Aux S V, S Aux O2 O1 V, O1 Aux S O2 V, O2 Aux S O1 V, Adv V S, Adv V S O, Adv V S O2 O1, Adv Aux S V, Adv Aux S O V, Adv Aux S O2 O1 V} An example of a sentence type corresponding Adv V S O2 O1 is often saw we in London many students. We can therefore straightforwardly obtain the set L 1 ∩ L2 as L1 ∩ L2 = { S V, S V O, O V S, O1 V S O2, O2 V S O1, S Aux V, O Aux S V, Adv V S O, Adv Aux S V} Assuming P1 and P2 are uniform distributions on the degree-0 sentences of their respective languages, we see that a=

1 =b 2

Therefore, under the assumptions of both the NB and the CF models, there is no particular tendency for one grammar type to overwhelm the other. Language mixes would remain the same. If for some reason, a became slightly less than b, we see that the head-final (comp-first) language would be driven out and only the head-first language would remain. This would replicate the historically observed trajectory for the case of English. The rate is faster for the NB model than it is for the CF model. A Final Note Taking stock of our modeling results, we see that when a +V2 and a −V2 grammar come together (other parameters being the same), there is an inherent asymmetry with the −V2 grammar being more likely to lose out in the long run. On the other hand, when a head-first and head-final grammar come together, there is no particular proclivity to change — the directionality could go either way. The reason for this asymmetry is seen to be in the asymmetry in the number of surface degree-0 sentences that are compatible with each of the grammars in question, with +V2 grammars giving rise to a larger variety of surface sentences. Therefore, ambiguous sentences (those parsable with both +V2 and −V2 constraints) constitute a smaller proportion of the total sentence types of such grammars, leading to a directional asymmetry in values of a and b in the model framework. In conclusion, however, it is worthwhile to reiterate again our motivation in working through this particular example of syntactic change in English.

297

CULTURAL EVOLUTION

1

0.9

Proportion of head−first speakers

0.8

0.7

0.6

0.5

0.4

0.3

0.2

0.1

0

0

5

10

15 Generation No.

20

25

30

Figure 9.5: Trajectories of head-first growth. Shown in the figure are the evolving trajectories of st = proportion of head-first grammars in the population over successive generations. The solid curves denote the evolutionary trajectories under the NB model; the dashed curves denote the trajectories under the CF model. Two different initial population mixes are considered (a) 0.1 initial head-first speakers; (b) 0.5 initial head-first speakers. For each initial mix and each model (CF and NB) the upper curve (faster change) corresponds to a choice of a = 0.4 and b = 0.6 and a = 0.47 and b = 0.53 respectively. Notice that the NB model has a faster rate of change than the CF model.

CHAPTER 9

298

There are many competing accounts of how English changed over the years. Among other things, these accounts differ in (i) the precise grammatical characterization of the two grammars in competition, (ii) the number of parametric changes that occurred and their description in the context of a grammatical theory, and (iii) the nature of the learning mechanism that children employ in learning grammars (e.g., monolingual versus bilingual acquisition), and so on. Each of these factors can be modeled and the plausibility of any particular account can then be verified. To give the reader a sense of how this might happen in a linguistically grounded manner, we worked through these examples — not to make a linguistic point but to demonstrate the applicability of this kind of computational thinking to historical problems.

9.5

A Generalized NB Model for Neighborhood Effects

The basic model for vertical transmission of cultural (linguistic) traits by Cavalli-Sforza and Feldman (1981) proceeds by dividing children into four classes depending upon their parental types. The children of each class then receive input sentences from a different distribution depending upon their parental type. The Niyogi and Berwick approach on the other hand assumes that all children in the population receive inputs from the same distribution that depends on the linguistic composition of the entire parental generation. In this section, we consider a generalization of both approaches with a particular view to modeling “neighborhood” effects in linguistic communities. The key idea here is that in heterogeneous language communities speakers often tend to cluster in linguistically homogeneous neighborhoods. Consequently children growing up in the community might receive data drawn from different distributions depending upon their spatial location within the community at large. Imagine as usual a two-language population consisting of speakers of L1 or L2 . We now let the parental generation of speakers reside in adjacent neighborhoods. Children receive sentences drawn from different distributions depending upon their location in this neighborhood. At one end of the scale, children receive examples drawn only from L 1 . At the other end of the scale, children receive examples drawn only from L 2 . In the middle — at the boundary between the two neighborhoods as it were — are children who receive examples drawn from both sources. Let us develop the notion further. Let children of type α be those who receive examples drawn according to a distribution P = αP 1 + (1 − α)P2 .

299

CULTURAL EVOLUTION

0.9 0.8 0.7 0.6

h(n)

0.5 0.4 0.3 0.2 0.1 0 s(t) −0.1

0

0.1

1 − s(t) 0.2

0.3

0.4 0.5 0.6 Neighborhood location (n)

0.7

0.8

0.9

1

Figure 9.6: Examples of h mappings between the location n and the α type of the children occupying that location. Here the value of s t (proportion of L1 speakers) is taken to be 0.3 for illustrative purposes. Therefore the interval [0, 0.3] is the L1 -speaking neighborhood; the interval [0.3, 1] is the L2 -speaking neighborhood. For any location n the value of h(n) represents the proportion of L1 speakers the child occupying that location is exposed to. Here P1 is the probability with which speakers of L 1 produce sentences and P2 is the probability with which speakers of L 2 produce sentences. The quantity α ∈ [0, 1] is the proportion of L 1 speakers that an α-type child is effectively exposed to. Children will be of different α types depending upon their spatial location. How do we characterize location? Let location be indicated by a onedimensional real-valued variable n in the interval [0, 1]. Let speakers be uniformly distributed on this interval so that speakers of L 1 are close to n = 0 and speakers of L2 are close to n = 1. Let the proportion of L 1 speakers in the population be st . Therefore, all children located in [0, s t ] are in the L1 -speaking neighborhood and all children located in [s t , 1] are in the L2 speaking neighborhood. Let us now define the mapping from neighborhood to α-type by α = h(n) where h : [0, 1] −→ [0, 1]. We leave undefined the exact form of h, except noting that it should possess certain reasonable properties, e.g., h(0) should be close to 1, h(1) should be close to 0, h(s t ) should be close to 12 , and h should be monotonically decreasing. Shown in Figure 9.6 are some plausible mappings h that mediate the

CHAPTER 9

300

relation between location of the child in the neighborhood and its α-type. The x-axis denotes location. The y-axis denotes the α-type of a learner. We now have learners distributed uniformly in location and a mapping from location to α-type provided by h. One can therefore easily compute the probability distribution of children by α-type. This is just the probability distribution function for the random variable α = h(n) where n is uniform. Let this distribution be Ph (α) over [0, 1]. Now a child (learner) of type α receives sentences drawn according to P = αP 1 + (1 − α)P2 . According to our notation developed earlier, we see that it therefore has a probability f (α) of attaining the grammar of L1 . (This is provided by an analysis of the learning algorithm in the usual way, i.e., f (α) = g(A, αP 1 + (1 − α)P2 , k). Therefore, if children of type α are distributed in the community according to distribution Ph (α) and each child of type α attains L 1 with probability f (α), we see that in the next generation, the percentage of speakers of L 1 is provided by Equation 9.5:  st+1 =

9.5.1

0

1

Ph (α)f (α)dα

(9.5)

A Specific Choice of Neighborhood Mapping

For purposes of illustration, let us choose a specific form for h. In particular, let us assume that it is piecewise linear in the following way (Equation 9.6; the solid line of Figure 9.6):

h(0) = 1; h(1) = 0; h(st ) = h(n) = 1 −

1 ; 2

1−n 1 for n > st n for n < st ; h(n) = 2st 2(1 − st )

(9.6)

Thus, clearly, h is parameterized by s t . For such an h, it is possible to show that Ph is piecewise uniform — given by the following: 1 1 Ph (α) = 2st if α > ; Ph (α) = 2(1 − st ) if α < ; Ph (α) = 0 if α ∈ [0, 1] 2 2 (9.7) In previous sections, we discussed the form of the NB update rule f = g(A, st P1 + (1 − st )P2 , k) for some specific choices of learning algorithms. From Equation 9.4, we see that it is a polynomial of degree k. Putting this

301

CULTURAL EVOLUTION

into Equation 9.5, we get the update rule with neighborhood effects to be  1/2  1 f (α)dα) + 2st ( f (α)dα) (9.8) st+1 = 2(1 − st )( 0

1/2

Since α is a dummy variable in the above integral, the effect of the neighborhood is to reduce the update rule to a linear one. This is in striking contrast to the original NB update rule (degree-k polynomial) and the CF update rule (quadratic). It is worthwhile to reflect on a few aspects of such behavior: 1. The linear map implies an exponential growth (or decay) to a stable fixed point whose value is given by  1/2 2 0 f (α)dα ∗ s =  1/2 1 1 + 2( 0 f (α)dα − 1/2 f (α)dα)  1/2 2. Notice that s∗ = 0 requires 0 f (α)dα = 0. Correspondingly, s∗ = 1 1 requires 1/2 f (α)dα = 21 . Neither is very likely — therefore, no language is likely to be driven out of existence completely. If one chooses the update rule f for large k (= ∞) one can compute these quantities exactly. It is then possible to show that the fixed point s ∗ is never 0 or 1. In contrast, both NB and CF models result in one language becoming extinct if a = b. We see that the particular form of the update rule obtained with such neighborhood effects depends upon the functional form of the mapping h. In general, however, this approach allows us to compute the evolutionary trajectories of populations where children have arbitrary α-types. It is worthwhile to recall the original CF and NB models of the previous sections in this light. The CF models are derivable from this perspective with a particular choice of Ph (α), which happens to be a probability mass function with Ph (α = 0) = s2t ; Ph (α = 12 ) = 2st (1 − st ); Ph (α = 1) = (1 − st )2 . The NB model of previous sections is equivalent to choosing P h (α) to be a delta function, i.e., Ph (α) = δ(α − st ). Remark. It is important to recognize two aspects of the neighborhood model introduced here. First, the function h is not a fixed function but depends upon the proportion st of the L1 speakers at any time. Therefore, h changes from generation to generation (as s t evolves). Second, the population of mature adults is always organized into two linguistically homogeneous neighborhoods in every generation. Of course, children in a particular neighborhood

CHAPTER 9

302

might acquire different languages. It is implicitly assumed that on maturation, the children (now adults) reorganize themselves into homogeneous neighborhoods. It is this reorganization into homogeneous neighborhoods that prevents the elimination of any one language from the system. Another (more complete) way to characterize neighborhood effects is to treat the proportion of L1 speakers in the tth generation as a function that varies continuously with distance (n) in the neighborhood. It is this function that evolves from generation to generation. Without additional simplifying assumptions, this treatment requires techniques well beyond the scope of this chapter and will be subject of future work. A preliminary account of this approach is provided in the next chapter.

9.6

A Note on Oblique Transmission

A substantial part of Cavalli-Sforza and Feldman 1981 is devoted to the study of oblique and horizontal transmission of cultural traits. In the basic vertical model that we have discussed so far, cultural traits are vertically transmitted from parents to children with varying probabilities. The evolution of such traits is then studied. In oblique transmission, one considers the effect that members of the parental generation at large have on the transmission of cultural traits. Assume there are two traits L and H as before. Then one models the development of the cultural trait in the individual child in two stages: 1. Stage 1: Children acquire on the basis of preliminary exposure to their parents, a “juvenile” state that is one of L and H. This process is similar to vertical transmission, and one may thus characterize the probabilities with which L or H may be attained. These probabilities will depend upon the cultural traits of the parents in the manner that has been discussed previously. 2. Stage 2: Juveniles acquire a “mature” state on the basis of exposure to the rest of the adult population. In this stage, one computes the probability with which trait transitions occur. Thus one characterizes the probability with which a juvenile of type L remains a type L upon maturation and similarly, the probability with which a juvenile of type H remains so after maturation. These trait transition probabilities will now depend upon the frequency of these traits in the entire adult population.

303

CULTURAL EVOLUTION

Consider stage 2. What is the probability with which a type L child might change to a type H one? This will surely depend upon the proportion of type H adults in the parental generation. Let P [L → H] = φ 1 (α) where α is the proportion of type H adults. Similarly, let P [H → L] = φ 2 (α). If the functional forms φ1 and φ2 are known, then it is straightforward to compute the dynamics of H types in the population. Indeed, Cavalli-Sforza and Feldman (1981) consider some simple choices for φ 1 and φ2 and conduct an extensive analysis of the dynamics that results. In general, however, it is clear from the discussion in this and preceding chapters that in the context of language change and evolution, φ 1 and φ2 need to be derived from learning-theoretic considerations. Furthermore, φ 1 and φ2 are unlikely to be simple in general. Much of the current book may be viewed as a contribution toward the characterization of such quantities for a variety of learning algorithms. For example, if the child learner were using the TLA, the following observations may be made: 1. The behavior of a TLA-based learner is characterized by a Markov chain with transition matrix given by T (α) (following the analysis of Chapter 3). 2. Assume that in the maturation phase, the learner receives l examples. Then the probability of an L1 → L2 transition after l examples is simply given by the (1, 2) element of T l (α). Similarly, the probability of an L2 → L1 transition after l examples is given by the (2, 1) element of T l (α). In this manner φ1 and φ2 are easily obtained. In general, given an NB-style analysis, one may then be able to compute the two-stage model outlined above. I do not pursue this in any more detail here.

9.7

Conclusions

In this chapter, I have discussed the basic model of Cavalli-Sforza and Feldman 1981 for cultural transmission and change. I have shown how this provides a framework in which to think about problems of language change and evolution. Language acquisition serves as the mechanism by which language is transmitted from parents to children. By suitably averaging over a population we are then able to derive the population dynamics, i.e., the evolutionary trajectories of the linguistic composition of the population as a whole from generation to generation.

CHAPTER 9

304

I have shown how the approach of Cavalli-Sforza and Feldman 1981 relates to that of Niyogi and Berwick 1995, 1997 and how to go back and forth between the two kinds of models. For the particular case of two languages in competition, I have derived several dynamical systems under varying assumptions. I have also considered the generalization of such models to explicitly take into account the effect of spatial clustering of speakers into linguistic neighborhoods and have investigated the consequences of such neighborhood effects. The case of two languages in competition is of some significance since historical cases of language change and evolution are often traceable to a point in time when speakers of two language types came into contact with each other. As a particular case of this, I considered the evolution of English syntax from Old to Middle to Modern English. While the various linguistic explanations for such a change were not considered in a serious fashion, I demonstrated in this chapter, how one might apply the computational framework developed here to test the plausibility of differing accounts.

Chapter 10

Variations and Case Studies In Chapters 5 through 9, I have developed a variety of models of language change motivated by different potential applications to historically observed real-life cases. Over the course of this development, it has become gradually clear that such computational models allow us to bring into sharper focus the issues involved in studying and explaining linguistic diversity and language change. In order to make initial progress, I have made simplifying assumptions. These assumptions permit a point of entry into the subject and result in valuable first-order insights. While some variations of the basic assumptions have been studied in preceding chapters, it is worthwhile to reflect on other important issues that have been inadequately dealt with previously. In this chapter, I pause to consider some of these issues and discuss the nature of the modeling involved in resolving them in some sort of coherent fashion. Each of these represents a direction that is worthy of systematic study in its own right. I hope the preliminary analyses provided below are illustrative.

10.1

Finite Populations

In the analyses of preceding chapters, I have typically assumed that population sizes were infinite. This allowed me to derive deterministic dynamical systems that described the evolution of the population. In general, however, populations are finite, and this gives rise to stochastic dynamics as a result. For large population sizes, the deterministic dynamics and stochastic dynamics have similar behavior, but for small population sizes the two might differ from each other quite substantially. To see this, let us reconsider some of the two language models of Chapter 5.

CHAPTER 10

10.1.1

306

Finite Populations

Suppose the space L of possible languages consists of exactly two languages L1 and L2 . Suppose learning children follow an algorithm A to infer a language from the linguistic data they receive over their learning phase. Then following the argument of Chapter 5, one derives the corresponding deterministic population dynamics as αt+1 = f (αt , A, k) where αt is the proportion of L1 users in the tth generation. Recall that f is obtained by computing the probability with which a typical child (following A) acquires L1 on receiving k linguistic examples. A variety of such maps were derived for different choices of A and k. Consider now the situation in which there are a finite number, N, of adults. Each adult in generation t is characterized by a variable X i (t) (for the ith adult). Xi (t) ∈ {0, 1} takes the value 1 if the ith adult speaks L 1 and the value 0 if he or she speaks L2 . Thus the linguistic configuration of the population is denoted by the vector (X 1 (t), X2 (t), . . . , XN (t))T . The average linguistic behavior of the population is denoted by Y (t) =

N 1  Xi (t) N i=1

Y (t) is the fraction of the population that uses L 1 . Note that Y (t) ∈ { N0 , N1 , N2 , . . . , N N }. Further, for very large N , we have Y (t) ≈ E[X i (t)] = αt , i.e., the fraction of L1 users is close to what it would have been if the population size were infinite.

10.1.2

Stochastic Dynamics

Let us now characterize the evolution of Y (t) over time. Consider the next generation of N children who are exposed to linguistic data from the previous generation. A typical child of the next generation receives data drawn at random from the previous generation, which contains a proportion Y (t) of L 1 users. What is the probability with which such a child will attain L 1 ? This is simply given by f (Y (t), A, k). Thus for the ith child we see that X i (t + 1) is a random variable that takes the value 1 with probability f (Y (t), A, k) and the value 0 with probability 1 − f (Y (t), A, k).

307

VARIATIONS AND CASE STUDIES The average linguistic behavior is correspondingly given by Y (t + 1) =

N 1  Xi (t + 1) N i=1

Thus Y (t + 1) is a random variable that takes values in { N0 , N1 , N2 , . . . , N N }. Further, the probability distribution of Y (t+1) is determined entirely by the value of Y (t). The evolution of Y (t) can be characterized by a Markov chain with N + 1 states. Each state is identified with an element of { N0 , . . . , N N }. The transition matrix of this chain is given by (1 ≤ i, j ≤ N + 1)   i−1 N j−1 | Y (t) = ]= f j−1 (1 − fi−1 )N −(j−1) Tij = P[Y (t + 1) = N N j − 1 i−1 (10.1) where i fi = f ( , A, k) N Thus we have stochastic evolution of the average linguistic behavior from generation to generation: Y (t) → Y (t + 1). It is now possible to study the effect of N on this evolutionary behavior.

10.1.3

Evolutionary Behavior as a Function of N

First, let us recall that for infinite N , we have for any t N 1  Xi (t) = αt = E[Xi (t)] = f (Y (t − 1), A, k) N →∞ N

Y (t) = lim

i=1

Now let us consider the evolution for finite N . Large N For large N , one may invoke the Central Limit Theorem and conclude that Y (t) is normally distributed, for it is the average of a large number of i.i.d. variables. The mean and variance of Y (t) are given by μ = E[X i (t)] and σ 2 = N1 μ(1 − μ) respectively. Thus, we have Y (t) ≈ f (Y (t − 1)) + η(f (Y (t)), N ) where η is a normally distributed random variable with zero mean and variance given by N1 f (Y (t))(1 − f (Y (t))). Since f (Y (t)) ∈ [0, 1], we have the

CHAPTER 10

308

1 variance of η upper bounded by 4N . For notational convenience, we have dropped the explicit dependence of f on A and k. In other words, the evolution of average linguistic behavior is a noisy version of the deterministic dynamics where the variance of the noise is 1 . Clearly, the larger N is, the more closely the dynamics will bounded by 4N resemble the behavior of the infinite system. To see this, let us just consider as an example, the dynamics of the TLA learner with k = 2. Recall that this was provided by the following quadratic map (Chapter 5):

αt+1 = f (αt ) = Aα2t + Bαt + C where A = 21 ((1 − b)2 − (1 − a)2 ); B = b(1 − b) + (1 − a); C = b2 . When a = b, the system has exponential change to a stable point of α ∗ = 12 . When a > b or a < b, the system has logistic growth to a fixed point given by a valid solution to the quadratic equation. Shown in Figure 10.1 are the evolutionary trajectories for a = b and a < b respectively. Note how the trajectories are a noisy version of the deterministic ones. The noisiness decreases with N , so that for large N , the trajectories start resembling the original deterministic dynamics quite closely. 2

Small N For small values of N , a normal approximation is no longer valid and one will have to understand the behavior of the Markov chain given by Equation 10.1. Note that if the following is true, ∀i, 0 < fi < 1 the Markov chain is ergodic and has a stationary distribution given by a solution to xT T = xT Thus although the dynamics has no fixed points or stable attractors anymore, the stationary distribution characterizes how often the population is in the state Y (t) = Ni (for each i) over time. x is an N + 1 dimensional vector whose jth element is proportional (equal, if suitably scaled) to the probability that Y (t) = j−1 N in the limit. Therefore, if the stationary distribution has a sharp peak at x(j), then it corresponds to Y (t) taking the value j−1 N most often. Thus, we should expect the stationary distribution of the chain to have peaks at or close to the stable points of the original map f . Let us examine this for some choices of A and correspondingly f .

309

VARIATIONS AND CASE STUDIES

1 0.8 0.6 0.4 0.2 0

0

50

100

150

200 generation no

250

300

350

400

1.5

1

0.5

0

−0.5

0

100

200

300

400

500 600 generation no.

700

800

900

1000

Figure 10.1: Evolutionary trajectories for the quadratic map. The top panel shows the case a = b = 0.2 when the deterministic system moves exponentially to a stable attractor of α∗ = 12 . This is shown by the solid curve. The superimposed dashed curves show three random trajectories for finite N (= 10000, 800, 100). The bottom panel shows the case a = 0.01; b = 0.02. Again, the solid curve is the infinite population trajectory and the dashed curves are three random trajectories for N = 20000, 5000, and 1000 respectively. 1000 generations have been simulated. In both cases, the “noisiness” of the trajectories decreases with N .

CHAPTER 10

310

0.02 0.015 0.01 0.005 0

0

0.1

0.2

0.3

0.4

0.5 i/N

0.6

0.7

0.8

0.9

1

0

0.1

0.2

0.3

0.4

0.5 i/N

0.6

0.7

0.8

0.9

1

0

0.1

0.2

0.3

0.4

0.5 i/N

0.6

0.7

0.8

0.9

1

0.03 0.02 0.01 0 0.04 0.03 0.02 0.01

Figure 10.2: Stationary distributions for three choices of N . The peakedness of the distribution decreases from the top panel (N = 150) through the middle panel (N = 75) to the bottom panel (N = 30).

The Triggering Learning Algorithm For finite k the map f has been derived for the TLA in Chapter 5. Let us consider the case a = b as before. In general, for an arbitrary k, the map f is given by bk αt+1 = f (αt ) = αt (1 − bk ) + 2 This is a linear map and αt → 12 . One can compute T for various choices of k and b. The numerical simulations reported below were conducted for k = 9 and b = 0.7. No particular significance should be attributed to the choice of these parameter values. Figure 10.2 shows the stationary distribution as a function of the size of the population N . Notice that for large N , the distribution is peaked around 12 as expected. As N decreases the distribution starts flattening out. Note that for each choice of N , the stationary distribution is a probability distribution with support on the set { N0 , N1 , . . . , N N }. Since the support set is different for each N , the scale on the y-axis will be different for each of the distributions. However, the spread can be inspected visually.

311

VARIATIONS AND CASE STUDIES 1

0.5

0

0

100

200

300 generation no.

400

500

600

0

100

200

300 generation no.

400

500

600

0

100

200

300 generation no.

400

500

600

1

0.5

0 1

0.5

0

Figure 10.3: Trajectories for the random evolution of the population with the detailed Markov analysis for population sizes given by N = 150 (top panel), N = 75 (middle panel), and N = 30 (bottom panel) respectively.

The true trajectory for each N is given by the evolution of the corresponding Markov chain. The initial condition was taken to be x = 0, i.e., all members of the population speaking L 2 . From this initial condition, the deterministic system moves exponentially to 12 . Shown in Figure 10.3 are trajectories for the deterministic and the Markov chain simulations respectively. Note how the population average moves to 12 in a noisy manner, with the degree of noise increasing as N becomes smaller. In each of the cases considered in Figures 10.2 and 10.3 respectively, the stationary distribution is peaked at 12 , which means that eventually the population is mostly divided equally between L1 and L2 speakers. An interesting inversion occurs when N is very small. Consider, for example, the situation in Figure 10.4, where the stationary distribution is plotted for N = 25 and N = 23 respectively. At N = 24, the distribution is still peaked at 12 . However, at N = 23, the distribution suddenly inverts and now has two peaks near 0 and 1 respectively. This means that most of the time the population consists of mostly L1 speakers or mostly L2 speakers but it flip-flops between these two situations. Thus, we see an inversion — for “large” N, the population moves to a heterogeneous linguistic state. For “small” N, the population at any point in time is almost always homogeneous. In this sense, the behavior at small

CHAPTER 10

312

0.04

0

0.1

0.2

0.3

0.4

0.5 i/N

0.6

0.7

0.8

0.9

1

0

0.1

0.2

0.3

0.4

0.5 i/N

0.6

0.7

0.8

0.9

1

0.044

0.043

0.042

0.041

0.04

Figure 10.4: Stationary distributions for N = 25 (top panel) and N = 23 (bottom panel) respectively. N may be qualitatively different from the behavior at large N . Cue Learner; Finite k Consider now the cue learner described in Chapter 5. In the two-language setting, there is a probability p with which speakers of L 1 might produce cues for the next generation. The map f for the infinite-population case was derived to be k    k (pαt )i (1 − pαt )k−i f (αt ) = i i=l

It was seen that for a fixed k and l, as p varied, there was a bifurcation so that for large values of p there are two stable fixed points. One of these is α∗1 = 0 and the other is α∗2 < 1. In between is an unstable fixed point. Therefore, in this setting the infinite population may converge to either of the two stable points depending upon the initial conditions. In sharp contrast, the finite population will always converge eventually to 0. To see this, let us first make some simple observations about the transition matrix of the Markov chain characterizing the behavior of the

313

VARIATIONS AND CASE STUDIES

finite population. Note that P (Y (t + 1) =

j |Y (t) = 0) = N

  N j f0 (1 − f0 )N −j j

But f0 = f (0, A, K) = 0. Hence we have P (Y (t + 1) = j|Y (t) = 0) = 0 for every j = 0. Thus clearly 0 is an absorbing state. Since for every other i > 0, we have 0 < fi < 1, it is easily seen that (i) there are no other absorbing states and that (ii) every other state has a non zero probability of transitioning to 0. By the standard theory of Markov chains reviewed in Chapters 3 and 4, it is immediate that the chain will eventually settle to 0, i.e., the finite population will eventually converge to a situation where all members speak L 2 . The other stable point of the infinite-population analysis is never discovered. Summary and Further Directions Thus we see that the behavior of the finite population may be quite different from the infinite population in subtle ways. In general the following statements may be made: 1. If there are no extremal fixed points, i.e., neither 0 nor 1 is a fixed point, then the behavior of the finite population is similar to the infinite one for large N and may differ for small N . This is exemplified by the example of the TLA studied in this section. 2. Extremal fixed points correspond to absorbing states of the finite population. When this occurs, no other stable attractors of the deterministic system are discovered in the finite-population setting. What I have presented so far is only the tip of the iceberg as far as understanding the effects of finite population sizes is concerned. I hope, however, to have provided some sense of the techniques involved in studying this issue in greater detail. The general question reduces to understanding the relation between the deterministic dynamical system (infinite population) and the corresponding stochastic process (finite population). If the deterministic system displays bifurcations or chaos, it is quite unclear how the corresponding stochastic process would behave. The analysis in this section was restricted to the two-language case where we obtained some preliminary insights. For the more general case when |L| = n, we obtain a Markov chain with nN different states. An effective analysis of such chains presents many combinatorial challenges. Such questions are beyond the scope of this book

CHAPTER 10

314

and remain to be investigated systematically. The interested reader is referred to Komarova and Nowak 2003 for more detailed analysis of the case of finite populations in language. A large literature exists on genetic drift in finite populations (see Kimura 1983), where similar questions are studied in the context of biological evolution.

10.2

Spatial Effects

In most of our models so far, we have assumed that all children receive their primary linguistic data from the same source distribution. This source distribution depends upon the mixture of different linguistic types in the parental population. In other words, we assumed a social-connectivity pattern that is “perfectly mixed” in the sense that the different linguistic agents move freely in the society and influence each other equally. Reality, as always, is more complicated. For example, the community may be spatially segregated into different “neighborhoods” with each such neighborhood having a different mix of linguistic types. Children born in these different neighborhoods are therefore exposed to different source distributions and end up having different linguistic experiences. This leads naturally to spatially distributed models, and a brief discussion was conducted in Chapter 9. Let us reconsider some of these spatial effects. We begin by representing the geographic extent of the population by the square interval [0, 1] × [0, 1]. To keep matters simple, let us again work within the confines of the twolanguage setting where L, the space of possible languages, consists of exactly two languages L1 and L2 . Then the linguistic distribution of the population in space and time may be represented by a function gt (x, y) : [0, 1] × [0, 1] → [0, 1] Thus gt (x, y) ∈ [0, 1] is the proportion of L1 users at location (x, y) and time t.

10.2.1

Spatial Variation and Dialect Formation

Consider a typical child born at location (x, y). Assume that it is exposed to linguistic data from mature language users at that location. If it uses a learning algorithm A and operates on the basis of k examples, then one might compute the probability with which such a child would acquire L 1 . Following the usual analysis of previous chapters, we conclude that this

315

VARIATIONS AND CASE STUDIES

probability is equal to f (gt (x, y), A, k) and therefore gt+1 (x, y) = f (gt (x, y), A, k) In this setting, at each location (x, y), we have dynamics given by the deterministic map f . In previous chapters (particularly, Chapter 5 for twolanguage models) we showed how to compute f (·, A, k) for a variety of different learning algorithms. It is worthwhile to reflect on the implications of this spatial model. Imagine that f has two stable fixed points α = 0 and α = 1 respectively. We encountered several maps having such a property. This suggests that populations would converge to a unique language (either L 1 or L2 ) depending upon the initial conditions. Now imagine a spatially distributed population with the initial conditions provided by g 0 (x, y). At each location (x, y) the population would converge to a fixed language. However, the language that “emerges” would depend upon the initial condition. At all points (x, y) where the value g0 (x, y) lies in the basin of attraction of α = 0, we see that eventually L2 would be spoken by everyone. Correspondingly, at all points (x, y) where the value g0 (x, y) lies in the basin of attraction of α = 1, the population would eventually converge on L 1 . Thus linguistic differentiation would emerge in much the same way as species get formed. Local dialects would appear. Figure 10.5 illustrates this. There are six panels in the figure. Each panel shows the distribution of L1 speakers (across space) at a particular point in time. We assume that the learners use the cue-based strategy of Chapter 5. If the cue frequency p is within a certain interval, there are two stable fixed points α = 0 and α ≈ 1 with an unstable fixed point given by α = α∗ . In our simulations, p was chosen so that α ∗ = 0.34. The top two panels denote the initial state of linguistic differentiation in the square region for two initial conditions that are only very slightly different from each other. At any point (x, y) in the region, the darkness is proportional to the percentage of L 2 speakers. The initial condition is mostly gray, indicating that a mixture of both kinds of speakers exists in all regions. Further, in both cases, there is a region in the upper-left corner where the mixture is slightly different from the rest of the region. In the upper-left panel, the initial condition is such that the initial percentage of L 1 speakers is 0.33 and 0.35 respectively (in the upper-left subregion and rest of the square). In the upper-right panel, the initial percentage of L 1 speakers in these same regions is 0.31 and 0.33 respectively. Thus the upper-left corner has a slightly lower initial proportion of L 1 speakers (higher proportion of

CHAPTER 10

316

L2 speakers) than the rest. The two initial conditions corresponding to the upper two panels are numerically different but qualitatively the same. In fact, the two pictures are hardly distinguishable by eye. Yet from such qualitatively similar initial conditions, as time unfolds, the spatial distribution of languages evolves in very different ways. In the panels on the left, we see that strong linguistic differentiation emerges, so that the two different areas become linguistically homogeneous but with two different languages. Different regions with different languages have emerged. On the other hand, in the panels on the right, we see that in the long run, there is no difference between the linguistic composition of the two regions. Both regions are in the basin of attraction of the same language (L 2 ), and eventually this language is spoken throughout with no spatial segregation at all. No dialect formation emerges. Thus the initial conditions determine which language comes to be spoken in a particular region. Very slight differences in initial conditions might lead to different languages being spoken, and linguistic diversity might thus arise because of different initial conditions in different regions of the world.

10.2.2

A General Spatial Model

A more general account of spatial dynamics is provided by the following model. We develop this within the two-language framework as before. Let X ⊂ R × R be a spatial1 region. At location x ∈ X one may let gt (x) ∈ [0, 1] denote the proportion of L1 users in the population at time t. Thus the spatial distribution of linguistic types at time t is given by the function gt : X → R. We wish to characterize the evolution of g t over time. We denote spatial influences by the influence function I :X ×X →R where I(z, x) denotes the influence of speakers at location x ∈ X  on learners at location z ∈ X. For normalization purposes, we require that X I(z, x)dx = 1. Thus for any fixed z, the quantity I(z, x) may be interpreted as a probability density function denoting the likelihood of drawing a random example from a speaker at location x. If X were identified with social factors, then the influence function I characterizes the linguistic influence of one social group on another, leading to a mathematization of sociolinguistic networks 1

More generally we may identify X with any parameters of social, ethnic, economic, or other factors. For example each x ∈ X may denote a particular socio-economic group and gt (x) then denotes the linguistic composition of that group. The influence function I characterizes the linguistic influence of one group on another.

317

VARIATIONS AND CASE STUDIES

10

10

20

20

30

30

40

40

50

50 10

20 30 0.33/0.35

40

50

10

10

20

20

30

30

40

10

20 30 0.31/0.33

40

50

10

20 30 0.1/0.27

40

50

10

20

40

50

40

50

50 10

20 30 0.2/0.4

40

50

10

10

20

20

30

30

40

40

50

50 10

20

30 0/0.85

40

50

30 0/0

Figure 10.5: The evolution of linguistic diversity in space and time. Each panel represents the linguistic diversity in space at a point in time. The darkness of the (x, y) location is proportional to the percentage of L 2 speakers in the population. The three left panels (from top to bottom) represent snapshots of the linguistic diversity at three points in the evolutionary process from an initial condition (top), after ten generations (middle), and after twenty generations (bottom). The three right panels (from top to bottom) represent snapshots of the linguistic diversity from a different initial condition (top), after ten generations (middle), and after twenty generations (bottom). In each panel, there are two subregions — the upper-left-hand corner and the rest — with different proportions of L 2 speakers. These are indicated by the numbers x/y below each panel where x and y are the percentages of L1 speakers in the two subregions respectively. Although the two initial conditions represented by the top two panels are qualitatively the same, the numerical difference has put them in different basins of attraction. As a result, the population evolves in different ways. In the left panels, regional linguistic differentiation occurs; in the right panels, it does not.

CHAPTER 10

318

in the spirit of Milroy and Milroy 1985. More recently, there has been a profusion of interest in the structure and formation of networks in a variety of areas from biology and sociology, to the Internet (see, for example, Strogatz 2001 as well as Newman, Barabasi, and Watts 2003 for popular and accessible treatments). Consider a child learner born at location z. If this learner develops a language based on inputs drawn from speakers in its location, the evolution of language at that location would be given by f (g t (z)) following the arguments of the previous section. However, if the child is exposed to data from different locations according to I, then one may define the intermediate function ht : X → R where  I(z, x)gt (x)dx ht (z) = X

Here ht (z) denotes the overall probability with which a learner at z might encounter a speaker of L1 at time t. Thus the evolution of language is characterized by by f (ht (z)). Therefore, we get gt+1 (z) = f (ht (z)) = f ◦ (LI gt )(z)

(10.2)

where ◦ denotes composition and LI is a linear operator given by LI f = X I(z, x)f (x)dx. The evolution of gt may be studied for different choices of I and f . A proper investigation is beyond the scope of the current book, but let us provide some preliminary insights. Let f : [0, 1] → [0, 1] denote a monotonic map whose iteration leads to dynamics with two stable attractors α 1 < α2 and an unstable fixed point α∗ ∈ (α1 , α2 ). The basin of attraction of α1 is given by [0, α∗ ), and that of α2 is given by (α∗ , 1]. We have encountered such maps before. Then if the spatial evolution of language is provided by Equation 10.2, the following results are true: Proposition 10.1 If gt is the constant function, it remains so for all time. The evolution of the constant is characterized by f . Proof: Let gt (z) = αt . Then it immediately follows that ht (z) = X I(z, x)αt dx = αt . Therefore gt+1 (z) = f (αt ) = αt+1 . Thus gt is always a constant function. In other words, proposition 10.1 states that if there is no spatial diversity to begin with, such diversity will not arise. Recall that the dialect-formation

319

VARIATIONS AND CASE STUDIES

model of the previous section required some initial diversity distributed around the boundary between two basins of attraction of two different attractors. This result is therefore consistent with the earlier results. Proposition 10.2 If at all locations z, we have g t (z) > α∗ , then spatial diversity will be eliminated, i.e., ∀z(g t (z) → α2 ). ∗ Proof: Let αt = inf z∈X gt (z). If X is compact, we know  that αt > α . We also know that for every z, ht (z) = X I(z, x)gt (x)dx ≥ X I(z, x)αt dx = αt . Since f is monotone, we have f (ht (z)) ≥ f (αt ). Therefore αt+1 ≥ f (αt ). Iterating k times, we get 1 ≥ αt+k ≥ f k (αt ). Since limk=∞ f k (αt ) = α2 , we have that limk=∞ αt+k ≥ α2 .  = sup g (z). We see that h (z) = Now let β t t t z∈X X I(z, x)gt (x) ≤  I(z, x)β dx ≤ β . Therefore f (h (z)) ≤ f (β ), from which we conclude t t t t X βt+1 ≤ f (βt ). Iterating k times, we get limk=∞ βt+k ≤ limk=∞ f k (βt ) = α2 . Since αt ≤ βt , the result follows. Even if there is diversity to begin with, if all the initial conditions lie in the basin of attraction of one attractor, then spatial diversity will eventually be eliminated as the linguistic composition moves toward that attractor at all locations. If all initial conditions are not within the same basin of attraction, then more complicated behavior may arise. In the dialect-formation model of the previous section, we assumed that I(z, x) = δ(z − x), where δ is the Dirac delta function. In other words, different regions were linguistically isolated from each other. This led to each region developing its own language depending upon the initial conditions of that region. For other kinds of spatial interactions, other evolutionary possibilities emerge.

Proposition 10.3 Even though f has no cycles, the spatial evolution could display oscillations because of spatial interactions. Proof: We construct an example of cyclic behavior in the population. Let X be divided into two disjoint regions X1 and X2 such that X1 ∪ X2 = X. Let gt (x) = 12 +  for all x ∈ X1 and gt (x) = 12 −  for all x ∈ X2 . We will choose  appropriately later.  Now let I be as follows. For all z ∈ X1 , we have X2 I(z, x)dx = β > 21 .  Similarly, for all z ∈ X2 , we have X1 I(z, x)dx = 1 − β. Let us characterize the evolution of spatial diversity for this choice of I. We see that for z ∈ X1 , we have ht (z) = X I(z, x)gt (x) = β( 21 − ) + (1 − β)( 12 + ) = 12 + (1 − 2β). Therefore, we get 1 gt+1 (z) = f ( − (2β − 1)) 2

CHAPTER 10

320

Similarly, for z ∈ X2 , we have gt+1 (z) = f ( 12 + (2β − 1)). Consider f : [0, 1] → [0, 1] to be monotone increasing with f (0) = 0; f ( 12 ) = 12 ; f (1) = 1; ∀0 ≤  ≤ 12 , f ( 12 + ) + f ( 12 − ) = 1. Further, let the dynamics arising from the iteration of f be such that there are two stable attractors (0 and 1) and one unstable fixed point ( 12 ) in the middle. Then f has no cycles. For such an f we have f  ( 12 ) > 1. Therefore, it is possible to find  ∗ and β such that 1 1 f ( − (2β − 1)∗ ) = − ∗ 2 2 For this choice of ∗ we see that gt cycles between 12 + ∗ and 12 − ∗ in a period 2 cycle at all locations. In the construction of the previous proposition there are essentially two regions X1 and X2 where each region influences the other more than itself. This leads to a cycling behavior as each region follows the other. By having n regions, one can get period n cycles. One may eventually get chaos, though we do not explore that possibility here. In summary, we see that the spatial distribution of language and the interactions between different regions may have considerable effect on the dynamics depending upon both the initial conditions and the nature of the influence function I. It is worth noting in conclusion that the formulation is general enough to cover many different kinds of variation. Thus X need not represent spatial diversity but may be used to model ethnic, social, economic, or other forms of diversity that may influence the language-learning process at any point.

10.3

Multilingual Learners

In studying the relationship between learning and evolution, we have always assumed that the language-acquisition process leads the learner to a single grammar. This is acquired on the basis of the primary linguistic data that it is exposed to over the learning period. Formally, the learning algorithm is treated as a map A : Data → H where H is a class of possible natural-language grammars. Much of the interesting and nontrivial population dynamics arises when the primary linguistic data comes from a mixture of sources, e.g., from two different grammatical systems that are present in the adult generation.

321

VARIATIONS AND CASE STUDIES

An interesting and important issue is whether and when the learning child actually becomes bilingual (or more generally multilingual). In this case, the child does not acquire a single grammatical system but instead acquires multiple distinct ones. For example, if there are two linguistic systems (variants) L1 and L2 in the adult population, the learning child might actually acquire both. We consider this possibility in this section. The issue of acquiring multiple linguistic systems raises a host of subsidiary questions. For one, how might the learning child even know that there are two (or n) different linguistic systems in its linguistic environment? The literature on language acquisition is often imprecise on this question. One approach to multilingual acquisition is to claim that the child categorizes individuals into linguistic groups and figures out (somehow) that there are n different systems to acquire. Thereafter, acquisition proceeds in parallel as if there were n targets and the linguistic data is streamed based on the individual generating it. For example, a child might conclude from the social dynamics that the grandparents speak one language, the parents another, and the neighbors a third language. The child then proceeds to learn three different languages, treating the data differently depending upon who produces it. This is usually the case when the lexical items used by the different language users are quite distinct. The languages are thus kept apart and multilingual acquisition reduces to three parallel monolingual acquisition systems. The previously studied monolingual models apply immediately. This is classical multilingualism. However, the case of greater interest is when the linguistic systems interact. This is especially so if they share the same or similar lexical items but have different grammatical rules. In this setting, it is as if the child acquires neither a single system nor multiple separate systems but rather a “mixture” of competing systems that it uses simultaneously. For example, in their insightful study of language contact and change, Kroch and colleagues argue strongly that over the Middle English period, there were southern and northern dialects that came into contact and speakers had multiple conflicting systems that they used simultaneously (see Kroch and Taylor 1997; Kroch, Taylor, and Ringe 2000). The most natural model for this version of multilingual behavior is to treat linguistic knowledge as a distribution over competing grammatical systems. Thus the learning algorithm is a map A : Data → {μ |μ is a probability measure on H} If H consists of exactly n elements, i.e., there are n linguistic types, then probability measures on H can be identified with elements of the (n − 1)-

CHAPTER 10

322

(n−1) . Thus any measure on H is given by (α , . . . , α ) dimensional 1 n n simplex Δ where i=1 αi = 1 and αi ≥ 0. Therefore the learning algorithm may be characterized as A : Data → Δ(n−1)

We consider this for the case n = 2 in what follows. It is hoped that the analysis is illuminating and that generalizations to the general n case can be easily imagined.

10.3.1

Bilingualism Modeled as a Lambda Factor

Following the treatment of Kroch and Taylor 1997, let us consider an explicit model of bilingualism. Individuals are now bilingual with two internal grammars (say g1 and g2 ) that they alternate between in production and comprehension. A parameter λ ∈ [0, 1] determines the frequency of usage between the grammars. Thus, for example, a speaker/hearer with λ = 0.2 would use g1 20 % of the time and use g2 80 % of the time. The Bilingual Individual and the Population Imagine a population of bilingual adults with the ith adult using the two grammars in the ratio λi . Each adult therefore has his or her characteristic λ and one can characterize the variation in the adult population by a distribution over values of λ. Accordingly, let P (λ) denote a probability density function having support in [0, 1] that characterizes the distribution of λ-values in the adult population. Consider a particular adult having a value λ. The probability with which this adult produces an arbitrary sentence s ∈ Σ ∗ is given by Pλ (s) = λP1 (s) + (1 − λ)P2 (s)

(10.3)

where P1 (s) is the probability with which s is generated while using grammar g1 and P2 (s) is the probability with which s is generated while using grammar g2 . Equation10.3 characterizes the probability with which sentences are produced (and consequently received by listeners/children) by a single adult speaker. However, in reality the true source is not a single adult speaker but the entire population of adults with adults, having λ-values given by the distribution P (λ). Although it seems like this might complicate matters considerably, the analysis turns out to be quite simple.

323

VARIATIONS AND CASE STUDIES

To see this, consider the probability with which an arbitrary sentence s ∈ Σ∗ is produced. This is given by  1 Pλ (s)P (λ)dλ (10.4) P (s) = 0

Putting in the value of Pλ (s) from Equation 10.3 into Equation 10.4, we have  1  1 λP (λ)dλ + P2 (s) (1 − λ)P (λ)dλ P (s) = P1 (s) 0

0

Clearly, 

P (s) = E[λ]P1 (s) + (1 − E[λ])P2 (s)

where E[λ] = λP (λ)dλ is the mean value of λ in the population and obviously lies between 0 and 1. Thus having a mixed source with distribution P (λ) is equivalent to having a single source switching between the two grammars with E[λ]. Therefore, from the point of view of the child, the variation (of λ’s) in the adult population can be replaced by a single adult whose λ-value is given by the population mean. This will considerably simplify matters in the analysis of change. The Child as a Lambda Learner Children are exposed to sentences and from this exposure, they acquire (estimate) a value of λ that they then employ as adult users of language to switch between the two grammars in sentence production. According to this point of view, and in contrast to other models of language acquisition considered in this context, child learners are not explicitly monolingual. They do not attain a single unique grammar but rather acquire both. This reflects the fact that a single grammar is not adequate to account for the conflicting data they receive but rather two grammars are attained and they are in competition with the parameter λ deciding the rate at which the two grammars are employed. Crucially, this parameter λ is estimated by children on the basis of the primary linguistic data they receive. There is also variation in the population of children. Let us assume that all children use the same algorithm (procedure; denoted by A) to determine a suitable value of λ. Let us also assume that each child in the community receives its primary linguistic data from the bilingual adult population that has been characterized in the previous section. However, each child receives a different random draw of sentences from the same mixed source of adults. Because each child receives a potentially different data set, it attains a different value of λ if the primary linguistic data is finite. Thus, one has a

CHAPTER 10

324

population of children who receive different sets of sentences and therefore attain different values of λ, leading to variation in λ-values in the population of children. Under these assumptions, one might now attempt to characterize the distribution of λ-values in the population of children. By doing so, one will have determined the relationship between the linguistic composition of children and that of adults, i.e., the relationship between the linguistic compositions of two successive generations. The precise nature of such a relationship is determined by the λ-estimation procedure the canonical child employs and characterizes the evolutionary consequences (over generational time scales) of λ-estimation (over developmental time scales). We now outline three candidate λ-estimation procedures a child could potentially employ. In the next section, we will determine their evolutionary consequences. Imagine a child receives n random sentences from the adult population. Sentences can be of three types: 1. Sentences that are analyzable by g1 but not by g2 , i.e., those that belong to L1 \ L2 . 2. Sentences that are analyzable by both g1 and g2 , i.e., those that belong to L1 ∩ L2 . 3. Sentences that are analyzable by g2 but not by g1 , i.e., those that belong to L2 \ L1 . Of the n sentences the child receives, let n 1 be of type 1, n2 be of type 2, and n3 be of type 3. Clearly, n = n1 + n2 + n3 . Now, three candidate algorithms for λ-estimation the child might utilize are ˆ = n1 . Thus the child ignores all ambiguous sentences and 1. A1 : λ n1 +n3 uses only unambiguous triggers in its count. n1 ˆ= 2. A2 : λ n1 +n2 +n3 . The child interprets ambiguous sentences (type 2) as having been produced by g2 .

ˆ = n1 +n2 . The child interprets ambiguous sentences (type 3) 3. A3 : λ n1 +n2 +n3 as having been produced by g1 . Two remarks are in order. Algorithms A 2 and A3 are mirror images in that g2 and g1 respectively are given preferential treatment in the analysis of ambiguous sentences. If g2 and g1 represent first- and second-language

325

VARIATIONS AND CASE STUDIES

grammars respectively, this implies a tendency of the child to analyze ambiguous sentences exclusively using a first or second language respectively. Alternatively, one may consider g2 or g1 (respectively) to be a default or marked state according to which ambiguous sentences are interpreted. Various interpretations may thus be given to the asymmetry of estimation. Evolutionary Consequences Let ut be the mean value of λ’s in the population of adults, i.e., u t = E[λ]. Similarly, let ut+1 be the mean value of λ’s in the population of children (after maturation; therefore the population of the next generation of adults at time t + 1). It is possible to derive the functional relationship between ut+1 and ut under the assumptions described in earlier sections, and we provide and analyze such relationships below. First, some preliminaries. Recall that P1 is the probability with which speakers of L1 produce sentences and P2 is the probability with which speakers of L2 produce theirs. Further, let a = P1 [L1 ∩ L2 ] and b = P2 [L1 ∩ L2 ] Thus a, b are weights that P1 and P2 respectively put on the ambiguous sentences that are analyzable under both grammars. It turns out that these are the only parameters of P1 and P2 that enter the functional relationship between ut+1 and ut for the three λ-estimation procedures described above. Theorem 10.1 Let ut be the mean value of λ in generation t and u t+1 be the mean value of λ in generation t + 1. Assume that individual learners estimate λ according to each of the three algorithms described earlier. Then ut and ut+1 are related by the following for each of the three procedures: A1 : ut+1 =

ut (1 − a) ut (1 − a) + (1 − ut )(1 − b)

A2 : ut+1 = ut (1 − a) A3 : ut+1 = ut (1 − b) + b

CHAPTER 10

326

Proof: Consider A2 . For this setting, ut+1 = E[

n1 E[n1 ] = ut (1 − a) ]= n1 + n2 + n3 n

Next consider A3 . For this setting, ut+1 = E[

n1 + n2 E[n1 ] + E[n2 ] ]= = ut (1 − a) + ut a + (1 − ut )b n1 + n2 + n3 n

Finally, consider A1 . ut+1

 n  n1 n1 = E[ ]= αn1 β n2 γ n3 n1 + n3 n1 + n3 n1 n2 n3

where α = (1 − a)ut ; β = aut + b(1 − ut ); γ = (1 − b)(1 − ut ) respectively. Now k    n    n  n1 k n n1 n−k k−n1 n1 n1 n2 n3 = γ α β γ α β n1 + n3 k n1 k n1 n2 n3 k=0 n1 =0

=

n    n k=0

=

n    n k=0

k

k

β

n−k

k    k α n1 γ k−n1 n1 ) ( ) (1 − β) ( 1−β k n1 1 − β n =0

β n−k (1 − β)k (

k

1

α ut (1 − a) α )= = 1−β 1−β ut (1 − a) + (1 − ut )(1 − b)

Let us examine the consequences of this result. 1. The three different λ-estimation techniques lead to three different dynamical systems. Each dynamical system characterizes the evolution of the mean λ-value of the population from generation to generation. Crucially, nowhere have we assumed a particular form for the evolution of this mean λ-value. Instead, we have postulated different mechanisms by which children might estimate a value of λ from the data they receive. By taking ensemble averages over the entire population of children, we have then been able to derive the evolution of mean-λ as a logical consequence. The three different λ-estimation procedures essentially differ on how they treat ambiguous sentences, i.e., those that are parsed under both grammatical hypotheses in question. This results in the different evolutionary properties. We can now compare these predicted evolutionary trajectories to the ones that have been actually observed in history.

327

VARIATIONS AND CASE STUDIES

2. The three dynamical systems have different evolutionary properties. The evolution under λ-estimation method A 1 is characterized by equations that are familiar to us from Chapter 5. Recall that such a system has the following behavior: (a) if a = b, then u t+1 = ut , i.e., there is no change over generational time; (b) if a > b, then u t −→ 0 over generational time; (c) if a < b, then ut −→ 1 over generational time. 3. Under λ-estimation method A 2 , it is easy to check that ut −→ 0 (for all a > 0). 4. Under λ-estimation method A 3 , it is similarly easy to check that ut −→ 1 (for all b > 0).

10.3.2

Further Remarks

Memoryless Bilingual Learner The learning algorithm considered in the previous sections operates in a batch mode where it needs to keep a count of the different kinds of sentences (whether s ∈ L1 \ L2 , etc.) it has received. A different algorithm that is more faithful to the memoryless online principles discussed in Chapter 3 was developed and considered by Yang (2000). A brief exposition follows. We consider the case where |H| = 2. Thus there are two grammars g 1 and g2 in H characterizing two different languages L 1 and L2 respectively. Before seeing any examples the learner has a uniform prior on H. This may be characterized by the number p1 (0) = p2 (0) = 12 . Thus p1 (0) denotes the “weight” for g1 and p2 (0) = 1 − p1 (0) denotes the weight for g2 . The learner receives an example stream s1 , s2 , . . . , one at a time. After the nth example has been received, let the distribution be characterized by (p 1 (n), p2 (n)) where p1 (n) + p2 (n) = 1. A new sentence sn+1 is now heard by the learner. Then 1. with probability pi (n) the learner picks gi .

2.

if (sn+1 understood by gi ), pi (n + 1) = (1 − γ)pi (n) + γ pj (n + 1) = (1 − γ)pj (n); j = i else pi (n + 1) = (1 − γ)pi (n) pj (n + 1) = (1 − γ)pj (n) + γ; j = i

In this algorithm p1 plays the role of λ in the previous discussion. Instead of estimating λ using frequency counts, the algorithm estimates it using a

CHAPTER 10

328

classical linear reward-penalty scheme (Bush and Mosteller 1955). Suppose maturation occurs after k sentences have been received. Then, the variable of interest is p1 (k). This characterizes the probability with which the mature learner will use the grammatical system g1 in future interactions. One might ˆ = p1 (k). say that the learner estimates λ to be λ In the population setting, different children will receive different random draws of sentences and arrive at different estimates of λ. There is variation in λ-values of the children. As before, it is of interest to to characterize ˆ = p1 (k) in the the distribution and in particular the average value of λ population of child learners. I show how to do this. Note that in general p1 (n + 1) is related to p1 (n) in the following way: p1 (n + 1) = (1 − γ)p1 (n) + γXn where Xn is a random variable taking values in {0, 1}. Denote the probability of Xn = 1 by An . Therefore, by taking expectations we have E[p1 (n + 1)|p1 (n)] = (1 − γ)p1 (n) + γAn

(10.5)

Now let us compute An . As usual, we assume that ut is the mean value of the λs in the adult population (generation t). Therefore the following probabilities are immediate. The probability that a random sentence belongs to L1 \ L2 is equal to ut (1 − a). The probability that a random sentence belongs to L1 is equal to ut + (1 − ut )b. In order to compute An , note that Xn = 1 when either of two disjoint events occur: 1. The learner chooses g1 (with probability p1 (n)) and the sentence sn+1 ∈ L1 . This entire event occurs with probability (ut + (1 − ut )b) p1 (n) 2. The learner chooses g2 (with probability (1 − p1 (n))) and the sentence sn+1 ∈ L1 \ L2 . This entire event occurs with probability (ut (1 − a)) (1 − p1 (n)) Thus we have An = (ut + (1 − ut )b) p1 (n) + (ut (1 − a)) (1 − p1 (n)) Putting this into Equation 10.5, we have E[p1 (n + 1)|p1 (n)] = αp1 (n) + β

329

VARIATIONS AND CASE STUDIES

where α = 1 − γ + γ(aut + b(1 − ut )) and β = ut (1 − a)γ Taking expectations with respect to p 1 (n) we get the following recurrence relation: E[p1 (n + 1)] = αE[p1 (n)] + β Since E[p1 (0)] = 21 , we have 1 1 − αk β ut+1 = E[p1 (k)] = αk + 2 1−α For the limiting case of k = ∞, we have ut+1 =

ut (1 − a) β = 1−α (1 − a)ut + (1 − b)(1 − ut )

Note that this is exactly the same update rule as obtained in the previous section. Final Remarks While the bilingual learners were elaborated for the two-language setting, natural generalizations to the n-language case exist. For example, for the λestimation procedures discussed earlier, one might consider algorithms that count cues. Elements of Li \ Lj count as cues and one might count occurrences of members of each of these sets in the primary linguistic data and ˆ = (λ1 , . . . , λn ) ∈ Δ(n−1) . combine them usefully to obtain a total estimate λ Many such estimates are possible, and I do not explore their evolutionary consequences here. Similarly, the linear reward scheme of Yang (2000) has a natural extension to the n-language case. This general extension is considered in Yang 2000, and I direct the interested reader to that exposition. On a concluding note, it is worthwhile to reflect on the difference between bilingual and monolingual learning when batch-learning algorithms are used. Consider first the monolingual setting. The learning algorithm chooses L1 if n1 > n3 . Recall that n1 is the number of sentences belonging to L 1 \L2 that occur in the data set and n3 is the corresponding number belonging to L2 \ L1 . The update rule for this case is given by   n  p1 (t)n1 p2 (t)n2 p3 (t)n3 αt+1 = n n n 1 2 3 n >n 1

3

CHAPTER 10

330

where p1 (t) = αt (1 − a); p3 (t) = (1 − αt )(1 − b); p2 (t) = 1 − p1 (t) − p3 (t). Each individual in the population uses exactly one language and α t is the proportion of them that use L1 in generation t. The dynamics of this map has two stable equilibria, α∗ ≈ 0 and α∗ ≈ 1. Thus, if a community is homogeneous with most members speaking L 2 , it is hard to imagine how one might move to one consisting of mostly L 1 speakers. The population will have to jump from one basin of attraction into another. This may only come about because of large-scale language contact or migratory effects. In contrast, consider the bifurcations that occur with the bilingual learner. Each individual in the population uses both languages in the ratio λ i (t) (for the ith individual in the tth generation). We are interested in the evolution of the mean ut = E[λi (t)] (where the expectation is taken with respect to individuals). The update rule is provided by

ut+1 =

ut (1 − a) ut (1 − a) + (1 − ut )(1 − b)

In this setting, there is only one stable equilibrium although both α ∗ = 0 and α∗ = 1 are fixed points. Which of these is stable depends upon the relationship of a and b with respect to each other. Thus a switch in usage frequencies from a > b to b > a would cause a switch in the stability of the population. It is easy to imagine how language change might come about. It is interesting to note, however, that with memoryless algorithms the update rule (whether the monolingual TLA or the scheme outlined earlier (Yang 2000)) gives rise to similar bifurcation diagrams where language change is easier to imagine.

10.3.3

A Bilingual Model for French

Let us briefly consider the application of this point of view to reanalyze the case of syntactic change in French during the period from the fourteenth to the seventeenth century (A.D.). Our first computational analysis of this case was conducted in Chapter 6 within a five parameter monolingual framework of Clark and Roberts 1993. Here we redo the analysis within a bilingual framework. Our analysis continues to draw heavily from the work of Ian Roberts (linguistic work in Roberts 1993 and computational work in Clark and Roberts 1993) and is influenced by a more recent treatment in Yang 2002.

331

VARIATIONS AND CASE STUDIES

Linguistic Background Let us recall the syntactic setting. The discussion that follows is conducted within the Principles and Parameters tradition (Chomsky 1981) of linguistic theory. Two dominant parametric changes occurred in French syntax over the period under consideration. First, there was loss of subject pro-drop. In Old French (like Modern Italian), a pronominal subject could be dropped, as the following examples show. Loss of null subjects Modern French ∗ Ainsi s’amusaient bien thus (they) had fun Old French Si firent thus (they) made

grant great

cette that joie joy

nuit. night. la the

nuit. night.

Second, there was loss of verb-second phenomena (V2). Old French was a V2 language, so that V could raise to C (with the specifier typically filled) and occupy the second position in the linear order of the constituents. This is no longer true, as the following examples show: Loss of V2 Modern French ∗Puis entendirect-ils then heard-they Old French Lors oirent then heard

ils they

un a venir come

coup clap un a

de of escoiz clap

tonnerre. thunder. de of

tonoire. thunder.

Thus the situation is simply summarized as follows. In the beginning there was a relatively stable and homogeneous grammatical system that was +V2 and had null subjects (pro-drop). At the end, there was again a relatively stable and homogeneous grammatical system that had lost both V2 and pro-drop. In the middle there was variation, with multiple grammatical variants coexisting in the population. Computational Analysis I make the following assumptions: 1. Each speaker is potentially bilingual/multilingual with multiple grammatical systems that provide the basis for linguistic use.

CHAPTER 10

332

2. Similarly, each child potentially acquires multiple grammatical systems based on its linguistic experience. In particular, in periods when there is linguistic variation in the adult population and the data received is not consistent with a single grammar, the child will accordingly acquire multiple systems. For illustrative purposes, I will focus on the competition between two grammatical systems. The two grammars are denoted by g + and g− respectively. The corresponding sets of surface expressions (sentences) are denoted by L+ and L− respectively. When using the grammatical system g+ speakers produce sentences with a probability distribution P + (over L+ ) and similarly when using g− speakers produce sentences with a probability distribution P− (over L− ). For example, if g+ were a head-first grammar without verb-second movement and no prodrop (like modern French), then L + consists of elements like (1) SVO (subject-verb-object; like the Modern English Mary sees the children or the Modern French Marie voit les enfants); (2) XSVO (like the English After dinner, John read the newspaper) and so on. In general, in our analysis, various choices may be made for g + and g− and the evolutionary consequences may then be examined. Since speaker/learners are potentially bilingual, each speaker has a grammatical-mix factor λ ∈ [0, 1] that characterizes how often the speaker uses g + as opposed to g− . In general, a speaker with mix factor λ produces sentences with a probability distribution given by λP + + (1 − λ)P− . Note that this distribution is over the set L+ ∪ L− . Thus there may be internal variation within each speaker and the expressions produced by such a speaker are not consistent with a single unique grammar. We can view the space of possible grammatical systems as G = {h|h = λg1 + (1 − λ)g2 }, where G is a space of formal convex combinations denoting multiple grammatical systems. As noted in earlier sections, there is also external variation in the adult population. Thus different individuals have potentially different λ-values and we can imagine the distribution of λ-values in the adult population. A summary statistic for the average linguistic behavior of the population as a whole may be provided by the mean value of λ, which we denote by E[λ]. We are usually interested in the evolution of this quantity over generational time. Bifurcations and Syntactic Change I have discussed evolutionary dynamics for a variety of learning algorithms. Two algorithms are of particular interest. One is algorithm A 1 of Sec-

333

VARIATIONS AND CASE STUDIES

tion 10.3.1. The other is the memoryless λ-estimation procedure of Section 10.3.2. Both are algorithms by which the individual child might estimate a λ-value on the basis of linguistic data. Recall that both learning algorithms resulted in the same evolutionary dynamics for E[λ] over time. This dynamical system is given by xt+1 =

axt axt + b(1 − xt )

(10.6)

where 1. xt = E[λt ] is the average value of λ in the parental generation (generation t).  2. a = s∈L+ \L− P+ (s) is the probability with which a speaker when using g+ produces a trigger for that grammatical system.  3. b = s∈L− \L+ P− (s) is the probability with which a speaker when using g− produces a trigger for that grammatical system. Recall the dynamical behavior of Equation 10.6. 1. If a > b then x = 1 is the only stable point. From all initial conditions, the population will converge over evolutionary time to a homogeneous population of g+ users. 2. If a < b then x = 0 is the only stable point. From all initial conditions, the population converges to g− . 3. If a = b then xt+1 = xt for all t. There is no change. We may interpret the facts of language change in terms of this bifurcation. Thus, on this account, one would suggest that a homogeneous stable population of g+ users (x = 1) could become unstable if the frequencies of sentences changed so that a became less than b while before it was the other way around. Under this condition, we see that the introduction of even the slightest variation in the population would cause the language of the community to move to one of g− users, i.e., large-scale language change as a result of a bifurcation. It is also interesting to note that while syntactic diglossia is permitted within the grammatical and acquisition framework, it is usually eliminated over time unless a is exactly equal to b in such models. Looking more closely at the grammatical theories and the data, we find that if there was no pro-drop, a +V2 grammar tends to be quite stable in comparison to a −V2 grammar if this is the only parametric difference

CHAPTER 10

334

between the two grammars. Following the analysis in Roberts 1993 and Yang 2000, we may take the two grammars to be 1. g+ : the +V2 grammar has expressions like SVO (subject-verb-object; with verb typically in C and subject in Spec-C) and VS patterns like XVSO, OVS, and so on. 2. g− : the −V2 grammar (like Modern French) has expressions like SVO (subject-verb-object; with subject in Spec-IP) and XSVO (in general, V> 2 patterns). Following our analysis above, we see that SVO patterns do not count as triggers. The proportion of XSVO (trigger for −V2) and XVSO (trigger for +V2) patterns in the speech of g− and g+ users respectively will determine the evolution of the population. Preliminary statistics (following Yang 2000) based on the speech of modern −V2 (like English and French) and +V2 (like German and Dutch) languages suggest that a = 0.3 while b = 0.2. Consequently, the +V2 grammar would remain stable. Let us now consider a +V2 (with pro-drop) grammar in competition with a −V2 (with pro-drop) grammar. Then we have the following patterns. 2 (Note that +V2 grammars with pro-drop will not generate VO expressions, presumably because subj is in Spec-CP and this needs to be filled.) 1. g+ : SVO; XVSO; XVO 2. g− : SVO; VO; XVO; XSVO Let us do a simple calculation. Assume that with probability p the subject is a pronoun rather than a full NP. Further, with probability d, the pronoun is dropped in a pro-drop (null subject) language. If d = 1 then the prodrop is obligatory. If d = 0 then the language does not allow pro-drop. 2 These patterns are provided for illustrative purposes. In reality, of course, there are many more expressions that are generated by each of the two grammars, but these may be deemed irrelevant to the discussion at hand. Consequently, the probabilities provided may be treated as normalized after discarding these irrelevant distributions. More precisely, I am assuming the following. L+ and L− each contain a potentially infinite number of expressions. I restrict my discussion to a set A ⊂ Σ∗ of expressions where A = {SVO, XVSO, XVO, VO, XSVO}. Then for any element a ∈ A, when I put P+ (a) and similarly in values for P+ (a) in our calculations, I actually use the value P+ (A∩L +) for P− . There are two other potentially important considerations that I have eliminated from my current discussion. First, I am restricting myself to verb-medial grammars. It has been proposed that +V2 systems tend to be more stable in verb-final grammatical systems than in verb-medial ones. I do not explore this issue in part because of my supposition that French was verb-medial throughout. Second, there was a point when subject pronouns started behaving as clitics, and it is quite possible that this behavior affected the interpretation of surface expressions during the language-learning phase and altered the relevant primary linguistic data to weaken V2. I do not consider this issue any further.

335

VARIATIONS AND CASE STUDIES

Then we see that

P+ (SV O) = 0.7 P+ (XV SO) = 0.3((1 − p) + p(1 − d)) P+ (XV O) = 0.3pd To clarify the logic of the calculations, let us consider the probability with which XVSO would be produced by a speaker of g + . With probability 0.3 a baseform of XVSO would be produced. Now we need to calculate the probability with which this is overtly expressed, i.e., the subject is not dropped. There are two cases: the subject position is filled by a full NP (with probability 1− p), in which case it cannot be dropped; and the subject position is filled with a pronoun (probability p), but this pronoun is not dropped (probability 1 − d). Multiplying this out and adding the two cases, we obtain P+ (XV SO) = 0.3((1 − p) + p(1 − d)). Probability values for the other expressions are obtained similarly. Now consider probabilities with which g− speakers produce their expressions. It is simply seen that P− (SV O) = 0.8(1 − pd) P− (V O) = 0.8pd P− (XSV O) = 0.2(1 − pd) P− (XV O) = 0.2pd Given P− and P+ , we can calculate a and b to be a = 0.3(1 − pd) and b = 0.8pd + 0.2(1 − pd) From this we see that for a > b we need 0.3(1 − pd) > 0.8pd + 0.2(1 − pd) or

1 9 Thus we see that if d = 0, i.e., the language has no pro-drop, then the dynamics is in the regime a > b, and correspondingly, the +V2 grammar pd
9p then a bifurcation occurs and the +V2 grammar becomes unstable. One might then ask, how a +V2 and pro-drop grammar (as old French putatively was) remained stable in the first place. According to this analysis it must be because p was small (so that the product pd < 19 ). Now notice that if this were the state of affairs, then the only way in which change would come about is if p increased to cross the threshold. While this is happening it is crucial that the null subject is not being lost, i.e., d is not decreasing. By this analysis +V2 is lost before the null subject is lost. If the null subject were lost first, then d = 0 and the dynamics would always be in the regime pd < 19 and the +V2 parameter would always remain stable. On this account, +V2 is lost before the null subject was lost. Further, +V2 was lost because of the increase in p, i.e., the use of pronominal subjects in the speech of the times. The above analysis is empirically anecdotal because we have plugged in plausible numbers for the probabilities. The point of the exercise was to show again how a bifurcation may be at the root of language change and how the conditions for change depend in a subtle way on the frequencies with which expressions are produced. In this case, the product pd is seen to determine the stability of the language. Further, we obtain a linguistic prediction, that +V2 must have been lost before the null-subject parameter was lost. The loss of +V2 must have been triggered by the increase in the use of pronominal subjects.

10.4

Conclusions

We have considered three separate issues that arise in the accurate modeling of the evolution of linguistic populations. First, there is the question of the finiteness of population sizes. In general, assuming an infinite population size allows one to derive deterministic dynamics by taking suitable ensemble averages over the population. When populations are finite, the contingencies of individual decisions do not converge to their theoretical mean behavior. As a result, the population dynamics is now stochastic. For large or medium population sizes, this stochastic behavior is like a noisy version of the original deterministic dynamics. For small population sizes, the stochastic behavior could be systematically different. Second, we considered the effects of spatial organization of the population into linguistic neighborhoods. The most interesting result from such

337

VARIATIONS AND CASE STUDIES

a consideration is a proposal about dialect formation. Since there may be multiple attractors for the population dynamics, the actual stable language that emerges in a given region will depend upon the initial conditions of the population in that region. Assuming that humans are distributed with different initial conditions in different regions of the world, one might then see the formation of different dialects, with each dialect corresponding to a attractor of the linguistic system. Third, we examined the population dynamics for multilingual learners. In a two-language setting, we considered two different bilingual learning procedures and derived the corresponding dynamical equations. In those linguistic settings where multilingual behavior is important to take into account, models of this sort will need to be developed. A case study of syntactic change in French was then conducted within this paradigm. From this analysis, we concluded that a bifurcation could have led to the loss of +V2. We conjecture that +V2 must have been lost before the null subject was lost, and further that it must have been the increase in the use of pronominal subjects in the speech of the times that triggered the loss of +V2. These predictions are empirically verifiable. Each of these investigations is incomplete but suggestive. It shows how a relevant aspect of language evolution might be systematically modeled and the kinds of results one might potentially obtain. I leave a more extensive coverage of these issues to future research.

Part IV

The Origin of Language

Chapter 11

The Origin of Communicative Systems: Communicative Efficiency In this part of the book, we turn our attention to the question of the origin of language, i.e., the genesis of human language from prelinguistic versions of it. The issue is notoriously difficult to investigate because empirical facts that pertain to this matter are particularly hard to come by. At the same time, the genesis of the language faculty certainly counts as one of the major evolutionary transitions – on a par with other significant evolutionary events such as the origin of protocells, eukaryotes, sexual reproduction, and the like (see Maynard Smith and Szathmary 1995). So in some sense, this might seem like the holy grail of linguistic inquiry — how and why did this peculiar recursive communication system of human language arise in biological populations? This particular question has attracted significant attention in recent times (see, for example, the Proceedings of the International Conferences on Language Evolution, held every two years since 1996). Whatever story one chooses to tell about how language arose, one will need to invoke learning-theoretic considerations. Thus learning at the individual level and evolution at the population level will continue to be important themes in our narrative. However, we will now need to formally develop certain other notions that have played a minimal role in the models that have been discussed so far. First, there is the notion of communicative efficiency. This has been a recurring theme and a constant subtext in linguistic studies from a func-

CHAPTER 11

342

tionalist perspective that takes the central defining feature of language to be its communicative function (see Halliday 1994 and Newmeyer 1998, for perspectives from different points of view; also Kirby 1999, for a discussion in the context of language evolution). Human language allows us a rich flexibility to describe and communicate a variety of thoughts and events to each other. While one may seriously debate the degree to which the structure of language reflects its communicative function, there is no denying the fact that among other things, language is also used for communication. There is also no denying the fact that human language allows a richer medium of communication than other animal languages. It is certainly possible that communicative efficiency was important in the evolution of competing linguistic groups where the different groups had different communicative ability. To fully develop this argument, we need a notion of differential fitness and natural selection. In much of this book, we have been considering modern1 human languages, where all languages are more or less equally efficient. However, one might imagine populations of organisms that are basically similar except with respect to their communicative language. If communicative efficiency provides biological fitness to individuals in terms of increased ability to reproduce or survive, then would populations converge to coherent linguistic states? As usual, we will be interested in the stable modes of linguistic populations under such assumptions. In addition to stability, we will examine the closely related notion of coherence. A coherent population is one in which most members of the population have a shared common language, i.e., the population is linguistically homogeneous. Often, one observes coherent populations in settings where there is no obvious centralizing agency that enforces such coherence. We investigate whether coherence can emerge through individual interactions — an idea that is explored in different contexts in the literature on self-organization and emergence in complex systems. These themes arise in the study of populations of interacting linguistic agents in animal, human, and artificial populations. For example, the notion of emergence is invoked in artificial intelligence where one asks whether intelligent behavior can evolve or emerge by interaction between simple de1 I mean modern in evolutionary time scales, i.e., the evolution of human language after the origin of the current human species from some prehuman ancestor. It is my assumption that in this modern period, there has been no genetic evolution of the language faculty, but rather cultural evolution of particular natural languages within the scope ordained by the space of possible human languages (universal grammar). This evolutionary process falls within the purview of historical linguistics.

343

COMMUNICATIVE EFFICIENCY

centralized agents rather than reflecting preprogrammed structure (design by centralized planning) (e.g., see the work of Luc Steels and colleagues in the context of designing languages for robots). Similarly, studies of signaling or animal communication also need to deal with such issues. This part of the book studies the interplay between communicative efficiency, learning, fitness, and coherence in some abstraction. In the current chapter, I develop the notion of communicative efficiency and fitness. I characterize language as a probabilistic association between form and meaning. I then provide a natural definition of communicative efficiency between two linguistic agents possessing different languages. I consider the implications of this definition for learning and evolution. I also perform an empirical study on large linguistic corpora to find that the structure of the lexicon of several modern languages does not reflect optimality in terms of communicative efficiency. In the light of this, one should take seriously the possibility that communicative efficiency as developed in these chapters plays less of a role in human language and may be a more useful concept in studies of animal or artificial communication systems. In the next two chapters, I consider a population of linguistic agents evolving with (Chapter 12) and without (Chapter 13) fitness. This fitness is derived from communicative efficiency. In both cases we find that coherent states emerge if the learning fidelity is high, i.e., if the learner is able to acquire the target language with high confidence. In particular, we see that there is a bifurcation point in the dynamics when there is a transition from incoherent to coherent states. We also find that if linguistic agents learn only (primarily) from a single individual in the population (typically a parent, e.g., among certain animal species that learn in the nesting phase), then fitness is necessary for coherence to emerge. On the other hand, if linguistic agents learn from the population at large, then coherence might emerge even in the absence of fitness.

11.1

Communicative Efficiency of Language

Consider two linguistic agents in a shared world. The agents desire to communicate different messages (meanings) to each other. Such a situation arises in a number of different contexts in natural and artificial communication systems, and it is important in such cases to be able to quantify the rate of success in information transfer, in other words, the mutual intelli-

CHAPTER 11

344

gibility or communicative efficiency2 of the agents. Each agent possesses a communicative device or a language that allows it to relate code (signal) and message, form and meaning, syntax and semantics, depending upon the context in which the communication arises. If they share the same language and this language is expressive enough and unambiguous, then mutual intelligibility will be very high. If, on the other hand, they do not share the same language, or the languages are inexpressive or ambiguous, the mutual intelligibility will be much lower. In this chapter I present an analysis of this situation. This analysis grew out of joint work with Martin Nowak and Natalia Komarova and was first presented in Komarova and Niyogi 2004. Following that development, I view languages as probabilistic associations between form and meaning and develop a natural measure of intelligibility, F (L 1 , L2 ), between two languages, L1 and L2 . This is a generalization of a similar function introduced in the context of language evolution by Hurford (1989). I ask the following question: If there is a biological/cultural/technological advantage for an agent to increase its intelligibility with the rest of the population, what are the ways to do this? The task of increasing intelligibility reduces ultimately to three related subproblems: 1. Given a language L, what language L  maximizes the mutual intelligibility F (L, L ) for two-way communication about the shared world? 2. What are some acquisition mechanisms/learning algorithms that can serve the task of improving intelligibility? 3. What are the consequences of individual language-acquisition behavior on the population dynamics and the communicative efficiency of an interacting population of linguistic agents? In the next few chapters, I develop a mathematical framework to address these questions analytically. In this chapter, I address questions (1) and (2) above. You will see, surprisingly, that the optimal language L  for maximizing communicative efficiency with a user of L need not be the same as L. In general, if L has ambiguities, then L  will end up being different from L. You will also see that in general, there may be no language L  that achieves the maximal communicative efficiency with L but there always exist languages that come arbitrarily close to this maximum. Our main result 2

I use the terms mutual intelligibility and communicative efficiency interchangeably in this book.

345

COMMUNICATIVE EFFICIENCY

is an algorithm for constructing such arbitrarily good approximations. Using this result, I demonstrate an algorithm that can learn such languages from linguistic exchanges with L. Bounds on the sample complexity of such learning algorithms are then provided. Finally, I conduct empirical studies on linguistic corpora of several languages. These studies suggest that the lexicons of these languages are not structured for optimal communicative efficiency in the sense described here. The implications of this are discussed. In the next chapter, I examine the evolutionary dynamics of populations (question 3 above) within this framework.

11.1.1

Communicability in Animal, Human, and Machine Communication

The simplest situation where communicability is readily defined corresponds to the case where the “language” may be viewed as an association matrix, A. Such a matrix simply links referents to signals. If there are M referents and N signals, then A is an N × M matrix. The entries, a ij , define the relative strength of the association between signal i and meaning j. The matrix A thus characterizes the behavior of the linguistic agent in production mode, where it may produce any of the signals corresponding to a particular meaning in proportion to the strength of the association, and in comprehension mode, where it may interpret a particular signal as any of the meanings in proportion to the association strengths. The specific settings in which such a scheme is a useful description include animal communication, human languages, and artificial languages. For instance, it often makes sense to talk about a lexical matrix as a formal description of human mental vocabularies. This approach is introduced to describe the arbitrary relations between discrete words and discrete concepts of human languages (Hurford 1989; Miller 1996 ; Regier et al. 2001; Regier 2003; Tenenbaum and Xu 2000; Komarova and Nowak 2001). Each column of the lexical matrix corresponds to a particular word meaning (or concept); each row corresponds to a particular word form (or word image). In the Saussurean terminology of arbitrary sign, the lexical matrix provides the link between signifie and signifiant (Saussure 1983). An equivalent of a lexical matrix is also at the basis of any animal communication system, where it defines the relation between animal signals and their specific meanings (Hauser 1997; Smith 1977; Macedonia and Evans 1993; Cheney and Seyfarth 1990). A classic example of this is alarm calls in primates. There are a finite number of referents that are coded using acoustic signals and decoded appropriately by recipients.

CHAPTER 11

346

Infinite association matrices can be used as a description of human languages. Human grammars mediate a complex mapping between form and meaning. Here, the space of possible signals is the set of all strings (sentences) over a finite syntactic alphabet and the set of possible meanings is the set of all strings over some semantic alphabet. Most crucially, the sets of possible sentences and meanings are infinite. This accounts for the infinite expressibility of human grammars. In artificial intelligence, the problem arises in many different settings. A number of studies have focused on the interaction of linguistic agents with each other in simulated worlds, exploring the question of whether coherent or coordinated communication ultimately emerges (see, for example, Steels 1996; Steels and Vogt 1997; Steels and Kaplan 1998; Oliphant 1999; Briscoe 2000; Kirby 1999). Much of this kind of research employs the simulation methodology of Artificial Life. In this chapter, I create a mathematical framework for these kinds of problems and derive a number of analytic results. Ultimately, I examine language coordination (coherence) in a community of linguistic agents, which has natural synergies with research on multiagent systems (see, for example, Boutilier, Shoham, and Wellman 1997, for an overview of multiagent systems). In the design of natural-language understanding systems, the goal is to develop a computer system that is able to communicate with a human. The statistical approach to this problem assumes an underlying probabilistic model for the human source. This probabilistic model is then recovered or learned from data, either by randomly drawn samples as in the case of corpus linguistics or statistical language modeling (see Charniak 1993 as well as Manning and Schutze 1999 for overviews of this point of view) or via some interactive exchanges and semantic reinforcement (Gorin, Levinson, and Gertner 1991; Isbell et al. 2001). The primary implication of the results here is that optimal communication with a language user might require us to learn a language that is different from the target source.

11.2

Communicability for Linguistic Systems

I will now develop a formal characterization of communicative efficiency between the users of two linguistic systems.

11.2.1

Basic Notions

I regard a linguistic system as an association between form and meaning. Let S be the set of all possible linguistic forms (sentences or signals) and

347

COMMUNICATIVE EFFICIENCY

M be the set of all possible semantic objects (meanings or referents). Note that depending on the context, the elements of S can be words, codes, expressions, forms, signals, or sentences. The elements of M can be meanings, messages, events, or referents. I will use the general term signals for elements of S and meanings for elements of M. The sets S and M need not be finite, but it is essential that they are enumerable. For natural languages, because of their discrete underlying structure, the sets S and M may be viewed as countable. In the lexical setting, S is the set of all words and therefore is naturally countable, while the countability of M (the meanings) is motivated by category formation in semantic space. In the case of human grammars, we may let S = Σ ∗1 be the set of all possible strings over a syntactic alphabet (Σ 1 ) and M = Σ∗2 be the set of all possible strings over a semantic alphabet (Σ 2 ). Note that in this case S and M are infinite. I define a communication system, or a language, to be a probability measure μ over S × M. Note that in the case of finite languages (human or artificial lexicons and animal communication systems), μ is related to the association matrix, A, by means of a simple rescaling. Let us enumerate all possible signals, i.e., the elements of the set S, as s1 , s2 , s3 , . . . and all possible meanings (elements of M) as m 1 , m2 , m3 , . . .. The coding and decoding schemes of the agent are contained in the measure μ in the following manner. Each user of μ is characterized by an encoding matrix P and a decoding matrix Q where    μ(si , mj )/ p μ(sp , mj ), if 0, p μ(sp , mj ) >(11.1) Pij ≡ μ(si |mj ) = 0, otherwise,    μ(sj , mi )/ p μ(sj , mp ), if 0, p μ(sj , mp ) >(11.2) Qji ≡ μ(mi |sj ) = 0, otherwise. Both P and Q matrices are easily interpreted. P ij is simply the probability of producing the signal si given that one wishes to convey the meaning m j . Similarly, Qij is the probability of interpreting the expression s i to mean mj by the same user. Matrices analogous to P and Q were introduced in Hurford 1989; however, they were not explicitly related through a common measure, μ. An effective connection between P and Q has been employed for a particular learning mechanism, called the Saussurean (Hurford 1989; Oliphant 1999). Remarks. 1. The user of a language is characterized in production mode by the matrix P and in comprehension mode by the matrix Q. This captures

CHAPTER 11

348

the fact that given a particular meaning, there might be many different ways to express it. Correspondingly, given a particular signal, there may be no unique interpretation. Thus ambiguities in sentence interpretation or polysemy in lexical semantics are incorporated. 2. We have required that P and Q arise from a common measure μ. This ensures that the user of a language is self-consistent in production and comprehension. If P and Q were completely decoupled, one could potentially have pathological situations like the following. A language user could use the word forms /cat/ and /dog/ to refer to the concepts “cat” and “dog” respectively in production mode. Yet in comprehension mode, the same language user might interpret /cat/ as “dog” and /dog/ as “cat” respectively. Such contradictory behavior in production and comprehension is eliminated by requiring P and Q to be related to a common association matrix and more generally a probability measure. 3. While a measure μ uniquely defines the corresponding P and Q matrices, the converse is not generally true. Given the P and Q matrices it might be possible to find more than one μ that would have the correct encoding and decoding matrices. An example with 2 × 2 matrices is P = Q = I and     1/2 0 1/3 0 μ2 = μ1 = 0 1/2 0 2/3 Clearly, both μ1 and μ2 lead to the same P and Q. In order to avoid such ambiguities I consider equivalence classes of measures. I will say that two measures μ1 and μ2 are equivalent to each other (μ1 ≡ μ2 ) if and only if the corresponding P and Q matrices are equal, i.e., P (1) = P (2) and Q(1) = Q(2) . 4. For a probability measure μ let us introduce Sμ = {s ∈ S| ∃m ∈ M such that μ(s, m) > 0}. This defines the set of signals that will be used in production or comprehension by a linguistic agent. In the sense of formal language theory, this is the set of well-formed syntactic expressions. In a human language setting, Sμ corresponds to the traditional notion of language as a set of grammatical expressions generated by some underlying grammar. My definition of a language as a measure μ contains this traditional notion of language as the support of the marginal probability of

349

COMMUNICATIVE EFFICIENCY μ. Similarly, we can define Mμ = {m ∈ M| ∃s ∈ S such that μ(s, m) > 0} This defines the set of all meanings that are expressible by the linguistic agent. If Mμ = M, then all meanings can be expressed. If Mμ is a proper subset of M, this means that some meanings are left unexpressed.

5. The probability measure μ, the sets S μ and Mμ , and the matrices P and Q in humans and animals presumably arise out of highly structured systems in the brain. In fact, it is clear that in human languages, these objects may not vary arbitrarily. A significant activity in generative linguistics attempts to characterize the nature of this structure and the variation that exists among natural languages of the world. Correspondingly, fieldwork on animal communication systems (see, for example, the large literature on birdsongs) characterizes the structure and variation in systems within and across species.

11.2.2

Probability of Events and a Communicability Function

The communicating agents are immersed in a world and the need to communicate messages arises as corresponding events occur in this shared world. Thus one may define a measure σ on the set of possible meanings M according to which the agents need to communicate these meanings to each other. Given two communication systems, i.e., languages μ 1 and μ2 , the probability that an event occurs whose meaning is successfully communicated from μ1 to μ2 is given by   σ(mi ) μ1 (sj |mi )μ2 (mi |sj ) P[1 → 2] = i

j

Similarly, one may compute the probability with which an event is successfully communicated from μ2 to μ1 as   σ(mi ) μ2 (sj |mi )μ1 (mi |sj ) P[2 → 1] = i

j

We may then define the effective communicability function of μ 1 and μ2 as 1 F (μ1 , μ2 ) = (P[1 → 2] + P[2 → 1]) 2

CHAPTER 11

350

This is the mutual intelligibility or communicative efficiency between a user of μ1 and a user of μ2 . In matrix notation, this may be written as F (μ1 , μ2 ) = 1/2 [tr (P (1) Λ(Q(2) )T ) + tr(P (2) Λ(Q(1) )T )]

(11.3)

where Λ is a diagonal matrix such that Λii = σ(mi ), tr(A) denotes the trace of the matrix A, and P (i) , Q(i) refer to the coding and decoding matrices associated with measure μi . Note that tr(P (1) Λ(Q(2) )T ) is simply the probability that an event occurs and is successfully communicated from user of μ1 to user of μ2 . Remarks. 1. The function F (μ1 , μ2 ) is the average probability with which μ 1 and μ2 understand each other in two-way communication mode. The function F (μ1 , μ2 ) is symmetric with respect to its arguments. If |S| = |M| and μ1 is a probability measure with support only on the diagonal elements of S × M, then P = Q = I (where I is the identity matrix). The communicative efficiency F (μ1 , μ1 ) = 1. 2. F (μ1 , μ1 ) is the communicability of two identical linguistic agents. We have 0 < F (μ1 , μ1 ) ≤ 1 For two different agents μ1 and μ2 we also have 0 ≤ F (μ1 , μ2 ) = F (μ2 , μ1 ) ≤ 1 3. Note that the marginal μ(m) is not equal to σ(m). In other words, the language of an agent is simply given by μ and the conditional probabilities associated with it. The probability with which agents communicate different meanings is determined not by the language but by the external world in which the agents are grounded. Therefore, two agents might have high communicative efficiency in some world and low communicative efficiency in another one. 4. A function similar to our communicability function was introduced by Hurford (1989). However, all meanings were treated to have equal probabilities (a uniform measure σ), and thus the function was not suitable for infinite matrices.

351

COMMUNICATIVE EFFICIENCY

11.3

Reaching the Highest Communicability

Let us assume that one of the languages is given and call this language μ 0 . According to definition 11.3, for any language μ, we have (where σ i = σ(mi )) F (μ0 , μ) =

1 σj [μ0 (si |mj )μ(mj |si ) + μ(si |mj )μ0 (mj |si )] 2

(11.4)

i,j

Let us define the best response as a language μ ∗ , such that F (μ0 , μ∗ ) = sup F (μ0 , μ)

(11.5)

μ

In what follows, I will present an algorithm for constructing a best response or a language that in some sense approaches the best response. In particular, I show that the best response need not exist. However, an arbitrarily good response can be constructed. I show how to construct a family of languages (μ where  > 0) such that F (μ0 , μ ) can be made arbitrarily close to supμ F (μ0 , μ) — the maximum possible mutual intelligibility between a user of μ0 and a user of any allowable language.

11.3.1

A Special Case of Finite Languages

In order to keep the argument as transparent as possible, I will first make three simplifying assumptions. I will discuss relaxations later. 1. The languages are finite, and the matrices have the size N × M . 2. The distribution σ is uniform, i.e. σ i = 1/M ∀i. 3. The measure μ0 satisfies the property of unique maxima, i.e. for each i, there exist a unique p0 (i) and a unique r0 (i) such that μ0 (si |mp0 (i) ) = max μ0 (si |mp ), p

μ0 (mi |sr0 (i) ) = max μ0 (mi |sr ) r

(11.6) The last condition states that there exists strictly one element of each column of μ0 (s|m) (row of μ0 (m|s)) such that it is the biggest element in the column (row). Let us maximize each of the two terms in the right-hand side of expression 11.4 separately. First, we find a matrix Q ∗ such that   μ0 (si |mj )Q∗ij = max μ0 (si |mj )Qij (11.7) i,j

Q

i,j

CHAPTER 11

352

where we maximize over all matrices Q whose elements are non-negative and sum up to one within each row. This results in the following definition of Q∗ :  1, μ0 (si |mj ) = maxp μ0 (si |mp ) ∗ (11.8) Qij = 0, otherwise In other words, in order to construct the best decoder, Q ∗ , we need to find the largest elements in each of the rows of μ 0 (s|m) and put “ones” at the correspondent slots of Q∗ . The rest of the entries of the matrix Q ∗ are zero. This is a well defined operation because of the property of unique maxima. Similarly, we can find the matrix P ∗ such that   Pij∗ μ0 (mj |si ) = max Pij μ0 (mj |si ) P

i,j

i,j

where we maximize over all matrices P whose elements are nonnegative and sum up to one within each column. The best encoder, P ∗ , is given by  1, μ0 (mj |si ) = maxp μ0 (mj |sp ) ∗ (11.9) Pij = 0, otherwise i.e., we maximize each column of the matrix μ 0 (m|s). Now, we have the best encoder and the best decoder for the language μ 0 . Finding the matrices P ∗ and Q∗ completes the task of the obverter of Oliphant 1997. However, in our setting, the two matrices cannot be independent, but they need to be related by a common measure. If a measure μ ∗ existed such that μ∗ (s|m) = P ∗ , μ∗ (m|s) = Q∗ then it would satisfy Equation 11.5, thus defining the best response. It turns out that in general, μ∗ does not exist. However, there always exists a measure that approximates the performance of P ∗ and Q∗ arbitrarily well. It is convenient to use the following shorthand notation: Pij0 = μ0 (si |mj ),

Q0ij = μ0 (mj |si )

We are ready to formulate the following: Theorem 11.1 (Komarova and Niyogi 2004) For any finite language μ0 satisfying the property of unique maxima, and a uniform probability distribution σ, we have sup F (μ0 , μ) = 1/(2M ) tr(P 0 (Q∗ )T + P ∗ (Q0 )T ) μ

353

COMMUNICATIVE EFFICIENCY

In order to prove this statement, we need to show that (a) For all μ, F (μ0 , μ) ≤ 1/(2M )tr(P 0 (Q∗ )T + P ∗ (Q0 )T ) (b) There exists a family of languages, μ  , such that lim |1/(2M ) tr(P 0 (Q∗ )T + P ∗ (Q0 )T ) − F (μ0 , μ )| = 0

→0

The proof of (a) immediately follows from the definitions of the best decoder and the best encoder. The rest of this subsection is devoted to developing an algorithmic proof of (b) following exactly the development of Komarova and Niyogi 2004. Given the matrices Q ∗ and P ∗ , we will build a family of measures, μ , such that lim μ (s|m) = P ∗ ,

→0

lim μ (m|s) = Q∗

→0

(11.10)

This is not a trivial task, which is demonstrated by the following example. Suppose that the P ∗ and Q∗ matrices are given by     1 0 0 1 ∗ ∗ Q = P = 0 1 1 0 It is clear that we cannot find a measure μ  that would satisfy conditions 11.10 for this pair (P ∗ , Q∗ ). Fortunately, it turns out that situations like this never arise. In order to prove this we will need to consider some auxiliary matrices. The Auxiliary Matrix and the Absence of Loops Let us define an auxiliary matrix X in the following way:  1, Pij∗ + Q∗ij > 0 Xij = 0, otherwise This means that the matrix X contains nonzero entries at the slots where either of the matrices, P ∗ or Q∗ , contains a nonzero entry. Now let us draw lines connecting all the “ones” of the X matrix that belong to the same row, and all the “ones” of the X matrix that belong to the same column. We will obtain some (disjoint) graphs. Let us refer to the “ones” of the X matrix as vertices. Lemma 11.1 Suppose that a finite measure μ 0 has the property of unique maxima. Graphs constructed as described above do not contain any closed loops.

CHAPTER 11

354

x α1,β1 A 1

xα3,β1

0 1

x α1,β2 x α2,β2 xα2,β3 xα3,β3

0

0

0

1

1

P*

Q* 0

1

1

0

x

x x P

Q x

x

x

Figure 11.1: No loops in graphs. Proof: Let us assume that there exists a closed loop. It looks like a polygon with right angles. Let us consider its “turning points”, i.e., those points that simultaneously belong to a horizontal and a vertical line. Suppose there are 2K such vertices (this can only be an even number). We will refer to these vertices as xαi ,βj , where the pair of integers, (αi , βj ), gives the coordinates of the vertex. Clearly, 1 ≤ i, j ≤ K. Without loss of generality, let xα1 ,β1 be connected with xα1 ,β2 with a horizontal line. Then xα1 ,β2 is connected with xα2 ,β2 with a vertical line, and so on, until xαK ,β1 is connected with xα1 ,β1 with a vertical line, thus closing the loop (see Figure 11.1, where we used K = 3). It is possible to show that exactly half of the vertices corresponds to “ones” of the P ∗ matrix, and the rest to “ones” of the Q ∗ matrix. If a vertex corresponds to a “one” of the Q∗ matrix, then the corresponding slot of the P ∗ matrix is zero, and vice versa. This is a direct consequence of the property of unique maxima. Let us now suppose that Q∗α1 ,β1 = 1, Pα∗1 ,β1 = 0 (the alternative is that ∗ Pα1 ,β1 = 1, Q∗α1 ,β1 = 0, in which case the proof remains very similar). This

355

COMMUNICATIVE EFFICIENCY

means that Q∗α1 ,β2 = 0, because by construction (see formula 11.8), there can be only one nonzero element in the same row of the Q ∗ matrix. Then the element Pα∗1 ,β2 = 1, because the corresponding vertex is present in the X matrix. This leads to Pα∗2 ,β2 = 0 (we can only have one positive element in each column of the P ∗ matrix, Equation 11.9). This argument can be continued around the loop. The Q∗ elements along the loop are alternating between 0 and 1, and so are the elements of the P ∗ matrix, see Figure 11.1. We can conclude that Pα01 ,β1 > Pα01 ,β2 , because by construction, positive elements in the Q∗ matrix correspond to the largest elements in the corresponding rows of the P 0 matrix. Similarly, we obtain 2K inequalities: Pα0i ,βi > Pα0i ,βi+1 , Q0αi ,βi+1

>

(11.11)

Q0αi ,βi+1 ,

1≤i≤K

(11.12)

(here we set αK+1 ≡ α1 and βK+1 ≡ β1 ). In Figure 11.1, the maximum elements of the rows of P 0 and the columns of Q0 are marked by crosses. The arrows indicate the direction towards the larger elements. I will now show that system (11.11-11.12) is incompatible. In order to do this, I write μ0 (si , mj ) = Q0ij Mi , where Mi is the sum of the elements of  the ith row of the matrix μ0 : Mi ≡ k μ0 (si , mk ). Then I can rewrite Pij0 in terms of Q0 and M : Q0ij Mi μ0 (si , mj ) = 0 Pij0 =  k μ0 (sk , mj ) k Qkj Mk System (11.11-11.12) can be presented as a closed chain of inequalities for Q0 :  Q0α1 ,β1

>

k Q0α1 ,β2  k

>

k Q0α2 ,β3  k



Q0α2 ,β2

... Q0αi ,βi

> ...

Q0αK ,βK

>

Q0kβ1 Mk Q0kβ2 Mk Q0kβ2 Mk

,

Q0α1 ,β2 > Q0α2 ,β2

(11.13)

, Q0α2 ,β3 > Q0α3 ,β3 Q0kβ3 Mk  0 k Qkβ Mk 0 Qαi ,βi+1  0 i , Q0αi ,βi+1 > Q0αi+1 ,βi+1 Q M k kβi+1 k  0 k Qkβ Mk Q0αK ,β1  0 K , Q0αK β1 > Q0α1 ,β1 (11.14) Q M k kβ1 k

CHAPTER 11

356

P 0 k Q 1 Mk 0 From the first two inequalities we know that > Qα2 ,β2 P Qkβ , then 0 Mk P k 0kβ2 Q Mk 1 . Conusing the next pair we similarly derive that Q 0α1 ,β1 > Q0α3 ,β3 Pk Qkβ 0 k kβ3 Mk P 0 Mk kQ tinuing along the chain, at the Kth step we have Q 0α1 ,β1 > Q0αK ,βK P Q0kβ1 M . k k kβK Using the last two inequalities, we finally obtain: Q 0α1 ,β1 > Q0α1 ,β1 . This con-

Q 0α1 ,β1

tradiction proves that there can be no closed loops in the matrix X.

Constructing the Matrix μ Now we can systematically build the matrix μ  . From Lemma 11.1 it follows that if we connect all the vertices of the matrix X by horizontal and vertical lines, the resulting (disjoint) graphs will contain no closed loops. Some of the graphs may only consist of one vertex. For each of these graphs we will perform the following procedure. Take a pair of vertices. If they are connected by a horizontal (vertical) line, refer to the corresponding entries of the Q ∗ matrix (P ∗ matrix). One of them will be one and the other, zero. Draw an arrow on the graph from the element corresponding to zero to the element corresponding to one. Repeat this for all pairs of vertices. Next, starting from some vertex, replace the corresponding element in the X matrix by , and then, following the arrows, keep replacing the elements of X by entries of the form  k , where the integer k increases or decreases from one vertex to the next depending on the direction of the arrow (we can always do this because by Lemma 11.1, there are no closed loops in the graphs of matrix X). We will call the resulting matrix A . The measure μ is obtained by renormalizing the elements of the matrix A :  Akl (11.15) μ (si , mj ) = Aij / k,l

Remark. In the algorithm above we used powers of the small parameter , k , to assign to vertices of the matrix X. More generally, we can use any functions of , fk (), such that lim→0 fk ()/fk+1 () = 0. Thus, the family μ found above is just one of many such families. Proof of Theorem 11.1 We are now ready to complete the proof of Theorem 11.1, part (b). Proof: Let us show that Equation 11.10 holds. In order to find entries of

357

COMMUNICATIVE EFFICIENCY

P*=

0 1 0 0 0

1 0 0 0 0

0 0 0 0 1

0 0 0 0 1

1 0 0 0 0

A=

ε

A=

Q*=

0 1 0 1 0

0 1 0 1 0

1 0 1 0 0

0 0 0 0 1

0 0 0 0 1

1 0 0 0 0

0 1 0 ε 0

ε 0 ε2 0 0

0 0 0 0 1

0 0 0 0 ε

1 0 0 0 0

0 0 1 0 0

0 0 0 0 1

0 0 0 0 0

1 0 0 0 0

Figure 11.2: Construction of A for Example (11.3.1). We first form P 0 and Q0 matrices by normalizing columns and rows of μ0 respectively; this step is not shown here. Then we can construct the best encoder, P ∗ , by identifying the maximal elements in the columns of Q0 , and the best decoder, Q∗ , by identifying the maximal elements in the rows of P 0 , see the top of the figure. Next, we combine the positive elements (or vertices) of P ∗ and Q∗ to create the auxiliary matrix X. The vertices of X that belong to the same column (row) are connected. In order to define the direction of the arrows, we have to refer to the matrices P ∗ and Q∗ . If two vertices are connected by a vertical line, we find the corresponding elements of the P ∗ matrix (they are encircled); the direction of the arrow is always toward the “one” of the P ∗ matrix. Similarly, if two vertices are connected by a horizontal line, we find the corresponding elements of the Q∗ matrix (encircled) and direct the arrow toward the “one” of the Q∗ matrix. Finally, we build the A matrix by replacing the “ones” of the X matrix by powers of . The powers of  must be arranged in such a way that in each of the connected graphs, the arrows point from a smaller entry to a larger entry. Note that P ∗ and Q∗ are not related by a common measure, i.e., no measure exists such that P ∗ and Q∗ are its conditional probabilities in the sense described in this chapter.

CHAPTER 11

358

μ (s|m), we need to renormalize each column of the matrix μ  so that its elements sum up to one. Obviously, each column will contain at most one segment of one of the graphs. By construction, the biggest element of this segment of the graph corresponds to the positive element of Q ∗ . In the limit  → 0, the other elements will be vanishingly small in comparison with the biggest one, and the resulting column of the μ  (s|m) matrix will be identical to the corresponding column of the P ∗ matrix. The same argument holds for rows of the μ (m|s) matrix that in the limit become the rows of the Q∗ matrix. Thus we conclude that the algorithm of Section 11.3.1 leads to constructing a family of measures μ  that satisfy the requirements of Theorem 11.1. Example. Consider the following 5 × 5 matrix: ⎛ ⎜ 1 ⎜ ⎜ μ0 = 1245 ⎜ ⎝

1 92 53 88 39

64 8 77 15 48

2 42 60 68 66

23 81 2 73 65

90 42 50 59 37

⎞ ⎟ ⎟ ⎟ ⎟ ⎠

(11.16)

For this language, supμ F (μ0 , μ) = 394/6225. In Figure 11.2 I show the calculated P ∗ and Q∗ matrices, and then construct the X and the A  matrices. The family μ is given by ⎛ 1 μ = 3(1 + ) + 2

⎜ ⎜ ⎜ ⎜ ⎝

0  1 0 0 2  0 0 0

0 0 0 0 1

0 0 0 0 

1 0 0 0 0

⎞ ⎟ ⎟ ⎟ ⎟ ⎠

Remark. As  → 0, F (μ0 , μ ) → supμ F (μ0 , μ). If we let μ∗ = μ |=0 , i.e., μ evaluated at 0, we note that μ∗ = μ0 in general. Further, F (μ0 , μ∗ ) < supμ F (μ0 , μ). Thus, we have that lim→0 μ = μ∗ yet F (μ0 , μ∗ ) < lim→0 F (μ0 , μ ) = supμ F (μ0 , μ). This is a consequence of a discontinuity in the definition of the communicability function, F (L 1 , L2 ). In particular, the conditional probabilities entering definition 11.4 are discontinuous when all the elements of a column or a row of μ are zero, see Equations 11.1 and 11.2. Thus the value of F (μ0 , μ ) may have a jump at  = 0.

359

11.3.2

COMMUNICATIVE EFFICIENCY

Generalizations

In the discussion in this chapter, I worked with three restrictions on the measures μ. These were (i) the property of unique local maxima in the rows and columns of μ(s|m) and μ(m|s) respectively; (ii) uniform distribution of events in the world that need to be communicated; and (iii) finite cardinality of S and M. Each of these can be dropped. A proper discussion will lead us to technical details that are beyond the scope of this book and so I omit them here. I say just a few words about each of these issues; the interested reader may find a complete treatment in Komarova and Niyogi 2004. If there are multiple maxima, it turns out that loops may exist and Lemma 11.1 needs to be modified. Fortunately, this can be done and the notion of a neutral vertex of a graph needs to be introduced. If events do not occur with uniform probability, we simply redefine P ∗ and Q∗ accordingly. For example, Q∗ is redefined as  1, μ0 (si |mj )σj = maxp μ0 (si |mp )σp ∗ Qij = 0, otherwise and P ∗ is similarly redefined. Everything else works out as before. Finally, to go from finite matrices to infinite matrices, we proceed by noting that a measure μ on a countably infinite space can be approximated arbitrarily closely by a measure with finite support. This reduces the infinite case to the finite case and the results discussed here apply immediately.

11.4

Implications for Learning

From the preceding discussion it is now clear that in order to maximize mutual intelligibility with a language user (characterized by the measure μ), it may be necessary to use a different measure, μ ∗ , where μ∗ = μ. This fact has implications both for learning and evolution of populations of linguistic agents. Let us first consider the problem of an agent trying to learn a language in order to communicate with some other agent whose language is characterized by the measure μ. Recall that μ ∗ (the best response) itself may not exist; however, an arbitrarily close approximation μ  (for any ) does exist. Therefore the best the learner can do is estimate μ  . What degree of accuracy, , is useful or necessary will depend upon the particular application in mind. Since the measure μ is unknown to the learner at the outset, there are two natural learning scenarios depending upon how much information is available to the learner on each interaction.

CHAPTER 11

360

1. Full information: This corresponds to the situation where the learner is able to sample μ directly to get (sentence, meaning) pairs. Thus, when the teacher speaks, both sentence and meaning are directly accessible. The strategy of the learner is to estimate μ as well as it can, derive from it the P ∗ and Q∗ matrices, and ultimately estimate μ  using the procedure described in the previous sections. 2. Partial information: In most natural settings, however, the meaning may not be directly accessible. In other words, the learner only hears the sentence while the intended meaning is latent. What the learner reasonably may have access to is whether its interpretation of the sentence was successful or not. On the basis of this information, the learner must somehow derive the optimal communication strategy. I refer to this as learning with partial information. Thus we see that (1) full information and (2) partial information suggest two different frameworks for learning; in either case, the learner has to estimate P and Q matrices of the teacher.

11.4.1

Estimating P

An important task for the learner is to estimate Q ∗ , which is derivable from the P matrix of the teacher. Recall that  1, μ(si |mj ) = maxp σp μ(si |mp ) ∗ Qij = 0, otherwise Learning with Full Information The learner, in this case, has access to (s, m) (sentence, meaning) pairs every time the teacher produces a sentence. We can define the event Aij = Teacher produces si to communicate mj The probability of event Aij is simply σj μ(si |mj ). Therefore, if the teacher produces n (sentence, meaning) pairs at random in i.i.d. manner, then the ratio kij a ˆij (n) = n is an empirical estimate of the probability of the event A ij . By the law of large numbers, as n → ∞ we have a ˆij (n) → σj μ(si |mj )

361

COMMUNICATIVE EFFICIENCY

with probability 1. For the case under consideration, we can even bound the rate at which this convergence occurs. For example, applying Hoeffding’s inequality, we have P[|ˆ aij (n) − σj μ(si |mj )| > ] ≤ 2e−

2 n 2

This convergence is guaranteed for fixed (i, j). In general, the learner must estimate a collection of events. The total number of events are given by the total number of possible (sentence, meaning) pairs. As before, let us assume that there are N possible sentences and M possible meanings. Therefore, there are N M different events whose probabilities need to be estimated. The collection of events Aij , i = 1, . . . , N ; j = 1, . . . , M , are disjoint. For a finite collection of such events, we will derive a uniform law of large numbers. Let event Eij be aij (n) − σj μ(si |mj )| >  Eij = |ˆ Then, by the union bound, we obtain P[∪i,j Eij ] ≤



P(Eij ) ≤ N M 2e−

2 n 2

i,j

Therefore, we have aij (n) − σj μ(si |mj )| ≤ ] > 1 − N M 2e− P[∪i,j Eij ] = P[∀i, j|ˆ

2 n 2

Thus, with high probability (depending upon the number of examples, n) all empirical estimates a ˆ ij (n) are close to σj μ(si |mj ) respectively. Estimating the σj μ(si |mj )’s is the step to estimating the Q∗ matrix that is required for the optimal communication system. Learning with Partial Information Now consider the setup in (2) where the learner has no access to the meaning directly but has to guess a meaning and is told after the event whether the guess was correct or incorrect. Thus the learner has access to asymmetric information: if the guess was correct, the learner knows the true intended meaning; if the guess was incorrect, the learner merely knows what the meaning was not. As it turns out, this does not dramatically change the state of affairs. To see this, let the learner guess a meaning uniformly at random. 1 the learner chooses a meaning mj . Each time Thus with probability M

CHAPTER 11

362

the teacher produces a sentence, the intended meaning may be successfully communicated or not. Define the event Aij = Teacher produces si ; Learner guesses mj ; Communication successful. 1 σj μ(si |mj ). The event Aij is obThe probability of event Aij is simply M servable since the learner knows (i) what sentence has been uttered by the teacher, (ii) what meaning it (the learner) assigned to the sentence, and (iii) whether communication was successful. Therefore after n sentences have been produced by the teacher, the learner can count k ij — the number of times event Aij has occurred — and can make an empirical estimate of the probability of Aij as kij a ˆij (n) = n By the same argument as before, we see that a ˆ ij (n) converges in probability 1 σj μ(si |mj ), and the rates are provided by the Hoeffding bounds. Since to M M is fixed in advance and known, this allows the learner to guess σ j μ(si |mj ) for each i, j arbitrarily well. Let us be a little more precise about the rates ˆij where of convergence. The learner’s estimate of σ j μ(si |mj ) is really M a a ˆij is defined above. Therefore we have that

aij − P[|M a ˆij − σj μ(si |mj )| > ] = P[|ˆ

−2 n 1  σj μ(si |mj )| > ] ≤ 2e 2M 2 M M

Thus the confidence in the -good estimate of σ j μ(si |mj ) is poorer than before. By the same argument as in Case (2), we have a uniform bound as follows: −2 n

P[∀i, j|M a ˆij − σj μ(si |mj )| ≤ ] > 1 − N M 2e 2M 2

11.4.2

(11.17)

Estimating Q

Let us now consider the task of estimating P ∗ , which is derivable from the Q matrix of the teacher. The same arguments of the previous section apply. Recall that  1, μ(mj |si ) = maxp μ(mj |sp ) ∗ Pij = 0, otherwise Learning with Full Information Here the learner has direct access to the meaning assigned by the teacher to each sentence. Therefore, the learner need only pick a sentence uniformly

363

COMMUNICATIVE EFFICIENCY

at random (with probability us define the event

1 N)

and produce it for the teacher to hear. Let

Aij = Learner produces si ; Teacher interprets as mj . The event Aij is observable on each trial. The probability with which it occurs is given by N1 μ(mj |si ). After n trials (where the learner speaks in this manner), the learner simply counts the number k ij of times event Aij k occurs and its estimate of N1 μ(mj |si ) is nij . Therefore, we have −2 n 1 μ(mj |si )| > ] ≤ 2e 2N 2 N Using the same arguments as before, we have

P[|ˆ aij −

−2 n

P[∀i, j|M N a ˆij − μ(mj |si )| ≤ ] > 1 − N M 2e 2N 2 Learning with Partial Information

The learner simply picks a (sentence, meaning) pair uniformly at random (with probability N1M ). Define the event Aij = Learner produces (si , mj ); Communication is successful. The event Aij is observable by the learner on each trial. The probability of event Aij is N1M μ(mj |si ). After n trials (where the learner speaks), the learner counts the number kij of times event Aij occurs. Therefore, we again have −2 n 1 μ(mj |si )| > ] ≤ 2e 2M 2 N 2 P[|ˆ aij − NM Using the same arguments as before, we have −2 n

P[∀i, j|M N a ˆij − μ(mj |si )| ≤ ] > 1 − N M 2e 2M 2 N 2

11.4.3

(11.18)

Sample Complexity Bounds

Now we can put the pieces together to determine the number of learning events that need to occur so that with high probability, the learner will be able to develop a language with -good communicability. Let the teacher’s measure be μ. We will assume that μ is such that the P and Q matrices have unique row-wise and column-wise maxima respectively. First let us introduce the margin by which the maximum value clears all other values in the row and column respectively. This margin will play an important role in determining the number of learning events.

CHAPTER 11

364

Definition 11.1 For each i, let ji∗ = arg maxj σj μ(si |mj ) and for each j, let i∗j = arg maxi μ(mj |si ). Then, we define the margin γ to be the largest real number such that ∀j = ji∗

σji∗ μ(si |mji∗ ) ≥ σj μ(si |mj ) + γ and μ(mj |si∗j ) ≥ μ(mj |si ) + γ

∀i = i∗j

Learning with Partial Information We have described how to estimate the Q ∗ and P ∗ matrices; the following theorem provides a bound on the number of examples needed to ensure correct estimates: Theorem 11.2 (Komarova and Niyogi 2004) If the total number, n, of interactions between teacher and learner (with partial information) is greater 2 2 than 64Mγ 2 N log( 4Mδ N ), then with high probability > 1 − δ, the learner can construct a measure that will give arbitrarily good communicability with the teacher. Proof: Let there be n/2 interactions where the teacher speaks and the learner listens and n/2 interactions of the other form. The learner constructs estimates of σj μ(si |mj ) and μ(mj |si ) in the manner described previously. Let the estimates be denoted by pˆij and qˆij respectively. By setting  = γ/4 in Equations 11.17 and 11.18, we obtain −γ 2 n

P[∀i, j|ˆ pij − σj μ(si |mj )| ≤ γ/4] > 1 − 2N M e 64M 2 and

−γ 2 n

P[∀i, j|ˆ qij − μ(mj |si )| ≤ γ/4] > 1 − 2N M e 64M 2 N 2 Using the fact that P(A ∩ B) ≥ P(A) − 1, we  can see that with  + P(B) −γ 2 n −γ 2 n probability greater than 1 − 2N M e 64M 2 N 2 + e 64M 2 , the estimates pˆij and qˆij are both within γ/4 of the true values. The learner chooses Q ∗ and P ∗ using the estimated matrices. Let us first consider the case of Q ∗ . For each i the learner desires to obtain ji∗ , given by ji∗ = arg max σj μ(si |mj ) j

The learner chooses ˆji = arg max pˆij j

365

COMMUNICATIVE EFFICIENCY

and we claim that ˆji = ji∗ . In order to prove this, assume that this is not the case. Then we get immediately: σji∗ μ(si |mji∗ ) ≥ σˆji μ(si |mˆji ) + γ However, we have the following chain of inequalities: σˆji μ(si |mˆji ) ≥ pˆiˆji − γ/4 ≥ pˆiji∗ − γ/4 ≥ σji∗ μ(si |miji∗ ) − γ/2 which leads to a contradiction. This argument holds for every i, therefore, since ˆji = ji∗ for each i, the Q∗ matrix is identified exactly. Similarly, we can show that the P ∗ matrix is also identified exactly. The only thing that remains is to ensure that n is large enough so that this occurs with high probability. We have   −γ 2 n −γ 2 n −γ 2 n 2 2 2 64M N 64M +e ≤ 4N M e 64M 2 N 2 ≤ δ 2N M e This is satisfied for n > 64Mγ 2 N log( 4Mδ N ). Thus, with probability greater than 1 − δ, both P ∗ and Q∗ are identified exactly. Now the procedure of approximating the measure may be applied. Remarks. 2

2

1. The number of examples is seen to be a function of M, N and γ. The margin γ that depends upon the teacher’s language, μ, determines, in some sense, how easy it is to estimate Q ∗ and P ∗ matrices for the learner. It therefore characterizes the learning difficulty of μ in this setting. 2. Infinite matrices are not learnable. In fact, infinite dimensional spaces are known to be unlearnable (see Vapnik 1998 ) and therefore further constraints will be required on the space of possible measures to which the teacher’s language belongs. Such constraints are presumably embodied in the Universal Grammar (Chomsky 1986) constraints that learning agents bring to the task. 3. The constants in the bound on sample complexity may be tightened, although the order is essentially correct. For example, we have let the interactions be symmetric, i.e., the numbers of sentences the learner produces and receives are the same. It is easy to check that a more favorable bound is obtained when the learner speaks N 2 times as often 2 2 +1) log( 4Mδ N ) as it listens. In this case, it is enough to have 32M γ(N 2 interactions in all.

CHAPTER 11

366

Learning with Full Information For completeness, let us state the number of interactions needed to learn in setting (1). This is given by the following theorem:

Theorem 11.3 (Komarova and Niyogi 2004) If the total number, n, of interactions between teacher and learner (with full information) is greater 2 4M N than 64N γ 2 log( δ ), then with high probability > 1 − δ, the learner can construct a measure that will give arbitrarily good communicability with the teacher. The proof is very similar to that of the previous theorem, and I omit it for this reason. It is noteworthy that learning with full information requires M 2 less interactions to learn. This is not surprising since the meanings are accessible, and the larger the number, M , of different concepts is, the greater the difference between learning with full and partial information.

11.5

Communicative Efficiency and Linguistic Structure

There have often been discussions about the role of communicative efficiency in the structure and possibly the evolution of language. Functionalist perspectives on linguistics hold that the primary function of language is communication or transfer of information from speaker to hearer. Furthermore, it is argued that such communicative pressures shape the structure of language so that over time, the structure of language adapts to facilitate the transfer of information. While there is undoubtedly some merit to this point of view, one must examine the empirical evidence carefully to tease apart the domains in which communicative efficiency may play an explanatory role and those in which it does not. In this section, I provide an empirical study of the structure of lexical items that suggests that a tight coupling between communicative efficiency and lexical structure may not always be present. This is joint work with Dinoj Surendran and is reported at length in Surendran 2003. More largescale studies of this sort are necessary to fully flesh out this issue in other domains.

367

11.5.1

COMMUNICATIVE EFFICIENCY

Phonemic Contrasts and Lexical Structure

Consider the finite number of words w1 , . . . , wn in English. Each word wi may be viewed as a string of phonemes.3 Now consider spoken communication between a speaker and hearer of English. Assume the speaker produces a single word. If all the phonemes in the sequence are heard correctly by the hearer, then the word has been successfully transmitted from speaker to hearer. If the speech-perception apparatus of the speaker is error-free, then all words (barring homophones) can be distinguished from each other and communication would be perfect. Communicative efficiency would be high. Now imagine that the hearer can tell all phonemes apart except /p/ and /b/, i.e., the hearer cannot distinguish between voiced and unvoiced labial stop consonants. As a result, the hearer would not be able to tell apart words that differ by exactly this phonemic contrast — in this case, the voicing feature for labial stop consonants. For example, the hearer would not be able to tell apart the words “pat” (/pat/) and “bat” (/bat/) or “pit” (/pit/) and “bit” (/bit/) and so on. Information would no longer be perfectly transmitted from speaker to hearer. By considering all the words in the lexicon that rely on this distinction, one can try to quantify how much information is lost on the whole. Let us do this now. A natural measure of uncertainty and information content is provided by entropy. If p1 , . . . , pn are the probabilities with which the n words are used on average, then the entropy of the entire lexicon is given by H(W ) =

 i

pi log(

1 ) pi

To appreciate this measure, note that if all words were equally likely, then H(W ) =

n  1 log(n) = log(n) n i=1

Thus, the information content of the lexicon grows with the size of the lexicon. The entropic measure accounts for the fact that all words may not 3

More generally, modern linguistic theory views each phoneme to be a complex of distinctive features. The segmental phonology of the Sound Pattern of English (Chomsky and Halle 1968) decomposes each word into a sequence of feature bundles, and phonological operations are defined on feature spaces rather than phoneme spaces. More recently, autosegmental phonology (Goldsmith 1976 and later work) considers these features to be arranged in an overlapping pattern along multiple tiers rather than along a linear time sequence. The results presented here may be reexpressed in terms of features rather than phonemes in both segmental and autosegmental frameworks.

CHAPTER 11

368

be equally likely. If a particular set of three words is used almost all the time and the rest of the words are hardly ever used, then it is informally clear that the effective size of this lexicon is closer to 3 than it is to n. The entropic formulation H(W ) captures this distinction naturally. H(W ) may be regarded as an average measure of the information transmitted from speaker to hearer by transmitting words of the lexicon. If the phonemic distinction between /b/ and /p/ collapses, then the lexicon W may be partitioned into cohorts (homophone classes after the phonetic collapse) of words that are indistinguishable within each cohort. The hearer can only distinguish cohorts from each other, and the size of the lexicon as perceived by the hearer reduces to the number of cohorts. Let c1 ({/p/, /b/}), . . . , ck ({/p/, /b/}) be the k cohorts obtained by losing the distinction between /b/ and /p/. Let us denote this reduced lexicon as W ({/p/, /b/}) = {c1 ({/p/, /b/}), . . . , ck ({/p/, /b/})}. The probability with which the hearer will encounter a word that belongs to c1 ({/p/, /b/}) is given by  pi P1 = wi ∈c1 ({/p/,/b/})

Similarly, the probability for each of the other cohorts is calculated as P i for the ith cohort. The information content of this reduced lexicon is given by H(W ({/p/, /b/})) =



Pi log(

i

1 ) Pi

The normalized loss of information is given by F L(W ({/b/, /p/}) =

H(W ) − H(W ({/p/, /b/}) H(W )

(11.19)

This is a quantitative measure of the functional load carried by the ability to distinguish between /p/ and /b/. It is easy to check that 0 ≤ F L ≤ 1. Thus F L is the fraction of information lost (at the lexical level) by losing the ability to distinguish between /p/ and /b/. It is the lexical work this phonetic contrast does for the language in question.

11.5.2

Functional Load and Communicative Efficiency

The lexical entropy H(W ) is an information-theoretic measure of how much information is transmitted on average from speaker to hearer via the lexicon. It is therefore correlated with the communicative efficiency of the lexicon of

369

COMMUNICATIVE EFFICIENCY

the language. The functional load F L(W ({/p/, /b/})) is a relative measure of how much information is lost as a result of not being able to make the distinction between /p/ and /b/. It is clear that for every pair of phonemes /π i / and /πj / we can define the functional load of that phonetic contrast. The functional load is therefore correlated with the loss of communicative efficiency because of being unable to make the phonetic contrast. In previous sections, we have developed the notion of communicative efficiency in terms of the probability with which the speaker and hearer are able to understand each other on average. We can also formulate functional load in these terms. For example, if all words can be distinguished from each other, then the probability of successful transmission of lexical items is equal to 1. On the other hand, if /b/ and /p/ cannot be distinguished, then what is the new probability of successful transmission? This will depend upon the precise nature of the listener’s guessing strategy. When a speaker produces a particular word, the listener will only be able to perceptually identify the cohort to which the word belongs. Thereafter, the listener will have to use some other strategy to guess a particular word in the cohort. Let us assume a randomized strategy where for each word w i in cohort cj , the learner guesses that word with a probability proportional to f (pi ) where pi is the probability with which the word is used in general. Different choices of f may be made. For example, one could have (i) f (p i ) = pi , (ii) f (pi ) = log(pi ), (iii) f (pi ) = 1( else 0) ⇔ pi > pk ∀k such that wk , wi ∈ cj . Then the probability of transmission is given by t=

k  

pj g(pj )

i=1 wj ∈ci

where g(pi ) =

P

f (pi ) {k|wi ,wk ∈cj }

f (pk ) .

This may be computed for every contrast

and provides a measure of communicative efficiency of the contrast. The probability of error is given by e = 1 − t and is a measure of the loss of communicative efficiency as a result of being unable to make the contrast. Unfortunately, the exact numerical value of this quantity depends upon the choice of the functional form f and for this reason, we use Equation 11.19 in the rest of this discussion. It is worth noting, however, that all these characterizations of communicative efficiency are correlated with each other. To see the simplest example of this, let us again assume that all n words in the lexicon are equally

CHAPTER 11

370

likely. Then the communicative efficiency may be calculated as t=

k  k   k 1 1 1 1  g(pj = ) = g(pj = ) = n n n w ∈c n n w ∈c i=1

j

i=1

i

j

i

Therefore the loss in communicative efficiency is e =1−t=1−

k n

Thus if there are n cohorts, then each cohort consists of a unique word and there is no loss in communicative efficiency. If there is only one cohort, then the only communicative efficiency that remains is through random guessing. For large lexical sizes, the loss is almost 1. Now consider the information-theoretic measure of functional load. For cohort ci , one may calculate Pi = |cni | where |ci | is the size of the ith cohort, i.e., the number of words in that cohort. Then it is easily seen that H(W ) =  log(n), H(W ({/p/, /b/})) = ki=1 |cni | log( |cni | ), and  1 |ci | log(|ci |) n log(n) k

FL =

i=1

It is possible to provide a range in which F L must lie. By Jensen’s inequality, we have that k  n |ci | log( ) ≤ log(k) n |ci | i=1

log(k) . On the other hand, because there are n from which we get F L ≥ 1 − log(n) words distributed among k cohorts, by a version of the pigeonhole principle, we have |ci | ≤ n − k + 1 ∀i

Applying this, we have F L ≤ 1−

log(n−k+1) . log(n)

Thus

log(n − k + 1) log(k) ≤ FL ≤ log(n) log(n)

Thus we see that both F L and loss of communicative efficiency decrease as a function of the number of cohorts k.

371

11.5.3

COMMUNICATIVE EFFICIENCY

Perceptual Confusibility and Functional Load

Speakers of a language have some choice regarding the lexical items they wish to use. In general, for optimal communication, it is advantageous to have as few homophones as possible and to be able to distinguish as many words as possible. Therefore, if a particular phonetic contrast is difficult to make perceptually, then an optimally structured lexicon should not rely heavily on making this distinction. For example, if it is very hard to distinguish /m/ (labial nasal consonant) from /n/ (alveolar nasal consonant), it would seem suboptimal and inefficient to have a lot of words in the language that differ from each other by exactly this single distinction. Confusion would be rampant. In other words, if a phonetic contrast is difficult to make, it should have low functional load. On the other hand, if a contrast is easy to make, it should be utilized in developing lexical distinctions. Thus if communicative efficiency played a role in the evolution of linguistic structure, we should observe a correlation between the perceptual difficulty of making a phonetic contrast and the functional load of that contrast. Data was collected from several languages (Dutch, English, and Chinese) to examine this question. The perceptual confusibility between phonemes is a psychoacoustic property that depends upon the similarity between the acoustic realizations of those phonemes and the ability of the brain and its perceptual mechanisms to discriminate between them based on their acoustic differences. Psychophysical data on phoneme-confusion matrices may be obtained for several languages from the long tradition of research in experimental psycholinguistics. Lexical data where lexical items are annotated with citation form and colloquial pronunciation patterns, frequency of usage, and semantic and syntactic information was obtained from the more recent tradition of research in corpus-based linguistics. Surprisingly, we find in all of these languages that there is no significant correlation between functional load and confusibility. The lexicon is therefore not optimally adjusted to our perceptual apparatus. The design of the lexicon compromises on its communicative efficiency. As an example of this phenomenon, consider Figure 11.3, where the functional load is plotted against perceptual confusibility for phonetic distinctions in English. Sixteen obstruents in American English have been considered. There are therefore 120 distinct pairs of obstruents to consider. For each such pair of obstruents, the x-axis plots the functional load carried by the distinction corresponding to this pair. This functional load is based on syllable data from the Switchboard corpus (Greenberg 1996) of American

CHAPTER 11

372

Lexical Importance vs Perceptual Difficulty for all obstruent pairs x,y in American English −1.5

−2

log probability of confusing x with y

−2.5

−3

−3.5

−4

−4.5

−5

−5.5

−6

−6.5

0

0.002

0.004

0.006 0.008 functional load of x and y

0.01

0.012

Figure 11.3: Functional Load and Perceptual Confusibility. English. The y-axis plots the log probability of confusing the two phonemes in the pair. The probability is derived from a confusion matrix of Wang and Bilger 1973 on American English speakers and subjects. The probability is a weighted combination of the probabilities of hearing phoneme i when phoneme j was spoken and vice versa. The weighting is from a method proposed by Luce (1959) and Wagenaar, described in Van der Kamp and Pols 1971. As one can see, the correlation is only slight. In fact, the correlation coefficient as measured is positive at 0.34. On the other hand, an optimally designed lexicon would show negative correlation with high perceptual confusibility (high probability of confusing) corresponding to low functional load. These results are robust. The immediate objection that may be raised is that other contextual cues help in identifying the word uniquely — cues that a purely perception-based account does not take into account appropriately. To counter this, one may consider subcategories of words based on other attributes and repeat the above experiment on these subcategories. Thus, one may examine words by syntactic category, semantic class (gleaned from

373

COMMUNICATIVE EFFICIENCY

the WordNet Project; Fellbaum 1998), stress patterns, and syllable structure, and this lack of correlation remains. Similarly, one may factor in the additional information provided by visual cues besides the acoustic ones in perceptual consfusibility, and the lack of correlation remains. Rigorous empirical tests on these issues on many different languages may be found in Surendran 2003. What do we conclude from this? It points to the fact that in this particular context, the structure of the lexicon does not display any sign of having been optimized to suit the perceptual limitations of humans. There are several possible interpretations of this empirical result. First, it points to the possibility that over historical time scales, communicative efficiency might play little role in the structure of natural languages as they are today. In studies of language evolution and historical linguistics, it is tempting to imagine a scenario where a protolanguage originates and then evolves over historical time scales to adapt itself to the communicative needs of humans. If we follow this logic, we would expect that the structure of the phonetic inventory and the structure of the lexicon would coevolve over historical time scales to yield better-adapted lexicons than the ones that are currently attested in modern languages. Second, it may well be that factors other than the one considered here may need to be incorporated in a significant way into a proper formulation of functional load or communicative efficiency. For example, an important factor that we have not discussed is ease of production. While a lot of mathematical work exists on speech production in general, there are no quantitative characterizations of the ease of production for which empirical data is available. One way around this might be to consider suitable proxies such as developmental data about when different phonemes arise in child speech. I leave this as an open question. Third, it may be that internal optimization of linguistic interfaces rather than external optimization of communicative efficiency is the key fact driving change and evolution. For example, a modular view of linguistics supported in the generative tradition (see Chomsky 1995) suggests that the interfaces between the various modules of a linguistic system (phonological, syntactic, semantic) and the conceptual-intensional system of thought need to be well matched for smooth working of the system as a whole. It is these interfaces that may be optimized with respect to each other. Thus, a functionalist perspective on this might suggest that the primary function of language is to aid thought rather than communication. Seeking evidence for optimality in communication may prove to be futile. These caveats underscore the importance of constantly relating models

CHAPTER 11

374

and theories to empirical facts and phenomena. With that sobering thought, let us continue.

Chapter 12

The Origin of Communicative Systems: Linguistic Coherence and Communicative Fitness In this chapter, we consider a population of interacting linguistic agents where communicative efficiency provides Darwinian fitness that translates into reproductive success. The basic framework differs from the models in Part III in three important respects. First, we assume here that the rate at which an agent produces offspring depends upon its communicative efficiency with the rest of the population at large. This is a particular interpretation of how the forces of natural selection might operate in a communicative setting. In keeping with the fundamental framework of this entire book, we assume that the precise language of the parent is not genetically transmitted to the child. Rather, the child will have to learn this language on the basis of linguistic examples. However, unlike most of the models of Part III, the parent is assumed to be the dominant source of primary linguistic data for the child — in other words, the child learns mostly from the parents. Finally, while all the models of Part III resulted in iterated maps that assumed a generational structure, here we will work with differential equations that provide one way to effectively deal with the overlapping-generations issue. The central theme of this chapter is the interplay between individual learning and population dynamics. I will outline the conditions for the emergence of coherent linguistic states where most of the population converges to a single shared language over time. These conditions are seen to

CHAPTER 12

376

be related to the learning fidelity of individual children. In particular, if the learning fidelity is high, coherence is achieved. If it is low, no such coherence emerges. More interestingly, the population dynamics undergoes a bifurcation from incoherent to coherent regimes as the learning fidelity is changed. Learning fidelity is the probability with which the child successfully learns the language of its parents. Since this probability will depend upon the complexity of the space of possible grammars, we see that the emergence of coherence is related now to this complexity. In this sense, learnability, evolvability, and grammatical complexity are linked. This chapter is based on joint work with Martin Nowak and Natalia Komarova and the results presented here first appeared in Nowak, Komarova, and Niyogi 2001.

12.1

General Model

12.1.1

The Class of Languages

Following the previous chapter, we take a language to be a probability measure μ on a countable set. Let us assume that there are n possible languages 1 given by μ1 , . . . , μn . We characterize the mutual intelligibility between these languages by an n×n matrix A. The (i, j) entry of A (denoted by A ij ) is the probability that a speaker using language μ i is understood by a hearer using μj . Using the notion of communicative efficiency developed previously, a natural choice for Aij is   σ(m) μi (s|m)μ(m|s) Aij = m∈M

s∈S

where S is the space of possible expressions or signals and M is the space of possible meanings. σ is a measure on M that denotes the prior probability with which the need arises to communicate different meanings. The rest of the chapter does not depend in any crucial sense on how A ij is defined as long as such a matrix of n 2 numbers exists. In general, 0 ≤ Aij ≤ 1. However, I will conduct a detailed analysis only for the symmetric 1

In general, one might consider the space of languages to be a continuous space of possible measures. Conventionally, however, language is taken to be categorical, discrete, combinatorial, and therefore countable. In order to engage this traditional conception of language, I have chosen countable spaces of languages. The countably infinite case presents considerable mathematical difficulty, and I focus in this chapter only on the case of n languages in competition with each other. The finite n case is particularly suited to analyses in the principles and parameters tradition. However, it is worth noting that the analysis is conducted here for arbitrary n and therefore the large n limit may be viewed by some as a better approximation of reality.

377

LINGUISTIC COHERENCE AND COMMUNICATIVE FITNESS

case where Aii = 1 and Aij = a if i = j. In other words, each language has perfect mutual intelligibility with itself. For any pair of different languages, the mutual intelligibility is a and does not depend upon the pair involved. Thus the languages are all equally fit and there are no clusters within the space of languages. The more complex case where languages have different fitnesses or group into more or less mutually intelligible clusters can also be treated but will not be dealt with in any detail here.

12.1.2

Fitness, Reproduction, and Learning

Let us consider a population of constant size. Each person uses only one language. The fraction of people who speak the language μ j is denoted by xj . Thus the linguistic stateof the population is given by n numbers x1 , . . . , xn such that xi ≥ 0 and nj=1 xj = 1.

Fitness The overall fitness of an individual with language μ i is its average communicative efficiency with the rest of the population as a whole. Recall that the mutual intelligibility between μ i and μj can be given by the following symmetric formula: 1 F (μi , μj ) = (Aij + Aji ) 2 The average communicative efficiency of a speaker of μ j is then defined to be n  F (μi , μj )xj (12.1) fi = f0 + j=1

Here, f0 is the background fitness that does not depend on the person’s language. The language-dependent fitness is related not just to the person’s own language but also to the proportion of people speaking various languages in the population. Thus a speaker of μ 1 would have a fitness of f0 + 1 if everyone speaks μ1 in the population but a different (lower) fitness if everyone else speaks μ2 . Ultimately, the evolutionary dynamics will depend upon the F matrix, and the tools of the previous chapter are used to provide a plausible measure in terms of the communicative efficiency matrix A.

CHAPTER 12

378

Differential Reproduction Following the central tenet of natural selection, we assume that individuals reproduce in proportion to their fitness.2 One may argue that successful communication about potentially life-threatening events aids the survival of organisms in an uncertain world. Therefore those who have higher communicative efficiency have greater chances of survival and correspondingly greater ability to reproduce. Learning Children learn from their parents. For simplicity we assume that each child has only one parent, i.e., each child learns from one teacher. We allow for mistakes during language acquisition. These mistakes may be due to the finite size of the primary linguistic data or the input of nonparents or both. At any rate, it is possible learn from a person with language μ i and end up speaking language μj . The probability of such a transition is denoted by Qij . The matrix Q depends on the matrix A because the latter defines how close different grammars are to each other (and therefore, how easy it is to confuse them with each other in the learning process). The dependence of Q on A can be modeled if we make assumptions about the precise nature of learning. Much of Part II of this book discusses the theory of learning, and the tools developed therein may be used to evaluate the dependence of Q on A.

12.1.3

Population Dynamics

How will the population evolve? Assume a generational structure and consider two successive generations. Let the state of the population in generation t be given by x1 (t), . . . , xn (t). Since reproduction is proportional to fitness, the proportion of the next generation that are offspring of μ j users f x (t) is given by Pn j fj i xi (t) . Each of these children attempts to learn μ j . The i=1 proportion of μj users in the next generation is easily seen to be n xi (t)fi Qij xj (t + 1) = i=1 n k=1 fk xk (t) 2

It is has been wisely noted (see, e.g., Ariew and Lewontin 2004) that the notion of fitness has received many disparate and mutually inconsistent treatments in the evolutionary literature. Therefore it becomes unclear what one means by fitness in evolutionary contexts in general. We will not enter into such philosophical tangles at this point but proceed with our particular interpretation of it for the time being.

379

LINGUISTIC COHERENCE AND COMMUNICATIVE FITNESS

In order to avoid the assumption of generational structure, we will discretize time finely and consider a natural system of ordinary differential equations that characterize the dynamics. A first candidate for this is  fi xi Qij , 1 ≤ j ≤ n x˙j = i

This captures the fact that the rate of growth of x j ought to be proportional to the probability with which children become μ j users. Unfortunately, the above equation can give rise to unbounded growth for some of the x j ’s. The initial state of the population is a point on the n-simplex. The dynamics needs to be defined so as to constrain the evolutionary trajectory to lie on the n-simplex from any such initial condition. In order to do this, we need additional constraints:  1. Total population size is constant, i.e., nj=1 x j (t) = 1. This is realized by ensuring that the net differential is 0, i.e, nj=1 x˙j = 0. 2. Population sizes are positive, i.e., ∀j, t x j (t) ≥ 0. In order to achieve this we need that if xj = 0 for some j, then x˙j ≥ 0. Incorporating these constraints, the dynamics of a population (x 1 , . . . , xn ) may be suitably captured by the following general system of ordinary differential equations:  fi xi Qij − φxj , 1 ≤ j ≤ n (12.2) x˙ j = i

n where φ = m=1 fm xm . Note that φ may be interpreted as the average fitness of the population and its language-dependent part is the grammatical coherence. Thus it defines the overall probability that a sentence uttered by a randomly chosen agent is understood by another randomly chosen agent. Equation 13.1 is similar to a quasi-species equation (Eigen and Schuster 1979), but has frequency-dependent fitness values (Nowak 2000). We analyze this in some detail for symmetrically distributed language spaces.

12.2

Dynamics of a Fully Symmetric System

In order to investigate system 13.1, we need to specify the matrices A and Q. Let us consider the simplest case where A ij = a, a constant, for all i = j, and Aii = 1. We will refer to such a matrix as a fully symmetric A

CHAPTER 12

380

matrix. It corresponds to the situation where all languages have the same communicability with each other. The fitness in this case is simply fi = (1 − a)xi + a + f0

(12.3)

Next, I introduce the notion of a learning fidelity, q, which is the probability of learning the teacher’s (parent’s) language, i.e., the probability to learn language μi given that the teacher speaks μi . We will assume all languages are equally easy (or hard) to learn, i.e., q does not depend upon i. In keeping with our assumption of an equidistant configuration of languages, we further assume that they are equally confusible on average. Thus, if a mistake is made in learning the parent’s language, then it is equally likely that the person will speak μj , j = i, for any j. The probability to be taught μ i and learn μj is therefore u = (1 − q)/(n − 1), for each j = i. The quantity u is called the error rate of language learning. Therefore the Q matrix is defined by (12.4) Qii = q, Qij = u = (1 − q)/(n − 1), i = j The learning accuracy satisfies 1/n ≤ q ≤ 1. Perfect learning implies q = 1, i.e., no mistakes are made. q = 1/n is equivalent to random guessing on the part of the learner. With these assumptions, system 13.1 becomes    1−q (a + f0 )(1 − q)(nxj − 1) 3 2 2 − xj xi − x˙ j = (1 − a) − xj + xj q + n−1 n−1 i =j

(12.5) for all 1 ≤ j ≤ n.

12.2.1

Fixed Points

To begin, we will look for fixed points of system 12.5. These correspond to solutions of x˙j = 0. The dynamics is given by a set of cubic equations. We will exploit symmetry in order to understand the structure of the solutions. In general, one may consider m-language solutions 3 where m languages are used with different frequencies (say X1 , X2 , . . . , Xm ) and the rest are used 3

It turns out that all m-language solutions have the form that X1 = X2 = . . . = Xm . then each coordinate is a root of To see this, note if x ¯ =“ (x1 , . . . , xn ) is a fixed “ point, ””

1−q the polynomial (1 − a) −x3 + x2 q + (α − x) n−1 − x − (a+f0 )(1−q)(nx−1) where α = n−1 P P 2 1 j xj . This polynomial has three roots given by n , r, s. Since the constraint j xj = 1 must be satisfied by the solution, we see that there are two possible solutions: (i) all x j ’s are equal to n1 ; (ii) m of the xj ’s are equal to r and the rest are equal to s. See Mitchener 2003 for this observation and more developments following from it.

381

LINGUISTIC COHERENCE AND COMMUNICATIVE FITNESS 1−

Pm

X

i=1 i with the same frequency (given by (n−m) each). The most important class of solutions that we will examine in some detail in the rest of this chapter are the one-language solutions.

Factoring the Cubic Let us set xl = X, xm = (1 − X)/(n − 1), m = l. This corresponds to the case when all languages except one are used with the same frequency. Without loss of generality, we can take l = 1. From system 12.5 with a zero left-hand side, we obtain n equations for the unknown X. They are compatible, because the equations for x 2 , x3 , . . ., xn are identical, and their sum is just the equation for x1 (due to the conservation of the number of people). In other words, each of the equations from the second to the last one is nothing but the first equation divided by n − 1. Therefore, we only need to solve the first equation, (1 − X)2 X −X q+ n−1 3

2



1−q X− n−1

 +

(1 − q)(a + f0 )(nX − 1) = 0 (12.6) (1 − a)(n − 1)

This is a cubic equation in X and can be factored as (nX − 1)(AX 2 + BX + C) = 0 where A=

2q − 1 − qn 1−q (1 − q)(a + f0 ) 1 ;B = + ;C = 2 n−1 (n − 1) (1 − a)(n − 1) (n − 1)2

The Solutions The cubic equation (Equation 12.6) has three solutions. The factorization provided above makes it clear what these are. One solution is X0 = 1/n

(12.7)

and corresponds to the uniform distribution (i.e., all grammars occur in the population equally often). The other two solutions correspond to the two √ −B± B 2 −4AC . Putting roots of the quadratic factor and are given by X ± = 2A in the values of A, B, C, from above we get √ −(1 − a)(1 + (n − 2)q) ∓ D (12.8) X± = 2(a − 1)(n − 1)

CHAPTER 12

382

1.0 f0=10 0.8 f0=0.5 f0=0

γ

0.6 0.4 0.2 0.0 0.0

0.2

0.4

0.6

0.8

1.0

a

Figure 12.1: The threshold value, γ, of learning accuracy, in the limit of large values of n. For q > q1 ≈ γ, asymmetric solutions become possible. The coefficient γ is plotted as a function of a for different values of the background fitness, f0 . where D = 4[−1 − a(n − 2) − f0 (n − 1)](1 − q)(n − 1)(1 − a) + (1 − a)2 [1 + (n − 2)q]2 (12.9) These two solutions describe a less symmetrical situation, when one grammar is the most (least) preferred one and is used with frequency X ± , and the rest of the grammars are used equally often. Therefore there are 2n + 1 solutions in all: (i) uniform solution; (ii) n solutions of the form X i = X+ for some i; and (iii) n solutions of the form X i = X− for some i. Existence of One-Language Solution The one-language solutions correspond to the solutions given by Equation 12.8. Real-valued solutions exist only if D ≥ 0. This, in turn, is equivalent to the existence condition q ≥ q 1 , where q1 =

4 + 2W (n − 1)3/2 − 2f0 (n − 1)2 − 3n − a(2n2 − 7n + 6) (1 − a)(n − 2)2

(12.10)

and W = (1 + f0 )[1 + a(n − 2) + f0 (n − 1)]. Thus we see that q1 is a coherence threshold above which one-language solutions may exist and below which only the uniform solution is possible. The uniform solution always exists.

383

LINGUISTIC COHERENCE AND COMMUNICATIVE FITNESS

Properties of Coherence Threshold It is worthwhile to reflect on the nature of the coherence threshold q 1 and its dependence on factors such as a (the confusibility between different languages), n (the size of the space of possible languages), and f 0 (the background fitness). n = 2: In the special case of n = 2, q 1 is given by q1 = (3+a+4f0 )/(4(1+f0 )). Thus q1 increases linearly with a and has the value 1 when a = 1. Large n: For n  1/(a + f0 ), we have:   1 q1 = γ + O n where γ=

(12.11)

" 2 ! (a + f0 )(1 + f0 ) − (a + f0 ) 1−a

(12.12)

We observe (see Figure 12.1) that γ is a monotonically increasing function of (1−a)(a+f0 ) √ . a, and it is equal to 1 when a = 1. Note that 1 − γ = 2 (a+f0 +

(a+f0 )(1+f0 ))

If a is close to 1, so that a = 1 −  and  → 0, we have γ = 1 − /(4(f 0 + 1)) + O(2 ). The coefficient γ also grows with f 0 reaching 1 as f0 → ∞. More precisely, we have γ = 1 − (1 − a)/(4f 0 ) + O(1/f02 ). a = f0 = 0: In the special case of a = f0 = 0, the existence condition looks like 3 4 + 2(n − 1) 2 − 3n (12.13) q1 = (n − 2)2 √ For n  1 we obtain q1 = 2/ n + O(1/n), i.e., the asymptotic behavior is quite different. Summary Remarks For small values of q(< q1 ), only the uniform solution exists. At q = q 1 , a bifurcation occurs. Solution 12.8 emerges and is shown in Figure 12.2. For all values of a and f0 , at q = 1 we have X+ = 1 and X− = 0. At the point where the solution first appears (q = q 1 ), the value is approximately √ √ X± ≈ q21 ≈ γ2 for large n. For the setting f0 = 0, we have X± ≈ a/(1+ a). We note that because of the choice of the A and Q matrices, system 12.5 is highly symmetric and its solutions are degenerate. That is, by relabeling variables, we can pick any of the n grammars to be the “chosen” one, and

CHAPTER 12

384

1.0 0.8

X+

X

0.6 0.4 0.2 0.0 0.85

X-

1/n 0.90 q

q1

0.95

q2

1.00

Figure 12.2: The solutions X = X0 , X+ and X− . Here, a = 0.5, f0 = 1, and n = 10. Stable solutions are represented by solid lines, and unstable ones by dashed lines (see Section 12.2).

then we will have n equivalent solutions of the form xl = X, xj =

1−X , where X = X0 , X+ or X− , ∀j = l n−1

(12.14)

for any l such that 1 ≤ l ≤ n. Perturbations of the A or Q matrix will in general lift the degeneracy, which may result in the following changes: (i) in general, all values of xj , j = l, will be different from each other, and (ii) solutions of the form (12.14) will have different shapes for different values of l (in other words, X0 , X+ and X− will depend on l). In the next section we will see that solution 12.14 with x l = X− is always unstable and the one with xl = X0 (the uniform solution) loses stability as q grows further. Only solutions with x l = X+ remain stable for high values of learning accuracy. When the A matrix is not fully symmetric, the X + type solutions have a more complicated form, but one important feature will persist. It turns out that these solutions can be characterized by one language whose share grows as q approaches unity, whereas the frequency of other languages decreases. μl will be called the preferred, or “chosen” language and the languages μj with j = l will be called secondary languages. Finally, it is important to reemphasize that the fixed points found in this section are not the only possible fixed points of system 13.1. It turns out, however, that m-language solutions correspond to saddlenodes and are unstable along certain directions. For large values of q, the one-language

385

LINGUISTIC COHERENCE AND COMMUNICATIVE FITNESS

solutions with Xi = X+ are the only attractors of this system. It also turns out that there are no periodic orbits (oscillations) in the symmetric system. In much of this chapter we only concentrate on the three fixed points found above. Later we will briefly examine the case for asymmetric systems where the dynamics could potentially get much more complicated.

12.2.2

Stability of the Fixed Points

Given a system of n ordinary differential equations of the form x˙j = Fj (x1 , x2 , . . . , xn ); j = 1, . . . , n we can check for the stability of a fixed point x ∗ = (x∗1 , x∗2 , . . . , x∗n )T by locally perturbing around x∗ to yield xj = x∗j + y˜j . Taking exponential perturbations of the form y˜j = eΓt yj , we get Γt

x˙j = Γyj e

n  ∂Fj ≈ ( |x∗ )eΓt yi ∂xi i=1

yielding the following eigenvalue problem n  ∂Fj ( |x∗ )yi ; j = 1, . . . , n Γyj = ∂xi i=1

∂F

Thus the n × n matrix J where Jij = ∂xji |x∗ is the Jacobian, whose eigenvalues determine stability. In our case, since we are interested in solutions on  the n − 1 dimensional simplex, we have the additional constraint that j yj = 0. If there exists a Γ > 0, the system is unstable. If all eigenvalues are strictly less than 0, the system is stable. Let us check the stability of solution 12.14; we will take l = 1. Following the well-developed techniques of a linear stability analysis outlined earlier, ˜j , j > 1 (here we perturb the solution by taking x1 = X + y˜1 , xj = 1−X n−1 + y X can be X0 , X+ or X− ). We substitute this into system 12.5 and linearize with respect to y˜j . Next, we introduce exponential behavior of the perturbations, i.e., (˜ y1 , . . . , y˜n )T = eΓt (y1 , . . . , yn )T The eigenvalue problem corresponds to Γy = Jy where y = (y 1 , . . . , yn )T . Since the Jacobian J is computed at   (1 − X) T (1 − X) (1 − X) ∗ , ,..., x = X, n−1 n−1 n−1

CHAPTER 12

386

We see that the elements of J consist of exactly five distinct values at each time. As a result, we obtain a system of linear equations for y 1 , . . . , yn with the following form   ym = 0, Cyj + D ym + Ey1 = 0, 2 ≤ j ≤ n Ay1 + B m>1

m>1 m = j

where A, B, C, D, and E are constants given by A= B= C= D= E=

∂F1 ∂x1 ∂F1 ∂xj ∂Fj ∂xj ∂Fj ∂xk ∂Fj ∂x1

! " 2 − − Γ = (1 − a) −3X 2 + 2Xq − (n − 1)( 1−X ) n−1 = −Γ = = =

2(1 − a)( 1−X )( 1−q − X) ! n−1 n−1 1−X 2 (1 − a) 2q( 1−X n−1 ) − (n + 1)( n−1 ) X−q 2(1 − a)( 1−X n−1 )( n−1 ) 2(1 − a)X( X−q n−1 )

n(a+f0 )(1−q) n−1

" − X2 −

−Γ

n(a+f0 )(1−q) n−1

−Γ

respectively.  Because of the conservation of the number of people, we have nj=1 yj = 0. We therefore need to solve n for the above linear system under this constraint. Replacing y1 by − m=2 ym , we get: (A − B)

n 

ym = 0

(12.15)

m=2

(C − D)yj + (D − E)

n 

ym = 0, 2 ≤ j ≤ n

(12.16)

m=2

Here, the first equation is the sum of the other (n−1) equations (by construction of equation 13.1 and is therefore satisfied as long as the other (n − 1) equations are satisfied. To ensure the existence of nontrivial solutions of linear system 12.16, we require that the determinant of the corresponding (n − 1) × (n − 1) matrix is zero. The matrix [M ij ] has the form Mii = C − D, Mij = D − E for i = j, and its determinant is given by (C − 2D + E)n−2 (C − D + (n − 2)(D − E))

(12.17)

Determinant 12.17 is zero if C = 2D − E (the corresponding Γ is denoted as Γ1 ) or if C − D + (n − 2)(D − E) = 0 (the corresponding Γ is denoted as Γ 2 ). Note that in the special case of n = 2 we only have the latter condition. By examining the sign of Γ1,2 , we can study the stability of solutions X 0 , X+ , and X− . If at least one of the growth rates is positive, the corresponding solution is unstable.

387

LINGUISTIC COHERENCE AND COMMUNICATIVE FITNESS

The Uniform Solution For X = X0 = 1/n, we have      1 2 n(2q − 1) − 1 (1 − a) − n (1 − q)(f0 + a) (12.18) Γ1 = Γ 2 = n(n − 1) This gives a threshold condition for learning accuracy. That is, for q > q 2 , Γ1,2 become positive and the uniform solution loses stability. The value q 2 is given by n2 (f0 + a) + (n + 1)(1 − a) . (12.19) q2 = n[n(f0 + a) + 2(1 − a)] The value q2 corresponds to the point where X− = X0 . Thus, the uniform solution loses stability at the point q where it meets solution X − . For large n (n  1/(a + f0 )), we have     1 1−a 1 +O (12.20) q2 = 1 − n a + f0 n2 Note that in the case a = f0 = 0, we have q2 = 1/2 + 1/(2n). The Asymmetric Solutions First, we examine the case n > 2. The growth rate for the two asymmetric solutions is presented in Figure 12.3. It turns out that for the solution X + , both Γ1 and Γ2 are nonpositive for all q ≥ q1 (the solid lines in Figure 12.3). This means that the asymmetric solution X + is stable everywhere in the domain of its existence. Thus, for higher values of learning accuracy, the system prefers a state when one of the grammars is used very often, whereas the rest of them have an equal (and small) share. For X− , the situation is different. In the domain q 1 ≤ q ≤ 1, one of the growth rates is positive whereas the other is negative (at the point q = q 2 they are both zero, the dotted lines in Figure 12.3). This means that the solution X− is unstable (it is neutrally stable for q = q 2 ). It is instructive to compare the eigenvectors corresponding to the eigenvalues Γ 1 and Γ2 . The former one has y1 = 0, and the latter one has y1 = 0. For q > q2 , Γ1 > 0, which means that the solution X− loses stability in such a way that x1 stays the same, but the rest of the grammars fail to keep a uniform distribution. The complete proof of these stability results is tedious and not reported here. However, to convince the reader of their veracity and to provide some insight into why they hold, let us consider a large- n analysis for the eigenvalues.

CHAPTER 12

388

0.10

Growth rate, Γ

0.00

q1

q2

-0.10

Γ1

-0.20

Γ2

-0.30 -0.40 -0.50 0.90

0.92

0.94 0.96 Learning accuracy, q

0.98

1.00

Figure 12.3: The growth rates (eigenvalues) for the one-grammar solutions X+ (solid lines) and X− (dashed lines), as functions of q. Here, a = 0.5, f0 = 1 and n = 20.

Stability of X+ for Large n Note that X+ > where q ≥ q1 ≈ γ + O( n1 ). Now where C  = C + Γ =

∂Fj ∂xj .

1 + (n − 2)q q ≈ 2(n − 1) 2

Γ1 = C  − 2D + E For all n, the following inequalities hold:

C  ≤ −X 2 (1 − a) − (a + f0 )(1 − q) + 2q(1 − a) |D| = |2(

1−X n−1

2(1 − a) 1−X X −q )( )(1 − a)| ≤ n−1 n−1 (n − 1)2

|E| = |2X(

2(1 − a) X −q )(1 − a)| ≤ n−1 n−1

Therefore, we have 1 Γ1 ≤ C  + |2D| + |E| ≤ −X 2 (1 − a) − (a + f0 )(1 − q) + O( ) n

389

LINGUISTIC COHERENCE AND COMMUNICATIVE FITNESS

For sufficiently large n, Γ1 is therefore seen to be negative. Consider Γ2 . This is given by the expression Γ2 = C  − D + (n − 2)(D − E) ≈ C  − nE + nD Since D/E =

1 1−X n−1 ( X )

= O( n1 ), we have that Γ2 ≈ C  − nE

For large n (ignoring O( n1 ) terms), we have Γ2 ≈ −(1 − a)X 2 − (a + f0 )(1 − q) + 2X(q − X)(1 − a) But

−(1 − a)X 2 − (a + f0 )(1 − q) + 2X(q − X)(1 − a) q q ≤ −(1 − a)X 2 − (a + f0 )(1 − q) + 2( )( )(1 − a) 2 2 q2 ≤ −(1 − a)X 2 − (a + f0 )(1 − q) + (1 − a) 2 Now putting in X = X+ , we see √ q D 1 + (n − 2)q + ≈ +α X= 2(n − 1) 2(1 − a)(n − 1) 2 Therefore, we have q q2 Γ2 ≤ −(1 − a)( + α)2 − (a + f0 )(1 − q) + (1 − a) 2 2 q2 ) + q(a + f0 ) − (a + f0 ) − (1 − a)(qα + α2 ) 4 Now let us find an approximate expression for α. We see that for large n, we have # q 2 (a + f0 )(1 − q) − (12.21) α≈ 4 1−a We therefore see that = (1 − a)(

(1 − a)(

q2 ) + q(a + f0 ) − (a + f0 ) − (1 − a)α2 ≈ 0 4

and putting this into the approximate upperbound for Γ 2 , we have Γ2 ≤ −(1 − a)qα ≤ 0

CHAPTER 12

390

Stability of X− for Large n A similar analysis may be conducted for X − . Here we need to consider two different regions (q1 < q < q2 and q2 < q < 1 following Figure 12.3). Let us begin by taking q to be slightly larger than q 1 so that q 1 √ ≤ X− ≤ 1 n 2 In this regime, we show that Γ2 > 0. Recall that Γ2 = C  − D + (n − 2)(D − E) 1−X Note that D E = X(n−1) ≤ Therefore, we may take

√ (1−q1 ) n n−1

and for large n this is approximately 0.

Γ2 ≈ C  − nE

as before. Following the approximations made earlier, we see that this reduces to Γ2 ≈ −(1 − a)X 2 − (a + f0 )(1 − q) + 2X(q − X)(1 − a) + φ " ! 1−X where φ = (1 − a) ( n−1 )(2q − (1 − X)) . A detailed analysis of φ suggests that it is either negligible compared to the other terms or is positive (or both). So we ignore φ in what follows. We also have X − ≈ 2q − α. Putting this into the above equation we have q q q Γ2 ≈ −(1 − a)( − α)2 − (a + f0 )(1 − q) + 2( − α)( + α)(1 − a) 2 2 2 Simplifying and canceling terms, we have Γ2 = (1 − a)

q2 − (a + f0 )(1 − q) − (1 − a)α2 + (1 − a)qα − 2α2 (1 − a) 4

But, from Equation 12.21, this reduces to   Γ2 ≈ (1 − a) qα − 2α2 = α(1 − a)(q − 2α) Again, from Equation 12.21, we see that 0≤α≤

q 2

391

LINGUISTIC COHERENCE AND COMMUNICATIVE FITNESS

Therefore, we have Γ2 ≈ α(1 − a)(q − 2α) ≥ 0 Thus, Γ2 is in general positive for large n. By the same argument as before, Γ1 is negative for large n in this regime. Now consider the regime where q > q 2 . Note that for very large n, the value of the threshold q2 ≈ 1. It turns out that for finite n, the value of X − at q = q2 is exactly equal to n1 and correspondingly for all q ≥ q 2 we have X− ≤ n1 . It is easy to check that in this region |D| ≥ |E|. Therefore, we have Γ1 = C  − 2D + E ≥ C  − D ≥ C  For convenience, put in q = 1. For this value, X − = 0 and we see that Γ1 ≥ C  = φ =

1−a ≥0 n−1

Since Γ1 is a continuous function of q, we see that it must be positive in the neighborhood of q = 1. This neighborhood is arbitrarily small for large n and so we conclude that Γ1 ≥ 0 in this regime. This concludes our analysis. For completeness we consider the value n = 2. In this special case, q2 coincides with q1 . Therefore, for q < q1 = q2 , the uniform solution x1,2 = 1/2 is stable, and for higher values of learning accuracy, it loses stability. We have a pitchfork bifurcation with two equivalent stable solutions, (x1 , x2 ) = (X+ , X− ) and (x1 , x2 ) = (X− , X+ ).

12.2.3

The Bifurcation Scenario

To sum up the bifurcation picture (Figure 12.2), we note that for 0 ≤ q < q 1 the only stable solution is the uniform solution 1/n, then between q 1 and q2 both the uniform solution and solutions 12.14 with X + (the one-grammar solutions) are stable, and finally, for q > q 2 the uniform solution loses its stability and the one-grammar solutions remain stable. At the point q = q1 , where the nonuniform solutions first appear, the corresponding average fitness (assuming that n is large) is ! φasym =

"2

(a + f0 )(1 + f0 ) − f0 − a 1−a

+ a + f0

(12.22)

whereas the average fitness of the uniform solution (for large n) is φunif = a + f0

(12.23)

CHAPTER 12

392

2.0

Average fitness, φ

1.8 1.6 1.4 1.2 1.0 0.85

q

0.90 1 0.95 Learning accuracy, q

q2

1.00

Figure 12.4: Total fitness of the stable solutions of a system with a fully symmetric A matrix, as a function of learning accuracy, q. Parameters of the system are as in Figure 12.2.

One can see that as the system goes to a one-grammar solution, the average fitness (and the grammatical coherence) experience a jump, Δf = (1 − a)(γ/2)2 +O(1/n), see Figure 12.4. Note that if a = 1−, then Δf ∼ /4. As q increases to 1, the total fitness of the one-grammar solution monotonically increases to 1+f0 , whereas the fitness of the uniform solution stays constant. It is convenient to present the stability diagram in terms of the error rate, u (see Figure 12.5). Clearly, as n grows, it becomes harder and harder to maintain one grammar. Also, one can see that there is always a bistability region where the uniform solution and X+ coexist. Indeed, for the existence of a one-grammar solution we need u ≤ u1 = c1 /n, c1 ≡ 1 − γ

(12.24)

For the uniform solution to lose stability we need u ≤ u2 = c2 /n2 , c2 ≡ (1 − a)/(a + f0 )

(12.25)

The above inequalities are derived in the case of large n and a + f 0 > 0.

12.3

Fidelity of Learning Algorithms

In preceding sections we have examined in some detail how the population dynamics depends upon the learning fidelity q of the individual learner.

393

LINGUISTIC COHERENCE AND COMMUNICATIVE FITNESS 0.14

0.12

Error rate, u

0.1

Uniform solution 0.08

0.06

u

1

0.04

u

Bistability region

2

0.02

Asymmetric solution 0

2

4

6

8

10

12

14

16

18

20

n

Figure 12.5: The stability diagram in terms of the error rates, u1,2 . Here, a = 0.5, f0 = 1.

Bifurcations in the population dynamics are seen to occur at q = q 1 and q = q2 . In the model, individual learners attempt to learn the parental language based on example sentences they receive. Consequently, the number of such linguistic examples is naturally related to the learning fidelity. In general, the greater the number of examples on which learners base their estimates, the higher the probability of learning the parental language correctly, and correspondingly the higher the q value at which the population dynamics operates. An analysis of the learning algorithm will yield the precise relationship between the number of examples (b) and the learning fidelity (q). To illustrate this point, let us consider two learning algorithms that have been discussed in previous chapters.

12.3.1

Memoryless Learning

In Chapters 3 and 4, I introduced the class of memoryless learners that modify their grammatical hypotheses in an online adaptive fashion. There are many variants but for convenience, consider the following algorithm: 1. Initial hypothesis: The learner chooses a language uniformly at random from among the n different languages μ 1 , . . . , μn .

CHAPTER 12

394

2. Updating hypothesis: Let the learner’s hypothesis after n sentences be μj(n) . When the (n + 1)th sentence sn+1 is heard, the learner updates its hypothesis in the following manner: (a) if sn is understood, j(n + 1) = j(n) (b) else j(n + 1) is chosen uniformly at random to be one of the n − 1 languages not equal to μj(n) . Following the analysis of previous chapters, the behavior of such a learner may be characterized by a Markov chain. Such a chain has n states — one for each language. The probability distribution over the states (at each point in time) describes the probability with which the learner will hypothesize each of the n different languages at that time. The initial probability distribution of the learner is uniform: p (0) = (1/n, . . . , 1/n)T , i.e., each of the languages has the same chance to be picked at the initial moment. The discrete-time evolution of the vector p (t) is characterized by a Markov chain with a transition matrix, T (k), which depends on the teacher’s language, μk . This matrix is defined by $ T (k)ij =

(1−Aki ) (n−1)

Aki

if i = j if i = j

Recall that Aki is the probability with which a learner using language μ i will understand a sentence from the target language (in this case μ k ). This is therefore the probability with which the learner will retain the hypothesis μi for the next time step. With probability 1 − A ki , the learner will switch its hypothesis to one of the other n − 1 languages. After b samplings, the k-th row of matrix Q is given by (p (b) )T = (0) (p )T [T (k)]b obtained with the transition matrix T (k). Therefore we can specify the (i, j) element of Q as Qij = [(p(0) )T T (i)b ]j

(12.26)

This expression captures the relation between matrices A and Q. For instance, if we assume that the off-diagonal entries of the A matrix are constant and equal to each other (the fully symmetric case), then, according to Equation 12.26, the off-diagonal entries of the Q matrix are also equal to each other, and Equation 12.4 holds. Expression 12.26 can be used to evaluate the learning accuracy and the error rate in terms of a. It is easy to

395

LINGUISTIC COHERENCE AND COMMUNICATIVE FITNESS

check that  1−a b n−1 q = 1− 1− n−1 n  b 1−a 1 1− u = n n−1 

(12.27) (12.28)

Note that limb→∞ q = 1 for any fixed n and 0 ≤ a < 1. This simply means that learning fidelity may be arbitrarily close to 1 depending upon how many (b) examples are made available to the learner. When a = 1, all languages are mutually comprehensible. For this case, q = n1 , which is the lowest possible value of learning accuracy. We can use results of the previous section to find conditions for b, the number of sampling sentences per individual, which would allow the population to maintain a particular grammar. We will assume that the number n is large and use inequalities 12.24 and 12.25. In order for solution X + to exist, we see that 1 n log (12.29) b ≥ b1 = 1−a c1 The uniform solution loses stability if b ≥ b2 =

n n log 1−a c2

(12.30)

The constants c1 and c2 are defined in formulas 12.24 and 12.25.

12.3.2

Batch Learning

At another end of the spectrum of learning algorithms lie the batch-learning algorithms. Such algorithms form their final decision by globally optimizing over the entire collection of linguistic examples received over the learning period. Consider the following simple instantiation of such an algorithm. Denote the example set by S = {s1 , s2 , . . . , sb }. For each sentence s and language μ let us define the comprehensibility function c(s, μ) to be a 0 − 1 valued function that takes the value 1 if s is comprehensible 4 to a speaker of μ. 4 There are many different notions that may be invoked to properly define comprehensibility. One may consider parsability of s according to the grammatical rules underlying P μ. This is equivalent to determining whether m μ(s, m) > 0. Alternatively, one might take into account the intended meaning behind s and let c(s, μ) = 1 if a speaker of μ is able to correctly infer the true meaning. In doing so, one might follow the treatment of the previous chapter.

CHAPTER 12

396

1. For each of the candidate languages μ 1 , . . . , μn determine its total comprehensibility score by 1 c(si , μj ) b b

Cj (b) =

i=1

Note that Cj (b) is a random variable if the examples are drawn in some random fashion. 2. Determine the set of empirically optimal languages to be U = {μj |Cj (b) = max Ci (b)} i

3. If |U| = 1, then choose the unique optimal language as the guess. If there are multiple optimal languages, i.e., |U| > 1, then choose any of the n languages uniformly at random. It is worthwhile to make a remark about step (3) of the above algorithm. If there are multiple optimal languages (|U| > 1) it might seem natural to choose one of the elements of U at random rather than choosing one of the n languages at random, as is done by the above algorithm. The reason for considering the above strategy is merely to simplify analysis and to ensure that Qij = Qik for all distinct i, j, k. It is easy to check that for symmetric A matrices this property will be satisfied by the above algorithm, so that the analysis of the dynamics will go through as before. Furthermore, it is easy to check that the stated algorithm is unbiased in the sense that the learner will converge to the right language as b → ∞. For a symmetric A matrix, let us now compute bounds on b for this batch learner. We assume Aij = a if i = j and Aii = 1 ∀i. Let the parental language be μk . Consider b examples drawn in i.i.d. fashion and presented to the learner. To begin, note that each of the C j (b)’s is an empirical average of b i.i.d. random variables (the c(s i , μj )’s) with mean Ak j. Therefore, it follows from the simple law of large numbers that for each j the following is true: lim Cj (b)→a( with probability 1); j = k b→∞

and Ck (b) = 1 ∀b Therefore the set U will always contain μ k as a member. Let us first compute the probability that it will be the lone member. For this to be the case, it

397

LINGUISTIC COHERENCE AND COMMUNICATIVE FITNESS

must be that for every other language μ j , there is at least one sentence s ∈ S that is incomprehensible to it, i.e., c(s, μ j ) = 0. Accordingly let us introduce the event E j which is the event that at least one sentences in S is incomprehensible according to μ j . The probability that U has a unique member is therefore given by P[∩i =k Ei ] But P[∩i =k Ei ] = 1 − P[∪i =k E¯i ] ≥ 1 −



P[E¯i ]

i =k

Now E¯i is simply the probability that every sentence in S is comprehensible according to μi . This is simply P[E¯i ] = ab Therefore, we have P(∩i =k Ei ) ≥ 1 − (n − 1)ab Now let us consider Qkk . This is given by Qkk = P(∩i =k Ei ) +

1 1 n−1 (1 − P(∩i =k Ei )) = + P(∩i =k Ei ) = q n n n

In order for q to be greater than q 1 , it is sufficient for " n−1 ! 1 + 1 − (n − 1)ab > q1 n n Simplifying, we get (n−1) ) log( n(1−q 1) 2

b>

log( a1 )

= Ω(log(n))

It is worthwhile to note that although we have assumed A ki = a for all distinct k, j, the above analysis would work for any arbitrary choice of A matrix as long as it was diagonal dominant, i.e., as long as A kj > Aki for all i = k. The constants would change and the dependence would be on maxi =k |Aii − Aik | rather than a. The dependence on n would remain unchanged. We thus see that the number of sample sentences needed for a community of batch learners to develop a coherent language grows as log n, whereas memoryless learners need b ∝ n sentences (formula 12.29). This is a consequence of the fact that batch learners have perfect memory, whereas memoryless learners only remember one sentence at a time.

CHAPTER 12

12.4

398

Asymmetric A Matrices

In this chapter I have provided in some detail the analysis of a population of language users where the languages μ 1 , . . . , μn are in a symmetric configuration with respect to each other. In particular, I have assumed that A ii = 1 for all i and Aij = a for all distinct i, j. A natural question that now arises is what happens when the matrix A is not symmetric. It is worth noting that A affects the evolutionary dynamics in two different ways. First, the F matrix is defined via A. By construction, even if the A matrix is not symmetric, the corresponding F matrix will be symmetric. Second, A influences the values of the Q matrix depending upon the learning algorithm used. In the simulations that follow, the memoryless learner is used in all cases. Let us now consider some simple examples where some or all symmetries of the A matrix are broken. In order to investigate this we also need to specify the Q matrix, which as we saw previously depends on the A matrix as well as on the learning algorithm. For convenience, we will assume in what follows that the learner utilizes the online memoryless algorithm described in the previous section. The first example assumes that all languages but one are in some sense equivalent, which means that certain symmetries remain in the system, even though the corresponding A matrix is no longer fully symmetric (Section 12.4.1). The second example considers a random configuration of languages leading to a very general system where no symmetries remain (Section 12.4.2). An analytic consideration of these examples is beyond the scope of the current chapter. To provide some insight, however, I present some numerical simulations that reveal the essential character of the results. As before, the population evolution undergoes bifurcations as b changes. For small b, only the uniform solution exists. For large enough b new “one-grammar” solutions (corresponding to X+ ) emerge. The bifurcation diagrams are different now, and it is seen that some of the languages are suppressed while others are enhanced.

12.4.1

Breaking the Symmetry of the A Matrix

The A matrices that we have considered so far possessed such symmetries that all one-language solutions (for each language μ i ) were identical. This is not the case in general. All nonsymmetric perturbations of a fully symmetric A matrix lead to the effect of suppressing some languages and enhancing others.

399

LINGUISTIC COHERENCE AND COMMUNICATIVE FITNESS

Let us consider the slightest perturbation of the A matrix by replacing one element aij = a with aij = a + ξ. A simulation was conducted with the following assumptions: 1. Each language μi defines a probability distribution over Σ ∗ . The support of this distribution may be characterized by an underlying grammar Gi . Thus LGi ⊂ Σ∗ is the support of μi and a speaker of μi produces a sentence s ∈ LGiwith probability μi (s). With this assumption we have that aij = s∈LG ∩LG μi (s). i

j

2. The learning algorithm follows the memoryless procedure described in the previous section. We observe the following picture. For low values of b, an interior solution (approximately but not exactly uniform) is the only stable one. As b increases, bifurcations occur. The branch of the stable asymmetric solution X+ corresponding to the grammar Gi will split off from the other one-grammar solutions, whereas solutions with grammars G l , l = i, l = j, will stay together. In other words, the one-grammar solution with G j as the preferred grammar will deviate ever so slightly from the rest of the grammars. It turns out that if ξ > 0, the grammar G i will be suppressed (and the grammar Gj will be very slightly advantaged), and if ξ < 0, the grammar Gi will be enhanced (and Gj will be slightly suppressed). This means that for ξ < 0, the solution with grammar G i will come into existence earlier (for smaller values of q and b) and will have a larger total fitness (see Figure 12.6, where i = 1, j = 2, a = 0.5, and ξ = −0.4). This makes sense because negative (positive) values of ξ mean that the grammar i has a smaller (larger) intersection with the rest of the grammars. When this grammar becomes preferred, it stands out more (less) than other grammars would in its place, i.e., it corresponds to higher (lower) values of X+ and has a correspondingly larger (smaller) total fitness. Figure 12.6 shows a picture of the bifurcation diagram where grammar Gi is the first to emerge and the other grammars emerge later and simultaneously.

12.4.2

Random Off-Diagonal Elements

Let us now consider an example of a nonsymmetric system where the A matrix is composed of random elements. In particular, we take a ii = 1 but the off-diagonal elements of the A matrix are random numbers uniformly

CHAPTER 12

400

Average fitness, φ

1.00 0.90 G1

0.80 0.70

G2...Gn

0.60 0.50 0

20

40 60 80 Learning events, b

100

Figure 12.6: The growth rates for the asymmetric matrix A with all off-diagonal a ij = 0.5 except a12 = 0.1. The solution with the G1 as the preferred grammar is advantageous in comparison with the rest of the one-grammar solutions, it has a higher coherence and comes into existence for smaller values of b. The grammar G2 is slightly suppressed. distributed between zero and one. As a result, no symmetries are left in the system. Again, bifurcations are seen to occur. For small values of b, an interior solution is the only stable one. All languages are represented in the population. If the number of learning events, b, is high, there are still n stable one-grammar solutions. In Figure 12.7 one can see seventeen of twenty possible stable solutions. On the basis of computer simulations, it is possible to make several observations about the dynamics arising from such general matrices: 1. Unlike the symmetric case, one-grammar solutions with different dominant grammars correspond to different values of φ, the grammatical coherence of the population. Thus each of the solutions is represented by a separate line. This is consistent with the progression from the symmetric case where all solutions lie on the same line through the partially symmetric case of the previous section where one of the solutions lies on one line and the rest of the solutions lie on another. 2. The number of stable one-grammar solutions grows with b. Some of the grammars become advantaged and have a lower threshold of existence. Some are suppressed until much higher values of b. Such behavior was already present in the previous example, where one of

401

LINGUISTIC COHERENCE AND COMMUNICATIVE FITNESS

Average fitness, φ

1.00 0.90 0.80 0.70 0.60 0.50 0

100 200 300 400 Learning events, b

500

Figure 12.7: The fitness of the system with the A matrix consisting of uniformly distributed random numbers 0 ≤ aij ≤ 1. The number of grammars is n = 20, and f0 = 0. the grammars emerged earlier than the rest. The value of b at which the first bifurcation takes place can be roughly predicted by using formula 12.29 with a = 1/2, i.e., the average value of the elements aij . In general, (based on numerical simulations), the first bifurcation point b at which a coherent grammar emerges seems to be roughly estimated with formula 12.29, where we use the average value a = a. Furthermore, the range of the interval over which various grammars emerge (as a result of bifurcations) increases with the range of the distribution of the aij values. 3. Another interesting feature that can be clearly observed in Figure 12.7 is that the lowest-fitness solution (which corresponds to the uniform solution of the fully symmetric case) flows smoothly into one of the one-grammar solutions (the “second best” one for this particular realization of A). This effect can be predicted from standard bifurcation theory. That is, general perturbations of a pitchfork-like bifurcation will lead to smoothing out sharp edges and avoiding cross-sections, and might also cause the disappearance of “knees” (bistability regions) like those seen in Figure 12.2. We conclude that systems with random A matrices behave in a predictable way, and many of the elements of the dynamics can be understood from the analysis of symmetric systems and their perturbations. However, an extended study of a system with randomly chosen a ij is still needed to

CHAPTER 12

402

describe the full bifurcation picture. Such a description is beyond the scope of the current chapter. For more analysis, see Komarova and Rivin 2003.

12.4.3

Final Remarks

The general system of equations (Equation 13.1) may give rise to much more complicated behavior if symmetries in the F and Q matrix are independently controlled (rather than through a common A matrix as in the preceding analysis). For example, choose n = 3. Let F be a symmetric matrix as before. For a Q matrix with the following structure ⎛ ⎞ p q r Q=⎝ r p q ⎠ q r p one observes stable oscillations (limit cycles) and spiral attractors depending on the precise values of p, q, r. The reason for this is intuitively clear from the cyclic structure of Q, where the errors are such as to induce a rotation. By considering larger systems and having different cycles interact with each other, one may have a path to chaos through a period-doubling mechanism. These discussions are beyond the scope of the current chapter, but some analysis pertaining to these questions may be found in Mitchener 2003.

12.5

Conclusions

In this chapter, we have studied in some detail the evolution of a population of language users while making two major assumptions about them: 1. Natural selection: We assume that communicative fitness confers reproductive success, so that the rate at which indivduals produce offspring is related to their communicative efficiency with the rest of the population at large. 2. Local learning: We assume that language is not transmitted genetically from parent to child. Rather, languages have to be learned by children. Crucially, however, we assume that children learn from their parents. Under these assumptions, the primary question we investigate is when coherence, i.e., a shared language, would emerge in the population. As a result of the analysis of this chapter, we now see that grammatical coherence is possible only if the learning accuracy of children is sufficiently high.

403

LINGUISTIC COHERENCE AND COMMUNICATIVE FITNESS

Bifurcations are seen to occur in the population dynamics as the learning fidelity changes. For low learning accuracy, the only stable mode is the uniform mode, where all languages are used in the population with roughly equal frequency. As the learning accuracy increases, new onelanguage modes arise, where most of the population speaks one language and a small fraction speaks a smattering of the rest. The system undergoes spontaneous symmetry breaking. There are n stable equilibria. Which one is attained depends upon the initial conditions. While the general relationship between the fidelity thresholds (q 1 and q2 ), the number of learning events b, and the size of the space of languages n is intuitive enough, the details depend upon the precise learning procedure that children use. As an illustrative example we have considered two learning algorithms that make very different demands on the cognitive ability of individuals — a batch learner and a memoryless learner. A memoryless learner requires O(n) examples for language emergence, while the batch learner requires O(log(n)) examples for the same. The equations of language dynamics studied here are similar to the quasispecies equations of evolutionary biology (Eigen and Schuster 1979) but with frequency-dependent fitness terms. In genetic evolution, genes are transmitted from parents to children with possible mutations. The mutation rate determines the fidelity of such a transmission. In contrast, in linguistic populations, languages are transmitted from parents to children via learning. The learning accuracy determines the fidelity of such a transmission. It is also worthwhile to reflect on the potential role of natural selection in linguistic evolution. Natural selection may be interpreted in two different ways. In its most literal interpretation, it seems unlikely that it has played a role in historical times. It is certainly possible, however, that on evolutionary time scales the ability to communicate effectively affects the chances of survival and therefore reproduction. One can, however, provide alternative interpretations of natural selection in terms of social influence and imitative behavior on the part of mature adults and child learners respectively. Individuals with higher communicative fitness may have a higher influence on learning children (alternatively they may be imitated more often). As a result the effective number of child learners attempting to learn a particular mature adult’s language may depend upon the communicative efficiency of that adult. This interpretation is also consistent with the same set of equations analyzed in this chapter. Several variations may be considered. I provided a brief discussion of asymmetric configurations of languages. Bifurcations are still seen to occur. Each language now has a different threshold before it can emerge, but the

CHAPTER 12

404

essential spirit of the results remains the same as in the symmetric case. The effect of finite populations, spatial distributions of the language speakers, bilingual learning, and so on can be studied as in previous chapters. I do not pursue such variations here. One might choose to tell an evolutionary story based on the results of this chapter. For language to emerge in a population, the learning fidelity must be high. It is possible that over evolutionary time scales, the learning ability of individual children crossed the coherence threshold. This may be due to 1. The separate evolution of learning algorithms 2. The increased maturation time of learners, allowing them more examples (b) with which to learn and therefore greater fidelity 3. A decrease in the size (n) of the search space of possible languages Each of these is consistent with the evolution of language. In general, though, this raises an important and puzzling question. Why is there linguistic variation at all? It is clear that there is an evolutionary pressure for n to be small. It is not clear what counteracting pressures there are for n to be large. If such pressures were found, then the tension between the two would support intermediate sizes of Universal Grammar and would therefore be consistent with the state of affairs in naturally occurring linguistic systems. I leave such questions for future research.

Chapter 13

The Origin of Communicative Systems: Linguistic Coherence and Social Learning In this chapter, I continue the investigation of the conditions under which linguistic coherence might emerge in a population of linguistic agents. I am primarily interested in understanding how a community might arrive at a shared language through the actions of interacting individual learners in the absence of a centralizing agent that enforces linguistic homogeneity in some sense. From the analysis of the previous chapter, we already appreciate the strong yet subtle relationship between the learning fidelity of individual learners and the emergence of shared languages in the population. We examined a setting with the following assumptions: (i) learning agents (children) learn from one teacher (parent), and (ii) communicative efficiency (fitness) translates into reproductive success. Under those conditions, we discussed a model where we were able to derive a coherence threshold (corresponding to a bifurcation point) that depended upon the learning fidelity of the individual learner. Below this coherence threshold, the only stable mode was the uniform mode when all linguistic types were equally represented in the population. One may liken this condition to the Tower of Babel, where many different languages are present and communicative efficiency is low. Above this coherence threshold, one-language modes that correspond to population states with shared languages become possible. The communicative efficiency

CHAPTER 13

406

is therefore high. In this chapter, we consider alternative models in which coherence might emerge. The most significant dimension in which these models differ from each other is in the analysis of individual child learners. Under our model assumptions, we will see that if children learn only from their parents, then fitness is necessary to ensure coherence. Without fitness, coherence will never emerge, no matter how good the learning ability of individual children. In contrast, if children learn from a wider pool of people than just their parents, coherence might emerge without natural selection (fitness). Thus social learning may result in shared languages. Under symmetry conditions that parallel those of the previous chapter, we develop a model for the evolution of n linguistic types in the population. Each linguistic type is equally easy to learn. The learner is exposed, however, to data from a mixture of these types and ends up learning a language that is most consistent with its data set. We analyze the model and characterize equilibria and stability. As before, we find that bifurcations exist. If the number of examples provided to learners is small so that learning fidelity is low, the only equilibrium is the uniform mode where all linguistic types are equally present in the population. As the number of examples increases, a bifurcation point is reached after which the uniform solution becomes unstable and stable one-grammar solutions arise. A significant portion of this chapter is based on joint work with Thomas Hayes. To set the stage for the arguments that follow, let us briefly recapitulate our findings for the setting when children learn only from their parents.

13.1

Learning Only from Parents

If there are n linguistic types given by the measures μ 1 through μn , then following the arguments of the previous chapter, the linguistic evolution of the population is characterized by the following equation:  fi xi Qij − φxj , 1 ≤ j ≤ n, (13.1) x˙j = i

Here xj is the proportion of individuals in the population that speak the language corresponding to μj . An individual born to a speaker of μ i learns a language based on data provided by its parent. The fidelity of the languagelearning map is provided by the matrix Q, where Q ij denotes the probability learning μj . The fitness with which an offspring of a speaker of μ i ends up  fi of a speaker of μi is given by the expression fi = nj=1 xj F (μi , μj ), where

407

LINGUISTIC COHERENCE AND SOCIAL LEARNING

F (μi , μj ) is the mutual intelligibility between a speaker of μ i and a speaker of μj . Under the symmetric assumptions of the previous chapter, we have 1. (i) Qii = q (ii)Qij =

1−q n−1 ; i

= j

2. (i) F (μi , μi ) = 1 (ii)F (μi , μj ) = a; i = j  3. fi = j xj F (μi , μj ) = (1 − a)xi + a + f0 When evolution is governed by fitness as in Equation 13.1, we see there is a coherence threshold (q1 ) for learning fidelity (given by q). When q < q 1 , the only stable mode of the population is the uniform mode in which all linguistic types are equally represented. When q > q 1 , new one-grammar solutions emerge. These correspond to population states in which a majority speak a shared language. Let us now consider the same evolutionary dynamics but without having a fitness that depends on communicative efficiency F . In other words, we assume for all i, fi = f0  Therefore, we have φ = f0 i xi = f0 . For this case, the dynamical equations reduce to  f0 xi Qij − f0 xj , 1 ≤ j ≤ n (13.2) x˙j = i

Note that this is a set of linear differential equations. Equilibria are computed by setting x˙j = 0. This leads to linear equations for which the only solution is given by the uniform solution x j = n1 for all j. Thus from all initial conditions, populations move to an equilibrium state in which all languages are equally represented in the population. No bifurcations occur and no shared languages ever emerge. The above discussion clarifies how fitness based on communicative efficiency is an important ingredient in the emergence of communal or shared languages. In many cases of animal communication — for example, in some species of songbirds — the infant of the species learns in the nesting phase where it is exposed to primarily one teacher. In such cases, it seems likely that we will need to invoke arguments from natural selection and fitness to provide an explanation for the emergence of shared communication systems. On the other hand, we will soon see that if learning is based on input provided by the population at large, i.e., social learning, then shared systems might emerge without natural selection.

CHAPTER 13

13.2

408

Social Learning: Learning from Everybody

Now consider the case when the individual learner learns on the basis of the linguistic input derived from the entire adult population. This setting was considered in some detail in Part III of this book. I now reexamine those models from the point of view of coherence.

13.2.1

The Symmetric Assumption

The analysis of language evolution in the previous chapter was conducted for the symmetric case where F (Li , Li ) = 1 and F (Li , Lj ) = a when i = j. For ease of analysis we will make an analogous symmetric assumption now. Let there be n languages L1 , L2 , . . . , Ln ⊂ Σ∗ . For each language Li , we assume there are a set of expressions that are perceptually salient for the purpose of learning. Depending upon one’s theoretical persuasion, there are many possible candidates for such a set. Examples of such expressions may be “triggers” of various sorts as discussed in trigger-based accounts of language acquisition (e.g., Fodor 1998; Gibson and Wexler 1994), “cues” as in Lightfoot 1998, or in general, any expression that provides a linguistic indicator to the child regarding the identity of the grammar that generated the expression. Following the learning-theoretic discussion in Chapter 2, we may even let the set of such expressions to be L i \ (∪j =i Lj ). I denote the set of such perceptually salient expressions to be C i ⊆ Li such that Ci ∩ Cj = φ for all i = j. Speakers of Li produce sentences according to a probability distribution μi . In particular, let μi (Ci ) = ai Our general model of learning will be as follows. The learning child scans its input for cues. If the cues for L i occur often enough, the child will acquire the language Li . A number of learning algorithms may be designed around this general principle by taking different computational and cognitive requirements into account. If ai > 0 for every i, i.e., every language has a nonempty (positive measure) cue set, then these learning algorithms will be able to identify every language in the family {L 1 , . . . , Ln }. In the analysis that follows, we will assume a i = a for every i. Thus the languages have cue sets of equal measure and are therefore equally easy to learn by unbiased algorithms. This amounts to a symmetric assumption about the ease of learning languages.

409

LINGUISTIC COHERENCE AND SOCIAL LEARNING

13.2.2

Coherence for n = 2

We begin by considering the case in which there are two possible languages — L1 and L2 . Each has a cue set — C1 and C2 respectively. Speakers of L1 produce sentences with a probability distribution μ 1 such that μ1 (C1 ) = a. Speakers of L2 produce sentences according to μ2 such that μ2 (C2 ) = a. At any point in time, there may be a mixture of L 1 and L2 speakers. Let the proportion of L1 speakers (at time t) be x1 (t) and the proportion of L2 speakers be x2 (t). Children are exposed to both kinds of speakers and therefore potentially hear both kinds of cues. Depending upon the ratio of L1 and L2 speakers in their linguistic environment, they are more likely to hear one or the other type of cue. Let us consider the evolution of this population under the following learning algorithm. Cue-Frequency Based Batch Learner The learning algorithm receives k examples in all. The algorithm counts 1. k1 : the number of examples that are cues for L 1 , i.e., examples that belong to C1 2. k2 : the number of examples that are cues for L 2 , i.e., examples that belong to C2 3. k3 : the number of examples that are not a cue for either language Clearly, k = k1 + k2 + k3 . The learning procedure is empirically driven and simple. If (i) k1 > k2 , the learner chooses L1 ; (ii) if k2 > k1 , the learner chooses L2 ; and (iii) if k1 = k2 , the learner chooses any one of the two languages (with probability 12 each). If this is the learning algorithm the typical child uses, one may compute the probability with which such a child acquires L 1 . Assuming that sentences are produced in i.i.d. fashion, the probability that k 1 > k2 is given by

f1 (a, x1 (t), x2 (t), k) =

 {(k1 ,k2 ,k3 )∈I1 }



 k (ax1 (t))k1 (ax2 (t))k2 (1 − a)k3 k1 k2 k3

where I1 = {(k1 , k2 , k3 )|k1 > k2 ; k1 + k2 + k3 = k} Similarly, the probability that k2 > k1 is given by

CHAPTER 13

410



f2 (a, x1 (t), x2 (t), k) =

{(k1 ,k2 ,k3 )∈I2 }



 k (ax1 (t))k1 (ax2 (t))k2 (1 − a)k3 k1 k2 k3

where I2 = {(k1 , k2 , k3 )|k2 > k1 ; k1 + k2 + k3 = k} Note that by symmetry, we have f2 (a, x1 (t), x2 (t), k) = f1 (a, x2 (t), x1 (t), k) The probability that a typical child acquires L 1 after k sentences is now given by 1 f1 + (1 − f2 − f1 ) 2 Therefore, the population dynamics is given by 1 x1 (t + 1) = (1 + f1 (a, x1 (t), x2 (t), k) − f2 (a, x1 (t), x2 (t), k)) 2

(13.3)

Using the fact that x1 (t) + x2 (t) = 1 for all t, we can eliminate x2 (t) from the above equation to obtain a one-dimensional map g : [0, 1] −→ [0, 1] such that x1 (t + 1) = g(x1 (t)). For example, g may be expressed in terms of f 1 as 1 1 (13.4) g(x) = + (f1 (a, x, 1 − x, k) − f1 (a, 1 − x, x, k)) 2 2 Evolutionary Dynamics Equation 13.4 determines the evolution of L 1 types in the population. The following observations may now be made: 1. The dynamics depends upon the number of example sentences k that individual children hear before maturation. In particular, g is a degreek polynomial map. 2. A fixed point is provided by x1 = x2 = 12 . This corresponds to the uniform solution where both languages are spoken in equal proportion. 3. For small values of k, this is the only fixed point. It is stable.

411

LINGUISTIC COHERENCE AND SOCIAL LEARNING

4. As k increases, new coherent states emerge where one of the languages is the dominant language spoken by a majority of the agents. These correspond to stable one-language modes. 5. As k increases, the uniform state becomes unstable. 6. The critical values of k at which the bifurcations occur depend upon the value of a. In general these critical values become larger as a becomes smaller. Since k takes on integer values, it is more natural to hold k fixed and study the bifurcations as a changes continuously from 0 to 1. When a is close to 0, only the uniform mode is stable. As a increases, the bifurcations occur, the uniform mode becomes unstable, and new stable one-language equilibria arise. 7. The values of a, k may be related to learning fidelity in a natural way. When a, k have small values, the learner is given too little information on the basis of which language is learned. As a result, learning fidelity is low, the system is noisy, and shared languages do not emerge. When a, k have large values, the learner is given a lot of information on the basis of which language is learned, learning fidelity is high, the system is less noisy, and shared languages emerge. To see (2), simply put in x1 = x2 = 12 and notice that f1 = f2 for this situation. To see (3), let us evaluate the map g for k = 2 and k = 3 respectively. This will also provide some insight into the relationship between a and k for bifurcations to occur. Consider k = 2. It is easy to check that f1 (a, x1 , x2 , 2) = (ax1 )2 + 2(ax1 )(1 − a) and f2 (a, x1 , x2 , 2) = (ax2 )2 + 2(ax2 )(1 − a) Putting this into Equation 13.3, we have g(x1 ) =

1 2x1 − 1 + (2a − a2 ) 2 2

Clearly x1 = 12 is the only solution. Now consider k = 3. For this case, f1 (a, x1 , x2 , 3) = (ax1 )3 + 3(ax1 )2 (ax2 ) + 3(ax1 )2 (1 − a) + 3(ax1 )(1 − a)2

CHAPTER 13

412

A similar expression holds for f 2 . Substituting into Equation 13.3, we have  1 x1 − x2  3 2 a (x1 + x1 x2 + x22 ) + 3a2 x1 x2 + 3a2 (1 − a) + 3a(1 − a)2 g(x1 ) = + 2 2 where x1 + x2 = 1. This further simplifies to g(x1 ) =

 1 2x1 − 1  + 1 − (1 − a)3 + 2a3 x1 (1 − x1 ) 2 2

Clearly, g is a polynomial of degree 3, and solving for its fixed points yields three solutions: +

2a3 − 1 x = ;x = 2

4a6 − 8a3 (1 − a)3 4a3

The second pair of solutions exist only when 4a6 − 8a3 (1 − a)3 ≥ 0 or (

a 3 ) ≥2 1−a

Thus we see that for k = 2, the uniform solution is the only solution. For k = 3, additional solutions exist only if a is large enough. If a is too small, then k will need to be much higher than 3 for bifurcations to occur. This already reflects the behavior suggested by (4),(5), and (6) in the above list. We will now provide arguments to gain some insight into the validity of (4),(5), and (6). Let us begin by concentrating on the stability of the fixed point at x = 12 . From Equation 13.4, we have that g (x) =

1  [f (a, x, 1 − x, k) − f1 (a, 1 − x, x, k)] 2 1

Now f1 (a, x, 1 − x, k) =

 k  k−k  3 (1 − a)k3 k xk1 −1 (1 1 k1 >k2 k1 k2 k3 a  k k −1 −k2 x 1 (1 − x) 2



− x)k2

Putting in x = 12 , we see f1 (a,

  k  1 1 1 , , k) = ak−k3 (1 − a)k3 (k1 − k2 )( )k1 +k2 −1 2 2 2 k1 k2 k3 k1 >k2

413

LINGUISTIC COHERENCE AND SOCIAL LEARNING

A similar calculation reveals that   k  1 1 1  ak−k3 (1 − a)k3 (k2 − k1 )( )k1 +k2 −1 f2 (a, , , k) = 2 2 2 k1 k2 k3 k1 >k2

We thus have   k  1 a (1 − a)k3 (k1 − k2 )( )k−k3 −1 g (x = , a, k)) = a 2 2 k1 k2 k3

(13.5)

k1 >k2

We immediately see from this expression that g  (x = 12 , a = 0, k)) = 0 for all values of k ≥ 1. By continuity and differentiability of g  , we see that for each k, there exists a sufficiently small a k such that |g  (x = 12 , a, k))| < 1 for all a < ak . Now let us turn our attention to g  (x = 12 , a, k)) for the case a = 1. The following proposition is true. Proposition 13.1 For a = 1, the uniform solution becomes unstable for large k. In particular, we have 1 lim g (x = , a = 1, k) = ∞ k→∞ 2

Proof: From Equation 13.5, we have 1 g (x = , a = 1, k)) = 2

=2



 k1 >k2 ;k3 =0

 1 k (k1 − k2 )( )k−k3 −1 2 k1 k2 k3

 k  l> k2

1 (l − (k − l))( )k . 2 l

   It is sufficient to show that l> k kl (2l − k)( 12 )k grows to ∞ as a function 2 of k. Notice that  k   k  1 1  k  1 1 (2l − k)( )k ≥ (2l − k)( )k ≥ k 6 ( )k 2 2 2 l l l 1 1 k l> 2

The quantity

l> k2 +k 6

 1

l> k2 +k 6

k  l

l> k2 +k 6

( 12 )k is the probability with which an unbiased

coin would turn up heads at least

k 2

1

+ k 6 times in k independent tosses.

CHAPTER 13

414

Denote this probability by Pk . Then by an application of the Central Limit Theorem, we know that ⎛

1 6



−k 1 lim Pk = lim φ ⎝ % ⎠ = k→∞ 2 k

k→∞

4

where φ is the cumulative distribution of the univariate Normal. Therefore, we have 1 1 lim g (x = , a = 1, k)) ≥ lim k 6 Pk = ∞ k→∞ k→∞ 2

From this proposition, we see that there exists a K such that for all k > K, g  (x = 12 , a = 1, k) > 1. At the same time, we know that g  (x = 1 2 , a = 0, k) = 0. Therefore, by continuity, there exists for each k > K a critical point ak such that g  (x = 21 , a = ak , k) = 1. This corresponds to the bifurcation point at which the uniform solution becomes unstable as a changes continuously. Once the uniform solution becomes unstable, new stable solutions must arise. These solutions correspond to situations where one language is spoken by a majority of the population. Thus shared languages emerge. To see this, fix a k > K and choose a > ak so that the uniform solution is unstable, i.e., g (x = 21 , a, k) > 1. Now notice that g(x = 1, a, k) < 1 and g(x = 0, a, k) > 0 while g(x = 12 , a, k) = 12 . Since g  (x = 12 , a, k) > 1, there exist two points (x− and x+ ) in the neighborhood of 12 such that g(x+ ) − x+ > 0 and g(x− ) − x− < 0. Consider the function h(x) = g(x, a, k) − x and note that (i) h(0) > 0 and h(x− ) < 0 and (ii) h(1) < 0 and h(x+ ) > 0. By continuity of h, we see that there are stable equilibrium points x ∗,− ∈ (0, 12 ) and x∗,+ ∈ ( 12 , 1). These correspond to situations where a majority speak L2 and L1 respectively. We do not have an analytic form for the relationship between a k and k. However, from numerical simulations, it seems to be the case that a k decreases as k increases. Therefore, we see that if a = a ∗ , the one-language solutions arise only when k is large enough, i.e., for all k such that a k < a∗ . As an example, I show in Figure 13.1 the bifurcation diagrams for k = 14 and k = 7 respectively. These correspond to classic pitchfork bifurcations where two stable points arise simultaneously as the uniform fixed point becomes unstable.

415

LINGUISTIC COHERENCE AND SOCIAL LEARNING 1 0.8 0.6 0.4 0.2 0

0

0.1

0.2

0.3

0.4

0.5 a

0.6

0.7

0.8

0.9

1

0

0.1

0.2

0.3

0.4

0.5 a

0.6

0.7

0.8

0.9

1

1 0.8 0.6 0.4 0.2 0

Figure 13.1: Bifurcation diagrams for k = 14 (top) and k = 7 (bottom) as the cue-frequency a varies.

13.3

Coherence for General n

We now consider the more general case in which there are n linguistic types corresponding to n languages L1 , L2 , . . . , Ln . For each language Li , there is an associated cue set Ci ⊂ Li . Elements of the cue set may be interpreted as expressions that provide a clue to the learner regarding the identity of the grammatical system underlying L i . Speakers of Li produce sentences with probability distribution μi such that μi (Ci ) = a. As before, child learners hear a total of k sentences from the mixture of languages in their linguistic environment. Consider the following learning algorithm they may use to learn a language.

13.3.1

Cue-Frequency Based Batch Learner

Children scan the input for cues to each of the languages and choose the language for which maximally many cues occur in their linguistic experience. If there are multiple languages with the same number of cues, then a language is chosen at random. 1. Count cues: Let ki (i = 1, . . . , n) be the number of sentences in the data set that belong to Ci . Let kn+1 be the number of noncues that

CHAPTER 13

416

occur in the child’s linguistic experience. Clearly,

n+1 i=1

ki = 1.

2. Find maximal languages: Determine the languages whose cues occur most often. Let I = {i|ki = max1≤j≤n kj }. Thus I consists of the indices of such maximal languages. 3. Choose mature language: If there is a single language whose cues occur most often, i.e., |I| = 1, then this language is chosen as the mature language. Otherwise, a language is chosen at random. There are two variations: (a) Simple-minded: choose any one of the n languages with probability n1 . (b) Careful: Let L = {Li |i ∈ I}. Choose one of the languages in L 1 . with probability |I|

13.3.2

Evolutionary Dynamics of Batch Learner

One may now try to characterize the population dynamics under this learning algorithm. Let the state of the populationat time t be given by x(t) = (x1 (t), x2 (t), . . . , xn (t)) where xi (t) ≥ 0 and ni=1 xi (t) = 1. The probability distribution with which sentences are presented to the typical child is now given by n  xi (t)μi μ= i=1

With probability ax1 (t), the child will receive a cue for L 1 , and in general, with probability axi (t), it will receive a cue for Li . Finally, with probability 1 − a, a noncue will be heard. Assuming sentences are presented in i.i.d. fashion according to μ, the probability of hearing k 1 cues for L1 , k2 cues for L2 and so on is given by   k kn+1 pk11 pk22 . . . pn+1 k1 k2 . . . kn kn+1 where pi = axi (t) for i ∈ {1, . . . , n} and pn+1 = 1 − a. Therefore the probability that k1 is strictly greater than k2 , . . . , kn is given by  k  kn+1 k pk11 pk22 . . . pn+1 F1 (x(t)) = k k∈I1

  where we use k to denote the n + 1-tuple given by (k 1 , . . . , kn+1 ) and kk   k . The sum is taken over the denotes the multinomial quantity k1 k2 ...k n+1

417

LINGUISTIC COHERENCE AND SOCIAL LEARNING

set I1 , which consists of all ordered partitions of k into n + 1 nonnegative integers (k1 , . . . , kn+1 ) such that k1 is strictly greater than k2 , . . . , kn . In other words, I1 = {k = (k1 , k2 , . . . , kn , kn+1 )|k1 > k2 , . . . , kn ;

n+1 

ki = 1}

i=1

In a similar way, the probability that k2 is strictly greater than k1 , k3 , . . . , kn can be calculated. Let this be F2k (x(t)). We can thus define F1k , . . . , Fnk . Note that for any i, j, we have Fik (. . . , xi , . . . , xj , . . .) = Fjk (. . . , xj , . . . , xi , . . .) Under the simple-minded version of the batch-learning algorithm described above, we can compute the probability that the learner will choose L 1 after k examples have been heard. This is given by F1k (x(t))

+ (1 −

n 

Fik (x(t)))

i=1

1 n

Thus the population dynamics is given by a map f k : Δn−1 → Δn−1 where Δn−1 is the (n − 1)-dimensional simplex in R n given by  xi = 1, (∀i)xi ≥ 0} Δn−1 = {(x1 , . . . , xn ) ∈ Rn : i

The map f k = (f1k , f2k , . . . , fnk ) has n components where the jth component is given by fjk (x(t))

= xj (t + 1) =

Fjk (x(t))

+ (1 −

n  i=1

Fik (x(t)))

1 n

(13.6)

One may now investigate the evolutionary dynamics associated with the map f k with a particular view to the emergence of linguistic coherence. The following observations may now be made: 1. Since f k is a continuous map from Δn−1 to itself, in general it may have a continuum of fixed points. It is possible to show, however, that due to the special structure of f k , there can be at most a finite number of fixed points (equilibria). In particular, for any a ∈ [0, 1] and k > 0, there are no more than k(2n ) equilibria. Furthermore, all fixed points like on critical lines in the simplex that we describe at length below.

CHAPTER 13

418

2. For small values of k, the only fixed point of the map is given by the uniform solution x = ( n1 , . . . , n1 ). This is stable. 3. For any fixed k that is sufficiently large, the number of fixed points varies with a. Thus as a changes from 0 to 1, there is a bifurcation in the dynamics. For small values of a, there is only one fixed point, the uniform point where all languages are equally represented in the population. As a increases and crosses a critical point, other fixed points arise. 4. For large values of a, only the one-language solutions (where one of the languages dominates the population) are stable. The rest (including the uniform solution) are unstable. 5. The same behavior may also be observed by holding a fixed and changing k. For small values of k the uniform point is the only fixed point and the other fixed points arise as k increases beyond a critical value. The values of a, k may be related to learning fidelity in the natural way.

13.4

Proofs of Evolutionary Dynamics Results

In this section, I provide formal proofs and informal arguments supporting the analysis of the evolutionary dynamics outlined earlier.

13.4.1

Preliminaries

Ultimately, we wish to analyze the dynamics associated to the map f k . It is useful, however, to introduce some intermediate maps along the way. Definition 13.1 Let  ≥ 1. The function h : Δn−1 → Δn−1 is defined by h i (x) = the conditional probability that, given exactly  cues for the languages L1 , . . . , Ln , and the existence of a unique language with the most cues, Li is that language. Observation: Let p (x) denote the probability that, given language distribution x, and l cues, there is a unique most-cued language in the learning experience. For 1 ≤ i ≤ n, let Si be the set of ordered partitions of  into n nonnegative integers (k1 , . . . , kn ) such that k1 is strictly greater than k2 , . . . , kn . h is given by the formula   n 1   kj xj . hi (x) = p (x) k k∈Si

j=1

419

LINGUISTIC COHERENCE AND SOCIAL LEARNING

Note also that p is given by the formula n    n   k xj j p (x) = k i=1 k∈Si j=1 n n    kσ (j) = xj i k i=1 j=1 k∈S1 ⎛ ⎞   k1 −ki n n     x k i j ⎝ = xj ⎠ x1 k k∈S1

j=1

i=1

where σi (j) = 1 if j = i, i, if j = 1, and j otherwise. Observation: For  = 1, h is the identity function. For  ≥ 2, h has exactly 2n − 1 fixed points, namely, those inputs x = (x 1 , . . . , xn ) for which all nonzero coordinates xi are equal. Definition 13.2 Let  ≥ 0. The function g : Δn−1 → Δn−1 is defined by gi (x) = the conditional probability that, given exactly  cues for the languages L1 , . . . , Ln , that Li is the selected language. Recall that if there is no unique most-cued language among L1 , . . . , Ln , then the chosen language is selected uniformly at random from among all n (not just the most-cued). Observation: g is given by the formula g = p h + (1 − p )u, where u is the constant function (1/n, . . . , 1/n). In particular, g is a convex combination of h and u. Definition 13.3 Let k > 0, a ∈ [0, 1]. The function f k : Δn−1 → Δn−1 is defined by fik (x) = the probability that Li is the selected language, given k sentences for the languages L1 , . . . , Ln , each drawn with probability xi , and having probability a of being a cue. Recall that, if there is no unique most-cued language among L1 , . . . , Ln , then the chosen language is selected uniformly at random from among all n (not just the most-cued). Observation: f k is given by the formula k    k k a (1 − a)k− g . f =  =0

This shows that f k is a convex combination of g 0 , . . . , g k , which implies that f k is a convex combination of h1 , . . . , hk and u. When a = 1, f k = gk .

CHAPTER 13

13.4.2

420

Equilibria

In this section, I show that f k may have at most a finite number of fixed points (Corollary 13.3). Furthermore, these fixed points have a special structure. If x = (x1 , . . . , xn ) is a fixed point of f k , then there can be at most two distinct values for the xi ’s (Corollary 13.2). We begin with a key lemma. Lemma 13.1 Let  > 1, and suppose x = (x1 , . . . , xn ), where x1 > x2 > x3 . Let M be the 3 × 3 matrix ⎞ ⎛ h1 (x) x1 1 M := ⎝ h 2 (x) x2 1 ⎠ . h 3 (x) x3 1 Then det(M ) > 0. Proof: We will show how to rewrite the left column of M as a positive linear combination of columns of the form  ⎞ ⎛ A B x1 x2 − xB 3  / (x2 − x3 ) B B ⎠ ⎝ xA 2 x3 − x1  / (x3 − x1 ) A B B x3 x1 − x2 / (x1 − x2 ) where A ≥ B are positive integers. Once this is established, the result follows by multilinearity of the determinant, together with the observation that      A B A    1 xB  x1 x2 − xB − x ) x 1 x / (x 2 3 1 3 3 3     A B   x x − xB / (x3 − x1 ) x2 1  =  1 xB xA  ≥ 0 1 2 2     2  3  xA xB − xB / (x1 − x2 ) x3 1   1 xB xA  3 1 2 1 1 Equality holds iff A = B, since when A > B, the second matrix is a submatrix of a doubly increasing Vandermonde matrix (x 3 < x2 < x1 and 0 < B < A). Now, by definition of h , we know p h 1 (x)

n   k = xj j k k∈S1

j=1

Fix values for k1 , k4 , k5 , . . . , kn . Observe that this fixes the sum S = k2 + k3 , and allows (k2 , k3 ) to take on exactly the values (S −M, M ), (S −M +1, M − 1), . . . , (M, S − M ), where M = min{k1 − 1, S} is the maximum allowable

421

LINGUISTIC COHERENCE AND SOCIAL LEARNING

  value for k2 or k3 . Also note that the coefficient k is a decreasing function of |k2 − k3 |. This allows us to rewrite our sum as p h 1 (x) =





k1 ,k4 ,...,kn j =2,3

M − S/2 k

xj j



J=0

α(k, J)

M −J 

S−K xK 2 x3

(13.7)

K=S−M +J

where the coefficients α(k, J) are defined by $   J =0 ,S−M,M,k ,...,k k n 1 4     α(k, J) := J=

0 k1 ,S−M +J,M −J,k4,...,kn − k1 ,S−M +J−1,M −J+1,k4,...,kn For the range of J in the sums above, α(k, J) is always positive. The innermost sum in (13.7) evaluates to B (x2 x3 )S−M +J (xB 2 − x3 )/(x2 − x3 )

where B = 2M − S − 2J + 1. This lets us rewrite 13.7 as p h 1 (x) =



M − S/2

k1 ,k4 ,...,kn

J=0



B B β(k, J)xA 1 (x2 − x3 )/(x2 − x3 )

where A = k1 − S + M − J, and β(k, J) is defined by β(k, J) = α(k, J)(x1 x2 x3 )S−M +J

n

k

xj j

j=4

Noting that h2 (x) = h1 (x2 , x3 , x1 , x4 , . . . , xn ) and h3 (x) = h1 (x3 , x1 , x2 , x4 , . . . , xn ) and that β(k, J) is symmetric in x1 , x2 , x3 , we can write the left column of the original matrix as ⎞ ⎞ ⎛ A B ⎛ x1 (x2 − xB h1 (x) 3 )/(x2 − x3 )  M − S/2  B B ⎠ ⎝ h 2 (x) ⎠ = β(k, J) ⎝ xA 2 (x3 − x1 )/(x3 − x1 ) A B B k1 ,k4 ,...,kn J=0 h3 (x) x3 (x1 − x2 )/(x1 − x2 ) This establishes that the original determinant is nonnegative. It remains to be checked that, when  > 1, at least one term satisfying A > B has a nonzero coefficient. The term corresponding to the partition k = (, 0, . . . , 0) has A = , B = 1 and coefficient β(k, 0) = 1, which completes the proof.

CHAPTER 13

422

Corollary 13.1 The statement of Lemma 13.1 remains true if g or f k is substituted for h , as long as , k > 1 and a > 0. For , k ∈ {0, 1}, the determinant is zero. Proof: As has been observed already, g and f k are convex combinations of h0 , . . . , hk , where h0 is taken by convention to equal u. By multilinearity of the determinant, the result follows, with strict inequality from the fact that h occurs with nonzero coefficient in g , and hk occurs with nonzero coefficient in f k when a > 0. Corollary 13.2 If x is a fixed point of one of the functions h , g , f k , then either , k ≤ 1, a = 0, or there at most 2 distinct values among x 1 , . . . , xn . Proof: If there are at least three distinct values among x 1 , . . . , xn , then by relabeling, we may assume x1 > x2 > x3 . But since x is a fixed point, the matrix ⎞ ⎛ f1 (x) x1 1 ⎝ f2 (x) x2 1 ⎠ . f3 (x) x3 1 has its first two columns identical, and hence has determinant zero, contradicting Corollary 13.1. Corollary 13.3 Fix n, k, a. Then, unless k = 1 and a = 1, f k has at most 1 + (k − 1) 2n−1 − 1 fixed points. Proof: When k ≤ 1 or a = 0, the result is obvious. So assume k > 1 and a > 0. By Corollary 13.2, all the fixed points lie on the 2 n−1 − 1 stable lines corresponding to unordered partitions of [n] into two nonempty subsets. Within each of these lines, the condition of being a fixed point can be expressed by a degree k nonconstant polynomial of one variable having a root. Since this can have at most k roots, and u is a common fixed point on all the stable lines, there can be at most (k − 1)(2 n−1 − 1) additional fixed points.

13.4.3

Stability

From the previous section, we see that there are only a finite number of fixed points of the dynamical system given by f k . Our next concern is to understand the stability of these fixed points. Consider a fixed point x satisfying x = f k (x). By Corollary 13.2, we see that the coordinates (x i ’s) of x may have at most two distinct values. We may thus classify fixed points into one of three types.

423

LINGUISTIC COHERENCE AND SOCIAL LEARNING

1. Type A: x corresponds to a state where there is a unique dominant language. In other words, x is such that xπ(1) > xπ(2) = xπ(3) = . . . = xπ(n) for some permutation π of the set {1, 2, . . . , n} that orders the coordinates in terms of the magnitude of xi . 2. Type B: x corresponds to a state where there are multiple dominant languages. In other words, for some permutation π and l > 1, xπ(1) = . . . = xπ(l) > xπ(l+1) = . . . = xπ(n) 3. Type C: x is the uniform solution given by 1 1 x = ( , . . . , ) n n We can now state the following theorem: Theorem 13.1 (Hayes and Niyogi 2005) Fixed points of Type B are always unstable. Proof: Consider a fixed point of Type B given by x 1 = x2 . . . xl > xl+1 = xl+2 . . . xn . Suppose this is stable. This means there exists a δ > 0 such that for all y ∈ Δ(n−1) ∩ Bδ (x), the sequence given by y(t + 1) = f (y(t)) from the initial condition y(0) = y converges to x. We will now demonstrate the existence of a particular y ∈ Δ (n−1) ∩Bδ (x) for which the above convergence is not true. Pick    ,...,− ) n−1 n−1 % n−1 and for  < n δ, we see that y ∈ Bδ (x).

y = x + (, − By construction, y ∈ Δ(n−1) Note also that

y1 > y2 > yn Now by Lemma 13.2, we have that lim (y2 (t) − yn (t)) = 0

t→∞

However, since x2 − xn > 0, we see that the sequence y(t) does not converge to x.

CHAPTER 13

424

Lemma 13.2 Let y ∈ Δn−1 be a point where the coordinates are ordered so that (i) yi ≥ yj for all i > j and (ii) y1 > y2 > yn . Then the iterative dynamics of f defined by y(t + 1) = f (y(t)) from the initial point y is such that lim (y2 (t) − yn (t)) = 0 t→∞

Proof: Consider the sequence {y, y(1), y(2), . . .} where y(t) = f (y(t − 1)). We will study this sequence in the three-dimensional space obtained by applying the coordinate projection P3 : Rn → R3 given by x = P3 y = (y1 , y2 , yn ) . Thus, we obtain the sequence of 3-tuples {x, x 1 , . . .} where xi = P3 y(i). In this three-dimensional space, with origin defined by (0, 0, 0)  , let u = (1, 1, 1) , e1 = (1, 0, 0) and e3 = (0, 0, 1) respectively. Following usual conventions, we can identify vectors with points in the space and we will do so in our proof. Viewing this space in the direction of u (so that the point u is directly above the origin), we get the picture of Figure 13.2. Following the picture, consider an arbitrary x t and xt+1 = P3 [f (yt )]. The following observations may be made. 1. For every t, xt and e3 are on opposite sides of the plane defined by u and e1 . To see this, simply notice that |e1 xt u| ≥ 0 while |e1 e3 u| < 0 keeping in mind that the equation of the plane is given by |e 1 x u| = 0. 2. For every t, e1 and xt are on the same side of the plane defined by u and e3 . To see this, notice that |e3 e1 u| > 0 and |e3 xt u| ≥ 0. 3. For every t, xt and e1 are on opposite sides of the plane defined by u and xt+1 . To see this, notice that |e1 xt+1 u| ≥ 0 while |xt xt+1 u| ≤ 0. This justifies the picture of Figure 13.2 where we view all points from the direction u. Following this picture, we can define for any vector z ∈ R 3 , the angle that the vector makes with the plane defined by u and e 1 . Let θ(z) be this angle. Formally, we see that θ(z) =

(Pu z) · (Pu e1 ) (Pu z)Pu e1 

where Pu denotes the projection onto the two dimensional plane perpendicular to u. Now consider the set A = {(z1 , z2 , z3 ) = P3 z |z1 ≥ z2 ≥ z3 ; z ∈ Δ(n−1) ; 0 ≤ θ((z1 , z2 , z3 ) ) ≤ θ(−e3 )}

425

LINGUISTIC COHERENCE AND SOCIAL LEARNING

xt xt+1

θ( xt )

θ( xt+1 )

e1

u

e

3

Figure 13.2: A view of the points x(t), x(t + 1), u, e 1 , e3 in the direction of u. Thus the origin is directly below u and what is observable in the picture is the projection of each vector onto the plane perpendicular to u.

CHAPTER 13

426

On this set A one can also define the function Δθ to be the change in θ after one step of the dynamics defined by f . In other words, for P 3 z ∈ A Δθ(P3 z) = θ(P3 z) − θ(P3 f (z)) Clearly the sequence x, x1 , . . . , lies in A. Observations 2 and 3 taken together show that θ(xt ) is a decreasing function of t. Therefore, we have ∀t θ(xt ) ≥ θ(xt+1 ) ≥ 0 Therefore limt→∞ θ(xt ) must exist. Let this limit be θ∗ . Since A is a compact set and xt is a sequence in A, we know that it must have a convergent subsequence. Let xtn be the convergent subsequence and x∗ ∈ A be its limit point. Since θ(v) is a continuous function on A, we have that θ(x∗ ) = θ∗ Since Δθ(xt ) = θ(xt ) − θ(xt+1 ), we have Δθ(x∗ ) = 0 By the latter condition, and Corollary 13.2, we have that there can be at most two distinct values in x∗ . There are therefore three cases: (a) x∗ (1) = x∗ (2) = x∗ (3) (b) x∗ (1) > x∗ (2) = x∗ (3) (c) x∗ (1) = x∗ (2) > x∗ (3) By observation 1, case (c) cannot be true. Since only case (a) or case (b) is true, therefore, we see that x∗ lies on the plane defined by u and e1 . Consequently, θ∗ = θ(x∗ ) = 0. This, in turn, means that lim t→∞ xt (2) − xt (3) = 0.

13.4.4

Bifurcations

From the previous sections, we see that there could be at most a finite number of equilibria (fixed points). These have been classified into three types (A,B,C) and the instability of Type B equilibria has already been established. The map f k : Δ(n−1) → Δ(n−1) is parameterized by a, k and both the number of equilibria and their stability depend upon the values of a and k. We now discuss the bifurcations in the dynamics as a and k change.

427

LINGUISTIC COHERENCE AND SOCIAL LEARNING

Small a, k Regime We first note that for small values of a, k, there is only one fixed point corresponding to the uniform solution x = (1/n, . . . , 1/n) T . To see this, consider a = 0. Then, Fik = 0 for all i. Thus, from Equation 13.6, we have For all i,

fik =

1 n

and we see that the population moves to the uniform equilibrium in one time step for all values of k. It seems that for small values of k, the uniform solution is the only fixed point for all a ∈ [0, 1]. An analytic proof of this with a quantitative characterization of what small k means in this context is missing at the moment. However, let us briefly consider the special cases of k = 1, 2, 3, respectively. For k = 1, we see that Fi1 = axi and therefore the dynamical map is given by 1 fi1 = axi + (1 − a) n 1 Solving for fi (x) = xi , we see that the uniform solution is the only fixed point. For k = 2, we have Fi2 = 2axi (1 − a) + (axi )2 and the dynamical map is now given by ⎛ ⎞ n  1 x2j ⎠ fi2 = 2ax1 (1 − a) + a2 x2i + ⎝1 − 2a(1 − a) − a2 n j=1

2 are given " by fi (x) = xi and noting that ! The fixed points  1 − 2a(1 − a) − a2 nj=1 x2j n1 does not depend upon i, we have that

xi − a2 x2i − 2axi (1 − a) = xj − a2 x2j − 2axj (1 − a) for all i, j. From the above equation, we have that either (i) x i = xj for all for all i, j. It is easy to check that (ii) is not i, j or (ii)xi + xj = 1−2a(1−a) a2 possible so that the uniform solution (x i = xj for all i, j) is the only possible solution. 3 3 2 2 For k = 3, we have i ) + 3axi (1 − a) + 3(axi ) (1 − axi ). Noting Fni = 3(ax 1 3 3 that fj = Fj + (1 − i=1 Fi ) n , we have that any fixed point (fj3 (x) = xj ) must satisfy xj − Fj3 (x) = xi − Fi3 (x)

CHAPTER 13

428

for all i, j. Substituting the expression for F i3 in the above equation, we have that for all i, j, the following must be true: xi − xj = −2a3 (x3i − x3j ) + 3a(1 − a)2 (xi − xj ) + 3a2 (x2i − x2j ) From this we get either (i) xi = xj for all i, j or (ii) 3a2 (xi + xj ) − 2a3 (x2i + x2j + xi xj ) is independent of i, j. Considering case (ii) further, we see that this reduces to either (a) xi = xj for all i, j or (b) 3a2 −2a3 (xj +xk +xi ) = 0. 3 > 1 so that (b) is impossible. Thus the Clearly xj + xk + xi ≤ 1 while 2a uniform solution (xi = xj for all i, j) remains the only fixed point. Stability of the Uniform Solution Let us now consider the uniform solution (Type C) and conduct an analysis of its stability. It is easy to check that the uniform solution x = ( n1 , . . . , n1 ) is always a fixed point of the iterated map given by Equation 13.6 for all values of a, k. This is seen by noting that (by symmetry) F ik (x) = Fjk (x) for all i, j and then substituting in Equation 13.6. To analyze stability for any fixed point x ∗ of the map x(t + 1) = f k (x), we need to check if the map is contracting in all directions along the simplex. Any perturbation n of x∗ along the simplex may be given by x = y + x∗ where n y ∈ R and i=1 yi = 0. Following standard linear stability analysis, we note that f k (x)−f k (x∗ ) ≈ Jy where J is the Jacobian (of f k ) evaluated at x∗ . Therefore the stability of x∗ is determined by the value of S where S = max

yT 1=0

||Jy|| ||y||

Since f k : Δn−1 → Δn−1 depends upon both a and k (in addition to n), we see that the value of S depends upon n, a, and k. Therefore, we will denote this dependence explicitly by writing S(n, a, k). To evaluate S(n, a, k) for the uniform solution, let us consider the structure of the Jacobian J. Noting that fik = Fik + n1 (1 − nj=1 Fjk ), we see that Jij =

∂fik n − 1 ∂Fik 1  ∂Flk = − ∂xj n xj n xj l =i

One may check that by symmetry, we have

∂Fik xj

=

k ∂Fm xl

for all distinct i, j, m, l

when evaluated at the uniform point x∗ = (1/n, . . . , 1/n)T . Further, for all

429

LINGUISTIC COHERENCE AND SOCIAL LEARNING

i, j, we have

∂Fjk xj

=

∂Fik xi .

Letting (for all distinct i, j) ∂Fik ∂Fik = A; =B xi xj

we see that Jii =

n−1 1 (A − B); Jij = (B − A) n n

Because of the special symmetric structure of J, it is easy to check that its eigenvalues and eigenvectors are as follows: the smallest eigenvalue is λ = 0 and the corresponding eigenvector is 1 (the all one’s vector). The next eigenvalue is A − B (multiplicity n − 1) and the eigenspace spanned by the eigenvectors is the n − 1 dimensional subspace orthogonal to 1. Therefore, by a familiar Raleigh-Ritz argument, we see that S(n, a, k) = A − B =

∂Fik ∂Fik − xi xj

Now note that  k  a ∂F1k |x∗ = a k1 (1 − a)kn+1 ( )k−kn+1 −1 ∂x1 n k

(13.8)

 k  a ∂F1k |x = a k2 (1 − a)kn+1 ( )k−kn+1 −1 ∂x2 ∗ n k

(13.9)

k∈I

and

k∈I

where k = (k1 , k2 , . . . , kn+1 ) refers to an ordered partition of k into n + 1 positive integers as usual. The set I is given by the following: I = {(k1 , k2 , . . . , kn , kn+1 )|k1 > k2 , . . . , kn ;

n+1 

ki = 1}

i=1

Therefore, we have S(n, a, k) = a

 k  k∈I

k

a (1 − a)kn+1 ( )k−kn+1 −1 n

(13.10)

The quantity S(n, a, k) determines the stability of the uniform solution of the map f k for the the parameter value a. We see that S(n, a = 0, k) = 0

CHAPTER 13

430

for all n, k suggesting that the uniform solution is stable at a = 0. Now let us consider the value of S(n, a, k) at a = 1. This is given by    1 k ( )k−1 S(n, a = 1, k) = k1 k2 . . . kn 0 n (k1 ,...,kn ,kn+1 =0)∈I

We now prove the following theorem: Theorem 13.2 (Hayes and Niyogi 2005) The uniform solution (a = 1) becomes unstable for large k. In particular, lim S(n, a = 1, k) = ∞

k→∞

Proof: We prove by induction. The base case corresponds to S(n = 2, a = 1, k) and we have already proved that lim k→∞ S(n = 2, a = 1, k) = ∞. (Proposition 13.1.) The induction step is as follows. Suppose lim k→∞ S(n, a = 1, k) = ∞. We see the following identity. 

S(n + 1, a = 1, k) =

 (k1 − k2 )

k1 >k2 ,...,kn+1



=

 (k1 − k2 )



k/(n+1)

kn+1 =0 k1 >k2 ,...,kn





k/(n+1)



kn+1 =0 k1 >k2 ,...,kn k/(n+1) 



=

kn+1 =0

kn+1

k/(n+1) 

=



kn+1 =0



k

k kn+1

 k 1 k ) ( k1 . . . kn+1 n + 1

   k k − kn+1 1 k ) (k1 − k2 ) ( k1 . . . kn n + 1 kn+1

nk−kn+1 (n + 1)k

 (

 k 1 k ) ( k1 . . . kn+1 n + 1

 k1 >k2 ,...,kn

  k − kn+1 1 k−kn+1 (k1 − k2 ) ( ) k1 . . . kn n

1 kn+1 n k−kn+1 ) ) ( S(n, a = 1, k − kn+1 ) n+1 n+1

n k, it follows Since (i) liml→∞ S(n, a = 1, l) = ∞, and (ii) k − kn+1 ≥ n+1 that limk→∞ S(n, a = 1, k − kn+1 ) = ∞ for each kn+1 in the the above expression. Further, since by the Central Limit Theorem, we know that

431

LINGUISTIC COHERENCE AND SOCIAL LEARNING

k/(n+1)  k  1 kn+1 n k−kn+1 limk→∞ kn+1 =0 kn+1 ( n+1 ) = φ( 12 ) = ( n+1 ) cumulative distribution of the unit normal), we see that k/(n+1) 

lim

k→∞



kn+1 =0

k kn+1

 (

1 2

(where φ is the

1 kn+1 n k−kn+1 ) ) ( S(n, a = 1, k − kn+1 ) = ∞ n+1 n+1

The theorem is proved. The Emergence of Coherence We can now piece together a picture of the bifurcations by which shared languages emerge as a result of the dynamics of language evolution in our setting. From the previous section, we see that lim k→∞ S(n, a = 1, k) = ∞. We now make the following observations. Consider any sufficiently large k such that S(n, a = 1, k) > 1. For such a k, we have a bifurcation as a changes continuously from a = 0 to a = 1. Specifically, 1. For a = 0, the only fixed point of the map f k is the uniform solution and this is stable. Thus, from all initial conditions, the population moves to a situation where all possible languages are equally represented in the population. 2. Since S(n, a = 0, k) = 0 while S(n, a = 1, k) > 1, by continuity, we see that there exists a bifurcation point a ∗ where S(n, a∗ , k) = 1 and when a > a∗ , the uniform solution becomes unstable. 3. The bifurcation point a = a∗ when the uniform solution becomes unstable corresponds also to the point where new one grammar solutions emerge. To see this fact, consider a critical line segment on the simplex Δ(n−1) that joins the uniform point u = ( n1 , . . . , n1 )T ∈ Δ(n−1) to one of the corners e1 = (1, 0, . . . , 0)T ∈ Δ(n−1) . This line may be parameterized by the real number α ∈ [0, 1] and any point x ∈ Δ (n−1) on this line may be represented as x = αe 1 + (1 − α)u. Now consider the map f k restricted to this critical line. This restriction defines a map from g : [0, 1] → [0, 1] as follows. For any x = αe 1 + (1 − α)u, consider y = f k (x). It is easy to check that y lies on the critical line segment so that it can be represented as y = α y u + (1 − αy )e1 . Now we set g(α) = αy . It is easy to check that g(0) = 0, g(1) < 1 and g  (0) > 1. Therefore there exists a point α∗ ∈ [0, 1] such that g(α∗ ) = α∗ . This corresponds to a point x∗ = α∗ u + (1 − α∗ )e1 that is a fixed point of f k . Further, since α∗ > 0, it is easy to check that x∗

CHAPTER 13

432

is such that x∗ (1) > n1 . In other words, there is a dominant language in the community. It is also easy to check by standard arguments that this fixed point is stable. 4. It is worth noting that we see the emergence of coherence for large a, k without any notion of differential fitness and natural selection. What brings about coherence instead is the fact that all language learners are immersed in the same population and learn from the same source distribution which is a mixture of the languages of the previous generation. When k is large enough, the population eventually settles on a common language. The transition from the Tower of Babel mode with a uniform solution to a one grammar mode with a shared (majority) language occurs via a bifurcation in the dynamics, as we see.

13.5

Coherence for a Memoryless Learner

So far in this chapter, I have spent some time analyzing the evolutionary dynamics of a batch learner. We saw how coherence would emerge in a population of such learners via a bifurcation. The other important class of learning algorithms we have considered in this book are the memoryless learning algorithms. Interestingly, for a prototypical memoryless learning algorithm, we see that coherence never emerges. Formally, the memoryless algorithm is as follows: 1. Initialize. Choose a language uniformly at random. 2. Iterate. At time step n, receive a new example sentence. If this sentence is a cue for Li , then change hypothesis language to L i . Otherwise, retain current hypothesis language. It is easy to check that this algorithm satisfies the learnability requirement. By our familiar Markov analysis, the behavior of the learner can be understood as a Markov chain with n states and a transition matrix T whose (i, j) term is given by  Tij =

axj if j = i (1 − a) + axi otherwise

We see that T = (1 − a)I + a1xT where I is the n × n identity matrix and x ∈ Δn−1 is the state of the population in generation t. The probability

433

LINGUISTIC COHERENCE AND SOCIAL LEARNING

with which learners will settle on the different languages after k examples are drawn is given by our familiar analysis resulting in the following dynamics: 1 1 (x(t + 1))T = ( , . . . , )(T k ) n n Because of the structure of T , we see that k  T k = (1 − a)I + a1xT = (1 − a)k I + (1 − (1 − a)k )1xT Using Equation 13.11, we see that ! " x(t + 1) = (1 − a)k u + 1 − (1 − a)k x(t)

(13.11)

where u = ( n1 , . . . , n1 )T is the uniform point. It is straightforward to check that this simple linear dynamics results in the population moving to the uniform solution u from all initial conditions for all values of a ∈ [0, 1). For a = 1, the dynamics is given by the identity map so that no change is possible. The initial distribution of languages in the population is preserved for all time.

13.6

Learning in Connected Societies: Analogies to Statistical Physics

Our model in this chapter with its bifurcations from incoherent to coherent states is reminiscent of models of phase transitions in statistical physics. I have touched upon this analogy in the past and we will now try to explore this a little more concretely here. In the setting for language evolution considered in this chapter, there are n possible linguistic types. We analyzed the behavior of a population of linguistic agents where each agent belongs to one of these types. Learning by individual agents defined a natural dynamical evolution of the population, and we characterized the conditions under which coherent (one-language) modes emerged. In statistical physics, one considers a physical system made up of an ensemble of interacting components (particles). The degree of interaction between these particles is typically governed by temperature in a manner such that the interactions decrease as the temperature increases. One then analyzes some macroscopic property of the system as a function of temperature. In many such systems, one finds a phase transition between two different regimes of behavior separated by a critical temperature T c . Examples include the transitions between different states of matter (solid, liquid,

CHAPTER 13

434

gas) or the transitions associated with the loss of permanent magnetization at high temperatures. My goals in exploring this analogy are twofold. First, by developing these connections to statistical physics, I hope that intuitions and insights developed in each discipline may be transferred usefully across disciplinary lines. Second, I note that much of this book has focused on the case where agents learn from everybody, i.e., the population is perfectly mixed. This allows for the possibility of long-range interactions between agents and a central focus of this book is on the bifurcations that arise in this setting. I have also considered the case where agents learn from only one person, i.e., the population is largely disconnected and we have noted the lack of bifurcations in this situation. An important question for me is to understand what happens when there are intermediate (local) correlations between agents. One way to formulate these local interactions is by developing a graph structure where vertices are identified with linguistic agents and the edges denote lines of communication along which examples are provided to learning agents situated at these vertices. Once this formulation is adopted, synergies with statistical physics become more apparent and consequently worth exploring. In the next section, I formulate language evolution on a locally connected graph. Following that, I discuss the classical Ising model of spins associated with a graph. Finally, I consider analogies, implications, and directions of future research in Section 13.6.3. My development focuses on the special case in which n = 2.

13.6.1

Language Evolution in Locally Connected Societies

Imagine a spatial connectivity√pattern √ in terms of a graph that has the structure of a square lattice ( N × N lattice with N vertices in all) as shown in Figure 13.3. Each site (vertex) is associated with a random variable Xi,j (t) ∈ {0, 1}. The value of Xi,j (t) may be identified with the language of the linguistic agent occupying the location (i, j) at time t in the graph (lattice). Now Xi,j (t + 1) is the language at that same location (i, j) at the next time step. In our setting, Xi,j (t + 1) is a random variable such that P[Xi,j (t + 1) = 1] = g(a, μi,j , k) where 1 μi,j = (Xi+1,j (t) + Xi,j+1 (t) + Xi−1,j (t) + Xi,j−1 (t)) 4

435

LINGUISTIC COHERENCE AND SOCIAL LEARNING

This corresponds to the following protocol: the language at each location (i, j) is updated by a learning procedure where the learner obtains example sentences at random from the neighboring agents and uses a cue-based learning algorithm to determine its language. The function g is obtained by the usual analysis of the previous sections. As before, a is the probability with which speakers of each language produce cues and k is the number of examples each learning agent hears during the learning period. An example of the function g is given in Equation 13.4. One may now study the evolution of the following object 1  Xi,j (t) αN (t) = N i,j

The quantity αN (t) corresponds to the average number of L 1 speakers at time t. Remark. If the connectivity pattern of the graph is such that every location is connected to every other location, then all the X i,j (t)’s would be identically distributed. Because of the local connectivity described here, the Xi,j (t)’s are no longer identically distributed. For complete graphs, I believe that α(t) = limN →∞ αN (t) is well defined and the following is true: α(t + 1) = lim αN (t) = g(a, α(t), k) N →∞

Thus the bifurcation described in earlier sections of this chapter is recovered as the limiting behavior of large, complete graphs where the graph size goes to infinity. I conjecture that even for locally connected graphs of the sort described in this section, we would observe bifurcations as the graph size was appropriately increased to infinity. The reason for this conjecture rests on analogies to similar phenomena in statistical physics, as I describe shortly.

13.6.2

Magnetic Systems: The Ising Model

Now consider the well studied two dimensional Ising model of statistical physics. We discuss this model in the context of phase transitions in magnetic systems. In this setting, each site on the square lattice of Figure 13.3 is associated with a particle having a spin that takes one of two values: up or down. A state of the magnetic system is denoted by a configuration of spins s¯ ∈ {−1, +1}N where s¯i,j denotes the spin at site (i, j). Given a magnetic field B and a correlation strength J, one may write down the Hamiltonian (energy) of a configuration of spins at each site as   s¯i,j s¯k,l − B s¯i,j H(¯ s) = −J (i,j)∼(k,l)

i,j

CHAPTER 13

436

L

L

2

N = L

Figure 13.3: A two dimensional lattice with N vertices in all. Each vertex is identified with a linguistic agent with a particular language in the context of language evolution and with a particle with a spin in the context of statistical physics. Each vertex is connected to its immediate neighbors as shown.

437

LINGUISTIC COHERENCE AND SOCIAL LEARNING

 In the term (i,j)∼(k,l) s¯i,j s¯k,l , the sum is taken over all neighboring sites  and measures the degree of alignment of neighboring spins. The term B i,j s¯i,j measures the alignment of the spins with the direction of the magnetic field. Thus each spin configuration s¯ ∈ {−1, +1}N has an energy value associated with it. The Hamiltonian, in turn, defines an equilibrium probability distribution on the set of all possible spin configurations as P[¯ s] =

e−βH(¯s) ZN

(13.12)

 where ZN = s¯ e−βH(¯s) is the normalizing factor referred to as the partition 1 where T is the temperature and function. The quantity β is equal to kT k is Boltzmann’s constant. This characterizes the probability with which the particle system will take on different spin configurations at thermal equilibrium. A parameter of interest is the average spin (magnetization) in the absence of an external magnetic field. It is easy to check that the average magnetization is given by   1  kT ∂F s¯ij P[¯ s] = MN (T ) = N s¯ N ∂B B=0 i,j

where F = ln(ZN ) is the Free energy. An important limiting condition is the so called thermodynamic limit where one lets N → ∞, i.e., the lattice (graph) size grows to infinity, and in this limiting case, one may define m(T ) = limN →∞ MN (T ). Thus m(T ) may be interpreted as the average magnetization in a system with an infinite number of particles. An analysis of this two-dimensional Ising model reveals the following phase transition phenomenon. There exists a critical temperature T C such that for all T > TC , the equilibrium state of the material is a state of demagnetization, i.e., m(T ) = 0, or there are as many up spins as down spins on average. For T < TC , it turns out that m(T ) is nonzero implying that spontaneous magnetization occurs and the material may therefore be in a state of permanent magnetization at that temperature. Remark 1. A number of technical issues relate to how the thermodynamic limit N → ∞ is taken and the existence of well-defined probability measures on the configuration space in this limit. These equilibrium measures correspond to Gibbs measures, and the phase transitions correspond to whether there is a single Gibbs measure or multiple Gibbs measures in the thermodynamic limit. To provide some intuition, it is worth noting that because

CHAPTER 13

438

of the particular exponential form of the probability distribution in Equation 13.12, we see that at high temperatures (large T ), the value of β is close to 0 and the distribution is relatively flat. Thus all configurations are more or less equally likely, the effect of interactions is less, individual particle spins behave almost at random, and the average magnetization is close (equal) to zero. At T = 0, when β = ∞, the system will settle into a minimum energy configuration where all spins will align. A phase transition may be expected between these two regimes, but a rigorous proof is beyond the scope of the current book. Remark 2. What has been discussed above is a static theory of magnetization that describes the probability distribution of spin configurations at thermal equilibrium at different temperatures. A number of dynamic processes on associated graphs (lattices) have been proposed that correspond to various Markov Chain Monte Carlo methods for sampling from this equilibrium distribution. For example, single spin flip Glauber dynamics is a procedure in which a site is picked at random and using a local rule, the spin at that site remains the same or flips with a probability that is based on the configuration of its neighbors and the strength of interaction. In ferromagnetic materials, spins tend to align with their neighbors. Remark 3. The formulation and initial development of the theory of Ising systems may be traced back to the first half of the twentieth century as attempts were made to construct theoretical models for the existence of phase transitions in magnetic systems. More recent developments over the last thirty years have elaborated many aspects of such systems. Researchers have studied the behavior of the order parameter (average magnetization) near the critical point leading to insights about power law characteristics and universality. For example, using techniques from the theory of renormalization groups, many now believe that the phenomena of phase transitions and behavior near the critical point depend on some global properties (e.g. the dimension of the lattice — one may consider higher dimensional lattices in general rather than the two dimensional case we described here) rather than the details of the particular interaction. A large number of local interaction behaviors may thus give rise to the same global patterns. For a treatment of this subject, see Kadanoff 2000.

13.6.3

Analogies and Implications

We can now draw analogies between our model of language evolution on a lattice and the two-dimensional Ising model of statistical physics. Linguistic agents are like particles. Spins are like languages. Each linguistic

439

LINGUISTIC COHERENCE AND SOCIAL LEARNING

agent takes on one of two languages while each particle has one of two spins. The linguistic composition of the population (what percentage speaks L 1 ) is like the average magnetization of the spin system. In the language evolution case, the quantity a behaves like temperature. When a = 0, each agent behaves independently of the others and the only equilibrium to be expected is the uniform mode in which both languages are equally represented in the population. This is like the behavior of the magnetic system at very high temperatures when spins are largely uncorrelated and the average magnetization is zero. I conjecture therefore that one ought to observe a phase transition as a increases from 0 to 1 for the specific language evolution model with local interactions described in Section 13.6.1. I leave this for future work. A number of implications of the analogy are worth noting. First, we observe that the models in the first part of this chapter are formulated in the more general case with n linguistic types (languages). My description in this section, however, has been in terms of the special case when n = 2. The corresponding generalization of Ising models with n-valued spins is known as the Potts model and similar issues may be studied in that context. Second, I believe that my analysis in this chapter (working in the continuum limit of infinitely many linguistic agents) corresponds closely to the situation in which every agent is connected to every other, i.e., the graph of interactions is not the locally connected lattice of Figure 13.3 but rather is a complete graph where each vertex is connected to all other vertices. The solution of Ising systems with a complete graph correspond rather directly to mean field theories where one assumes that the effective field on every particle is the same. Mean field approximations are widely used in statistical physics and have been found to be useful. One may regard many of the models of this book as mean field approximations to more realistic models of language evolution. It is worth noting, however, that the approach in this book has not followed the formalisms or style of statistical mechanical systems. Rather, by taking the continuum limit, we arrive immediately at non-linear dynamical systems and the bifurcation theory of iterated maps plays the central role in our development. Third, following Remark 3 of the previous section, I believe that the qualitative insights of our “mean field” analysis will translate into other models of language evolution with local (rather than global) interactions as well. In fact, one of the central insights of this book is that bifurcations may arise in language evolution as a result of the interactions of individual language learners. Following Remark 3 (previous section), I believe that this insight will be robust to details of the particular interactions.

CHAPTER 13

440

Finally, it is worth noting that although there are many analogies to spin systems, there are also some differences. In particular, unlike the case in statistical physics, for models of language evolution, there is no static theory. There is no quantity analogous to the Hamiltonian that may describe an equilibrium distribution from first principles. Language evolution is fundamentally a dynamic theory where the dynamics is derived from the behavior of individual learning. Second, the details of the model are sufficiently different that a rigorous proof of the behavior of language evolution systems in the thermodynamic limit is not obvious at the present moment. We thus see that our approach to language evolution may be characterized in terms of constructs that are similar to statistical physics systems. A general setting for language evolution consists of the following ingredients: 1. A graph G = (V, E) where V is the vertex set and E is the edge set represents the linguistic connectivity pattern of the society. Each vertex v ∈ V may be identified as the site of a linguistic agent. The connectivity pattern may reflect social, geographical, economic, or other factors. 2. Let L be a collection of possible languages. These may be identified with sets of expressions, grammars, alternative pronunciations of words, etc. We have seen examples of these over the course of this book. In our current chapter, L = {L1 , . . . , Ln } where each Li ⊂ Σ∗ . 3. A linguistic configuration of the society at time t is given by x v (t) ∈ L for each v ∈ V . 4. A language learning algorithm A determines how the linguistic configuration evolves as each agent is exposed to examples from its neighbors and updates its language. Thus, this defines x v (t + 1) for each v ∈ V . Depending upon precise choices for each of the four above, one may obtain many versions of the language evolution problem. Some of these problems may be amenable to study by the techniques developed in probability theory (e.g., percolation processes) and statistical physics. Others may require new techniques. It is my hope that knowledge developed about phase transitions and critical phenomena may be usefully applied in this linguistic setting. The peculiarities of the language evolution problem may also bring fresh perspectives and new problems for theorists interested in complex systems of simple interacting agents. For an approach to language acquisition and evolution based on ideas from statistical physics, see the work of Antonio Galves and colleagues (Cassandro et al. 1999).

441

13.7

LINGUISTIC COHERENCE AND SOCIAL LEARNING

Conclusions

In this chapter, we reexamined the conditions under which linguistic coherence might emerge in a population of linguistic agents. In particular, we considered a setting where individual children learn from the entire adult population at large rather than a single teacher (parent) alone. We concentrated on the case in which there were n possible languages (linguistic types) that were symmetrically arranged in that all languages were equally easy to learn. Languages were learned on the basis of cues provided by speakers to child learners. The learning algorithm was a batch learner that counted cues and hypothesized a language that was consistent with most cues. Under these assumptions I derived the evolutionary dynamics of the population. This was seen to be a map f k : Δn−1 → Δn−1 and we showed that this could have only a finite number of fixed points (equilibria). More interestingly, there were two regimes for this map. In the small a, k regime, only the uniform fixed point was stable and populations converged to this from all initial conditions. Thus no shared language emerged. As a, k changed values, a bifurcation led to the loss of stability of the uniform solution and new stable one language modes arose. Shared languages emerged. Thus, yet again, we see the role of bifurcations in the dynamics of language evolution. The values of a and k may be related to the learning fidelity of the learner. Thus small a, k corresponds to a regime in which there is too little information on the basis of which learners make their linguistic decisions. The system is noisy and shared languages do not emerge. Large a, k corresponds to a regime in which a lot of information is provided to individual learners, learning fidelity is high, the system is less noisy, and shared languages emerge. The small a, k regime may be compared to high thermal noise in statistical physics systems or high mutation rates in biological evolutionary systems. Correspondingly, the large a, k regime may be compared to low thermal noise in statistical physics or low mutation rates in biological evolution and the bifurcations in language evolution may be qualitatively compared to those that are seen in physics and biology respectively. It is also worthwhile for us to reflect on the difference between learning from one teacher and learning from the population. Our analysis suggests that when learning from one teacher, differential fitness and natural selection play an important role for the emergence of coherent states corresponding to shared languages in the population. In our particular models, we find that without natural selection, coherence will never emerge, no matter how good the learning ability of individual learners. In contrast, in social learning,

CHAPTER 13

442

as in the model of this chapter, coherence emerges without any recourse to natural selection. This insight must be kept in mind when constructing explanatory paradigms for the evolution of communication systems in the animal kingdom. Thus, I argue that in those evolutionary scenarios (such as in some species of songbirds) when learning is in the nesting phase and primarily from one teacher, natural selection must play an important role in the evolution of shared systems. On the other hand, in social settings such as human linguistic communication, one need not invoke notions of natural selection for language evolution. While shared communication systems were seen to emerge for the batch learner in a social learning situation, a similar situation was not true for a memoryless learner. In fact, the population dynamics resulting from the individual behavior of the memoryless learner converged to a uniform solution for all values of k. Thus no matter how good the learning ability of the learner, shared languages do not emerge. This suggests that the emergence of shared communication systems is not implied simply by high learning ability. Rather, the details of the learning algorithm and the influence of natural selection interact in a subtle way to bring this about.

Part V

Conclusions

Chapter 14

Conclusions Parts II, III, and IV of this book have explored the themes of language, learning, and evolution in some detail. In this concluding chapter, I take stock of the situation, assess the progress made so far, and outline important directions for future understanding. Recall the essential logic of the argument presented here. There are three main observations: 1. Language is composed of units such as distinctive features, phonemes, syllables, morphemes, phrases, and so on. In any particular linguistic system, these have systematic relationships with each other. These relationships may be given a formal characterization (algebraic or statistical). Thus linguistic knowledge and behavior are usefully captured by an underlying computational system (grammar). 2. At any time t there may be variation among the linguistic systems of different mature speakers in the population. This is the synchronic variation (across space) at time t. 3. Learning is the process by which language is transmitted to new speakers — children and others who enter the population by birth or recent migration. This gives rise to evolutionary dynamics at the population level. Thus, we relate learning at the individual level to change and evolution in the linguistic characteristics of the population over generational time. It is difficult to reason about the relationship between (1),(2), and (3) by verbal arguments alone. For this reason, a number of computational and mathematical models were considered. I do not posit any single model of language evolution. Rather, my goal here has been to lay the foundations for a framework within which a number of different models may be constructed to understand different aspects of the phenomena at hand.

CHAPTER 14

14.1

446

A Summary of the Major Insights

Let us now review my major results to highlight some of the important insights that have been obtained. It is worthwhile to emphasize that many of these insights are only obtained by the construction of a mathematical model that brings questions and issues into sharper focus than before.

14.1.1

Learning and Evolution

Learning at the individual level and evolution at the population level are related. Furthermore, we see that different learning algorithms may have different evolutionary consequences. Thus every theory of language acquisition also makes predictions about the nature of language change. Such theories may therefore be tested not only against developmental psycholinguistic data but also against historical and evolutionary data. Over the course of this book, I have explored many different learning algorithms and worked out their evolutionary consequences. In Chapter 5, I developed a basic understanding of two different linguistic systems in competition with each other. Three learning algorithms were studied: a memoryless learner (the TLA), a batch learner, and an asymmetric cuebased learner. These three had different evolutionary dynamics. Not only were the precise equations for the dynamical system different, their qualitative behaviors were also different. The memoryless learner gave rise to a single stable attractor to which the population converged from all initial conditions. The batch learner gave rise to a bistable situation with two stable attractors. The cue-based learner had two different regimes — one with a single stable attractor and one with two stable attractors. Each of these three learning algorithms satisfies the learnability criterion. In a homogeneous setting with a single target grammar, the learning algorithm identifies the target in the limit as the data goes to infinity. Yet, they have different evolutionary properties. This theme repeated itself across many chapters. In Chapter 7 on Portuguese, three different algorithms were studied. These were the Galves Batch Algorithm, the Batch Subset Algorithm, and the Online Learning Algorithm (TLA). Different evolutionary dynamics were obtained. In Chapter 8 on Chinese, four different cases were examined. We see that there is a difference depending upon whether learners make a categorical decision or a blending decision in phonetic/phonological acquisition. We have also seen the role of critical age periods (the maturation parameter) in learning and evolution. If the learning stops and the mature

447

CONCLUSIONS

language crystallizes after a number n of examples have been received, we see that the evolutionary dynamics are characterized by degree n polynomial maps. These polynomial maps were encountered in a number of different settings in Chapters 5 through 8, and 13. Although such high-degree polynomial maps may have complicated behavior in general, in the particular case of language, they operate in bounded parameter regimes. Thus, though bifurcations typically arise, chaos typically does not. In particular, we see that there is a qualitative difference in evolutionary behavior between smalln and large-n settings. In Chapter 12, we uncovered a similar relationship between evolvability and learning fidelity. Learning fidelity is a measure of how well the learner learns a potential target language, and this is naturally related to the amount of linguistic data it receives over the learning period. Thus, a change in the developmental lifecycle of protohumans (so that a longer “gestation” period was available for language learning) could result in a qualitatively different evolutionary pattern for the emergence of communal languages. Finally, we see the differences between learning algorithms that learn from the input provided by a single individual (parent, teacher, or caretaker) versus algorithms that learn from the input provided by the community at large. This theme was explored in Chapter 8 (Chinese), Chapter 9 (models of cultural evolution), and most interestingly in Chapters 12 and 13. I comment on the findings of Chapters 12 and 13 shortly. Throughout, I have considered examples of language change to ground abstract discussion in linguistic reality and to provide some sense to the reader as to how linguistic data may be engaged using the approaches described here. To this effect, I discussed phonological change in Chinese and Portuguese and syntactic change in French and English. An assortment of other changes are scattered across the book to motivate the discussion. Much of learning theory in language acquisition is developed in the context of the classical Chomskyan idealization of an “ideal speaker hearer in a homogeneous linguistic environment”. As a result we typically assume that there is a target grammar that the learner tries to reach. This book has dropped the homogeneous assumption and concentrated on analyzing the implications of learning theory in a heterogeneous population with linguistic variation. Learning theory has not been systematically developed in such a context before.

CHAPTER 14

14.1.2

448

Bifurcations in the History of Language

A major insight that emerges from the analytic treatment pursued over the preceding chapters has to do with the existence of bifurcations (phase transitions) in the dynamics of language evolution. These bifurcations may be viewed as appropriate explanatory constructs to account for major transitions in language. Throughout this book, I have derived the dynamics of linguistic populations under a variety of assumptions. Again and again, we notice that (a) the dynamics is typically nonlinear, and (b) there are bifurcations and these may be interpretable in linguistic terms as the change of language from one seemingly stable mode to another. We have encountered numerous such examples of bifurcations in this book. In Chapter 5, I considered models with two languages in competition. For a TLA-based learner for learning in the Principles and Parameters model, we noted that the equilibrium state depended upon the relationship of a with b where a and b are the frequencies with which ambiguous forms are generated by speakers of each of the two languages in question. If a = b, then no evolutionary change is possible. If a < b, then one of the two languages is stable; the other is unstable. For a > b the reverse is true. We thus saw that it was possible for a language to go from a stable to an unstable state because of a change in the frequencies with which expressions are produced. In a cue-based model of learning, also discussed in that chapter, we saw that there was a bifurcation from a regime with two stable equilibria to one with only a single stable equilibrium as k (the number of learning samples) and p (the cue frequency) varied as a function of each other. For models inspired by change in European Portuguese or those inspired by phonological change in Chinese, similar bifurcations arose. In Chapters 12 and 13 I studied the emergence of grammatical coherence. We saw that there was a bifurcation point below which the only stable solution was the uniform solution where all languages are equally represented in the population. Above this bifurcation point, one-language solutions were seen to emerge. These correspond to coherent states where the entire population has converged to the same language. This bifurcation point was related to the learning fidelity of individual learners. These results provide some understanding of how a major transition in the linguistic behavior of a community may come about as a result of a minor drift in usage frequencies, provided those usage frequencies cross a critical threshold. The usage frequencies may drift from one generation to the next but the underlying linguistic systems may remain stable. However, if these usage frequencies cross a threshold, then rapid change may come

449

CONCLUSIONS

about. Thus a novel solution to the actuation problem (the problem of what initiates language change) is posited.

14.1.3

Natural Selection and the Emergence of Language

In Part IV of this book, I shed some light on the complex nature of the relationship between communicative efficiency and fitness, social connectivity, learnability, and the emergence of shared linguistic systems. For example, I study the emergence of grammatical coherence, i.e., a shared language of the community in the absence of any centralized agent that enforces such coherence. Two different models are considered in Chapters 12 and 13. In one, children learn from their parents alone. In the other, they learn from the entire community at large. In both models, it is found that coherence emerges only if the learning fidelity is high, i.e., for every possible target grammar g, the learner will learn it with high confidence (with probability > γ). In the light of the discussion on learnability, we see that the complexity of the class of possible grammars H, the size of the learning set n, and the confidence γ are all related. For a fixed n, if γ is to be large, then H must be small. Thus in addition to the traditional learning-theoretic considerations, we see that there may be evolutionary constraints on the complexity of H — the class of Universal Grammar. In order to stably maintain a shared language in a community, the class of possible languages must be restricted, and something like Universal Grammar must be true. A second insight emerges from considering the difference in the two models of Chapters 12 and 13. We see that if one learns from parents alone, then natural selection based on communicative fitness is necessary for the emergence of a shared linguistic system. On the other hand, if one learns from the community at large, then natural selection is not necessary. In human societies, the social connectivity pattern ensures that each individual child receives linguistic input from multiple people in the community. In such societies, it is therefore not necessary to postulate mechanisms of natural selection for the emergence of language. On the other hand, in those kinds of animal societies where learning occurs in the “nesting phase” with input primarily from one teacher, one may need to invoke considerations of natural selection. This is the case for some birdsong communities, for example.

14.2

Future Directions

It is clear that a number of simplifying idealizations needed to be made to obtain the results and insights discussed above. There are a number of

CHAPTER 14

450

issues that are still poorly understood and a number of directions to take the models toward greater realism. I discuss these below. 1. In the models I have considered there have usually been two different settings. Either children learn from equal exposure to all members in the population (most of the models in this book) or children learn from individual members alone (parents or randomly selected teachers, as in Chapter 12). In many cases of interest, there is a more complicated social structure that is best expressed as a network of influences on the learning child. In that case, the population ought to be modeled as a graph where each node of the graph denotes a location or individual and the edges denote the influences that individuals (locations) have on each other. In this book, I have essentially considered complete graphs and degenerate graphs where there are no edges at all. Dynamics when the graph is locally connected were briefly explored in Chapter 10 and then again in Chapter 13, but much more remains to be done. A better understanding is ultimately required of the effect that network topology has on the dynamics of language evolution. This direction will tie together the recent interest in social networks (see, for example, Strogatz 2003, or Newman, Barabasi, and Watts 2003) with our interest in language evolution. 2. Two other aspects of the population structure were briefly examined in Chapter 10. One is related to the fact that population sizes are finite. This results in a stochastic process rather than a dynamical system, and the relationship between the stationary distribution of this process and the attractors of the dynamical system need to be better understood. In particular, how do the bifurcations that occur in the dynamical systems manifest themselves in the corresponding processes? A second is related to the fact that most of the models have assumed a generational structure that is blocked into discrete generations with dependence only between successive generations. It is of interest to know the effect of more complicated generation structures on learning and evolution. 3. In much of my analysis, I have assumed that the grammatical system of a human may be characterized by a single computational system (in the sense of Church-Turing). This corresponds to a speaker-hearer being monolingual in the traditional sense of generative linguistics. In the presence of variation, children might become effectively bilingual (multilingual). The effect of this was briefly studied in a few chapters.

451

CONCLUSIONS For example, in Chapter 8 on phonological change in Chinese, I considered both bilingual and monolingual models of learning and observed that they could have different evolutionary consequences. Again, in Chapter 10, some preliminary models of bilingual learning were discussed.

4. Closely related to the previous point is the relationship between firstand second-language learning and their relative effects on language change. In the models of this book, I have never made a distinction between these two possibly different modes of learning, their interrelationships, and their evolutionary consequences. We can consider learning models with two learning phases corresponding to the acquisition of a native language and its characteristics, followed by a later period of second-language acquisition. Such models would have particular relevance to the analysis of the linguistic adaptation of immigrants in a new society. 5. The general framework presented here makes no commitment to details of linguistic theory or learning theory. Every choice of H (the class of human grammatical systems), and A (the learning algorithm used by children) would give rise to a potentially different population dynamics. Many of the examples were worked out for grammars in the Principles and Parameters tradition of grammatical theory and appropriate learning algorithms in that context. It is possible to provide an analogous development where the entire analysis is carried out within an alternative framework such as Optimality Theory. I leave such alternative analyses to future work. 6. The role of fitness and natural selection is poorly understood at present. I presented an analysis in Chapters 11 and 12, but many problems remain. First, what is the sense in which knowing a language confers fitness upon the individual that knows it? The particular characterization of communicative efficiency as a proxy for fitness is motivated heavily by an information-theoretic view of language that views the primary purpose of language to be communication. But it may well be that communicative ability is an accidental side-effect of linguistic competence. It may be that the true function of language is that it confers the ability for thought and reasoning. In that case, communicative success might have little to do with language, and the evolutionary pressures would be very different from those developed here. An empirical study of the structure of lexical items already suggests that in

CHAPTER 14

452

the narrow sense of communicative efficiency developed in this book, languages do not seem to be optimized for communication. However, much more remains to be done here. 7. In this book, I have failed to provide any genuine insight into the evolution of novel structures. Much of my analysis has focused on the competition between existing structures. I have always assumed a preexisting class H of possible communication systems and this determines the state space of the dynamical system. The sense in which novelty arises in the framework is the following. We can consider initial conditions of the population that are in some region of the state space where there are many unattested communication systems. The dynamical process might unfold in such a way that previously unattested systems might come into existence. We have seen how bifurcations may provide a mechanism for this. However, these unattested systems were possible by construction of the evolutionary process itself. In contrast, we might ask — how does one formulate an evolutionary process where the state space itself is modified by reconstruction so that genuinely new structures are seen to evolve? For example, an important question from a linguistic point of view may be: How did recursion evolve? A high-level answer that may be given at this point would claim that recursion evolves in the following stages. In the first stage, there are systems with nonrecursive rules defined over primitive objects. Primitive communication systems that map symbol to object like certain kinds of alarm-call systems in animal communication possibly belong to this type. In the second stage, one finds the evolution of categories (groups of primitive objects). In the final stage, there are rules defined over categories. Once one evolves rules over categories, recursion would follow almost immediately. A cogent formulation of this general evolutionary sequence is an important problem for future work. 8. Part IV of this book deals with the conditions necessary for the emergence of coherence. We see that for linguistic coherence to emerge, learning fidelity must be high. In general, for learning fidelity to be high, H needs to be a highly restricted family (in our example model, the cardinality n must be low). So there is clearly some evolutionary pressure for H to be a small set. Why, then, doesn’t the set H become smaller and smaller until it becomes a singleton set, consisting only of one language? In that case, H would eventually consist of only

453

CONCLUSIONS one language, eliminating the need for any learning or the possibility of any linguistic variation in the population. We have identified pressures that force H to be small — both from learning-theoretic and evolutionary considerations. What are the pressures that force H to be large? A plausible argument is that if H consists of only one language, then the details of this language must be encoded in the genome sequence and would take up a lot of space. Therefore a large H corresponds to smaller requirements on the genetic code and greater flexibility in some sense. Clarifying these issues in some quantitative way presents another important direction for future work.

9. I have discussed human language and its origins in some detail over the course of this book, but the framework is applicable in general to the evolution of communication systems that are acquired by learning. In the context of the evolutionary origin of human language, relevant data seems to be difficult to come by. It may therefore be more profitable to explore some of the natural evolutionary questions in the context of animal communication, where data may be more easily collected. One observes specialized communication systems in a large number of animal species. For example, there is the familiar bee dance, which bees use to communicate a source of food to other bees in their hive. The dance follows a figure eight and has three parameters: (i) the principal axis points towards the direction of food; (ii) the time taken to traverse the figure represents the distance to the source; (iii) the vigorousness of the dance represents the amount of food present. Can one provide a coherent account of how such a specialized communication system arose? Similarly, can one provide accounts of the evolution of songs in birds, or communicating signals in bats? These present significant challenges for the future.

14.2.1

Empirical Validation

While model construction allows us to get a sharper understanding of the questions and issues in language evolution, ultimately these models remain speculative unless they are used to engage empirical facts in a productive way. The empirical validation of these models remains an important ongoing direction for future work. It needs to be properly understood that in a historical discipline like evolution, controlled experiments are difficult to design and so the kind of success we wish for is not the sort that we see in certain areas of physics or

CHAPTER 14

454

engineering. Rather, we hope to get the sort of understanding that we obtain from mathematical models in evolutionary biology. We hope that we will be able to separate plausible from implausible theories, sort out inconsistencies in reasoning, and generally obtain a deeper qualitative understanding of the phenomena. At the same time, it is worth noting that even in many areas of physics, numerical match to the data is difficult. Nevertheless quantitative models are developed, and these play an important role in qualitative understanding. Various aspects of fluid mechanics and turbulence are a good example. It is impossible to predict the motion of a leaf on a stormy day — yet dynamical systems models of the general properties of fluid flow have been developed. Similarly, areas of cosmology, pattern formation (e.g. sandpiles), soft condensed matter physics, and so on present further examples. As examples of concrete empirical phenomena that have been developed within the mathematical framework of this book, let us recall the case of French. We saw that there was a syntactic change leading to the loss of V2 and loss of pro-drop in the grammar of French. One may make models of this change under different assumptions: (a) speakers are bilingual and there is grammatical variation within each speaker, or (b) speakers are monolingual and the locus of variation is at the level of the population, or (c) there is linkage among the grammatical parameters so that they may not be treated as independent, i.e., the nature of the parameterization matters, and so on. Depending on the precise assumptions, several different kinds of evolutionary models were explored in Chapters 6 and 10 in some detail. Most interestingly, we found that not all models were consistent with the historically observed trends. From this kind of analysis we were led to conclude that (i) V2 must have been lost before pro-drop was lost, and (ii) the loss of V2 was triggered by the increase in the frequency with which pronominal subjects were used. Both statements (i) and (ii) therefore count as predictions that are independently verifiable by further data collection. Similar examples of syntactic and morphological change were provided for English (Chapter 9) and Portuguese (Chapter 7) respectively. I worked through different kinds of models for each of those cases. These models allowed us to check our intuitions about the forces that led to the change and to reason about the interplay between learning and change more generally. Another example of an empirically motivated study was provided by the example of phonological change in the Wu dialect considered in Chapter 8. We saw that a phonological merger between a diphthong and a monophthong was attested in the Wu dialect of Wenzhou province. As a result, several pairs of rhyming words became homophonous. A number of questions could now be asked. What initiated the change? Why did the change go to

455

CONCLUSIONS

completion? Under what conditions do we expect competing forms to coexist in the population? What is the possible effect on population dynamics of noise in the perception and production of speech? In order to reason about these questions, I considered several different models for the acquisition of phonological forms. We saw that if individuals treated phonological forms in a categorical way, then equilibrium states of the population corresponded to pure states where all members used the same form. In contrast, if individuals used both forms, then equilibrium states were not pure and there was no tendency for language to change or for the change to go to completion. Thus different hypotheses about individual behavior were separated, and these predictions about individual behavior may be empirically falsified. The bifurcations that arise in such models were seen as suitable constructs to account for the actuation problem. In Chapter 11, I discussed a formulation of communicative efficiency that was based on the notion of information transfer from speaker to hearer. Using a variant of such a formulation, we saw that the lexicon of English was not optimally adapted to the perceptual limitations of humans. In particular, phonetic distinctions that are hard to make seem to bear great functional load. This provided an immediate empirical challenge to the notion that language evolves in the direction of greater communicative efficiency. I hope that these examples provide a sense to the reader of how one may engage empirical facts. But this has been a largely theoretical book and much more remains to be done. Historical linguistics is filled with examples of language change and competing explanations that account for them. I provided some illustrative examples in Chapter 1, and empirically grounded studies need to be performed in large measure before the approach embodied in this book realizes its full potential. The empirical validation of models for the origin of language presents an even greater challenge. At the present moment in time, there is very little data that points directly to the different stages in the origin of human language from prelinguistic versions of it. However, there is much more data on other kinds of natural communication systems found in the animal world (see Hauser 1997 for an account). Perhaps the most empirically fruitful direction of future work would be to sharpen the questions and models of evolutionary thinking by considering case studies of the origin of such communication systems. This presents an important direction for future work.

CHAPTER 14

14.2.2

456

Connections to Other Disciplines

A central theme of this book is the relationship between local interactions at the individual level and global behavior at the population level. This theme arises in a number of different disciplines and obvious connections emerge. I have noted at many points the analogies between evolution in linguistic populations and evolution in biological populations. Grammars, like genes, are formal and discrete. Unlike genetic evolution, however, grammars are transmitted from one generation to the next via learning. The evolutionary dynamics that result from this different transmission law has been the central concern of this book. Nevertheless, there are strong synergies between the methods and concerns of evolutionary biology and those of evolutionary linguistics. At some point in the future when the biological basis of grammars are understood, perhaps the two disciplines will be unified more deeply and it will make sense to study gene-grammar coevolution in human populations. Statistical physics has long concerned itself with deriving the characteristics of the ensemble from the statistics of individual particles. There is particular interest in the existence and nature of phase transitions, and these phase transitions play an important role in describing the behavior of materials at different temperatures. I considered the analogies between language evolution systems and the Ising systems of statistical physics in Chapter 13. In the Ising model, each particle has a spin that takes one of two values. The particles are identified with the vertices of a graph and the connectivity of the graph determines how the particles interact. The temperature T controls the strength of the interaction, and we find a phase transition in the equilibrium spin distribution as the temperature changes smoothly. Above a critical temperature Tc , in the high thermal noise regime, the only equilibrium distribution is one in which as many spins are up as down on average. Below the critical temperature, spontaneous spin alignment may occur. This phase transition was analogous to the bifurcations we observed in the models of language evolution in Chapter 13. Linguistic agents were like particles, languages were like spins, and learning fidelity (quantified by the parameters a and k) was like temperature. A bifurcation led to the transition from incoherent regimes in which the only stable mode was the uniform mode to coherent regimes in which “alignment of languages” occurred so that a majority of the population spoke a shared language. In this book, I have analyzed the interplay between individual behavior and the ensemble by taking the continuum limit of individual agents, deriving

457

CONCLUSIONS

nonlinear dynamical systems, and invoking the theory of bifurcations in nonlinear dynamics. It is my strong intuition that it is possible to pursue an alternative development using the techniques of percolation theory and statistical mechanics, leading to the characterization of language change in terms of phase transitions. I leave this as a possibility for the future. Human linguistic behavior is grounded in our biology, yet is closely linked to our cultural identity. Consequently, language evolution may also be viewed as a particular instantiation of cultural evolution. Indeed, many theories of cultural evolution (notably those of Cavalli-Sforza and Feldman) have taken this viewpoint. I considered the relationship between the CavalliSforza and Feldman theory and that of Niyogi and Berwick in Chapter 9. It is possible that some of the tools and techniques described here could also serve theories of cultural evolution in other domains. Developments affecting music, writing, or games are particularly good examples since learning plays an important role in all three. However, the form that learning theory must take to best describe the transmission process in each of these domains is probably closer to teaching than to the kind of unconscious learning that takes place in language. More fundamental work will be needed to build on these connections profitably. The game-theoretic analysis of evolutionary processes implicit in Boyd and Richerson 1985 may also be used to model the evolution of language. It would be interesting to see the development of theories of language evolution in those terms. Other areas of the social sciences with which synergies exist are economics and psychology. In economics, there is a long tradition of understanding the macroscopic behavior of economic systems in terms of the behavior of individual economic agents. Examples such as social choice theory, welfare economics, and evolutionary economics come to mind. A recent and unusual exploration of the interface between economics and linguistics may be found in Rubinstein 2000. Evolutionary thinking is beginning to play a role in psychology with the development of evolutionary psychology, where the evolution of cognitive traits is studied. Since language surely counts as an important cognitive trait, the study of language evolution embodied in this book may also be viewed as a contribution toward an understanding of the principles of evolutionary psychology in a concrete and empirically verifiable context. There are many close ties to computer science and artificial intelligence. A grammar is a computational system and learning is an algorithmic process. The heart of my approach is therefore grounded in the techniques of computer science. This book may be viewed as a contribution to mathematical and computational linguistics, a subdiscipline of artificial intelligence.

CHAPTER 14

458

These connections are obvious and there is no need to elaborate on them any further. It is worthwhile, however, to discuss some other areas of artificial intelligence that have points of contact with the ideas in this book. The areas of artificial life and multi-agent systems deal with the behavior of groups of intelligent agents with a particular focus on emergent phenomena in those communities. This varies from market based models of computational economies (see Boutilier, Shoham, and Wellman 1997), herding and flocking behavior in robot communities (Mataric 1998) , and coordination in multi-agent systems (Shoham and Tennenholtz 1997; Tohme and Sandholm 1999). Part IV of this book focuses on emergent phenomena in a population of linguistic agents, and the results there may also be viewed as a theoretical contribution toward artificial life. The artificial life approach to language evolution is most strongly represented in the work of Luc Steels.

14.3

A Concluding Thought

I have attempted here to provide an evolutionary perspective on language — on how and why languages change, why they take the form they do, and how the structure, acquisition, and evolution of language are all interdependent. In the understanding of social and natural phenomena, it is often the case that a historical perspective provides one with a deeper appreciation of why things are the way they are. In this context, it is worth recalling Dobzhansky’s famous remark: Nothing in biology makes sense except in the light of evolution. (T. Dobzhansky) In the natural world, language and communication are grounded in the biology of living organisms. I hope that such an evolutionary perspective will provide a richer understanding of the fundamental nature of human language, and more generally of communication in humans, animals, and machines.

Bibliography [1] N. Abe and M. Warmuth. On the computational complexity of approximating distributions by probabilistic automata. In Proceedings of Computational Learning Theory, 1992. [2] D. M. Abrams and S. H. Strogatz. Modelling the dynamics of language death. Nature, 424(900), 2003. [3] J. Aitchison. The Seeds of Speech: Language Origin and Evolution. Cambridge: Cambridge University Press, 2000. [4] G. Altmann, H. van Buttlar, W. Rott, and U. Strauss. A law of change in language. In B. Brainerd, editor, Historical Linguistics, pages 104– 115. Bochum: Studienverlag Dr. N. Brockmeyer, 1983. [5] D. Angluin and M. Kharitonov. When won’t membership queries help. Journal of Computer and System Sciences, 50:336–355, 1995. [6] Dana Angluin. Inductive inference of formal languages from positive data. Information and Control, 45:117–135, 1980a. [7] Dana Angluin. Finding patterns common to a set of strings. Journal of Computer and System Sciences, 21:46–62, 1980b. [8] Dana Angluin. Learning regular sets from queries and counterexamples. Information and Computation, 75:87–106, 1987. [9] Dana Angluin. Identifying languages from stochastic examples. Technical Report YALEU/DCS/RR-614, New Haven, CT: Yale University, March 1988. [10] A. Ariew and R. Lewontin. The confusions of fitness. British Journal for the Philosophy of Science, 55:347–363, 2004.

BIBLIOGRAPHY

460

[11] Martin Atkinson. Children’s Syntax: Introduction to Principles and Parameters Theory. Oxford: Blackwell, 1992. [12] R. Axelrod. The Evolution of Cooperation. New York: Basic Books, 1984. [13] Charles-James Bailey. Variation and Linguistic Theory. Washington, DC: Center for Applied Linguistics, 1973. [14] J. Batali. Computational simulations of the emergence of grammar. In J. R. Hurford, M. Studdert-Kennedy, and C. Knight, editors, Approaches to the Evolution of Language: Social and Cognitive Bases, pages 405–426. Cambridge: Cambridge University Press, 1998. [15] L. Bauer. Watching English Change. Longman, 1994. [16] Stefano Bertolo. Language Acquisition and Learnability. Cambridge: Cambridge University Press, 2001. [17] R. C. Berwick. The Acquisition of Syntactic Knowledge. Cambridge, MA: MIT Press, 1985. [18] M. Blum and L. Blum. Towards a mathematical theory of inductive inference. Information and Control, 28:125–155, 1975. [19] A. Blumer, A. Ehrenfeucht, D. Haussler, and M. Warmuth. Learnability and the Vapnik-Chervonenkis dimension. Journal of the ACM, 36(4):929–965, 1989. [20] C. Boutilier, Y. Shoham, and M. Wellman. Economic principles of multi-agent systems. Artificial Intelligence Journal, 94(1-2):1–6, 1997. [21] R. Boyd and P. Richerson. Culture and the Evolutionary Process. Chicago: University of Chicago Press, 1985. [22] L. E. Breivik and E. H. Jahr. Language Change: Contributions to the Study of Its Causes. New York: Mouton de Gruyter, 1989. [23] M. Brent. Surface cues and robust inference as a basis for the early acquisition of subcategorization frames. In L. Gleitman and B. Landau, editors, The Acquisition of the Lexicon, pages 433–470. Cambridge, MA: MIT Press, 1994. [24] M. Brent. An efficient, probabilistically sound algorithm for segmentation and word discovery. Machine Learning, 45:71–105, 1999.

461

BIBLIOGRAPHY

[25] M. Brent and T. Cartwright. Distributional regularity and phonotactics are useful for segmentation. Cognition, 61:93–125, 1996. [26] Joan Bresnan. Lexical-Functional Syntax. Oxford: Blackwell, 2001. [27] E. J. Briscoe. Grammatical acquisition: Inductive bias and coevolution of language and the language acquisition device. Language, 76(2):245– 296, 2000. [28] Helena Britto, Charlotte Galves, Ilza Ribeiro, Marina Augusto, and Ana Paula Scher. Morphological annotation system for automatic tagging of eletronic textual corpora: From English to Romance languages. In Oriente, editor, Anais do VI Simposio Internacional de Comunicacion Social, pages 582–589. Santiago de Cuba, 1999. [29] R. Brown and C. Hanlon. Derivational complexity and order of acquisition in child speech. In J. R. Hayes, editor, Cognition and the Development of Language. New York: Wiley, 1970. [30] R. Bush and F. Mosteller. Stochastic Models for Learning. New York: Wiley, 1955. [31] A. Cangelosi and D. Parisi. Simulating the Evolution of Language. Berlin: Springer-Verlag, 2002. [32] J. Case and C. Smith. Comparison of identification criteria for machine inductive inference. Theoretical Computer Science, 25:193–220, 1983. [33] M. Cassandro, P. Collet, A. Galves, and C. Galves. A statistical physics approach to language acquisition and language change. Physica A, 263:427–437, 1999. [34] L. Cavalli-Sforza. Genes, Peoples, and Languages. University of California Press, 2001. [35] L. L. Cavalli-Sforza and M. W. Feldman. Cultural Transmission and Evolution: A Quantitative Approach. Princeton, NJ: Princeton University Press, 1981. [36] Eugene Charniak. Statistical Language Learning. Cambridge, MA: MIT Press, 1993. [37] D. Cheney and R. M. Seyfarth. How Monkeys See the World: Inside the Mind of Another Species. Chicago: University of Chicago Press, 1990.

BIBLIOGRAPHY

462

[38] N. Chomsky and M. Schutzenberger. The algebraic theory of contextfree languages. In P. Braffort and D. Hirschberg, editors, Computer Programming and Formal Languages, pages 118–161. Amsterdam: North Holland, 1963. [39] Noam Chomsky. Aspects of the Theory of Syntax. Cambridge, MA: MIT Press, 1965. [40] Noam Chomsky. Lectures on Government and Binding. Dordrecht: Foris, 1981. [41] Noam Chomsky. Knowledge of Language: Its Nature, Origin and Use. New York: Praeger, 1986. [42] Noam Chomsky. The Minimalist Program. Cambridge, MA: MIT Press, 1995. [43] D. Christian, W. Wolfram, and N. Bube. Variation and Change in Geographically Isolated Communities: Appalachian English and Ozark English. Birmingham: University of Alabama Press, 1988. [44] M. Christiansen and S. Kirby. Language Evolution. Oxford: Oxford University Press, 2003. [45] R. Clark and I. Roberts. A computational model of language learnability and language change. Linguistic Inquiry, 24:299–345, 1993. [46] Stephen Crain. Language acquisition in the absence of experience. Brain and Behavioral Sciences, pages 597–612, 1991. [47] Stephen Crain and Rosalind Thornton. Investigations in Universal Grammar: A Guide to Experiments in the Acquisition of Syntax and Semantics. Cambridge, MA: MIT Press, 1998. [48] W. Croft. Explaining Language Change: An Evolutionary Approach. Harlow, Essex: Longman, 2000. [49] J. F. Crow and M. Kimura. An Introduction to Population Genetics Theory. Minneapolis: Burgess, 1970. [50] F. Cucker, S. Smale, and D-X. Zhou. Modeling language evolution. Foundations of Computational Mathematics, 4(3), 2004.

463

BIBLIOGRAPHY

[51] Walter Daelemans. Abstraction considered harmful: Lazy learning of language processing. In Proceedings of the 6th Belgian-Dutch Conference on Machine Learning, pages 3–12, 1996. [52] Carl de Marcken. Parsing the LOB corpus. In Proceedings of the Meeting of the Association for Computational Linguistics, pages 243– 251, 1990. [53] Carl de Marcken. Unsupervised Language Acquisition. PhD thesis, MIT, Cambridge, MA, 1996. [54] Ferdinand de Saussure. Course in General Linguistics. 1916. [55] Ferdinand de Saussure. Course in general linguistics. In C. Bally, A. Sechehaye, and R. Harris, editors, Course in General Linguistics (English Translation). London: Duckworth, 1983. [56] M. DeGraff. Language Creation and Language Change. Cambridge, MA: MIT Press, 2001. [57] M. Demetras, K. Post, and C. Snow. Feedback to first language learners: The role of repetitions and clarification questions. Journal of Child Language, 13:275–292, 1986. [58] B. E. Dresher and J. D. Kaye. A computational learning model for metrical phonology. Cognition, 34:137–195, 1990. [59] M. Eigen and P. Schuster. The Hypercycle: A Principle of Natural Self Organization. Berlin: Springer, 1994. [60] J. Elman, E. Bates, M. Johnson, A. Karmiloff-Smith, Domenico Parisi, and Kim Plunkett. Rethinking Innateness. Cambridge, MA: MIT Press, 1996. [61] Jerome A. Feldman. Some decidability results on grammatical inference and complexity. Information and Control, 20(3):244–262, 1972. [62] Jerome A. Feldman, George Lakoff, David Bailey, Srini Narayanan, Terry Regier, and Andreas Stolcke. L0—the first five years of an automated language acquisition project. Artificial Intelligence Review, 10:103–129, 1996. [63] C. Fellbaum. WordNet: An Electronic Lexical Database. Cambridge, MA: MIT Press/Bradford Books, 1998.

BIBLIOGRAPHY

464

[64] Roberto Fernandez and Antonio Galves. Identifying features in the presence of competing evidence: The case of first-language acquisition (preprint). 1999. [65] Roberto Fernandez and Antonio Galves. Identifying features in the presence of competing evidence: The case of first language acquisition. In Dynamical Systems: From Crystal to Chaos, pages 52–62. World Scientific, 2000. [66] Ronald Fisher. The Genetical Theory of Natural Selection. Oxford: Oxford University Press, 1930. [67] J. D. Fodor. How to obey the subset principle: Binding and locality. In B. Lust, G. Hermon, and J. Kornfilt, editors, Syntactic Theory and First Language Acquisition: Cross-linguistic Perspectives, Vol. 2: Binding, Dependencies and Learnability. Hillsdale, NJ: Erlbaum, 1994. [68] J. D. Fodor. Unambiguous triggers. Linguistic Inquiry, 29:1–36, 1998. [69] R. Frank and S. Kapur. On the use of triggers in parameter setting. Technical Report IRCS-92-52, Philadelphia: University of Pennsylvania Institute for Research in Cognitive Science, February 1993. [70] A. Galves and C. Galves. A case study of prosody driven language change. Technical report, Sao Paulo, Brazil: UNICAMP-University of Sao Paulo, 1995. [71] W. I. Gasarch and C. H. Smith. Learning via queries. Journal of the ACM, 39:649–674, 1992. [72] E. Gibson and K. Wexler. Triggers. Linguistic Inquiry, 25:407–454, 1994. [73] L. R. Gleitman and B. Landau. Acquisition of the Lexicon. Cambridge, MA: MIT Press, 1994. [74] E. Mark Gold. Language identification in the limit. Information and Control, 10(5):447–474, 1967. [75] E. Mark Gold. Complexity of automaton identification from given data. Information and Control, 37:302–320, 1978. [76] John Goldsmith. Autosegmental Phonology. PhD thesis, MIT, Cambridge, MA, 1976.

465

BIBLIOGRAPHY

[77] John Goldsmith. Autosegmental and Metrical Phonology. Oxford: Blackwell, 1990. [78] John Goldsmith. Unsupervised learning of the morphology of a natural language. Computational Linguistics, 27(2):153–198, 2001. [79] A. Gorin, S. Levinson, and A. Gertner. Adaptive acquisition of language. In Proceedings of International Conference on Acoustics, Speech, and Signal Processing, pages 805–808, 1991. [80] J. Greenberg. Language Universals with Special Reference to Feature Hierarchies. The Hague: Mouton, 1966. [81] J. Greenberg. Language Typology: A Historical and Analytical Overview. The Hague: Mouton, 1974. [82] J. Greenberg, C. Ferguson, and E. Moravcsik. Universals of Language. Stanford, CA: Stanford University Press, 1978. [83] S. Greenberg. The switchboard transcription project. In Proceedings of the Large Vocabulary CSR Summer Research Workshop, Baltimore, USA, April 1996. [84] P. Gupta and D. Touretzky. Connectionist models and linguistic theory: Investigations of stress systems in language. Cognitive Science, 18:1–50, 1994. [85] Liliane Haegeman. Introduction to Government and Binding Theory. Oxford: Blackwell, 1991. [86] M. Halle. Phonology in generative grammar. Word, 18:54–72, 1962. [87] M. Halle and W. J. Idsardi. General properties of stress and metrical structure. In Eric Ristad, editor, DIMACS Workshop on Human Language March 20-22, 1992 (Dimacs Series on Discrete Mathematics and Theoretical Computer Science). American Mathematical Society, 1992. [88] M. Halle and W. J. Idsardi. General properties of metrical structure. In J. Goldsmith, editor, The Handbook of Phonological Theory. Oxford: Blackwell, 1995. [89] M. Halle and J-R. Vergnaud. An Essay on Stress. Cambridge, MA: MIT Press, 1987.

BIBLIOGRAPHY

466

[90] M. Halliday. An Introduction to Functional Grammar. London: Edward Arnold, 1994. [91] H. Hamburger and K. Wexler. A mathematical theory of learning transformational grammar. Journal of Mathematical Psychology, 12:137–177, 1975. [92] Michael A. Harrison. Introduction to Formal Language Theory. Reading, MA: Addison-Wesley, 1978. [93] M. Hauser. The Evolution of Communication. Cambridge, MA: MIT Press, 1997. [94] J. A. Hawkins and M. Gell-Mann. The Evolution of Human Languages (Volume XI of the Santa Fe Institute Studies in the Sciences of Complexity). Reading, MA: Addison-Wesley, 1992. [95] Bruce Hayes. Metrical Stress Theory: Principles and Case Studies. Chicago: University of Chicago Press, 1995. [96] T. P. Hayes and P. Niyogi. A model for language emergence with social learning (preprint). 2005. [97] K. Hirsh-Pasek, R. Treiman, and M. Schneiderman. Brown and Hanlon revisited: Mothers’ sensitivity to ungrammatical forms. Journal of Child Language, 11:81–88, 1984. [98] J. Hofbauer and K. Sigmund. The Theory of Evolution and Dynamical Systems. Cambridge: Cambridge University Press, 1988. [99] J. E. Hopcroft and J. D. Ullman. Introduction to Automata Theory, Languages, and Computation. Reading, MA: Addison-Wesley, 1979. [100] James J. Horning. A study of grammatical inference. PhD thesis, Stanford University, 1969. [101] James Hurford. Biological evolution of the saussurean sign as a component of the language acquisition device. Lingua, 77(2):187–222, 1989. [102] James Hurford and Simon Kirby. Learning, culture and evolution in the origin of linguistic constraints. In P. Husbands and I. Harvey, editors, Fourth European Conference on Artificial Life. Cambridge, MA: MIT Press, 1997.

467

BIBLIOGRAPHY

[103] James Hurford and Simon Kirby. The emergence of linguistic structure: An overview of the iterated learning model. In A. Cangelosi and D. Parisi, editors, Simulating the Evolution of Language. Berlin: Springer-Verlag, 2001. [104] D. L. Isaacson and R. W. Madsen. Markov Chains: Theory and Applications. New York: Wiley, 1976. [105] C. Isbell, C. Shelton, M. Kearns, S. Singh, and P. Stone. A social reinforcement learning agent. In Proceedings of Agents: Montreal, Canada, pages 377–384, 2001. [106] Ray Jackendoff. X-Bar Syntax: A Study of Phrase Structure. Cambridge, MA: MIT Press, 1977. [107] S. Jain, D. Osherson, J. Royer, and A. Sharma. Systems that Learn: An Introduction to Learning Theory, 2nd ed. Cambridge, MA: MIT Press, 1998. [108] L. Jenkins. Biolinguistics: Exploring the Biology of Language. Cambridge: Cambridge University Press, 2001. [109] William Jones. The Collected Works of William Jones in 6 Volumes. London: Robinson and Evans, 1799. [110] A. K. Joshi, L. S. Levy, and M. Takahashi. Tree adjunct grammars. Journal of Computer and System Sciences, 10(1), 1975. [111] L. P. Kadanoff. Statistical Physics: Statics, Dynamics, and Renormalization. World Scientific, 2000. [112] M. Kanazawa. Learnable Classes of Categorial Grammars. Stanford, CA: CSLI Publications, 1998. [113] R. Kaplan and M. Kay. Regular models of phonological rule systems. Computational Linguistics, 20(3):331–379, 1994. [114] Ronald M. Kaplan and Joan Bresnan. Lexical-functional grammar: A formal system for grammatical representation. In Joan Bresnan, editor, The Mental Representation of Grammatical Relations, MIT Press Series on Cognitive Theory and Mental Representation, pages 173–281. Cambridge, MA: MIT Press, 1982.

BIBLIOGRAPHY

468

[115] M. Kearns and L. Valiant. Cryptographic limitations on learning boolean formulae and finite automata. In Proceedings of the 21st Annual ACM STOC., 1989. [116] Michael Kenstowicz. Phonology in Generative Grammar. Oxford: Blackwell, 1994. [117] M. Kimura. The Neutral Theory of Molecular Evolution. Cambridge: Cambridge University Press, 1983. [118] Paul Kiparsky. Phonological Change. PhD thesis, MIT, Cambridge, MA, 1965. [119] Paul Kiparsky. Explanation in Phonology. Dordrecht: Foris, 1982. [120] Simon Kirby. Function Selection and Innateness: The Emergence of Language Universals. Oxford: Oxford University Press, 1999. [121] N. Komarova and P. Niyogi. Optimizing the mutual intelligibility of mutual agents in a shared world. Artificial Intelligence Journal, 154:1– 42, 2004. [122] N. Komarova, P. Niyogi, and M. Nowak. The evolutionary dynamics of grammar acquisition. Journal of Theoretical Biology, 209(1):43–59, 2001. [123] N. Komarova and M. Nowak. Evolutionary dynamics of the lexical matrix. Bulletin of Mathematical Biology, 63(3):451–485, 2001. [124] N. Komarova and M. Nowak. Language dynamics in finite populations. Journal of Theoretical Biology, 221:445–457, 2003. [125] N. L. Komarova and I. Rivin. Harmonic means, random polynomials, and stochastic matrices. Advances in Applied Mathematics, 31(2):501– 526, 2003. [126] Anthony Kroch. Reflexes of grammar in patterns of language change. Journal of Language Variation and Change, 1:199–244, 1989. [127] Anthony Kroch. Syntactic change. In Mark Baltin and Chris Collins, editors, Handbook of Syntax. Oxford: Blackwell, 1999. [128] Anthony Kroch and A. Taylor. Verb movement in old and middle english: dialect variation and language contact. In A. van Kemenade and N. Vincent, editors, Parameters of Morphosyntactic Change. Cambridge: Cambridge University Press, 1997.

469

BIBLIOGRAPHY

[129] Anthony Kroch, A. Taylor, and D. Ringe. The Middle English verb second constraint: A case study in language contact and language change. In S. Herring, S. Schoesler, and P. van Reenen, editors, Textual Parameters in Older Language: Current Issues in Linguistic Theory. Philadelphia: John Benjamins, 2000. [130] William Labov. The Social Stratification of English in New York City. Washington, DC: Center for Applied Linguistics, 1966. [131] William Labov. Principles of Linguistic Change, Volume 1: Internal factors. Oxford: Blackwell, 1994. [132] David Lightfoot. Principles of Diachronic Syntax. Cambridge: Cambridge University Press, 1979. [133] David Lightfoot. How to Set Parameters: Arguments from Language Change. Cambridge, MA: MIT Press/Bradford Books, 1991. [134] David Lightfoot. Shifting triggers and diachronic reanalyses. In A. van Kemenade and N. Vincent, editors, Parameters of Morphosyntactic Change. Cambridge: Cambridge University Press, 1997. [135] David Lightfoot. The Development of Language: Acquisition, Change and Evolution. Oxford: Blackwell, 1998. [136] R. Duncan Luce. Individual Choice Behavior: A Theoretical Analysis. New York: Wiley, 1959. [137] J. M. Macedonia and C. S. Evans. Variation among mammalian alarm call systems and the problem of meaning in animal signals. Ethology, 93:177–197, 1993. [138] Brian MacWhinney. The competition model. In B. MacWhinney, editor, Mechanisms of Language Acquisition. Hillsdale, NJ: Lawrence Erlbaum, 1987. [139] Brian MacWhinney. The CHILDES system. American Journal of Speech Language Pathology, 5:5–14, 1996. [140] Brian MacWhinney. A unified model of language acquisition. In J. Kroll and A. DeGroot, editors, Handbook of Bilingualism:Psycholinguistic Approaches. Oxford: Oxford University Press, 2004.

BIBLIOGRAPHY

470

[141] A. M. Madeira. On clitic placement in European Portuguese. In H. van de Koot, editor, UCL Working Papers in Linguistics 4, University College London, 1992. [142] Christopher D. Manning and Hinrich Schutze. Foundations of Statistical Natural Language Processing. Cambridge, MA: MIT Press, 1999. [143] M. R. Manzini. Second position dependencies (paper presented at the 8th Workshop on Germanic Syntax, Troms.). 1992. [144] Maja J. Mataric. Coordination and learning in multi-robot systems. IEEE Intelligent Systems, pages 6–8, March-April 1998. [145] R. M. May. Stability and Complexity in Model Ecosystems. Princeton, NJ: Princeton University Press, 1973. [146] A. McMahon. Change, Chance, and Optimality. Oxford: Oxford University Press, 2000. [147] G. Miller. The Science of Words. New York: Scientific American Library, 1996. [148] J. Milroy. Linguistic Variation and Change: On the Historical Sociolinguistics of English. Oxford: Blackwell, 1992. [149] J. Milroy and L. Milroy. Linguistic change, social network and speaker innovation. Journal of Linguistics, 21:339–384, 1985. [150] J. Milroy and L. Milroy. Real English: The Grammar of English Dialects in the British Isles. London: Longman, 1993. [151] W. G. Mitchener. Bifurcation analysis of the fully symmetric language dynamical equation. Journal of Mathematical Biology, 43(3):265–285, 2003. [152] W. G. Mitchener. A Mathematical Model of Human Languages: The Interaction of Game Dynamics and Learning Processes. PhD thesis, Princeton University, 2003. [153] Salikoko Mufwene. The Ecology of Language Evolution. Cambridge: Cambridge University Press, 2001. [154] T. Nagylaki. Introduction to Theoretical Population Genetics. Berlin: Springer-Verlag, 1992.

471

BIBLIOGRAPHY

[155] G. Neumann and D. Flickinger. Learning stochastic lexicalized tree grammars from HPSG. Technical report, DFKI, Saarbrucken, 1999. [156] M. Newman, A. Barabasi, and D. Watts. The Structure and Dynamics of Complex Networks. Princeton, NJ: Princeton University Press, 2003. [157] F. Newmeyer. Language Form and Language Function. Cambridge, MA: MIT Press/Bradford Books, 1998. [158] E. Newport, H. Gleitman, and L. R. Gleitman. Mother, I’d rather do it myself: Some effects and non-effects of maternal speech style. In C. E. Snow and C. A. Ferguson, editors, Talking to Children: Language Input and Acquisition. Cambridge: Cambridge University Press, 1977. [159] E. L. Newport and R. N. Aslin. Innately constrained learning: Blending old and new approaches to language acquisition. In S.C. Howell, S.A. Fish, and T. Keith-Lucas, editors, Proceedings of the 24th Annual Boston University Conference on Language Development. Cascadilla Press, 2000. [160] J. Nichols. Linguistic Diversity in Space and Time. Chicago: University of Chicago Press, 1992. [161] P. Niyogi. The Informational Complexity of Learning: Perspectives on Neural Networks and Generative Grammar. Boston: Kluwer Academic Publishers, 1998. [162] P. Niyogi. The computational study of diachronic linguistics. In D. Lightfoot, editor, Syntactic Effects of Morphological Change. Cambridge: Cambridge University Press, 2002. [163] P. Niyogi. Models of cultural evolution and their application to language change. In E. J. Briscoe, editor, Language Evolution through Language Acquisition. Cambridge: Cambridge University Press, 2002. [164] P. Niyogi. Phase transitions in language evolution. In L. Jenkins, editor, Variation and Universals in Biolinguistics. Elsevier Press, 2004. [165] P. Niyogi and R. C. Berwick. Formalizing triggers: A learning model for finite spaces. Technical Report AI Memo-1443, Cambridge, MA: MIT, November 1993.

BIBLIOGRAPHY

472

[166] P. Niyogi and R. C. Berwick. The Logical Problem of Language Change. Technical Report AI Memo: 1516, Cambridge, MA: MIT, 1995. [167] P. Niyogi and R. C. Berwick. A dynamical systems model of language change. Complex Systems, 11:161–204, 1997. [168] P. Niyogi and R. C. Berwick. Evolutionary consequences of language learning. Linguistics and Philosophy, 20:697–719, 1997. [169] M. Nowak. The evolutionary biology of language. Philosophical Transactions of the Royal Society of London, 355:1615–1622, 2000. [170] M. Nowak, N. Komarova, and P. Niyogi. Evolution of universal grammar. Science, 291:114–118, 2001. [171] M. Nowak, N. Komarova, and P. Niyogi. Computational and evolutionary aspects of language. Nature, 417:611–617, 2002. [172] M. Nowak and D. Krakauer. The evolution of language. Proceedings of the National Academy of Sciences of the United States of America, 96(14):8028–8033, 1999. [173] M. Nowak and K. Sigmund. Chaos and the evolution of cooperation. Proceedings of the National Academy of Sciences, 90:5091–5094, 1993. [174] John Ohala. Sound change is drawn from a pool of synchronic variation. In L. E. Breivik and E. H. Jahr, editors, Language Change: Contributions to the Study of Its Causes. New York: Mouton de Gruyter, 1989. [175] John Ohala. The phonetics of sound change. In Jones, editor, Historical Linguistics: Problems and Perspectives, pages 237–278. Longman, 1993. [176] M. Oliphant. The learning barrier: Moving from innate to learned systems of communication. Adaptive Behavior, 7:371–384, 1999. [177] M. Oliphant and J. Batali. Learning and the emergence of coordinated communication. Center for Research on Language Newsletter, 11(1), 1997. [178] Paulus Orosius. In Janet Bately, editor, The Old English Orosius. Oxford: Oxford University Press, 1980.

473

BIBLIOGRAPHY

[179] C. Osgood and T. Sebeok. Psycholinguistics: A survey of theory and research problems. Journal of Abnormal and Social Psychology, 49(4):1–203, 1954. [180] Daniel N. Osherson, M. Stob, and S. Weinstein. Systems That Learn. Cambridge, MA: MIT Press, 1986. [181] Hermann Paul. Principles of the History of Language (translated from the 1880 German edition by H. A. Strong). College Park, MD: McGroth Publishing Company, 1970. [182] M. Piatelli-Palmarini. Evolution, selection, and cognition: From learning to parameter-setting in biology and in the study of language. Cognition, 31(1), 1989. [183] Steven Pinker. Language Learnability and Language Development. Cambridge, MA: Harvard University Press, 1984. [184] Steven Pinker. The Language Instinct: The New Science of Language and Mind. New York: William Morrow, 1994. [185] L. Pitt. Probabilistic inductive inference. 36(2):383–433, 1989.

Journal of the ACM,

[186] Carl J. Pollard and Ivan A. Sag. Head-Driven Phrase Structure Grammar. Chicago: University of Chicago Press, 1994. [187] Alan Prince and Paul Smolensky. Optimality Theory: Constraint Interaction in Generative Grammar. Technical Report TR-2, New Brunswick, NJ: Rutgers University Cognitive Science Center, 1993. [188] G. K. Pullum and B. Scholz. Empirical assessment of poverty of stimulus arguments. Linguistic Review, 19:9–50, 2002. [189] T. Regier. The Human Semantic Potential: Spatial Language and Constrained Connectionism. Cambridge, MA: MIT Press, 1996. [190] T. Regier. Emergent constraints on word-learning: A computational review. Trends in Cognitive Sciences, 7:263–268, 2003. [191] T. Regier, B. Corrigan, R. Cabasan, A. Woodward, M. Gasser, and L. Smith. The emergence of words. In Proceedings of the Cognitive Science Society, 2001.

BIBLIOGRAPHY [192] Sidney Resnick. Birkhauser, 1992.

474 Adventures in Stochastic Processes.

Boston:

[193] J. Rickford. Dimensions of a Creole Continuum. Stanford, CA: Stanford University Press, 1987. [194] J. Rissanen and E. S. Ristad. Language acquisition in the MDL framework. In E. S. Ristad, editor, Language Computations. American Mathematical Society (DIMACS series), 1992. [195] I. Roberts. Verbs and Diachronic Syntax: A Comparative History of English and French. Kluwer Academic Publishers, 1992. [196] H. Rogers. Godel numberings of partial recursive functions. Journal of Symbolic Logic, 23:331–341, 1958. [197] A. Rubinstein. Economics and Language. Cambridge: Cambridge University Press, 2000. [198] M. Ruhlen. The Origin of Language: Tracing the Evolution of the Mother Tongue. New York: Wiley, 1994. [199] M. Ruhlen. On the Origin of Language. Stanford, CA: Stanford University Press, 1997. [200] Jerrold Sadock. Autolexical Syntax: A Theory of Parallel Grammatical Representations. Chicago: University of Chicago Press, 1991. [201] Y. Sakakibara. Learning context-free grammars from structural data in polynomial time. Theoretical Computer Science, 76:223–242, 1990. [202] W. Sakas. Ambiguity and the Computational Feasibility of Syntax Acquisition. PhD thesis, City University of New York, 2000. [203] G. Salvi. La sopravvivenza della legge di wackernagel nei dialettioccidentali della peninsola iberica. Medioevo Romanzo, 15:177–210, 1990. [204] Beatrice Santorini. The rate of phrase structure change in the history of Yiddish. Language Variation and Change, 5:257–283, 1993. [205] B. Schieffelin and A. Eisenberg. Cultural variation in children’s conversations. In R. Schiefelbusch and D. Bricker, editors, Early Language: Acquisition and Intervention. University Park Press, 1981.

475

BIBLIOGRAPHY

[206] Carson T. Schutze. The Empirical Base of Linguistics: Grammaticality Judgments and Linguistic Methodology. Chicago: University of Chicago Press, 1996. [207] B. Schwartz and S. Vikner. All verb second clauses are CPs. Working Papers in Scandinavian Syntax, 38:1–48, 1990. [208] A. Senghas and M. Coppola. Children creating language: How Nicaraguan sign language acquired a spatial grammar. Psychological Science, 12, 4:323–328, 2001. [209] Zhongwei Shen. The Dynamic Process of Sound Change. PhD thesis, University of California at Berkeley, 1993. [210] Zhongwei Shen. Exploring the dynamic aspect of sound change. Chinese Journal of Linguistics (monograph), 11, 1997. [211] Y. Shoham and M. Tennenholtz. On the emergence of social conventions: Modeling, analysis, and simulations. Artificial Intelligence Journal, 94:139–166, 1997. [212] J. M. Siskind. Naive Physics, Event Perception, Lexical Semantics, and Language Acquisition. PhD thesis, MIT, 1992. [213] J. M. Siskind. A computational study of cross-situational techniques for learning word-to-meaning mappings. Cognition, 61:39–91, 1996. [214] Dan Slobin. The crosslinguistic study of language acquisition: Vol. I-IV. Mahwah, NJ: Lawrence Erlbaum, 1985-97. [215] J. Maynard Smith. Evolution and the Theory of Games. Cambridge: Cambridge University Press, 1982. [216] J. Maynard Smith and E. Szathmary. The Major Transitions in Evolution. Oxford: Oxford University Press, 1995. [217] W. J. Smith. The Behavior of Communicating. Cambridge, MA: Harvard University Press, 1977. [218] Ray J. Solomonoff. A formal theory of inductive inference. Information and Control, 7:1–22, March 1964. [219] E. Stabler. Acquiring languages with movement. Syntax, 1:72–97, 1998.

BIBLIOGRAPHY

476

[220] L. Steels. Self-organizing vocabularies. In C. Langton, editor, Proceedings of ALIFE V, 1996. [221] L. Steels. Social and cultural learning in the evolution of human communication. In K. Oller, D. Griebel, and K. Plunkett, editors, Evolution of Communication Systems: A Comparative Approach. Cambridge, MA: MIT Press, 2001. [222] L. Steels. Evolving grounded communication for robots. Trends in Cognitive Science, 7(7):308–312, 2003. [223] L. Steels and F. Kaplan. Spontaneous lexicon change. In Proceedings of COLING-ACL: Montreal, 1998. [224] L. Steels and P. Vogt. Grounding adaptive language games in robotic agents. In P. Husbands and I. Harvey, editors, Proceedings of the Fourth European Conference on Artificial Life, 1997. [225] Kenneth Stevens. Acoustic Phonetics. Cambridge, MA: MIT Press, 1998. [226] S. Strogatz. Exploring complex networks. Nature, 410:268–276, 2001. [227] D. Surendran. The functional load of phonological contrasts. Master’s thesis, The University of Chicago, 2003. [228] Henry Sweet. The Practical Study of Languages. 1899. [229] J. Tenenbaum and F. Xu. Word learning as Bayesian inference. In Proceedings of the 22nd Annual Conference of the Cognitive Science Society, 2000. [230] Bruce Tesar and Paul Smolensky. Learnability in Optimality Theory (short version). Technical Report JHU-CogSci-96-2, Baltimore: Johns Hopkins University Cognitive Science Department, 1996. [231] Bruce Tesar and Paul Smolensky. Learnability in Optimality Theory. Cambridge, MA: MIT Press, 2000. [232] F. Tohme and T. Sandholm. Coalition formation processes with belief revision among bounded rational self-interested agents. Journal of Logic and Computation, 9(6):793–815, 1999. [233] R. L. Trask. Historical Linguistics. London: Edward Arnold, 1996.

477

BIBLIOGRAPHY

[234] L. Travis. Parameters and Effects of Word Order Variation. PhD thesis, MIT, 1984. [235] Leslie Valiant. A theory of the learnable. ACM, 27(11):1134–1142, 1984. [236] L. J. Th. van der Kamp and L. C. W. Pols. Perceptual analysis from confusions between vowels. Acta Psychologica, 35:64–77, 1971. [237] A. van der Mude and A. Walker. On the inference of stochastic regular grammars. Information and Control, 38:310–329, 1978. [238] Ans van Kemenade and Nigel Vincent. Parameters of Morphosyntactic Change. Cambridge: Cambridge University Press, 1997. [239] V. Vapnik. Estimation of Dependences Based on Empirical Data. New York: Springer-Verlag, 1982. [240] V. Vapnik. Statistical Learning Theory. New York: Wiley, 1998. [241] V. N. Vapnik and A. J. Chervonenkis. On the uniform convergence of relative frequencies of events to their probabilities. Theory of Probability and Its Applications, 16:264–280, 1971. [242] V. N. Vapnik and A. J. Chervonenkis. The necessary and sufficient conditions for consistency of the method of empirical risk minimization. Pattern Recognition and Image Analysis, 1:284–305, 1991. [243] W. A. Wagenaar. Application of Luce’s choice axiom to form discrimination. Nederlands Tijdschrift voor Psychologie, 23:96–108. [244] Marilyn DeMorrest Wang and Robert C. Bilger. Consonant confusions in noise. Journal of the Acoustical Society of America, 54(5):1248– 1266, 1973. [245] W. S-Y. Wang. Competing changes as a cause of residue. Language, 45:9–25, 1969. [246] W. S-Y. Wang. The Lexicon in Phonological Change. The Hague: Mouton, 1977. [247] W. S.-Y. Wang, J.Y. Ke, and J.W. Minett. Computational studies of language evolution. In C. R. Huang and W. Lenders, editors, Computational Linguistics and Beyond, pages 65–106. Institute of Linguistics, Academica Sinica, 2004.

BIBLIOGRAPHY

478

[248] A. Warner. The structure of parametric change and v-movement in the history of English. In A. van Kemenade and N. Vincent, editors, Parameters of Morphosyntactic Change. Cambridge: Cambridge University Press, 1997. [249] D. Watts. Six degrees: The Science of a Connected Age. New York: Norton, 2003. [250] Uriel Weinreich, William Labov, and Marvin Herzog. Empirical foundations for a theory of language change. In W. Lehman and Y. Malkiel, editors, Directions for Historical Linguistics, pages 97–195. Austin, TX: University of Texas Press, 1968. [251] K. Wexler and P. Culicover. Formal Principles of Language Acquisition. Cambridge, MA: MIT Press, 1980. [252] R. M. Wharton. Approximate language identification. Information and Control, 26:236–255, 1974. [253] Patricia M. Wolfe. Linguistic Change and the Great Vowel Shift in English. Berkeley: University of California Press, 1972. [254] Sewall Wright. Evolution and the Genetics of Populations, Vols 1-4. Chicago: University of Chicago Press, 1968-1978. [255] Charles Yang. Knowledge and Learning in Natural Language. PhD thesis, MIT, 2000. [256] Charles Yang. Knowledge and Learning in Natural Language. Oxford: Oxford University Press, 2002. [257] C. Jan-Wouter Zwart. Clitics in Dutch: Evidence for the position of INFL. Groninger Arbeiten zur germanistischen Linguistik, 33:71–92, 1991.

Index pro-drop, 22, 220, 330 Absorbing state, 104, 106 Actuation problem, 181 Anglo-Saxon Chronicles, 5, 289 Artificial intelligence, 346 Ascending texts, 69 Asymmetric errors, 269

Coherence threshold, 382 Communicability function, 346 Communicative efficiency, 341, 343, 350 Computational complexity, 81 Convergence of Markov language model, 123 Critical threshold, 180 Cue-based learner, 173, 288, 312 Cue-frequency batch learner, 409, 415 Cultural evolution, 275

Bailey, 29, 205 Batch error-based learner, 171 Batch learner, 288 Batch learning, 395 Darwin, 35 Batch learning bounds, 135 Darwinian fitness, 375 Batch subset algorithm, 246 Darwinian revolution, 33, 34 Bengali, 13, 18 Descent of Man, 35 Bengali parameters, 92 Diachronic variation, 13 Best response, 351 Bifurcations, 40, 180, 241, 248, 270, Dialect formation, 314 Differential fitness, 342 332, 391, 402, 426, 448 Differential reproduction, 377 Bilingual acquisition, 185 Distribution-free learning, 67 Bilingualism, 322 Distributional effects on language learnBiological evolution, 32 ing, 132 Cavalli-Sforza and Feldman theory, Dynamic system, 157 277 Dynamical system, 192 Centralization Index, 21 Dynamical systems, 164 Chaos, 166 Emergence of language, 41, 398, 402, Child language sentence types, 134 431, 449 Childes corpus, 112, 134 Empirical risk minimization, 54 Chinese phonology, 251 enclitics, 234 Church-Turing thesis, 51 Closed set of states, 104, 106 English change, 289

INDEX English parameters, 91 English vowel shift, 16 European Portuguese authors, 235

480 Inhomogeneous Markov chain, 138 Ising model, 435

Language acquisition, 8, 50 Language emergence, 41, 398, 402, 431, 449 Language growth, 8 Language learning convergence times, 117 Learnability, 55, 106, 162 Learning, 378 Learning algorithm, 53 Learning and evolution, 39, 446 Galves batch learning algorithm, 239 Learning fidelity, 380, 392 Learning from parents, 406 GB grammar, 11, 56, 88 Learning in nesting phase, 407 Generalization, 49 Generalization error, 55 Learning in the limit, 57 Generational structure, 183 Learning rates, 117 Gold learning, 100 Learning system triple, 193 Gold’s theorem, 58 Learning with full information, 360 Grammar competition, 320 Learning with measure 1, 63, 64 Grammatical inference, 56 Learning with partial information, 361 Grammaticality judgments, 3, 6 Lexical diffusion, 256 Lexical structure, 366 Head complement relations, 90, 292 Lexical-Functional grammar, 11, 56, Henry Sweet, 28 88 Hermann Paul, 28 Lightfoot, 156, 291 Heterogeneous populations, 211 Linguistic coherence, 342, 376, 402, Hindi, 18 405, 409, 415 Homogeneous populations, 196 Linguistic Data Consortium, 32 HPSG grammar, 11, 56, 88 Linguistic diversity in space, 319 Linguistic Equilibria, 380 Identification in the limit, 100 Linguistic equilibria, 420 Identification in the limit, 57 Linguistic stability, 385, 422 Idiolects, 28 Local learning, 402 Individual learning, 161, 188 Locking lemma, 58 Indo-European thesis, 34 Logical problem of language acquisiInductive inference, 56 tion, 8, 155 Informant texts, 70 Information theory, 370 Logistic map, 166 Finite population effects, 182 Finite populations, 305 Fitness, 377 Formal languages, 51 French syntactic change, 330 French syntax, 22, 218 French V2 loss, 22 Frequency effects, 208 Functional load, 366, 368

481

INDEX

PAC language learning, 135 PAC learning, 63 PAC model, 12, 71 Padding, 146 Parametric change, 195, 219, 291 Penn Helsinki corpus, 32 Perceptual confusibility, 371 Phase-space plots, 211 Phonemic contrasts, 366 Phonological merger, 252 Population dynamics, 162, 188, 193, 223, 239, 241, 258, 378, 379, 410, 416 Population dynamics stability, 213 Portuguese change, 233 Portuguese clitic placement, 233 Poverty of stimulus, 86 Primary linguistic data, 10, 158 Principal components analysis, 89 Principles and Parameters, 88 Principles and Parameters, 89, 96 Probably Approximately Correct, 12, 63, 71 proclitics, 234 Natural selection, 35, 41, 342, 377, Proof of language learning, 113 402, 449 Quadratic map, 166 Neighborhood effects, 298 Quasi-species equations, 403 Niyogi Berwick model, 281 Null subject, 22, 220, 330 Recurrent states, 114

Major transitions in evolution, 341 Markov chain formulation, 100 Markov chain convergence, 117 Markov chain formulation, 102, 104 Maturation time, 207 Maximum likelihood, 239 Memoryless bilingual learner, 327 Memoryless learner, 130, 164, 201, 327, 432 Memoryless learning, 54, 105, 138, 140, 393 Metrical phonology, 94 Metrical stress, 94 Microscopic and macroscopic linguistics, 158 Middle French, 222 Minimum description length, 54 Modern French, 222 Multiagent systems, 346 Multilingual acquisition, 185 Multilingual learners, 320 Multiple language model, 187 Mutual intelligibility, 343, 350

Oblique transmission, 302 Old English, 5, 156, 289 Old French, 221 One-Language equilibria, 382 One-Language mode, 406 Origin of Language, 341 Orosius, 6 Oscillations, 319 Osgood and Sebeok, 29 OV grammar, 7

Recursive texts, 69 Reinforcement learning, 346 S-shaped curve, 29, 166, 205 Sample complexity, 363 Sanskrit, 18 Saussure, 61, 347 Sentence distributions, 208 Sentence types in parametric space, 113 Shared language, 382

INDEX Social learning, 406, 408 Sound change, 255, 271 Spatial effects, 314 Spatial population, 314 Spatial population structure, 184 Statistical physics analogies, 433 Stochastic context-free grammars, 69 Stochastic dynamics, 306 Stochastic grammars, 69 Synchronic variation, 13 T. Dobzhansky, 458 Text, 56 The Actuation problem, 270 Three parameter system, 89, 113, 195 Tower of Babel, 406 Transformational rules, 92 Transient states, 114 Tree Adjoining grammar, 11, 56 Triggering Learning Algorithm (TLA), 100, 164, 247, 284, 310 Triggers, 101 Tycho Brahe corpus, 32, 234 Uniformly computable families, 68 Update rule, 189, 193 V2 loss, 220, 330 V2 movement, 93, 295 V2 stability, 197 Vapnik-Chervonenkis theory, 74 VC dimension, 75 Verb Object order, 6 Verb Object relations, 295 VO grammar, 7 Vowel centralization, 21 Weinreich, Labov, Herzog, 29 Wenzhou province, 19, 251 William Jones, 34 Word order, 6

482 Wu dialect, 19, 251 X-bar theory, 89 Yiddish phonology, 20 Yiddish syntax, 23, 177

Current Studies in Linguistics Samuel Jay Keyser, general editor 1. 2. 3. 4. 5.

A Reader on the Sanskrit Grammarians, J. F. Staal, editor Semantic Interpretation in Generative Grammar, Ray Jackendoff The Structure of the Japanese Language, Susumu Kuno Speech Sounds and Features, Gunnar Fant On Raising: One Rule of English Grammar and Its Theoretical Implications, Paul M. Postal 6. French Syntax: The Transformational Cycle, Richard S. Kayne 7. Panini as a Variationist, Paul Kiparsky, S. D. Joshi, editor 8. Semantics and Cognition, Ray Jackendoff 9. Modularity in Syntax: A Study of Japanese and English, Ann Kathleen Farmer 10. Phonology and Syntax: The Relation between Sound and Structure, Elisabeth O. Selkirk 11. The Grammatical Basis of Linguistic Performance: Language Use and Acquisition, Robert C. Berwick and Amy S. Weinberg 12. Introduction to the Theory of Grammar, Henk van Riemsdijk and Edwin Williams 13. Word and Sentence Prosody in Serbocroatian, Ilse Lehiste and Pavle Ivic 14. The Representation of (In)definiteness, Eric J. Reuland and Alice G. B. ter Meulen, editors 15. An Essay on Stress, Morris Halle and Jean-Roger Vergnaud 16. Language and Problems of Knowledge: The Managua Lectures, Noam Chomsky 17. A Course in GB Syntax: Lectures on Binding and Empty Categories, Howard Lasnik and Juan Uriagereka 18. Semantic Structures, Ray Jackendoff 19. Events in the Semantics of English: A Study in Subatomic Semantics, Terence Parsons 20. Principles and Parameters in Comparative Grammar, Robert Freidin, editor 21. Foundations of Generative Syntax, Robert Freidin 22. Move α: Conditions on Its Application and Output, Howard Lasnik and Mamoru Saito 23. Plurals and Events, Barry Schein 24. The View from Building 20: Essays in Linguistics in Honor of Sylvain Bromberger, Kenneth Hale and Samuel Jay Keyser, editors 25. Grounded Phonology, Diana Archangeli and Douglas Pulleyblank 26. The Magic of a Common Language: Jakobson, Mathesius, Trubetzkoy, and the Prague Linguistic Circle Jindrich Toman

27. 28. 29. 30. 31. 32. 33.

34. 35. 36. 37. 38. 39. 40. 41. 42. 43.

Zero Syntax: Experiencers and Cascades, David Pesetsky The Minimalist Program, Noam Chomsky Three Investigations of Extraction, Paul M. Postal Acoustic Phonetics, Kenneth N. Stevens Principle B, VP Ellipsis, and Interpretation in Child Grammar, Rosalind Thornton and Kenneth Wexler Working Minimalism, Samuel Epstein and Norbert Hornstein, editors Syntactic Structures Revisited: Contemporary Lectures on Classic Transformational Theory, Howard Lasnik with Marcela Depiante and Arthur Stepanov Verbal Complexes, Hilda Koopman and Anna Szabolcsi Parasitic Gaps, Peter W. Culicover and Paul M. Postal Ken Hale: A Life in Language, Michael Kenstowicz, editor Flexibility Principles in Boolean Semantics: The Interpretation of Coordination, Plurality, and Scope in Natural Language, Yoad Winter Phrase Structure Composition and Syntactic Dependencies, Robert Frank Representation Theory, Edwin Williams The Syntax of Time, Jacqueline Gu´eron and Jacqueline Lecarme, editors Situations and Individuals, Paul D. Elbourne Wh-Movement: Moving On, Lisa L.-S.Cheng and Norbert Corver, editors The Computational Nature of Language Learning and Evolution, Partha Niyogi