Statistical Learning and Language Acquisition 9781934078242, 9781934078235

Open publication This volume brings together contributors from cognitive psychology, theoretical and applied linguistic

224 31 9MB

English Pages 523 [524] Year 2012

Report DMCA / Copyright

DOWNLOAD FILE

Polecaj historie

Statistical Learning and Language Acquisition
 9781934078242, 9781934078235

Table of contents :
Preface
List of contributors
Introduction: Statistical learning and language acquisition
Statistical-sequential learning in development
Bootstrapping language: Are infant statisticians up to the job?
Sensitivity to statistical information begets learning in early language development
Word segmentation: Trading the (new, but poor) concept of statistical computation for the (old, but richer) associative approach
The road to word class acquisition is paved with distributional and sound cues
Linguistic constraints on statistical learning in early language acquisition
The potential contribution of statistical learning to second language acquisition
Statistical learning and syntax: What can be learned, and what difference does meaning make?
Statistical construction learning: Does a Zipfian problem space ensure robust language learning?
Can we enhance domain-general learning abilities to improve language function?
Conscious versus unconscious learning of structure
How implicit is statistical learning?
What Bayesian modelling can tell us about statistical learning: what it requires and why it works
Evolutionary perspectives on statistical learning
Statistical learning: What can music tell us?
“I let the music speak”: Cross-domain application of a cognitive model of musical learning
Index

Citation preview

Statistical Learning and Language Acquisition

Studies in Second and Foreign Language Education 1

Editors

Anna Uhl Chamot Wai Meng Chan

De Gruyter Mouton

Statistical Learning and Language Acquisition

edited by

Patrick Rebuschat John N. Williams

De Gruyter Mouton

ISBN 978-1-934078-23-5 e-ISBN 978-1-934078-24-2 ISSN 2192-0982 Library of Congress Cataloging-in-Publication Data A CIP catalog record for this book has been applied for at the Library of Congress. Bibliographic information published by the Deutsche Nationalbibliothek The Deutsche Nationalbibliothek lists this publication in the Deutsche Nationalbibliografie; detailed bibliographic data are available in the Internet at http://dnb.dnb.de. ” 2012 Walter de Gruyter, Inc., Boston/Berlin Cover image: Creatas/Thinkstock Typesetting: RoyalStandard, Hong Kong Printing: Hubert & Co. GmbH & Co. KG, Göttingen 앝 Printed on acid-free paper 앪 Printed in Germany www.degruyter.com

Preface Reading the papers in Rebuschat and Williams’ volume, ‘‘Statistical learning and language acquisition,’’ brings me both back in time and looking ahead to the future. I suppose that is appropriate, given that the kind of learning at issue is precisely the kind that avails itself of prior experience to predict future events. The construct of statistical learning is both intuitively appealing and frustratingly vague. The appeal of statistical learning, I believe, derives from its apparent simplicity: it would be sensible for learners to exploit distributions of events in their environments to predict future events. Unfortunately, the flip side of this apparent simplicity is that the construct is so easily applied that it is di‰cult to decide where statistical learning rightly begins and ends. Appropriately, then, the chapters in the current volume bring out both the pleasures and the pitfalls of accounts that invoke statistical learning mechanisms. When Dick Aslin, Elissa Newport, and I began to work on our collaborative studies on infant and adult statistical language learning in the early 1990s, we were keenly aware of the history surrounding these ideas. Their roots lie in the structural linguistics of Leonard Bloomfield (1933) & Zellig Harris (1955), and in prior experimental and theoretical work by Hayes & Clark (1970), Goodsitt, Morgan, & Kuhl (1993), Braine (1966), Reber (1967), Morgan & Newport (1981), Maratsos & Chalkley (1981), and many others. Despite the long history of research and debate surrounding these ideas, I did not anticipate the field’s reaction to our initial infant studies. There were two interesting and surprising dimensions to those reactions. The first dimension spanned responses ranging from ‘‘Duh!’’ to ‘‘Impossible!’’ Some colleagues, particularly those in the visual sciences, responded to our initial studies by saying: ‘‘Of course learners track statistics in environmental input; how could they not?’’ At the other extreme, some readers questioned the idea that statistical information could have any e‰cacy whatsoever given the complexities of natural language: ‘‘How could a learning ability that allows you to remember wallpaper patterns possibly have anything to do with real linguistic input?’’ While the current incarnations of these perspectives are markedly less extreme, they continue to provide necessary counterpoints as we work to expand and refine our theories. The second dimension is also still quite current. After we published our first paper on infant statistical language learning, some readers responded

vi

Preface

by saying: ‘‘Wow! This is real evidence for a language learning device; look at the speed and ease with which infants in these studies learned a novel linguistic structure.’’ Other readers responded by saying: ‘‘Wow! This is real evidence for a general learning device; these results suggest that language learning must be subserved by the same machinery that we use to learn across varied domains.’’ Of course, that initial paper was not intended to directly address questions of domain-specificity versus domain-generality. But, as is evident from the papers in this volume, the subsequent decade has seen a great deal of research focused on this issue. Reading these chapters, I’m struck by the breadth of questions that have emerged and reemerged over the past 2 decades of research on statistical learning. Issues of nativism and empiricism continue to fascinate us. Because learning requires both innate machinery and experience as input to that machinery, it is a fertile domain within which to explore naturenurture questions. We continually return to issues of ecological validity, even as we constantly attempt to refine our methods to move closer to studying learning ‘‘in the wild.’’ We continue to grapple with fundamental issues: What are the computations that learners perform? What are the units over which those computations are performed? Are there developmental di¤erences that a¤ect which units are tracked and which computations are prioritized? What is the locus of constraints on learning, and does this di¤er across domains? What are the most appropriate ways to model statistical learning processes? What is the relationship between statistical learning and other key cognitive constructs (implicit learning, associative learning, procedural learning, working memory, attention, conscious awareness)? How should statistical learning fit with current thinking about language evolution and neural plasticity? These are not new questions. But the ways in which the authors in this volume address them are new and very exciting. It is notable that some consensus has emerged: Nobody takes statistical learning for granted (‘‘Duh!’’), and nobody is arguing that these learning mechanisms are entirely irrelevant. We are all working to determine the role that statistical learning should play within our broader theories. Finally, the papers in this volume suggest that accounts that invoke statistical learning mechanisms have moved into the mainstream of subfields well beyond first language acquisition. Second language acquisition is of particular interest, given that immersion in an L2 is essentially an implicit learning experience. Application to language and cognitive disorders is a natural extension, and research on individual di¤erences has huge potential. Music is an ideal companion domain for research on language; the role of expectation has a long and illustrious history in music

Preface

vii

theory (Meyer, 1956), and I expect that we will see the continued emergence of rich statistical learning accounts in this domain. What will be contained in the next edition of this volume, a few years down the line? It’s hard to know. Statistical learning accounts may become a full-fledged alternative to more traditional perspectives in language acquisition, as well as in other domains where these models are beginning to be applied (e.g., social cognition, perception for action, etc.). Or aspects of ideas from this framework may be integrated into other types of accounts, playing a role where needed. To a large extent, the future of statistical learning hinges on the answers to the questions laid out by the authors in this volume. I can’t wait to find out what the answers are! Jenny Sa¤ran References Bloomfield, L. 1933 Language. New York: Holt. Braine, M. D. S. 1966 Learning the positions of words relative to a marker element. Journal of Experimental Psychology, 72, 532–540. Goodsitt, J. V., Morgan, J. L., & Kuhl, P. K. 1993 Perceptual strategies in prelingual speech segmentation. Journal of Child Language, 20, 229–252. Harris, Z. S. 1955 From phoneme to morpheme. Language, 31, 190–222. Hayes, J. R., & Clark, H. H. 1970 Experiments in the segmentation of an artificial speech analog. In J. R. Hayes (Ed.), Cognition and the development of language. New York: Wiley. Maratsos, M., & Chalkley, M. A. 1980 The internal language of children’s syntax: The ontogenesis and representation of syntactic categories. In Katherine Nelson (Ed.), Children’s language Vol. 2. (pp. 127–213). New York: Gardner Press. Meyer, L. 1956 Emotion and meaning in music. Chicago: University of Chicago Press. Morgan, J. L., and Newport, E. L. 1981 The role of constituent structure in the induction of an artificial language. Journal of Verbal Learning and Verbal Behavior, 20, 67–85. Reber, A. S. 1967 Implicit learning of artificial grammars. Journal of Verbal Learning and Verbal Behavior, 6, 855–863.

List of contributors Althea Bauernschmidt Department of Psychological Sciences Purdue University Morten H. Christiansen Department of Psychology Cornell University Christopher M. Conway Department of Psychology Saint Louis University

Michelle A. Gremp Central Institute for the Deaf Washington University School of Medicine Phillip Hamrick Department of Linguistics Georgetown University Jessica F. Hay Department of Psychology University of Tennessee – Knoxville

Zoltan Dienes School of Psychology and Sackler Centre for Consciousness Science University of Sussex

Jill Lany Department of Psychology University of Notre Dame

Nick C. Ellis Department of Psychology and English Language Institute University of Michigan

Psyche Loui Department of Neurology Beth Israel Deaconess Medical Center and Harvard Medical School

Judit Gervain Laboratoire Psychologie de la Perception Institut Neurosciences Cognition Universite´ Paris Descartes

Jacques Mehler Language, Cognition and Development Lab Scuola Internazionale Superiore die Studi Avanzati (SISSA), Trieste

Michael H. Goldstein Department of Psychology Cornell University

Jennifer B. Misyak Department of Psychology Cornell University

Kalim Gonzales Department of Psychology University of Arizona

Daniel Navarro School of Psychology University of Adelaide

Rebecca Go´mez Department of Psychology University of Arizona

Marina Nespor Facolta` di Psicologia Universita` di Milano Bicocca

x

List of contributors

Matthew Brook O’Donnell English Language Institute University of Michigan Amy Perfors School of Psychology University of Adelaide Pierre Perruchet Laboratoire d’Etude de l’Apprentissage et du De´veloppement CNRS-UMR 5022 Universite´ de Bourgogne, Dijon David B. Pisoni Department of Psychological and Brain Sciences Indiana University Be´ne´dicte Poulin-Charronnat Laboratoire d’Etude de l’Apprentissage et du De´veloppement CNRS-UMR 5022 Universite´ de Bourgogne, Dijon Michelle Sandoval Department of Psychology University of Arizona Mohinish Shukla Department of Brain and Cognitive Sciences University of Rochester

Luca Onnis Department of Second Language Studies University of Hawaii, Manoa Patrick Rebuschat School of Linguistics and English Language Bangor University Kenny Smith Language Evolution and Computation Research Unit School of Philosophy, Psychology and Language Sciences University of Edinburgh Anne D. Walk Department of Psychology Saint Louis University Geraint A. Wiggins Department of Computing and Centre for Cognition, Computation and Culture Goldsmiths, University of London John N. Williams Department of Theoretical and Applied Linguistics University of Cambridge

Table of contents Preface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Jenny Sa¤ran

v

List of contributors. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

ix

Introduction: Statistical learning and language acquisition . . . . . . . Patrick Rebuschat and John Williams

1

Statistical-sequential learning in development . . . . . . . . . . . . . . . . . Jennifer B. Misyak, Michael H. Goldstein and Morten H. Christiansen

13

Bootstrapping language: Are infant statisticians up to the job? . . . . Elizabeth K. Johnson

55

Sensitivity to statistical information begets learning in early language development . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Jessica F. Hay and Jill Lany

91

Word segmentation: Trading the (new, but poor) concept of statistical computation for the (old, but richer) associative approach. . . . . . . . 119 Pierre Perruchet and Be´ne´dicte Poulin-Charronnat The road to word class acquisition is paved with distributional and sound cues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Michelle Sandoval, Kalim Gonzales and Rebecca Go´mez

145

Linguistic constraints on statistical learning in early language acquisition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Mohinish Shukla, Judit Gervain, Jacques Mehler and Marina Nespor

171

The potential contribution of statistical learning to second language acquisition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Luca Onnis

203

Statistical learning and syntax: What can be learned, and what di¤erence does meaning make? . . . . . . . . . . . . . . . . . . . . . . . . . . . . John N. Williams and Patrick Rebuschat

237

xii

Table of contents

Statistical construction learning: Does a Zipfian problem space ensure robust language learning? . . . . . . . . . . . . . . . . . . . . . . . . . . Nick C. Ellis and Matthew Brook O’Donnell Can we enhance domain-general learning abilities to improve language function? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Christopher M. Conway, Michelle A. Gremp, Anne D. Walk, Althea Bauernschmidt and David B. Pisoni

265

305

Conscious versus unconscious learning of structure. . . . . . . . . . . . . Zoltan Dienes

337

How implicit is statistical learning? . . . . . . . . . . . . . . . . . . . . . . . . Phillip Hamrick and Patrick Rebuschat

365

What Bayesian modelling can tell us about statistical learning: what it requires and why it works. . . . . . . . . . . . . . . . . . . . . . . . . . Amy Perfors and Daniel J. Navarro

383

Evolutionary perspectives on statistical learning . . . . . . . . . . . . . . . Kenny Smith

409

Statistical learning: What can music tell us? . . . . . . . . . . . . . . . . . . Psyche Loui

433

‘‘I let the music speak’’: Cross-domain application of a cognitive model of musical learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Geraint A. Wiggins

463

Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

495

Introduction: Statistical learning and language acquisition Patrick Rebuschat and John Williams Recent years have witnessed an increasing interest in empiricist approaches to language acquisition (see Behrens, 2009; Ellis, 2006a, 2006b; Elman et al., 1996; Goldberg, 2006; MacWhinney, 1999; Redington & Chater, 1998; Tomasello, 2003). This development was driven, in part, by two observations, namely that (i) infants’ environment is considerably richer in linguistic and non-linguistic cues than previously anticipated and that (ii) infants are able to make extensive use of these cues when acquiring language. Both findings suggest a greater role for learning than traditionally assumed by nativist approaches to language development (e.g. Anderson & Lightfoot, 2002; Chomsky, 1966, 1986, 1988; Crain & Pietroski, 2001; Roeper & Williams, 1987). Among empiricist approaches, research conducted on statistical learning, i.e. our ability to make use of statistical information in the environment to bootstrap language acquisition, has been particularly fruitful. Statistical learning research was sparked by the work of Jenny Sa¤ran, Elissa Newport, and Richard Aslin (Sa¤ran, Aslin, & Newport, 1996; Sa¤ran, Newport, & Aslin, 1996) and developed into a major research strand in developmental psychology (see Go´mez, 2007; Sa¤ran, 2003, for reviews). Statistical learning involves computations based on units or patterns, which can include linguistic elements such as speech sounds, syllables, syntactic categories and form-meaning mappings. The types of statistical computation range from simple frequency counts to the tracking of co-occurrence information and conditional probability. Research in statistical learning generally focuses on infant or child language acquisition, though studies with adult subjects are also common. In terms of methodology, the most distinctive features of statistical learning research are the careful manipulation of statistical information in the input and the use of artificial languages (see Go´mez & Gerken, 2000, for a review). In their seminal study, Sa¤ran, Aslin, and Newport (1996) investigated whether 8-month-old infants could use statistical information to solve the problem of word segmentation, i.e. to discover word boundaries in running

2

Patrick Rebuschat and John Williams

speech. Infants were exposed to two minutes of a continuous speech stream that contained four three-syllable nonsense words (e.g., tupiro, padoti). The ‘‘words’’ were repeated in random order, and a speech synthesizer was used to generate a continuous auditory sequence (e.g., bidakupadotigolabubidakupadotigolabubidakutupiro. . .). The sequence contained no pauses, stress di¤erences or any other acoustic cues between words, so that the only cue to word boundaries were the transitional probabilities between syllables. The transitional probability within words was 1.0, given that the first syllable of a word was always followed by the second, and the second syllable by the third (e.g., tu– was always followed by –pi–, and –pi– followed by –ro). The transitional probability between words was 0.33 because the final syllable of a given word could be followed by the initial syllable of three di¤erent words (e.g., –ro could be followed go–, bi–, or pa–). Infants were then tested by means of the head-turn preference procedure to determine whether they could recognize the di¤erence between trained items (tupiro, golabu) and novel items (dapiku, tilado). Sa¤ran, Aslin and Newport (1996) found that the 8-month-olds successfully discriminated between familiar and unfamiliar stimuli, which suggests that infants are highly sensitive to statistical information (here, transitional probabilities) and that they can use this information to succeed in a complex learning task (word segmentation). This early research on statistical learning was important for demonstrating that infants are ‘‘intuitive statisticians’’ (Ellis, 2006b), who are able to make extensive use of environmental cues when acquiring language. Importantly, subsequent research has shown that the capacity for statistical learning is maintained throughout adulthood (e.g., Sa¤ran, Newport, & Aslin, 1996) and that statistical learning is not restricted to the task of word segmentation. After more than a decade of experimental research, there is ample evidence that both infants and adults can exploit the statistical structure of their environment in order to succeed in a wide variety of linguistic tasks, including phonological learning (e.g., Maye, Weiss, & Aslin, 2008; Maye, Werker, & Gerken, 2002), word learning (e.g., Estes, Evans, Alibali, & Sa¤ran, 2007; Yu & Smith, 2007; Smith & Yu, 2008) and syntactic development (e.g. Gerken, Wilson, & Lewis, 2005; Sa¤ran & Wilson, 2003; Thompson & Newport, 2007). There is also evidence that the cognitive mechanism involved in statistical learning is not specific to language acquisition but rather domain-general in nature, i.e. the learning mechanism applies to statistical information in the environment, irrespective of the nature of the stimulus (auditory, visual, tactile, etc.; see Sa¤ran & Thiessen, 2007, for discussion). For example, several experiments have

Introduction: Statistical learning and language acquisition

3

demonstrated that infants and adults can track sequential statistics in nonlinguistic auditory stimuli (e.g., Sa¤ran, Johnson, Aslin, & Newport, 1999) and visual stimuli (e.g., Bulf, Johnson, & Valenz, 2011; Fiser & Aslin, 2002a, 2002b). Studies on cotton-top tamarin monkeys (Hauser, Newport, & Aslin, 2001) and rodents (e.g., Toro & Trobalo´n, 2004) further suggest that basic aspects of statistical learning are not unique to human learners. Finally, it is widely accepted that the process of statistical learning can occur incidentally, i.e. subjects can acquire the statistical structure of language without the conscious intention to learn, making the process of statistical learning analogous to that of implicit learning (see also Dienes, this volume; Hamrick & Rebuschat, this volume; Misyak, Goldstein, & Christiansen, this volume).

This Volume The present volume brings together researchers from a variety of disciplines (cognitive psychology, computer science, corpus linguistics, developmental psychology, psycholinguistics) in order to assess the progress made in statistical learning research, to critically appraise the role of statistical learning in language acquisition, and to determine future directions to take in this interdisciplinary enterprise. The volume was inspired by an eponymous symposium which the editors organized for the 2009 edition of the Georgetown University Round Table on Languages and Linguistics (GURT). The feedback we received from the symposium presenters and conference delegates was very positive throughout, and when we were approached by Mouton de Gruyter regarding the possibility of producing an edited volume on the same topic we readily agreed to do so. Three presentations of our original symposium were converted into much expanded and updated chapters (Ellis & O’Donnell, Williams & Rebuschat, and Hay & Lany). The remaining contributors were recruited specifically for this volume. Each chapter in this volume was peer-reviewed by 2–3 anonymous reviewers and by the two editors. In addition, many chapters were used as readings in a postgraduate course on the Implicit and Explicit Learning of Languages (Ling-494), o¤ered by the first editor at Georgetown University. This enabled us to gain feedback on the readability of texts and on the clarity of the arguments expressed by the authors. The final product is a volume that is written in an accessible and engaging fashion and that

4

Patrick Rebuschat and John Williams

gives readers a snapshot of the exciting research that has examined the role of statistical learning in language acquisition. Jennifer Misyak, Michael Goldstein and Morten Christiansen focus on two distinct, but closely related research traditions, namely implicit learning (Reber, 1967) and statistical learning (Sa¤ran, Aslin, & Newport, 1996). Both approaches focus on how we acquire information from the environment and both rely heavily on the use of artificial grammars. Perruchet & Pacton (2006) suggested that implicit and statistical learning represent two approaches to a single phenomenon. Conway & Christiansen (2006) go as far as combining the two in name: implicit statistical learning. Misyak, Goldstein and Christiansen’s aim is to promote the synergistic fusion of the two approaches by highlighting theoretical and methodological similarities and by providing researchers with a thorough and much-needed synthesis of current research in both fields. Elizabeth Johnson evaluates the contribution of statistical learning to solving the bootstrapping problem. The chapter focuses on infant learners and the task of word segmentation, but Johnson’s observations apply to many levels of spoken language acquisition. She first provides a brief overview of the progress made in statistical learning research. This is followed by an engaging discussion of five questions and challenges faced by distributional models of language development. Does the ability to track patterns in an artificial language scale up to the challenge of natural language? What are the units that language learners keep track of, and what type of calculations do they perform? Can distributional models predict children’s di‰culties? How much knowledge is innate and how much is acquired? and How to interpret looking-time data in statistical learning research? Jessica Hay and Jill Lany also concentrate on the role of statistical information in infant language development. Their chapter begins with three important observations. Firstly, many of the early statistical learning experiments employed artificial languages that lack the rich, multidimensional structure of natural language. Secondly, many studies presented subjects with stimuli that are devoid of semantic information. Both of these aspects arguably reduce the ecological validity of studies. Thirdly, early research has little to say about how statistical learning at one level (e.g. syllables) relates to statistical learning about other aspects of language (e.g., word classes). Hay and Lany then describe several recent studies that have begun to address these gaps. The work reviewed in their chapter shows that infants are highly adept at tracking statistical regularities in an artificial language even with tasks that closer approximate the problems faced over

Introduction: Statistical learning and language acquisition

5

the course of learning a natural language. Importantly, this research also shows how sensitivity to statistical structure in one area of language can bootstrap the learning of other, more complex dimensions of language structure. Pierre Perruchet and Be´ne´dicte Poulin-Charronat propose that statistical learning phenomena can be interpreted as end-products of associative learning processes and that the associative approach can provide a stronger and more appropriate framework within which to examine statistical learning. Their emphasis is on the widely-studied task of word segmentation. After describing their thesis in detail, Perruchet and Poulin-Charronat discuss di¤erent explanations for our sensitivity to statistical structure (associative, attention-based, and interference-based accounts). They then explore how statistical computation can be integrated with other factors that are known to play an important role in word segmentation (acoustical cues and contextual information) in a unified, dynamic perspective that is based on the associative learning tradition. They conclude by considering evidence from behavioural experiments and computational modelling. Michelle Sandoval, Kalim Gonzales and Rebecca Gomez focus on the acquisition of word classes. They first consider three cues to word class – distributional, phonological and prosodic – and review studies that examined the role of these cues in the acquisition of lexical categories. Sandoval, Gonzales and Gomez then discuss how these multiple sources of information are integrated in word class acquisition. Their chapter concludes with a discussion of how learners might scale up from purely form-based categories to lexical classes. Mohinish Shukla, Judit Gervain, Jacques Mehler and Marina Nespor suggest that a synthesis between rationalist and empiricist approaches might be necessary to account for a complex phenomenon like language acquisition and propose that three types of mechanisms – rule-based, distributional and perceptual – are required to explain how languages are acquired. The authors begin by defining statistical learning and by reviewing several key studies. In the following sections, they then investigate how a powerful, domain-general statistical learning mechanism interacts with other, language-specific and perceptual processes. Specifically, they consider how linguistic representations constrain the use of statistical information at the phonemic, morphological, syntactic, and prosodic levels. Luca Onnis reflects on the potential contribution of statistical learning to second language (L2) acquisition. In the first part, Onnis discusses four principles based on statistical learning research that can be applied to

6

Patrick Rebuschat and John Williams

L2 learning scenarios. These general learning principles are: (i) ‘‘Integrate probabilistic sources of information’’, (ii) ‘‘Seek invariance in the signal’’, (iii) ‘‘Reuse learning mechanisms’’, and (iv) ‘‘Learn to predict.’’ In the second part, Onnis then elaborates on how these principles can be put to use for specific problems arising in L2 acquisition and teaching. He considers evidence from both behavioural experiments and computational analyses of corpora. John Williams and Patrick Rebuschat focus on the acquisition of second language (L2) syntax in adult learners. Their chapter discusses the contribution of statistical learning to L2 syntactic development and the role of prior linguistic knowledge. An obvious criticism of artificial language experiments is that learners are often exposed to meaningless stimuli. Williams and Rebuschat describe a series of experiments that employed semi-artificial languages, i.e. systems in which the complexity of natural language was maintained and semantic information present. Their findings support the view that syntactic structure can be induced from an analysis of the contingencies between words. However, they also suggest that there are limitations to what can be learned. Nick Ellis and Matt O’Donnell present the results of a corpus analysis that was designed to test the generalizability of construction grammar theories of language learning. The linguistic focus is on Verb-Argument Constructions (VACs); the corpus in question is the British National Corpus (BNC). The chapter begins with a description of the main tenets of construction grammar and usage-based approaches to language acquisition. This is followed by a discussion of determinants of construction learning (frequency, function, and contingency of form-function mapping). The next section of the chapter is dedicated to a thorough description of the corpus analysis and its results. Ellis and O’Donnell find that constructions are Zipfian in their type-token distributions in usage, selective in their verb form occupancy, and coherent in their semantics. They suggest that these characteristics make linguistic constructions robustly learnable by a statistical learning mechanism. Christopher Conway, Michelle Gremp, Anne Walk, Althea Bauernschmidt, and David Pisoni discuss whether statistical learning abilities can be enhanced to improve language function. They begin by reviewing evidence highlighting the importance of statistical learning in language acquisition and processing. They then describe recent research that used computerized training techniques that were designed to improve working memory. This provides the background for a discussion of two studies that assessed the

Introduction: Statistical learning and language acquisition

7

e¤ectiveness of a new adaptive training task for improving domain-general learning abilities. The first study focuses on adult subjects with normal hearing. The second study considers children who are deaf or hard of hearing. Conway, Gremp, Walk, Bauernschmidt and Pisoni’s findings confirm that the basic mechanisms of learning and memory can be trained, and that training tasks such as theirs might be employed as an intervention for treating disorders of language and learning. One of the widely discussed questions in implicit learning research is whether the knowledge acquired in sequence learning and artificial grammar experiments is, in fact, implicit. Zoltan Dienes presents a methodology for determining the conscious (explicit) and unconscious (implicit) status of knowledge. Dienes first provides a definition of unconscious knowledge. He then discusses di¤erent measures of awareness, with a special emphasis on subjective measures. After introducing the distinction between structural and judgment knowledge, Dienes then presents extensive evidence in support of subjective measures of awareness. Phillip Hamrick and Patrick Rebuschat describe an experiment that investigated whether a typical statistical learning experiment results in implicit knowledge, explicit knowledge, or both. The experiment combined the cross-situational word learning paradigm (Yu & Smith, 2007) and the subjective measures of awareness developed by Dienes (this volume; Dienes & Scott, 2005). Subjects were either exposed under incidental or intentional learning conditions. Hamrick and Rebuschat found clear learning e¤ects under both conditions. However, subjects in the intentional group developed both implicit and explicit knowledge, while the subjects in the incidental group developed primarily implicit knowledge. The experiment illustrates the usefulness of including measures of awareness when researching statistical learning. Amy Perfors and Daniel Navarro explore the why and what of statistical learning from a computational modelling perspective. Perfors and Navarro propose that Bayesian techniques can be particularly useful for understanding what kinds of learners and assumptions are necessary for successful statistical learning. Their chapter begins with a brief introduction to Bayesian modelling, contrasting it with the other widely-used computational approach to statistical learning (connectionism). The remaining chapter is structured around a series of key questions: What is statistical learning? What data does statistical learning operate on? What knowledge does learner acquire from the data? What assumptions do learners make about the data? What prior knowledge does the learner possess? Finally, why does statistical learning work?

8

Patrick Rebuschat and John Williams

Kenny Smith approaches the topic of statistical learning from an evolutionary perspective. Smith first describes generative and non-generative approaches to language universals and language evolution. He then discusses recent research on linguistic variation as a test-case for exploring debates on the link between learning biases and universals in language design. The chapter concludes with a discussion of the biological evolution of the language faculty. Psyche Loui approaches the topic of statistical learning from a nonlinguistic perspective, with a special focus on music. The central thesis of her chapter is that much of our musical knowledge can be acquired by means of experience with the statistical regularities in the input. Loui begins her chapter with a discussion of the modality-independence of statistical learning and then briefly reviews research on how we acquire implicit knowledge of music. This sets the stage for a description of several of Loui’s experiments on the acquisition of an artificial musical system by adult learners. The artificial system is based on the Bohlen-Pierce scale, a novel scale that is entirely di¤erent from existing musical systems. Loui’s paradigm allowed her to address several important questions, e.g. What aspects of musical structure can be learned? How quickly can we acquire pitch, timbre, etc.? How much does emotion in music depend on statistical regularities? The chapter concludes with an outline of possible future directions. Geraint Wiggins presents the Information Dynamics of Music (IDyOM) model of musical melody processing. A special feature of this model is its multidimensionality, i.e. it is capable of modelling perceptual phenomena whose percepts are multidimensional constructs. Importantly, even though it was designed as a model of melody processing, IDyOM can be applied to other, non-musical domains. Wiggins takes a strong view of statistical learning, in which statistical estimation is paramount in cognition. IDyOM is not presented merely as a way of capturing regularities in the observed data, but as a theory of the processing mechanism itself. That is, IDyOM is viewed as a simulation of actual cognitive processing. The chapter begins with a discussion of the relationship between language and music and a survey of the relevant literature in statistical linguistics. Wiggins then presents a detailed overview of IDyOM. The chapter concludes with a study that explored whether IDyOM is able to model a basic linguistic task (syllable identification) by means of the same information theoretic principles that apply in melody segmentation.

Introduction: Statistical learning and language acquisition

9

Acknowledgements The volume, and the symposium on which it was based, would not have been possible without the extensive help of many people. We would like to thank Ron Leow and Cristina Sanz, the organizers of the 2009 edition of GURT, for inviting us to organize a symposium, and we thank our presenters for making it such as successful event. With regards to the volume, we are most grateful to our authors for their excellent contributions and for agreeing to peer-review several chapters. Without their hard work, there would be no volume. At Mouton de Gruyter, we are very grateful to Cathleen Petree for proposing this project in the first place. Sadly, Cathleen passed away only a few months after we began working on the volume, and we are saddened that she is not around to see the finished book. We would like to thank Emily Farrell, our new editor at Mouton de Gruyter, for her continued support and Wolfgang Konwitschny, our production editor, for his assistance in producing the volume. At Georgetown, we are very grateful to Elizabeth Kissling, who worked as our editorial assistant and made our lives significantly easier. Finally, several PhD students provided their feedback on the chapters, for which we would like to thank them: Phillip Hamrick, Katie Kim, Elizabeth Kissling, Julie Lake, and Kaitlyn Tagarelli.

References Anderson, S. R., & Lightfoot, D. 2002 The language organ: Linguistics as cognitive psychology. Cambridge: Cambridge University Press. Behrens, H. 2009 Usage-based and emergentist approaches to language acquisition. Linguistics, 47(2), 383–411. Bulf, H., Johnson, S. P., & Valenza, E. 2011 Visual statistical learning in the newborn infant. Cognition, 121, 127–132. Chomsky, N. 1966 Cartesian linguistics: a chapter in the history of rationalist thought. New York: Harper & Row. Chomsky, N. 1986 Knowledge of language: Its nature, origin, and use. New York: Praeger. Chomsky, N. 1988 Language and problems of knowledge: The Managua lectures. Cambridge, MA: MIT Press.

10

Patrick Rebuschat and John Williams

Conway, C. M., & Christiansen, M. H. 2006 Statistical learning within and between modalities. Psychological Science, 17, 905–912. Crain, S., & Pietroski, P. 2001 Nature, nurture and Universal Grammar. Linguistics & Philosophy, 24, 139–186. Dienes, Z. this volume Conscious versus unconscious learning of structure. In P. Rebuschat & J. N. Williams (Eds.) Statistical learning and language acquisition. Berlin: Mouton de Gruyter. Dienes, Z., & Scott, R. 2005 Measuring unconscious knowledge: distinguishing structural knowledge and judgment knowledge. Psychological Research, 69, 338–351. Ellis, N. C. 2006a Cognitive perspectives on SLA: The Associative-Cognitive CREED. AILA Review, 19, 100–121. Ellis, N. C. 2006b Language acquisition as rational contingency learning Applied Linguistics, 27, 1–24. Elman, J., Bates, E., Johnson, M., Karmilo¤-Smith, A., Parisi, D. & Plunkett, K. 1996 Rethinking innateness: a connectionist perspective on development. Cambridge, MA: MIT Press (1996). Fiser, J., & Aslin, R. N. 2002a Statistical learning of higher-order temporal structure from visual shape sequences. Journal of Experimental Psychology: Learning, Memory, and Cognition, 28, 458–467. Fiser, J., & Aslin, R. N. 2002b Statistical learning of new visual feature combinations by infants. Proceedings of the National Academy of Sciences, 99, 15822– 15826. Gerken, L., Wilson, R., & Lewis, W. 2005 Infants can use distributional cues to form syntactic categories. Journal of Child Language, 32, 249–268. Goldberg, A. E. 2006 Constructions at work. The nature of generalization in language. Oxford: Oxford University Press. Go´mez, R. 2007 Statistical learning in infant language development, In M. Gaskell (ed.), The Oxford handbook of psycholinguistics (pp. 601–616), Oxford: Oxford University Press. Go´mez, R. L., & Gerken, L. 2000 Infant artificial language learning and language acquisition, Trends in Cognitive Sciences, 4(5), 178–186.

Introduction: Statistical learning and language acquisition

11

Graf Estes, K. M., Evans, J., Alibali, M. W., & Sa¤ran, J. R. 2007 Can infants map meaning to newly segmented words? Statistical segmentation and word learning. Psychological Science, 18, 254– 260. Hamrick, P., & Rebuschat, P. this volume How implicit is statistical learning? In P. Rebuschat & J. N. Williams (Eds.) Statistical learning and language acquisition. Berlin: Mouton de Gruyter. Hauser, M., Newport, E. L., & Aslin, R. N. 2001 Segmentation of the speech stream in a nonhuman primate: Statistical learning in cotton-top tamarins. Cognition, 78, B41–B52. Lieven, E., & Tomasello, M. 2008 Children’s first language acquisition from a usage-based perspective: In P. Robinson & N. Ellis (eds). Handbook of cognitive linguistics and second language acquisition. (pp. 168–196) London: Routledge. MacWhinney, B. (Ed.) 1999 Emergence of language. Mahwah, NJ: Lawrence Erlbaum. Maye, J., Weiss, D. J., & Aslin, R. N. 2008 Statistical phonetic learning in infants: Facilitation and feature generalization, Developmental Science, 11, 122–134. Maye, J., Werker, J. F., & Gerken, L. 2002 Infant sensitivity to distributional information can a¤ect phonetic discrimination. Cognition, 82, B101–B111. Misyak, J. B., Goldstein, M. H., & Christiansen, M. H. this volume Statistical-sequential learning in development. In P. Rebuschat & J. N. Williams (Eds.) Statistical learning and language acquisition. Berlin: Mouton de Gruyter. Perruchet, P., & Pacton, S. 2006 Implicit learning and statistical learning: One phenomenon, two approaches, Trends in Cognitive Sciences, 10, 233–238. Reber, A. S. 1967 Implicit learning of artificial grammars. Journal of Verbal Learning and Verbal Behavior, 6, 855–863. Redington, M. & Chater, N. 1998 Connectionist and statistical approaches to language acquisition: A distributional perspective. Language and Cognitive Processes, 13, 129–191. Roeper, T., & Williams, E. (Eds.) 1987 Parameter Setting. Reidel: Amsterdam. Sa¤ran, J. R. 2003 Statistical language learning: Mechanisms and constraints, Current Directions in Psychological Science, 12, 110–114.

12

Patrick Rebuschat and John Williams

Sa¤ran, J. R., & Thiessen, E. D. 2007 Domain-general learning capacities, In E. Ho¤ & M. Schatz (eds.), Blackwell handbook of language development (pp. 68–86), Oxford: Blackwell. Sa¤ran, J. R., & Wilson, D. P. 2003 From syllables to syntax: Multi-level statistical learning by 12month-old infants. Infancy, 4, 273–284. Sa¤ran, J. R., Aslin, R. N., & Newport, E. L. 1996 Statistical learning by 8-month-old infants, Science, 274, 1926– 1928. Sa¤ran, J. R., Johnson, E. K., Aslin, R. N., & Newport, E. L. 1999 Statistical learning of tone sequences by human infants and adults. Cognition, 70, 27–52. Sa¤ran, J. R., Newport, E. L., & Aslin, R. N. 1996 Word segmentation: The role of distributional cues, Journal of Memory and Language, 35, 606–621. Smith, L., & Yu, C. 2008 Infants rapidly learn word–referent mappings via cross-situational statistics. Cognition, 106(3), 1558–1568. Thompson, S. P., & Newport, E. L. 2007 Statistical learning of syntax: The role of transitional probability, Language Learning and Development, 3, 1–42. Tomasello, M. 2003 Constructing a language: A usage-based theory of language acquisition. Cambridge, MA: Harvard University Press. Toro, J. M. & Trobalo´n, J. B. 2005 Statistical computations over a speech stream in a rodent, Perception & Psychophysics, 67, 867–875. Yu, C., & Smith, L. B. 2007 Rapid word learning under uncertainty via cross-situational statistics. Psychological Science, 18, 414–420.

Statistical-sequential learning in development Jennifer B. Misyak, Michael H. Goldstein and Morten H. Christiansen 1. Introduction The child’s mind is embedded in a flowing, stimuli-rich world of perceived regularities. Children learn to engage their surroundings skillfully, in a manner reflecting knowledge of astoundingly complex structural patterns. From infants to adults, studying the unsupervised, ostensibly unconscious nature underlying many of these early-acquired processes has continually fueled research in the fields of ‘‘implicit learning’’ and ‘‘statistical learning’’ since their onset. The convention has been to trace the empirical genesis of implicit learning research from the early work of Reber (1967) in the 60’s, and to follow modern statistical learning developments since the seminal work of Sa¤ran and colleagues in the 90’s (Sa¤ran, Aslin, and Newport 1996). This chapter, though, is not about tracing the vibrant histories of the implicit and statistical learning fields, but rather about interrelating the accrued developmental findings from both literatures – while in the service of promoting a synergistic fusion between their two approaches. For despite their disparately-pursued lines of work to date, establishing this common discourse may be easier to accomplish than otherwise presupposed. Witness, for instance, the close overlap among operational definitions provided by their paradigm progenitors.1 Implicit learning has thus been related as ‘‘the process by which knowledge about complex stimulus domains is 1. While these characterizations are mostly limited to those of Reber and Sa¤ran et al., key ideas surrounding ‘‘implicit learning’’ and ‘‘statistical learning’’ may also be connected to earlier theoretical contributions. The use of surface-level distributional information to identify relevant structure in language, for example, is a notion that has been recognized within structural linguistics and information theory by Bloomfield (1933), Harris (1955), and Shannon (1948). The formal metric of ‘‘transitional probability’’ in statistical learning segmentation studies was also provided within Miller and Selfridge’s (1950) account for how ‘‘the statistical dependencies between successive units form the basis for a study of verbal context’’ (177). Despite methodological confounds in testing,

14

Jennifer B. Misyak, Michael H. Goldstein and Morten H. Christiansen

acquired largely independent of conscious awareness of either the process or the products of acquisition’’ (Reber and Allen 2000: 227). Reber’s initial account (1967) also included characteristics such as the ‘‘e‰cient responding’’ of the organism and the development of ‘‘a strong sensitivity to the lawfulness [. . .] [existing in a] stimulus array;’’ in other words, participants may ‘‘become sensitive to the statistical nature of their environment without using explicit or verbalizable strategies’’ when the stimuli they receive is ‘‘patterned’’ or ‘‘ordered.’’ He emphasized how this was ‘‘closely akin to Gibson and Gibson’s (1955) perceptual learning and is [. . .] a rudimentary inductive process which is intrinsic in such phenomena as language learning and pattern perception’’ (863). Regarding statistical learning in its modern form, Sa¤ran and collaborators introduced this as a ‘‘powerful mechanism for the computation of statistical properties of the language input’’ (Sa¤ran, Aslin, et al. 1996: 1926), emphasizing the rapidity and adeptness by which infant learners incidentally extracted relevant regularities. This was otherwise termed ‘‘the process of statistical learning, or the detection of patterns of sounds, words, and classes of words in the service of discovering underlying structure’’ (Sa¤ran 2002: 172) and could be relevant for learning complex (i.e., hierarchical) linguistic forms (Sa¤ran 2001). As subsequent findings suggested that such learning was widely applicable to a variety of nonlinguistic domains, this empirical definition became more general, e.g., ‘‘the ability to discover units via their statistical coherence’’ (Sa¤ran 2003: 111), and construed more broadly by researchers, e.g., as pertaining to ‘‘the discovery of statistical structure in the environment’’ (Go´mez 2006: 90). Close terminological correspondences such as these invite parallels – or demand a more rigorous partitioning and defining of phenomena (see Perrig 2001 for analogous claims in conjunction with implicit memory and procedural knowledge). It is our contention, though, that the resem-

Hayes and Clark’s (1970) artificial segmentation experiment with adults was a forerunner to those of Sa¤ran and colleagues. With regard to implicit learning, other researchers prior to Reber had also devoted attention to the subject of unconscious cognitive processes. Among these, Jenkins (1933: 471) wrote about ‘‘‘incidental learning’ – that is to say, learning which occurs in the absence of a specific intent to remember’’; and Thorndike and Rock (1934) wrote about ‘‘learning without awareness of what is being learnt or intent to learn it’’ in an article of that same name. Clark Hull (1920) also discussed implicit/incidental learning phenomena in his published dissertation.

Statistical-sequential learning in development

15

blances in this case point to a common learning mechanism(s)2 – albeit historically seen from di¤erent theoretical orientations and empirical traditions – and hereafter referred to as statistical-sequential learning. That is, much of the phenomena revealed through statistical learning and implicit learning approaches concerns the learning of sequential material and largely taps into the same probabilistically-sensitive, associative-based mechanism(s) recognized as belonging to ‘‘statistical learning’’ proper.3 One potential discrepancy, however, between the descriptions above relates to Reber’s notion of the process as proceeding via unconscious rule abstraction, in which the participant tacitly apprehends ‘‘a valid, if partial, representation of the actual underlying rules of the [finite-state] language’’ (Reber and Allen 1978: 191). Conversely, statistical learning is construed as a process driven by statistical properties of the input, which results in participants’ probabilistic knowledge of the constraints governing stimuli formation. However, the sensitivity of participants to the ‘‘lawfulness’’ per se in sequentially arrayed material of implicit learning experiments has been sharply contested and has given rise to statistically-based explanations (without symbolic or abstract rules) of the computational principles entailed by successful performance (e.g., Cleeremans 1993). Although still an active matter of debate, if one is convinced by converging evidence that such learning is indeed driven by sensitivity to the statistical properties of the stimuli, then this natural a‰nity between the two fields should be 2. Statistical learning may involve multiple subsystems that are modality-specific and that operate in parallel over distinct perceptual dimensions (Conway and Christiansen 2006, 2009; for analogous theoretical views in the traditional implicit learning literature, see Goschke, Friederici, Kotz, and van Kampen 2001; Keele, Ivry, Mayr, Hazeltine, and Heuer 2003). Additionally, it is not known whether statistical learning for adjacent and nonadjacent dependencies respectively – two types, or aspects, of statistical learning performance – entails shared or separate processing mechanism(s) in adult learners (a question raised by findings in Misyak and Christiansen, 2012); see also Friederici, Bahlmann, Heim, Schibotz, and Anwander (2006) and Pacton and Perruchet (2008). Hence, wherever wording to the e¤ect of ‘‘a [statistical learning] mechanism’’ may be encountered in the text, this should be interpreted in a potentially distributive sense without necessarily inferring singularity. 3. By using the term ‘‘statistical-sequential learning’’ as denoting the particular convergence of many findings across the statistical and implicit learning fields with respect to at least one kind of common underlying mechanism (i.e., probabilistic, associative-based and sequential), we are not suggesting that the merger of findings from the two fields cannot be construed as forming other meaningful overlaps (e.g., with respect to more ‘‘implicit’’ learning processes).

16

Jennifer B. Misyak, Michael H. Goldstein and Morten H. Christiansen

readily apparent. They both entail incidental learning of sequential patterns (in spatiotemporal, temporal, or visually-arrayed distributions) that are defined by statistical relations over units perceived by the learner and that are processed in a manner respecting intrinsic regularities or probabilistic constraints of the input. Some researchers have realized this connection. For instance, following in similar vein to the perspective informing the collection by Ellis (1994), Sa¤ran, Newport, Aslin, Tunick, and Barrueco (1997) recognized that the literatures on ‘‘incidental learning’’ (which includes implicit learning and frequency estimation research) and natural language acquisition (which includes statistical learning, by this view) ‘‘would each be well served by a consideration of the theoretical and empirical concerns of the other’’ since these mutually suggest ‘‘pertinent’’ mechanisms for respective researchers (104). Perruchet and Pacton (2006: 237) delineated the growing convergence between results in the implicit learning and statistical learning fields, concluding that they appear to be ‘‘one phenomenon’’ that explores ‘‘the same domain-general incidental learning processes.’’ They also note the increasing cross-referential synonymy of ‘‘implicit’’ and ‘‘statistical’’ learning terms, mentioning the example of Conway and Christiansen’s (2006) coinage of ‘‘implicit statistical learning’’ as emblematic of their potentially future confluence. However, beyond such claims and cross-references, there has been little (if any) attempt towards truly integrating and synthesizing findings across these wide literatures. Stronger e¤orts for communication between the literatures should be encouraged, as it would simultaneously widen and deepen the knowledge base for researchers in both fields. This chapter is a modest step in that direction; namely, it aims to underscore and support the theorized a‰nity of the statistical and implicit learning fields by providing a synthesis among findings to date. Its scope is confined to a human developmental context for two reasons: to fill in pre-adulthood timeline gaps from the canonical statistical learning literature alone, as well as to complement studies from the implicit learning literature that yield some equivocal findings during infancy; and to direct attention to the largely unasked but important question of developmental change. In addressing developmental change, this chapter may be admittedly considered unorthodox. Developmental invariance is one of the central tenets, or corollaries, falling out of the theory on unconscious cognition posited by Reber, in which implicit learning is viewed as recruiting upon phylogenetically conserved and evolutionarily stable processes of high, basic adaptive value since antiquity (‘‘the primacy of the implicit’’; Reber,

Statistical-sequential learning in development

17

1993). By Reber’s framework, implicit learning has been expected to exhibit age independence, neurobiological robustness, little intraindividual variation, and remarkable cross-species commonality. These assumptions have in turn deterred many implicit learning researchers from directly seeking developmental trends. And the assumption of age-independence seems to even have seen its way borrowed into the canonical statistical learning literature; see especially Sa¤ran et al. (1997, but note the conflicting evidence for developmental di¤erences later found in Sa¤ran 2001). Such largely unconscious processes may indeed have basic and evolutionarily old roots, as well as recruit upon mechanisms shared across species. Fittingly, researchers from both implicit and statistical learning fields have specifically invoked parallels to principles from the classical Rescorla-Wagner model (1972) of animal learning along several lines, e.g., regarding the detection of predictive co-variation of stimulus events (Reber and Allen 2000), the similar subjection to perceptual constraints (Creel, Newport, and Aslin 2004), the use of prediction-based estimation from conditional statistics, or contingent probabilities (Aslin, Sa¤ran, and Newport 1998; Hunt and Aslin 2001; Swingley 2005), and in relation to attention-based accounts of statistical learning (Pacton and Perruchet 2008). Despite some cross-species commonalities4, though, earlier claims of implicit learning’s (and by inference, statistical learning’s) neurobiological robustness across impaired populations, and reputed lack of substantive di¤erences across individuals, are being eroded by converging, recent evidence suggesting that systematic variations do in fact exist. Even Reber and Allen (2000) more recently have conceded the existence of some individual di¤erences, referring back for support to findings from Reber, Walkenfeld, and Hernstadt (1991); they conversely argue now for which theoretical framework should be used to interpret these di¤erences. Amid such shifting ground, this chapter’s ancillary aim is to reappraise the remaining, fundamental postulate of developmental invariance, with ensuing implications for our understanding of the nature of statistical-sequential learning and the factors by which it may be influenced. The remainder of this chapter is organized into five main divisions. The next section considers various paradigm implementations used in the devel4. There may, nonetheless, be important di¤erences in both quantitative performance and the nature of limitations on statistical-sequential learning abilities across humans, non-human primates, and non-primates (Conway and Christiansen, 2001; Newport, Hauser, Spaepen and Aslin 2004; Sa¤ran et al. 2008; Toro and Trobalo´n 2005; see also related discussion by Weiss and Newport 2006).

18

Jennifer B. Misyak, Michael H. Goldstein and Morten H. Christiansen

opmental statistical and implicit learning literatures. Subsequently in Sec. 3, attention is directed towards areas of overlap between implicit and statistical learning research, the convergence of which delimits the statistical-sequential learning phenomena discussed throughout this chapter. In Sec. 4, we highlight various aspects of infants’ and children’s statistical-sequential learning as they relate to the processing of sequential relations and the tracking of probabilistic dependencies. With this punctuated empirical review in place, major developmental trends are then identified and further elaborated upon with regard to potential underlying factors (Sec. 5). The conclusion then ties together prospects and future directions for one way of bridging implicit and statistical learning literatures within a developmental context.

2. Common ground amid paradigmatic diversity Before commencing our empirical overview/synthesis, a few words on methodology are in order. Understanding the basics and logic of four prominent paradigms will stand the reader in good stead through the remainder of the larger discussion that follows. Our exposition of these paradigms will proceed in chronological order of their introduction. In the first paradigm – artificial grammar learning (AGL; Reber 1967) – participants are typically instructed to memorize or observe exemplars presented during a training phase. Often, these exemplars are visual letterstrings (e.g., PTTVPS) generated from an artificial finite-state grammar, but they can in principle be composed of any distinctive set of stimulus tokens varying along a perceptual dimension (e.g., auditory nonwords, musical tones, shapes) that are arranged in sequence according to the grammar (see Figure 1). Importantly, participants are not apprised of the existence of any underlying regularities until the test phase, when they are informed that stimulus strings follow a set of rules specifying the particular orderings among constituents. Without being told however the precise nature of these rules, participants are then asked to classify additional strings as either ‘‘grammatical’’ or ‘‘ungrammatical,’’ relying upon intuitions or impressions of familiarity to guide their judgments. Participants typically achieve classification levels of above-chance performance on the task, even when test items comprise grammatical exemplars that were never directly encountered in training (i.e., requiring generalizations of the grammar to new strings) and despite being unable to provide verbal reports of actual patterns or rules. (Participants usually claim that they were merely ‘‘guessing.’’) These performances have been construed as evidence for participants’

Statistical-sequential learning in development

19

Figure 1. An illustration of an artificial finite-state grammar adapted from Reber and Lewis (1977). Strings are generated by starting at the leftmost node and following possible paths marked by the arrows to other nodes. The succession of letters associated with the arrows encountered along the traced path corresponds to a grammatical string sequence. For example, following the arrow from node 1 to 3, the arcing arrow back to 3, and then the respective arrows to nodes 5, 4 and 6 produces the letter string PTTVPS.

incidental encoding of the regularities of the grammar during training and their manifestation of this knowledge through meeting the task demands during test; and as alluded to earlier, are well-suited to statistical-based accounts (though not always without dispute) of the computational processes that mediate successful performance. Thus, participants may evince knowledge for complex, statistical relationships, even in the absence of reported awareness for any underlying structure and without direct intentions to discover such regularities. Although artificial grammar learning remains a fruitful paradigm within both implicit learning and statistical learning work, few studies have been conducted with children, especially as standard grammaticality judgments require metacognitive skills not present very early in development. There are a few informative exceptions, though, for work with older children ages nine to eleven (Don, Schellenberg, Reber, DiGirolamo, and Wang

20

Jennifer B. Misyak, Michael H. Goldstein and Morten H. Christiansen

2003; Gebauer and Mackintosh 2007; van den Bos 2007), and a couple studies reported in the standard statistical learning literature, with children of six to seven years (Sa¤ran 2001, 2002). Furthermore, similar methodological principles can be seen as informing the design and interpretation of other additional experiments in very young children. Accordingly, classic statistical learning studies with children and infants have involved familiarizing participants to carefully manipulated, frequency-balanced subsets of stimuli strings, sequences, or streams from a grammatical ‘corpus’ or miniature language, and then probing for sensitivity to the statistical relations by measuring more naturalistic behavioral responses to statistically consistent and inconsistent test items (more on this below). A paradigm that has been successfully extended to both adults and children alike is the serial reaction-time paradigm (SRT; Nissen and Bullemer 1987; informative studies with children participants include Bremner, Mareschal, Destrebecqz, and Cleeremans 2007; Meulemans, Van der Linden, and Perruchet 1998; Thomas and Nelson 2001; and Thomas et al. 2004). In a prototypical task instantiation, participants are asked to respond as quickly and accurately as possible to trials of presented ‘‘targets’’ (e.g., illuminated lights) occurring at discrete locations on a computer screen, with each location mapping onto a particular response key or button. Unbeknownst to participants, target appearances follow a repeating or probabilistic sequence of locations. After many trials, participants become increasingly adept in anticipating and responding swiftly to the targets. When there is a disruption however to the predictive sequence, either through ‘‘noisy,’’ interspersed sequence-breaks or a continuous block of trials consisting of randomly-generated target locations, accuracy performance decreases and response latencies increase; when target locations conform again to the training sequence, participants’ reaction time (RT) performances dramatically ‘‘recover’’ (e.g., Schvaneveldt and Go´mez 1998; Thomas and Nelson 2001). Because of the indirectness of the instructions and the task demands for speeded responses that discourage explicit reflection/strategizing, SRT work has yielded convincing demonstrations for participants’ sensitivity to violations of sequential structure and incidental learning for sequenceembedded patterns. For the youngest of subjects, however, other methods are necessary to assess incidental sequence knowledge. In the implicit learning literature, the visual expectation paradigm (VExP; Haith, Hazan, and Goodman 1988; Haith, Wentworth, and Canfield 1993) has been used with infants as young as two months to investigate their formation of expectations for upcoming visual events comprising predictable sequential patterns. A video

Statistical-sequential learning in development

21

monitor displays pictures for brief durations, separated by interstimulus intervals (without intervening visual input), and projected in distinct locations relative to the center of the infant’s visual field (i.e., left versus right, up-down, left-center-right or within a triadic-pivot arrangement); the location and/or timing of visual events furthermore accord with either a predictive or randomly ordered series. There are generally three dependent variables of main interest: reaction times (RTs) for eye saccades to correctly anticipated upcoming stimulus locations in the predictive sequences (compared against RTs for non-predictive sequences); the frequency of accurate anticipatory (i.e., non-reactive, as opposed to elicited) saccades to target locations comprising the predictive series; and any facilitation e¤ect on RTs (i.e., shorter latency to shift fixation to a predictive location upon the onset of a visual event). Importantly, infants do shift their visual fixations to the location where a future picture will appear prior to the timing of that picture’s actual onset. Thus, it has been possible for researchers to obtain a behavioral index for infants’ expectations of visual event sequences through measuring anticipatory RTs (assessed against an appropriate RT baseline for when events unfold in a relatively unpredictable manner). Such work has indicated that infants at a very early age rapidly form expectations based on detecting basic spatiotemporal regularities governing the predictive sequences. Finally, the early infant statistical learning studies (e.g., Aslin et al. 1998; Go´mez and Gerken 1999; Sa¤ran, Aslin et al. 1996) have employed variants or adaptations of existing infant habituation-dishabituation and preference methods to investigate statistical learning. They have used syllables, nonwords, tones, or shapes for the elements instantiating the statistical relations of their artificial training grammars or sequences. Thus, for example, the word-segmentation study of Aslin et al. used a familiarization method to expose eight-month-old infants to a three-minute continuous speech stream (e.g., pabikugolatudaropitibudo. . .) consisting of four trisyllabic nonce words concatenated together in random order without immediate word repetitions and with each word occurring with a controlled frequency across the stream. While there were no acoustic cues (pauses, prosodic contours, etc.) marking artificial word boundaries, the edges of the nonce words could be successfully segmented or ‘‘extracted out’’ from the continuous sound sequence by utilizing statistical information governing sequence-element transitions: namely, that word-internal, successive syllable transitions (P ¼ 1.00) contain higher conditional probabilities than pairwise syllables straddling word boundaries (for this study, P ¼ .50). To assess whether infants were in fact sensitive to such cues for segmenting

22

Jennifer B. Misyak, Michael H. Goldstein and Morten H. Christiansen

the stream, testing involved twelve trial presentations of two types of test sequences: repetitions of either single words (e.g., pabiku), or repetitions of part-words (e.g., tudaro). Infants demonstrated that they could discriminate between the two types on the basis of the relevant distributional statistics by orienting reliably longer to the direction of the loudspeaker on trials in which it emanated the words of the artificial language. Capitalizing on the attentional and natural orienting responses of infants, di¤erential looking performance at test thus provided a measure of successful discrimination, and hence statistical sensitivity, in preverbal infants. This work exemplifies a prototypical design structure of many infant statistical learning studies to date, and these in turn have contributed valuably towards our understanding of statistical learning mechanisms.

3. Phenomenological boundaries Across various experimental designs employed in the implicit and statistical learning literatures, and from infancy to early adolescence and beyond, individuals display an exquisite sensitivity to statistical aspects of their environments. The range of statistical information permitted on such accounts is theoretically broad in principle, as the underlying perspective endorsed here is that the statistical computations realized by the learner are carried out by mechanisms capable of extracting and integrating statistical information from multiple sources to home in on the most reliable regularities of the input, given as well the learner’s constraints and perceptual biases. This does not deny, however, the advantages of investigating di¤erent types of statistical cues and identifying the contexts in which they may di¤erentially aid the discovery of structure. Accordingly, within any sequentially distributed input, there are a priori potentially many statistical cues available to the learner: (simple) frequency, co-occurrences, transitional probabilities (and ‘‘conditional probabilities,’’ more generally, which can describe nonadjacent relationships), and higherorder conditionals (e.g., second-order, third-order, . . . nth-order probabilities). Researchers have also claimed psychological plausibility for other metrics, such as predictive probability (v. Rescorla 1967; ‘‘DeltaP,’’ Shanks 1995) and normative [bi-directional] contingency (see Perruchet and Peereman 2004). Successful empirical demonstrations of statistical-sequential learning thus do not necessarily imply that learning must be defined or accounted for in terms of a particular or only a couple prominent statistical metrics already demonstrated in prior work (e.g., forward and backwards

Statistical-sequential learning in development

23

transitional probabilities; Jones and Pashler 2007; Pelucchi, Hay, and Sa¤ran 2009a, 2009b; Perruchet and Desaulty 2008). Despite the clear relevance of frequency information (whether simple or normalized) for statistical learning mechanisms, the research reviewed in this chapter will not focus on raw frequency as a type of statistical cue. Frequency e¤ects are ubiquitous across many cognitive-developmental domains, and pertain to a diverse plethora of phenomena extending outside traditional implicit learning and statistical learning terrain; so studies in frequency estimation research per se (cf. Hasher and Zacks 1984) may be less likely to strongly constrain theorizing specific to the operations of statistical-sequential learning mechanisms. Instead, interested readers are directed to the helpful review compiled by Zacks and Hasher (2002) that places twenty-five years of frequency processing work within the context of a range of human behaviors – including, notably, language acquisition, statistical learning, constraint-based sentence processing, and Bayesian reasoning. For di¤erent reasons, other interesting work in the implicit learning literature has been omitted here as well. Despite the rapprochement between various implicit memory and statistical learning phenomena, certain work within the former (i.e., pertaining to perceptual/conceptual priming, eyeblink conditioning, complex visual search, motor pursuit tracking) are excluded from our discussion, as they do not principally entail processing of sequentially distributed forms of information. Also not discussed here are phenomena with unclear or tenuous connections to statistical-sequential learning mechanisms, as they do not straightforwardly entail the processing of both statistical and sequential information; i.e., contextual cueing (e.g., in atypically and typically developing populations: Roodenrys and Dunn 2008; Vaidya, Huger, Howard, and Howard Jr 2007), covariation detection (in older children: Fletcher, Maybery, and Bennett 2000; Maybery, Taylor, and O’Brien-Malone 1995), ‘‘dynamic systems control’’ tasks (as part of a cross-methodological investigation: Gebauer and Mackintosh, 2007), invariance learning (cf. Lewicki, Czyzewska, and Ho¤man 1987), and ‘‘probabilistic classification’’ tasks (cf. Knowlton, Squire, and Gluck 1994). Finally, a few caveats are in order for whether the learning of repeating sequences (in which asymmetric series alterations, as in VExP work, is a simple example) resides within the purview of statistical-sequential learning mechanisms. It may be objected, for instance, that a di¤erent kind of mechanism could potentially mediate the learning of sequentially arrayed input in fixed and repeating patterns – such as one mechanism subserving

24

Jennifer B. Misyak, Michael H. Goldstein and Morten H. Christiansen

simple rote memory e¤ects, and another mechanism ranging over more probabilistically-patterned, continuous input. Individuals certainly can and do engage in explicit memorization and recall of digits, letters, etc., with attention paid to the consecutive ordering of units – and such intentional processes are not the focus of this chapter. But the incidental, largely implicit, and (by many accounts) automatic nature of the learning processes that we review would seem to militate against this explanation, and pose di‰culties for the di¤erential application of such dual mechanisms prior to discovering the nature of the sequential input along such lines. And the parallel operation of both types of hypothetical mechanisms is less parsimonious, especially when a powerful (yet simple) mechanism can adeptly handle each. A dual-view hypothesis may be further di‰cult to reconcile with a larger theoretical perspective in which sequence memory is inextricably tied to sequential processing capabilities (MacDonald and Christiansen 2002), as suggested for instance by neural network models widely used in statistical and implicit learning research (e.g., Cleeremans and McClelland 1991, Dienes 1992; Keele and Jennings 1992; Mirman, Graf Estes, and Magnuson 2010; Misyak, Christiansen, and Tomblin 2010; Servan-Schreiber, Cleeremans, and McClelland 1988). Further, the most widely cited documentation of statistical learning in the developmental literature comes from word segmentation studies (i.e., as in the original statistical learning studies described in Sec. 2), in which transitional probabilities are theorized to be computed for pairwise elements within a single sequence composed of fixed sequence-fragments. That is, the artificial trisyllabic words are fixed, contiguous orderings of certain phonemes, the latter of which are concatenated together to form a continuous input stream with probabilistic orderings among such fixed orderings. These studies naturally do not presuppose the recognition of such nonce words as units over which more probabilistic computations are performed. Locating the boundaries among such constituents is in essence the goal of the segmentation task. The continuous nature of the speech stream and the novelty of the artificial words necessitate sensitivity to di¤erences in conditional probabilities between each of the pairwise syllables (irrespective of their absolute or graded contingency values) in order to identify the relevant word boundaries in the service of processing the speech input on-line. It may still be, though, that some forms of fixed sequential learning, such as learning for lists of pre-individuated serial items, invoke di¤erent encoding and representation strategies – with learning for such fixed (typically ‘‘singular’’ or non-repeating, as well as truncated in length)

Statistical-sequential learning in development

25

sequences relying more on sensitivity to ordinal than associative information among series-internal elements. Evidence has been reviewed suggesting that humans and non-human primates may track the ordinality of relations among serial elements when presented in such a manner (see Conway and Christiansen 2001). Yet sequential statistical learning in an artificial grammar task also shows modality-specific e¤ects paralleling known auditory-visual e¤ects of recency and primacy in serial recall (cf. Conway and Christiansen 2009; see also discussion in Conway and Pisoni 2008). So even if learning of fixed sequences in these circumscribed contexts recruit di¤erent encoding and representation strategies from learning non-fixed sequences, they may still point to similar constraints or principles operative in both. For example, linguistic combinatorial structure is both probabilistic and deterministic. Fixed sequences composing words may be initially identified on the basis of distributional information and later comprise the units (or ‘‘chunks’’) for the fixed sequences constituting idioms and stock phrases, as well as the non-fixed sequences characterizing novel sentences (see McCauley and Christiansen, 2011, for a possible model). And perhaps this is so, mutatis mutandis, for admixtures of fixed, deterministic, and probabilistic micro- and macro-structures in other developmental domains (e.g., visual scene processing and object-parts/object-based recognition where dimensional features may be perfectly or variably correlated). Given these arguments, this chapter does not delve into incidental learning of ordinal relations (for a recent example, see Lewkowicz 2008). As our understanding of underlying mechanisms deepens, it may be prudent to reexamine such manifestations of learning more closely with respect to the claim of shared mechanisms. However, findings from humans’ learning of continuous, fixed, repeating sequences will be included in our discussions here. As a small confirmation that this may be currently the right approach, an emerging appreciation for VExP results in the statistical learning field may already be underway. For instance, Sa¤ran and Thiessen (2007: 74) recently noted that transitional probabilities may be ‘‘only one particular example of statistical learning’’ if one more broadly considers evidence for the learning of regularities in one’s natural environment; in this regard, they acknowledge the important findings of Canfield and Haith (1991) concerning the learning of predictive event sequences in preverbal infants. It should also be noted that simple, repeating sequences comprise only part of the relevant literatures we review, and that sequences that are probabilistic in nature have also been studied. Thus, we turn in the next section to an overview of statistical-sequential learning in development, beginning with VExP findings and unambiguous repeating sequences, and

26

Jennifer B. Misyak, Michael H. Goldstein and Morten H. Christiansen

bridging over to work on context-dependent sequences, probabilistic dependencies, and other statistical structures. 4. A sketch of the learning landscape At their core, implicit and statistical learning literatures speak to fundamental processes underlying a diverse panoply of incidentally acquired, complex skills. Accordingly, the subsections below emphasize the broad nature of statistical-sequential learning mechanism(s) across individuals. Findings from both literatures are thereby briefly highlighted with respect to general characterizations that apply widely across human cognitivedevelopmental domains. 4.1. Learning fixed, continuous sequences 4.1.1. Asymmetric or simple, repeating sequences As early as two months, infants in VExP studies show evidence for forming expectations of upcoming stimulus locations from symmetrically alternating series (i.e., in Left-Right, or Right-Left patterns) (Wentworth and Haith 1992). By three months, infants also exhibit faster and more frequent anticipatory saccades to asymmetric 2/1 repeating series (e.g., L-L-R) and predictive 3/1 (L-L-L-R, or R-R-R-L) patterns (Canfield and Haith 1991). Older infants (by about eight or twelve months) tested in the VExP paradigm also display some anticipatory gaze behavior for upcoming visual targets whose locations form a predictable triadic-pivot series (i.e., ABCBABCB. . . with the ‘‘B ’’ location as the series’ pivot point among the three locations) (Reznick, Chawarska, and Betts 2000). It has been noted that use of the term ‘‘expectation’’ need not imply explicit recognition of patterns (Reznick et al. 2000), and the VExP paradigm itself should probably be best considered along the lines of a procedural task (cf. Nelson 1995), in which skilled performance commonly reflects incidental learning and the coordination of complex sequential input with motor responses. Regarding sequence-specific knowledge, while betweensubjects VExP analyses provide evidence for global probability matching (e.g., greater eye shifts back to the more frequent, or ‘‘home-side,’’ location in 3/1 than 2/1 asymmetric conditions), within-subjects analyses also indicate sensitivity to spatiotemporal regularities inhering over and above simply the proportion of picture appearances to a given side. For example, in the 3/1 condition, infants are more likely to appropriately shift to the

Statistical-sequential learning in development

27

less frequent ‘‘target’’ side after the third ‘‘home-side’’ event appearance than after the second or first ‘‘home-side’’ event. Given the specific experimental design of a study, encoding may also extend beyond location to accommodate specific (visual) event content, inter-event contingencies, and temporal flow rate of the stimuli sequences (Adler, Haith, Arehart, and Lanthier 2008; Wentworth and Haith 1992; Wentworth, Haith, and Hood 2002). VExP measures exhibit moderate internal consistency and reflect stable individual di¤erences over the short-term in early infancy (Haith and McCarty 1990; Rose, Feldman, Jankowski, and Caro 2002). But evidence for age-related di¤erences within the first year is partly equivocal, with both longitudinal and cross-sectional designs reporting improvements up to nine months, but no improvements (and fewer anticipations) between nine and twelve months (Canfield et al. 1997; Reznick et al. 2000). Rose and colleagues, using a longitudinal design, did find support for increasing anticipations from seven to twelve months with a traditional cut-o¤ of 200 ms distinguishing anticipatory/reactive saccades, but not when employing a more conservative criterion of 150 ms. In these cases, the di‰culty in establishing an appropriate cuto¤ amid substantial individual variability in response latencies was further compounded by increased processing speed associated with higher ages across the first year of infancy. While Reznick and colleagues have posited an underlying change in the nature of the expectations formed at twelve months, corresponding to maturation of medial temporal lobes, Canfield et al. have suggested that something as simple as motivational requirements may have been at issue – that is, what may have been an interesting visual stimuli set for younger infants, may be considerably less engaging for the oldest infants in the group and thus resulted in their underperformance. More systematic studies are needed, though, to confirm this conclusion. In summary, within the two minutes or shorter period of exposure to a repeating, symmetric or asymmetric series, infants throughout the first year demonstrate remarkably rapid on-line facilitation and anticipation for basic regularities intrinsic to the independently unfolding, spatiotemporal sequences in their visual environment. 4.1.2. Context-dependent sequences In context-dependent (i.e., nth-order) progressions, the occurrence of a sequence-element depends upon the context associated with its preceding element. For example, given a 1-2-1-3-1-2 . . . sequence, being able to anticipate the location after a ‘‘1’’ requires knowing the temporal context of

28

Jennifer B. Misyak, Michael H. Goldstein and Morten H. Christiansen

whether a ‘‘2’’ or ‘‘3’’ preceded the ‘‘1.’’ Sequences with context-dependent transitions can be either deterministic (repeating) or probabilistic. While four-month-olds have shown above-chance accuracy in anticipatory saccades for unambiguous, simple repeating sequences (e.g., 1-2-3) mapping onto a triangular configuration of spatial locations, they seem unable to perform above chance for the context-dependent transitions of more ambiguous sequences (e.g., 1-2-1-3) given roughly comparable exposure time (i.e., 27 and 32 single-location trials, respectively, which correspond to 9 and 8 sequence repetitions each) (Clohessy, Posner, and Rothbart 2001). Clohessy et al. further assessed performance for these same sequences in ten- and eighteen-month-olds. Ten-month-olds did not show anticipatory learning for the context-dependent sequence (even when exposure was doubled, i.e., expanded to 2 sessions), but eighteen-month-olds could show anticipations for both types of sequences. Bremner et al. (2007) found successful learning for two-year olds performing on a six-element deterministic spatial sequence (e.g., A-C-B-D-A-B) and a subsequent generation task using an adapted SRT paradigm. This is quite notable because, previously, sequence learning in SRT paradigms had not been conducted with children younger than four years, as in Thomas and Nelson’s (2001) study. However, the performance evidenced by the two-year olds may also be considered an important extension of the sequential learning skills evidenced at eighteen months in Clohessy et al.’s VExP study for anticipating a deterministic sequence with fewer elements. Across two studies in children of four, six- to seven, and ten- to elevenyears, performance on a deterministic ten-element SRT task (e.g., 1-3-2-41-2-3-4-2-4) reported similar learning magnitudes across age groups (Meulemans et al. 1998; Thomas and Nelson 2001). These are consistent with standard expectations in the literature that, although general processing speed improves with age, a sequence learning e¤ect nonetheless remains comparable across age groups. However, Thomas and Nelson reported that the ‘‘number of anticipatory button presses to correct locations show[ed] evidence of developmental change’’ (2001: 375). Nonetheless, they refrained from forming conclusions about developmental changes in implicit learning as such, under concerns that this measure might be construed as tapping more ‘‘explicit’’ learning. However, a functional magnetic resonance imaging (fMRI) study conducted later by Thomas et al. (2004), again with seven- and eleven- yearolds, reported evidence of di¤erential neural recruitment by children and adults on a SRT task, the latter of whom performed significantly better as well on the learning index. Particularly, there were age-related di¤erences in neural activity for premotor cortex, putamen, hippocampus, inferotemporal

Statistical-sequential learning in development

29

cortex, and parietal cortex. The sharpest age discrepancy was in greater recruitment of fronto-striatal circuitry and hippocampal activation in adults, although such activity in these regions was not significantly correlated with the magnitude of the learning e¤ect. 4.2. Tracking probabilistic dependencies In this subsection, we now shift attention from work on repeating sequences to studies employing non-repeating sequences or artificial grammars. 4.2.1. Adjacent dependencies The event-related brain responses of sleeping neonates indicate that the ability to use statistical cues (such as co-occurrence frequencies) to discriminate lexical boundaries among adjacent phonemes in a continuous artificial speech stream is present as early as one-half to two days after birth (Teinonen, Fellman, Na¨a¨ta¨nen, Alku, and Huotilainen 2009). Using behavioral measures, infants by two months are also sensitive to the cooccurrence frequencies obtaining across a continuous stream of geometric shapes with reliable shape-pairings (bigrams) (Kirkham, Slemmer, and Johnson 2002). Beyond sensitivity to co-occurrences, studies further show robust statistical segmentation processes using transitional probabilities by five-and-ahalf months (E. K. Johnson and Tyler 2010). This finding suggests an earlier time date for successful learning performance analogous to that demonstrated in the premier studies of statistical learning, conducted with eight-month-olds (as elaborated in greater detail in Sec. 2.) (Aslin et al. 1998; Sa¤ran, Aslin, et al. 1996). And at twelve months, infants appear to be able to use adjacent probabilities to concurrently track pairwise syllables and nonwords belonging to an artificial language (Sa¤ran and Wilson 2003), and as a first step towards learning form-based categories from nonword sequences of aX and bY strings (Go´mez and Lakusta 2004). Interestingly, in the latter study, infants can generalize even where there is some inconsistency in the input, e.g., an 84/16 consistent-to-inconsistent ratio (but do not show generalization in a 68/32 condition). Skipping ahead to about six years of age, earlier word segmentation and AGL work in young child learners of six- to nine- years would suggest that statistical-sequential learning e¤ects may be age-invariant (e.g., Don et al. 2003; Sa¤ran et al. 1997). However, a later study by Sa¤ran (2001) provided evidence for clear age di¤erences. Six- to nine-year old children and adults were both trained and tested on an artificial grammar containing predictive dependencies. While all participants demonstrated significant

30

Jennifer B. Misyak, Michael H. Goldstein and Morten H. Christiansen

learning, the adults consistently outperformed the children, prompting Sa¤ran to write that the results ‘‘suggest that children may possess a limited ability to acquire syntactic knowledge via statistical information. While their performance was not as strong as the adults’, the children did acquire rudimentary aspects of the phrase structure of the language’’ (508). Another study by van den Bos (2007) compared the learning performance of ten- and eleven-year old children with that of adults’ on an artificial grammar task, varying the usefulness of the underlying structure with respect to a cover task. Although the qualitative learning e¤ect was the same, adults in the study acquired quantitatively greater knowledge of second-order dependencies than did the children. And finally, a study (Arciuli and Simpson, 2011) of visual statistical learning using a triplet segmentation task was conducted with children ranging from five to twelve years. Quantitative improvements in discrimination performance for legal triplets on a forced-choice posttest were documented with increasing age. It is quite possible that the results from these few artificial language studies might implicate poor metacognitive judgments as the source of di¤erences between children and adults (and between younger and older children), rather than statistical-sequential learning skill per se. On the other hand, reports of age di¤erences in learning are consistent with other work in VExP and SRT paradigms (and canonical statistical learning paradigms in the next section). Thus it is becoming evident that an assumption of developmental invariance for statistical-sequential learning is not a foregone conclusion. 4.2.2. Nonadjacent dependencies Studies in the implicit learning literature tend not to investigate ‘‘nonadjacent dependencies,’’ defined as relationships where another element(s) intervenes between two dependent elements, as their primary aim. More customarily, they may investigate learning for higher-order conditionals under the assumption that the concomitant non-local dependencies embedded in the stimuli sequences are learnt through the chunking of adjacencies (though see e.g., Kuhn and Dienes 2005, Pacton and Perruchet 2008, and Remillard 2008, for findings in the implicit learning literature with adults). Thus, what is currently known about incidental learning of nonadjacencies over development is mostly limited to studies conducted within the canonical statistical learning literature. In infants (as in adults), it has been demonstrated that relatively high variability in the set size from which an ‘‘intervening’’ middle element of

Statistical-sequential learning in development

31

a string is drawn facilitates learning of the nonadjacent relationship between the two specific, flanking elements (Go´mez 2002). In other words, when exposed to artificial grammar strings of the form aXd and bXe, individuals display sensitivity to the nonadjacent dependency-pairs (i.e., the a_d and b_e relations) when the elements composing the X are drawn from a large set distributed across many exemplars (e.g., when |X| ¼ 18 or 24). Performance is hindered, however, when the variability of the set size for the X is intermediate (e.g, |X| ¼ 12) or low (e.g., |X| ¼ 2). Go´mez and collaborators have assessed young infants’ learning for such nonadjacent grammars with auditory nonword stimuli (monosyllabic tokens for a, b, d, and e; bisyllabic tokens for the X’s) instantating the string-elements. Experiments (using a familiarization method and headturn preference procedure) involved approximately three minutes of exposure to a 2-dependency nonadjacency grammar, followed by a phase in which infants were tested on their ability to discriminate grammatical strings belonging to the familiarized grammar or a foil grammar. While twelvemonth-olds were unable to successfully discriminate strings following high-variability training conditions (Go´mez and Maye 2005), fifteen-, seventeen-, and eighteen-month olds were able to make the discriminations (Go´mez 2002; Go´mez and Maye 2005). Such performance results also obtain across a four-hour delay between familiarization and test, and across di¤erent environmental settings (i.e., when familiarized to the grammar at home and then tested in the lab; Go´mez et al. 2006). Go´mez and Maye also reported age-group di¤erences in looking-time trends to grammatical test items; fifteen- and seventeen-month olds exhibited familiarity and novelty preferences, respectively, supporting the researchers’ conclusion that skill in detecting nonadjacencies appears to emerge by later infancy, with more robust tracking evidenced at seventeen and eighteen months.5

5. This interpretation is informed by the influential Hunter-Ames model (Hunter and Ames 1988) in which greater stimulus complexity or partial encoding by infants is predicted to elicit familiarity preferences (longer looking/listening times for test stimuli that are consistent with the training exemplars) rather than the opposite pattern (i.e., a preference for attending longer to the novel, or inconsistent, test items). While the generality for interpreting preference patterns has not been definitively established (and is thus open to dispute), the observation that twelve-month olds were unable to demonstrate learning for the nonadjacent grammar (under the same experimental conditions) as the older infants further underscores Go´mez and Maye’s conclusion.

32

Jennifer B. Misyak, Michael H. Goldstein and Morten H. Christiansen

5. Developmental changes Age di¤erences in the magnitude of learning e¤ects were observed across di¤erent age groups of young children in comparison to adults when learning deterministic and probabilistic sequences in SRT tasks. Such quantitative di¤erences between child learners and adults were also observed in AGL paradigms. In light of the assumption that children should be naturally better statistical learners than adults given sensitive period e¤ects observed in early language acquisition (J. S. Johnson and Newport 1989; Newport 1990), these findings may be counterintuitive and surprising. Di¤erences between learning adjacencies and nonadjacencies within a developmental context could potentially be an extension of quantitative performance-level di¤erences, in that, as structural complexity purportedly increases, we see later ages of proficiency documented. Complexity of artificial grammars studied in the literature thus far seem to have some correspondence with dependency length – and measures, such as topological entropy, that correlate with length – over and above simply the number of associations or predictability of the grammar (van den Bos and Poletiek 2008). However, the simplified structures employed to date do not exhaust the full range of potential structures amenable to statistical tracking (e.g., such as embedded long-distance dependencies and cross serial-dependencies evidenced in natural language). Attempts to identify facilitative contexts for acquiring remote dependencies is still in its early stages. So ‘‘dependency length’’ may be more of a useful starting-point, rather than a conclusive identification of the main source for statistical-sequential learning ‘‘complexity.’’ Neural network modeling of sequential processing in natural language suggest that complexity (as corresponding to measures reflecting human processing di‰culty, such as protracted on-line RTs) is not reducible to dependency length and reflects interactions between experiential variations and constraints intrinsic to the architecture of the learning mechanism (Christiansen and MacDonald 2009; MacDonald and Christiansen 2002). Nonetheless, with this initial provisional definition, the findings reviewed herein are consistent with an interpretation of improved statistical-sequential learning performance with age for incidentally detecting increasingly ‘‘complex’’ structures, including but not confined to explaining the behavioral emergence of nonadjacency tracking (see Figure 2). With certain stipulations, deterministic nonadjacencies in statistical learning bear surface resemblance to the second-order context-dependent sequences in the implicit learning literature. That is to say, a context-

implicit learning (bottom) literatures. As the timeline is intended to illuminate potential developmental changes, findings of learning sensitivity that are established exclusively from neurophysiological measures are omitted (e.g., ERP measures in Teinonen et al. 2009). Neural sensitivity can be evidenced even without behavioral discrimination on standard performance measures (Turk-Browne et al. 2009), thus making neurophysiological comparisons to quantitative behavioral assessments (especially null findings) less straightforward. Nonetheless, developmental trends suggested by neuroscience data, as discussed in Sec. 6.2., may be particularly fruitful for understanding neural mechanisms involved in statistical-sequential learning.

Figure 2. A provisional timeline of learning that indicates behavioral developments documented within the statistical learning (top) and

Statistical-sequential learning in development

33

34

Jennifer B. Misyak, Michael H. Goldstein and Morten H. Christiansen

dependent sequence with 2nd order relations (e.g., 1-2-1-3-1-2-1-3 . . . ; which is the precise structure studied by Clohessy et al. 2001, noted earlier) parallels in its embedded relations the nonadjacency grammar-strings of type 2_3 or 3_2 as investigated in infants by Go´mez and colleages (see Sec. 4.2.2.) (n.b., set-size X ¼ 1, which is a ‘‘zero variability,’’ potentially learning-conducive condition; Onnis, Christiansen, Chater, and Go´mez 2003; Onnis, Monaghan, Christiansen, and Chater 2004). Interestingly, robust learning for these kinds of sequential regularities appears around 18 months in both literatures. Indeed, as mentioned earlier, much implicit learning work generally does not discriminate whether second-order contextdependent sequence relations are represented in the same manner as nonadjacencies of the kind studied in the statistical learning literature (although theoretically both forms of learned representations would be consistent with statistical-sequential learning mechanisms). 5.1. Transient cognitive constraints While various cognitive constraints might be operative in accounting for patterns of developmental performance di¤erences, here we concentrate on one promising factor that has already received some attention in the literature and that has particular utility for explaining the emergence of nonadjacency skills. The provisional hypothesis is that infants begin with limitations in the information that they can process in time, and are thus restricted in the amount of elements that can be e¤ectively related with one another within the distance of this processing window. Following Santelmann and Jusczyk (1998) and others, the number of intervening elements (or syllable/word constituents) between dependencies comprised of similar units is used as a working definition of ‘‘distance’’ (precluding for now the issue of precise temporal duration). As the processing space expands during later development, this allows the infant to e‰ciently exploit and integrate more of an element’s preceding context to discover appropriate nonadjacent relations that would otherwise be obscured. Indeed, a narrow processing window may act as an initial ‘‘filter’’ to constrain the problem space of potential mappings (Newport 1988, 1990), thus focusing the infant’s attention on more basic, local dependencies that can be later applied over longer distances, when the temporal window grows. Another related possibility is that a narrow window may act as an initial ‘‘amplifier’’ for detecting the covariation of input elements, because smaller sampling of a distribution increases the likelihood of observing correlations that are more extreme in magnitude than the true associations

Statistical-sequential learning in development

35

(Kareev 1995; Kareev, Lieberman, and Lev 1997). Given the structure of language, such memory-based constraints (in contrast to those of adults) might paradoxically contribute toward superior performance in language acquisition for child learners (Newport’s ‘‘less is more’’ hypothesis). Computationally, ‘‘less is more’’ has a parallel in one of the two methods used in Elman’s (1993) ‘‘starting small’’ simulations, in which Elman manipulated the resetting of a simple recurrent network’s context units in order to simulate the child’s initially reduced and then growing window for relating dependencies. Such a procedure enabled the network, thus ‘‘handicapped,’’ to learn a complex language corpus that it had previously failed to master without recourse to such developmental limitations (or without having received incrementally staged input that externally mimicked such constraints). Conway, Ellefson, and Christiansen (2003) further investigated the e¤ects of ‘‘starting small’’ in artificial grammar learning experiments with adults. In support of a starting small hypothesis, they documented a learning advantage for participants trained with incrementally staged input on complex visual grammars. The notion that initial developmental constraints (or staged startingsmall input) might sca¤old the acquisition of more complex dependencyforms or tracking-skills than otherwise possible (or in a relatively quicker or more robust manner) is not unique to ‘‘purely’’ cognitive or linguistic phenomena, and may cut across perceptual development, too. For instance, it has been similarly postulated (with some support by neural-connectionist simulations) that the early limitations in human newborns’ visual acuities may actually promote the subsequent development of binocular disparity sensitivities emerging around four months (Dominguez and Jacobs 2003). More generally, such a hypothesis is consistent with categorization schemas of asynchronously developing experience-expectant brain systems in mammals (Greenough, Black, and Wallace 1987). Empirically, the developmental timing of nonadjacency-related skills at eighteen months in statistical learning paradigms parallels emerging sensitivity by this same age in natural language learning for sensitivity to morphosyntactic relationships obtaining over one to three intervening syllables (Santelmann and Jusczyk 1998). This is consistent as well with Go´mez and Maye’s (2005) suggested interpretation of their developmental results (described in Sec 4.2.2.). At the neural level, synaptic pruning might also be a mechanism for such gains in memory performance (and can also be explored computationaly in neural networks via ‘‘selective pruning’’ of nodes and connection weights) (cf. Quartz and Sejnowski 1997). Thus, the hypothesis of a limited temporal processing window has

36

Jennifer B. Misyak, Michael H. Goldstein and Morten H. Christiansen

current empirical and theoretical support (see also Goldstein et al. 2010), and can be investigated in greater detail from various aspects within a computational perspective. On the latter note, however, Rohde and Plaut’s (2003) neural network simulations (‘‘less is less’’) failed to replicate Elman’s findings, thus questioning computational support for the hypothesis. However, our explanation here dodges these particular concerns. That is to say, although intriguing and unresolved, the issue of whether early limitations in processing space would be beneficial per se for the infant (under certain circumstances, perhaps) is not decisive to the validity of whether performance di¤erences in statistical-sequential learning, especially for the emergence of nonadjacency tracking, can be traced to such a window. What is important to our exposition, then, is the idea that such transient cognitive limitations do appear to exist and that the notion of a temporal processing window, as expounded above, may o¤er a powerful framework for organizing the existing developmental data to date. 5.2. Changes in underlying neural structures Work reviewed for the statistical-sequential learning of context-dependent sequences suggested di¤erential recruitment of cortical and subcortical structures between children and adults, with the latter showing greater hippocampal activation (although this was not associated with the magnitude of the learning e¤ect per se). Statistical-sequential learning, as a form of procedural learning, is likely to involve the participation of the basal ganglia through cortico-striatal circuits (based on supportive molecular and neuropsychological evidence, as well as theoretical views; Ackermann 2008; Christiansen, Kelly, Shillcock, and Greenfield 2010; Conway and Pisoni 2008; Lieberman 2002); and continually receives much attention in implicit learning accounts, especially in relation to putative performance ‘‘dissociations’’ among impaired populations. It has also been suggested that the basal ganglia may play a role in speech and language development via feedback-driven vocal learning, such as in the socially guided statistical learning of phonological patterns through contingent interactions between caregivers and prelinguistic, vocalizing infants (Goldstein and Schwade 2008). However, the age-related di¤erences in activation patterns with respect to hippocampal involvement are less clear. Minimally, they are interesting in that they dovetail with recent neuroimaging work implicating a potential role for both striatum (right caudate activation) and hippocampus in the

Statistical-sequential learning in development

37

on-line statistical learning performance of adults, using a visual tripletsegmentation task with glyphs from Sabean and Ndjuka syllabaries (TurkBrowne, Scholl, Chun, and Johnson 2009). The researchers have postulated potentially di¤erent roles/pathways for each, with the hippocampus putatively mediating more abstract learning, and the striatum involved in more specific associative encoding. Of further interest, the age-related di¤erences in hippocampal activation patterns might play a role in explaining conflicting findings of the e¤ect of sleep on children’s statistical-sequential learning. Within the implicit learning literature, Fischer, Wilhelm and Born (2007) had reported a sleep-dependent deterioration e¤ect on children’s learning of second-order contingencies from an SRT task. They had assessed the level of learning retention by seven- to eleven-year-olds when retested on the SRT sequence after either a period of sleep or an equal interval of wakefulness. The di¤erence in average reaction times to grammatical versus ungrammatical SRT trials in the test blocks was the study’s dependent measure of learning. Twenty- to thirty-year olds adults exhibited an improvement in this RT learning measure when retested following sleep, but a decrement in learning when retested in the wake condition. Compared to adults, however, the magnitude of learning decreased in the group of children following sleep, while remaining unchanged at retest in the wake condition. This sleep-dependent deterioration in learning for the children contrasts with results from Go´mez, Bootzin, and Nadel’s (2006) statistical learning study, in which naps promoted better statistical learning generalization of nonadjacencies for infants. That is, fifteen-month-olds who napped within a 4-hour interval between a familiarization period and testing were able to discriminate between nonadjacency pairings that were either consistent or inconsistent with the artificial grammar generating the string they were presented with on the first test trial, but did not exhibit veridical recall for the specific pairings they had been acquainted with during familiarization. In contrast, fifteen-month olds without an intervening nap displayed veridical discrimination for prior dependency-strings, but not generalization of the nonadjacent structure to novel pairings. These results may nonetheless fit with those in Fischer et al.’s (2007) study, because the performance decrement in Fischer et al. was observed for veridical dependencies, rather than abstracted relations to new forms based on prior grammar probabilities. Furthermore, in speculating on the di¤erential e¤ects of sleep in adults and children, Fischer et al. state:

38

Jennifer B. Misyak, Michael H. Goldstein and Morten H. Christiansen Moreover, for di¤erent learning tasks, a competitive interference between striatal and hippocampal systems has been shown (Packard and Knowlton 2002; Schroeder, Wingard, and Packard 2002; Poldrack et al. 2001). In this framework, opposite e¤ects of sleep on implicit learning in children and adults might reflect that sleep in children leads to a preferential strengthening of hippocampal aspects of the memory representation, whereas sleep in adults strengthens caudate involvement. (224)

In support of this hypothesis, Fischer et al. note that whereas the children and adults in their study did not di¤er in amounts of rapid-eye-movement (REM) sleep, children’s sleep was characterized by a greater amount of slow-wave sleep (SWS). In turn, SWS is associated with hippocampusdependent memory consolidation (see the review by Marshall and Born 2007). Greater caudate involvement may lead to enhanced procedural learning, thus explaining adults’ gains in SRT learning performances. However, strengthening hippocampal aspects of memories may not impact the implicit/indirect indices of learning tapped by the SRT, thus explaining the lack of gains in children after the sleep interval. (Recall again that in the Thomas et al. (2004) study (reported in Sec. 4.1.2.), adults’ learning showed evidence of greater reliance on hippocampal recruitment than children’s during on-line task performance, but this di¤erence was not significantly linked to the SRT learning measure). What other learning outcome then might be a¤ected by strengthening hippocampal-dependent memories? No other measures of learning beyond procedural ones reflected by SRT performance were reported in Fischer et al. (2007). However, if Turk-Browne et al. (2009) are correct in connecting the hippocampus to the formation of more abstract representations in their statistical learning study, and if infants’ napping comprises strengthening of hippocampal-dependent associations, this might account for the enhancement specific to generalization of the nonadjacencies to the novel pairings in Go´mez et al. (2006). It would also be consistent with hypotheses put forward by Go´mez et al. at a more cognitive-level of description. Namely, in speculating as to why sleep enhances generalization after learning in infants, Go´mez et al. proposed three possibilities: 1) preferential weighting of abstract vs. specific information changes after sleep, 2) infants forget details of the items implementing the specific nonadjacentpairings after sleep, and 3) sleep prolongs the learning-dependent processing necessary for abstraction to later occur. There is not conclusive evidence here for putting the matter to rest, but for now it provides further room for much speculation (and thoughts to sleep on).

Statistical-sequential learning in development

39

5.3. Early perceptual development Because statistical-sequential learning is closely mediated by perceptual features of the input and by modality constraints (Conway and Christiansen 2005, 2006, 2009; Conway et al. 2007), part of a thorough picture may likely include then the manner in which perceptual systems develop in response to early environmental input and are recruited through experience. Early perceptual learning, at least for certain gross-level acoustic or prosodic patterns, begins in utero during the last trimester (e.g., DeCasper and Spence 1986; see review of Go´mez and Gerken 2000, Box 1, as well as computational simulations of sca¤olding from prenatal ‘‘filtered’’ stimuli in Christiansen, Dale, and Reali 2010), suggesting a developmental trajectory that may have a substantial prenatal initial progression or foundation. Postnatally, the evidence reviewed herein indicates that visual and spatiotemporal sequential learning for co-occurrence frequencies are present at two months (e.g., as in VExP findings and the statistical learning study by Kirkham et al. 2002) and older (e.g., by nine months; Emberson, Misyak, Schwade, Christiansen, and Goldstein 2008), with further statistical learning of regularities within visual arrays documented at nine months (Fiser and Aslin, 2002). In the auditory domain, natural language acquisition skills (some of which likely recruit upon statistical-sequential learning mechanisms) have also been investigated in very early infancy. Rudimentary, auditory sequential learning abilities appear present as early as one-half to two days after birth (Teinonen et al. 2009), and evidence of learning transitional probabilities embedded within linguistic stimuli are seen by five-and-a-half months (E. K. Johnson and Tyler 2010). However, auditory statistical learning for non-speech stimuli such as tones has seemingly not been documented in infants younger than eight months (Sa¤ran et al. 1999). Systematic comparative work (using comparable procedures and stimuli) does not currently exist for deriving firm conclusions about potential modality-driven performance patterns and di¤erences in infancy (Misyak, Emberson, Schwade, Christiansen, and Goldstein 2009); although one study in children indicates somewhat better learning of predictive dependencies in the auditory versus the visual modality (Sa¤ran 2002). Thus, the timeline gaps in these cases probably reflect the fact that studies have typically not been motivated by comparative developmental trends – at least not with regard to modality or performance-level di¤erences – rather than reflecting absences of ability per se.

40

Jennifer B. Misyak, Michael H. Goldstein and Morten H. Christiansen

It is further unknown how learning in infants and children may di¤er among specific dimension-modality pairings (i.e., comparisons for learning between visual and auditory stimuli, occurring in arrayed or sequential format, when regularities are also encoded along the dimension of location or variable timings). Kirkham, Slemmer, Richardson and Johnson (2007), however, have some work suggesting that sequential spatial-location regularities (of more complex form than those studied in the VExP task) may be more di‰cult for infants than other forms of sequence learning tasks and that proficiency may manifest later in perceptual development. We are also far from systematic investigations in infancy/childhood of feature dimension pairings. For instance, color and shape were always perfectly correlated in Fiser and Aslin’s (2002) and Kirkham et al.’s (2002) visual statistical/sequential learning studies, thus preempting developmental considerations for the kind of phenomena investigated by Turk-Browne and colleagues (Turk-Browne et al. 2008; Turk-Browne and Scholl 2009) in adults regarding ‘‘bound object representations’’ and spatiotemporal generalization abilities. In contrast to fairly-developed auditory abilities at birth (Lasky and Williams 2005; Sa¤ran, Werker, and Werner 2006), the visual system undergoes more dramatic changes during the first year. As newborns, preferential looking visual acuity estimates are at approximately 1 cycle per degree (cpd; equivalent to 20/600 Snellen), developing to 3 cpd (20/200 Snellen) at 3 months, and reaching about 12 cpd (20/50 Snellen) by the end of the first year (Birch, Gwiazda, Bauer Jr, Naegele, and Held 1983; Courage and Adams 1990; Dobson and Teller 1978). Such early, transient limitations in the detail of infants’ visual fields may thus narrow their perceptual focus to close-range visual stimuli and in turn support a more sequentially constrained format upon visual images they perceive – which could in turn necessarily favor visual statistical learning of temporally-distributed sequences over that of spatially-arrayed sequences in early processing. There are di¤erential e¤ects of temporal and spatial formats for auditory and visual statistical learning in adults, with visual-temporal conditions eliciting poorest performances (Conway and Christiansen 2009); but it has not been established whether and how such performance patterns might be shaped by, and/or possibly depart from, early perceptual experiences/biases. One intriguing possibility, therefore, is that the kind of early processing constraint alluded above, in tandem with an abundance of such early visual experiences during the first months, may temporarily place young infants’ visual statistical learning of sequences above or on par with analogous learning for auditory sequences. Another complimentary idea

Statistical-sequential learning in development

41

is that the structure of prelinguistic social interaction with caregivers shapes infant attention in ways that facilitate specific forms of statistical learning (Goldstein et al. 2010). Such comparative learning among modalities is an ongoing matter of investigation, and if the former hypotheses bear out, it would form a surprising counterpoint to auditory superiority performance patterns observed in adults (Conway and Christiansen 2005, 2009; Sa¤ran 2002). Beyond the development of sequential learning through particular sensory experiences (and by implication, for distributed modality-constrained subsystems of statistical learning; see Note 2), there may further be a possible role for general principles in early neural development that facilitate and maintain so-called ‘‘entrenched’’ perceptual ‘‘discrimination’’ abilities (cf. Scott, Pascalis, and Nelson 2007). That is, it has been proposed that such mechanisms may be broader (more domain-general) than traditionally supposed. For instance, the language-specific ‘‘narrowing’’ characteristic of infants’ later babbling as a product of increasing exposure to the ambient language, might also be driven or facilitated by experiences that are supralinguistic in some form; and more specifically, with respect to incorporated phonological patterns and articulatory/acoustic features, can be shaped by contingent parental feedback (Goldstein, King, and West 2003; Goldstein and West 1999) and ‘‘socially guided statistical learning’’ (Goldstein and Schwade 2008). Given, though, the evidence for gradiency e¤ects in many ‘‘discrete’’ categorization performance tasks (e.g., Dale, Kehoe, and Spivey 2007; McMurray, Tanenhaus, Aslin, and Spivey 2003; see also Spivey 2007), an account encompassing these principles may also be heavily context-sensitive and ultimately entail a probabilistic cue-weighting explanation/mechanism that need not be perceptually specific in its explanatory range or extension. Assuming the requisite experiences for shaping such behavioral response patterns, these may in turn provide a ‘‘representational platform’’ or weighted biases over which related input features or cues for statistical-sequential learning mechanisms may be integrated. Furthermore, regarding so-called perceptual ‘‘enhancement’’ processes, it is likely that the statistical distribution of the featural input itself may play a large role in such phenomena (cf. Maye, Weiss, and Aslin 2008; Maye, Werker, and Gerken 2002). In sum, empirical investigation of many of the newly hypothesized links, as described throughout this section, to statistical-sequential learning across development must be awaited. As of yet, despite promising potential, there are no systematic cross-sectional or longitudinal data for informing our understanding of patterns and trajectories in learning

42

Jennifer B. Misyak, Michael H. Goldstein and Morten H. Christiansen

performance across modalities and/or with respect to related perceptual phenomena.

6. Future developmental strides for the merger of statistical and implicit learning work A synthesis of findings from across the implicit and statistical learning literatures suggests that these two fields may have much to synergistically o¤er one another. It also indicates that their convergence may be especially fruitful for exploring issues related to infant/childhood cognitive constraints, underlying neural mechanisms, and early perceptual development. Furthermore, despite the orthodox assumption of age invariance, it appears that the possibility of developmental changes across studies should merit much stronger consideration than they presently have to date. Abandoning the presumption of developmental invariance might also provide the impetus for much-needed longitudinal and cross-sectional designs. There are conspicuous age gaps that reflect the nature of existing work in the area. Much work concentrates on early infancy, but does not clearly connect learning continuously across childhood. By going beyond documenting the age of successful learning for di¤erent statistical-sequential skills towards providing more detailed developmental trajectories, the projected merger of research findings across implicit and statistical learning paradigms will not only become truly developmental, but may perhaps flourish even more prominently past its early formal youth.

References Ackermann, Hermann 2008 Cerebellar contributions to speech production and speech perception: Psycholinguistic and neurobiological perspectives. Trends in Neurosciences, 31, 265–272. Adler, Scott A., Marshall M. Haith, Denise M. Arehart, and Elizabeth C. Lanthier 2008 Infants’ visual expectations and the processing of time. Journal of Cognition and Development, 9, 1–25. Arciuli, Joanne, and Ian C. Simpson 2011 Statistical learning in typically developing children: The role of age and speed of stimulus presentation. Developmental Science, 14, 464–473.

Statistical-sequential learning in development

43

Aslin, Richard N., Jenny R. Sa¤ran, and Elissa L. Newport 1998 Computation of conditional probability statistics by 8-month-old infants. Psychological Science, 9, 321–324. Birch, E. E., J. Gwiazda, J. A. Bauer Jr, J. Naegele, and R. Held 1983 Visual acuity and its meridional variations in children aged 7–60 months. Vision Research, 23, 1019–1024. Bloomfield, Leonard 1933 Language. Chicago: University of Chicago Press. Bremner, Andrew J., Denis Mareschal, Arnaud Destrebecqz, and Axel Cleeremans 2007 Cognitive control of sequential knowledge in 2-year-olds: Evidence from an incidental sequence-learning and -generation task. Psychological Science, 18, 261–266. Canfield, Richard L., and Marshall M. Haith 1991 Young infants’ visual expectations for symmetric and asymmetric stimulus sequences. Developmental Psychology, 27, 198–208. Canfield, Richard L., Elliott G. Smith, Michael P. Brezsnyak, Kyle L. Snow, Richard N. Aslin, Marshall M. Haith, Tara S. Wass, and Scott A. Adler 1997 Information processing through the first year of life: A longitudinal study using the visual expectation paradigm. Monographs of the Society for Research in Child Development, 62, i–160. Christiansen, Morten H., Rick Dale, and Florencia Reali 2010 Connectionist explorations of multiple-cue integration in syntax acquisition. In S. P. Johnson (Ed.), Neoconstructivism: The new science of cognitive development (pp. 87–108). New York: Oxford University Press. Christiansen, Morten H., M. Louise Kelly, Richard C. Shillcock, and Katie Greenfield 2010 Impaired artificial grammar learning in agrammatism. Cognition, 116, 382–393. Christiansen, Morten H. and Maryellen C. MacDonald 2009 A usage-based approach to recursion in sentence processing. Language Learning, 59 (Suppl. 1), 126–161. Cleeremans, Axel 1993 Mechanisms of implicit learning: A connectionist model of sequence processing: MIT Press. Cleeremans, Axel, and James L. McClelland 1991 Learning the structure of event sequences. Journal of Experimental Psychology: General, 120, 235–253. Clohessy, Anne B., Michael I. Posner, and Mary K. Rothbart 2001 Development of the functional visual field. Acta Psychologica, 106, 51–68. Conway, Christopher M., and Morten H. Christiansen 2001 Sequential learning in non-human primates. Trends in Cognitive Sciences, 5, 539–546.

44

Jennifer B. Misyak, Michael H. Goldstein and Morten H. Christiansen

Conway, Christopher M., and Morten H. Christiansen 2005 Modality-constrained statistical learning of tactile, visual, and auditory sequences. Journal of Experimental Psychology: Learning, Memory, and Cognition, 31, 24–39. Conway, Christopher M., and Morten H. Christiansen 2006 Statistical learning within and between modalities: Pitting abstract against stimulus-specific representations. Psychological Science, 17, 905–912. Conway, Christopher M., and Morten H. Christiansen 2009 Seeing and hearing in space and time: E¤ects of modality and presentation rate on implicit statistical learning. European Journal of Cognitive Psychology, 21, 561–580. Conway, Christopher M., Michelle R. Ellefson, and Morten H. Christiansen 2003 When less is less and when less is more: Starting small with staged input. In Rick Alterman and David Kirsh (Eds.), Proceedings of the 25th Annual Cognitive Science Society (pp. 270–275). Boston, MA: Cognitive Science Society. Conway, Christopher M., Robert L. Goldstone, and Morten H. Christiansen 2007 Spatial constraints on visual statistical learning of multi-element scenes. In D. S. McNamara and J. G. Trafton (Eds.), Proceedings of the 29th Annual Cognitive Science Society (pp. 185–190). Austin, TX: Cognitive Science Society. Conway, Christopher M., and Pisoni, D. B. 2008 Neurocognitive basis of implicit learning of sequential structure and its relation to language processing. In Guinevere F. Eden and D. Lynn Flowers (Eds.), Annals of the New York Academy of Sciences, 1145, 113–131. Courage, Mary L., and Russell J. Adams 1990 Visual acuity assessment from birth to three years using the acuity card procedure: Cross-sectional and longitudinal samples. Optometry and Vision Science, 67, 713–718. Creel, Sarah C., Elissa L. Newport, and Richard N. Aslin 2004 Distant melodies: Statistical learning of nonadjacent dependencies in tone sequences. Journal of Experimental Psychology: Learning, Memory, and Cognition, 30, 1119–1130. Dale, Rick, Caitlin Kehoe, and Michael J. Spivey 2007 Graded motor responses in the time course of categorizing atypical exemplars. Memory and Cognition, 35, 15–28. DeCasper, A., and Spence, M. 1986 Prenatal maternal speech influences on newborns’ perception of speech sounds. Infant Behavior and Development, 9, 133–150. Dienes, Zolta´n 1992 Connectionist and memory-array models of artificial grammar learning. Cognitive Science, 16, 41–79.

Statistical-sequential learning in development

45

Dobson, Velma, and Davida Y. Teller 1978 Visual acuity in human infants: A review and comparison of behavioral and electrophysiological studies. Vision Research, 18, 1469–1483. Dominguez, Melissa, and Robert A. Jacobs 2003 Does visual development aid visual learning? In P. T. Quinlan (Ed.), Connectionist models of development: Developmental processes in real and artificial neural networks (pp. 257–278). NY: Psychology Press. Don, Audrey J., E. Glenn Schellenberg, Arthur S. Reber, Kristen M. DiGirolamo, and Paul P. Wang 2003 Implicit learning in children and adults with Williams syndrome. Developmental Neuropsychology, 23, 201–225. Ellis, Nick C. (Ed.) 1994 Implicit and explicit learning of languages. London: Academic Press. Elman, Je¤rey L. 1993 Learning and development in neural networks: The importance of starting small. Cognition, 48, 71–99. Emberson, Lauren L., Jennifer B. Misyak, Jennifer A. Schwade, Morten H. Christiansen, and Michael H. Goldstein 2008 Face-to-face: Visual statistical learning with complex natural stimuli. XVIth International Conference on Infant Studies (abstracts ed., pp. 259). Vancouver, Canada. Fischer, Stefan, Ines Wilhelm, and Jan Born 2007 Developmental di¤erences in sleep’s role for implicit o¤-line learning: Comparing children with adults. Journal of Cognitive Neuroscience, 19, 214–227. Fiser, Jo´zsef, and Richard N. Aslin 2002 Statistical learning of new visual feature combinations by infants. Proceedings of the National Academy of Sciences, 99, 15822– 15826. Fletcher, Janet, Murray T. Maybery, and Sarah Bennett 2000 Implicit learning di¤erences: A question of developmental level? Journal of Experimental Psychology: Learning, Memory, and Cognition, 26, 246–252. Friederici, Angela D., Jo¨rg Bahlmann, Stefan Heim, Ricarda I. Schibotz, and Alfred Anwander 2006 The brain di¤erentiates human and non-human grammars: Functional localization and structural connectivity. Proceedings of the National Academy of Sciences, 103, 2458–2463. Gebauer, Guido F., and Nicholas J. Mackintosh 2007 Psychometric intelligence dissociates implicit and explicit learning. Journal of Experimental Psychology: Learning, Memory, and Cognition, 33, 34–54.

46

Jennifer B. Misyak, Michael H. Goldstein and Morten H. Christiansen

Goldstein, Michael H., Andrew P. King, and Meredith J. West 2003 Social interaction shapes babbling: Testing parallels between birdsong and speech. Proceedings of the National Academy of Sciences, 100, 8030–8035. Goldstein, Michael H., and Jennifer A. Schwade 2008 Social feedback to infants’ babbling facilitates rapid phonological learning. Psychological Science, 19, 515–523. Goldstein, Michael H., Heidi R. Waterfall, Arnon Lotem, Joseph Y. Halpern, Jennifer A. Schwade, Luca Onnis, and Shimon Edelman 2010 General cognitive principles for learning structure in time and space. Trends in Cognitive Sciences, 14, 249–258. Goldstein, Michael H., and Meredith J. West 1999 Consistent responses of human mothers to prelinguistic infants: The e¤ect of prelinguistic repertoire size. Journal of Comparative Psychology, 113, 52–58. Go´mez, Rebecca L. 2002 Variability and detection of invariant structure. Psychological Science, 13, 431–436. Go´mez, Rebecca L. 2006 Dynamically guided learning. In Y. Munakata and M. H. Johnson (Eds.), Attention and performance XXI: Processes of change in brain and cognitive development (pp. 87–110). Oxford: Oxford University Press. Go´mez, Rebecca L., Richard R. Bootzin, and Lynn Nadel 2006 Naps promote abstraction in language-learning infants. Psychological Science, 17, 670–674. Go´mez, Rebecca L., and LouAnn Gerken 1999 Artificial grammar learning by 1-year-olds leads to specific and abstract knowledge. Cognition, 70, 109–135. Go´mez, Rebecca L., and LouAnn Gerken 2000 Infant artificial language learning and language acquisition. Trends in Cognitive Sciences, 4, 178–186. Go´mez, Rebecca L., and Laura Lakusta 2004 A first step in form-based category abstraction by 12-month-old infants. Developmental Science, 7, 567–580. Go´mez, Rebecca L., and Jessica Maye 2005 The developmental trajectory of nonadjacent dependency learning. Infancy, 7, 183–206. Goschke, Thomas, Angela D. Friederici, Sonja A. Kotz, and Anja van Kampen 2001 Procedural learning in Broca’s aphasia: Dissociation between the implicit acquisition of spatio-motor and phoneme sequences. Journal of Cognitive Neuroscience, 13, 370–388. Greenough, William T., James E. Black, and Christopher S. Wallace 1987 Experience and brain development. Child Development, 58, 539– 559.

Statistical-sequential learning in development

47

Haith, Marshall M., Cindy Hazan, and Gail S. Goodman 1988 Expectation and anticipation of dynamic visual events by 3.5month-old babies. Child Development, 59, 467–479. Haith, Marshall M., and Michael E. McCarty 1990 Stability of visual expectations at 3.0 months of age. Developmental Psychology, 26, 68–74. Haith, Marshall M., Naomi Wentworth, and Richard L. Canfield 1993 The formation of expectations in early infancy. In C. RoveeCollier and L. P. Lipsitt (Eds.), Advances in infancy research (Vol. 8). Norwood, NJ: Ablex. Harris, Zellig S. 1955 From phoneme to morpheme. Language, 31, 190–222. Hasher, Lynn, and Rose T. Zacks 1984 Automatic processing of fundamental information: The case of frequency of occurrence. American Psychologist, 39, 1372–1388. Hayes, J. R., and H. H. Clark 1970 Experiments in the segmentation of an artificial speech analogue. In J. R. Hayes (Ed.), Cognition and the development of language (pp. 221–234). New York: Wiley. Hull, Clark Leonard 1920 Quantitative aspects of the evolution of concepts. An experimental study. Psychological Monographs, 28. Hunt, Ruskin H., and Richard N. Aslin 2001 Statistical learning in a serial reaction time task: Access to separable statistical cues by individual learners. Journal of Experimental Psychology: General, 130, 658–680. Hunter, Michael A., and Elinor W. Ames 1988 A multifactor model of infant preferences for novel and familiar stimuli. Advances in Infancy Research, 5, 69–95. Jenkins, John G. 1933 Instruction as a factor in ‘incidental’ learning. The American Journal of Psychology, 45, 471–477. Johnson, Elizabeth K., and Michael D. Tyler 2010 Testing the limits of statistical learning for word segmentation. Developmental Science, 13, 339–345. Johnson, Jacqueline S., and Elissa L. Newport 1989 Critical period e¤ects in second language learning: The influence of maturational state on the acquisition of English as a second language. Cognitive Psychology, 21, 60–99. Jones, James, and Harold Pashler 2007 Is the mind inherently forward looking? Comparing prediction and retrodiction. Psychonomic Bulletin & Review, 14, 295–300. Kareev, Yaakov 1995 Through a narrow window: Working memory capacity and the detection of covariation. Cognition, 56, 263–269.

48

Jennifer B. Misyak, Michael H. Goldstein and Morten H. Christiansen

Kareev, Yaakov, Iris Lieberman, and Miri Lev 1997 Through a narrow window: Sample size and the perception of correlation. Journal of Experimental Psychology: General, 126, 278–287. Keele, Steven W., Richard Ivry, Ulrich Mayr, Eliot Hazeltine, and Herbert Heuer 2003 The cognitive and neural architecture of sequence representation. Psychological Review, 110, 316–339. Keele, Steven W., and Peggy J. Jennings 1992 Attention in the representation of sequence: Experiment and theory. Human Movement Science, 11, 125–138. Kirkham, Natasha Z., Jonathan A. Slemmer, and Scott P. Johnson 2002 Visual statistical learning in infancy: Evidence for a domain general learning mechanism. Cognition, 83, B35–B42. Kirkham, Natasha Z., Jonathan A. Slemmer, Daniel C. Richardson, and Scott P. Johnson 2007 Location, location, location: Development of spatiotemporal sequence learning in infancy. Child Development, 78, 1559–1571. Knowlton, Barbara J., Squire, Larry R., and Gluck, Mark A. 1994 Probabilistic classification learning in amnesia. Learning and Memory, 1, 106–120. Kuhn, Gustav, and Zolta´n Dienes 2005 Implicit learning of nonlocal musical rules: Implicitly learning more rhan chunks. Journal of Experimental Psychology: Learning, Memory, and Cognition, 31, 1417–1432. Lasky, Robert E., and Amber L. Williams 2005 The development of the auditory system from conception to term. Neoreviews, 6, 141–152. Lewicki, Pawel, Maria Czyzewska, and Hunter Ho¤man 1987 Unconscious acquisition of complex procedural knowledge. Journal of Experimental Psychology: Learning, Memory, and Cognition, 13, 523–530. Lewkowicz, David J. 2008 Perception of dynamic and static audiovisual sequences in 3- and 4-month-old infants. Child Development, 79, 1538–1554. Lieberman, Philip 2002 On the nature and evolution of the neural bases of human language. American Journal of Physical Anthropology, 35, 36–62. MacDonald, Maryellen C., and Morten H. Christiansen 2002 Reassessing working memory: A comment on Just & Carpenter (1992) and Waters & Caplan (1996). Psychological Review, 109, 35–54. Marshall, Lisa, and Jan Born 2007 The contribution of sleep to hippocampus-dependent memory consolidation. Trends in Cognitive Sciences, 11, 442–450.

Statistical-sequential learning in development

49

Maybery, Murray T., Margaret Taylor, and Angela O’Brien-Malone 1995 Implicit learning: Sensitive to age but not IQ. Australian Journal of Psychology, 47, 8–17. Maye, Jessica, Daniel J. Weiss, and Richard N. Aslin 2008 Statistical phonetic learning in infants: Facilitation and feature generalization. Developmental Science, 11, 122–134. Maye, Jessica, Janet F. Werker, and LouAnn Gerken 2002 Infant sensitivity to distributional information can a¤ect phonetic discrimination. Cognition, 82, B101–B111. McCauley, Stewart M., and Morten H. Christiansen 2011 Learning simple statistics for language comprehension and production: The CAPPUCCINO model. Submitted manuscript. McMurray, Bob, Michael K. Tanenhaus, Richard N. Aslin, and Michael J. Spivey 2003 Probabilistic constraint satisfaction at the lexical/phonetic interface: Evidence for gradient e¤ects of within-category VOT on lexical access. Journal of Psycholinguistic Research, 32, 77–97. Meulemans, Thierry, Martial Van der Linden, and Pierre Perruchet 1998 Implicit sequence learning in children. Journal of Experimental Child Psychology, 69, 199–221. Miller, George A., and Jennifer A. Selfridge 1950 Verbal context and the recall of meaningful material. The American Journal of Psychology, 63, 176–185. Mirman, Daniel, Katharine Graf Estes, and James S. Magnuson 2010 Computational modeling of statistical learning: E¤ects of transitional probability versus frequency and links to word learning. Infancy, 15, 471–486. Misyak, Jennifer B., and Morten H. Christiansen 2012 Statistical learning and language: An individual di¤erences study. Language Learning, 62, 302–331. Misyak, Jennifer B., Morten H. Christiansen, and J. Bruce Tomblin 2010 Sequential expectations: The role of prediction-based learning in language. Topics in Cognitive Science, 2, 138–153. Misyak, Jennifer B., Lauren L. Emberson, Jennifer A. Schwade, Morten H. Christiansen, and Michael H. Goldstein 2009 Face-to-face, word-for-word: Comparing infants’ statistical learning of visual and auditory sequences. Poster session presented at the Biennial Meeting of the Society for Research in Child Development. Denver, CO. Nelson, Charles A. 1995 The ontogeny of human memory: A cognitive neuroscience perspective. Developmental Psychology, 31, 723–738. Newport, Elissa L. 1988 Constraints on learning and their role in language acquisition: Studies of the acquisition of American Sign Language. Language Sciences, 10, 147–172.

50

Jennifer B. Misyak, Michael H. Goldstein and Morten H. Christiansen

Newport, Elissa L. 1990 Maturational constraints on language learning. Cognitive Science, 14, 11–28. Newport, Elissa L., Marc D. Hauser, Geertrui Spaepen, and Richard N. Aslin 2004 Learning at a distance II. Statistical learning of non-adjacent dependencies in a non-human primate. Cognitive Psychology, 49, 85–117. Nissen, Mary Jo, and Peter Bullemer 1987 Attentional requirements of learning: Evidence from performance measures. Cognitive Psychology, 19, 1–32. Onnis, Luca, Morten H. Christiansen, Nick Chater, and Rebecca L. Go´mez 2003 Reduction of uncertainty in human sequential learning: Evidence from artificial language learning. In R. Alterman and D. Kirsh (Eds.), Proceedings of the 25th Annual Cognitive Science Society (pp. 886–891). Boston, MA: Cognitive Science Society. Onnis, Luca, Padraic Monaghan, Morten H. Christiansen, and Nick Chater 2004 Variability is the spice of learning, and a crucial ingredient for detecting and generalizing in nonadjacent dependencies. In K. Forbus, D. Gentner and T. Regier (Eds.), Proceedings of the 26th Annual Cognitive Science Society (pp. 1047–1052). Chicago, Illinois: Cognitive Science Society. Pacton, Se´bastien, and Pierre Perruchet 2008 An attention-based associative account of adjacent and nonadjacent dependency learning. Journal of Experimental Psychology: Learning, Memory, and Cognition, 34, 80–96. Pelucchi, Bruna, Jessica F. Hay, and Jenny R. Sa¤ran 2009a Learning in reverse: Eight-month-old infants track backward transitional probabilities. Cognition, 113, 244–247. Pelucchi, Bruna, Jessica F. Hay, and Jenny R. Sa¤ran 2009b Statistical learning in a natural language by 8-month-old infants. Child Development, 80, 674–685. Perrig, W. J. 2001 Implicit memory, cognitive psychology of. In: N. J. Smelser and P. B. Baltes (Eds.), International Encyclopedia of the Social and Behavioral Sciences, Vol. 11, Elsevier, Amsterdam (2001), pp. 7241–7245. Perruchet, Pierre, and Ste´phane Desaulty 2008 A role for backward transitional probabilities in word segmentation? Memory & Cognition, 36, 1299–1305. Perruchet, Pierre, and Se´bastien Pacton 2006 Implicit learning and statistical learning: One phenomenon, two approaches. Trends in Cognitive Sciences, 10, 233–238. Perruchet, Pierre, and Ronald Peereman 2004 The exploitation of distributional information in syllable processing. Journal of Neurolinguistics, 17, 97–119.

Statistical-sequential learning in development

51

Quartz, Steven R., and Terrence J. Sejnowski 1997 The neural basis of cognitive development: A constructivist manifesto. Behavioral and Brain Sciences, 20, 537–596. Reber, Arthur S. 1967 Implicit learning of artificial grammars. Journal of Verbal Learning and Verbal Behavior, 6, 855–863. Reber, Arthur S. 1993 Implicit learning and tacit knowledge: An essay on the cognitive unconscious. New York: Oxford University Press. Reber, Arthur S., and Rhianon Allen 1978 Analogic and abstraction strategies in synthetic grammar learning: A functionalist interpretation. Cognition, 6, 189–221. Reber, Arthur S., and Rhianon Allen 2000 Individual di¤erences in implicit learning: Implications for the evolution of consciousness. In R. G. Kunzendorf and B. Wallace (Eds.), Individual di¤erences in conscious experience (pp. 227– 247). Amsterdam: John Benjamins Publishing Company. Reber, Arthur S., and Selma Lewis 1977 Implicit learning: An analysis of the form and structure of a body of tacit knowledge. Cognition, 5, 333–361. Reber, Arthur S., Faye F. Walkenfeld, and Ruth Hernstadt 1991 Implicit and explicit learning: Individual di¤erences and IQ. Journal of Experimental Psychology: Learning, Memory, and Cognition, 17, 888–896. Remillard, Gilbert 2008 Implicit learning of second-, third-, and fourth-order adjacent and nonadjacent sequential dependencies. The Quarterly Journal of Experimental Psychology, 61, 400–424. Rescorla, Robert A. 1967 Pavlovian conditioning and its proper control procedures. Psychological Review, 74, 71–80. Rescorla, Robert A., and Allan R. Wagner 1972 A theory of Pavlovian conditioning: Variations in the e¤ectiveness of reinforcement and nonreinforcement. In A. H. Black and W. F. Prokasy (Eds.), Classical conditioning II: Current research and theory (pp. 64–99). New York: Appleton-CenturyCrofts. Reznick, J. Steven, Katarzyna Chawarska, and Stephanie Betts 2000 The development of visual expectations in the first year. Child Development, 71, 1191–1204. Rohde, Douglas L. T., and David C. Plaut 2003 Less is less in language acquisition. In P. T. Quinlan (Ed.), Connectionist models of development: Developmental processes in real and artificial neural networks (pp. 189–232). NY: Psychology Press.

52

Jennifer B. Misyak, Michael H. Goldstein and Morten H. Christiansen

Roodenrys, Steven, and Naomi Dunn 2008 Unimpaired implicit learning in children with developmental dyslexia. Dyslexia, 14, 1–15. Rose, Susan A., Judith F. Feldman, Je¤ery J. Jankowski, and Donna M. Caro 2002 A longitudinal study of visual expectation and reaction time in the first year of life. Child Development, 73, 47–61. Sa¤ran, Jenny R. 2001 The use of predictive dependencies in language learning. Journal of Memory and Language, 44, 493–515. Sa¤ran, Jenny R. 2002 Constraints on statistical language learning. Journal of Memory and Language, 47, 172–196. Sa¤ran, Jenny R. 2003 Statistical language learning: Mechanisms and constraints. Current Directions in Psychological Science, 12, 110–114. Sa¤ran, Jenny R., Richard N. Aslin, and Elissa L. Newport 1996 Statistical learning by 8-month-old infants. Science, 274, 1926– 1928. Sa¤ran, Jenny R., Marc Hauser, Rebecca Seibel, Joshua Kapfhamer, Fritz Tsao, and Fiery Cushman 2008 Grammatical pattern learning by human infants and cotton-top tamarin monkeys. Cognition, 107, 479–500. Sa¤ran, Jenny R., Elizabeth K. Johnson, Richard N. Aslin, and Elissa L. Newport 1999 Statistical learning of tone sequences by human infants and adults. Cognition, 70, 27–52. Sa¤ran, Jenny R., Elissa L. Newport, Richard N. Aslin, Rachel A. Tunick, and Sandra Barrueco 1997 Incidental language learning: Listening (and learning) out of the comer of your ear. Psychological Science, 8, 101–105. Sa¤ran, Jenny R., and Erik D. Thiessen 2007 Domain-general learning capacities. In E. Ho¤. and M. Shatz (Eds.), Handbook of Language Development (pp. 68–86). Cambridge: Blackwell. Sa¤ran, Jenny R., Janet F. Werker, and Lynne A. Werner 2006 The infant’s auditory world: Hearing, speech, and the beginnings of language. In R. Siegler and D. Kuhn (Eds.), Handbook of Child Development (pp. 58–108). New York: Wiley. Sa¤ran, Jenny R., and Diana P. Wilson 2003 From syllables to syntax: Multilevel statistical learning by 12month-old infants. Infancy, 4, 273–284. Santelmann, Lynn M., and Peter W. Jusczyk 1998 Sensitivity to discontinuous dependencies in language learners: Evidence for limitations in processing space. Cognition, 69, 105– 134.

Statistical-sequential learning in development

53

Scott, Lisa S., Olivier Pascalis, and Charles A. Nelson 2007 A domain-general theory of the development of perceptual discrimination. Current Directions in Psychological Science, 16, 197–201. Schvaneveldt, Roger W., and Rebecca L. Go´mez 1998 Attention and probabilistic sequence learning. Psychological Research, 61, 175–190. Servan-Schreiber, David, Axel Cleeremans, and James L. McClelland 1988 Encoding sequential structure in simple recurrent networks, CMU Technical Report CMU-CS-335-87: Computer Science Department, Carnegie-Mellon University. Shanks, David R. 1995 The psychology of associative learning. Cambridge: Cambridge University Press. Shannon, C. E. 1948 A mathematical theory of communication. Bell Systems Technical Journal, 27, 379–423, 623–656. Spivey, Michael J. 2007 The continuity of mind. New York: Oxford University Press. Swingley, Daniel 2005 Statistical clustering and the contents of the infant vocabulary. Cognitive Psychology, 50, 86–132. Teinonen, Tuomas, Vineta Fellman, Risto Na¨a¨ta¨nen, Paavo Alku, and Minna Huotilainen 2009 Statistical language learning in neonates revealed by eventrelated brain potentials. BMC Neuroscience, 10, 21. Thomas, Kathleen M., Ruskin H. Hunt, Nathalie Vizueta, Tobias Sommer, Sarah Durston, Yihong Yang, and Michael S. Worden 2004 Evidence of developmental di¤erences in implicit sequence learning: An fMRI study of children and adults. Journal of Cognitive Neuroscience, 16, 1339–1351. Thomas, Kathleen M., and Charles A. Nelson 2001 Serial reaction time learning in preschool- and school-age children. Journal of Experimental Child Psychology, 79, 364–387. Thorndike, E. L., and Rock, R. T. 1934 Learning without awareness of what is being learned or intent to learn it. Journal of Experimental Psychology, 17, 1–19. Toro, Juan M., and Josep B. Trobalo´n 2005 Statistical computations over a speech stream in a rodent. Perception & Psychophysics, 67, 867–875. Turk-Browne, Nicholas B., Philip J. Isola, Brian J. Scholl, and Teresa A. Treat 2008 Multidimensional visual statistical learning. Journal of Experimental Psychology: Learning, Memory, and Cognition, 34, 399–407.

54

Jennifer B. Misyak, Michael H. Goldstein and Morten H. Christiansen

Turk-Browne, Nicholas B., and Brian J. Scholl 2009 Flexible visual statistical learning: Transfer across space and time. Journal of Experimental Psychology: Human Perception and Performance, 35, 195–202. Turk-Browne, Nicholas B., Brian J. Scholl, Marvin M. Chun, and Marcia K. Johnson 2009 Neural evidence of statistical learning: E‰cient detection of visual regularities without awareness. Journal of Cognitive Neuroscience, 21, 1934–1945. Vaidya, Chandan J., Marianne Huger, Darlene V. Howard, and James H. Howard Jr 2007 Developmental di¤erences in implicit learning of spatial context. Neuropsychology, 21, 497–506. van den Bos, Esther 2007 Implicit artificial grammar learning: E¤ects of complexity and usefulness of the structure. Unpublished PhD dissertation, Leiden University, Netherlands. van den Bos, Esther, and Fenna H. Poletiek 2008 E¤ects of grammar complexity on artificial grammar learning. Memory & Cognition, 36, 1122–1131. Weiss, Daniel J., and Elissa L. Newport 2006 Mechanisms underlying language acquisition: Benefits from a comparative approach. Infancy, 9, 241–257. Wentworth, Naomi, and Marshall M. Haith 1992 Event-specific expectations of 2- and 3-month-old infants. Developmental Psychology, 28, 842–850. Wentworth, Naomi, Marshall M. Haith, and Roberta Hood 2002 Spatiotemporal regularity and interevent contingencies as information for infants’ visual expectations. Infancy, 3, 303–322. Zacks, Rose T., and Lynn Hasher 2002 Frequency processing: A twenty-five year perspective. In P. Sedlmeier and T. Betsch (Eds.), Frequency processing and cognition (pp. 21–36). New York: Oxford University Press.

Bootstrapping language: Are infant statisticians up to the job? Elizabeth K. Johnson 1. Introduction Over the years, spoken language acquisition has attracted the attention of intellects from many disciplines. After much debate, two facts are apparent. On the one hand, it is clear that the ability to learn a language must be at least partially built in to our human psyche. Even chimps, our closest evolutionary cousins, fail to learn spoken language the way human children do. This is true even if they are cared for and spoken to as if they were human children (Terrace, Petitto, Sanders, and Bever 1979; see, however, Savage-Rumbaugh and Fields 2000). On the other hand, it is also clear that human language acquisition crucially depends on experience. Infants exposed to French learn French, infants exposed to Swahili learn Swahili (see also Curtiss 1977). But what must be inherited and what must be learned? And how do infants learn what they need to learn? All current models of language acquisition represent di¤erent answers to these very basic questions. Currently, many of the most popular models of early language acquisition are what could generally be described as distributional models. Most of these models place a heavy burden on the learning capabilities of prelingual infants. Some researchers working within the distributional framework strive to show how much children could accomplish in the absence of innate linguistic knowledge (e.g. Elman 1999), whereas others still emphasize the importance of linguistically-motivated constraints or expectations in statistical learning (e.g. Gervain, Nespor, Mazuka, Horie, and Mehler 2008; Mehler, Pen˜a, Nespor, and Bonatti 2006; Yang 2004). The main focus of this chapter will be to examine distributional learning with the goal of better understanding what exactly we do and do not know about the ability of infants to extract linguistic generalizations from the speech signal. Although the discussion presented in this chapter is meant to apply to many levels of spoken language acquisition (e.g. phonology, morphology, syntax), the examples used to illustrate my points will be drawn primarily from the infant word segmentation literature. This

56

Elizabeth K. Johnson

is simply a reflection of my own research interests, as well as the fact that much of the recent work on distributional learning has been focused on potential solutions to the word segmentation problem. My primary goal in writing this chapter is to stimulate critical thinking about the key challenges facing di¤erent types of distributional models. In the end, I am afraid that this chapter will pose more questions than answers. The diversity of languages that a child must be equipped to learn is immense. For example, Finnish has a rich inflectional morphology, highly predictable stress, and vowel harmony whereas English, by contrast, is morphologically impoverished, exhibits di‰cult to predict stress, and has no vowel harmony. Likewise, German is an intonation language with very long words whereas Vietnamese is a tone language with relatively shorts words. And Italian is a syllable-timed language with a simple syllable structure and clear syllable boundaries whereas Russian is a stress timed language with complicated syllable structure and less clear syllable boundaries. The examples provided here just begin to scratch the surface in terms of possible sound structure patterns exhibited by the thousands of languages spoken by humans around the world. Remarkably, no matter what their characteristics, all natural languages seem to be acquired equally readily by normally developing children. This poses a potential problem for experience-based accounts of language acquisition. How could a single learning strategy account for how children are capable of mastering (or at least gaining a handle on) all of these di¤erent types of linguistic systems with relative ease in the first few years of life? In the decades prior to the mid 1990s, most researchers would have avoided this dilemma by arguing that the key to language acquisition lies in the linguistic knowledge all humans inherit. Through our innate endowment, we are born knowing all of the possible rules that can exist in a language, therefore reducing language acquisition to the process of simply working out which rules apply to the language currently being learned (Chomsky 1957). In the late 70’s and early 80’s language researchers who opposed this Chomskian perspective had little ground to stand on, since there was no other fully convincing explanation for how children could acquire the wide varieties of language in the world so quickly. The rules and structure of language were viewed as too complex and unpredictable to be learned, especially by young infants with limited cognitive resources. And early speech perception research initially seemed to support the notion that much of language knowledge was innate, as study after study suggested that neonates were born being able to discriminate virtually all

Bootstrapping language: Are infant statisticians up to the job?

57

possible phoneme contrasts in much the same way as adults (see Jusczyk 1997, for review). A brief article published in Science in 1996 marked the beginning of a dramatic shift in the way language researchers viewed early language acquisition (Sa¤ran, Aslin, and Newport 1996). In this study, the authors set out to demonstrate that infants could indeed acquire some complicated aspects of language using experience-dependent learning mechanisms. They chose to investigate the word segmentation problem because the authors viewed word segmentation as a tractable problem faced by all language-learning infants. In this study, eight-month-old infants were exposed to an artificial language containing 4 trisyllabic CVCVCV words. The made-up words were strung together in random order with the stipulation that no single word ever followed itself in immediate succession. The speech stream was produced with a flat intonation and contained no pauses between words. Since the words in this language had no meaning, and the language lacked any prosodic or pause cues to word boundaries, words only existed in the sense that the syllables co-occurred consistently. That is, the words only existed in a statistical sense. Remarkably, after a mere two minutes of exposure, infants listened longer to trisyllabic sequences of syllables that spanned syllable boundaries over the trisyllabic sequences that corresponded to statistical words. This finding revealed that 8-montholds are capable of segmenting words from an artificial language using statistical cues alone (in the remainder of this chapter, we will often refer to this particular type of statistical segmentation cue as a syllable distribution cue). In the closing words of the authors: ‘Our results raise the intriguing possibility that infants possess experiencedependent mechanisms that may be powerful enough to support not only word segmentation but also the acquisition of other aspects of language. . . the existence of computational abilities that extract structure so rapidly suggests that it is premature to assert a priori how much of the striking knowledge base of human infants is primarily a result of experience-independent mechanisms. In particular, some aspects of early development may turn out to be best characterized as resulting from innately biased statistical learning mechanisms rather than innate knowledge. If this is the case, then the massive amount of experience gathered by infants during the first postnatal year may play a far greater role in development than has previously been recognized.’ (Sa¤ran, Aslin, & Newport, 1996)

This landmark study was not the first study to discuss how statistical information could be used to extract linguistic generalities from language (e.g. Harris 1955), nor was it the first study to implement the use of an

58

Elizabeth K. Johnson

artificial language to study language learning in human adults (e.g. Hayes and Clark 1970; Morgan and Newport 1981; Valian and Coulson 1988) or infants (e.g. Goodsitt, Morgan, and Kuhl 1993). It was not even the first to suggest that syllable distribution cues could help listeners find word boundaries (e.g. Morgan and Sa¤ran 1995). Nonetheless, this remarkable study struck a chord in the field, making two very timely and pivotal contributions that literally reshaped the way researchers thought about (and studied) early language acquisition. First, the Sa¤ran et al. study firmly established the use of artificial languages as a mainstream approach to studying infant language development (see Gomez and Gerken 2000, for review). Second, this study suggested that infants possessed truly powerful statistical learning mechanisms that might enable them to learn far more from the ambient environment than researchers had previously imagined possible. Thus, this one study not only inspired language researchers to develop new hypotheses concerning the development of language skills in children, it also gave infant researchers a very powerful tool for experimentally testing these hypotheses (see Bates and Elman 1997, for discussion). In the 15 years since the initial publication of Sa¤ran, Aslin, and Newport (1996), scores upon scores of infant artificial language learning studies have been published (see Aslin and Newport 2008, for a review). Many of these studies have built directly upon Sa¤ran et al.’s initial work, continuing to explore how infants might learn to segment words from speech. Studies have shown that the statistical learning mechanisms that infants use to segment words from an artificial language appear to be domain general since infants apply them to sequences of tones and visual objects just as readily as they apply them to sequences of syllables (e.g. Sa¤ran, Johnson, Newport, and Aslin 1999; Kirkham, Slemmer, and Johnson 2002). And infants have been shown to track not only simple cooccurrence frequencies between syllables (as tested by the original Sa¤ran et al., 1996, study), but also conditional probabilities between syllables (Aslin, Sa¤ran, and Newport 1998). Rats, on the other hand, appear to succeed only at tracking the simpler co-occurrence frequencies (Toro and Trobalon 2005). Some have suggested that rats’ inability to track conditional probabilities might in part explain why they do not develop human-like language (Aslin & Newport, 2008). Amazingly, infants have also been shown to readily extract backward transitional probabilities (Pelucchi, Hay, and Sa¤ran 2009b), as well as non-adjacent relationships between segments in an artificial language (Newport and Aslin 2004).

Bootstrapping language: Are infant statisticians up to the job?

59

English-learning infants have even been shown to track transitional probabilities between syllables in highly constrained but nonetheless natural Italian speech (Pelucchi, Hay, and Sa¤ran 2009a). In short, the evidence that infants rely on transitional probabilities between syllables to segment words from speech is quite convincing. If you pick up a recently published undergraduate textbook, chances are you might just read that infants solve the word segmentation problem by tracking transitional probabilities between syllables (e.g. Byrd and Mintz 2010). Specifically, most infant speech perception researchers suppose that infants first segment a limited number of words from speech by tracking transitional probabilities between syllables in the ambient language and positing word boundaries at ‘dips’ in the these transitions. Then, by analyzing the sound properties of these words, infants could deduce language-specific cues to word boundaries in their language, such as lexical stress placement and phonotactic constraints. Indeed, transitional probabilities between syllables seems an ideal solution to the word segmentation problem because such a strategy relies entirely on bottom-up information, circumventing the need to resort to built-in linguistic constraints to explain how infants manage to segment words from speech. The impact of the original Sa¤ran et al. (1996) paper goes far beyond the study of word segmentation. Infants’ success at using transitional probabilities between syllables to extract words from an artificial language helped increase the popularity of connectionist models of cognitive development, and has inspired dozens of studies examining the possibility that infants use statistical mechanisms to learn virtually every other level of language structure. On the segmental level, tracking the distribution of phonetic contrasts in speech has been argued to help infants work out the phonemic inventory of their language (e.g. Maye, Werker, and Gerken 2002; Peperkamp, Calvez, Nadal, and Dupoux 2006). It may also help infants learn the patterning of speech sounds with respect to word boundaries (e.g. Chambers, Onishi, and Fisher 2003; Seidl and Buckley 2005). On the morphosyntactic level, young infants have been argued to track non-adjacent dependencies between linguistic elements, an ability that could help them learn morphosyntactic patterns such as the is –ing dependency in English (e.g. ‘is singing’ versus ‘can singing’; Go´mez and Maye 2005). Similar strategies could be used to work out the word class of newly learned words (e.g. Mintz 2003; Monaghan, Christiansen, and Chater 2007). On the syntactic level, there is some evidence that syntactic grammars are learnable via statistical learning mechanisms (e.g. Sa¤ran 2001; Sa¤ran and Wilson 2003; Thompson and Newport 2007; see however De Vries,

60

Elizabeth K. Johnson

Monaghan, Knecht, and Zwitserlood 2008). And on the semantic level, research has suggested that language learners could use simple crosssituational statistics to work out the referent of words in an artificial language (e.g. Smith and Yu 2008). The sheer amount of information that artificial language learning studies suggest infants might be able to glean from their language input is staggering, and would have been truly unimaginable 20 years ago. In the early days after the initial publication of Sa¤ran, Aslin, and Newport (1996), many researchers saw statistical cues such as transitional probabilities between syllables as just another cue (amongst many cues) that infants might have in their arsenal to use in the task of word segmentation (e.g. Jusczyk, Houston, and Newsome 1999). And many researchers found it easier to accept that these statistical cues would be used for more ‘lowlevel’ problems such as word segmentation rather than the higher-level acquisition challenges such as syntax and word form to referent mapping. But in more recent years, the general zeitgeist of the language acquisition field has truly changed. In the past decade and a half, distributional learning has clearly replaced innate knowledge as the most widely accepted explanation for the better part of children’s language learning prowess. Indeed, in some cases, reliance on any sort of innate phonological knowledge has been presented as a weakness in models of developmental speech perception (e.g. see discussion on ‘external components’ in Monaghan and Christiansen 2010). At this point in time, when distributional models of early language acquisition are continuing to climb in popularity, a useful exercise might be to sit back and take stock of both the strengths and weaknesses of these models. As has been summarized in this chapter thus far, the strengths of this type of theoretical approach have been outlined in the literature time and time again and are in large part stunningly (and elegantly) clear. Statistical learning has been argued to provide a universal bottom-up strategy for pulling words out of speech. These models can also explain how infants might begin to learn nearly all other levels of linguistic structure. Since the great majority of descriptions of statistical models in the literature are extremely positive, I would like to spend the larger part of the remainder of this chapter taking a more critical stance, and discussing the flip side of the coin more critical stance, and discuss the flip side of the coin. What are the drawbacks to these models? What aspects of these models are under-specified? What behavioral data do these models have di‰culty explaining? What assumptions do these models make? Are the assumptions valid? And what does the term ‘distributional model’ mean?

Bootstrapping language: Are infant statisticians up to the job?

61

Perhaps the term is so underspecified that it carries little meaning at all. Note that I do not dwell on the problem areas of distributional models because I think these models are weak. On the contrary, the models are quite powerful as evidenced by the impact they have had on the field. And I think that the statistical learning mechanisms that have been revealed in the past 15 years of research are truly exciting, and distributional models are here to stay. Nonetheless, I think it is important to consider some of the challenges still faced by these models because identifying and testing weaknesses in popular models of psychological behavior is an important part of the scientific process. Only by testing and challenging currently popular models can our theories move forward and continue to grow and improve to meet the challenges posed by new data and possible new alternative theories of early language acquisition.

2. Five Challenges Faced by Distributional Models of Language Acquisition 2.1. Does it scale up? As mentioned above, language scientists have known for many years that statistical patterns in the input are linked to the linguistic structure of languages. However, these statistical patterns were thought to be too inconsistent and too complex to be acquired by young infants and children. Thus, researchers assumed that children must possess substantial innate knowledge about possible language structures in order to acquire language as quickly and seemingly e¤ortlessly as they do. But now we have substantial experimental evidence to suggest that infants are far better at picking up statistical patterns in the input than we initially thought. Using the artificial language learning paradigm, researchers have created miniature languages containing patterns that are reflective of linguistically relevant statistical patterns in natural language. Infants’ ability to learn many of these patterns within a very short period of time suggests that these same patterns can be learned from natural language input. However, there is a problem with using these studies as evidence for the ways in which real language learning occurs. The artificial languages used in many infant artificial grammar learning studies are so simplistic that one must wonder whether the ability to learn a pattern in these artificial languages will necessarily scale up to the challenge of learning a pattern in natural language input. Consider, for a moment, the language used in the original Sa¤ran et al. (1996) study on word segmentation. This

62

Elizabeth K. Johnson

language contained four trisyllabic words, and each syllable had a simple CV syllable structure. All syllables were clearly enunciated, fully stressed, and of equivalent duration. And no syllable occurred in more than one word. In other words, the language was not very speech-like at all. The authors admit the language is simplistic, but also note that natural language contains many other cues to word boundaries aside from transitional probabilities. Thus, they never claim that the tracking of transitional probabilities alone could solve the whole segmentation problem. Later publications, however, make stronger claims regarding infants’ reliance on transitional probabilities between syllables to segment words from speech. For example, Thiessen and Sa¤ran (2003; 2007) argue that transitional probabilities between syllables are perhaps the first cue used by children to segment words from speech. These cues could then be used to learn other important cues to word boundaries such as the placement of lexical stress. Other studies have argued that phonotactic cues to word boundaries could be learned from transitional probabilities, once again suggesting that transitional probabilities between syllables are the first and most important cue infants use to begin tackling the word segmentation problem (Sahni, Seidenberg and Sa¤ran 2010). The clear assumption in these studies is that infants can track transitional probabilities between syllables in natural speech in much the same way that they can track them in a highly simplified artificial language (i.e. in the absence of any other cues to word boundaries). Analogous arguments have been made for the acquisition of other levels of linguistic knowledge (e.g. Onnis and Christiansen 2008). But is it valid to assume that the ability to track linguistic patterns in a highly simplified language will necessarily scale up to the challenge of natural language? There are a few ways to test this assumption. In a perfect world, we would know the transitional probabilities between syllables in natural language (or, for example, the statistical likelihood of a word being uttered when a particular object is in sight). Testing the ecological validity of distributional models of word segmentation (or the acquisition of any other level of linguistic structure) would simply involve seeing whether infants were sensitive to this information in the environment. But for numerous obvious reasons, such an experiment is clearly impossible (see Johnson and van Heugten, 2012, for discussion). Who is to say precisely how many times a given child has heard or experienced a particular pattern in the input? And if they did, who is to say the information was attended to? The next best option for testing the ecological validity of these models might be to present infants with an unfamiliar natural language and see

Bootstrapping language: Are infant statisticians up to the job?

63

whether they use transitional probabilities to locate likely word boundaries. Such an experiment has in fact been carried out (Pelucchi, Hay, and Sa¤ran 2009a; 2009b). In this cleverly designed study, English-learning 8.5-month-olds were presented with an Italian passage containing repeated tokens of two trochaic bisyllabic words (e.g. fuga and melo). The infants had never heard Italian before, and the syllables in this target word did not occur elsewhere in the passages. Therefore, the researchers could be assured that at least for the English-learning infants they tested, the transitional probability between the two syllables in this word were 100%. The passages contained two additional target words that occurred just as often as the first two target words (e.g. bici and casa). But importantly, the initial syllables of these words also occurred as monosyllabic words in the passage (e.g. ca and bi). For these second set of target words, the transitional probability between the syllables was .33. Amazingly, the English learners not only segmented the Italian target word with a high transitional probability from the passage, they also di¤erentiated between the high transitional probably words and the low transitional probability Italian words. The authors concluded that infants can track transitional probabilities from speech in the face of naturalistic speech sound variation. This study represents the strongest support to date for the notion that infants use transitional probabilities between syllables to extract words from natural speech. Note, however, that although this study does demonstrate that English learners can track transitional probabilities in Italian, it does not demonstrate that infants use transitional probabilities to segment their first words from real world language. One reason this is the case is because the training phase in the Italian study was artificially constructed, so the statistical cues would be particularly salient to infants. To put it more simply, by creating a stream of Italian speech that contained many repetitions of the target words produced by a single speaker in a short period of time and by ensuring that the transitional probabilities between the syllables in the high transitional probability and low transitional probability target words were markedly di¤erent (100% and 33%, respectively), this study employed an unnatural sample of a natural language. In English, the dips in transitional probabilities would likely be far less dramatic. And in other languages with many monosyllabic words, such as Thai, it seems likely that the syllable transition cues to word boundaries would be even weaker than in English. Moreover, the choice of language materials used in the Pellucchi et al. study may have impacted the findings, since tracking syllables is probably relatively easy in a syllable timed language with clear syllable boundaries such as Italian compared to a stress timed language

64

Elizabeth K. Johnson

with less clear syllable boundaries such as Russian or Hungarian. For this reason, it would be interesting to repeat this study with Russian rather than Italian stimuli (or to test infants learning a language other than English on their ability to track transitional probabilities in English). It would also be interesting to carry out some additional studies to ensure that the infants were truly segmenting the whole words (rather than part of a word) from speech (as is sometimes the case in artificial language studies; see Juscyzk, Houston, and Newsome 1999; Johnson, 2005; Johnson and Jusczyk 2003b; for a related discussion). Last, but certainly not least, since the researcher’s goal was to show that transitional probabilities could be tracked in natural speech, they did not control for intonation and other prosodic grouping cues that likely provided an additional cue to word boundaries in natural language. Thus, for all of these reasons, despite representing a very important and exciting step forward in the study of infant statistical learning abilities, these Italian segmentation studies do not entirely answer our question of interest. They show that exaggerated transitional probabilities can be tracked in natural Italian speech containing many cues to word boundaries, but they do not show that infants can track the naturally occurring transitional probabilities between syllables in natural language (which are presumably much less clear than the 1.0 and .33 ratio used in the Italian study), and use these cues to bootstrap all other language-specific segmentation cues. A third possible way to test whether infants’ ability to track transitional probabilities between syllables can scale up to the challenge of natural language would be to take an artificial language and systematically make it a bit more like a natural language in order to gauge the likelihood that infants can track distributional information in natural language. By doing so, one could begin to address the ‘scaling problem’ while also maintaining the exquisite experimental control o¤ered by artificial language studies. This is precisely the approach initially taken by Peter Juscyzk and myself. We created a natural speech analogue of the original Sa¤ran et al. (1996) artificial language. We then added a speech cue that either conflicted or aligned with the statistical cue in the artificial language. In both cases, 8-month-olds extracted the words aligned with the speech cues, suggesting that at least at 8 months infants appear to weigh speech cues to word boundaries more heavily than syllable transition cues (see also Thiessen and Sa¤ran 2003; Johnson and Seidl 2009). In a more recent study, rather than adding an additional speech cue to the speech stream, we instead focused on removing regularities other than

Bootstrapping language: Are infant statisticians up to the job?

65

the transitional probabilities between syllables that could have been helping infants extract words from simple artificial languages (Johnson and Tyler 2010; see also Johnson and Jusczyk 2003). We had many regularities to choose from, including syllable structure, word length, and uniform phoneme length. There are at least two ways in which these regularities could have helped infants track transitional probabilities in the original artificial language. First, by excluding all other natural language variability, highly simplified artificial languages may highlight syllable distributions as an important pattern in the input (much more so than they are highlighted in natural languages). Second, once infants commence transitional probability tracking in a simplified artificial language, sound structure regularities could serve as additional segmentation cues boosting initial distributionbased hypotheses regarding likely word boundaries (see Sahni, Seidenberg and Sa¤ran 2010, for a related discussion). As a first step to exploring the role of variability in infant’s ability to track transitional probabilities, we chose to focus on word length. We created two sets of artificial languages: one with 4 bisyllabic words (the uniform word length language) and one with 2 bisyllabic words and 2 trisyllabic words (the mixed word length language). Despite the fact that both the uniform word length and mixed word length languages contained equally strong statistical cues to word boundaries, both the 5.5- and 8.5-month-olds we tested only succeeded in segmenting words from the uniform word length language. That is, infants succeeded if they were exposed to the language containing both syllable transition cues to word boundaries and word length regularities, but they failed if they were exposed to the language containing only syllable transition cues to word boundaries. Since all natural languages contain words of varying length, this study gives one reason to fear that infants’ ability to track transitional probabilities in a highly simplified artificial language might not scale up to the challenge of natural language. Note that there are studies that have shown that infants can use transitional probabilities between syllables to segment words from a language containing words of variable length, however, these studies have involved artificial languages that contained both statistical and prosodic cues to word boundaries (Thiessen, Hill, and Sa¤ran 2005). Even adults appear to have great di‰culty using syllable distribution cues alone to find word boundaries in artificial languages containing words of varying word lengths (Tyler and Cutler 2009). Given how di‰cult it appears to track transitional probabilities between syllables (even in a simple artificial language), one is left wondering whether perhaps infants have an easier solution up their sleeve (e.g. see discussion of utterance level prosody in Endress and

66

Elizabeth K. Johnson

Hauser, 2010; Johnson and Seidl 2008; Johnson and Tyler 2010). Further research in this area is clearly needed. 2.2. Which unit? And once you pick a unit, which statistic? Artificial language learning studies are not the only line of evidence in support of distributional models of language acquisition. Computational models demonstrating what information statistical learning mechanisms could, in theory, extract from the speech signal also provide another important line of evidence for distributional learning theories. But all computational models have to make some assumptions regarding what sort of information infants can perceive and process, and calculating statistics requires having some unit over which to do the calculations. It is impossible to implement a computational model of early language acquisition without making assumptions about relevant units and types of calculations. At first blush, choosing a unit to track in the input sounds like a trivial challenge for most models of early language development. The infant speech perception literature provides reason to choose either the syllable or the phoneme as a basic unit to be tracked (see Juscyzk, 1997, for review; see, Vihman and Vihman, 2011, for a di¤erent perspective). Indeed, models such as PARSER (Perruchet and Vintner 1998) and Swingley’s segmentation model (2005) have chosen the syllable as the basic unit, whereas models such as PUDDLE (Monaghan and Christiansen 2010) and BOOTLEX (Batchelder 2002) have chosen the phoneme as their basic unit. Later on, as children begin learning the words of their language (by tracking statistical relationships between phonemes and/or syllables), models typically assume that words (and even bound morphemes) serve as the units over which additional computations are performed. However, even a very simple assumption like choosing a unit over which a child will perform calculations is not without its controversies. How did infants extract this unit in the first place? Was the knowledge inborn? Or is it simply a product of the auditory system? Indeed, the classic infant speech perception story is that children are born perceiving syllabic units as well as nearly all segmental contrasts present in the world’s languages (Jusczyk 1997; Werker and Tees 1984). However, we now know that this is a bit of an oversimplification. First, most newborn speech perception studies have only been carried out under ideal listening conditions involving the presentation of clearly articulated isolated syllables or words. We still have a lot to learn about how infants perceive speech

Bootstrapping language: Are infant statisticians up to the job?

67

contrasts in di¤erent contexts (e.g. Bortfeld and Morgan 2010) or distracting environments (e.g. Newman 2009). Moreover, not all speech contrasts are initially perceived equally easily (see Juscyzk 1997, for review). We now know that acquiring the phoneme inventory of one’s language does not simply involve losing sensitivity to contrasts that do not signal changes in word meanings in the native language, it also involves a fair amount of tuning into sound contrasts that were initially di‰cult to perceive (Nurayan, Werker, and Beddor 2010; Tsao, Liu, and Kuhl 2006). And some learning environments surely make this more di‰cult to accomplish than others (Sundara and Scutellaro 2011). Moreover, categorizing syllables and segments (and even words) can be complicated by various language-specific coarticulatory and suprasegmental phenomenon (e.g. Curtin, Mintz, and Christiansen 2005). What this means is that regardless of whether you choose a phoneme or a syllable (or even a word) as your basic unit, the information a young infant pulls from the speech signal will obviously be very di¤erent from the information an adult (or a highly trained transcriber listening to speech recordings) will pull from the speech signal. This is especially true since many models suggest that infants are tracking these statistics at around 6 months of age or so, before infants have mastered the segmental inventory of their language or even learned whether they are learning a tone language or an intonation language (Mattock, Molnar, Polka, and Burnham 2008). The situation becomes even more complicated when you consider the di¤erences between broadly transcribed speech and spoken language. Spoken and broadly transcribed language do not just di¤er in the amount of information they carry, they also di¤er in the type of patterns they highlight. For example, spoken language contains rich prosodic grouping cues and immense fine-grained acoustic-phonetic variation in the realization of speech sounds. No two realizations of the same word are ever the same. And there is evidence to suggest that infants are sensitive to even more fine-grained information than this from very early on (Johnson 2003; 2008; Johnson and Juscyzk 2001; McMurray and Aslin 2005). Transcriptions, on the other hand, typically represent each realization of a word in an identical fashion. Representing the input in such a categorical fashion assumes that infants have already solved the many-to-one mapping challenges cause by the lack of invariance in the speech signal. It is worrisome that many of the challenges faced by an infant hearing spoken language (such as dealing with connected speech processes like graded assimilations, resyllabification, casual speech reduction, and stress shifts) are typically not dealt with in computational models. Is it fair to assume that infants

68

Elizabeth K. Johnson

know what type of acoustic variation is linguistically important, and what type is not? Some studies have suggested that infants are adept at coping with invariant productions of speech sounds (Jusczyk, Pisoni, and Mullennix 1992; Kuhl 1979; van Heugten and Johnson 2012), other studies have suggested that infants have a fair bit of di‰culty recognizing commonalities between acoustically distinct realizations of syllables and words (Bortfeld and Morgan 2010; Houston and Jusczyk 2000; Schmale, Christia`, Seidl and Johnson 2010; Singh, Morgan and White 2004). Only further work in this area will be able to clarify the situation. Some computational modelers have simply dismissed concerns over the fact that fine-grained information was not incorporated in their models, claiming that such information could only help children find information in the speech signal (e.g. Batchelder 2002). In other words, their models are conservative estimates of children’s performance because they have not incorporated all of the useful information present in real speech. But statements such as these assume an incredible amount in terms of the speech perception and processing abilities of young infants. Is this the best strategy for researchers who claim to be designing models that reduce the amount of built in knowledge children have to be equipped with? Other modelers have tried to correct for some of the most basic di¤erences between spoken and written language by replacing some dictionary pronunciations with the pronunciations we typically see in specific speech environments, or adding some random variation in the realization of words (Cairns, Shillcock, Chater and Levy 1997; Monaghan and Christiansen 2010; Swingley 2005). Although such an action is a step in the right direction, the gap between spoken and transcribed language is still enormous. And connected speech processes, such as assimilation, are often (if not always) gradient (e.g. Gow 2002). Even if modelers were to use some sort of complex mathematical vector in their models that carried all of the variation (some useful, some likely distracting) contained within the speech signal, this would still not solve the problem of units in computational models. How would infants know to categorize and make use of all of this variation? How would they know to calculate statistics over phonological units without first knowing the units? Very recent work has shown that adding naturalistic variation to a corpus-based model of segmentation seriously hinders the model’s performance (Rytting, Brew and Fosler-Lussier 2010), but it seems that a better understanding of how infants perceive and process speech variability is necessary before such conclusions can be extended to human segmentation behavior. It is possible

Bootstrapping language: Are infant statisticians up to the job?

69

that the same variability that hinders the performance of computational models of word segmentation may in fact help young human listeners (see Johnson 2003; Rost and McMurray 2009, for related discussions). Setting aside the di‰culties of choosing the unit over which infants should calculate their statistics (and the question of how infants overcome the many-to-one mapping problems involved in identifying these units in an adult-like manner), there is another enormous challenge in clearly defining exactly what type of statistical formula children may be working out in their heads. Even in the simple artificial languages specifically designed to address whether infants are capable to tracking a specific type of statistic, there are often multiple ways the patterns in the input could be tracked (e.g. Bonatti, Pen˜a, Nespor and Mehler 2006; Endress and Mehler 2009; Perruchet and Desaulty 2008; Perruchet, Peereman & Tyler 2006). For example, do humans compute primarily forward or backward transitional probabilities? Are they learning rules or patterns? Are they calculating simple statistics or more complex statistics? Are all segments (and syllables) treated equally in these calculations? Are adjacent relationships easier to learn than non-adjacent relationships? How big does a ‘dip’ in a transitional probability have to be before a listener hypothesizes that a word boundary has occurred? The list goes on and on. And of course, the type of calculations you assume infants can perform, and the criteria along which you evaluate the success of any given distributional learning model, can have an enormous impact on the conclusions you draw (e.g. Yang 2004). One unresolved mystery in the literature that I have been interested in revolves around the calculations infants perform over the input they receive from artificial language training. In all studies examining adults’ ability to segment words (or tone words) from an artificial language containing no cue to word boundaries other than transitional probabilities between syllables, adults consistently perceive partwords consisting of the last two syllables of one word and the first syllable of another as more ‘word-like’ than partwords consisting of the last syllable of one word and the first two syllables of another (e.g. all else being equal, if bupada and golatu are statistical words, then padago will sound more word-like then dagola; Sa¤ran, Johnson, Aslin and Newport 1999; Sa¤ran, Newport and Aslin 1996). It is not immediately clear to me how attention to transitional probabilities alone can account for this e¤ect because both partwords are equally ‘word-like’ in terms of their transitional probabilities. The Sa¤ran et al. (1996) infant word segmentation paradigm avoids this interpreta-

70

Elizabeth K. Johnson

tional di‰culty by only testing infants on the partwords consisting of the last syllable of one word plus the first two of another (the partwords that adults found least ‘word-like’). However, one study that has examined infants’ sensitivity to transitional probabilities between the first and second syllable of a statistical word versus the second and third syllable has found a pattern very reminiscent of the adult pattern. When exposed to an artificial language like the one used in Sa¤ran et al., infants seem to perceive the last two syllables of a statistical word as more ‘word-like’ than the first two syllables (Johnson and Juscyzk 2003b). This is despite the fact that the transitional probabilities between the first two and last two syllables in the words were held equal. This behavior of recognizing part of a word as familiar does not neatly match up with the behavior we see in natural language segmentation studies (e.g. in general, infants tend to segment whole words from speech, not parts of words; Houston, Santelmann and Juscyzk 2004; Jusczyk, Houston Newsome 1999). This example serves to illustrate how little we really know about which statistics infants are actually tracking, even in very simplified artificial languages. Studies on the acquisition of morphosyntactic dependencies provide another example where infant behavior in a natural language study does not necessarily entirely line up with their behavior in an artificial language learning study. Based on artificial language studies, it has been claimed that increased variation in the tokens occurring between two nonadjacent elements should make it easier to learn the relationship between the two non-adjacent elements (e.g. Go´mez and Maye 2005). For example, Gomez (2002) found that toddlers exposed to nonsense strings with an A-X-B structure (e.g. pel wadim rud, pel kicey rud, etc) learned the nonadjacent dependency between syllables A and B when 24 di¤erent syllables occurred in the X position, but not when only 3 or 12 syllables occurred in the X position. The authors concluded that as the strength between adjacent dependencies diminishes (i.e. as the number of elements that can occur in the X position increases), infants shift their attention from adjacent dependencies to non-adjacent dependencies. Van Heugten and Johnson (2010) were interested in whether any evidence for this phenomenon could be observed in natural language acquisition. Interestingly, in a study combining a perception study with a corpus analysis, van Heugten and Johnson found no evidence that acquisition of natural language non-adjacent dependencies is impacted by variability in the material intervening the elements forming the dependency. More specifically, they found that Dutch infants appeared to learn the diminutive dependency before the plural

Bootstrapping language: Are infant statisticians up to the job?

71

dependency despite the fact that their corpus analysis revealed that the plural dependency tends to have far more interceding token variability in the input than the diminutive dependency. Taken at face value, this could be taken as evidence that perhaps the acquisition of non-adjacent dependencies in artificial languages may depend on di¤erent computational mechanisms than the acquisition of non-adjacent dependencies in natural languages. Of course, natural language examples cannot be nearly as well controlled as artificial language examples. So it may be the case that other complicating factors were driving the order of acquisition of these dependencies. Clearly, one exception does not necessarily break the generality. Nonetheless, it does certainly motivate one to want to investigate this issue further. To summarize, a broadly transcribed orthographic representation of speech (even with some corrections for typical pronunciation variants in particular contexts) is a completely di¤erent sort of animal than the speech actually produced in everyday interactions (e.g. Johnson 2004; Shockey 2003). Even something as simple as a syllable count can become complicated when normal everyday reduction is taken into account. Consider that the word probably can be produced with one [prai], two [prabli], or three [prabbli] syllables in di¤erent speech conditions. And connected speech processes, such as assimilation and coarticulation, are not easily captured in most current computational models because they are graded rather than binary all-or-none phenomenon. Adults are not thrown o¤ by the lack of invariance in the speech signal and appear to be highly sensitive to the acoustic-phonetic detail of utterances. Recent research has shown that adults use this information to work out the intended meaning behind others utterances. We are only just beginning to understand how infants deal with variation and acoustic-phonetic detail in speech. We know very little about how infants perceive speech in context. And even if we assume that infants can extract the same information that adults can extract, we still have very little understanding of which of the many possible statistical analyses infants are performing over the input. When you combine these concerns with those expressed in the previous section of this paper (‘Does it scale up’), you are faced with the real life possibility that the solutions natural language learners use to master natural languages may not just be quantitatively di¤erent from those used by artificial language learners, they may in fact be qualitatively di¤erent. We are not the first to worry about these types of issues. In the words of Soderstrom, Conwell, Feldman and Morgan (2009).

72

Elizabeth K. Johnson ‘. . . extant models have been hand-crafted for particular problems, selecting relevant properties of input and learning mechanisms to arrive at predetermined output structures. Although these models may provide proofs-inprinciple on possible ways to solve language learning problems, they do not illuminate how the learner knows how to select and organize the input appropriately for any particular task, what the most appropriate out-put representations might be, or how the learner chooses specific statistical analyses to pursue.’

2.3. Can distributional models predict children’s di‰culties? A good test of any model is its ability to predict errors or di‰culties in performance. One very nice aspect of distributional models of word segmentation is that they make strong testable predictions about where infants should make segmentation errors. Below we summarize some behavioral tests of these predictions. More research is clearly needed in this area, but at present, it seems that the infant data does not entirely line up as neatly as one would like with the predictions made by the strongest (or perhaps simplest) proposed models of distributional learning. I will illustrate this issue with examples from the segmentation literature. If a language learner were to rely heavily on transitional probabilities between syllables to extract an initial cohort of words from speech, there are two types of errors that we would expect to see. First, children should segment frequently co-occurring words such as ‘see it’ or idiomatic phrases such as ‘piece of cake’ as one word. Likewise, children should oversegment words containing highly frequent spurious words or morphemes embedded within them such as the ‘be’ in ‘behave’, ‘a’ in ‘parachute’, or ‘ing’ in ‘singing’. To some degree, corpus studies have supported the notion that such errors should be seen. For example, in a study designed to demonstrate that infants can segment words from speech by tracking transitional probabilities between syllables, Swingley (2005) reports the following patterns in his results: ‘Examination of the Dutch false alarms suggests two factors that conspired to reduce the accuracy of the Dutch bisyllable analyses. One was the number of fixed expressions consisting of pairs of monosyllabic words. For example, the Dutch false alarm hou vast (‘‘hold on’’) contains two words that hardly ever occurred in other contexts. As noted previously, several bisyllabic false alarms were conventional expressions, particularly in the Dutch analyses. More importantly, Dutch infant-directed speech contains more trisyllabic words than similar English speech; on occasion these words were not

Bootstrapping language: Are infant statisticians up to the job?

73

detected as trisyllables, but did trigger postulation as bisyllables. Examples include the first two syllables of boterham (‘‘sand- wich’’), mannetje (‘‘little man’’), mopperen (‘‘being grumpy’’), vreetzakje (‘‘little eating-too-much person’’), and zonnetje (‘‘little sun’’), and the last two syllables of olifant (‘‘elephant’’) and eventjes (‘‘just’’ or ‘‘for a little while’’). Some of these words are morphologically complex, consisting of a bisyllabic word and the diminutive su‰x -je or -tje. The Dutch diminutive is productive and frequent, making full trisyllables containing the diminutive su‰x di‰cult to extract. Thus, to some degree the relatively low accuracy of the Dutch analyses can be traced to structural properties of the language.’

Behavioral studies, however, suggest that infants perform far better than a distributional model relying solely on transitional probabilities would predict. Research has shown that infants (like adults) do not, as some models might predict, simply pull out recurrent patterns in the input (Mattys and Jusczyk 2001). Moreover, infants do not over-segment multisyllabic words containing spurious function words embedded within them. For example, 7.5-month-olds familiarized with passages containing trisyllabic words such as ‘parachute’ subsequently recognize the word ‘parachute’, but not the words ‘pair’ or ‘chute’ (Houston, Santelmann and Jusczyk 2004). This is despite the fact that the syllable ‘a’ is a highly frequent word in English that might be expected to trigger over-segmentation of longer words containing the syllable ‘a’. At the same time, 7.5-montholds familiarized with repeated three word phrases such as ‘pair of mugs’ pull out ‘pair’ and ‘mugs’ as units, but not ‘pair of mugs’ (Johnson, van der Weijer and Jusczyk 2001). These studies suggest that young infants may have some strategy besides simple syllable distribution analyses to help them detect the intended word boundaries in speech. Other studies point to similar conclusions. Eight-month-olds familiarized with a passage containing the word ‘catalogue’ segment out the word ‘catalogue’, but not the word ‘cat’. At the same time, 8-month-olds familiarized with a passage containing repetitions of the phrase ‘cat a log’ segment out the word ‘cat’ but do not segment out the word ‘catalogue’ (Johnson 2003). In a related study that controlled for both intonation boundaries and the occurrence of spuriously embedded function morphemes, 7.5- and 12-month olds recognized the word ‘toga’ only when they were familiarized with passages containing repetitions of the phrase ‘toga lore’. Infants did not recognize the word ‘toga’ when they were familiarized with a passage containing repetitions of the phrase ‘toe galore’ (Johnson 2003; 2008a). And finally, in a study examining Dutch-learning 11-month-olds’ segmentation of polysyllabic words ending in the highly frequent diminutive su‰x ‘–je’, it

74

Elizabeth K. Johnson

was found that for the most part, infants tended to perceive the su‰x as part of the polysyllabic word (i.e. the su‰x was not blindly ‘stripped o¤ ’ as might be predicted by some distributional models; Johnson 2008b). Taken together, these studies suggest that 1) infants do not always over-segment polysyllabic words containing frequently occurring spurious function words embedded within or attached to them, and 2) repeatedly re-occurring strings of words are not necessarily under-segmented (see, however, Johnson, 2003, for evidence that infants perceive idiomatic renditions of word strings as more word-like than their literal word string counterparts). At the same time, there are some findings in the literature that appear to show that infants really should (at least sometimes) make the errors predicted by simple distributional models of word segmentation. For example, studies have shown that infants segment the nonsense word ‘breek from the utterance ‘thebreek’ (which contains a real function word within it) but infants do not segment the word ‘breek’ from the utterance ‘kuhbreek’ (which does not contain a real function word within it; Shi, Cutler, Werker and Cruickshank 2006). The authors interpreted this study as evidence that the infants ‘stripped o¤ ’ the frequent word ‘the’ to discover the new word ‘breek’ (see Christophe, Guasti, Nespor, Dupoux and Van Ooyen 1997, for discussion). Can we reconcile this evidence for function word stripping with the fact that the occurrence of ‘a’ in catalogue and parachute does not cause 8-month-olds to over-segment these utterances into three word phrases? There are at least two good ways to reconcile these findings. It is very possible that the 8-month-olds tested in the catalogue segmentation study had not yet learned the function words embedded within the trisyllabic words. Perhaps the results would have been di¤erent had we tested slightly older infants. Another possibility is that infants are sensitive to the acousticphonetic di¤erences between spurious and intended renditions of words. In other words, infants may be able to do something not predicted by most distributional models of word segmentation: they may be somehow sensitive to speakers intended productions and di¤erentiate between real and intended function words (see Conwell and Morgan 2007; Johnson 2003; for a related discussion). Using the fine-grained acoustic-phonetic structure of natural utterances to work out the parse of a speakers’ intended message may actually be necessary to explain word segmentation, given the prevalence of spurious embedded words in languages such as English (McQueen, Cutler, Briscoe and Norris 1995). This notion fits nicely with a growing body of literature demonstrating that attention to the fine-grained

Bootstrapping language: Are infant statisticians up to the job?

75

acoustic phonetic structure of speech is also important for adult speech perception (e.g. McMurray, Tanenhaus and Aslin 2002; Shatzman and McQueen 2006; Spinelli, McQueen and Cutler 2003). Note, however, that if this is true, then this suggests a very large gap between what infants are attending to in the speech signal and what information corpus models based on orthographic transcriptions are tracking. It may be the case that models of developmental speech perception could predict infants’ segmentation errors much more accurately if they took some of the factors discussed in this section into account. Clearly additional research is needed in this area. 2.4. How much innate knowledge is too much innate knowledge? By the end of their first year of study in Psychology, undergraduates are already citing the mantra that no behavior is completely learned or innate. Rather, all human abilities depend on a combination of experienceindependent and experience-dependent factors. Of course they are right. Given what we know today about the complex interaction between our biological endowment and our environment, it would be ludicrous to believe anything else. Indeed, from the mid nineties forward, nearly all (if not all) proponents of distributional models of language development would have heartily agreed with this statement. And even prior to the popularization of distributional models, even the staunchest supporters of the nativist perspective had to allow a substantial learning component into their models because all languages are incredibly unique (in other words, even if the parameters along which languages can vary are innate, they still have to be set). Indeed, one could argue that infants’ statistical learning abilities simply justify the assumptions underlying many parameter setting models. But given that language abilities must be based on a combination of learned and innate factors, how much innate knowledge is too much (or too little) innate knowledge? Some have invoked Occam’s Razor to answer this question. According to Occam’s Razor, if you have two hypotheses that explain an observation equally well, then the simplest most parsimonious hypothesis is best. Using this sort of logic, some have argued that a model allowing for any built-in (or innate) constraints on processing is less acceptable than one that can do without. But is this a justifiable use of Occam’s Razor? Who is to say whether it is more parsimonious to propose that 1) infants have calculator-like brains that can expertly track and extract a specific generalization from a particular multi-

76

Elizabeth K. Johnson

level complex pattern in the input, or 2) infants are born with predispositions or innate constraints (be they perceptual or cognitive) that narrow the possible range of solutions infants consider when faced with working out a particular pattern in their language input? Perhaps we need to allow for both possibilities, and avoid ruling out the latter possibility a priori simply because an experience-based calculation (no matter how complex) can in theory do the work (especially given the concerns expressed above in the section ‘Which unit? And once you pick a unit, which statistic?’). To make this discussion more concrete, I will illustrate my point with an example from the domain of word segmentation. Brent and Cartwright (1996) improved the performance of their distributional model for word segmentation by requiring that all possible parses contain a vowel. Similar constraints have been built into other adult models of word segmentation (Frank, Goldwater, Mansinghka, Gri‰ths and Tenenbaum 2007) and online word recognition (Norris, McQueen, Cutler and Butterfield 1997). Others have proposed a linguistically motivated and very useful constraint that no word should contain more than one syllable carrying primary stress (Yang 2004). Still others have suggested that statistics are tracked di¤erently across di¤erent types of speech segments (Mehler, Pen˜a, Nespor and Bonatti 2006). Note that what all of these suggestions have in common is that they are not suggesting ad hoc constraints on early speech perception, they are suggesting linguistically motivated constraints. However, some members of the modeling community seem to be suggesting that if possible we should reject the need for these constraints because the use of such a constraint assumes innate knowledge and is therefore not parsimonious (e.g. Monaghan and Christiansen 2010). But is this justified? It seems to be that one needs to be careful not to confuse what is the most parsimonious way to design a computational model with what is the most parsimonious way to explain early language acquisition. Rather than blindly invoking Occam’s Razor, a better way to address this issue might be to at least give linguistically motivated constraints a healthy consideration. For example, why not run perceptual studies designed to test whether infants actually possess a constraint against considering segmental strings lacking vowels as possible words? Indeed, there is some evidence in the literature that such a constraint might exist. For example, English-learning 12-month-olds behave like adults in that they segment ‘win’ from ‘window’ but not ‘wind’ (Johnson, Juscyzk, Norris and Cutler 2003). Additional behavioral data with younger infants would be useful to help decide whether it is parsimonious to include such a constraint in computational models of word segmentation. Note, however, that if such

Bootstrapping language: Are infant statisticians up to the job?

77

constraint was included in computational models, it would have to be flexible enough to deal with language-specific exceptions such as Slovakian prepositions (Hanulikova, McQueen and Mitterer 2010). In short, it is clear that infants have excellent statistical learning abilities, and the distribution of linguistic patterns surely provide infants statisticians with useful information. However, does this mean that we should allow no innate biases or learning constraints beyond a tendency to look for statistical patterns in the input? It seems to me that built-in constraints to models of early language learning can be implemented in a parsimonious fashion, and in the end I suspect they will be very necessary to explain several aspects of early language acquisition. 2.5. Looks can be deceiving: the potential dangers of rich interpretations Nearly all of the infant behavioral studies testing distributional models of language acquisition that have been reported thus far in this chapter have been based on one very simple dependent measure: length of look. Indeed, most infant testing paradigms use length or speed of look because we cannot ask infants to verbally indicate whether they, for example, recognize a word or grammatical construction. Indeed, looking paradigms have revolutionized the field of developmental psychology. However, there has been a longstanding uneasiness with the di‰culties involved in interpreting infant looks (e.g. Aslin 2007; Cohen 2001). My own personal view is that looking paradigms are absolutely invaluable in the study of infant perception and cognition. However, it is important that the results of these studies be interpreted very cautiously, especially when examining the processing of higher-level structures of language (see Kooijman, Johnson and Cutler 2008, for a related discussion). In the past, one common approach to compensating for the weaknesses of looking procedures was to run many experiments addressing the same issue and trying to control for as many possible alternative explanations as possible. For example, there were fifteen (fifteen!) experiments in the original paper claiming that English-learning infants use lexical stress to identify word boundaries in fluent speech (Jusczyk, Houston and Newsome 1999). Certainly, even in the 90’s, reporting 15 experiments in a single paper was out of the ordinary. But nowadays, one is hard-pressed to find a paper on infant language learning with more than even 2 or 3 experiments. This would be fine if we were all interpreting our results rather cautiously, but even when we try to be cautious sometimes it is hard to imagine all the di¤erent possible explanations for the results we obtain (not to mention

78

Elizabeth K. Johnson

the pressure to publish exciting findings, and to do so very quickly). I am not suggesting that papers with 2 or 3 infant experiments should not be published, but certainly in the best case scenario multiple follow-up studies and replications would also be published in order to ensure the generality of the conclusions drawn in the original set of studies. Studies examining infants’ sensitivity to grammatical constructions provide a good example of the shortcomings (and frustrations) of looking time data. For example, toddlers have been shown to listen longer to utterances containing grammatical morphosyntactic dependencies (e.g. she is walking the dog) than utterances containing ungrammatical dependencies (e.g. she can walking the dog; Ho¨hle, Schmitz, Santelmann and Weissenborn 2006; Santelmann and Jusczyk 1998). This has been interpreted as evidence that infants have learned (or are sensitive to) this dependency. The results of artificial language studies have further suggested that these types of dependencies could be learned by tracking non-adjacent dependencies in speech (Gomez and Gerken 2000). Indeed, more recent studies have actually shown that the dependencies that are most strongly marked by statistical cues in the input are the very dependencies that toddlers first show evidence of knowing in looking time studies (van Heugten and Johnson 2010). All of this seems to be strong convergent evidence that young children learn morphosyntactic dependencies by tracking nonadjacent dependencies between syllables in their linguistic input. However, there is another slightly less exciting explanation. Perhaps infants like to listen to things that sound familiar. Things that occur frequently in the environment sound familiar, thus things that are statistically frequent in the environment attract longer looking times (see van Heugten and Johnson 2010; van Heugten and Shi 2009, for a related discussion). How does this di¤er from saying that toddlers have developed sensitivity to discontinuous dependencies by tracking the statistical relationship between non-adjacent elements in their language input? In fact, whether or not it di¤ers depends entirely on whether you adopt a conservative or rich interpretation of the looking time studies. A rich interpretation would credit the child with sophisticated grammatical knowledge, whereas a more conservative interpretation would simply credit the child with picking up on a pattern frequently heard in the input. Most researchers reporting that infants are sensitive to non-adjacent (or discontinuous) dependencies in natural speech word their findings very carefully. Saying that an infant is ‘sensitive to discontinuous dependencies’ makes no theoretical claims regarding the underlying nature grammatical knowledge of the child. Can tracking discontinuous dependencies actually be the learn-

Bootstrapping language: Are infant statisticians up to the job?

79

ing mechanism that serves as the main driving force behind the acquisition of abstract grammatical constructions in English? Or are input distributions and statistics just triggers that help infants work out which of the many possible structures of human language that they are encountering? Maybe the language competency possessed by children is just item based, and isn’t really based on abstract grammatical knowledge at all (e.g. Tomasello 2000)? It would be helpful if in the future, language researchers would more clearly spell out their assumptions regarding the knowledge driving infants’ looks. Is it grammatical knowledge? Is it statistical knowledge? Is there even a di¤erence between grammatical and statistical knowledge? If there is a di¤erence between the two, then we need to develop an explanation for how and when children shift from statistical to grammatical knowledge, which will require a way to test the underlying nature of children’s sensitivity to linguistic patterns. If developmentalists propose no shift away from the initial statistical knowledge, then they must be held accountable for explaining studies demonstrating that adults have abstract linguistic knowledge. Looking time measures will be helpful in addressing these issues, but the use of convergent measures, more clearly laid out theoretical assumptions, and cleverly designed experiments will also be necessary. In my lab, we are trying to begin to address these issues by presenting toddlers with strictly familiar patterns (i.e. statistically common in the input, but ungrammatical) versus grammatical items (statistically less common, but nonetheless fully grammatical). Preliminary work in this area suggests that toddlers’ knowledge of discontinuous dependencies might not truly be grammatical in the adult sense of the word (van Heugten and Johnson 2011). More work in this area will surely emerge in the near future as researchers move beyond demonstrating the remarkable statistical learning abilities of infants and move towards developing more comprehensive theories of language development. 3. Closing Comments The title of this chapter poses a question: Are infant statisticians up to performing the job of bootstrapping language? In large part, the answer to this question might depend on how you define the ‘job’ of language acquisition. Clearly, infants are statisticians of some sort, as study after study has demonstrated how exquisitely attuned they are to statistical patterns in their environment. And many aspects of language are reflected

80

Elizabeth K. Johnson

in its statistical structure (especially if one assumes adult-like representations of the input). But when it comes to language acquisition, it is not yet entirely clear what information infants need to learn, what units of information they are tracking, or what calculations they might be performing over these units. And the looking time data we often used to address these questions is frustratingly ambiguous (and tempting to overinterpret). Indeed, we do not yet really know whether the task of artificial language learning is quantitatively or qualitatively di¤erent from natural language learning. As much as I appreciate the beauty of a well-designed artificial language study, I must admit that I fear that the language learning task faced by a child in an artificial language learning study di¤ers both qualitatively and quantitatively from the task faced by a child learning a language in the real world. At the same time, artificial language learning studies allow us to test hypotheses in the laboratory that we cannot realistically test in the real world. So how shall we, as a field, move forward towards discovering an answer to this di‰cult question? My advice would be to keep chipping away at the many challenges to distributional models, and to try to better define the role distributional learning plays in language acquisition. I would also recommend looking to the information-rich patterns in the speech signal as another potentially important cue to language structure (e.g. see Endress and Mehler 2009; Johnson and Seidl 2008; Johnson and Tyler 2010, for a discussion of the potential importance of utterance level prosody in early language acquisition). I also look forward to seeing more work focused on understanding the basic fundamentals of how infant speech perception di¤ers from adult speech perception, and how experience sculpts infant language learners into adult language users. By better understanding these basic issues, we will be in a better position to judge the feasibility of many distributional models proposed based on artificial language and corpus studies. In closing, regardless of which level of linguistic structure we focus on, the language acquisition ‘problem’ is remarkably far from being solved. At this point, it is impossible to say with any certainty whether or not infant statisticians are capable of bootstrapping language. Researchers are still at the early stages of grappling to understand what implications our discovery of infants’ statistical learning abilities should have on our theories of language acquisition. The last 15 years have been such an exciting time to be involved in language acquisition research. As researchers continue to tackle the challenges posed in this chapter, I expect the next 15 years to be equally exciting.

Bootstrapping language: Are infant statisticians up to the job?

81

Acknowledgements I thank Anne Cutler, Rochelle Newman, Amanda Seidl, and two reviewers (Amy Perfors plus one anonymous reviewer) for very useful feedback on an earlier version of this chapter. All errors are of course my own. Funding was provided by NSERC (Natural Sciences and Engineering Research Council of Canada) and CFI (Canada Foundation for Innovation). References Aslin, Richard N. 2007 What’s in a look? Developmental Science, 10: 48–53. Aslin, Richard N. and Elissa L. Newport 2008 What statistical learning can and can’t tell us about language acquisition. In J. Colombo, P. McCardle, and L. Freund (eds.), Infant Pathways to Language: Methods, Models, and Research Directions. Mahwah, NJ: Lawrence Erlbaum Associates. Aslin, Richard N., Jenny R. Sa¤ran and Elissa L. Newport 1998 Computation of conditional probability statistics by 8-monthold infants. Psychological Science, 9: 321–324. Batchelder, Eleanor 2002 Bootstrapping the lexicon: A computational model of infant speech segmentation. Cognition, 83: 167–206. Bates, Elizabeth and Je¤rey L. Elman 1997 Learning rediscovered: A perspective on Sa¤ran, Aslin, and Newport. Science, 274: 1849–1850. Bonatti, Luca L., Marcela Pen˜a, Marina Nespor and Jacques Mehler 2006 How to hit Scylla without avoiding Charybdis: comment on Perruchet, Tyler, Galland, and Peereman. Journal of Experimental Psychology: General, 135: 314–326 Bortfeld, Heather and James L. Morgan 2010 Is early word-form processing stress-full? How natural variability supports recognition. Cognitive Psychology, 60: 241–266. Brent, Michael and Timothy Cartwright 1996 Distributional regularity and phonotactic constraints are useful for segmentation. Cognition, 61: 93–125. Byrd, Dani M. and Toben H. Mintz 2010 Discovering Speech, Words, and Mind. Wiley-Blackwell Publishing. Cairns, P., Richard Shillcock, Nick Chater & Levy, J. 1997 Bootstrapping word boundaries: A bottom-up corpus-based approach to segmentation. Cognitive Psychology, 33: 111–153.

82

Elizabeth K. Johnson

Chambers, Kyle E., Kris H. Onishi and Cynthia Fisher 2003 Infants learn phonotactic regularities from brief auditory experience. Cognition, 87: B69–B77. Chomsky, Noam 1957 Syntactic structures. The Hague, NL: Mouton. Christophe, Anne, Guasti, T., Marina Nespor, Emmanuel Dupoux, & Van Ooyen, B. 1997 Reflections on phonological bootstrapping: Its role for lexical and syntactic acquisition. Language and Cognitive Processes, 12: 585–612. Cohen, Leslie B. 2001, April Uses and Misuses of Habituation: A Theoretical and Methodological Analysis. Symposium paper presented at the Society for Research in Child Development Meeting, Minneapolis, MN. Conwell, Erin & James L. Morgan 2007 Resolving grammatical category ambiguity in acquisition. In H. Caunt-Nulton, S. Kulatilake, & I. Woo (Eds.), Proceedings of the 32th annual Boston University Conference on Language Development: Vol. 1 (pp. 117–128). Somervill, MA: Cascadilla Press. Curtin, Suzanne, Toben Mintz and Morten Christiansen 2005 Stress Changes the representational landscape: Evidence from word segmentation. Cognition, 96: 233–262. Curtiss, Susan 1977 Genie: a psycholinguistic study of a modern-day ‘‘wild child ’’. Boston: Academic Press. De Vries, Meinou, Padraic Monaghan, Stefan Knecht & Pienie Zwitserlood 2008 Syntactic structure and artificial grammar learning: the learnability of embedded hierarchical structures. Cognition, 107: 763– 774. Endress, Ansgar D. and Jacque Mehler 2009 Primitive Computations in Speech Processing. Quarterly Journal of Experimental Psychology. Endress, Ansgar D. and Marc D. Hauser 2010 Word segmentation with universal prosodic cues. Cognitive Psychology, 61, 177–199. Frank, Michael. C., Goldwater, S., Mansinghka, V., Gri‰ths, T. & Tenenbaum, J. 2007 Modeling human performance on statistical word segmentation tasks. In D. S. McNamara & G. Trafton (eds), Proceedings of the 29th Annual Meeting of the Cognitive Science Society, 281– 86. Mahwah, NJ: Lawrence Erlbaum. Gervain, Judit, Marina Nespor, Reiko Mazuka, Ryota Horie & Jacques Mehler 2008 Bootstrapping word order in prelexical infants: a JapaneseItalian cross-linguistic study. Cognitive Psychology, 57: 56–74.

Bootstrapping language: Are infant statisticians up to the job?

83

Go´mez , Rebecca & Lou Ann Gerken 2000 Infant Artificial Language Learning and Language Acquisition. Trends in Cognitive Sciences, 4: 178–186. Go´mez, Rebecca 2002 Variability and the detection of invariant structure. Psychological Science, 13: 431–436. Go´mez, Rebecca & Jessica Maye 2005 The developmental trajectory of nonadjacent dependency learning. Infancy, 7: 183–206. Goodsitt, J. V., James L. Morgan & Patricia Kuhl 1993 Perceptual strategies in prelingual speech segmentation. J Child Lang, 20: 229–52. Gow, David W. 2002 Does English coronal place assimilation create lexical ambiguity? Journal of Experimental Psychology: Human Perception and Performance, 28: 163–179. Harris, Zellig 1955 From phoneme to morpheme. Language, 31: 190–222. Hayes, J. R., & Clark, H. H. 1970 Experiments in the segmentation of an artificial speech analog. In J. R. Hayes (Ed.), Cognition and the Development of Language. New York: Wiley. Ho¨hle, Barbara, Michaela Schmitz, Lynn Santelmann & Jurgen Weissenborn 2006 The recognition of discontinuous verbal dependencies by German 19-month-olds: Evidence for lexical structural influences on children’s early processing capacities. Language Learning and Development, 2: 277–300. Houston, Derek M. & Peter W. Jusczyk 2000 The role of talker-specific information in word segmentation by infants. Journal of Experimental Psychology: Human Perception & Performance, 26: 1570–1582. Houston, Derek M., Lynn Santelmann & Peter W. Jusczyk 2004 English-learning infants’ segmentation of trisyllabic words from fluent speech. Language and Cognitive Processes, 19: 97–136. Hanulikova, Adriana, James McQueen & Holger Mitterer 2010 Possible words and fixed stress in the segmentation of Slovak speech. Quarterly Journal of Experimental Psychology, 63: 555– 579. Johnson, Elizabeth K. 2003 Word segmentation during infancy: The role of subphonemic cues to word boundaries. Unpublished Doctoral Dissertation, The Johns Hopkins University. Johnson, Elizabeth K. 2005 English-learning infants’ representations of word-forms with iambic stress. Infancy, 7: 95–105.

84

Elizabeth K. Johnson

Johnson, Elizabeth K. 2008a Dutch infants’ segmentation of diminutive word forms: An exploration of morpheme stripping and lexical embedding. Paper presented at the 16th International Conference on Infant Studies, Vancouver, Canada. Johnson, Elizabeth K. 2008b Infants use prosodically conditioned acoustic-phonetic cues to extract words from speech. Journal of the Acoustical Society of America, 123: EL144-EL148. Johnson, Elizabeth K. & Peter W. Jusczyk 2001 Word segmentation by 8- month-olds: When speech cues count more than statistics. Journal of Memory and Language, 44: 548– 567. Johnson, Elizabeth K. & Peter W. Jusczyk 2003a Exploring statistical learning by 8-month-olds: The role of complexity and variation. In D. Houston, A. Seidl, G. Hollich, E. Johnson, & A. Jusczyk (Eds.) Jusczyk Lab Final Report (pp. 141–148). Johnson, Elizabeth K., & Peter W. Jusczyk 2003b Exploring possible e¤ects of language-specific knowledge on infants’ segmentation of an artificial language. In D. Houston, A Seidl, G. Hollich, E. Johnson, & A. Jusczyk (Eds.) Jusczyk Lab Final Report (pp. 141–148). Johnson, Elizabeth K., Peter W. Jusczyk, Anne Cutler & Dennis Norris 2003 Lexical viability constraints on speech segmentation by infants without a lexicon. Cognitive Psychology, 46: 65–97. Johnson, Elizabeth K. & Amanda Seidl 2008 Clause segmentation by 6-month-olds: A crosslingusitic perspective. Infancy, 13: 440–455. Johnson, Elizabeth K. & Amanda Seidl 2009 At 11 months, prosody still outranks statistics. Developmental Science, 12: 131–141. Johnson Elizabeth K. & Michael D. Tyler 2010 Testing the limits of statistical learning for word segmentation. Developmental Science, 13: 339–345. Johnson, Elizabeth K., Joost Van de Weijer & Peter W. Jusczyk 2001 Word segmentation by 7.5-month-olds: Three words do not equal one. BUCLD 25: Proceedings of the 25th Annual Boston University Conference on Language Development (Volume 2: 389–400). Cascadilla Press, Somerville, MA. Johnson, Elizabeth K. & Marieke van Heugten 2012 Infant artificial language learning. In N.M. Seel (Ed.), Encyclopedia of the Sciences of Learning. Heidelberg: Springer-Verlag. Johnson, Keith 2004 Massive reduction in conversational American English. In K. Yoneyama & K. Maekawa (Eds.), Spontaneous speech: Data

Bootstrapping language: Are infant statisticians up to the job?

85

and analysis. Proceedings of the 1st session of the 10th international symposium (pp. 29–54). Tokyo, Japan: The National International Institute for Japanese Language. Jusczyk, Peter W. 1997 The discovery of spoken language. Cambridge, MA: MIT Press. Jusczyk, Peter W., Derek Houston & Mary Newsome 1999 The beginnings of word segmentation in English-learning infants. Cognitive Psychology, 39: 159–207. Jusczyk, Peter W., David Pisoni, D. & James Mullennix 1992 Some consequences of stimulus variability on speech processing by 2-month-old infants. Cognition, 43: 253–291. Kirkham, Natasha Z., Jonathan Slemmer & Scott P. Johnson 2002 Visual statistical learning in infancy: Evidence of a domain general learning mechanism. Cognition, 83: B35–B42. Kooijman, Valesca, Elizabeth K. Johnson & Anne Cutler 2008 Reflections on reflections of infant word recognition. In Friederici, Thierry (ed.), In book entitled Early Language Development: Bridging Brain and Behaviour, Series Trends in Language Acquisition Research. (pp. 91–114). Amsterdam, The Netherlands: John Benjamins Publishing Company. Kuhl, Patricia 1979 Speech perception in early infancy: Perceptual constancy for perceptually dissimilar vowel categories. Journal of the Acoustical Society of America, 66: 1168–1679. Mattock, Karen, M. Molnar, Linda Polka & Denis Burnham 2008 The developmental course of lexical tone perception in the first year of life Cognition, 106, 3: 1367–1381. Mattys, Sven & Peter W. Jusczyk 2001 Do infants segment words or continuous recurring patterns? Journal of Experimental Psychology: Human Perception and Performance, 27: 644–655. Maye, Jessica, Janet Werker & Lou Ann Gerken 2002 Infant sensitivity to distributional information can a¤ect phonetic discrimination. Cognition, 82: 101–111. McMurray, Bob & Richard N. Aslin 2005 Infants are sensitive to within-category variation in speech perception. Cognition, 95: B15–B26. McMurray, Bob, Michael Tanenhaus & Richard N. Aslin, R. 2002 Gradient e¤ects of within-category phonetic variation on lexical access, Cognition, 86: B33–B42. McQueen, James M., Anne Cutler, T. Briscoe & Dennis Norris, D. 1995 Models of continuous speech recognition and the contents of the vocabulary. Language and Cognitive Processes, 10: 309–331. Mehler Jacques, Marcela Pen˜a, Marina Nespor & Luca Bonatti 2006 The soul of language does not use statistics: reflections on vowels and consonants. Cortex, 42: 846–854.

86

Elizabeth K. Johnson

Mintz, Toben 2003

Frequent frames as a cue for grammatical categories in child directed speech. Cognition, 90: 91–117. Monaghan, Padraic, Morten Christiansen & Nick Chater, N. 2007 The phonological–distributional coherence hypothesis: Crosslinguistic evidence in language acquisition, Cognitive Psychology, 55: 259–305. Morgan, James L. & Elissa L. Newport 1981 The role of constituent structure in the induction of an artificial language. Journal of Verbal Learning and Verbal Behavior, 20: 67–85. Morgan, James L. & Jenny Sa¤ran 1995 Emerging integration of sequential and suprasegmental information in preverbal speech segmentation. Child Dev, 66: 911–936. Newman, Rochelle S. 2009 Infant’s listening in multitalker environments: E¤ect of the number of background talkers. Attention, Perception & Psychophysics, 71: 822–836. Newport, Elissa L. & Richard N. Aslin 2004 Learning at a distance: I. Statistical learning of non-adjacent dependencies. Cognitive Psychology, 48: 127–162. Norris, Dennis G., James M. McQueen, Anne Cutler & Sally Butterfield 1997 The possible-word constraint in the segmentation of continuous speech. Cognitive Psychology, 34: 191–243. Nurayan, Chandan, Janet Werker & Patricia Beddor 2010 The interaction between acoustic salience and language experience in developmental speech perception: Evidence from nasal place discrimination. Developmental Science, 13: 407–420. Pelucchi, Bruna, Jessica Hay & Jenny Sa¤ran 2009a Statistical learning in a natural language by 8-month-old infants. Child Development, 80: 674–685. Pelucchi, Bruna, Jessica Hay & Jenny Sa¤ran 2009b Learning in reverse: Eight-month-old infants track backwards transitional probabilities. Cognition, 113: 244–247. Peperkamp, Sharon, R. Le Calvez, J.-P Nadal & Emmanuel Dupoux 2006 The acquisition of allophonic rules: statistical learning with linguistic constraints. Cognition, 101: B31–B41. Perruchet, Pierre & S. Desaulty 2008 A role for backward transitional probabilities in word segmentation? Memory & Cognition, 36: 1299–1305. Perruchet, Pierre, Ronald Peereman & Michael Tyler 2006 Do we need algebraic-like computations? A reply to Bonatti, Pen˜a, Nespor, and Mehler (2006) . Journal of Experimental Psychology: General, 135: 322–326.

Bootstrapping language: Are infant statisticians up to the job?

87

Perruchet, Pierre & Annie Vinter 1998 Parser: A model for word segmentation. Journal of Memory and Language, 39: 246–263. Rost, Gwyneth & Bob McMurray 2009 Speaker variability augments phonological processing in early word learning. Developmental Science, 12: 339–349. Rytting, C. Anton, Chris Brew & Eric Fossler-Lussier 2010 Segmenting words from natural speech: subsegmental variation in segmental cues. Journal of Child Language, 37: 513–543. Sa¤ran, Jenny R. 2001 The use of predictive dependencies in language learning. Journal of Memory and Language, 44: 493–515. Sa¤ran, Jenny R., Richard N. Aslin & Elissa L. Newport 1996 Statistical learning by eight-month-old infants. Science, 274: 1926– 1928. Sa¤ran, Jenny R., Elizabeth K. Johnson, Richard N. Aslin & Elissa L. Newport 1999 Statistical learning of tonal sequences by human infants and adults. Cognition, 70: 27–52. Sa¤ran, Jenny R. & Erik D. Thiessen 2003 Pattern induction by infant language learners. Developmental Psychology, 39: 484–494. Sa¤ran, Jenny R. & D. P. Wilson 2003 From syllables to syntax: Multilevel statistical learning by 12month-old infants. Infancy, 4: 273–284. Sahni, Sarah D., Mark S. Seidenberg & Jenny R. Sa¤ran 2010 Connecting cues: overlapping regularities support cue discovery in infancy. Child Development, 81: 727–736. Santelmann, Lynn M. & Peter W. Jusczyk 1998 Sensitivity to discontinuous dependencies in language learners: Evidence for limitations in processing space. Cognition, 69: 105–134. Sue Savage-Rumbaugh & William M. Fields 2000 Linguistic, Cultural and Cognitive Capacities of Bonobos (Pan Paniscus). Culture and Psychology, 6: 131–153. Schmale, Rachel, Alejandrina Christia`, Amanda Seidl & Elizabeth K. Johnson 2010 Developmental changes in infants’ ability to cope with dialect variation in word recognition. Infancy, 15: 650–662. Seidl, Amanda & Eugene Buckley 2005 On the learning of arbitrary phonological rules. Language Learning and Development, 1: 289–316. Shatzman, Keren B. & James M. McQueen 2006 Prosodic knowledge a¤ects the recognition of newly acquired words. Psychological Science, 17: 372–377.

88

Elizabeth K. Johnson

Shi, Rushen, Anne Cutler, Janet F. Werker & Marisa Cruickshank 2006 Frequency and form as determinants of functor sensitivity in English acquiring infants. Journal of the Acoustical Society of America, 119: EL61–EL67. Shockey, Linda 2003 Sound patterns of spoken English. Oxford: Blackwell Publishing. Singh, Leher, James L. Morgan, Katherine S. White 2004 Preference and processing: The role of speech a¤ect in early spoken word recognition. Journal of Memory and Language, 51: 173–189. Spinelli, Elsa, James M. McQueen & Anne Cutler 2003 Processing resyllabified words in French. Journal of Memory and Language, 48: 233–254. Smith, Linda & Chen Yu 2008 Infants rapidly learn word- referent mappings viacross-situational statistics. Cognition, 106: 1558–1568. Soderstrom, Melanie, Erin Conwell, Naomi Feldman & James L. Morgan 2009 The learner as statistician: three principles of computational success in language acquisition. Developmental Science, 12: 409–411. Sundara, Megha & Adrienne Scutellaro 2011 Rhythmic distance between languages a¤ects the development of speech perception in bilingual infants. Journal of Phonetics, 505– 513. Swingley, Daniel 2005 Statistical clustering and the contents of the infant vocabulary. Cognitive Psychology, 50: 86–132. Terrace, Herb S., Laura-Ann Petitto, R. J. Sanders & T. G. Bever 1979 Can an ape create a sentence? Science, 206: 891–902. Thiessen, Erik D., Emily Hill & Jenny R. Sa¤ran 2005 Infant-directed speech facilitates word segmentation. Infancy, 7: 53–71. Thiessen, Erik & Jenny Sa¤ran 2003 When cues collide: Use of stress and statistical cues to word boundaries by 7- to 9-month-old infants. Developmental Psychology, 39: 706–716. Thiessen, Erik D. & Jenny R. Sa¤ran 2007 Learning to learn: Infants’ acquisition of stress-based segmentation strategies for word segmentation. Language, Learning and Development, 3: 73–100. Thompson, Susan P. & Elissa L. Newport 2007 Statistical learning of syntax: The role of transitional probability. Language Learning and Development, 3: 1–42. Tomasello, Michael 2000 Do young children have adult syntactic competence? Cognition, 74: 209–253.

Bootstrapping language: Are infant statisticians up to the job?

89

Toro, Juan M. & Joseph Trobalon 2005 Statistical computations over a speech stream in a rodent. Perception and Psychophysics, 67: 867–875. Tsao, Feng Ming, Huei-Mei Liu & Patricia Kuhl 2006 Perception of native and non-native a¤ricate–fricative contrasts: cross- language tests on adults and infants. Journal of the Acoustical Society of America, 120: 2285–2294. Tyler, Michael & Anne Cutler 2009 Cross-language di¤erences in cue use for speech segmentation. Journal of the Acoustical Society of America, 126: 367–376. Valian, Virginia & Seana Coulson 1988 Anchor points in language learning: The role of marker frequency. Journal of Memory and Language, 27: 71–86. Van Heugten, Marieke & Elizabeth K. Johnson 2010 Linking infants’ distributional learning abilities to natural language acquisition. Journal of Memory and Language, 63: 197– 209. Van Heugten, Marieke & Elizabeth K. Johnson 2011 Is dependency acquisition by children grammatical or distributional in nature? Poster presented at the Society for Research in Child Development Biennial Meeting, Montreal, Canada. Van Heugten, Marieke & Elizabeth K. Johnson 2012 Infants exposed to fluent natural speech succeed at cross-gender word recognition. Journal of Speech, Language, and Hearing Research, 55, 554–560. Van Heugten, Marieke & Rushen Shi 2009 French-learning toddlers use gender information on determiners during word recognition. Developmental Science, 12: 419–425. Vihman, Marilyn & Virve-Anneli Vihman 2011 From first words to segments: A case study in phonological development. In E. V. Clark & I. Arnon, eds., How children make linguistic generalizations: Experience and variation in learning a first language. Trends in Language Acquisition Research. Amsterdam: Benjamins, 109–133. Werker, Janet F. & Richard C. Tees 1984 Cross-language speech perception: Evidence for perceptual reorganization during the first year of life. Infant Behavior and Development, 7: 49–63. Yang, Charles D. 2004 Universal grammar, statistics, or both? Trends in Cognitive Science, 8: 451–456.

Sensitivity to statistical information begets learning in early language development Jessica F. Hay and Jill Lany Growing interest in potential sources of statistical regularities in the environment, and in infants’ ability to learn these regularities, has lead to a considerable amount of research examining the role of statistical learning in early development. One prevalent form of statistical information is co-occurrence. When two elements reliably co-occur, there is typically a meaningful relationship between those elements, and thus co-occurrence patterns potentially provide a very powerful source of information across di¤erent domains and modalities. Kirkham, Slemmer, and Johnson (2002) found that by 2 months of age, infants can detect sequential relationships in a series of visually-presented shapes, and Baldwin and colleagues (Baldwin, Andersson, Sa¤ran, & Meyer, 2008) found that adults can use statistical information to learn regularities in dynamic action sequences. Also in the visual domain, 9-month-old infants can track elements that reliably co-occur across time and space, which tend to belong to the same object, and thus provide information about object boundaries (Fiser & Aslin, 2002). Statistical information is also a potential cue to important aspects of language structure, and the extent to which sensitivity to co-occurrence regularities plays a role in infant language development is the focus of this chapter. One kind of co-occurrence relationship that may be particularly important, and that has been the focus of much research, is sequential relationships between syllables: syllables that reliably co-occur often belong to the same word, while syllables that rarely co-occur are more likely to span word boundaries (e.g., Swingley, 2005). This type of sequential co-occurrence relationship is referred to as conditional or transitional probability (TP). Generally, transitional probability is the probability of one event given the occurrence of another event. This statistic refers to more than the frequency with which one element follows another, as it adjusts for the base rate of the first event or element. The TP of Y given X is represented by the following equation Eq. (1): TP ¼ PðY jX Þ ¼

frequencyðXY Þ frequencyðX Þ

ð1Þ

92

Jessica F. Hay and Jill Lany

Take, for example, the following series of likely two word phrases that an infant might hear: pretty baby, happy baby, cute baby, pretty kitty, pretty doggy, pretty horsey. In this micro-corpus, every time an infant hears ba it is always followed by by. Therefore, within the word ‘baby’ the syllable sequence has a high transitional probability (TP ¼ 1.0) and infants may therefore treat ‘baby’ as a good word candidate. Although the syllable ba predicts the syllable by, hearing ty (from ‘pretty’) does not necessarily predict ba (from ‘baby’). In this micro-corpus, the transitional probability between syllable sequences ty and ba is 0.25, and therefore ‘tyba’ (which crosses a word boundary) is a less good word candidate in English than ‘baby’. Corpus analyses suggest that the TP between syllables are an imperfect but potentially useful cue to word boundaries in natural speech (Swingley, 2005; though, see Yang, 2004, for a di¤erent view), as transitional probabilities between syllables tend to be higher within words and lower across word boundaries. Thus, TP could serve as a word boundary cue for pre-lingual infants facing the challenge of segmenting words from fluent speech. Infants appear to be remarkably sensitive to statistical regularities beginning in early infancy. For example, Sa¤ran, Aslin, & Newport (1996) found that 8-month-old infants can track these probabilities in an artificial language using synthesized speech, and Teinonen and colleagues (Teinonen, Fellman, Na¨a¨ta¨nen, Alku, & Huotilainen, 2009) reported evidence of sensitivity to sequential statistics in speech from ERP recordings in neonates. These and many other studies over the last 2 decades have yielded important insights about infants’ ability to track statistical regularities in their environment (see Hay, 2009; Lany & Sa¤ran, in press, for a review). However, many such studies have employed materials lacking the complexity of patterns found in the real world. For example, the artificial languages used in many experiments on word segmentation consist entirely of sequential regularities among a small number of syllables and lack the rich, multidimensional structure of natural language. Thus, it is unclear whether infants would still use statistical regularities when presented with natural language input that contains many potentially competing sources of information. Second, many of these studies present children with streams of speech that lack meaning. Because learning language entails learning semantic information in addition to phonological regularities, it is important to understand how statistical learning mechanisms contribute to other aspect of language development, such as word learning. Finally, another important goal is to understand how statistical learning at one level, such

Sensitivity to statistical information begets

93

as tracking co-occurrence relationships among syllables, might relate to learning about other aspects of language, such as the co-occurrence relationships between words and word categories that are important in grammatical structure. In this chapter we will describe several recent studies that have begun to bridge the gap between studies suggesting that infants are highly adept at tracking statistical regularities in artificial languages with tasks that closer approximate the problems faced over the course of learning a natural language. Moreover, we will describe how sensitivity to relatively simple statistical regularities allows infants to learn new and increasingly complex dimensions of language structure, or how statistical learning begets learning.

1. Statistical Learning in a Natural Language The primary evidence supporting the existence of statistical learning mechanisms in infants comes from studies employing artificial language materials, typically miniature languages that have been designed to capture some aspect of structure found in natural language. For example, in the original statistical segmentation studies infants were presented with a synthetically produced speech stream, without pauses or other acoustic cues to word boundaries. The only available cue was a dip in the TPs (and other related sequential statistics, such as mutual information) between syllables and/or segments at word boundaries. Infants were then tested on their ability to discriminate sequences corresponding to words versus nonwords (syllables from the language assembled in a novel order), or a more subtle comparison, words versus part-words (syllable sequences spanning word boundaries, which can be matched for frequency in the speech stream; e.g., Aslin, Sa¤ran, & Newport, 1998). Although artificial language materials have been invaluable for the initial investigation of infant statistical learning mechanisms, it is obvious that such stimuli lack the complexity of a natural language on virtually every possible dimension. This problem of ecological validity has been acknowledged throughout the literature on infant statistical language learning. For example, the artificial languages tend to contain few words (typically just four to six), have a limited set of phonemes and syllables, lack other sequential regularities, rhythmic patterning, pitch changes, and other acoustic variability associated with natural languages. In addition,

94

Jessica F. Hay and Jill Lany

words are repeated extremely frequently during exposure (45–90 times). Furthermore, there are typically no other sequential regularities present, such as those engendered by syntactic structure (though, see Sa¤ran & Wilson, 2003). For these reasons, the learning challenges presented by artificial languages are quite di¤erent than those presented by natural languages. In order to address this concern, researchers have systematically increased the complexity of artificial languages used in infant studies in several ways. For example, researchers have included naturally produced syllables instead of synthetically produced syllables (Graf Estes, Evans, Alibali, & Sa¤ran, 2007; Sahni, Seidenberg, & Sa¤ran, 2010), varied word length (Johnson & Tyler, 2010; Mersad & Nazzi, 2010; Thiessen, Hill, & Sa¤ran, 2005) and stress cues (Johnson & Jusczyk, 2001; Thiessen & Sa¤ran, 2003; 2007) and provided multiple correlated cues to word boundaries (Sahni, et al., 2010). Generally, these studies have found that infants are still able to track statistical regularities in the input despite greater complexity (although, see Johnson & Tyler, 2010, for a counter-example), and that in some cases additional complexity facilitates statistical learning (e.g., Thiessen et al. 2005). However, while these miniature languages contain more structure than earlier artificial materials, they nonetheless lack the richness of natural languages. In one early study examining infants’ capacity to segment words from natural language, Jusczyk and Aslin (1995) presented 7.5-month-old infants with multiple repetitions of isolated words (such as bike, cup, dog, or feet) and then tested their listening preferences for sentences containing the word versus sentences without the word. In a second condition, infants were first familiarized with the sentences containing the target words and were then tested on their listening preferences for the isolated words. In both conditions, infants recognized the words in the novel context, suggesting that they must have noticed some similarity between the words in the sentences and the isolated words. Since transitional probability was not expressly manipulated, it is not clear whether mere familiarity with or statistical coherence in the target words was driving these e¤ects. Recently, however, Pelucchi and colleagues (Pelucchi, Hay, & Sa¤ran, 2009a; 2009b), demonstrated that English-learning 8-month-old infants track statistical coherence, and specifically transitional probabilities, even when presented with complex language input that is naturally produced, grammatically correct, and semantically meaningful. Importantly, the natural language they used was not English, which English learning infants may

Sensitivity to statistical information begets

95

be able to segment using known words (Bortfeld, Morgan, Golinko¤, & Rathbun, 2005), but instead was an unfamiliar natural language – Italian. The Italian corpora maintained virtually all of the complexities found in natural language, but the internal transitional probabilities between syllable sequences were expressly manipulated in a subset of the words. To that end, Pelucchi et al.’s (2009a, Experiment 3) familiarization corpora consisted of grammatically correct and semantically meaningful standard Italian sentences, produced in a lively voice by a female native Italian speaker. The sentences contained 4 Italian target words with a strong/weak stress pattern: fuga, melo, casa, and bici. Importantly, two of the target words, fuga and melo, were High Transitional Probability (HTP) words as their component syllables never appeared without each other (i.e., fu never appeared in the absence of ga, and vice versa). Thus, the TP of, for example, fuga, corresponds to: TPðgaj fuÞ ¼

f ð fugaÞ f ð fuÞ

Because fu never appeared without ga, the internal transitional probability of fuga (and of melo) was 1.0. Two other words, casa and bici, were Low Transitional Probability (LTP) words as there were additional occurrences of the syllables ca and bi, all in strong (stressed) position. As a consequence, the TPs of casa and bici were .33. A counter-balanced language was also created to control for arbitrary listening preferences at test. Although the two pairs of words, fuga/melo and casa/bici, were equally frequent, they contained di¤erent internal TPs. In addition to sequential co-occurrence statistics of the target words, the Italian corpus contained much of the variability associated with natural language, including 23 di¤erent consonants, 7 vowels, over 100 unique syllables, a wide variety of syllable types (e.g., V, CV, VC, CCV), and varied word-length (i.e, 1-, 2-, 3-, and 4-syllable words), and also contained varied characteristic acoustic and prosodic patterns found in naturally spoken Italian. Despite the complexity of the natural language input, 8-month-old infants discriminated high vs. low transitional probability words following a very brief familiarization period (2 mins 15 sec). In a second and related study, Pelucchi, Hay, and Sa¤ran (2009b) investigated the directionality of the computation of sequential statistics. In principle, transitional probability can be defined in two di¤erent and symmetrical ways, depending on how the normalization step is performed. In addition to forward TPs (hereafter FTP) described in Eq. (1), it is also

96

Jessica F. Hay and Jill Lany

possible to compute backward TPs (hereafter BTP), as shown in Eq. (2), measuring the likelihood of X preceding Y: BTP ¼ PðX jY Þ ¼

frequencyðXY Þ frequencyðY Þ

ð2Þ

Indeed, in a corpus analysis of English infant-directed speech, Swingley (1999) demonstrated that FTP and BTP are equally informative as independent cues to word boundaries. In the Italian corpora used by Pelucchi et al. (2009a, Experiment 3), the transitional probabilities within the LTP words was lowered by including additional occurrences of the first syllable of each of those words. For example, the transitional probability of casa was lowered (to 0.33) by including other words that contained the syllable ca. However, the backward transitional probability remained high, at 1.0 because sa always preceded ca. Thus, Pelucchi et al. (2009b) presented another group of 8-month-old infants with an analogous familiarization corpus to the one described above, only this time the transitional probabilities within the LTP words was lowered by including additional occurrences of the second syllable of each of LTP words. Again, 8-month-old infants discriminated high vs. low transitional probability words following a very brief familiarization period. The Pelucchi et al. (2009a; 2009b) studies illuminate the power of statistical learning mechanisms and extend our understanding of their computational range; infants readily track both forward and backward TP even in complex, natural linguistic input. Natural languages represent a noisy stimulus, in which the words of interest are interspersed amidst myriad other words, word repetitions are necessarily limited, and TPs are just one of many regularities in the input (e.g., prosodic patterns, morphological agreement, word order, etc.). These results thus provide a striking demonstration of statistical learning: infants detected sequential probabilities despite the substantial richness and complexity of the experimental materials. In fact, while such complexity has been considered to be a potential drawback of natural language materials, this complexity may prove to be advantageous for infant learners (for a discussion see Morgan, Shi, & Allopenna, 1996; Landau & Gleitman, 1985). Natural languages have the benefit of providing infants with multiple redundant cues to word boundaries, and are inherently more engaging than artificial languages. Indeed, prior research suggests that making artificial languages just a little more natural, by using infant-directed speech intona-

Sensitivity to statistical information begets

97

tion contours, can facilitate statistical learning (Thiessen et al, 2005). Similarly, infants learn more about word order when a word sequence is sung rather than spoken, providing engaging redundant cues to learning (Thiessen & Sa¤ran, 2009).

2. The Role of Sequential Statistics in Linking Sounds to Meaning Pelucchi et al.’s (2009a; 2009b) demonstration of statistical learning in a natural language allowed for greater ecological validity than previous experiments using artificial languages. However, these results – successful discrimination between test items that di¤ered in transitional probability – tell us little about the representations that infants formed while listening to the fluent speech. For example, does the output of the segmentation task provide infants with something akin to words that are useful for subsequent language development? Infants may be able to use word segmentation processes to identify sound sequences that are likely to be individual words and thus ready to be associated with meanings. Alternatively, the representations that emerge from this statistical learning process may fail to intersect with subsequent word learning. On this latter view, statistical learning mechanisms, while available to infant language learners faced with fluent speech, do not render representations that can serve as potential words, thereby limiting the explanatory power of statistical learning accounts. To begin to address these issues, Graf Estes et al. (2007) designed a method to study the connection between word segmentation and word learning. In this experiment, 17-month-old infants were first familiarized with an artificial language. The language consisted of four bisyllabic words, recorded so that the transitional probabilities between syllables were the only reliable cues to word boundaries. Infants were then presented with a word-object association task. They were taught two novel object-label associations where the labels were words from the artificial language (TP ¼ 1.0), nonwords (sequences that did not appear in the language; TP ¼ 0.0), or part-words (sequences that crossed word boundaries in the artificial language; TP ¼ 0.5). Following habituation, infants were tested using a modified Switch paradigm (Werker, Cohen, Lloyd, Casasola, & Stager, 1998). Half of the test trials were Same trials, in which the objectlabel pairings presented during habituation remained unchanged. The other half of the test trials were Switch trials, in which the object and label pairings were switched such that Object A was paired with Label B and

98

Jessica F. Hay and Jill Lany

vice versa. The results confirmed that the syllable co-occurrence information presented in the initial speech stream a¤ected infants’ subsequent word learning success: infants who were taught labels that were words in the speech stream showed significantly longer looking to the Switch trials than to the Same trials, indicating that they had learned the object-label pairings. However, infants who were taught labels that were nonwords or part-words from the language did not show di¤erential looking to the two types of test trials. Only the high transitional probability sequences appeared to function as good candidate words. This pattern of results suggests that not only are infants able to track statistical regularities in sound sequences, but also that the output of this process can function as the input to future word learning. Word learning tasks, combined with word segmentation tasks, provide a window into what infants learn through tracking statistical regularities in speech streams, because infants can use the output of statistical learning to support subsequent label-object association learning: one learning task feeds the next. However, like most statistical learning experiments, Graf Estes et al. (2007) used simple artificial language materials. Thus, it is not clear from the Graf Estes et al. (2007) study whether infants tracking regularities in natural languages are also able to form representations of sound sequences that are ready to link to meanings. A more recent study by Hay and colleagues (Hay, Pelucchi, Graf Estes, & Sa¤ran, 2011) examined whether infants can use the output of statistical learning in a natural language as the input to word learning. To that end, they combined the method developed by Graf Estes and colleagues (Graf Estes et al., 2007) with the Italian materials developed by Pelucchi et al. (2009a). In Experiment 1, 17-month-old infants were first familiarized with an Italian corpus of fluent speech, followed by a labelobject association task. The crucial manipulation concerned the statistics of the labels relative to the Italian corpus. Half of the infants were trained on label-object pairings in which the labels were words from the corpus with a high internal transitional probability (HTP condition). The other half of the infants were trained on label-object pairings in which the labels had a low internal transitional probability (LTP condition). Importantly, both types of words occurred equally often in the Italian corpus. In both the HTP and LTP conditions, infants exhibited significantly longer looking times on Switch trials than on Same trials, suggesting that they readily learned both types of label-object pairings. The transitional probabilities internal to the labels did not appear to mediate learning; infants success-

Sensitivity to statistical information begets

99

fully mapped both HTP (TP ¼ 1.0) and LTP (TP ¼ 0.33) words as labels for objects. By 17 months of age, infants are fairly skilled at learning new objectlabel associations, even in the absence of previous experience with the labels in fluent speech (e.g., Werker, Fennell, Corcoran, & Stager, 2002). It is thus possible that infants in Hay et al.’s (2011) first experiment did not need an extra boost from the word segmentation phase to successfully learn the labels. To rule out this possibility, Hay et al. (2011) ran a control experiment (Experiment 2) in which infants completed the label-object association task from Experiment 1 without any initial familiarization with the Italian corpus. They reasoned that if infants can learn the objectlabel associations based solely on their experience during the habituation phase of the label-object association task, then infants should also learn the label-object pairings in Experiment 2. In the absence of experience with the sequential statistics in the Italian corpus, infants were unsuccessful in learning to map the novel words to the novel objects. This suggests that infants in Experiment 1 needed the opportunity to segment the target words from fluent speech in order to map those target words onto novel objects. An interesting puzzle emerges when comparing the label-learning patterns observed in Hay et al., (2011) with the findings of Graf Estes et al. (2007). In both studies, infants learned labels following segmentation experience. In Hay et al.’s (2011) Experiment 1, infants successfully mapped HTP and LTP Italian words as labels for objects. In contrast, Graf Estes et al. found that infants successfully learned high transitional probability sequences as object labels, but not low probability sequences that had occurred as part-words in an artificial language. One possible explanation for these diverging results is that the target words in Experiment 1 were trochees, whereas there were no prosodic cues present in the fluent speech used by Graf Estes et al. (2007). Thus, it is possible that the 17-month-old English-learning infants ignored the internal transitional probabilities of the target words and instead used a trochaic-based parsing strategy to segment words from the fluent speech. Another possible explanation concerns the directionality of the computation of the sequential statistics. In the Italian corpora used by Hay et al. (2011), the transitional probabilities within the LTP words was lowered by including additional occurrences of the first syllable of each of those words. However, the backward transitional probability remained high, at 1.0. This manipulation suggests a potentially important di¤erence from the low transitional probability part-word labels used by Graf Estes

100

Jessica F. Hay and Jill Lany

et al. (2007), which contained low transitional probabilities in both directions (both forward and backward TPs ¼ 0.5). It is thus possible that the high backward transitional probability of the LTP words played a crucial role in segmentation and subsequent word learning in Experiment 1. In their third experiment, Hay et al. (2011) manipulated both forward and backward transitional probabilities in the LTP labels. As in Experiment 1, the HTP items contained high transitional probabilities (TPs ¼ 1.0) in both directions. However, both the forward and backward transitional probabilities for the LTP words were lowered to 0.33 by adding further occurrences of the words’ first and second syllables elsewhere in the corpus. Following familiarization with the Italian corpus, only infants habituated to HTP label-object pairings exhibited significantly longer looking times on Switch than Same trials; infants presented with LTP labels showed no di¤erence in looking time. These results suggest that prior exposure to the fluent speech facilitated infants’ ability to learn novel object labels, but only when the transitional probability between syllable sequences was high in at least one direction. When the forward and backward transitional probabilities were both lowered to 0.33, as in the LTP words used as labels in Hay et al.’s third experiment, infants failed to learn the label-object associations. Results from Hay et al.’s three experiments suggest that the internal cohesiveness of novel words, as measured by the strength of their internal transitional probabilities, influence how readily infants map these words to novel objects. Recent research has also examined how real-world experience with phoneme co-occurrence a¤ects subsequent word-learning. For example, Graf Estes, Edwards, & Sa¤ran (2011) examined how experience with native-language phonotactic regularities a¤ects subsequent word-learning. They exposed English-learning 18-month-old infants to 2 novel object-label parings consisting of either phonotactically legal sound sequences (i.e., sound sequences that were consistent with English phonotactics), dref and sloob, or 2 labels with phonotactically illegal sound sequences (i.e., sequences that are inconsistent with English phonotactics), dlef and sroob. Using a looking-while listening procedure (Fernald, Zangl, Portillo, & Marchman, 2008), Graf Estes and colleagues (Graf Estes et al., 2011) found that at test infants who were exposed to the phonotactically legal labels showed a larger increase in looking to the target objects than infants presented with phonotactically illegal labels. Consistent with these findings, Storkel (2001) has demonstrated that preschool-aged children (aged 3 to 6 years) are able to learn common sound sequence labels (i.e., labels with high phonotactic probability) with fewer exposures and retained

Sensitivity to statistical information begets

101

them with better accuracy than rare sequence labels (i.e., sequences with low phonotactic probability). These e¤ects appear to be mediated by vocabulary size, suggesting that the more experience children have with native-language phonotactics, the greater e¤ect phonotactics have on future word learning. Phonotactic probability also a¤ects speech segmentation in infants as young as 9 months of age (e.g., Mattys & Jusczyk, 2001; Mattys, Jusczyk, Luce, & Morgan, 1999). Thus, experience with native-language phoneme co-occurrence regularities also appears to impact future language learning, in both younger and older children. In sum, experience with sequential co-occurrence statistics present in natural language facilitates the identification of candidate words – words that are readied to be linked to meaning. Demonstrating that infants can track and use statistical regularities when faced with the complexity of natural language represents an important step in advancing our understanding of the role of statistics in natural language acquisition. Importantly, the output of speech segmentation feeds into learning increasing complex dimensions of language structure.

3. Statistical Learning and Grammatical Structure: Simple Grammatical Patterns Thus far we have reviewed evidence suggesting that infants are sensitive to TPs in fluent synthesized speech, and even more impressively, in naturallyspoken Italian sentences. Moreover, this sensitivity begets subsequent word learning, such that sequences with high internal TPs are more readily mapped to visual referents. In addition to facilitating word segmentation and subsequent lexical development, sensitivity to co-occurrence information promotes learning other important dimensions of language structure, such as regularities in how words are combined into phrases and sentences. For example, Sa¤ran & Wilson (2003) tested whether tracking TPs across syllables to find words would subsequently allow infants to track TPs across newly segmented words. They played 12-month-old infants synthesized speech strings in which TPs between syllables reliably cued word boundaries (i.e., within-words TPs between syllables were 1.0, and TPs of syllables spanning word boundaries were 0.25). However, the strings also contained regularities in how words were combined that could only be detected by tracking the TPs between the words. Infants were then tested on their ability to discriminate novel grammatical strings that were consistent with the word-order regularities in the training strings from ungrammatical

102

Jessica F. Hay and Jill Lany

ones. Infants showed significant discrimination despite the fact that syllablelevel TPs could not be used to distinguish between the grammatical and ungrammatical test strings. These findings suggest that sensitivity to cooccurrence relationships between syllables promotes learning sequential regularities at a higher level of language structure: segmenting words promotes learning statistical relationships between words. Gomez and Gerken (1999) also found evidence that infants can use TPs to learn word order regularities within multi-word ‘‘sentences’’ in an artificial language. In their experiment, 12-month-old infants were familiarized to strings of nonsense words. As in natural language, word order was variable. In particular, the occurrence of a word did not predict the identity of the following word with 100% certainty, but some word sequences had higher TPs than others. Infants subsequently discriminated novel strings that contained high TPs from ungrammatical strings that contained lower TPs, indicating that they learned the probabilistic co-occurrence relationships between words. These findings suggest that tracking TPs between words could play an important role in learning about higher-level language structure, and research with adults is consistent with this hypothesis. For example, Sa¤ran (2001) found that adults can use these cues to learn phrase-structure information in an artificial grammar. Likewise, Thompson and Newport (2007) found that transitional probability cues within and across phrases are enhanced by common features of natural language such as phrasal movement and repetition, as well as the presence of optional phrases, and that such conditions promote adults’ acquisition of phrase structure within an artificial language. While the previous findings suggest that infants and adults can learn sequential relationships between words – indeed infants can do so by 12 months of age – learning the grammatical structure of a language requires learning how words from di¤erent grammatical categories can be combined. In other words, beyond tracking specific word sequences that are grammatical, such as ‘‘the toy’’, infants and toddlers must learn more abstract patterns such as the fact that determiners like ‘‘the’’ and ‘‘a’’ precede nouns rather than follow them. This more abstract sensitivity is critical to grammatical development because it supports generalization to novel utterances. How do infants begin to learn these aspects of their language? Several studies suggest that within a given language, the phonological properties of content words (e.g., nouns, verbs) and function words (e.g., determiners, pronouns, prepositions) are substantially di¤erent. For example, in both

Sensitivity to statistical information begets

103

English and Mandarin Chinese, content words are, on average, shorter, lower in amplitude, and contain simpler syllable structure, relative to function words (Morgan, Shi, & Allopenna, 1996). Amazingly, newborn infants can discriminate between content and function words, presumably on the basis of these cues (Shi, Werker, & Morgan, 1999). However, the specific phonological dimensions on which function and content words di¤er from language to language. In contrast, corpus studies of diverse languages (e.g., English, Korean, Italian) suggest that function words have a substantially higher token frequency than content words (Cutler & Carter, 1987; Gervain, Nespor, Mazuka, Horie, & Mehler, 2008; Kucera & Francis, 1967). These findings suggest that tracking word frequency might result in a rough distinction between these kinds of words. Infants’ sensitivity to this broad distinction between function and content words may promote subsequent language development by providing cues to simple grammatical patterns (Gervain et al., 2008). In particular, within a given language, phrases of di¤erent types tend to have similar underlying structure. For example, their corpus analysis of Japanese revealed that function words tend to occur in phrase-final positions, following content words, while function words tend to occur in phrase initial positions and are followed by content words in Italian. Gervain and colleagues found that by 8 months of age, infants prefer the patterning consistent with their native language, even when instantiated in an artificial language. These findings suggest that sensitivity to grammatical structure might be bootstrapped from experience with the ordering of function and content words at phrase boundaries. Beyond forming coarse groupings of function words vs. content words, infants must form finer-grained grammatical categories, such as noun, verb, and determiner. One statistical cue that might promote forming these categories is referred to as a distributional cue, or regularities in the ordering of individual words that might point to commonalities among words. For example, nouns tend to occur after determiners, while verbs tend to follow pronouns and auxiliaries. Many studies have investigated the role of distributional cues in the acquisition of grammatical categories and their co-occurrence relationships using variants of an aX bY artificial language. These languages typically consist of the nonsenseword categories a, b, X, and Y, and restrictions on how words from these categories can be combined: as precede Xs but not Ys, and bs precede Ys but not Xs, similar to determiner-noun and auxiliary-verb co-occurrence relationships in English. When the Xs and Ys di¤er in their distributional properties alone, some learning is possible: learners encode and remember

104

Jessica F. Hay and Jill Lany

the specific strings they were trained on, and also positional regularities, such as whether specific words occur in string-initial or string-final position. Critically, however, they fail to form word categories and detect their cooccurrence relationships (Smith, 1969). Interestingly, adults can use distributional information to form word categories when words from di¤erent categories were flanked by words providing distributional cues on either side, or ‘‘frequent frames’’, suggesting that distributional cues alone may be su‰cient if they are particularly strong (Mintz, 2002). Importantly, the presence of overlapping cues, such as correlations between words’ distributional and phonological cues also facilitates learning the category-level co-occurrence relationships in adults (Frigo & McDonald, 1998; Braine, 1987). Natural languages incorporate multiple cues marking syntactic categories, such that words from di¤erent syntactic categories are distinguished by their distributional properties (Cartwright & Brent, 1997; Mintz, 2003; Mintz, Newport, & Bever, 2002; Monaghan, Chater, & Christiansen, 2005; Redington, Chater, & Finch, 1998), as well as by their phonological properties (Farmer, Christiansen, & Monaghan, 2006; Kelly, 1992; Monaghan et al., 2005). For example, nouns both tend to occur after ‘‘a’’ and ‘‘the’’, and have a strong-weak stress pattern. The fact that learners often benefit from the presence of these correlated cues in learning categories and their co-occurrence relationships is consistent with the possibility that the presence of correlated distributional and phonological cues facilitates learning by reducing computational and memory demands on learners. Rather than having to remember numerous individual aX or bY combinations to form word categories and learn their co-occurrence relationships, learners can track much simpler co-occurrence relationships between as and one phonological feature, and between bs and a di¤erent phonological feature. The relationships between as and bs and distinctive phonological features are referred to as marker-feature relationships because the as and bs resemble categories that serve as grammatical markers, as opposed to conveying semantic information. To test whether infants can also use correlated cues to form grammatical categories, Gerken, Wilson, and Lewis (2005) exposed 17-monthold English-learning infants to Russian words from two categories that contained correlated cues marking their grammatical class. Specifically, Russian words often consist of a stem plus multiple grammatical morphemes marking case and grammatical gender. The familiarization set consisted of feminine words ending in the case markers ‘‘oj ’’ and ‘‘u’’ and masculine words ending in the case markings ‘‘ya’’ and ‘‘em’’. The case markings provided distributional cues to the feminine or masculine cate-

Sensitivity to statistical information begets

105

gory membership of the stems. Critically, a phonological cue marking the category membership of the words was present on half of the words: three of the feminine words contained a derivational su‰x ‘‘k’’, and 3 of the masculine words contained the su‰x ‘‘tel’’. Thus, in many feminine words ‘‘k’’ was followed by ‘‘oj ’’ and ‘‘u’’ (e.g., ‘‘polkoj ’’ and ‘‘polku’’), while in many masculine words ‘‘tel’’ was followed by ‘‘ya’’ and ‘‘em’’ (e.g., ‘‘zhitelya’’ and ‘‘zhitelyem’’). Infants who were familiarized with these words were subsequently able to distinguish between novel grammatical words containing those relationships and ungrammatical ones: for example, even if they had not heard ‘‘zhitelyem’’, they were able to distinguish it from the ungrammatical ‘‘zhitelu’’. Using artificial language materials, Gomez & Lakusta, (2004) found that 12-month-olds can also learn correlations between distributional and phonological features when 100% and 87% but not 63% of Xs and Ys contain distinctive phonological features. These findings suggest that tracking the correlations between distributional and phonological features of words plays an important role in learning grammatical patterns during infancy. Statistical Learning and Grammatical Structure: Nonadjacent Dependencies The studies of Gerken et al. (2005), Gomez & Gerken (1999), and Gomez & Lakusta (2004) suggest that infants can track predictive relationships between adjacent segments such as words and word categories. However, in natural languages, predictive relationships can also occur between nonadjacent elements, as in the relationship between auxiliaries such as ‘‘is’’ and the progressive inflection ‘‘ing’’ (e.g., ‘‘is running’’ ‘‘is eating’’ ‘‘is talking’’). Likewise, a plural noun predicts plural marking on the subsequent verb, but the noun and verb can be separated by modifiers, as in ‘‘The kids who were late to school are in trouble.’’ Tracking nonadjacent dependencies may be more di‰cult take out than tracking adjacent dependencies because elements must be remembered long enough to be linked to other elements occurring later in time. Another challenge presented by nonadjacent dependencies is that there are many potentially irrelevant relationships for the learner to track, as dependent elements can be separated by several word elements. Consistent with the high demands involved in learning nonadjacent dependencies, both infants and adults have substantial di‰culty learning them, and succeed in doing so under highly restricted circumstances. For example, Newport and Aslin (2004) found that adults successfully detect

106

Jessica F. Hay and Jill Lany

nonadjacecnt dependencies between similar segments (between consonants or vowels) but not between nonadjacent syllables. In contrast, (Pena, Bonatti, Nespor, & Mehler, 2002) found that adults used reliable transitional probabilities between nonadjacent syllables to segment words in a continuous speech stream. The participants did not, however, generalize, to novel words that maintained the nonadjacent dependencies but contained a novel middle syllable. These findings suggest that there may be some constraints on the kinds of nonadjacent dependencies that are readily acquired. Tracking nonadjacent relationships appears to pose even more substantial challenges to infants. For example, while infants can track the relationships between adjacent elements in language-like materials well before they turn a year old (e.g., Sa¤ran et al., 1996), they show evidence of sensitivity to grammatical relationships involving nonadjacent elements in their native language only at about 18 months of age (Santlemann & Jusczyk, 1998). Gomez (2002) found that 18-month-olds can also learn nonadjacent dependencies in artificial language materials, provided that potentially competing adjacent dependencies are highly variable and di‰cult to track. In subsequent research, Gomez & Maye (2005) found that 15-month-old infants also track nonadjacent dependencies, but that 12month-olds fail to do so. This developmental pattern suggests that increases in infants’ memory capacity over the second year facilitates nonadjacent dependency learning. However, prior language experience may also play an important role in the development of this ability. Just as tracking TPs of adjacent syllables can also facilitate learning higher-order patterns in which those words are combined (Sa¤ran & Wilson, 2003), Lany, Go´mez, and Gerken (2007) found that exposing adults to adjacent dependencies greatly facilitates their ability to learn nonadjacent dependencies (see also Conway, Ellefson, & Christiansen, 2003; Elman, 1993; Newport, 1990; although see Rhode & Plaut, 1999). For infants to benefit from prior experience, they would have to generalize from strings containing adjacent dependencies to novel strings containing nonadjacent dependencies. Because infants can be more perceptually-bound than adults, there is active debate as to whether infants and children are capable of forming such generalizations (Tomasello, 2000; Gertner, Fisher, & Eisengart, 2006; Fisher, 2002). Thus, Lany & Gomez (2008) tested whether 12-month-olds infants given experience with adjacent dependencies could learn more di‰cult nonadjacent ones. They familiarized two groups of infants to strings from an aX bY artificial language as they playing quietly. Infants in the Experimental group heard aX and bY strings in which Xs and Ys di¤ered in their phonological prop-

Sensitivity to statistical information begets

107

erties (Xs were disyllabic and Ys were monosyllabic), and thus X- and Yword categories were marked by correlated cues. Infants in the Control condition heard a mixture of aX, aY, bX and bY strings, such that as and bs were followed by di¤erent sets of words (i.e., there were distributional cues present) but those words did not conform to any phonological regularities (i.e., both as and bs predicted disyllabic and monosyllabic words). The Control language thus provided equivalent exposure to individual vocabulary elements and had the same intonational patterns and positional regularities as the aX bY language. Critically, however, it did not contain correlated cues marking word categories and their co-occurrence relationships. Infants in both conditions were subsequently habituated to strings containing nonadjacent aX and bY dependencies, but the particular dependencies had been withheld from familiarization. Only the Experimental infants subsequently dishabituated to strings containing violations of the nonadjacent dependencies. These results suggest that while 12-month-olds typically fail to track nonadjacent dependencies, given relevant prior experience with simpler adjacent dependencies, they can subsequently detect novel nonadjacent relationships between words from those categories. Interestingly, Lany and Gomez (2008) found that only female infants were more likely to generalize to novel nonadjacent dependencies based on experience with adjacent ones. Willits, Lany, & Sa¤ran (under review) also recently found evidence that at 24 months of age female infants are better able to generalize nonadjacency learning than male infants. This pattern of results is consistent with other findings of sex di¤erences in language development. For example, females tend to have more words in their vocabularies than males of the same age (Nelson, 1973), and also tend to lead males in early word combinations (Schacter, Shore, Hodapp, Chalfin, & Bundy, 1978). Hartshorne and Ullman (2006) found that female toddlers are more likely to generalize the past-tense ending to irregular verbs (e.g., saying ‘‘holded’’ instead of ‘‘held’’), particularly for irregular verbs sharing phonological features with regular verbs (e.g., the irregular verb ‘‘hold’’ is phonologically similar to the regular verbs ‘‘fold’’ and ‘‘mold’’). They suggested that females formed stronger associations between phonologically-related verbs, and inappropriately generalized the regular past-tense morphology to irregular phonological neighbors as a result. These findings suggest that female infants may from stronger associations between phonologically-related words, and may shed light on the sex di¤erence observed in Lany & Gomez (2008). If female infants are more sensitive to phonological similarities, they might be better able to notice similarities between X- and Y-elements based on their phonological

108

Jessica F. Hay and Jill Lany

properties (i.e., the syllable-number cue), and subsequently to generalize to novel nonadjacent aX and bY combinations. Whether these findings reflect di¤erences in phonological processing, memory development or other processes involved in language acquisition, they pose an intriguing question for further research on the underlying basis of sex di¤erences in language development. In sum, the findings from artificial language studies suggest that infants may begin to learn the distributional and phonological properties marking grammatical categories such as ‘‘noun’’ and ‘‘verb’’ and their co-occurrence relationships by the time they are 12-months old. Indeed, there is evidence that English-learning infants begin tracking distributional regularities by 12 months of age in their native language (Mintz, 2006), and by 14-months of age in German-learning infants (Hohle, Weissenborn, Kiefer, Schulz, & Schmitz, 2004). Thus, infants’ experience with statistical cues appears to provide a critical foundation for learning adjacent relationships, as well as more di‰cult nonadjacent ones.

4. Linking Sounds to Meaning: Statistical Cues Marking Word Categories Promote Learning the Semantic Properties of Word Categories Beyond promoting sensitivity to grammatical patterns, infants’ experience with statistical regularities in the sound stream cueing grammatical patterns may provide a foundation for learning words’ meanings. As previously mentioned, word-frequency may provide clues to grammatical structure, such as the broad distinction between function word and content words. Because content words such as ‘‘cat’’, ‘‘ball’’, ‘‘shoe’’, ‘‘run’’ and ‘‘crawl’’ tend to be mapped to referents, while function words such a ‘‘the’’, ‘‘on’’, or ‘‘that’’ do not have such concrete semantic referents, Hochmann, Endress, and Mehler (2010) hypothesized that tracking frequency information may also play a role in learning word meanings. To test this, they presented monolingual Italian-learning 17-month-olds with a corpus of unfamiliar French sentences exemplifying these frequency imbalances between function and content words. After listening to the familiarization corpus, infants were presented with a determiner-noun phrase while viewing a picture. The determiner had been presented much more frequently than the content word in the familiarization corpus. Infants were then tested on whether they formed a stronger association between the more frequent determiner or the less frequent content word. Despite the fact that the determiner and noun were paired with the picture equally often, infants appeared to form a selective association between the noun and

Sensitivity to statistical information begets

109

the picture. Infants who were exposed to the phrase-picture pairings with no exposure to the corpus containing information about the frequency imbalance showed no such preference. These findings suggest that when hearing multi-word phrases in the context of a potential referent, infants may be biased to map the relatively infrequent word to the referent. Beyond sensitivity to which words are likely to be mapped to concrete semantic referents, statistical cues might facilitate forming a selective mapping between content words and referents by narrowing the pool of possible referents considered as likely candidates. In particular, words from di¤erent lexical categories are correlated with di¤erent semantic properties: e.g., nouns tend to refer to objects and people, adjectives to properties such as color or texture, and verbs to actions or events. Given the evidence that infants can use correlated statistical cues to form word categories, in a recent study Lany and Sa¤ran (2010) tested whether they can also capitalize on experience with such cues to learn the semantic properties of categories. They first played 22-month-old infants an aX bY artificial language similar to the one used by Lany & Gomez (2008): Infants in the Experimental group heard aX and bY phrases, and thus for them the X and Y words di¤ered in their distributional and phonological properties (i.e., membership in the X and Y categories was reliably marked by correlated phonological and distributional cues). In contrast, infants in the Control group were familiarized with aX and bY phrases in addition to aY and bX phrases, as in the study by Lany & Gomez (2008). Thus, in the Control group, words’ phonological properties did not covary with their distributional properties, and the word categories were not marked by correlated cues. Infants in both conditions were then trained on pairings between phrases from the language and pictures of unfamiliar animals and vehicles. Critically, a subset of aX and bY phrases were heard by infants in both Experimental and Control groups, and these familiar aX phrases were paired with animal pictures, and familiar bY phrases were paired with vehicle pictures (or vice versa). Thus, for infants in both conditions there were reliable associations between individual Xs and Ys and specific referents, and aX phrases always referred to referents of one kind or category, and bY phrases always referred to instances of the other category. Interestingly, only the Experimental infants learned the trained picture-phrase associations, despite the fact that Control infants had the same amount of experience with them. Moreover, only the Experimental infants successfully generalized to novel pairings: when hearing a new word with both distributional and phonological properties of other words referring to animals, they mapped the word to a novel animal over a novel

110

Jessica F. Hay and Jill Lany

vehicle. The findings of this experiment suggest that infants’ experience with reliable cues in the initial listening phase promoted learning the semantic properties of individual words, as well as the generalization that aX phrases refer to animals and bY phrases to vehicles. These findings are consistent with research on the role of sensitivity to correlated features in learning object categories. In particular, object categories (e.g., cups, dogs, birds, and trees) are structured such that properties or features characteristic of a category tend to co-occur within individual instances of the category. For example, the presence of one feature of a tree (e.g., branches) is correlated with the presence of the others (e.g., leaves, bark, and trunk) but less strongly correlated with the presence of features associated with objects from other categories (e.g., ceramic handles, fur, and beaks). Mareschal, Powell, Westermann, and Volein (2005) and Younger (1985) found that when the features of a set of fantastical animals form 2 clusters (e.g., fuzzy tails co-occur with ears, while vs. feathered tails co-occur with antlers), infants group the animals into categories corresponding to correlations between values on these dimensions. However, when feature values are randomly distributed, infants do not group the animals into separate categories. These findings suggest that there may be a parallel in the mechanisms underlying forming perceptually based object categories and forming grammatical categories. An important question for future research is whether mature representation of grammatical categories can be characterized in terms of sensitivity to correlated cues, or whether linguistic categories are ultimately organized in a fundamentally di¤erent way. While the findings of Lany and Sa¤ran (2010) suggest that infants readily learn intercorrelations between the distributional, phonological, and semantic properties of word categories, it is possible that the respective associations they formed between semantic information and each of the statistical cues could di¤er in strength. Thus, Lany and Sa¤ran (2011) probed the role of each of these cues in the word learning process. Specifically, after listening to the artificial language containing correlated cues and being trained on associations between phrases and pictures as in Lany and Sa¤ran (2010, Experimental Condition), infants were tested on their ability to find a novel referent given either only a distributional cue or only a phonological cue to category membership. Interestingly, infants’ use of phonological and distributional cues to generalize was related to their native-language proficiency. Specifically, infants with smaller vocabularies successfully used phonological cues to generalize to the novel referents, but failed to use distributional cues. In contrast, infants with larger native-

Sensitivity to statistical information begets

111

language vocabularies and more advanced grammatical skills used distributional cues to words’ category membership to generalize. These findings suggest that distributional and phonological information contribute independently to word-learning, and may also develop on di¤erent timetables. Given the evidence that children who are delayed in their vocabulary development, or ‘‘late talkers’’, as well as children with Specific Language Impairment, show deficits in using grammatical knowledge (frequent grammatical morphemes) to build their vocabularies (Moyle, Ellis Weismer, Evans, & Lindstrom, 2007), future work in this paradigm may shed light on the processes critical to their acquisition and successful use in word learning. The findings of Lany and Sa¤ran (2010; 2011) suggest that infants can use statistical information, and specifically distributional and phonological cues, to discover novel associations between word categories and semantic properties. According to the syntactic bootstrapping hypothesis (e.g., Landau & Gleitman, 1985), learning the meanings of words crucially relies on sensitivity to syntactic information, such as the syntactic frame in which a novel word is heard. The findings of Lany and Sa¤ran suggest that experience with the distributional and phonological regularities marking words’ category membership may play an important role in the development of the ability to use structural information to learn words. In future studies it will be important to pursue this link between statistical learning and syntactic bootstrapping by testing whether infants capitalize on statistical cues marking word categories in their native language. For example, given the evidence that English and German-learning infants are sensitive to correlated cues marking word categories like ‘‘noun’’ and ‘‘verb’’ by 12–14 months, this sensitivity could support subsequent word learning in natural language just as they support word-learning in an artificial language. Thus, an important question is whether statistical regularities play a role in the infants’ use of syntactic bootstrapping in their native language. Work by Sandhofer and Smith (2007) suggests that this might be the case; in speech to infants and children adjectives are frequently used in ambiguous frames, and in fact are often used in frames that nouns can also be used in. Children have di‰culty learning adjectives when they are presented in such ambiguous frames, but using novel adjectives in more distinctive frames (frames that distinguish them from nouns) can facilitate learning adjectives. An interesting possibility that should be addressed in future research is that statistical cues may play a role in the di‰culties children have in learning other word classes, such as verbs.

112

Jessica F. Hay and Jill Lany

Conclusions Statistical learning plays an important role in learning structure in the sound stream, such as phonological regularities important for word segmentation as well as word-order regularities relevant to acquiring grammatical structure. Critically, sensitivity to statistical structure in one area of language can bootstrap infants’ learning of other dimensions of language structure. For example, using TPs between syllables to segment fluent speech subsequently facilitates learning word-order patterns. Likewise, tracking correlations between such distributional properties of words and their phonological properties can facilitate learning complex grammatical structure. Moreover, tracking statistical regularities at all of these levels facilitates learning semantic information. By 17 months of age, words with good sequential statistics are more readily associated with referents than sound sequences that do not have characteristics of typical words in that language (Graf Estes et al., 2007; in press; (Hay et al., 2011)). Once infants have begun to learn associations between words and referents, this knowledge influences the associations that infants will subsequently form: Words that have statistical properties of a particular category are readily mapped to new referents from that category (Lany & Sa¤ran, 2010; 2011). All together, these studies suggest that infants are able to track statistical regularities relevant to many aspects of language despite the considerable complexity of such structures, and shed important light on the mechanisms supporting language development.

References Aslin, R. N., Sa¤ran, J. R., & Newport, E. L. 1998 Computation of conditional probability statistics by 8-monthold infants. Psychological Science, 9, 321–324. Baldwin, D., Andersson, A., Sa¤ran, J., & Meyer, M. 2008 Segmenting dynamic human action via statistical structure. Cognition, 106, 1382–1407. Bortfeld, H., Morgan, J. L., Golinko¤, R. M., & Rathbun, K. 2005 Mommy and me: Familiar names help launch babies into speech-stream segmentation. Psychological Science, 16, 298–304. Braine, M.D.S. 1987 What is learned in acquiring word classes – A step toward an acquisition theory. In B. MacWhinney (Ed.), Mechanisms of language acquisition. Hillsdale, NJ: Lawrence Erlbaum Associates.

Sensitivity to statistical information begets

113

Cartwright, T. A. & Brent, M. R. 1997 Syntactic categorization in early language acquisition: formalizing the role of distributional analysis. Cognition, 63, 121–170. Conway, C. M., Ellefson, M. R. & Christiansen, M. H. 2003 When less is less and when less is more: Starting small with staged input. In Proceedings of the 25th Annual Conference of the Cognitive Science Society, 270–275. Mahwah, NJ: Lawrence Erlbaum. Cutler, A. & Carter, D. M. 1987 The predominance of strong initial syllables in the English vocabulary. Computer Speech and Language, 2, 133–142. Elman, J. L. 1993 Learning and development in neural networks: The importance of starting small. Cognition, 48, 71–99. Farmer, T. A., Christiansen, M. H. & Monaghan, P. 2006 Phonological typicality influences on-line sentence comprehension. Proceedings of the National Academy of Sciences, 103, 12203– 12208. Fernald, A., Zangl, R., Portillo, A. L., & Marchman, V. A. 2008 Looking while listening: Using eye movements to monitor spoken language comprehension by infants and young children. In I. A. Sekerina, E. M. Fernandez, & H. Clahsen (Eds.), Developmental psycholinguistics: On-line methods in children’s language processing (pp. 97–135). Amsterdam, Netherlands: John Benjamins Publishing Company. Fisher, C. 2002 The role of abstract syntactic knowledge in language acquisition: A reply to Tomasello (2000). Cognition, 82, 259–278. Fiser, J. & Aslin, R. N. 2002 Statistical learning of new visual feature combinations by infants. Proceedings of the National Academy of Sciences, 99, 15822– 15826. Frigo, L. & McDonald, J. L. 1998 Properties of phonological markers that a¤ect the acquisition of gender-like subclasses. Journal of Memory & Language, 39, 218– 245. Gerken, L. A., Wilson, R., & Lewis, W. 2005 Seventeen-month-olds can use distributional cues to form syntactic categories. Journal of Child Language, 32, 249–268. Gertner, Y., Fisher, C., & Eisengart, J. 2006 Learning words and rules: Abstract knowledge of word order in early sentence comprehension. Psychological Science, 17, 684– 691. Gervain, J., Nespor, M., Mazuka, R., Horie, R., & Mehler, J. 2008 Bootstrapping word order in prelexical infants: A Japanese– Italian cross-linguistic study. Cognitive Psychology, 57, 56–74.

114

Jessica F. Hay and Jill Lany

Go´mez, R. L. 2002

Variability and detection of invariant structure. Psychological Science, 13(5), 431–436. Go´mez, R. L. & Gerken, L. 1999 Artificial grammar learning by 1-year-olds leads to specific and abstract knowledge. Cognition, 70, 109–135. Go´mez, R. L. & Lakusta, L. 2004 A first step in form-based category abstraction in 12-month-old infants. Developmental Science, 7, 567–580. Go´mez, R. L. & Maye, J. 2005 The developmental trajectory of nonadjacent dependency learning. Infancy, 7(2), 183–206. Graf Estes, K., Edwards, J., & Sa¤ran, J. R. 2011 Phonotactic constraints on infant word learning. Infancy, 16, 180–197. Graf Estes, K., Evans, J. L., Alibali, M. W., & Sa¤ran, J. R. 2007 Can infants map meaning to newly segmented words? Statistical segmentation and word learning. Psychological Science, 18, 254– 260. Hartshorne, J. K. & Ullman M. T. 2006 Why girls say ‘holded’ more than boys. Developmental Science, 9, 21–32. Hay, J. F. 2009 Statistical learning of language. In: Squire L.R. (Ed). Encyclopedia of Neuroscience, vol. 9, pp. 387–392. Oxford: Academic Press. Hay, J. F., Pelucchi, B., Graf Estes, K., & Sa¤ran, J. R. 2011 Linking sounds to meanings: Infant statistical learning in a natural language. Cognitive Psychology, 63, 93–106. Hochmann, J., Endress, A. D., & Mehler, J. 2010 Word frequency as a cue for identifying function words in infancy. Cognition, 115, 444–457. Hohle, B., Weissenborn, E., Kiefer, D., Schulz, A., & Schmitz, M. 2004 Functional elements in infants’ speech processing: The role of determiners in the syntactic categorization of lexical elements. Infancy, 5(3), 341–353. Johnson, E. K. & Jusczyk, P. W. 2001 Word segmentation by 8-month-olds: When speech cues count more than statistics. Journal of Memory and Language, 44, 548– 567. Johnson, E. K. & Tyler, M. D. 2010 Testing the limits of statistical learning for word segmentation. Developmental Science, 13(2), 339–345. Jusczyk, P. W. & Aslin, R. N. 1995 Infants’ detection of the sound patterns of words in fluent speech. Cognitive Psychology, 29, 1–23.

Sensitivity to statistical information begets Kelly, M. H. 1992

115

Using sound to solve syntactic problems: the role of phonology in grammatical category assignments. Psychological Review, 99, 349–364. Kirkham, N. Z., Slemmer, J. A., & Johnson, S. P. 2002 Visual statistical learning in infancy: evidence for a domain general learning mechanism. Cognition, 83, B35–B42. Kucera, H. & Francis, N. 1967 A computational analysis of present-day American English. Providence, RI: Brown University Press. Landau, B. & Gleitman, L. R. 1985 Language and experience: Evidence from the blind child. Cambridge, MA: Harvard University Press. Lany, J. & Go´mez, R. L. 2008 Twelve-month-olds benefit from prior experience in statistical learning. Psychological Science, 19, 1247–1252. Lany, J., Go´mez, R. L., & Gerken, L. 2007 The role of prior experience in language acquisition. Cognitive Science, 31, 481–507. Lany, J. & Sa¤ran, J. R. 2010 From statistics to meanings: Infants’ acquisition of lexical categories. Psychological Science, 21, 284–291. Lany, J. & Sa¤ran, J. R. 2011 Statistical learning mechanisms in infancy. In P. Rakik & J. Rubenstein (Eds.), Developmental Neuroscience – Basic and Clinical Mechanisms. Elsevier. Lany, J. & Sa¤ran, J. R. 2011 Interactions between statistical and semantic information in infant language development. Developmental Science, 14, 1207– 1219. Mattys, S. L. & Jusczyk, P. W. 2001 Phonotactic cues for segmentation of fluent speech by infants. Cognition, 78, 91–121. Mattys, S. L., Jusczyk, P. W., Luce, P. A., & Morgan, J. L. 1999 Phonotactic and prosodic e¤ects on word segmentation in infants. Cognitive Psychology, 38, 465–494. Mersad K. & Nazzi T. 2010 When mom helps statistics. Poster presented at the International Conference on Infant Studies, Baltimore, MD. Mintz, T. H. 2002 Category induction from distributional cues in an artificial language. Memory & Cognition, 30, 678–686. Mintz, T. H. 2003 Frequent frames as a cue for grammatical categories in child directed speech. Cognition, 90, 91–117.

116

Jessica F. Hay and Jill Lany

Mintz, T. H. 2006

Finding the verbs: distributional cues to categories available to young learners. In K. Hirsh-Pasek & R. M. Golinko¤ (Eds.), Action Meets Word: How Children Learn Verbs, p. 31–63. New York: Oxford University Press. Mintz, T. H., Newport, E. L., & Bever, T. G. 2002 The distributional structure of grammatical categories in speech to young children. Cognitive Science, 26, 393–424. Monaghan, P., Chater, N., & Christiansen, M. H. 2005 The di¤erential contribution of phonological and distributional cues in grammatical categorization. Cognition, 96, 143–182. Morgan, J. L., Shi, R. & Allopenna, P. 1996 Perceptual bases of rudimentary grammatical categories. In J. L. Morgan & K. Demuth (Eds), Signal to syntax. Hillsdale, NJ: Erlbaum. Moyle, M. J., Weismer, S. E., Evans, J. L., and Lindstrom, M. J. 2007 Longitudinal relationships between lexical and grammatical development in typical and late-talking children. Journal of Speech, Language, and Hearing Research, 50, 508–528. Nelson, K. 1973 Structure and strategy in learning to talk. Monographs of the Society for Research in Child Development, 38. Newport, E. L. 1990 Maturational constraints on language learning. Cognitive Science, 14, 11–28. Newport. E. L. & Aslin, R. N. 2004 Learning at a distance: I. Statistical learning of non-adjacent dependencies. Cognitive Psychology, 48, 127–162. Pelucchi, B., Hay, J. F., & Sa¤ran, J. R. 2009a Statistical learning in a natural language by 8-month-old infants. Child Development, 80(3), 674–685. Pelucchi B., Hay J. F., & Sa¤ran J. R., 2009b Learning in reverse: Eight-month old infants track backward transitional probabilities. Cognition, 113, 244–247. Pena, M., Bonatti, L. L., Nespor, M., & Mehler, J. 2002 Signal-driven computations in speech processing. Science, 298, 604–607. Redington, M., Chater, N., & Finch, S. 1998 Distributional information: a powerful cue for acquiring syntactic categories. Cognitive Science, 22, 425–469. Rohde, D. L. T. & Plaut, D. C. 1999 Language acquisition in the absence of explicit negative evidence: How important is starting small? Cognition, 72, 67–109. Santelmann, L. & Jusczyk, P. 1998 Sensitivity to discontinuous dependencies in language learners: Evidence for processing limitations. Cognition, 69, 105–134.

Sensitivity to statistical information begets Sa¤ran, J. R. 2001

117

The use of predictive dependencies in language learning. Journal of Memory and Language, 44, 493–515. Sa¤ran, J. R., Aslin, R. N., & Newport, E. L. 1996 Statistical learning by 8-month-old infants. Science, 274, 1926– 1928. Sa¤ran, J. R. & Wilson, D. P. 2003 From syllables to syntax: Multi-level statistical learning by 12month-old infants. Infancy, 4, 273–284. Sahni, S. D., Seidenberg, M., & Sa¤ran, J. R. 2010 Connecting cues: Overlapping regularities support cue discovery in infancy. Child Development, 81, 727–736. Sandhofer, C. & Smith, L. B. 2007 Learning adjectives in the real world: How learning nouns impedes learning adjectives. Language, Learning and Development, 3(3), 233–267. Schacter, F. F., Shore, E., Hodapp, R., Chalfin, S., & Bundy, C. 1978 Do girls talk earlier?: Mean length of utterance in toddlers. Developmental Psychology, 13, 388–392. Shi, R., Werker, J., & Morgan, J. 1999 Newborn infants’ sensitivity to perceptual cues to lexical and grammatical words. Cognition, 72, B11–B21. Smith, K. H. 1969 Learning co-occurrence restrictions: rule induction or rote learning? Journal of Verbal Learning and Verbal Behavior, 8(2), 319– 321. Storkel, H. L. 2001 Learning new words: Phonotactic probability in language development. Journal of Speech, Language, and Hearing Research, 44, 1321–1337. Swingley, D. 1999 Conditional probability and word discovery: A corpus analysis of speech to infants. In Proceedings of the 21st Annual Meeting of the Cognitive Science Society (pp. 724–729). Mahwah, NJ.: LEA. Swingley, D. 2005 Statistical clustering and the contents of the infant vocabulary. Cognitive Psychology, 50, 86–132. Teinonen, T., Fellman, V., Na¨a¨ta¨nen, R., Alku, P., & Huotilainen, M. 2009 Statistical language learning in neonates revealed by eventrelated brain potentials. BMC Neuroscience, 10(1), 21. Thiessen, E. D. & Sa¤ran, J. R. 2003 When cues collide: Use of stress and statistical cues to word boundaries by 7- to 9-month-old infants. Developmental Psychology, 39, 706–716.

118

Jessica F. Hay and Jill Lany

Thiessen E. D., Hill E. A., & Sa¤ran J. R. 2005 Infant-directed speech facilitates word segmentation. Infancy, 7(1), 53–71. Thiessen E. D. & Sa¤ran, J. R. 2007 Learning to learn: Infants acquisition of stress-based strategies for word segmentation. Language and Development, 3, 73–100. Thiessen, E. D. & Sa¤ran, J.R. 2009 How the melody facilitates the message and vice versa in infant learning and memory. Annals of the New York Academy of Sciences, 1169, 225–233. Thompson, S. P. & Newport, E. L. 2007 Statistical learning of syntax: The role of transitional probability. Language Learning and Development, 3, 1–42. Tomasello, M. 2000 Do young children have adult syntactic competence? Cognition, 74, 209–253. Werker, J. F., Cohen, L. B., Lloyd, V. L., Casasola, M., & Stager, C. L. 1998 Acquisition of word-object associations by 14-month-old infants. Developmental Psychology, 34, 1289–1309. Werker, J. F., Fennell, C. T., Corcoran, K. M., & Stager, C. L. 2002 Infants’ ability to learn phonetically similar words: E¤ects of age and vocabulary size. Infancy, 3(1), 1–30. Willits, J. A., Lany, J., & Sa¤ran, J. R. under review Infants can use meaning to learn about higher order linguistic structure. Yang, C. 2004 Universal grammar, statistics, or both. Trends in Cognitive Sciences, 8, 451–456. Younger, B. A. 1985 The segregation of items into categories. Child Development, 56, 1574–1583.

Word segmentation: Trading the (new, but poor) concept of statistical computation for the (old, but richer) associative approach Pierre Perruchet and Be´ne´dicte Poulin-Charronnat 1. Introduction: The state of the a¤airs Oft-cited studies have shown that infants, children, and adults can extract word-like units (hereafter: words) from an artificial language in which these units have been concatenated without any phonological or prosodic markers (e.g., Sa¤ran, Newport, and Aslin 1996). This attests to the fact that listeners are able to exploit the statistical information available in language. More precisely, boundaries between words could be found through the computation of transitional probabilities (hereafter: TPs; Aslin, Sa¤ran, and Newport 1998). Participants would exploit the fact that, on average, the TPs between word internal syllables are stronger than the TPs between syllables spanning word boundaries1. The idea that word segmentation of artificial languages attests to the computation of TPs is taken as a foundational principle in most papers in the domain. Although statistical structure is the only source of information made available in the experimental studies mentioned above, it is commonly thought that word segmentation of natural languages also relies on other cues. The role of phonological and prosodic features, such as lexical stress placement, on word discovery is the best documented (e.g., Creel, Tanenhaus, and Aslin 2006; Curtin, Mintz, and Christiansen 2005; Thiessen and Sa¤ran 2007). Some of these cues (e.g., the pauses) are certainly universal, gestalt-like cues for the formation of perceptual units, and even cues that are seemingly language-specific could be more universal than once thought (Berent and Lennertz 2010; Endress and Hauser 2010, Yoshida, Iversen, Patel, Mazuka, Nito, Gervain, and Werker 2010). In addition to acoustical cues, a second 1. In keeping with the prevalent view, only the syllabic level will be considered. However, it is worth stressing that statistics computed at the level of phonemes could be also, and even more relevant to find word boundaries (e.g., Hockema 2006). All the proposals of this chapter could be applied at the phonemic level as well.

120

Pierre Perruchet and Be´ne´dicte Poulin-Charronnat

potential source of information (hereafter coined as ‘‘contextual information’’) is provided by the known words surrounding the to-be-discovered new words. To borrow an example given by Dahan and Brent (1999): ‘‘If look is recognized as a familiar unit in the utterance Lookhere! then look will tend to be segmented out and the remaining contiguous stretch, here, will be inferred as a new unit’’ (p. 165). Various experimental evidence of this phenomenon has been reported (Bortfeld, Morgan, Golinko¤, and Rathbun 2005; Cunillera, Camara, Laine, and Rodriguez-Fornells 2010; Dahan and Brent 1999; Perruchet and Tillmann 2010). The view that statistical computations are complemented by the exploitation of acoustical cues and contextual information is quite consensual (e.g., Aslin et al. 1998; Christiansen, Allen, and Seidenberg 1998; Gomez 2007; Seidenberg and MacDonald 1999; Thiessen and Sa¤ran 2003). There are some disagreements, however, regarding how statistical computations combine with other cues, and notably with prosodic or phonological information2. The action of di¤erent cues can be thought of as being mediated by independent processes, which would operate in parallel. Statistical computations would be blind to the perceptual properties of the material. This is the view advocated by Shukla, Nespor, and Mehler (2007), at least with regard to prosody. The authors suggest that TP computations (or other forms of statistical computations) over syllabic representations of speech rely on encapsulated, automatic processes, which proceed irrespective of prosodic break-points. Prosody would act subsequently as a filter, suppressing possible word-like units that straddle two prosodic constituents. Another possibility is that statistical learning is ‘‘guided’’ by perceptual factors. It has long been claimed that the exploitation of statistical regularities needs to be constrained by external factors. The acoustical properties of the speech flow could serve as such constraints (e.g., Gomez 2007; Onnis, Monaghan, Chater, and Richmond 2005; Sa¤ran 2002; Seidenberg and MacDonald 1999). Still another view is that statistical computations would be performed on representations that embed prosodic or phonological 2. To the best of our knowledge, the role of contextual information in the discovery of new words has never been considered jointly with the role of statistical computation. A TP, for instance, is a value inherent to a pair of syllables, which does not depend on whether the local context in which this pair of syllables appears is known by the learner. As asserted by Dahan and Brent (1999), ‘‘transitional-probability computations do not take into account the segmentation points in previous utterances; in other words, having isolated some words does not help in isolating other words or even the same words later on’’ (p. 166).

Word segmentation

121

information. Curtin et al. (2005), for instance, suggested that stressed and unstressed syllables with the same segmental content could be considered as di¤erent primitives for the computation of TPs. Our starting point is that the lack of any principled constraints regarding the interplay of statistical computations with the other factors that have been shown to influence word extraction in natural languages comes from the conceptual indeterminacy of the notion of statistical computation itself. To begin with, what is meant exactly by computation in this context is not clearly defined. Some authors may have in mind the type of formal computations that a statistician would do in the same situation (the main di¤erence being that human learners would perform them unconsciously), whereas others may prefer to think that learners approximate statistics through the progressive tuning of associative links on the model of neural networks. However, whatever the preferred option, the same caveat remains: The notion of statistical computations remains underdetermined, as a superimposed piece in the architecture of the mind.

2. Our thesis The statement above could prompt us to delve into the notion of statistical computation, with the elaboration of a new integrative framework as an ultimate objective. But do we actually need a new framework? Is it plausible that word segmentation would require a mode of learning that, despite its claimed importance and pervasiveness, would have been undetected after the overwhelming amount of research devoted to learning processes during so many decades? In a nutshell, our thesis is that the phenomena currently encompassed under the label of ‘‘statistical learning’’ are nothing else that the end-product of ubiquitous associative learning processes and moreover, that the associative tradition in fact provides the basic architecture for a much more powerful integrative framework than provided by the recent literature on word segmentation and language acquisition. Such a thesis may look paradoxical. Associative learning may sound like an outdated, old-fashioned concept, which would have lost much of its relevance for language since Chomsky’s (1959) famous commentary on Skinner’s Verbal Behavior. Statistical learning, by contrast, seemingly provides a flourishing framework, and there is some irony to replace a new and promising concept by an older, and apparently worn-out one.

122

Pierre Perruchet and Be´ne´dicte Poulin-Charronnat

An immediate objection to our proposal could be that an associative framework is a priori ill-suited because it would be unable to account for the learning of TPs in the first place. Going against this objection, we will show in the next section (Section 3) that it has been demonstrated for 40 years or so that conventional processes of associative learning make learners sensitive to TPs and even to more complex measures of statistical association. Of major interest is that researchers, far from referring to some unspecified computational abilities to account for this sensitivity, have provided interpretations relying on simple psychological processes. Section 4 will show that endorsing an associative view does not change only our understanding of the way human learners exploit the statistical structure of the input to extract the words: An associative view naturally accounts for the action of the other factors that have been shown to be influential on word segmentation, namely the presence of acoustical cues and contextual information. Moreover, as we will argue in Section 5, the mechanisms involved in the exploitation of statistical regularities, when considered in a dynamic perspective, give to the other sources of information an influence that would be far weaker if they were considered in isolation. To summarize, our claim is that trading the recent statistical learning view for the older associative tradition allows a dynamic integration of phenomena that otherwise would require an array of limited-scope and ad-hoc processes. 3. Accounting for the sensitivity to statistical regularities 3.1. Which statistics? Before examining how associative principles account for the human sensitivity to statistical regularities, one needs to make clear the type of statistics that is involved here, and notably, to assess whether the current focus on TPs is actually warranted. For the sake of simplicity, let us consider the case of two successive events, A and X. Table 1 displays a standard 2  2 contingency matrix between the two events. A first index of relationship is given by a, which represents the number of AX pairs. For somewhat obvious reasons, the pure co-occurrence frequency is quite limited as an indicator of the strength of the relationships between two events. The TP, which is the probability that A be followed by X and can be computed as TP(X|A) ¼ a=ða þ bÞ, indisputably provides a more relevant measure. Considering co-occurrence frequency and TP may lead to opposite conclusions about the strength of an association. For instance, ‘‘ED’’ is a more frequent bigram than ‘‘QU’’ in written English,

Word segmentation

123

Table 1. A contingency matrix: ‘‘a’’ stands for the number of AX sequences, ‘‘b’’ for the number of occurrences of A not followed by X, ‘‘c’’ for the number of occurrences of X not preceded by A, and ‘‘d ’’ for the number of events comprising neither A nor X. X can be read as any contexts from which X is absent (this is the usual case in the conditioning domain), or alternatively, as an identified event or class of events (this is the usual case in the word segmentation literature), and likewise for A. e2

e1

X

X

A

a

b

A

c

d

hence suggesting that ‘‘E’’ is more strongly associated with ‘‘D’’ than ‘‘Q’’ with ‘‘U’’; However, ‘‘Q’’ is much more predictive of ‘‘U’’ than ‘‘E’’ is predictive of ‘‘D’’ (‘‘E’’ is more often followed with ‘‘R’’ or ‘‘S’’ than with ‘‘D’’; Source: Wikipedia, letter frequency). Aslin et al. (1998) provided the first demonstration in the word segmentation domain that humans, and more precisely 8-month-old infants, are sensitive to TPs when the raw co-occurrence frequency has been controlled. However, TP as such provides only a part of the information about relationships between A and X. To assess whether A is a useful cue for the occurrence of X, the probability of X in the presence of A should be compared with the probability of X in the absence of A. If, for instance, there are better or more salient predictors of X, it may be adaptive to ignore A and to focus on more relevant events, irrespective of whether A carries some predictive information on the occurrence of X when it is considered in isolation. The resulting statistic is Delta P, which stands as: Delta P ¼ a=ða þ bÞ  c=ðc þ d Þ. In a classical paper, Rescorla (1968) demonstrated that rats were not only sensitive to TPs (rather than raw frequencies), but also to Delta P, and this result has been confirmed and generalized to many other species in subsequent studies. It is worthwhile to note that neither TP nor Delta P are complete measures of associations, because both are limited to assess the forward relations between A and X (i.e., the probability that X follows A). The strength of an association may also be related to the backward relationships (i.e., the probability that A precedes X). Backward TP, denoted as TP 0 , can be computed as: TP 0 (A|X) ¼ a=ða þ cÞ, and likewise, Delta P 0 can be computed as: Delta P 0 ¼ a=ða þ cÞ  b=ðb þ d Þ. Relying on forward or back-

124

Pierre Perruchet and Be´ne´dicte Poulin-Charronnat

ward relations may, again, lead to opposite conclusions. For instance, in the bigram ‘‘QU’’, the forward TP is nearly 1, whereas the backward TP is very low, because ‘‘U’’ can be preceded by most other letters of the alphabet. Perruchet and Desaulty (2008) showed that adult participants were able to learn the words from an artificial language when the only available cues were the backward TPs. This ability was confirmed by Pelluchi, Hay, and Sa¤ran (2009) in 8-month-old infants. It makes sense to conceive that the tightness of the association between A and X depends on both forward and backward relationships. Interestingly, Pearson r, commonly called ‘‘rF’’ with dichotomous data, can be expressed as the geometric mean of forward and backward Delta Ps (Perruchet and Peereman 2004), and written as: pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi r ¼ a=ða þ bÞ  c=ðc þ dÞÞ:ða=ða þ cÞ  b=ðb þ dÞÞ Note that alternative measures of association (such as w2 and mutual information) also assess bidirectional relations. A few studies suggest that learners could be sensitive to bidirectional measures of association, such as Pearson r (Perruchet and Peereman 2004) and mutual information (Swingley 2005)3. 3.2. Associative interpretations Up to now, whereas our initial objective was to account for the behavioral sensitivity to forward TPs through simple associative mechanisms, we 3. As an aside, this discovery has implications in the debate surrounding the question of the relative weight di¤erent factors may play in language acquisition. Indeed, Yang (2004) has questioned that statistics could play a substantial role, based on analyses of child-directed corpuses showing that even an optimal exploitation of (forward) TPs has nearly negligible usefulness in extracting the words from natural language. His estimations may have been underestimated, however. Swingley (1999) considered bidirectional relations, and when his analyses were restricted to the words that occurred five or more times in the corpus, more than 60% of the extracted units were words (i.e., accuracy score), and less than 40% of actual words were not extracted (i.e., completeness score). These values clearly undermine the a priori argument that statistical information would be too impoverished to be useful in word learning, and, although this does not prove that infants actually use this information, it would seem somewhat ill-adaptive that a source of information that is both useful (Swingley 1999) and easily exploitable by learners including infants (e.g., Sa¤ran et al. 1996), would be neglected.

Word segmentation

125

have seemingly moved in the wrong direction. Indeed, the statistics to which human learners are sensitive are in fact much more complex than forward TPs4. From a computational standpoint, the increase in complexity is indeed unquestionable: As it can be seen above, the formula for (forward) TPs is included as a small initial component in the formula for rF . We intend to show now that sensitivity to complex statistical measures such as rF (and a fortiori to simpler measures) can in fact be accounted for by very simple mechanisms. Let us consider again the 2  2 contingency table above. To recap, claiming that performance does not only depend on the frequency of co-occurrence means that with a fixed, performance also depends on b (increasing b decreases the forward TP), on c (increasing c decreases the backward TP), or on both b and c (increasing b and c decreases the correlation). In other terms, for a fixed number of AX pairs, the probability of creating an association between A and X depends on the number of A and/or X events that are perceived in other contexts. Overall, the larger the number of A and/or X events perceived in other contexts, the more di‰cult is the formation of the AX chunk. We propose below two complementary accounts for this outcome, an attention-based account and an interference-based account, both relying on associative learning principles. 3.3. An attention-based account A problem in referring to the associative learning tradition is that everyone having heard about the e¤ects of repetition and extinction, and knowing concepts such as memory strength, decay and interference, may believe to master the basic tenets of this approach. These notions are indeed important. However, there are also other essential notions that are often neglected, with the role played by attentional factors certainly being the main one. Indeed, a fundamental principle is that the formation of any associative links between two elements depends on the learner’s joint attention to those elements. This principle obviously holds for complex and supervised forms of learning (any teacher presumably attempts to capture students’ attention), but for the most simple and implicit forms of 4. Note that learners’ ability to exploit backward TPs as well as forward TPs is no longer compatible with the prediction-based logic of Simple Recurrent Networks (SRNs), which are the most used computational models to simulate sequential learning. Because relying on SRNs for forward TPs and other mechanisms for backward TPs would lack parsimony, this pattern of results suggests that the SRNs are not the most appropriate models in this context.

126

Pierre Perruchet and Be´ne´dicte Poulin-Charronnat

learning as well (e.g., Hsiao and Reber 1998; Logan 1988; Mackintosh 1975; Pacton and Perruchet 2008). Unsurprisingly, the word segmentation domain does not make exception (Toro, Sinnett, and Soto-Faraco 2005). Note that conceiving attention as a condition for the formation of an association between two events is not a late and surreptitious addition from cognitive researchers to the conventional associative view. A study reported in a book published in 1932 by Thorndike (who may hardly be suspected of cognitive penchant) illustrates the point. We report this study with some details5, because it also enlightens what is meant exactly by ‘‘attention’’ in this context. Thorndike gave learners 254 word-number pairs. Half of the participants were asked to hear the words and numbers without paying special attention to them, while the other half were asked to pay as close attention as they could do to the materials. In a first test, participants were presented with a word and asked to supply its associated number in the training list. Results revealed that subjects who actively attended to the list outperformed those who were told to remain passive, but the recall score of passive subjects was largely above chance, nevertheless. More importantly, in a second test, the participants were presented with a number and asked to supply the word that began the next pair in the training list. In fact, unbeknown to the participants, some word-number pairs were so placed in the training series that a particular word always followed a particular number. Recall scores were at chance on the second test, even for the attentive group (for recent, conceptually similar findings, see Pacton and Perruchet 2008). For Thorndike, these results exemplify a condition for associative learning that he coined as the principle of belonging. The property of belongingness refers to as whether events are perceived as going together, without a logical or causal link between the events of concern being required in any way. The intended meaning is that incoming information is naturally divided into a succession of units, and that a necessary condition for the creation of an associative link is that the to-be-associated events are perceived as belonging to a same unit. If a list is perceived as a succession of word-number pairs, associations between the (final) number of a pair and the (initial) word of the next pair are not learned, whatever the ‘‘objective’’ contiguity of the two events, and whatever the amount of attention devoted to the task. The principle of belonging usefully specifies the claim 5. Based on the report given by Postman (1962). We thank David Shanks for having drawn our attention to this work and made it available on his website.

Word segmentation

127

that attention is necessary for learning. Certainly, some quantitative level of orientation towards the to-be-learned materials is needed, but the amount of attention learners naturally pay to their environment, without special intentional e¤ort and concentration, seems to be su‰cient, as exemplified in the ‘‘passive’’ group of Thorndike (1932). The critical issue is the qualitative content of the fleeting attentional focus. A condition for the establishment of stable associations between two events is the perception of these events within a single attentional chunk. As an aside, learning dependency on attention is essential to the explanatory power of an associative view 6. Indeed, one of the recurrent objections against the relevance of associative processes for language acquisition is the idea that the number of possible associations between the elements displayed in the input is so large that an explanation based on the exploitation of statistical regularities would be doomed to failure due to combinatorial explosion (e.g., Pinker 1984). This objection essentially reflects a misrepresentation of the laws of associative learning, because it is known for long that the possibility of formation of new associations depends on a number of constraints. However, the most pervasive of these constraints is certainly those stemming from the need of attention: Attention serves as a natural filter to avoid combinatorial explosion. Interestingly, this filter does not act as a blind mechanism that would operate a random selection among the possible candidates for the creation of new associations. Indeed, attention is naturally oriented towards events that have high chance of being relevant in the current context, due to the intrinsic properties of these events (e.g., attention is captured by novelty, and novel events are presumably those that need to be integrated in new associative networks) and to the e‰ciency of social cues, even in infants (e.g., Wu and Kirkham 2010). Coming back to the concern of this chapter, how may attention account for the behavioral sensitivity to statistical structure? The classical literature on animal conditioning has provided some responses (e.g., Mackintosh 1975; Pearce and Hall 1980). We do not intend to enter into the intricacies of the debates surrounding this issue. Rather, we present here a simplistic, general sketch, in the hope of making clear the gist of attention-based 6. Paradoxically, considerable e¤ort has been devoted to demonstrate the possibility of learning without attention or awareness of the to-be-learned relationships, presumably driven by the idea that relaxing learning from the constraints inherent to limited attentional resources seemingly extends its power and field of application. Arguably, the exact opposite is true.

128

Pierre Perruchet and Be´ne´dicte Poulin-Charronnat

theories. Referring again to the terminology used in the 2  2 contingency table above, it is obvious that the attention devoted to A does not only depend on the number of AX pairs ðaÞ, but more generally on the overall number of A events ða þ bÞ. Indeed, the amount of attention devoted to a given event is known to be inversely related to its frequency, because repetitions induce habituation. As a consequence, increasing b (assuming a fixed) necessarily decreases the amount of attention devoted to A, and hence the probability for A and X to be perceived in a single attentional chunk. This accounts for the sensitivity to forward TPs. It su‰ces to switch round A and X (and, as a way of consequence, b and c) in the reasoning above to account for the sensitivity to backward TPs. Those permutations are perfectly consistent with our knowledge about the role of attention, given that in this account A and X play a symmetrical role (i.e., the formation of an associative link depends on the joint attention to both events). Of course, accounting for bidirectional measures of association such as a Pearson r naturally follows. 3.4. An interference-based account In the reasoning above, the only envisioned consequence of increasing the number of b (and/or c) in Table 1 bears on the raw frequency of A (and/or X): Increasing frequency of an event entails some attentional deficit for this event. However, the matter is a little more complex. In fact, increasing the number of b (and/or c) in Table 1 also increases the number of potential associations of A (and/or X) with one or several other event(s). For example, if (a) is the sequence gati and (b) gafo and gamu, the presence of the latter events make that ga is now associated with three di¤erent syllables, ti, fo, and mu. This consideration does not invalidate the reasoning above and hence, an explanation based on attention still holds: the syllable ga may receive a larger amount of attention if ga is only played in gati than if ga also appears in gafo or gamu, all simply because ga turns out to be less frequent in the former than in the latter case. However, another phenomenon may combine with the modulation of attention, namely the generation of interference. We refer here to nothing else than the classical paradigm of interference, such as described in any psychology textbooks: People learn a first list of pairs (usually coined as AB) and then a second list of pairs that bears relation to the initial target pairs (AC). Learning AC has a more detrimental influence on the retrieval of AB than learning a list of unrelated items (e.g., DE). In keeping with this phenomenon, the memory

Word segmentation

129

for gati will be impaired by the presentation of gafo or gamu, with regard to a situation where ga would be always followed by ti. This e¤ect directly accounts for the behavioral sensitivity to forward TPs. Because interference also occurs backwards (the memory for gati would be impaired by the presentation of foti or muti), the processes of interference also account for the behavioral sensitivity to backward TPs and, by way of generalization, to bidirectional measures of association. Any explanations relying on simple and general principles to account for complex phenomena are often questioned for their power by those who find ad-hoc modules or mechanisms more attractive. Fortunately, implementing these simple accounts into computational models allows to address directly this concern. PARSER (Perruchet and Vinter 1998) is a computational model that is devised to discover words from a nonsegmented speech flow without involving any other principles or processes than those belonging to the associative tradition. Based on the phenomenon that, in humans, attentional coding of the ingoing information naturally segments the material into disjunctive parts, the model is provided online with a succession of candidate units, some of them relevant to the structure of the language and others irrelevant. The relevant units emerge through a selection process based on forgetting. Crucially, forgetting is the endproduct of both decay and interference. Decay is implemented as a linear decrement of the weight of the candidate units across the training session. If forgetting was only due to decay, PARSER would be only sensitive to the raw frequency of co-occurrences (i.e., a in Table 1): The candidate units resisting to forgetting would be all simply those that occur the more frequently in the speech flow. Interference allows the model to be sensitive to much more sophisticated measures of cohesiveness. Let us consider two artificial languages, one (L1) in which gati would be a word, and a second language (L2) in which gati would occur as a part-word (i.e., at least one word would end by ga and at least one other word would begin by ti). It is of course possible that gati occurs as frequently in L1 as in L2 (as in the Aslin et al.’s 1998 design), hence making a decay process ine‰cient to discover that gati is a word in L1 but not in L2. The point is that in L2, ga will be necessarily followed by other syllables, hence making possible the creation of other candidate units, such as gafo or gamu, which generate interference with gati. As a consequence of interference, gati should normally disappear from the lexicon of a learner trained with L2, by contrast with a learner trained with L1, for whom interference would be null or reduced. These processes should account for the model’s sensitivity to forward TPs. As

130

Pierre Perruchet and Be´ne´dicte Poulin-Charronnat

for the attention-based model described above, the role of the two to-beassociated events can be switched, hence accounting for backward TPs and bidirectional measures of associations. Data modelization fully confirmed these predictions (e.g., Perruchet and Peereman 2004). Whereas some critics of PARSER doubted that such a simple model would be sensitive to more than raw co-occurrence frequencies (e.g., Hunt and Aslin 2001), this achievement demonstrates that very elementary associative principles are powerful enough to account for the behavioral sensitivity to sophisticated statistics. To resume our main point up to now: The concepts evolved in the context of the literature on associative learning and memory are su‰cient to account for the behavioral sensitivity to measures of associations that include the standard TPs, but also much more sophisticated and powerful statistical measures. Of course, this does not rule out the idea that human learners compute TPs or other statistics, as the prevalent view contends. However, there is no need for such a postulate: Behavioral sensitivity to these measures can be understood alternatively as a by-product of a few ubiquitous processes, which are at the core of the associative tradition.

4. Integrating other sources of information As recalled in the introductory section, acoustical cues in the speech flow and contextual information are also exploited in the word segmentation of natural languages. Insofar as the exploitation of statistical structure is attributed to ad-hoc computational abilities, as in the mainstream tradition, the role played by these factors appears as superimposed to statistical computation, without principled constraints to predict whether and how the multiple factors could interact. Our thesis is that once the vague notion of ‘‘statistical computation’’ has been traded for well-known, specified mechanisms of associative learning and memory, a unified, integrative interpretation arises naturally. Let us consider again the contingency matrix in Table 1. The core condition for the formation of an association between A and X, or in other words the formation of the chunk AX, is the learner’s joint attention to A and X. In the prior section, we have examined how pure distributional factors (i.e., the occurrence of A and/or X in other contexts) can modulate the amount of attention devoted to these events and the pattern of interference, hence accounting for the sensitivity to the statistical structure. However, it is obvious that the chance for A and X to belong to a same

Word segmentation

131

attentional chunk at the outset of training also depends on a number of factors, and notably on prosodic and phonological factors. Let us assume that the sequence AX appears in one of the three following conditions (these conditions are freely inspired from Shukla et al. 2007): (1) [S1-S2-S3-A-X-S6-S7-S8-S9-S10] [S1-S2-S3-A- . . . (2) [S1-S2-S3-S4-S5-S6-S7-S8-A-X] [S1-S2-S3-S4- . . . (3) [X-S2-S3-S4-S5-S6-S7-S8-S9-A] [X-S2-S3-S4- . . . where Sx are di¤erent syllables (their number conveys no other information than the fact they are di¤erent), and square brackets mark intonational phrases, that is, a set of syllables bounded by natural break-points in speech. For instance, ‘‘] [’’ may be a short pause in the speech flow. From a purely statistical standpoint, the relation between A and X is the same in all three cases. If one endorses the view that statistical computations are performed by the listeners, there is no reason to think that these computations would di¤er between the three conditions, and hence, any di¤erence between conditions will be taken as attesting to the action of other processes. By contrast, if the apparent results of statistical computations are taken into account by the action of ubiquitous processes of associative learning and memory as proposed above, the predictions that can be put forth are straightforward. In (1), AX is included within an intonational phrase, so it is possible for AX to be perceived within an attentional chunk. This is not necessary however: Ten syllables surely exceed the range of an attentional percept, and it is possible that spontaneous processes of chunking lead to introduce a subjective boundary between A and X. The probability for A and X to be separated under condition (2) is certainly far smaller, due to the fact that the pair AX is limited on the right by a natural break-point. By contrast, the probability for A and X to be perceived in a single attentional chunk is nearly null under condition (3), given that they are separated by a natural break-point. Therefore an associative view predicts that discovering AX will be possible in (1), optimal in (2) because the edge e¤ect maximizes the probability for AX to be perceived within a single percept, and nearly impossible in (3), because there is no chance for AX to be perceived as an attentional unit. These predictions are direct consequences of the Thorndike’s principle of belonging. Interestingly, these predictions are clearly confirmed by the results provided by Shukla et al. (2007; see also Seidl and Johnson 2006, for related data on infants).

132

Pierre Perruchet and Be´ne´dicte Poulin-Charronnat

The e¤ect of contextual information of the speech flow on the discovery of new words can be easily accounted for within the very same framework. Indeed, attentional chunks do not overlap. In the Thorndike’s (1932) experiment outlined above, if word1-number1 is a unit and word2-number2 is the following unit, number1-word2 cannot be also perceived as a unit. As a consequence, the element following a known chunk is naturally perceived as the beginning of a new chunk. Let us consider the two following conditions: (1) S1-S2-S3-A-X-S6-S7- . . . (2) S1S2S3-A-X-S6S7- . . . where the deletion of hyphens in (2) is intended to mean that ‘‘S1S2S3’’ and ‘‘S6S7’’ are known words, instead of being perceived as sequences of independent syllables. The probability of perceiving AX as a single percept is obviously stronger under condition (2) than condition (1), and hence again, although a computational view would consider the two situations as similar, the predictions of an associative framework are straightforward: AX should be easier to learn under condition (2) than under condition (1), as found in empirical studies (e.g., Bortfeld et al. 2005; Cunillera et al. 2010). To resume, in an associative view, the e¤ects of statistical structure, acoustical cues, and contextual information, cannot be separated. All of them are aimed at modifying the probability that the to-be-related events (A and X in Table 1, which may be, for instance, the syllables composing a word in real world settings) are perceived within a single attentional unit, to serve as raw material to the action of associative processes. 5. Towards a dynamic view 5.1. A general outline Up to now, only general principles from the associative tradition have been involved. To the best of our knowledge, their application to the word segmentation issue is novel, as well as the demonstration that simple interference processes may account for the sensitivity to TPs and more complex measures of association. However, we did not introduce any new postulates or concepts. By contrast, the proposal that follows relies on a principle that, although in no way contradictory to the laws of associative learning and memory, has not been exploited yet in this approach. This principle posits that among the conditions susceptible to focus attention

Word segmentation

133

on a set of components – or in other words, to fulfill the Thorndike’s principle of belonging – is the fact that these components have been perceived as a chunk in earlier processing episodes. If the earlier processing episodes have been so frequent that a longlasting representational unit has been created, the above principle is trivial: Everyone would agree that a familiar word or object is perceived as a whole in adults. The new point here is that the phenomenon occurs at a much smaller scale, for instance during the few seconds or minutes following a single episode. Although the literature on priming could provide at least indirect evidence supporting this idea, more direct demonstration have been brought out in the domain of object perception (e.g., Scholl 2001). Objects are often defined as the product of gestalt-like grouping principles, presumably innate, such as the principles of continuation or common fate. It is also largely acknowledged that long-term familiarity with objects leads to process their features as a whole. However, it has been shown also that even a recent and sporadic experience with a novel shape is su‰cient to facilitate its processing as a unitary whole upon subsequent occurrences (e.g., Zemel, Behrmann, Mozer, and Bavelier 2002). When applied to the language domain, this kind of phenomena lies at the root of a dynamic approach to word segmentation. How does this principle work? Let us assume that a given sequence, say gati, has been perceived for the first time within a single attentional chunk due to its acoustical properties, for instance because it was displayed at the end of an intonational phrase. It is highly unlikely that this single presentation would be su‰cient to create gati as a definitive lexical unit, and it is not ascertain that gati will appear in subsequent occurrences in so favorable conditions. Let us suppose that the subsequent occurrence of gati occurs within the speech flow without any acoustical markers. If ‘‘statistical learning’’ is conceived of as statistical computations, there is no reason for gati to be processed di¤erently from any comparable sequence of two syllables. In our framework, by contrast, gati will be specifically strengthened, because provisional internal representations now guide learner’s attention as a substitute to external cues. The very same reasoning holds for the action of contextual information. We referred in the introduction to the Dahan and Brent’s (1999) example: If look is a word of the language, then here will be perceived as a new unit when hearing ‘‘Lookhere!’’. Our claim is that the same is true if look is a provisional, short-lived chunk. Suppose, for instance, that a child knowing neither look nor here is told ‘‘Look! Lookhere!’’. Because the first occurrence of look will be presumably processed as a (provisional) unit due to

134

Pierre Perruchet and Be´ne´dicte Poulin-Charronnat

the prosodic markers, ‘‘lookhere’’, instead of being possibly perceived as a single unit, will be correctly segmented, with the beneficial consequences of both strengthening look and creating here. What the prior examples illustrate is that a given unit does not need to be perceived in acoustically or contextually favorable conditions on each of its occurrences to acquire a stable internal representation: Once this unit has been perceived as such, it will be naturally strengthened on the near subsequent occurrences whatever its perceptual salience, due to the fact that even short-lived representational units capture attention. Word segmentation, in this framework, appears as the end-result of a dynamical organization. Initially, words (or parts of words) may be perceived as subjective units, due to their acoustical properties or to contextual information, which have been shown to be specially salient in child-directed speech; Then these units are strengthened by the action of associative processes, because the nascent and short-lived representational units generated by earlier experiences create themselves the condition for their own strengthening, namely their processing as a single attentional percept. This phenomenon is all the more adaptive given that analyses of natural language corpuses have shown that the probability of encountering a word that has just been met is strong in the near future, then decreases when time elapses (e.g., Anderson and Schooler 1991). 5.2. Empirical and computational evidence This framework allows to draw original predictions. If the processing of acoustical and contextual information is independent from statistical computations, the e¤ect of statistical, acoustical, and contextual information should be roughly additive (assuming no ceiling e¤ects). In the framework outlined above, the various sources of information have clearly interactive e¤ects. For instance, minor and sporadic acoustical factors providing a positive cue for word discovery may have a strong e¤ect. Also, if prosodic factors prevent any possibility for two syllables of being perceived in a same attentional focus (as in Condition (3) of the above example, where constituents are on each side of a prosodic boundary), no association will be created, irrespective of the strength of the relation that a statistician could calculate. These predictions have been tested in a recent study (Perruchet and Tillmann 2010). The experimental situation was quite similar to the situation introduced by Sa¤ran and collaborators (e.g., Sa¤ran et al. 1996).

Word segmentation

135

Participants had to listen to an artificial language composed of six trisyllabic artificial words, randomly concatenated without any pauses. The main di¤erence with regard to the standard situation was that the Initial Word Likeness (IWL) of three of the six artificial words (hereafter, the biased words) was manipulated. IWL refers to the probability of a new sound sequence to be considered as a word, due to its acoustical properties (Bailey and Hahn 2001). For one group of participants (IWLþ), the biased words, when heard in a continuous speech stream, were spontaneously perceived as words more often than part-words, whereas the relation was reversed for the second group of participants (IWL). Participants from both groups were presented with two successive twoalternative forced choice tests (a word and a part-word) and had to select the syllable set forming a word in the previously heard syllable stream. The first test occurred after a very limited exposure to the language. Unsurprisingly, performance for the biased words was better in the Group IWLþ than in the Group IWL, due to the e¤ects of acoustical factors (Figure 1, Initial conditions, Black bars). The second test occurred at the end of the experiment, as usual. Two opposite predictions were possible. If one posits that statistical computations are independent from the e¤ect of acoustical factors, then the improvement in performance should be the same in both groups, hence generating additive e¤ects. By contrast, if one posits that the initial biases in favor of some chunks are exploited by associative mechanisms following the dynamic organization outlined above, then there should be a positive interaction. Positively biased words should be learned better than negatively biased words. Results clearly supported the second hypothesis (Figure 1, the four Black bars in the top panel). The other three words of the language were unbiased, which means that, on average, the IWL of these words and the IWL of the part-words that are generated by the concatenation of these words did not di¤er. The unbiased words, although identical with regard to their IWL and their statistical properties were learned more quickly in the Group IWLþ than in the Group IWL (Figure 1, four Grey bars in the top panel), attesting that participants were able to exploit their growing knowledge of the biased words to guide the discovery of unbiased words. To resume, Perruchet and Tillmann’s (2010) paradigm allowed to investigate the joint influences of three factors on the discovery of new wordlike units in a continuous artificial speech stream: the statistical structure of the ongoing input, the initial word likeness of parts of the speech flow, and the contextual information provided by the earlier emergence of other word-like units. Results showed that these sources of information have

136

Pierre Perruchet and Be´ne´dicte Poulin-Charronnat

Figure 1. Proportion of correct responses for the initial and final tests, as a function of Groups, for biased and unbiased items. IWL stands for Initial Word Likeness, and designates the probability for a new sound sequence to be considered as a word of the language due to its intrinsic properties, before any training. The biased items were positively biased in the Group IWLþ, and negatively biased in the Group IWL. Unbiased items were identical for the two groups. The top panel shows the data collected on adult humans, and the bottom panel shows the simulations with PARSER (adapted from Perruchet and Tillmann 2010).

Word segmentation

137

strong and interactive influences on word discovery, as clearly anticipated in our framework. To assess the quality of fit in a quantitative way, the data were simulated with PARSER (Perruchet and Vinter 1998). PARSER has been introduced above to illustrate how interference makes learners sensitive to statistics, but one crucial aspect was passed over in silence. Indeed, the model implements the principle that perception is dynamically guided by the emerging representational units. Although the role of acoustical factors was not implemented in the original version, doing so is straightforward. In the simulations, the initial selection of the candidate units was biased in such a way that instead of being randomly drawn within a given length range, the candidate units were selected (within the same range) as a function of their relative IWL, which was previously assessed in human participants. When the standard parameters used in Perruchet and Vinter (1998) were applied, ceiling e¤ects were observed. When forgetting was progressively increased until ceiling e¤ects disappeared in all conditions, the pattern of results obtained by PARSER reproduced the main e¤ects observed for human participants, as reported in Figure 1 bottom panel. Learning for biased items was better for the Group IWLþ than for the Group IWL, even though taking performance in the first test as a baseline controlled for the direct e¤ect of IWL on word/part-word selection, and this di¤erence transferred to the unbiased items. The fact that PARSER was successful in generating the pattern observed in human participants without implementing any ad-hoc algorithmic changes is worthy of note.

6. Discussion The starting point of this chapter was the conceptual indeterminacy of the notion of ‘‘statistical computation’’, which is usually invoked to account for the human sensitivity to the statistical structure of the environment, including the language. We showed that behavioral sensitivity to statistical structures has been evidenced long ago in the context of research on conditioning in animals, and has been interpreted as the by-product of basic associative learning principles. At the core of these interpretations is the ubiquitous necessity for two events to be created as a long-lasting representational unit to be perceived within a single attentional chunk in the first place, a phenomenon already described as the ‘‘principle of belonging’’ by Thorndike (1932). We argued that putting associative learning principles as the primary cause for the behavioral sensitivity to statistical

138

Pierre Perruchet and Be´ne´dicte Poulin-Charronnat

structures has the unique advantage of also accounting for the action of the other factors that have been shown to be influential on word segmentation, especially the presence of acoustical cues and contextual information. Finally, we have shown that an associative framework, when complemented with the idea that even a fleeting, short-lived representation of a set of features as a unit shapes further perception, opens to a dynamic interpretation of word segmentation, in which the contributing factors exhibit overadditive interactions. This view has received some experimental and computational support (Perruchet and Tillmann 2010). Although our approach is critical towards the prevalent account of statistical learning, it is worth noting that we have no reservation with regards to the current empirical studies in the domain and moreover, we fully endorse the view that statistical learning plays a much more substantial role in language acquisition than once thought. In this regard, our approach stands at the exact opposite of another critical vision, emphasizing the need for completing statistical learning mechanisms with a symbolic machinery (e.g., Endress, Dehaene-Lambert, and Mehler 2007; Endress and Mehler 2009; Endress, Nespor, and Mehler 2009). In this view, the action of any perceptual factor modulating the end-result of statistical computations is referred to the action of specialized detectors. For instance, observing that components located at the boundaries of a sequence are processed more e‰ciently than components located in a middle position, the authors infer the action of an ‘‘edge detector’’ coding this information in a symbolic format. This symbolic information is then sent to other systems, including mechanisms devoted to statistical computations, and acts as a constraint: Any event situated at a boundary is taken as potentially more significant than other events. To take an analogy in physics, it is like if to account for the fact that ice melts at about 0 degrees Celsius, a ‘‘head detector’’ continuously monitors the temperature of the water, stores this information in a propositional format, and when the temperature starts to warm above the target value sends to another system in charge of the melting task a message such as ‘‘The melting point has been reached’’. This account is obviously nonsensical in the physical domain, but half of a century of prevalence of the cognitivist, information-processing view makes that what we construe as its counterpart in cognitive sciences is not always perceived as such. Instead of adding a (symbolic) layer in the explanatory sketch, our approach amounts to a withdrawal of a notion that has been recently proposed, namely that idea that the mind performs statistical computations. When the principle that the formation of a cognitive unit needs the initial

Word segmentation

139

processing of its components into a single attentional percept has been laid down, the action of distributional factors as well as the action of other variables guiding attention – such as edges – directly follows through direct functional couplings. The question of whether the principles exploited here for the word segmentation issue can be extended to account for other aspects of language acquisition stands beyond the scope of this chapter. This extension to language could meet other approaches relying on similar principles such as the emergentist theory developed by MacWhinney (e.g., MacWhinney 2010). Such an extension has also been proposed as a part of a general model of the mind framed around the concept of self-organizing consciousness (e.g., Perruchet 2005; Perruchet and Vinter 2002).

References Anderson, John R. and Lael J. Schooler 1991 Reflections of the environment in memory. Psychological Science, 2: 396–408. Aslin, Richard N., Jenny R. Sa¤ran and Elissa L. Newport 1998 Computation of conditional probability statistics by 8-monthold infants. Psychological Science, 9: 321–324. Bailey, Todd M. and Ulrike Hahn 2001 Determinants of wordlikeness: Phonotactics or lexical neighborhoods? Journal of Memory and Language, 44: 568–591. Baker, Christopher I., Carl R. Olson and Marlene Behrmann 2004 Role of attention and perceptual grouping in visual statistical learning. Psychological Science, 15: 460–466. Berent, Iris and Tracy Lennertz 2010 Universal constraints on the sound structure of language: Phonological or acoustic. Journal of Experimental Psychology: Human Perception and Performance, 36: 212–223. Bortfeld, Heather, James L. Morgan, Roberta M. Golinko¤ and Karen Rathbun 2005 Mommy and Me. Psychological Science, 16: 298–304. Noam, Chomsky 1959 A Review of B. F. Skinner’s Verbal Behavior. Language, 35: 26– 58. Christiansen, Morten H., Joseph Allen and Mark S. Seidenberg 1998 Learning to segment speech using multiple cues: A connectionist model. Language and Cognitive Processes, 13: 221–268. Creel, Sarah C., Michael K. Tanenhaus and Richard N. Aslin 2006 Consequences of lexical stress on learning an artificial lexicon. Journal of Experimental Psychology: Learning, Memory, and Cognition, 32: 15–32.

140

Pierre Perruchet and Be´ne´dicte Poulin-Charronnat

Cunillera, Toni, Estela Camara, Matti Laine and Antoni Rodriguez-Fornells 2010 Words as anchors: Known words facilitate statistical learning. Experimental Psychology, 57: 134–141. Curtin, Suzanne, Toben H. Mintz and Morten H. Christiansen 2005 Stress changes the representational landscape: Evidence from word segmentation. Cognition, 96: 233–262. Dahan, Delphine and Michael R. Brent 1999 On the discovery of novel wordlike units from utterances: An artificial-language study with implications for native-language acquisition. Journal of Experimental Psychology: General, 128: 165–185. Endress, Ansgar D., Ghislaine Dehaene-Lambertz and Jacques Mehler 2007 Perceptual constraints and the learnability of simple grammars. Cognition, 105(3): 577–614. Endress, Ansgar D. and Mark D. Hauser 2010 Word segmentation with a universal prosodic mechanism. Cognitive Psychology, 61: 177–199. Endress, Ansgar D. and Jacques Mehler 2009 The surprising power of statistical learning: When fragment knowledge leads to false memories of unheard words. Journal of Memory and Language, 60: 351–367. Endress, Ansgar D., Marina Nespor and Jacques Mehler 2009 Perceptual and memory constraints on language acquisition. Trends in Cognitive Sciences, 13: 348–353. Gomez, Rebecca 2007 Statistical learning in infant language development. In M. G. Gaskel (Ed.). The Oxford handbook of psycholinguistics, New York: Oxford University Press. Hockema, Steve A. 2006 Finding words in speech: An investigation of American English. Language Learning and Development, 2: 119–146. Hsiao, Andrew T. and Arthur S. Reber 1998 The role of attention in implicit sequence learning: Exploring the limits of the cognitive unconscious. In Michael. Stadler and Peter. Frensch (Eds.), Handbook of implicit learning (pp. 471– 494). Thousand Oaks, CA: Sage Publications. Hunt, Ruskin H. and Richard N. Aslin 2001 Statistical learning in a serial reaction time task: Access to separable statistical cues by individual learners. Journal of Experimental Psychology: General, 130: 658–680. Logan, Gordon D. 1988 Toward an instance theory of automatization. Psychological Review, 95: 492–527. Mackintosh, Nicolas J. 1975 A theory of attention: Variations in the associability of stimuli with reinforcement. Psychological Review, 82: 276–298.

Word segmentation

141

MacWhinney, Brian 2010 Language development. In M. Bornstein and M. Lamb (Eds.), Developmental science: An advanced textbook (pp. 467–508). New York: Psychology Press. Onnis, Luca, Padraick Monaghan, Nick Chater and Korin Richmond 2005 Phonology impacts segmentation in online speech processing. Journal of Memory and Language, 53: 225–237. Pacton, Se´bastien and Pierre Perruchet 2008 An attention-based associative account of adjacent and nonadjacent dependency learning. Journal of Experimental Psychology: Learning, Memory, and Cognition, 34: 80–96. Pearce, John M. and Geo¤rey Hall 1980 A model for Pavlovian learning: Variations in the e¤ectiveness of conditioned but not unconditioned stimuli. Psychological Review, 87: 532–552. Pelucchi, Bruna, Jessica F. Hay and Jenny R. Sa¤ran 2009 Learning in reverse: Eight-month-old infants track backwards transitional probabilities. Cognition, 113: 244–247. Perruchet, Pierre and Ste´phane Desaulty 2008 A role for backward transitional probabilities in word segmentation? Memory & Cognition, 36: 1299–1305. Perruchet, Pierre 2005 Statistical approaches to language acquisition and the selforganizing consciousness: A reversal of perspective. Psychological Research, 69: 316–329. Perruchet, Pierre and Ronald Peereman 2004 The exploitation of distributional information in syllable processing. Journal of Neurolinguistics, 17: 97–119. Perruchet, Pierre and Annie Vinter 1998 PARSER: A model for word segmentation. Journal of Memory and Language, 39: 246–263. Perruchet, Pierre and Annie Vinter 2002 The self-organized consciousness. Behavioral and Brain Sciences, 25: 297–388. Perruchet, Pierre and Barbara Tillmann 2010 Exploiting multiple sources of information in learning an artificial language: Human data and modeling. Cognitive Science, 34: 255– 285. Pinker, Steven 1984 Language learnability and language development. Cambridge, MA: Harvard University Press. Postman, Leo 1962 Rewards and punishments in human learning. In L. Postman (Ed.), Psychology in the making: Histories of selected research problems (pp. 331–401). New York: Knopf.

142

Pierre Perruchet and Be´ne´dicte Poulin-Charronnat

Rescorla, Robert A. 1968 Probability of shock in the presence and absence of CS in fear conditioning. Journal of Comparative and Physiological Psychology, 66: 1–5. Sa¤ran, Jenny R. 2002 Constraints on statistical language learning. Journal of Memory and Language, 47: 172–196. Sa¤ran, Jenny R., Elissa L. Newport and Richard N. Aslin 1996 Word segmentation: The role of distributional cues. Journal of Memory and Language, 35: 606–621. Scholl, Brian J. 2001 Objects and attention: The state of the art. Cognition, 80: 1–46. Seidenberg, Mark S. and Maryellen C. MacDonald 1999 A probabilistic constraints approach to language acquisition and processing. Cognitive Science, 23: 569–588. Seidl, Amanda and Elisabeth Johnson 2006 Infant word segmentation revisited: Edge alignment facilitates target extraction. Developmental Science, 9: 565–573. Shukla, Mohinish, Marina Nespor and Jacques Mehler 2007 An interaction between prosody and statistics in the segmentation of fluent speech. Cognitive Psychology, 54: 1–32. Swingley, Daniel 1999 Conditional probability and word discovery: A corpus analysis of speech to infants. In M. Hahn and S. C. Stoness (Eds.), Proceedings of the 21st annual conference of the cognitive science society (pp. 724–729). Mahwah, NJ: LEA. Swingley, Daniel 2005 Statistical clustering and the contents of the infant vocabulary. Cognitive Psychology, 50: 86–132. Thiessen, Erik D and Jenny R. Sa¤ran 2003 When cues collide: Use of stress and statistical cues to word boundaries by 7- to 9-month-old infants. Developmental Psychology, 39: 706–716. Thiessen, Erik D. and Jenny R. Sa¤ran 2007 Learning to learn: Infants’ acquisition of stress-based strategies for word segmentation. Language Learning and Development, 3: 73–100. Thorndike, Edward L. 1932 The fundamentals of learning. New York: Teachers College, Columbia University. Toro, Juan M., Scott Sinnett and Salvador Soto-Faraco 2005 Speech segmentation by statistical learning depends on attention. Cognition, 97: B25–B34.

Word segmentation

143

Wu, Rachel and Natasha Z. Kirkham 2010 No two cues are alike: Depth of learning during infancy is dependent on what orients attention. Journal of Experimental Child Psychology, 107: 118–136. Yang, Charles D. 2004 Universal Grammar, statistics or both? Trends in Cognitive Sciences, 8: 451–456. Yoshida, Kazuhiro, John R. Iversen, Aniruddh D. Patel, Reiko Mazuka, Hiromi Nito, Judith Gervain and Janet Werker 2010 The development of perceptual grouping biases in infancy: A Japanese-English cross-linguistic study. Cognition, 115: 356–361. Zemel, Richard S., Marlene Behrmann, Michael C. Mozer and Daphne Bevalier 2002 Experience-dependent perceptual grouping and object-based attention. Journal of Experimental Psychology: Human Perception and Performance, 28: 202–217.

The road to word class acquisition is paved with distributional and sound cues Michelle Sandoval, Kalim Gonzales and Rebecca Go´mez 1. Introduction The ability to classify words into grammatical categories represents a major milestone in language acquisition. Once children have grouped words into separate grammatical categories, they can begin to generate novel utterances that conform to the morphosyntactic patterns in their language. One of the clearest signs that children have begun to generate their own grammar-conforming utterances is when they start applying regular patterns (or rules) to irregular words. For example, children commonly apply the regular plural –s and past tense –ed to irregular nouns and verbs, not just to regular ones. They can thus be observed producing words like mans and hitted, forms that are not the result of mimicking adults. Importantly, we do not hear these same children saying things like animaled. Instead, they apply inflections to their appropriate word classes, providing further evidence that these children distinguish lexical categories. Of course, it is possible children begin categorizing words into word classes long before they begin overregularizing them and possibly before they begin producing full grammatical phrases. The learning mechanisms available to children for classifying words into categories are the topic of this chapter. In the ensuing sections we consider distributional, phonological and prosodic cues to word class (specifically nouns and verbs) and their usefulness for learners, how learners might bootstrap from correlated cues to category structure, and the problem of scaling up from purely formbased categories to lexical classes.1 Throughout we consider the relative

1. Although the proposals below should in principle apply to words in a variety of classes, most of the research has focused on nouns and verbs. Therefore we limit our review to these word classes.

146

Michelle Sandoval, Kalim Gonzales and Rebecca Go´mez

contributions of distributional and correlated cues to learning as well as the extent to which young learners are able to capitalize on these cues. 2. Distributional cues to word classes Researchers have long sought to determine whether distributional patterns signal grammatical class independent of semantic cues. In one of the earliest accounts of distributional learning, Maratsos and Chalkley (1980) proposed that language learners could identify the grammatical category of a word based purely on its lexical context. For example, nouns could be categorized apart from verbs because only nouns are directly preceded by articles like the. However, critics have challenged this view on the grounds that a distributional learner would be thrown o¤ by homophones belonging to multiple word classes (Pinker 1987). According to Pinker’s argument, hearing a homophone in the same sentential context as another word would lead a learner to categorize these words together and infer that both words share other sentential contexts. For example, a distributional learner who observed fish and oysters in the sentences 1a and 1b, and then observed fish in a verb context (e.g., sentence 1c), would infer that oysters could also appear in a verb context (e.g., sentence 1d). (1) a. b. c. d.

He eats fish He eats oysters He can fish *He can oysters

Skeptical of this and other similar arguments, a number of investigators have sought to determine how closely various distributional analyses could, when based on a large corpus of input, approximate actual word categories. These analyses have distinguished nouns and verbs with a surprisingly high level of accuracy, both when performed over corpora of adult-directed (Brill 1991; Finch and Chater 1992, 1994; Schu¨tze 1993) and child-directed speech (Chemla Mintz, Bernal, and Christophe 2009; Mintz 2003; Mintz, Newport, and Bever 1995, 2002; Mintz 1996; Cartwright and Brent 1997; Redington, Chater, and Finch 1998). For instance, corpus analyses of child-directed speech by Mintz and colleagues (Mintz 2003; Chemla et al. 2009) have focused on the usefulness of non-adjacent dependencies for classifying word classes. Specifically, Mintz et al. have focused on frequent frames. Frequent frames are ‘‘ordered pairs of words that frequently co-occur with exactly one word position

The road to word class acquisition

147

intervening’’ (Mintz 2003, p. 93). An example of a frequent frame is youX-it, where X represents the intervening word position. Analyses based on these frames ask how successfully the words in a corpus can be separated into lexical categories using only information about which frames each word occupies. Mintz (2003) used this analysis procedure to categorize words in 6 corpora of speech to children under 2.6 years of age, focusing on the 45 most frequent frames in each corpus. There were two gauges of the analyses’ success, accuracy and completeness. To compute accuracy, word tokens from the same frame were paired exhaustively with respect to grammatical class membership (noun, verb, etc.). Each pair of word tokens was then classified as a Hit or False Alarm, depending on whether the two tokens were from the same grammatical class. Accuracy was computed by dividing the number of Hits by the number of Hits plus False Alarms, with a resulting value of 1 representing the highest level of accuracy. Completeness represented the extent to which any given pair of words from the same grammatical class was grouped together. Completeness was computed by dividing the number of Hits by the number of Hits plus Misses, with Misses being the number of times any two categorized words from the same grammatical class failed to appear together in one or more frame-based categories (range 0–1). Results of the accuracy measure were the most striking: the average level across the 6 corpora was nearly perfect (.98). Further, this level was significantly higher than that for a group of control categories (.46), formed by randomly interchanging words across the frame-based categories. In contrast to accuracy the average level of completeness was only .07, although a comparison to the control categories (at .03) yielded a significant di¤erence. Together, the results from these two measures suggest that although words from the same category do not always occupy the same frame, words occupying the same frame are almost always from the same category providing some information to the child. Thus, it appears that a frame-based analysis could be useful to young children for accurately grouping together at least some of the words in their input. This frame-based analysis makes two important assumptions about young children’s linguistic capabilities that are supported by behavioral data. First, a vast majority of frames consist of closed class words, such as a and the, therefore young children could quickly identify frames if they were sensitive to this class of words. (We will see ample evidence that infants can detect closed class words from birth and have expectations about their distributional contexts by 10.5 months [Shi, Werker, and

148

Michelle Sandoval, Kalim Gonzales and Rebecca Go´mez

Morgan 1999; Shady 1996], Section 3). Second, frames require children to be able to track nonadjacent dependencies between words. Behavioral evidence for this ability comes from Santelmann and Jusczyk (1998), who investigated infants’ ability to track nonadjacent dependencies between a closed class word and a bound morpheme. Their results indicated that 18-month-olds could distinguish between passages containing is and –ing (. . .everyone is baking. . .) on the one hand and passages containing violations of those dependencies (. . .everyone can baking. . .) on the other. More recent results have shown sensitivity to nonadjacent dependencies between closed class words and bound morphemes in infants as young as 16 months of age (Soderstrom, White, Conwell, and Morgan 2007; also see Van Heugten and Johnson 2010; Van Heugten and Shi 2010). And, in the artificial language learning literature, Go´mez (Go´mez 2002; Go´mez and Maye 2005) has reported that 15- and 18-month-olds can attend to nonadjacent co-occurrences when the middle words in 3-element strings occur with high variability. Lany and Go´mez (2008) have further shown non-adjacent dependency learning in younger 12-month-old females after prior exposure with bigrams (e.g., aX) that would later occur in a nonadjacent relationship (e.g. acX). Thus, there is evidence that infants can learn frequent frames but is there evidence for categorization? Mintz (2006) has reported results using natural language, showing that by 12 months, infants are able to categorize on the basis of distributional information. However, even though infants are presented with frames, it is unclear whether infants are categorizing on the basis of frames or bigrams. St. Clair, Monaghan, and Christiansen (2010) report results from a computational model to show that flexible frames – frames formed by combining the bigram components of the traditional Mintz frame (i.e. aXb ! aX þ Xb) – o¤er higher accuracy and better coverage of corpora similar to those analyzed by Mintz (2003).2 St. Claire et al. trained connectionist models – an aXb fixed frame model and an aX þ Xb flexible frame model – to learn the mapping between the distributional information and the framed lexical category.3 The three layers of the models

2. The terms accuracy and coverage are analogous to the accuracy and completeness measures in Mintz (2003), however St. Clair et al. (2010) combined the two measures in order to determine how well the models could predict the category of each word in a given corpus. 3. Two bigram models (aX and Xb) were also trained in St. Clair et al. (2010) but were not as successful as flexible frames.

The road to word class acquisition

149

included: 1) input units which were the frames (a_b) for the fixed frame model or for the flexible frames model the preceding or succeeding words of the bigrams (a_ or _b), 2) the hidden layer, and 3) output units which were the grammatical categories present in the corpora. The input to both models included all the frames in the given corpus (6 corpora were analyzed).4 At both early and late stages of training, the flexible frames model led to greater accuracy in classifying all the nouns and verbs in each corpus, than the fixed frames model. Thus, flexible frames seem to combine the high coverage and low accuracy of bigrams (Monaghan, Chater, and Christiansen 2005) with the low coverage and high accuracy of frames (Mintz 2003), into a high coverage and high accuracy distributional environment. In sum, corpus analyses suggest that distributional information can be used to categorize lexical items and infants are clearly capable of learning frequent frames, however the type of distributional information employed by infants for acquiring grammatical categories is still unknown. In fact, it is likely that bigram and frame-based information are employed at di¤erent developmental stages (see Go´mez and Maye 2005 and St. Clair et al. 2010 for similar accounts). This is supported by the artificial language learning literature showing that by 12 months, infants readily track bigram information (Go´mez and Lakusta 2004), but cannot readily learn non-adjacent dependencies (Go´mez and Maye 2005; Lany and Go´mez 2008). 3. Sound cues that distinguish word classes We have seen evidence that distributional cues can be used to categorize nouns and verbs, but a greater proportion of words could be classified if children had another source of information to help them to distinguish categories (functional and lexical categories, as well as lexical categories from each other).5 In a review of the literature on cues that distinguish word classes, Monaghan and Christiansen (2008) list 22 phonological and prosodic characteristics found to be useful for English (see Table 1). Although these cues are listed as characteristics of English words, a subset 4. Analyses with the 45 most frequent frames as input led to similar results. 5. Indeed, corpus analyses suggest that phonological characteristics and distributional characteristics may be di¤erentially useful for di¤erent types of words. For instance, phonological characteristics may be more helpful in categorizing verbs and distinguishing closed class from open class words, whereas distributional information may be more helpful when categorizing nouns (Christiansen and Monaghan 2006; Monaghan et al. 2007).

150

Michelle Sandoval, Kalim Gonzales and Rebecca Go´mez

Table 1. Table adapted from Monaghan and Christiansen (2008). Sound cues that distinguish word classes in English. Cue

Word Class Distinction

References

How long is the word in terms of: Number of phonemes?

Nouns > Verbs

Open > Closed

(Kelly 1992; Morgan, Shi, and Allopena 1996)

Number of syllables?

Nouns > Verbs

Open > Closed

(Kelly 1992; Morgan et al. 1996; Cassidy and Kelly 1991)

What proportion of the syllables contain schwa or syllabic consonant?

Open < Closed

(Cutler 1993)

Does the first syllable contain schwa?

Open < Closed

(Cutler 1993; Cutler and Carter 1987)

Open > Closed

Does the word receive lexical stress?

(Gleitman and Wanner 1982)

Does the first syllable receive stress?

Nouns > Verbs

(Kelly and Bock 1988)

Is the stressed vowel more likely to be a front vowel?

Nouns < Verbs

(Sereno and Jongman 1990)

Are the vowels of the word more likely to be: Front vowels?

Nouns < Verbs

High vowels?

Nouns < Verbs

Does the word end in a voiced consonant?

(Monaghan, Chater, and Christiansen 2005)

Nouns > Verbs

Does the word begin with /ð/?

(Kelly 1992) Open < Closed

(Campbell and Besner 1981)

Open < Closed

(Morgan et al. 1996)

What proportion of the consonants are: Coronals? Nasals?

Nouns > Verbs

(Kelly 1992)

Which consonants are more likely to occur in the word: Plosives?

Open > Closed

Fricatives?

Open < Closed

Dentals?

Open < Closed

Velars?

Nouns < Verbs

(Monaghan, Christiansen, and Chater 2007)

Open > Closed

Which consonants are more likely to occur in the word onset: Bilabials?

Nouns > Verbs

Approximants?

Nouns < Verbs

Open > Closed

(Monaghan et al. 2007)

In the word’s onset?

Open > Closed

(Shi, Morgan, and Allopena 1998)

Per syllable?

Open > Closed

(Morgan et al. 1996)

How many consonants are present:

Does the word end in /d/ or /d/?

Adjective > Other category

(Marchand 1969)

The road to word class acquisition

151

of them apply to other languages as well, such as Dutch (Durieux and Gillis 2001), French, Japanese (Monaghan et al. 2007), Mandarin, and Turkish (Shi et al. 1998). Moreover, other prosodic cues distinguishing nouns from verbs have been found in infant directed speech in English (Conwell and Morgan 2008), French, (Shi and Moisan 2008), and Mandarin (Li, Shi, and Hua 2010) (e.g. duration of words, syllables, and vowels, and frequency changes within words). Given these di¤erences what is the evidence that children use such information? Several experimental studies have shown that adults and children are sensitive to phonological cues individually and when combined. Adults and 3- to 6-year-old children can classify a novel word as a noun or a verb based solely on syllable number (Cassidy and Kelly 1991; although note that other cues were present and uncontrolled). Additionally, Fitneva, Christiansen, and Monaghan (2009) have demonstrated that English speaking seven-year-olds in a French immersion program show sensitivity to French phonological typicality in the early stages of word learning with approximately 2 years of exposure to the language, suggesting that word classification using phonological typicality may be an ability that develops relatively quickly. Younger learners also show sensitivity to phonetic and prosodic information. Newborns discriminate open and closed class words, showing even at birth, infants are sensitive to the numerous cues distinguishing these classes (Shi et al. 1999). By 6 months infants show a clear preference for listening to open class over closed class words, dishabituating to closed class items but not to open class ones (Shi and Werker 2001; 2003). Ho¨hle and Weissenborn (2003) have shown that German learning 7.5-month-olds, but not 6-month-olds, recognize closed class elements they have been familiarized with when these words occur in continuous speech. By 10.5 months, infants have established a full representation of frequent closed class words, but do not reliably show their knowledge of the distributional contexts of these words (Shady 1996). In Shady’s experiments, 10.5-montholds recognized when the words was, is, the, a, of, with, and that were replaced with phonologically similar nonsense syllables. Evidence of sensitivity to the phonological properties of closed class elements has been replicated in an electrophysiological study showing that 11-month-olds, but not 10-month-olds, recognize when function morphemes have been replaced with phonologically dissimilar morphemes (Shafer, Shucard, Shucard and Gerken 1998). However, 10.5-month-olds did not notice when the locations of closed class words were swapped such that the words occurred in an incorrect

152

Michelle Sandoval, Kalim Gonzales and Rebecca Go´mez

grammatical position (i.e. Has man this bought two cakes versus This man has bought two cakes), although they do begin to show evidence of such knowledge by 16 months (Shady 1996). Consistent with this finding, German-exposed infants are able to detect when a novel word is placed in the incorrect syntactic position in continuous speech (i.e. a noun in a verb position) by 14 to 16 months of age if during familiarization they have heard the novel word follow a determiner (Ho¨hle, Weissenborn, Kiefer, Schulz, and Schmitz 2004). Thus, the studies by Shady (1996) and Ho¨hle et al. (2004) suggest that by 16 months, infants not only recognize closed class elements in speech, but they also recognize and learn from the distributional contexts of these closed class elements – a prime example of the integration of sound cues and distributional cues in natural language. Conwell and Morgan (in press) furthermore show that 13-month-olds can discriminate noun uses from verb uses of the same word by noting the length and pitch change di¤erences that are characteristic of these words in child-directed speech. Critically however, being able to distinguish functional and lexical categories by using phonological information does not tell us whether this type of information is being used for the formation of these categories. For this type of evidence, we turn to the artificial language literature investigating whether infants can use phonological cues in conjunction with distributional or morphological information to form functional-like and lexical-like categories (Go´mez and Lakusta 2004; Gerken, Wilson, and Lewis 2005).

4. From correlated cues to categories In reality, word class acquisition cannot be purely distributional, phonological, morphological, or semantic. Rather the solution must involve some integration of these sources of evidence, as all of these aspects are part of the end state of the category. And although these cues are probabilistic in nature, there exist multiple cues that can be associated with specific word classes. The beauty of multiple correlated cues is apparent in situations where one cue is absent yet other cues are present to suggest a classification. We have reviewed evidence that learners are sensitive to phonological and distributional information, but how do they exploit these cues for categorization and how does the existence of multiple cues aid categorization?6 6. We have reviewed evidence for the existence of distributional and phonological cues in speech, but semantic and morphological cues clearly play an important role in acquiring word classes.

The road to word class acquisition

153

Braine (1987) conducted one of the first studies examining correlated cues and their role in word class acquisition. This research was inspired by findings showing that learners have di‰culty learning categories when provided with distributional information only, namely bigram information (Braine 1966; Smith 1966; 1969). In these early studies adults were provided with the bigram patterns MN and PQ. The M category occurred before Ns while P items preceded Qs. Each category was made-up of letters and each bigram out of letter pairs. During training learners heard repetitions of these pairs, with some pairs withheld for test. Participants were tested on their knowledge of 1) training pairs, 2) the withheld grammatical pairs, 3) unheard ungrammatical pairs that retained word-class position (e.g. MQ and PN), and 4) random unheard pairs that violated position (e.g. NP). Learners were equally likely to accept stimulus types 1–3 over type 4, demonstrating they learned the positional information (e.g. M letters occur in first position) but not the co-occurrence information (e.g. M letters occur before N letters, not Q letters). Braine (1987) hypothesized that this failed learning resulted from a lack of phonological or semantic correlates that could be used to distinguish the two sets of co-occurrences. He suspected that adults were learning a grammar of the form XY where X letters were M and P sets and Y letters were N and Q sets because there was not enough information for them to learn that these co-occurrences came from two distinct categories. To investigate this hypothesis, Braine used nonsense words instead of letters and added a semantic cue by providing referential information for a subset of the M and P words. He speculated that the semantic cue would lead learners to distinguish the two sets and begin to learn the correct adjacent dependencies. In his artificial language, adults heard N and Q phrases indicating one, two, or more than two nouns while seeing pictures with half of the M words paired with pictures of men and half of the P words paired with pictures of women (gender was the semantic referential cue). The rest of the M and P words referred to inanimate objects. There were two words each for one, two, and more than two for a total of six number words, such that each gender category had its own word for the number of nouns in the scene. The control group received the same training except that the male/female distinction was not correlated with the M and P word distinction (i.e. M words were not exclusively male in that they could refer to female pictures as well). As in the previous MN PQ studies, some of the MN and PQ pairings were withheld to test participants’ ability to generalize to novel pictures of people and inanimate objects. Adult learners in the experimental group successfully formed categories of M, N, P and Q, showing correct MN and PQ naming to the novel pictures. Critically,

154

Michelle Sandoval, Kalim Gonzales and Rebecca Go´mez

correct naming was shown in the experimental group even when participants were provided with the pictures of inanimate objects, suggesting that the correlated semantic cue made learning of the categories easier.7 Braine’s results suggest that learners are able to form categories if provided with two converging cues, in this case semantic and distributional information, even if one cue is probabilistic. The cues allow associations not just between lexical items, but between categories themselves. Braine (1987) explained this process in a step-based account where the first step requires distinguishing M and P words using a cue that is unique to each category (See Figure 1).8 Distinguishing categories plays an important role in Step 1 as learners must associate M and P category cues to the respective N and Q words. Once this occurs, learners categorize N words separately from Q words based on their patterned co-occurences with M or P words. For example, learners discover that the N word for ‘‘one’’ cooccurs with the M-feature male, whereas the Q word for ‘‘one’’ co-occurs with the P-feature female. Once the co-occurrence between number word and semantic feature (male or female) is discovered learners move on to Step 2, where they show evidence of making inferences at the category level. This is particularly important for novel M and P words that do not have features because a learner exposed to one N-word paired with featureless M (e.g. MN1) can then go on to generalize that M word in unattested MN2 and MN3 combinations (See Figure 1 below). In other words, once learners connect N words with the unique feature ‘co-occurs with male’, each word that co-occurs with an N is automatically assigned that feature whether the feature is overtly present. As an example, in Braine (1987), the referential cue was not needed during the generalization test because the categories of N and Q had been formed and the cooccurrence between MN and PQ had been established. Following Braine’s lead and the discovery of several phonological cues to gender subclasses (Karmilo¤-Smith 1978; Levy 1983; Mills 1986), Frigo and MacDonald (1998) investigated the robustness of correlated cues by testing adult learners in the MN PQ paradigm using phonological markers to distinguish M and P words and article-like N and Q words, instead of semantic correlates. Generalization was shown when participants were 7. There were a few subjects in the control group that also showed generalization. Interestingly, Braine reported that some of these subjects had formed ad hoc semantic correlates for the categories. 8. In Braine (1987) the cue is semantic, however he noted that learners could just as well use phonological or morphological information to form separate M and P categories.

The road to word class acquisition

155

Figure 1. Braine’s (1987) two-step model for word-class generalization.

provided with redundant, high-salience features that consistently marked at least 60% of the M and P names. In contrast, as in Braine (1987), participants failed to generalize in an unsystematic condition that did not have a consistent cue distinguishing M and P categories. Frigo and McDonald’s findings also provided evidence for a step-based learning process as learners only showed categorization (evidenced by generalization to combinations with featureless M and P words) if they also showed generalization to novel MN and PQ pairs for which M and P words had features. In the first investigation with infants, Go´mez and Lakusta (2004) used the MN PQ paradigm to investigate Step 1 learning. In this study, 12 month olds were familiarized with strings containing two co-occurrence relationships, aX and bY (or aY and bX), where each letter represented a di¤erent word category. Each a- and b-element came from a set of two words and each X- and Y-element belonged to a set of six words.9 Furthermore, Xand Y-words were distinguishable by syllable number such that in a first condition all Xs were monosyllabic and all Ys were disyllabic. Syllable number was a probabilistic cue in the second and third conditions, with 9. The aX bY language is very much like the MN PQ language with the exception that a- and b-elements always come from a small set and behave like closed class elements, whereas X- and Y-elements come from a relatively larger set and are the words that carry the important marker that previously distinguished M and P categories.

156

Michelle Sandoval, Kalim Gonzales and Rebecca Go´mez

five of the six words (83/17 condition) and four of the six words (67/33 condition) following the co-occurrence pattern consistently. The words that did not follow this pattern came from the alternate category (i.e. X and Y words were swapped). Thus, the predominant co-occurrence relationship common across the training strings was that a-elements preceded monosyllabic words and b-elements preceded disyllabic words. Go´mez and Lakusta found that 12-month-olds could learn a pair of co-occurrence relationships when only 83% of training strings honored those relationships, with the remaining strings violating them. But, infants no longer demonstrated detection of the most frequent pattern in their training language when the ratio of occurrence between this and the conflicting pattern was increased from 83:17 to 67:33. Thus, like adults, 12month-olds can associate functor-like elements to a category cue and use this knowledge to generalize to novel marked category members, even when the distinguishing category cue is not perfectly correlated with the functor elements. Therefore, it appears that infants can accomplish Step 1 learning, but the design of this study did not allow for testing of Step 2 learning. For evidence of Step 2 learning we look to a study conducted by Gerken, Wilson, and Lewis (2005). Gerken et al. used Russian gender markers to examine whether 17month-olds could generalize to unheard unmarked words (Step 2 learning). Infants were provided with a set of words that had masculine endings –ya and –yem, feminine endings –oj and –u and their respective female (/k/) and male (/tel’/) derivational inflections. During training, 3 of 5 of the words were marked with the derivational and case cues (e.g., rubashkoj, rubashku, zhitel’ya, zhitel’yem), while the rest of the words carried the case endings only (e.g., vannu, pisar’ya) (see Table 2 below). Two singleTable 2. Table adapted from Gerken, Wilson, and Lewis (2005), Exp. 2. The plain typeface words were used during training. Withheld grammatical and ungrammatical test items are in bold, with the ungrammatical test items in bold and italics. Double-marked

Single-marked

polkoj

rubashkoj

ruchkoj

knigoj

polku

rubashku

ruchku

knigu

vannoj

korovoj

Feminine words Ungrammatical words

vannu

korovu

vannya

korov’yem

uchitel’ya

stroitel’ya

zhitel’ya

korn’ya

medved’ya

pisar’ya

uchitel’yem

stroitel’yem

zhitel’yem

korn’yem

medved’yem

pisar’yem

medevedoj

pisaru

Masculine words Ungrammatical words

The road to word class acquisition

157

marked words from each of the gender categories were withheld during training for comparison with ungrammatical items (single-marked words with the opposing category’s case ending). In two experiments infants listened longer at test to ungrammatical items over the unattested grammatical items as compared to a group that was not exposed to consistent cues to gender during training. Thus, infants were able to associate the correlated cues (/k/ to –oj and –u, /tel/ to –ya and –yem), form a generalization that words heard with –ya also occur with –yem, and apply this knowledge to words without the inflectional cue. However, it is important to note that this learning was fragile; while the infants did show a consistent learning pattern in a replication study (Exp. 3 in Gerken et al.), in the initial experiment the learning appeared in the last block of testing only (Exp. 2). Regardless, infants do appear sensitive to converging phonological cues and are able to accomplish Step 2 learning when given only 2 minutes of training. The studies presented thus far emphasize the importance of converging cues for categorization using a bigram design (i.e. MN PQ or aX bY). In contrast, adults in Mintz (2002) were provided with a frame based artificial language – three-word phrases where the first and the last word in the phrase had a 1:1 co-occurrence relationship. Mintz’s study asked whether frames provide enough information to the learner to ensure that generalization can take place. Training phrases came in four di¤erent varieties 1) full paradigm: 3 frames with four middle words for a total of 12 sentences, 2) partial paradigm: 1 frame with three of the four middle words used in the full paradigm phrases, 3) alternate paradigm: 1 frame with three new middle words, and 4) no-paradigm: four frames with four new middle words. The critical comparison was between responses to the partial paradigm sentences (with the word withheld from training) and alternate paradigm sentences (also with the missing fourth middle word). Importantly, only the frame from the partial paradigm sentences should be grouped with the frames from the full paradigm sentences as they share the same set of middle words. Participants rated the partial paradigm sentences as more familiar showing that they had construed a category of the four middle words and recognized the correct environment (i.e. distribution) for that category. The results of this study force the question of how this learning is possible when the previous studies have suggested that distributional information is not enough – that a second feature must be present for the learner to take note of the presence of a category. There are several di¤erences between Mintz’s study and other studies examining distributional based

158

Michelle Sandoval, Kalim Gonzales and Rebecca Go´mez

learning. The first is that Mintz’s paradigm only involved the learning of one main category. Secondly, the phrases in this study were not of the type aX bY or MN PQ; rather they were of the type aXb. What previous studies have shown is a failure to learn an aX bY type grammar when the X’s and Y’s are not cued with semantic or phonological information. That is, they require the learner to determine two adjacent dependencies (i.e. bigrams) when the only type of information provided is distributional, whereas Mintz requires the learner to acquire one non-adjacent dependency and provides the learner with only distributional information. Critically, the frame provides the learner with two cues to the category of the middle word insofar as both the first and the third words cue the middle word (also see St. Clair et al. 2010). There are a few ways Mintz’s distributional cues could impact learning. One possibility is that learners are still being provided with converging cues (albeit distributional) and thus Braine’s steps still apply. Another possibility is that frames provide a clearer distinction between words to be categorized and words that cue categories, providing the learner a place to start. Mintz (2002) describes this as a type of figure-ground e¤ect that leads learners to think of the middle words in terms of the frames they occur in or vice-versa. In other words, what phonological and semantic cues provide is not necessarily a cue for noticing categories ‘‘non-distributionally’’; rather it is a toehold to begin learning the distributional information. Therefore the patterning of words in frames may allow the learner to distinguish one type of information from another for the learning of category co-occurrence relationships. The last possibility, (see Gerken, Wilson, Go´mez, and Nurmsoo 2009), assumes that learners build analogical models. By this view, learners search a database for items similar to a learning instance based on a set of potentially shared features. Items sharing features are grouped into sets called ‘supracontexts’ defined by two properties: Proximity in which items from the database that share more features with the given form will appear in more supracontexts and will therefore have a greater chance of being used as an analogical model; and Gang E¤ects by which, if a group of similar examples behaves alike, the probability of selecting one of these examples as an analogical model is increased. Thus proximity e¤ects are much like Braine’s distinguishing cue and gang e¤ects are very similar to distributional environments. However, as Gerken et al. point out, there is no need for category discovery via a unique marker such as the –k ending on a subset of Russian feminine nouns. Instead, Proximity and Gang E¤ects both contribute to paradigm completion, raising the possibility that paradigms with relatively strong Gang E¤ects are discoverable even when no unique marker is present (a finding documented by Mintz 2002).

The road to word class acquisition

159

The current research does not favor one approach over another. Still, it is clear that a minimum of two cues are necessary for categorization and generalization to take place (whether distributional or phonological). It is also important to note that although Mintz showed successful categorization, more research is needed to clarify how infant learners would accept new middle words into a category using only frame information. On the same note, Mintz does not deny the role of non-distributional sources to category induction; rather his results prove that distributional environments are stronger cues than was previously thought.

5. Learning of phrase structure based solely on statistical learning? Several artificial language studies with adult, child, and infant learners suggest that learners can acquire classes of phrasal units – a step towards understanding syntactic category acquisition (Sa¤ran 2001; Sa¤ran, Hauser, Seibel, Kapfhamer, Tsao, Cushman 2008; Thompson and Newport 2007; Sa¤ran 2002). Specifically, Sa¤ran (2001) investigated whether the acquisition of phrase structure could be influenced by phrase-internal predictive dependencies between word classes. Sa¤ran reasoned that if her hypothesis was correct, learners might be expected to detect phrasal units based solely on their internal dependencies, independent of other cues shown to highlight phrasal units, such as prosody, functor word distributions, and concord morphology (Morgan, Meier, and Newport 1987). To test this prediction, Sa¤ran asked whether adult participants could identify phrasal units with internal dependencies between their constituent word classes in an artificial language (adapted from Morgan and Newport 1981) in which the phrasal units were not highlighted by any other potential cues. As shown in Figure 2, the familiarization grammar consisted of a sequence of two obligatory phrases, AP and BP, followed by an optional third phrase, CP. AP consisted of one word class, A, followed by an optional second word class, D, such that D predicted A but not vice versa. This is loosely analogous to the relationship between determiners and nouns in noun phrases. CP had an identical structure to AP except with di¤erent word classes, obligatory C and optional G. Finally, BP could consist either of a single word class, E, or of an embedded CP followed by an obligatory word class, F, which would be predicted by the C in the embedded CP. Across familiarization strings, each word class was instantiated by 2 or 4 di¤erent lexical members of that particular class. For example, in any given string A was instantiated by bi¤, hep, mib, or rud,

160

Michelle Sandoval, Kalim Gonzales and Rebecca Go´mez

Figure 2. Figure adapted from Sa¤ran (2001), artificial grammar stimuli.

but E was instantiated by jux or vot yielding strings such as bi¤ jux, bi¤ klor lum loke, and rud pell sig pilk dupp. An experimental group was familiarized with the grammar for 30 minutes on each of two consecutive days while a control group took part only in testing. Although multiple tests were administered, we discuss the one of most relevance here, which was intended by Sa¤ran to most directly assess acquisition of phrasal knowledge based solely on predictive dependencies between word classes. On each trial, participants were presented with two sequences of words. Each sequence was a fragment of a string generated by the familiarization grammar. However, only one sequence was a selfcontained phrase (AP, BP, or CP); the other fragment spanned a phrase boundary. Importantly, however, both fragments consisted entirely of legal bigrams and also occurred with equal frequency during familiarization, ruling out the possibility that learners made their discriminations based on bigram frequency or some other surface characteristic. Participants were instructed to indicate ‘‘which fragment seemed like a better or more coherent group or unit from the nonsense language’’ (Sa¤ran 2001, p. 499). The experimental group selected BP and CP (but not AP) phrases over non-phrases significantly more often than did the control group. Importantly, the number of dependency violations at the level of word classes was a significant covariate of the experimental group’s fragment endorsements (phrases had no dependency violations and non-phrases had one or more). The results were interpreted as clear support for the hypothesis that phrase-internal predictive dependencies are by themselves su‰cient for revealing phrasal units. Although Sa¤ran (2001) makes an important contribution by showing that learners can acquire the correct ordering of phrasal units based on statistical information alone, a drawback to the study is that even though learners were tested on novel sentences, they were not required to generalize to novel category members thus it is possible that adults learned patterns at the word level and not the category level. These issues are also present and acknowledged in other studies examining the acquisition of phrasal struc-

The road to word class acquisition

161

ture by infants (Sa¤ran et al. 2008) and adults (Thompson and Newport 2007). Future studies examining this problem should include generalization tests to novel category members as it is a critical test for determining whether learners have abstracted a category from the input. Evidence of abstraction is vital if these mechanisms are being argued to be part of the development of syntactic structure – a system built of abstract categories. (We will address this further in Section 6).

6. Discussion: From form-based categories to lexical categories The studies reviewed above suggest that learners can form categories, that they learn the features of particular categories and use them to identify new category members. However, important aspects of lexical category acquisition are missing from the studies reviewed in the above sections. A lexical category is defined by the distributional and morphological environments its members occur in. When a child or infant learns a lexical category they learn 1) the structural or distributional environments of that category, 2) the semantics of the members of that category, and 3) the sound properties or morphophonemic properties of that category. The most popular lexical category accounts invoke some notion of innate knowledge and in fact, no full account of word class acquisition has been proposed that has not posited innate knowledge of syntactic information. In order to propose such an account several issues need to be addressed. Although corpus studies have revealed the sound properties of particular categories, we still do not know what combination of these properties are used by the infant learner in lexical category acquisition. We have evidence that there are approximately 22 phonological cues in English that can be used in word class acquisition; however we do not have evidence that infants are sensitive to all these cues or how these cues may aid in the infant’s quest to word class acquisition. More experimental work is needed to address these disparities and to test the use of converging cues. Similarly, although infants seem to be highly sensitive to distributional information in speech, it is not clear whether the form-based categories derived from studies by Gerken, Go´mez, Mintz and colleagues scale-up to real-world lexical category acquisition.10 10. For discussions of the scaling up problem with regard to statistical word segmentation see Yang (2004), Pelucchi, Hay, and Sa¤ran (2009) and Johnson and Taylor (2010).

162

Michelle Sandoval, Kalim Gonzales and Rebecca Go´mez

A first step in proving that the above categorization studies show lexical category acquisition would be to show learning of the phrasal privileges of these categories, which is one of the important features of syntactic categories. We can take this first step by integrating Braine’s multiplecue approach into the statistical approach to phrase structure acquisition (Sa¤ran 2001; Sa¤ran 2002; Sa¤ran et al. 2008; Thompson and Newport 2007). Because previous studies have shown that correlated cues lead to the formation of abstract categories, incorporating multiple cues into the artificial phrase structure grammars would lead to better evidence of transitional probability computation over word class categories. Specifically this addition would allow for the inclusion of the critical generalization tests to novel category members that are the hallmark of category abstraction. Proving that learners are acquiring rules about categories and generalizing these rules to novel category members would help validate the role of statistical learning in the acquisition of syntactic categories. Additionally, many distributional accounts have pushed aside the role of meaning in lexical category acquisition, but meaning may be another cue that can be used in multiple cue integration and in fact may prove to be particularly useful for the acquisition of a noun category. Lany and Sa¤ran (2010) provide evidence from an artificial grammar learning study that infants can detect the convergence of made-up morphological inflections (e.g. –el in fengel ), distributional information (e.g. a- precedes X and b- precedes Y), and semantic information (X words refer to vehicles and Y words refer to animals). Lany and Sa¤ran tested infants’ word learning of concrete objects, however the results suggest that infants can map semantic information onto members of previously formed categories if they are provided with the aXbY language first. This study is a first step in incorporating semantic cues into word class acquisition studies and it also is a step towards providing convincing evidence that infants in these studies are treating these artificial language categories they learn as categories relevant to language learning. Finally, both corpus analyses and artificial language studies should continue to be aware of the cues that infants could be sensitive to at particular ages. One limitation of the results discussed in this chapter is that they do not address categorization for other word classes besides noun and verb. Even if one acknowledges the claim that nouns and verbs ‘‘are the fundamental and universal primitives from which grammars are constructed’’ (Mintz et al. 2002, p. 394), the fact still stands that the most parsimonious mechanism for grammatical category acquisition would be one that could account for acquisition of all of the major syntactic categories,

The road to word class acquisition

163

not just nouns and verbs alone. An objection to this point might be that the analyses could reflect a veritable trajectory in children’s category acquisition whereby noun and verb knowledge is established before knowledge for other categories. It also may be that lexical category acquisition strategies are not one size fits all, rather each category benefits from a di¤erent combination of cues. For instance, noun categorization may benefit more from semantic based information than verb categorization (see Gentner 1982 and Maratsos 1990). In sum, we have reviewed a number of findings contributing to the puzzle of how children acquire word classes through statistical learning including distributional, phonological and prosodic cues to word class and their usefulness for learners, how learners might bootstrap from correlated cues to category structure, and how learning might scale up from purely form-based categories to lexical classes. Although the requisite cues are present and there is evidence that young learners are able to capitalize on such cues to acquire word class, a caveat is in order. The artificial languages used to study the learning mechanisms involved are necessarily quite simple. Although these languages do not scale up to the complexity of a natural language, they do provide insight into the learning mechanisms at work. The next step in further investigation of the problem of how children acquire word classes will be to make predictions about learning in the context of natural languages driven by the hypotheses discovered with the use of artificial languages much in the same way models are used to test theories. References Braine, Martin D. S. 1966 Learning the positions of words relative to a marker element. Journal of Experimental Psychology, 72, 532–540. Braine, Martin D. S. 1987 What is learned in acquiring word classes – A step toward an acquisition theory. In Brian MacWhinney (Ed.), Mechanisms of Language Acquisition (pp. 65–87). Hillsdale, NJ: Lawrence Erlbaum Associates. Brill, Eric 1991 Discovering the lexical features of a language. In Proceedings of the 29th Annual Meeting of the Association for Computational Linguistics. Berkeley, CA.

164

Michelle Sandoval, Kalim Gonzales and Rebecca Go´mez

Campbell, Ruth and Besner, Derek 1981 This and thap – constraints on the pronunciation of new, written words. Quarterly Journal of Experimental Psychology, 33, 375– 396. Cartwright, Timothy A. and Brent, Michael R. 1997 Syntactic categorization in early language acquisition: Formalizing the role of distributional analysis. Cognition, 63, 121–170. Cassidy, Kimberly W. and Kelly, Michael H. 1991 Phonological information for grammatical category assignments. Journal of Memory and Language, 14, 333–352. Chemla, Emmanuel, Mintz, Toben H., Bernal, Savita, and Christophe, Anne 2009 Categorizing words using ‘frequent frames’: What cross-linguistic analyses reveal about distributional acquisition strategies. Developmental Science, 12, 396–406. Christiansen, Morten H. and Monaghan, Padraic 2006 Discovering verbs through multiple-cue integration. In Kathryn Hirsh-Pasek and Roberta Michnick Golinko¤, Action meets word: How children learn verbs (pp. 88–107). New York: Oxford University Press. Conwell, Erin and Morgan, James L. 2008 Learning about cross-category word use: The role of prosodic cues. Poster presented at the 16th International Conference on Infant Studies, Vancouver, CA. Conwell, Erin, and Morgan, James L. submitted for publication When parents verb nouns: Resolving the ambicategoricality problem. Cutler, Anne 1993 Phonological cues to open- and closed-class words in the processing of spoken sentences. Journal of Psycholinguistic Research, 22, 109–131. Cutler, Anne and Carter, David M. 1987 The predominance of strong initial syllables in the English vocabulary. Computer Speech and Language, 2, 133–142. Durieux, Gert and Gillis, Steven 2001 Predicting grammatical classes from phonological cues: An empirical test. In Ju¨rgen Weissenborn, and Barbara Ho¨hle (Eds.), Approaches to bootstrapping: Vol. 1. Phonological, lexical, syntactic and neurophysiological aspects of early language acquisition (pp. 189–229). Amsterdam: John Benjamins. Finch, Steven P. and Chater, Nick 1992 Bootstrapping syntactic categories. In Proceedings of the 14th Annual Conference of the Cognitive Science Society of America. Hillsdale, NJ: Lawrence Erlbaum Associates.

The road to word class acquisition

165

Finch, Steven P. and Chater, Nick 1994 Distributional bootstrapping: From word class to proto-sentences. In Proceedings of the Cognitive Science Society of America. Hillsdale, NJ: Lawrence Erlbaum Associates. Fitneva, Stanka A., Christiansen, Morten H. and Monaghan, Padraic 2009 From sound to syntax: phonological constraints on children’s lexical categorization of new words. Journal of Child Language, 36, 967–997. Frigo, Lenore and McDonald, Janet L. 1998 Properties of phonological markers that a¤ect the acquisition of gender-like subclasses. Journal of Memory and Language, 39, 218–245. Gentner, Dedre 1982 Why nouns are learned before verbs: Linguistic relativity versus natural partitioning. In Stan Kuczaj (Ed.), Language development: Vol. 2. Language, thought and culture (pp. 310–334). Hillsdale, NJ: Lawrence Erlbaum Associates. Gerken, LouAnn, Wilson, Rachel, Go´mez, Rebecca L., and Nurmsoo, Erica 2009 The relation between linguistic analogies and lexical categories. In James P. Blevins and Juliette Blevins (Eds.), Analogy in grammar: Form and acquisition (pp. 101–117). New York: Oxford University Press. Gerken, LouAnn, Wilson, Rachel, and Lewis, William 2005 Infants can use distributional cues to form syntactic categories. Journal of Child Language, 32, 249–268. Gleitman, Lila R. and Wanner, Eric 1982 Language acquisition: the state of the state of the art. In Eric Wanner and Lila R. Gleitman (Eds.), Language Acquisition: The state of the art (pp. 3–48). New York: Cambridge University Press. Go´mez, Rebecca L. 2002 Variability and detection of invariant structure. Psychological Science, 13, 431–436. Go´mez, Rebecca L. and Lakusta, Laura 2004 A first step in form-based category abstraction by 12-month-old infants. Developmental Science, 7, 567–580. Go´mez, Rebecca L. and Maye, Jessica 2005 The developmental trajectory of nonadjacent dependency learning. Infancy, 7, 183–206. Ho¨hle, Barbara and Weissenborn, Ju¨rgen 2003 German-learning infants’ ability to detect unstressed closed-class elements in continuous speech. Developmental Science, 6, 122–127. Ho¨hle, Barbara, Weissenborn, Ju¨rgen, Kiefer, Dorothea, Schulz, Antje, and Schmitz, Michaela 2004 Functional elements in infants’ speech processing: The role of determiners in the syntactic categorization of lexical elements. Infancy, 5, 341–353.

166

Michelle Sandoval, Kalim Gonzales and Rebecca Go´mez

Karmilo¤-Smith, Annette 1978 The interplay between syntax, semantics and phonology in language acquisition processes. In Robin N. Campbell and Philip T. Smith (Eds.), Advances in the psychology of language – language development and mother-child interaction (pp. 1–23). London: Plenum Press. Kelly, Michael H. 1992 Using sound to solve syntactic problems: The role of phonology in grammatical category assignments. Psychological Review, 99, 349–364. Kelly, Michael H. and Bock, J. Kathryn 1988 Stress in time. Journal of Experimental Psychology: Human Perception and Performance, 14, 389–403. Lany, Jill and Sa¤ran, Jenny R. 2010 From statistics to meaning: Infants’ acquisition of lexical categories. Psychological Science, 21, 284–291. Levy, Yonata 1983 The acquisition of Hebrew plurals: The case of the missing gender category. Journal of Child Language, 10, 107–121. Li, Aijun, Shi, Rushen, and Hua, Wu 2010 Prosodic cues to noun and verb categories in infant-directed Mandarin speech. In Proceedings of the Fifth International Conference on Speech Prosody. Chicago, Illinois. Maratsos, Michael 1990 Are actions to verbs as objects are to nouns? On the di¤erential semantic bases of form, class, category. Linguistics, 28, 1351– 1380. Maratsos, Michael and Chalkley, Mary Anne 1980 The internal language of children’s syntax: The ontogenesis and representation of syntactic categories. In Katherine Nelson (Ed.), Children’s language Vol. 2. (pp. 127–213). New York: Gardner Press. Mills, Anne E. 1986 The acquisition of gender: A study of English and German. Berlin: Springer-Verlag. Mintz, Toben H. 2002 Category induction from distributional cues in an artificial language. Memory and Cognition, 30, 678–686. Mintz, Toben H. 2003 Frequent frames as a cue for grammatical categories in child directed speech. Cognition, 90, 91–117. Mintz, Toben H., Newport, Elissa L., and Bever, Thomas G. 1995 Distributional regularities of grammatical categories in speech to infants. In J. Beckman (Ed.), Proceedings of the 25th Annual Meeting of the North Eastern Linguistics Society (pp. 43–54). Amherst, MA: Graduate Linguistic Student Association.

The road to word class acquisition

167

Mintz, Toben H., Newport, Elissa L., and Bever, Thomas G. 2002 The distributional structure of grammatical categories in speech to young children. Cognitive Science, 26, 393–424. Monaghan, Padraic, Chater, Nick, and Christiansen, Morten H. 2005 The di¤erential role of phonological and distributional cues in grammatical categorisation. Cognition, 96, 143–182. Monaghan, Padraic, and Christiansen, Morten H. 2008 Integration of multiple probabilistic cues in syntax acquisition. In Shanley E. M. Allen, Caroline F. Rowland, Annick De Houwer, and Steven Gillis (Series Eds.) and Heike Behrens (Vol. Ed.), Trends in language acquisition research: Vol. 6. Corpora in language acquisition research: History, methods, perspectives (pp. 139–163). Amsterdam: John Benjamins. Monaghan, Padraic, Christiansen, Morten H., and Chater, Nick 2007 The phonological-distributional coherence hypothesis: Crosslinguistic evidence in language acquisition. Cognitive Psychology, 55, 259–305. Morgan, James L., and Newport, Elissa L. 1981 The role of constituent structure in the induction of an artificial language. Journal of Verbal Learning and Verbal Behavior, 20, 67–85. Morgan, James L., Meier, Richard P., and Newport, Elissa L. 1987 Structural packaging in the input to language learning: Contributions of prosodic and morphological marking of phrases to the acquisition of language. Cognitive Psychology, 19, 498–550. Morgan, James L., Shi, Rushen, and Allopena, Paul 1996 Perceptual bases of rudimentary grammatical categories: Toward a broader conceptualization of bootstrapping. In James L. Morgan and Katherine Demuth (Eds.), Signal to syntax: Bootstrapping from speech to grammar in early acquisition. Hillsdale, NJ: Lawrence Erlbaum Associates. Pinker, Steven 1987 The bootstrapping problem in language acquisition. In Brian MacWhinney (Ed.), Mechanisms of Language Acquisition. Hillsdale, NJ: Lawrence Erlbaum Associates. Redington, Martin, Chater, Nick, and Finch, Steven P. 1998 Distributional information: A powerful cue for acquiring syntactic categories. Cognitive Science, 22, 425–469. Sa¤ran, Jenny R. 2001 The use of predictive dependencies in language learning. Journal of Memory and Language, 44, 493–515. Sa¤ran, Jenny R. 2002 Constraints on statistical language learning. Journal of Memory and Language, 47, 172–196.

168

Michelle Sandoval, Kalim Gonzales and Rebecca Go´mez

Sa¤ran, Jenny R., Hauser, Marc, Seibel, Rebecca, Kapfhamer, Joshua, Tsao, Fritz, and Cushman, Fiery 2008 Grammatical pattern learning by human infants and cotton-top tamarin monkeys. Cognition, 107, 479–500. Santelmann, Lynn M. and Jusczyk, Peter W. 1998 Sensitivity to discontinuous dependencies in language learners: Evidence for limitations in processing space. Cognition, 69, 105– 134. Schu¨tze, Hinrich 1993 Part-of-speech induction from scratch. In Proceedings of the 31st Annual Meeting of the Association for Computational Linguistics. Columbus, OH. Sereno, Joan A. and Jongman, Allard 1990 Phonological and form class relations in the lexicon. Journal of Psycholinguistic Research, 19, 387–404. Shady, Michelle 1996 Infants’ sensitivity to function morphemes. Unpublished doctoral dissertation, State University of New York at Bu¤alo. Shafer, Valerie L., Shucard, David W., Shucard, Janet L., and Gerken, LouAnn 1998 An electrophysiological study of infants’ sensitivity to the sound patterns of English speech. Journal of Speech, Language, and Hearing Research, 41, 874–886. Shi, Rushen, and Moisan, Annick 2008 Prosodic cues to noun and verb categories in infant-directed speech. In Harvey Chan, Heather Jacob, and Enkeleida Kapia (Eds.), BUCLD 32: Proceedings of the 32th Annual Boston University Conference on Language Development (pp. 450–461). Boston, MA: Cascadilla Press. Shi, Rushen, Morgan, James L., and Allopena, Paul 1998 Phonological and acoustic bases for early grammatical category assignment: A cross-linguistic perspective. Journal of Child Language, 25, 169–201. Shi, Rushen, and Werker, Janet F. 2001 Six-month-old infants’ preference for lexical words. Psychological Science, 12, 70–75. Shi, Rushen, and Werker, Janet F. 2003 The basis of preference for lexical words in 6-month-old infants. Developmental Science, 6, 484–488. Shi, Rushen, Werker, Janet F., and Morgan, James L. 1999 Newborn infants’ sensitivity to perceptual cues to lexical and grammatical words. Cognition, 72, 11–21. Smith, Kirk H. 1966 Grammatical intrusions in the recall of structures letter pairs: Mediated transfer or position learning? Journal of Experimental Psychology, 72, 580–588.

The road to word class acquisition

169

Smith, Kirk H. 1969 Learning co-occurrence restrictions: Rule learning or rote learning? Journal of Verbal Behavior, 8, 319–321. Soderstrom, Melanie, White, Katherine S., Conwell, Erin and Morgan, James L. 2007 Receptive grammatical knowledge of familiar content words and inflection in 16-month-olds. Infancy, 12, 1–29. St. Clair, Michelle C., Monaghan, Padraic, and Christiansen, Morten H. 2010 Learning grammatical categories from distributional cues: Flexible frames for language acquisition. Cognition, 116, 341–360. Thompson, Susan P. and Newport, Elissa L. 2007 Statistical learning of syntax: The role of transitional probability. Language Learning and Development, 3, 1–42. Van Heugten, Marieke and Shi, Rushen 2010 Infants’ sensitivity to non-adjacent dependencies across phonological phrase boundaries. Journal of the Acoustical Society of America, 128, EL223-EL228. Van Heugten, Marieke and Johnson, Elizabeth K. 2010 Linking infants’ distributional learning abilities to natural language acquisition. Journal of Memory and Language, 63, 197–209.

Linguistic constraints on statistical learning in early language acquisition Mohinish Shukla, Judit Gervain, Jacques Mehler and Marina Nespor 1. Introduction Two opposing ideas have been dominating the thought on the origins of human knowledge. Both schools greatly influenced psychology at di¤erent epochs of the 20th century. Empiricism, arguing that human knowledge originates in the outside world and is mostly learned through the senses, inspired behaviorism, one of the dominant trains of thought in psychology in the late 19th and early 20th centuries. By contrast, since the 1950s, the cognitive revolution with its roots in rationalism proposes that a considerable part of human knowledge is innate and not acquired through experience. Research on the development of language had been in the forefront of these debates (Mehler and Dupoux 1994). Arguments have been put forth both for an innate, rule-based language faculty (Bertoncini et al. 1988) as well as an account based exclusively on general-purpose learning mechanisms (Elman et al. 1996; Tomasello 2000), often statistical in nature. Recently, a synthesis started to emerge asking not whether language acquisition is governed by our genetic endowment or general learning mechanisms, bur rather what aspects of language acquisition are governed by which mechanism. Further, it has been recognized that the innate constraints that guide language learning need not themselves be only linguistic in character, but could be, in part derived from primitive perceptual computations (Endress, Nespor, and Mehler 2009; Gervain and Mehler 2010). This integrative view emphasizes not only that mechanisms of all three types, rule-based, statistical, and perceptual, are essential for a comprehensive theory of language acquisition, but also that these mechanisms very often interact in interesting ways in the course of language development. Rule-based computations together with distributional statistics and primitive perceptual and memory constraints are essential to explain how language acquisition arises. Although these mechanisms are shared with other species (e.g., Gallistel and King 2009), only humans acquire the grammar

172

Mohinish Shukla, Judit Gervain, Jacques Mehler and Marina Nespor

underlying their language of exposure. Possibly, specific interactions between the various mechanisms are part of the human cognitive endowment, and play an important role in defining our unique language competence. The aim of the present paper is to discuss some of these interactions. We will show how a general statistical mechanism conspires with rule extraction mechanisms and primitive perceptual constraints, enabling infants to learn the words and grammatical rules of the grammar of their native language.

2. What is statistical learning? At the very foundations of information theory (Shannon 1948) lies the observation that the statistical structure of natural language, conceived of as a discrete symbolic system, is such that its units are neither equiprobable, nor independent of each other. The simple probability of a unit is derived from its frequency of occurrence, whereas the conditional probability of a unit in a given context is defined as its frequency of occurrence in this context. Thus, in an absolute sense, the word man is more frequent, i.e. more probable, than the word feather. However, the in context light as a . . . , feather becomes more probable than most other words. The e¤ects of simple probability or frequency on language acquisition, use and processing have been well established for a long time (Forster and Chambers 1973; Zipf 1935). The last two decades have witnessed an increasing interest in the contribution of conditional probabilities to language acquisition and language use. Indeed, according to some proposals (Elman et al. 1996) learning a language is nothing more than learning the probability distributions over speech sounds. In the present article, we take the position that distributional cues, including conditional probabilities, are just one of several cues that aid the infant in acquiring the ambient language. The intuition that the distribution of conditional probabilities, i.e. the strength of the statistical coherence among units, might provide cues for the segmentation of continuous speech into its constituents dates back to American structural linguistics. Harris (1955) proposed a way to establish morpheme boundaries in unsegmented utterances of native American languages based on the idea that distributional coherence is stronger between phonemes that fall inside the same morpheme than between those that span morpheme boundaries. However, it is not until Hayes and Clark’s (Hayes and Clark 1970) initial study that experimental evidence emerged

Linguistic constraints on statistical learning

173

that human adults are indeed able to use conditional probabilities to segment a continuous speech analog. The relevance of statistically-based speech segmentation for language acquisition has been demonstrated by (Sa¤ran, Aslin, and Newport 1996), who showed that 8-month-old infants were able to use the forward transitional probabilities between syllables in an artificial speech stream to segment it into words. Forward transition probabilities are defined as the probability of occurrence of a unit B given a preceding unit A, or TP(A ! B) ¼ F(AB)/F(A), where F(X) is the frequency of unit X. Sa¤ran and her colleagues constructed the artificial speech stream by concatenating four trisyllabic nonce words (e.g. tupiro, golabu, bidaku, padoti) in such a way that no word could repeat adjacently (tupirogolabubidakutupiro. . .). This structure yielded TPs of 1.0 between adjacent syllables within a word and TPs of 0.33 across word boundaries. Dips in TP values were thus the only cues to word boundaries. After only 2 min of exposure to such a continuous speech stream, infants discriminated the words of the stream from part-words, defined as a trisyllabic sequence obtained from the last syllable of a word and the first two syllables of the subsequent word in the stream, or the last two syllables of a word and the first syllable of the subsequent word (e.g. rogola etc.). They showed longer looking times for the novel stimuli, i.e. the part-words. Notice that both words and partwords were familiar to infants as they both occurred in the stream, although the words di¤ered from part-words both in being more frequent and in having higher average TPs. In a subsequent study, Aslin, Sa¤ran, and Newport (1998) constructed artificial speech streams wherein the words and part-words were matched in frequency, such that the primary di¤erence between the two types of items was the presence of a TP dip in part-words but not in words. These results suggest that even young infants are able to use conditional statistical information for segmenting continuous speech in an e‰cient manner to extract words. These findings gave rise to a large body of research investigating the exact nature of this mechanism. First, at least for certain aspects, statistical learning appears not to be a specifically human ability (Conway and Christiansen 2001). Toro and Trobalon (2005) found that rats were able to segment words out of a continuous speech stream, although they succeeded only with speech sequences where simple co-occurrence frequencies could be used as a cue, and failed when conditional probabilities needed to be used. They also failed to extract more complex non-adjacent dependencies, a task that human adults (Pen˜a et al. 2002) and infants (Marchetto 2009) can perform.

174

Mohinish Shukla, Judit Gervain, Jacques Mehler and Marina Nespor

Exploring the developmental trajectory of statistical learning abilities, several groups have reported that this ability emerges very early on and is already operational at birth both for non-linguistic auditory stimuli, i.e. pure tones (Kudo et al. 2006) and naturalistic syllables (Teinonen et al. 2009). As the above results already suggest, statistical learning is domaingeneral. In addition to linguistic and non-linguistic auditory stimuli (Sa¤ran et al. 1999), statistical learning has been demonstrated in a motor (serial reaction time) task (Hunt and Aslin 2001), with tactile stimuli (Conway and Christiansen 2005), and with visual stimuli, both in adults (Fiser and Aslin 2005; Conway and Christiansen 2005) and in infants (Fiser and Aslin 2002; Kirkham, Slemmer, and Johnson 2002). The original studies by Sa¤ran et al. (1996) used a familiarization of only 2 minutes, suggesting that statistical learning is powerful and relatively fast. Nevertheless, it requires su‰cient time to allow sampling from the input material (Endress and Bonatti 2007). As the initial studies used a posteriori measures (e.g. recognition of extracted words in a test phase following familiarization), the time course of statistical learning had remained unknown for a long time. Recently, however, several behavioral (Go´mez, Bion, and Mehler 2010) and electrophysiological studies (Abla, Katahira, and Okanoya 2008; Loui et al. 2009; Teinonen et al. 2009; Buiatti, Pena, and Dehaene-Lambertz 2009) have been conducted to characterize statistical learning on line. These studies provide converging evidence that behavioral and neurophysiological signatures of segmentation start to emerge earlier than successful segmentation, as has been reported behaviorally using o¤-line measures. The electrophysiological studies have also revealed the neural correlates of segmentation. In (Abla et al. 2008) study, an increased N400 was observed at middle frontal and central sites in participants with high o¤-line (behavioral) performances, while Buiatti et al. (2009) found reduced brain oscillations at frequencies corresponding to single syllables, but greater oscillatory power at frequencies corresponding to trisyllabic units. In sum, these results suggest that statistical learning is a robust, domain-general, age-independent and not specifically human ability. The subsequent sections of the chapter will investigate how this powerful, domain-general mechanism might contribute to the acquisition of the native language and how it interacts with other, language-specific and perceptual, processes.

Linguistic constraints on statistical learning

175

3. Constraints on statistical learning at the phonemic level: The di¤erent roles of vowels and consonants In the previous section, we exclusively reviewed the distributional properties of language. The various findings summarized above suggest that humans are indeed able to compute distributional properties over speech sounds, and that these can potentially aid in language acquisition. Statistical learning is typically referred to as a ‘‘domain-general’’ process, that is, a process that proceeds in a similar manner irrespective of the input. In contrast, domain-specific mechanisms are not shared across modules (like audition and vision), suggesting specific computations tailor-made to a particular module (but see Conway and Christiansen 2005). Even within the auditory domain, language-specific mechanisms might contrast with other auditory mechanisms. However, the division of labor between the general and specific mechanisms is not clear-cut. In this section we examine how some learning mechanisms that are supposedly general in nature, appear to be constrained by specifically linguistic representations. Here we consider how linguistic representations constrain the use of statistical information in the detection of words in continuous speech. In particular, it has been shown that adults, as well as 8-month-old infants, can segment an artificial language in which the only cues available for word segmentation are the transitional probabilities between syllables. However, the computation of TPs appears to be constrained at the phonemic level, in that consonants but not vowels, lend themselves to TP computations (Bonatti et al. 2005) – the so-called C/V Hypothesis (Nespor, Pen˜a, and Mehler 2003). 3.1. The linguistic basis of the C/V hypothesis It has been hypothesized that there is a (partial) division of labor between consonants (C) and vowels (V) in the interpretation of linguistic properties: while the main function of consonants consists in conveying lexical distinctions, the main role of vowels is that of allowing the identification of the rhythmic class to which a language belongs, as well as of specific properties of syntactic structure, and, in many cases, morphological structure (Nespor, Pen˜a, and Mehler 2003). Probably inspired by the writing system of Hebrew, where the letters represent only consonants and identify the general meaning of words, and the vowels are diacritics that identify morphosyntactic information, such as gender and number, Spinoza (1677) considered vowels the soul of

176

Mohinish Shukla, Judit Gervain, Jacques Mehler and Marina Nespor

the letters and the consonants bodies without soul. Spinoza thus had clear intuitions about the di¤erent nature of the two categories: consonants and vowels. The hypothesis of a functional distinction between consonants (Cs) and vowels (Vs) is based on evidence from di¤erent disciplines that investigate language. First of all, Cs are cross-linguistically more numerous than Vs, the most common C:V ratio being 20:5. In extreme cases, Cs outnumber Vs to a greater extent, e.g. in Hausa the C:V ratio is 32C: 5V and in Arabic 29C: 3V. Cases like Swedish with 16 Cs and 17 Vs are extremely rare. The larger number of Cs as compared to the number of Vs crosslinguistically makes Cs relatively more informative than Vs, suggesting that their information load may be at the basis of their functional specialization for lexical interpretation. This specialization, however, goes beyond their numerical superiority, as seen from the fact that their division of labor remains unchanged in languages in which there is a similar proportion of Vs and Cs. It has, in fact, been shown that in word recognition, lexical selection is constrained less tightly by vocalic than by consonantal information both in languages with a high C:V ratio, like Spanish, and in languages with a balanced C:V ratio, like Dutch (Cutler et al. 2000). If asked to change one phoneme to convert a non-word into a word, participants more often change a V than a C. Thus when presented with a non-word, e.g. kebra, participants most often come up with the word cobra, rather than with the word zebra, indicating that Cs are more resistant to change than Vs in defining lexical items. The results thus indicate that the more distinctive role of Cs with respect to Vs is independent of the variation in C/V ratio across languages. The more distinctive role of Cs may also be attributed to the nature of the vocal tract, which allows for more consonantal than vocalic distinctions. However, the fact that even in systems with a similar number of distinctive Cs and Vs, the role of distinguishing lexical entries is mainly carried by Cs supports the hypothesis of two distinct functional roles for the categories of Cs and Vs. The C/V hypothesis sees the di¤erent functions of the two categories as an e¤ect of the relative stability of Cs across di¤erent contexts, as opposed to the great variability of Vs, the main carriers of prosody. These distinct properties render di¤erences in quality particularly important for Cs, and di¤erence in quantity – i.e. relative prominence – especially relevant for Vs. The variation of Vs in quantity – pitch, intensity and duration – gives them the role of interpreting morphosyntactic structures.

Linguistic constraints on statistical learning

177

So far, to the best of our knowledge, the C/V hypothesis has been investigated exclusively on non-tonal languages. In languages in which vowels bear distinctive tones, such as Mandarin or Thai, the tones themselves – though carried by Vs – may well behave more like Cs than like Vs. Future research will have to establish whether this is the case. Besides their restricted numerosity and their variability due to prosody, Vs also have a poor distinctive capacity as they have a tendency to lose distinctiveness. In many languages, Vs harmonize throughout a domain, i.e., they become more similar to each another. In other languages, they lose their quality in unstressed positions. In English, for example, unstressed Vs centralize and become schwa. In still other languages, the restricted distinctive power of Vs in unstressed position is only partial, in that their variation is larger in stressed than in unstressed position. For example, in European Portuguese, there are 8 Vs in stressed positions, but only 4 in unstressed positions. Thus the qualitative distinctions between Vs – poorer than that of Cs to begin with – further diminishes due to a number of phonological phenomena. Consonants and vowels have radically distinctive roles in cuing linguistic information in some languages: only Cs have the role of constituting lexical roots, for example, in Semitic languages, as noted by Spinoza. A trisyllabic root like gdl relates to the concept of big, while the vowels around the consonants generate di¤erent word categories or word forms. For example, gadol and gdola are adjectives meaning big, masculine and feminine, respectively; while giddel and gaddal are verbs meaning (he) grew, transitive and intransitive, respectively. Thus in languages of this type, exclusively Cs accomplish the role of distinguishing lexical roots. Symmetric systems in which Vs constitute the lexical roots and Cs supply morphosyntactic information are unattested. Phenomena of the type described above are at the basis of Goldsmith (1976)’s proposal to establish di¤erent levels of representation for Vs and Cs, each of the two categories constituting sequences in which they are adjacent on their respective tier. These di¤erent tiers, or levels of representation, are meant to account for phenomena that apply only to one category, ignoring the other, as vowel harmony or tonal spreading for the vocalic tier and lexical roots for the consonantal tier. 3.2. Consonants and vowels in the computation of transitional probabilities As observed above, the mechanism that allows humans to use transitional probabilities to segment a sequence of items is domain general, applying

178

Mohinish Shukla, Judit Gervain, Jacques Mehler and Marina Nespor

to syllables, but also to musical tones and visual stimuli. The generality of this learning mechanism does not rule out the possibility that specific linguistic factors might influence the domain over which TPs are computed. This possibility has been explored in two di¤erent ways: on the one hand, it has been hypothesized that, given long distance phenomena in language, TPs could be calculated on non-adjacent syllables (see Section 4.2). On the other hand, given the independence of vocalic and consonantal representations (Goldsmith 1976), it has been hypothesized that TPs could be computed on one but not on the other level, that is on elements that, though not phonetically adjacent, are adjacent at an abstract level of representation. The hypothesis that TPs could be calculated at this abstract level of representation was first formulated in Newport and Aslin (2004) who proposed that, if speech triggers the construction of separate V and C tiers, then TPs should be computed over the two types of representations equally e‰ciently. However language, in addition to providing representations, may provide constraints as to which representations lend themselves to TP computations. Thus an alternative view proposes that given the specialization of Cs for the lexicon, TPs should be computed on the consonantal tier. They should, however, not be computed on the vocalic tier, since Vs are variable, being the main carriers of prosody, and thus have mainly a grammatical function, and are often very restricted in number. Bonatti et al. (2005) thus proposed that Vs are processed independently of their local statistical distribution (cf. also Mehler et al. 2006). In their first experiment, participants were exposed to a continuous stream of CV syllables that were random concatenation of three tri-syllabic word ‘‘families.’’ The families were defined by fixed consonantal frames and varying vowels, e.g., puRagy, puRegy, poRegy or malitu, malyto, melytu. As a consequence, TPs between Cs were high word internally (TPs ¼ 1), and were lower between words (TPs ¼ 0.3). The vowels of the words, instead, varied such that the TPs between Vs were comparable within or between words. Thus TP computation over vowels alone or between syllables could not be exploited to identify words. Having to choose between words and part-words, participants significantly preferred words. This result shows that adults can exploit TPs between Cs to segment a continuous speech stream, a conclusion already reached by Newport and Aslin’s (2004) in an experiment with di¤erent materials and design. In a symmetric experiment, Bonatti et al. (2005) tested whether Vs play a similar role. In this experiment, words were organized into families, this time defined by fixed vocalic frames and varying consonants, e.g., po˜kima, po˜Rila, to˜Rima or kumepa˜, kuleta˜, Rulepa˜. That is, words were

Linguistic constraints on statistical learning

179

defined by TPs of 1 word internally, exclusively on Vs, while Cs varied. Having to choose between words and part-words, participants were at chance, showing their inability to compute TPs when relying on Vs. In a third experiment, participants had to choose between words and partwords, after exposure to a stream that contained both consonantal and vocalic words, but mismatched with respect to each other. That is, a stream could be segmented on the basis of either the Cs or the Vs, but not both. Participants preferentially relied on TPs on Cs over TPs on Vs, confirming the results of the previous experiments. Segmentation can be achieved by relying on Vs only when they are enhanced by nonprobabilistic information, for example, when a stream contains long stretches of immediate repetitions of the same vocalic pattern, as in the material in Newport and Aslin (2004). Restricting TP computations to the C level is economical from the point of view of language acquisition. An asymmetry in the perception of Vs and Cs has been detected also in newborns, who have been shown to be more sensitive to vocalic than to consonantal changes (Bertoncini et al. 1988). This asymmetry between the two categories is highlighted in the TIGRE model (Mehler, Pallier and Christophe 1996). In this Time-Intensity Grid Representation, it was proposed that infants initially represent speech as a temporal pattern of high- and low-intensity stretches, which roughly correspond respectively to Vs and Cs. 3.3. Consonants and vowels in the extraction of generalizations Although the vocalic level of representation does not lend itself to the computation of transitional probabilities, as seen above, other types of computations are possible over Vs. These – as the C/V hypothesis predicts – disregard quality distinctions. On the basis of the TIGRE model and the salience of Vs in the speech stream, as well as the fact that both adults and newborns discriminate two languages if they belong to di¤erent rhythmic classes (e.g. Italian and Dutch or Spanish and English), but not if they belong to the same class (e.g. English and Dutch or Italian and Spanish, Nazzi, Bertoncini and Mehler 1998), an acoustic correlate of rhythm has been proposed. Specifically, the phonetic correlates of di¤erent rhythmic classes of languages (Pike 1945) have been defined as the percentage of time occupied by Vs in an utterance (%V), as well as the variability (measured as the standard deviation) of the C intervals (DC) (Ramus, Nespor and Mehler 1999). This is one way in which Vs – because of quantitative di¤erences – o¤er cues to grammar, in this case phonology. The identifica-

180

Mohinish Shukla, Judit Gervain, Jacques Mehler and Marina Nespor

tion of the rhythmic class to which a language belongs, in fact, o¤ers a cue to the complexity of the syllabic repertoire: high %V and low DC are predictive of a restricted syllabic repertoire, with Vs recurring at rather regular intervals, thus implying a majority of simple syllables. This is typically the case of so-called mora-timed languages, such as Japanese (Ladefoged 1993). Low %V and high DC are predictive of a rich syllabic repertoire, thus Vs recurring at very variable intervals, implying complex syllables interspersed with simple ones. This is the case of so-called stresstimed languages, such as Dutch or English. In between moraic and stress languages are the so-called syllable-timed languages, such as Spanish and French, with a syllabic repertoire that falls in between that of stress languages and that of moraic languages (Pike 1945). It has, in addition, been shown that Vs, but not Cs, lend themselves to the extraction of generalizations (Toro et al. 2008). Toro et al. (2008) exposed participants to a stream that contained consonantal sequences coherent in terms of TPs, and vocalic sequences that followed a simple structural organization. As in previous studies (Bonatti et al. 2005; Newport and Aslin 2004), listeners found words in a continuous speech stream by using distributional information carried by Cs. Listeners were also able to extract a structural regularity from Vs and apply it to novel items. Thus both statistical and structural information are extracted from the same stream, but on di¤erent categories and for di¤erent purposes: either to identify words on the basis of Cs, or to detect structural generalizations on the basis of Vs. In a second experiment, the roles of Cs and Vs were reversed, Vs being coherent in terms of TPs within a sequence, and Cs following a simple generalization. Consistent with previous results (Bonatti et al. 2005), participants were unable to use the distributional information over Vs for segmentation. Importantly, they were also unable to generalize the structural organization over Cs, thus confirming that di¤erent mechanisms are exploited: a lexical mechanism to extract words from the speech stream, on the one hand, and the extraction of generalizations, a mechanism to identify grammatical regularities, on the other hand. From the study of Toro et al. (2008), the conclusion may be drawn that the processing of linguistic stimuli is constrained by language-relevant representations, and not by a blind general-purpose mechanism. If the latter were the case, Vs and Cs should play the same role. The specific role of Vs in the extraction of generalizations is a further confirmation of the C/V hypothesis.

Linguistic constraints on statistical learning

181

Since generalizations could potentially be preferentially extracted over Vs due to their acoustic salience, rather than to their categorical role in language perception and acquisition, it was asked whether the asymmetry between Vs and Cs could be due to lower level acoustic di¤erences between the two categories. The categorical distinction between Vs and Cs has been further demonstrated in Toro et al. (2008), where it was shown that when in the familiarization stream, the V duration is reduced to one third of that of Cs, participants still generalize the structure over the barely audible Vs.

4. Constraints on statistical learning at the morphological level: Extracting morphological regularities Linguistically, an utterance (a sentence) can be analyzed as a nested hierarchy of constituents, from the individual phonemes and syllables up to phonological words and phrases, in their overt, linear order. The linear order of constituents at any level of organization is ultimately derived from the grammar of the language. Intuitively, however, a sentence is seen as a linear string of words, and the syntax of a language can be broadly defined by the ordering of words of the di¤erent parts of speech. For example, the English sentence ‘‘John ate apples’’, in which the order is Subject-VerbObject would, in Hindi, be ‘‘John apples ate’’ – a Subject-Object-Verb order. However, the words themselves can be made up of sub-parts – for example, the word APPLE in English can be realized as the singular apple or the plural apples. Similarly, the word WALK can be realized as walk, walked or walking. In Hindi, the noun John from the sentence above would be realized as John-ne, where the -ne marks the nominative case. That is, words themselves are linear strings of morphemes like stems (walk) and a‰xes (-ed or -ing). Morphology is the study of word-internal phenomena. Cross-linguistically, words can change their shape with their function to di¤erent degrees. Broadly speaking, such changes are traditionally classified as inflection, where a word changes its shape for syntactic reasons (e.g., walk–walked ), and derivation, where a word changes its shape for syntactic as well as semantic reasons (e.g., walk–walker), although this distinction is not quite clear-cut. Such shape changes can be seen along an orthogonal dimension, more relevant to the current discussion, as the di¤erence between concatenative and non-concatenative morphology. All the examples discussed

182

Mohinish Shukla, Judit Gervain, Jacques Mehler and Marina Nespor

above are of the first type, where it is fairly easy to separate the various morphemes (e.g., walk-ed ). In contrast, irregular verbs in English demonstrate non-concatenative morphology, as in the verb sing and its past-tense form sang. With respect to concatenative morphology, we can ask of words the same question that we asked of sequences of words: are there predictive, statistical relationships between morphemes that enable the segmentation and acquisition of morphemes within words in a manner analogous to the segmentation and acquisition of words within continuous speech/ utterances? Naturally, the two questions are inter-related and expected to be language-dependent. For example, analytic languages like Chinese have almost no inflectional morphology, while agglutinative languages like Turkish have a rich morphology, such that a single word can be made up of several morphemes. What kind of information (if any) might statistical measures like TPs provide for infants acquiring languages with di¤erent morphological properties? Can distributional strategies like computing frequencies and transition probabilities between pairs of units account both for word segmentation and discovering the morphology of a language? 4.1. Using distributional regularities to identify morphemes Gervain (submitted) provided a first answer to this question by examining the role of TPs in segmenting words from two typologically distinct languages, Italian and Hungarian. In addition, the ability to segment morphemes from morphologically complex words was also analyzed for Hungarian, a heavily agglutinating language. For this analysis, she used corpora of child-directed speech from the CHILDES database (MacWhinney 2000), and computed three measures of pair-wise statistical coherence: forward and backward transition probabilities and mutual information. The first two are conditional probabilities: for a bisyllabic sequence A-B, they reflect either the probability of an upcoming B after having encountered A, or the probability of A being the preceding syllable, having encountered B; normalized by the overall probability of encountering A or B, respectively. Mutual information, in contrast, was defined as the (log 2) probability of encountering the sequence A-B, normalized by the joint probability of encountering A or B alone. To summarize the results, Gervain (submitted) found that, while the distribution of forward- and backward-TPs alone does not discriminate the two languages, performance (as measured by accuracy and completeness) was better for forward-TPs with the Italian corpus and for backward-TPs

Linguistic constraints on statistical learning

183

for the Hungarian corpus. For Hungarian, where both between-word and word-internal (morpheme) boundaries were examined, positing the presence of a boundary was successful, but when the two types of boundaries were considered separately, segmentation performance was better for word boundaries compared to (word-internal) morpheme boundaries. In addition to these analyses, Gervain et al (2008) also examined the role of word frequency, comparing child-directed speech corpora for Italian (from the CHILDES database, MacWhinney 2000) and Japanese (from the Japanese Mother-Child Conversation corpus, Mazuka, Igarashi, and Nishikawa 2006). To summarize, she first found that, as expected, function words (the equivalents of English ‘‘a’’ or ‘‘the’’) were more frequent than content words like verbs or nouns. Second, the position of the frequent elements with respect to utterance boundaries1 di¤ered in a languagespecific way: Japanese had more frequent-element-final utterances and Italian had more frequent-element-initial utterances. This di¤erence in the location of frequent elements was related to the dominant word order of the two languages: Japanese is predominantly an OV (Object-Verb, or Complement-Head) language and Italian is predominantly a VO language. Studies in language typology have revealed that the order of verbs and their objects in a language correlates with other orderings, importantly, between functional elements and content words (e.g., Dryer 1992). Therefore, frequency aids in detecting which words are functional elements, and the position of these elements relative to utterance boundaries (initial or final) indicates word order, a basic aspect of grammar. Taken together, two statistical properties in the speech stream can aid infants in language acquisition. TPs can provide potential word boundaries, while the frequency of ‘‘words’’ extracted by TPs, and their position relative to utterance edges, indicates the word order in the language. Notice, however, that there is no direct connection between which utterance edge the frequent words are typically aligned with and a grammatical property like OV or VO word order. Indeed, in order to acquire the word order of the language from an examination of the location of frequent elements, the infant must (a) be sensitive to this relationship and (b) be predisposed to induce word order from this relationship. In fact, Gervain et al (2008) have already shown that infants are indeed sensitive to the order of infrequent and frequent elements, and extend the pattern found in their language of exposure to artificial, ambiguous stimuli. Nonetheless, it remains 1. An utterance is a chunk of speech that is bounded by distinct auditory pauses, and thus presents clear perceptual boundaries.

184

Mohinish Shukla, Judit Gervain, Jacques Mehler and Marina Nespor

to be shown that they use this ability to induce the specific grammar of their language. These data also indicate that there are language-specific di¤erences, such that a single strategy cannot perform uniformly across di¤erent languages. For example, the observation that backward-TPs are better at word segmentation in Hungarian while forward-TPs are better in Italian suggests that there must be some additional constraints that determine which strategy (if any) is preferred in a given language. One possibility is that these constraints themselves are statistical in nature. For example, the learner might try to optimize the Minimum Description Length (MDL) metric for the encountered speech data as a whole, and choose whichever measure performs best for a given language. Alternately, the constraints might be imposed by other linguistic and non-linguistic factors. Indeed, while frequency might be a factor in separating the class of function words from the class of content words, neither class is uniform. Thus, the content words form distributional classes of their own, such as nouns and verbs. In contrast, from a morphological perspective, words formed from non-concatenative and concatenative means can both belong to the same class, as defined by their syntactic distribution (e.g, sang and walked are both past-tense verbs). It is therefore possible that other general statistical procedures might be useful for extracting linguistic categories. Indeed, several researchers have shown that words from the same category (e.g. nouns or verbs) share a host of distributional properties. For example, they occur in some frequent frames (Mintz 2002, Chemla et al. 2009), or they share various acoustic-phonetic characteristics (e.g., Farmer, Christiansen, and Monaghan 2006). However, there is a chicken-and-egg problem with regards to linguistic categories like verbs and nouns, and their distributional properties: are such linguistic categories induced from the input or are they derived ? The di¤erence lies in whether the learner posits such categories immediately upon encountering an appropriate linguistic input (induction), or whether the categories are mere labels on arbitrary, distributionally coherent sets of words in the input. Notice that an underlying representation of grammar that includes abstract categories will, due to its systematicity, produce non-random distributions of words in the ‘surface structure’ of language. Conversely, a sensitivity to such distributional properties in the input can aid in the rapid acquisition of the underlying categories. As outlined above, the data from Gervain (2008, submitted) shows that there exists a correlation between the distribution of frequent elements with respect to utterance edges and a grammatical property of language, and that infants

Linguistic constraints on statistical learning

185

are sensitive to this pattern. More recently, Hochmann and colleagues (2010, under review) have shown that infants indeed make the right kind of inference about the linguistic role of elements that di¤er solely in their distributional properties. In particular, these authors find that, when exposed to novel frequent and infrequent syllables, infants at 17 months of age already prefer to treat the infrequent syllables as object labels2. For the purpose of this chapter, we will assume that there are indeed distributional properties of the input that can lead to extraction of linguistic categories. We therefore ask whether the extraction and generalization of such categories is also subject to cognitive constraints? This topic is of special relevance to the study of morphology, since the domain of application of morphological rules can be quite complex. We have concentrated primarily on rules at the level of words, e.g., the plural form (e.g., knight– knights). But a morphological domain can consist of a series of words, as for the English possessive, which applies at the right edge of an entire noun phrase (e.g., [the knights that say Ni’]s horses, the possessive s modifies the entire noun phrase in brackets). 4.2. Extracting non-adjacent morphological regularities Morphological regularities within a language can be quite complex, and can even relate morphemes that are not adjacent. For example, in Italian, the derivation to turn an adjective into a verb (the English equivalent of pretty–prettify) may involve adding a prefix such as a- and a su‰x -re, as in a.rrossi.re, ‘‘redden’’ or a.vvicina.re, ‘‘(get) closer.’’ Therefore, much recent work has concentrated on understanding in detail both how morphemes and morpheme classes can be extracted, and how regularities between them are extracted. Since the findings of Sa¤ran, Aslin & Newport (1996), it has been well established that infants and adults can use TPs to recover (statistically) coherent multisyllabic nonce ‘‘words’’ from artificially created sequences of syllables that are the random concatenation of those nonce words (Aslin, Sa¤ran, and Newport 1998; Pen˜a et al. 2002; amongst many others). Such TP computations are constrained in several ways, some of which have been reviewed above. Here, we concentrate on empirical observations aimed at trying to understand how relationships between distant elements (e.g., the previously mentioned Italian rule a-X-re) are computed.

2. See also Nespor et al. (2008) for acoustic/prosodic correlates of word order.

186

Mohinish Shukla, Judit Gervain, Jacques Mehler and Marina Nespor

Newport and Aslin (2000) presented adults with continuous streams of concatenated trisyllabic nonce words in which only the TP between the first and the third syllable was high, and found that adults were incapable of segmenting out the nonce words. In contrast, adult participants were perfectly able to segment such words when the TPs between only the consonants (which are non-adjacent, being separated by vowels) was high, while the TPs between the vowels was low. Subsequently, it has been established that there is an asymmetry between vowels and consonants, such that TPs over consonants can be readily computed, while TPs over vowels can be computed only under certain circumstances (Newport and Aslin 2004, Bonatti et al. 2005; amongst others). Pen˜a et al. (2002) exposed adults to continuous streams of three randomly concatenated trisyllabic nonce words, in which the first syllable predicted the third with a TP of 1.0, while the middle syllable was randomly chosen from a set of three di¤erent syllables. That is, each word conformed to an A-x-C structure, where ‘x’ is variable. Of particular interest in these studies are the conditions in which a part-word (e.g., x-C1-A2, where A1-C1 are the fixed elements of one word, and A2-C2 are the fixed elements of another) was contrasted with a rule-word, wherein the middle, ‘x’ syllable had never occurred in that position (e.g., A1-A2-C1). The authors found that after 10 minutes of exposure to such continuous stream, participants did not prefer rule-words over part-words. After an extended familiarization (30 min), in fact there was a preference for the part-words. However, when they inserted subliminal (25 ms) gaps between the words, participants preferred rule-words over part-words even after just two minutes of familiarization. Pen˜a et al. (2002) hypothesized that the subliminal pauses changed the nature of the computation from a purely statistical one to one involving the induction of generalizations by providing a bracketing of the input. Subsequently Endress and colleagues (see Endress, Nespor, and Mehler 2009 for an overview) suggested that the relevant change that the subliminal pauses introduced was to provide perceptual edges. For instance, Endress, Scholl & Mehler (2005) showed that simple rules could be easily generalized at the edges of utterances, but were much more di‰cult (or impossible) to generalize in the middles. For example, participants were exposed to 7-syllable sequences, in which a repetition occurred either at the first and second position in the sequence (AAbcdef ) or at the fourth and fifth position (abcDDef ); participants were able to induce the repetition-atedge rule, while they were unable to induce the repetition-in-the-middle

Linguistic constraints on statistical learning

187

rule. Further, adults were capable of inducing two classes comprising arbitrary syllable lists and computing their relations only when the two classes were at the edges of short lists. That is, adults could learn a rule of the form AxxC, but not of the form xACx, where ‘A’ and ‘C’ are arbitrary syllable classes3. A further source of information in inducing generalizations is the variability of the tokens that comprise each distinct class. For example, Go´mez (2002) showed that 18-month-old infants were more likely and better able to generalize a rule of the form A-x-C if the middle ‘x’ element was highly variable (e.g., drawn from a set of 24) than when it was less variable (e.g., drawn from a set of 2). Indeed, several researchers have posited that variability plays a key role in inducing generalizations by highlighting the commonalities between the tokens (see also Onnis et al. 2004). Indeed, Gerken (2006) trained 9-month-old infants to syllable triplets that conformed to the rule A-A-B, but the final syllable was either fixed (‘di’) or variable. She found that infants exposed to the A-A-di condition only generalized to other triplets that ended in di, but the infants exposed to the triplets with a variable last syllable generalized to arbitrary A-A-B triplets. More recently, Gerken has shown that even a small amount of variability in the third syllable is su‰cient for infants to induce the A-A-B rule (Gerken 2010). More generally, it has been suggested (Newport 1999 and references therein; Newport and Aslin 2000) that variability in the input, as determined by the probabilistic co-occurrence of tokens with respect to each other might provide the necessary ingredients for generalizing the grammatical rules in the language. From the work of Endress and colleagues, we can extend this to say that the co-occurrence of tokens with respect to each other and with respect to perceptual primitives like edges or repetitions might help bootstrap the grammar of the language.

5. Constraints on statistical learning at the syntactic level Originally, statistical learning was proposed as a mechanism for learners to segment words from the continuous speech stream. More recently, it has been suggested that statistical information might also help learners extract syntactic information such as word order or phrase structure from 3. Although such generalizations are possible when ‘A’ and ‘C’ are natural classes, like stops or glides.

188

Mohinish Shukla, Judit Gervain, Jacques Mehler and Marina Nespor

the input. This is possible because underlying syntactic regularities leave a statistical signature on the surface, i.e. they result in non-homogeneous, non-equiprobable distributions of morphemes, words and phrases. In a series of experiments with adults, (Thompson, and Newport 2007) investigated the learnability of phrasal grouping in a complex artificial grammar, where phrase boundaries were marked by dips in TPs. The dips were obtained as a result of four di¤erent syntactic manipulations: (i) the optionality of phrases (e.g. [AB][CD][EF], [AB][CD]), (ii) the movement of phrases (e.g. [AB][EF][CD]), (iii) the repetition of phrases (e.g. [AB][CD][EF][CD]) and (iv) form classes of di¤erent sizes (e.g. [AB][CD][EF] with 4 tokens in A, C and E, and 2 tokens in B, D and F). As controls, the authors used the same artificial grammar with the same manipulations as in the experimental conditions, except that manipulations were allowed to apply to any two adjacent form classes irrespectively of phrasal bracketing (e.g. optional drop: ABCF, movement: ADEFBC etc.), not only to those constituting a phrase. Participants learned the canonical linear order of form classes better than chance in all experimental and control conditions, but participants in the experimental conditions outperformed participants in the control conditions. These results indicate that phrasal bracketing is not necessary for the acquisition of simple surface ordering, but when available, it improves performance. Thompson and Newport (2007) thus argue that syntactic operations such as movement, repetition or ellipsis created statistical distributions in surface linguistic forms that learners could use to detect phrase boundaries and extract phrasal grouping. This finding, they claim, does not imply that statistical information alone is su‰cient for learners to extract phrase structure from complex natural language input. Rather, as they argue, other converging cues, such as prosody or semantics, are probably necessary to signal the full complexity of the syntactic structure of natural languages (see also Takahashi and Lidz 2008). Indeed, Gervain et al. (2008), and Bion, Benavides & Nespor (2011), as well as Gervain and Werker (under revision) have shown that word frequency and prosody, respectively, help prelexical infants acquire the basic word order of their native language(s). One of the basic design features of natural languages is the division of labor between frequent functional elements, which signal grammatical relations, and less frequent content words, which carry lexical meaning. At the level of word tokens, functors are typically highly frequent, whereas content words are infrequent. Indeed, in large corpora of typologically di¤erent languages, the 30–50 most fre-

Linguistic constraints on statistical learning

189

quent words have been found to be functors (Gervain et al. 2008; Kucera and Francis 1967). Further, the relative position of functors and content words has been shown to correlate with basic word order (Dryer 1992; Gervain et al. 2008), e.g. postpositions are found in the O(bject)-V(erb) language Japanese (Tokyo ni Tokyo from; lit. ‘from Tokyo’), whereas prepositions are found in Italian, a VO language (sul tavolo on-the table; lit. ‘on the table’). Consequently, tracking the most frequent words and their position relative to infrequent words is a good heuristic strategy to determine the basic word order of a language. Gervain et al. (2008) found that 8-month-old, i.e. prelinguistic infants were indeed able to use this strategy. The authors constructed an artificial grammar in which frequent and infrequent words alternated. A four-syllable long basic unit, AXBY, where categories A and B had one token each ( fi and ge, respectively), whereas categories X and Y contained nine di¤erent tokens each, was repeatedly concatenated. In the resulting stream (gedofidegekufiragekafipa. . .), A and B tokens were nine times more frequent than X and Y tokens (although the categories themselves were equally frequent). This stream was presented to monolingual Japanese and monolingual Italian infants for a familiarization time of 4 minutes. The initial and final 15 sec of the stream were ramped in amplitude, masking phase information. As a result, the stream was ambiguous in structure between a frequent word initial (AXBY: [ gedofide] [ gekufira] [ ge. . .) and a frequent word final parse (XBYA: [dofidege] [kufirage] [ka. . .). Test items were four-syllable-long sequences that followed the frequent-initial or the frequent-final parse (gedofide and kufirage, respectively). Four test trials of each type were presented to infants and their looking times were measured using the headturn preference procedure. As predicted, Japanese infants looked longer at frequent-final items than frequent-initial ones, while Italian infants had the opposite preference. Thus infants showed a preference for the word order that was characteristic of their native language, indicating that they track frequent words and use them as anchor points with respect to which the position of infrequent content words can be encoded (Braine 1963, 1966; Valian and Coulson 1988). To test the universal applicability of the anchoring hypothesis, Gervain et al. (under revision) has extended the empirical scope of the investigations by testing an additional OV language, Basque and an additional VO language, French. In this case, adult participants were tested with an artificial language very similar to the one used with infants, except that three frequent and three infrequent categories were used (i.e. AXBYCZ).

190

Mohinish Shukla, Judit Gervain, Jacques Mehler and Marina Nespor

As predicted, speakers of the OV languages, i.e. Japanese and Basque participants, preferred the frequent word final test items significantly more often than VO speakers, i.e. Italian and French participants, confirming that this frequency-based anchoring mechanism is indeed sensitive to the word order type of the native language. These findings might also shed some light on the absence of phrase structure learning in the variable class size condition of Thompson and Newport’s (2007) study. Their variable class size manipulation created a stream similar to the one used here, in which more frequent and less frequent word tokens alternate. (Note, however, that the ratio between the frequency of the two types, 2:4 in Thompson and Newport (2007) vs. 1:9 in Gervain et al. (under revision), remains an important di¤erence between the two experiments.) The explanation for the failure of Thompson and Newport’s (2007) participants to learn phrase structure might not be the level at which they were tracking statistical information (form classes, and not word tokens), as the authors claim, but rather the fact that they were taught and tested on frequent word final phrases, the order of which goes against their Englishspeaking participants’ native frequent-initial word order. Future work is needed to test this possibility. The above findings suggest that di¤erences in word frequency and the resulting conditional probability distributions provide powerful cues to word order. However, as noted, these cues are not always su‰cient. This is the case of bilingual infants who are exposed to an OV and a VO language at the same time, e.g. Japanese and English, as these infants hear both frequent word initial and frequent word final patterns in their input. They need additional cues to discriminate between them. One such cue that has been proposed in the literature (Christophe et al. 2003; Nespor et al. 2008; Nespor and Vogel 1986) is phrasal prosody. Prosody might be a useful cue, because the physical realization of phrasal prominence correlates with word order and might provide a unique signal to it. In OV languages and in the OV phrases of mixed languages such as German or Dutch, prosodic prominence is realized as a pitch and intensity contrast, with the prosodically and informationally prominent infrequent content word being realized with higher pitch and intensity than the non-prominence frequent functor (‘Tokyo ni). In VO languages and in the VO phrases of mixed languages, phrasal prosody is carried by a durational contrast, with the infrequent content word being longer than the frequent functor (to Rome). This low-level, acoustic di¤erence between the patterns, i.e. a pitch/intensity contrast vs. a durational contrast, might be used to dis-

Linguistic constraints on statistical learning

191

criminate the two word orders and grammars4. Indeed, Bion et al. (2011) found that specific acoustic markers of prominence influence the grouping of speech sequences by 7-month-old-infants. Specifically, when familiarized with syllables alternating in pitch, infants showed a preference to listen to pairs of syllables that had high pitch in the first syllable. However, when familiarized with syllables alternating in duration, they did not show any preference. Gervain & Werker (under revision), however, found that 8-month-old OV-VO bilinguals are able to use prosody, in conjunction with word frequency, to selectively activate one or the other of their native word orders. The authors familiarized two groups of OV-VO bilinguals to an artificial grammar similar in structure to the one used in Gervain et al. (2008). However, in this case, prosody was added to the stream. One group of infants heard the stream with OV prosody, the other half with VO prosody. In the test phase, they were tested on the same frequentinitial and frequent-final items as before, with no prosody. The infants preferred the word order that correlated with the prosody they heard during familiarization, i.e. infants exposed to the OV prosody preferred frequent-final items, while infants exposed to the VO prosody looked longer and frequent-initial ones. These results indicate that prelexical infants are able to combine prosody with statistical information to learn a basic syntactic property of their native language.

6. Constraints on statistical learning at the prosodic level: Prosodic contours as units for segmentation Fluent speech has been investigated as a continuous sequence of prosodically flat syllables in order to understand if language learners are sensitive to the purely distributional properties of speech. However, real speech is far from being a monotonous sequence of syllables, but is instead characterized as a hierarchy of prosodic units, ranging from the individual phonemes to utterances (Selkirk 1984, Nespor and Vogel 1986). 4. Whether there is an automatic, hard-wired auditory bias to group elements contrasting in pitch/intensity trochaically and elements contrasting in duration iambically (B. Hayes, 1995) or whether this grouping needs to be learned (Iversen, Patel, and Ohgushi 2008) is currently debated. Further studies are needed to determine the complete developmental trajectory of the bias. However, by about 6–8 months of age, the bias presents some asymmetries (Bion et al. 2011; Yoshida et al. 2010) and could be used to bootstrap syntax.

192

Mohinish Shukla, Judit Gervain, Jacques Mehler and Marina Nespor

Therefore, several researchers have asked how the addition of prosody might influence statistical strategies for segmenting speech, but the results have been mixed. For example, Sa¤ran, Newport & Aslin (1996) presented American adults with continuous streams of syllables in which a minimal notion of prosody was implemented as the lengthening of the initial or final syllable of trisyllabic nonce words. Compared to a no-lengthening condition, these authors found that lengthening the initial syllable had no e¤ect (segmentation was better than chance in both conditions – with and without initial lengthening), while lengthening the final syllable enhanced segmentation. However, in a similar task, Toro, Rodrı´guez-Fornells, and Sebastia´n-Galle´s (2007) found above-chance performance but no significant di¤erences between initial-, final- and no-stress condition with Spanish adults. In contrast these authors found that random- or medially-placed stress resulted in chance performance. The results are not easy to interpret – English words typically have word-initial stress, while Spanish words typically have stress on the penultimate syllable. In fact, in an online segmentation and word detection task, Vroomen, Tuomainen & Gelder (1998, Experiment 3) found that stress placement a¤ected performance in a language-specific manner: Finnish and Dutch speakers benefited from word-initial stress, which is the common pattern in these two languages. Toro et al. (2007) therefore proposed that, in continuous speech streams made up entirely of nonsense words, perceptual factors like (the acoustic correlates of ) stress serve as anchor points. Segmentation is attempted around these points. Their findings thus indicate that the processing of linguistically impoverished artificial speech might engage non-linguistic, general auditory mechanisms more than specifically linguistic ones. Turning to infant data, Thiessen and Sa¤ran (2003) exposed 7- and 9month-old American infants to continuous syllable streams that pitted prosodic (stress) cues against statistical (TP) cues. The 9-month-olds used the stress location to segment the streams, preferring stress-initial words – the common pattern for English, while the 7-month-olds preferred the iambic, high-TP sequences as words. These results extended earlier findings that between 7 and 9 months of age, English-speaking infants prefer trochaic (stressed-unstressed) words (Jusczyk, Cutler, and Redanz 1993; Echols, Crowhurst, and Childers 1997; Jusczyk, Houston, and Newsome 1999; Johnson and Jusczyk 2001). More recent data shows that 4-montholds are already sensitive to the stress pattern of their native language (Friederici, Friedrich, and Christophe 2007), echoing previous findings that even 6-month-old American infants prefer the trochaic pattern, irrespective of the order of the syllables (Morgan and Sa¤ran 1995). There-

Linguistic constraints on statistical learning

193

fore, it now appears that infants might learn the common word stress pattern of their language early on, but place greater reliance on the stress pattern for segmenting speech only at a later developmental stage. However, all these studies rely on lexical stress patterns, which are clearly language-specific. In contrast, it has been hypothesized that larger prosodic phrases like intonational phrases or utterances might be universal, being based on physiological constraints (e.g., Maeda 1974; Lieberman and Blumstein 1988; Strik and Boves 1995). Indeed, it is known that by 2 months of age, infants are already sensitive to prosodic phrases. For example, they are better at memorizing the order of two words when the words form part of the same prosodic phrase, compared to when a prosodic phrase boundary separates the two (Mandel, Jusczyk, and Nelson 1994; Mandel, Kemler, and Jusczyk 1996). Several studies have shown that infants are sensitive to larger prosodic boundaries (Hirsh-Pasek et al. 1987; Jusczyk 1989; Kemler et al. 1989; Jusczyk, Pisoni, and Mullennix 1992; Morgan, Swingley, and Mitirai 1993; Morgan 1994; Pannekamp, Weber, and Friederici 2006). For example, infants prefer utterances with pauses (Hirsh-Pasek et al 1987) or buzzes (Morgan, Swingley, and Miritai 1993) artificially inserted at clausal edges, rather than in the middle of clauses, and show neurophysiological correlates for intonational phrase boundaries similar to adults (Pannekamp, Weber and Friederici 2006). A similar conclusion may also be drawn from a study by Seidl and Johnson (2006), where the authors found that 8-month-olds were better at segmenting nonce words from the edges than the middles of sentences in a passage. How can prosody help locate word boundaries in fluent speech? While stressed syllables might indicate the beginnings or ends of words, phrasal prosody instead indicates the boundaries of series of words. According to the principles of the prosodic hierarchy (Nespor and Vogel 1986), words are aligned with the edges of such larger prosodic phrases. Therefore, the language learner might hypothesize that the edges of large prosodic phrases are also word edges. That is, words are not expected to straddle the boundary of one prosodic phrase and the next. In order to test this hypothesis, Shukla, Nespor, and Mehler (2007) exposed Italian adults to continuous sequences of syllables, which were made up of one set of syllables that occurred at random (noise syllables), and a second set of syllables making up four nonce words that were interspersed within the noise syllables. These streams were either monotonous or were generated as a continuous series of intonational phrases (IPs), mimicking Italian IP prosody. The critical manipulation was to place two of the four nonce words internal to the prosodic contours (IPs), while the

194

Mohinish Shukla, Judit Gervain, Jacques Mehler and Marina Nespor

other two straddled adjacent IPs. These authors found that, while in the absence of prosody all four words were preferred over part-words5, in the prosodic condition only the IP-internal words showed evidence of being segmented, while the IP-straddling words did not. This pattern of results was found even when the prosodic phrases mimicked IPs recorded in Japanese, a language that the participants were completely unfamiliar with. Shukla, Nespor, and Mehler (2007) also compared the segmentation of trisyllabic nonce words that were internal to or at the edges of the IPs as in the experiments reported above. Recall (Section 4) that perceptual edges have been proposed as perceptual primitives that aid in the bootstrapping of language. In line with this, the authors found that nonce words at the edges of the IPs were better segmented than those in the middles. Of course, these experiments with adults only provide an existential proof for an interaction between phrasal prosody and segmentation in carving words out of fluent speech. It is not clear if prelinguistic infants can utilize the purportedly universal boundary cues that accompany larger prosodic phrases in a similar manner. That is, adults might be relying on a learnt heuristic (real words are aligned with prosodic phrase edges) and extrapolating to the artificial stimuli. Therefore, more recently, Shukla, White, and Aslin (in press) exposed 6-month-old American infants to a simplified artificial language, in which bisyllabic, high-TP target (nonce) words were embedded in short, two-IP utterances. For example, the target word mu-ra can occur in the utterance j‘-mu-ra # le-s‘ 6, or in the utterance j‘-mu # ra-le-s‘, where the # represents an IP boundary. Further, in this task the infants were required to learn an association between (portions of ) the heard sentences and an on-screen target object. Following an initial training phase, infants were exposed to a test phase that was a variant of the looking-while-listening paradigm (cf. Fernald et al. 2008). It was found that 6-month-olds indeed show di¤erential looking responses only to the high-TP, high-frequency bisyllable corresponding to the word mu-ra, but showed no di¤erential responses to the part-words. Critically, while infants familiarized with sentences in which the word was aligned with a prosodic edge mapped the word onto the target, infants familiarized with statistically identical sentences, but with words straddling the prosodic boundary, mapped the word onto distractor objects on the screen. The authors conclude that only when the statistically coherent sequence (the 5. Due to the design of the experiment, the part-words had a frequency of zero, and were hence more comparable to the non-words from previous paradigms. 6. The /‘/ designates a reduced, centralized vowel, like schwa.

Linguistic constraints on statistical learning

195

word) is internal to a phrasal prosodic constituent can infants simultaneously segment it and map it onto referents at a very early age. Utterances and IPs are particularly well-marked in the speech input, thus providing clear perceptual boundaries. The above-mentioned studies indicate that adults, and even pre-lexical infants, are constrained to treat large prosodic phrase boundaries as word boundaries. However, there is also evidence that boundaries of smaller phrases, like the phonological phrase, also constrain the location of words in fluent speech (Christophe et al. 2003; Soderstrom et al. 2003). In an artificial language setting, Shukla and Nespor (2008) exposed adult Italian participants to syllable sequences in which nonce words had high TPs over the syllables and the consonants, but not over the vowels alone; the sequences were so constructed that the vowels of each statistical part-word were all the same (e.g., DOMO[PU-SUBU]GA, where the part-word in square brackets only contains the vowel ‘U’). Adult responses indicated that they restricted computation of TPs to the part-words; i.e., to syllable sequences misaligned with the high-TP nonce words. They thus hypothesized that, since prosodic domains are primarily signaled through the vocalic tier (Nespor and Vogel 1986), the presence of identical vowels might bind the syllables of the partword into a prosodic domain, and TPs might be constrained within such domains. Indeed, earlier studies (Suomi, McQueen, and Cutler 1997; Vroomen, Tuomainen, and Gelder 1998) have suggested that vowel harmony, a phenomenon wherein, in some languages, all the vowels in a word share a phonetic feature, constrains segmentation in a segmentation and word-detection task. These findings underline the importance of the vocalic tier in determining prosodic groups, the edges of which are constrained to coincide with the edges of words by the language system (see also Section 3.1).

7. Conclusion Statistical regularities of any kind derive from underlying structured processes. Describing the observed statistical distribution alone does not constitute an explanation of the underlying mechanisms that generate the observed regularities. Indeed, the observation of a surface distributional regularity can be taken as a signature of an underlying process that must be discovered. For example, the distribution of location of photons passing through a narrow slit is hypothesized to arise from the quantum physics of light. Climatic patterns, to provide another example, are likely to be under-

196

Mohinish Shukla, Judit Gervain, Jacques Mehler and Marina Nespor

stood as the complex interaction of geophysical forces. Similarly, the nonrandom distributional structure of human language implies an underlying generative system. Organisms cannot perceive the distributional structure of stimuli that are not perceived by the sensorium. For example, humans cannot directly perceive ultraviolet light, and hence cannot discover any statistical structure defined over shades of ultraviolet. However, perceiving the distributional structure alone does not ensure the extraction of the appropriate underlying generative mechanisms. Thus, for example, while humans and monkeys share several aspects of auditory perception, only humans induce a generative grammatical system when exposed to auditory speech input. Indeed, the ease and uniformity of language acquisition by human infants suggest that there are species-specific cognitive traits that enable only humans to acquire generative systems given the speech signal. In this view, therefore, infants are programmed to attend to specific cues in the speech input and use these to automatically induce a grammar. While it has been convincingly shown that infants are sensitive to distributional regularities in the linguistic input, this is just one of many cues that aid in the discovery of the underlying structure. We argue that any cue – distributional, prosodic, or general perceptual – is useful only in so far as the language learner can utilize it to induce the appropriate underlying generative mechanism. Further, as we have shown in this chapter, the appropriate cues for learning a certain aspect of language might arise from an interaction between two or more cues. For example, we showed that even infant learners make a distinction between statistically coherent bisyllables that are prosodically either well- or ill-formed, and only treat the former as potential word candidates. While there has been a wealth of data demonstrating the potential use of distributional information in language learning, most studies have removed other cues inherent in speech (like prosody), in order to isolate the purely statistical learning mechanisms that infants possess. However, the kinds of generalizations that infants make with these stripped-down versions of natural speech might not be implemented when they are exposed to the full complexity of speech. Therefore, future research is needed to understand not just which cues are available to the language learner, but also which cues are actually used to make the appropriate inferences about the underlying grammatical system. Finally, while di¤erent aspects of language like phonology or morphology might rely on di¤erent sets of cues in the input, the relative importance of such cues might show variation across languages (e.g. Yang 2004). It is therefore fundamental to

Linguistic constraints on statistical learning

197

investigate these questions in a cross-linguistic perspective, taking into account language variation.

References Abla, Dilshat, Kentaro Katahira and Kazuo Okanoya 2008 On-line Assessment of Statistical Learning by Event-related Potentials. Journal of Cognitive Neuroscience 20(6): 952–964. Aslin, Richard. N., Jenny R. Sa¤ran and Elissa Newport 1998 Computation of conditional probability statistics by 8-monthold infants. Psychological Science 9(4): 321–324. Bion, Ricardo, Silvia Benavides and Marina Nespor 2011 Acoustic markers of prominence influence adults’ and infants’ memory of speech sequences. Language and Speech. Braine, Martin D. 1963 On learning the grammatical order of words. Psychological Review 70(4): 323–348. Braine, Martin D. 1966 Learning the positions of words relative to a marker element. Journal of Experimental Psychology 72(4): 532–540. Buiatti, Marco, Marcela Pena and Ghislaine Dehaene-Lambertz 2009 Investigating the neural correlates of continuous speech computation with frequency-tagged neuroelectric responses. NeuroImage 44(2): 509–519. Chomsky, Noam 1959 A review of B. F. Skinner’s Verbal Behavior. Language 35(1): 26– 58. Christophe, Anne, Arielle Gout, Sharon Peperkamp and James Morgan 2003 Discovering words in the continuous speech stream: The role of prosody. Journal of Phonetics 31: 585–598. Christophe, Anne, Marina Nespor, Maria Teresa Guasti and Brit van Ooyen 2003 Prosodic structure and syntactic acquisition: The case of the head-direction parameter. Developmental Science 6(2): 211–220. Conway, Christopher M. and Morten H. Christiansen 2001 Sequential learning in non-human primates. Trends in Cognitive Sciences 5: 529–546. Conway, Christopher M. and Morten H. Christiansen 2005 Modality-constrained statistical learning of tactile, visual, and auditory sequences. Journal of Experimental Psychology: Learning, Memory, and Cognition 31(1): 24–39. Dryer, Matthew S. 1992 The Greenbergian Word Order Correlations. Language 68: 81– 138.

198

Mohinish Shukla, Judit Gervain, Jacques Mehler and Marina Nespor

Elman, Je¤rey L., Elisabeth A. Bates, Mark H. Johnson, Annette Karmilo¤Smith, Domenico Parisi and Kim Plunkett 1996 Rethinking Innateness: A Connectionist Perspective on Development. Cambridge, Mass.: MIT Press. Endress, Ansgar. D. and Luca L. Bonatti 2007 Rapid learning of syllable classes from a perceptually continuous speech stream. Cognition 105(2): 247–299. Endress, Ansgar. D., Marina Nespor and Jacques Mehler 2009 Perceptual and memory constraints on language acquisition. Trends in Cognitive Sciences 13(8): 348–353. Farmer, Thomas A., Morten H. Christiansen and Padraic Monaghan 2006 Phonological typicality influences on-line sentence comprehension. Proceedings of the National Academy of Sciences 103: 12203– 12208. Fernald, Anne, Renate Zangl, Ana Luz Portillo and Virginia Marchman 2008 Looking while listening: Using eye movements to monitor spoken language comprehension by infants and young children. In: Irina A. Sekerina, Eva M. Fernandez and Harald Clahsen (eds.), Developmental Psycholinguistics: On-line Methods in Children’s Language Processing, 87–135. Amsterdam: John Benjamins. Fiser, Jo´zsef and Richard N. Aslin 2001 Unsupervised statistical learning of higher-order spatial structures from visual scenes. Psychological Science 12(6): 499–504. Fiser, Jo´zsef and Richard N. Aslin 2002 Statistical learning of higher-order temporal structure from visual shape sequences. Journal of Experimental Psychology: Learning, Memory, and Cognition 28(3): 458–467. Fiser, Jo´zsef and Richard N. Aslin 2005 Encoding multielement scenes: statistical learning of visual feature hierarchies. Journal of Experimental Psychology: General 134(4): 521–537. Forster, Kenneth I. and Susan M. Chambers 1973 Lexical access and naming time. Journal of Verbal Learning and Verbal Behavior 12(6): 627–635. Gallistel, C. Randy and Adam P. King 2009 Memory and the computational brain: Why cognitive science will transform neuroscience. New York: Wiley/Blackwell. Gerken, LouAnn 2006 Decisions, decisions: Infant language learning when multiple generalizations are possible. Cognition 98: B67–B74. Gerken, LouAnn 2010 Infants use rational decision criteria for choosing among models of their input. Cognition 115(2): 362–366. Gervain, Judit and Jacques Mehler 2010 Speech Perception and Language Acquisition in the First Year of Life. Annual Review of Psychology 61: 191–218.

Linguistic constraints on statistical learning

199

Gervain, Judit, Marina Nespor, Reiko Mazuka, Ryota Horie and Jacques Mehler 2008 Bootstrapping word order in prelexical infants: A Japanese-Italian cross-linguistic study. Cognitive Psychology 57(1): 56–74. Gervain, Judit, Nuria Sebastia´n-Galle´s, Begona Dı´az, Itziar Laka, Reiko Mazuka, Naoto Yamane, Marina Nespor and Jacques Mehler under revision Word Frequency Bootstraps Word Order: Cross-Linguistic Evidence. Gervain, Judit and Janet F. Werker under revision Prosody Cues Word Order in 7-month-old Bilingual Infants. Goldsmith, John A. 1976 An overview of autosegmental phonology. Linguistic Analysis 2: 23–68. Go´mez, David. M., Ricardo A. H. Bion and Jacques Mehler 2010 The word segmentation process as revealed by click detection. Language and Cognitive Processes 26(2): 212–223. Harris, Zelig 1955 From phoneme to morpheme. Language 31: 190–222. Hayes, Bruce 1995 Metrical stress theory: Principles and case studies. Chicago: University of Chicago Press. Hayes, J. R. and Herbert H. Clark 1970 Experiments in the segmentation of an artificial speech analogue. In: J. R. Hayes (ed.), Cognition and the Development of Language, 221–234. New York: Wiley. Hochmann, Jean-Re´my under review Word Frequency, Function Words and the Second Gavagai Problem. Hunt, Ruskin H. and Richard N. Aslin 2001 Statistical learning in a serial reaction time task: Access to separable statistical cues by individual learners. Journal of Experimental Psychology: General 130(4): 658–680. Iversen, John R., Aniruddh D. Patel and Kengo Ohgushi 2008 Perception of rhythmic grouping depends on auditory experience. Journal of the Acoustical Society of America 124(4): 2263–2271. Jusczyk, Peter W., David B. Pisoni and John Mullennix 1992 Some consequences of stimulus variability on speech processing by 2-month-old infants. Cognition 43(3): 253–291. Kirkham, Natasha Z., Jonathan Slemmer and Scott P. Johnson 2002 Visual statistical learning in infancy: evidence for a domain general learning mechanism. Cognition 83(2): B35–B42. Kucera, Henry and W. Nelson Francis 1967 Computational analysis of present-day American English. Providence: Brown University Press.

200

Mohinish Shukla, Judit Gervain, Jacques Mehler and Marina Nespor

Kudo, Noriko, Yulri Nonaka, Katsumi Mizuno and Kazuo Okanoya 2006 Statistical learning and word segmentation in neonates: an ERP evidence. Paper presented at the annual meeting of the XVth Biennial International Conference on Infant Studies. Lieberman, Philip and Sheila E. Blumstein 1988 Speech physiology, speech perception, and acoustic phonetics. In: Cambridge Studies in Speech Science and Communication. Cambridge: Cambridge University Press. Loui, Psyche, Elaine H. Wu, David L. Wessel and Robert T. Knight 2009 A generalized mechanism for perception of pitch patterns. The Journal of Neuroscience 29(2): 454–459. Maeda, Shinji 1974 A characterization of fundamental frequency contours of speech. Speech Communication 16: 193–210. Marchetto, Erika 2009 Discovering words and rules from speech input: an investigation into early morphosyntactic acquisition mechanisms. PhD Dissertation, SISSA, Trieste, Italy. Mehler, Jacques, Christophe Pallier and Anne Christophe 1996 Psychologie cognitive et acquisition des langues [Cognitive psychology and language acquisition]. Medecine Sciences 12: 94–99. Morgan, James L. and Jenny R. Sa¤ran 1995 Emerging integration of segmental and suprasegmental information in prelingual speech segmentation. Child Development 66: 911–936. Morgan, James L. 1994 Converging measures of speech segmentation in preverbal infants. Infant Behavior and Development 17: 389–403. Morgan, James L., Daniel Swingley and Kumiko Mitirai 1993 Infants listen longer to speech with extraneous noises inserted at clause boundaries. Paper presented at the biennial meeting of the Society for Research in Child Development. Nespor, Marina and Irene Vogel 1986 Prosodic phonology (Studies in Generative Grammar Vol. 28). Holland: Dordrecht. Reprinted 2007 Germany: Foris. Nespor, Marina, Mohinish Shukla, Ruben van de Vijver, Cinzia Avesani, Hanna Schraudolf and Caterina Donati 2008 Di¤erent phrasal prominence realization in VO and OV languages. Lingue e Linguaggio 7(2): 1–28. Newport, Elissa L. 1999 Reduced input in the acquisition of signed languages: Contributions to the study of creolization. In: Michel DeGra¤ (ed.). Cambridge, MA: MIT Press. Newport, Elissa L. and Richard N. Aslin 2000 Innately constrained learning: Blending old and new approaches to language acquisition. In: Catherine S. Howell, Sarah A. Fish

Linguistic constraints on statistical learning

201

and Thea Keith-Lucas (eds.), Proceedings of the 24th Annual Boston University Conference on Language Development. Somerville, MA: Cascadilla Press. Onnis, Luca, Padraic Monaghan, Morten H. Christiansen and Nick Chater 2004 Variability is the spice of learning, and a crucial ingredient for detecting and generalising nonadjacent dependencies. Proceedings of the 26th Annual Conference of the Cognitive Science Society. Pannekamp, Ann, Christiane Weber and Angela D. Friederici 2006 Prosodic processing at sentence level in infants. NeuroReport 17: 675–678. Pen˜a, Marcela, Lucal L. Bonatti, Marina Nespor and Jacques Mehler 2002 Signal-driven computations in speech processing. Science 298(5593): 604–607. Pinker, Steven 1984 Language learnability and language development. Cambridge, MA: Harvard University Press. Sa¤ran, Jenny R., Richard N. Aslin and Elissa L. Newport. 1996 Statistical learning by 8-month-old infants. Science 274(5294): 1926–1928. Sa¤ran, Jenny R., Elizabeth K. Johnson, Richard N. Aslin and Elissa L. Newport 1999 Statistical learning of tone sequences by human infants and adults. Cognition 70(1): 27–52. Seidl, Amanda and Elizabeth K. Johnson 2006 Infant word segmentation revisited: Edge alignment facilitates target extraction. Developmental Science 9: 565–573. Selkirk, Elisabeth O. 1984 Phonology and Syntax: The relation between sound and structure. Cambridge, MA: MIT Press. Shannon, Claude E. 1948 A mathematical theory of communication. Bell System Technical Journal 27: 379–423 and 623–656. Shukla, Mohinish and Marina Nespor 2008 The vowel tier constrains statistical computations over the consonantal tier. Poster presented at AMLaP 2008, Cambridge, UK. Shukla, Mohinish, Marina Nespor and Jacques Mehler 2007 An interaction between prosody and statistics in the segmentation of fluent speech. Cognitive Psychology 54(1): 1–32. Shukla, Mohinish, Katherine S. White and Richard N. Aslin in press Prosody guides the rapid mapping of auditory word forms onto visual objects in 6-mo-old infants. Proceedings of the National Academy of Sciences. Spinoza, Baruch [1677] Hebrew Grammar.

202

Mohinish Shukla, Judit Gervain, Jacques Mehler and Marina Nespor

Strik, Helmer and Louis Boves 1995 Downtrend in F0 and Psb. Journal of Phonetics 23: 203–220. Suomi, Kari, James M. McQueen and Anne Cutler 1997 Vowel harmony and speech segmentation in Finnish. Journal of Memory and Language 36(3): 422–444. Takahashi, Eri and Je¤rey Lidz 2008 Beyond statistical learning in syntax. In Anna Gavarro´ and M. Joa˜o Freitas (eds.), Proceedings of Generative Approaches to Language Acquisition (GALA) 2007, 446–456. Newcastle-uponTyne: Cambridge Scholars Publishing. Teinonen, Tuomas, Vineta Fellman, Risto Na¨a¨ta¨nen, Paavo Alku and Minna Huotilainen 2009 Statistical language learning in neonates revealed by event-related brain potentials. BMC Neuroscience 10: 21–28. Thompson, Susan P. and Elissa L. Newport 2007 Statistical learning of syntax: The role of transitional probability. Language Learning and Development 3(1): 1–42. Tomasello, Michael 2000 Do young children have adult syntactic competence? Cognition 74(3): 209–253. Toro, Juan M and Josep B. Trobalon 2005 Statistical computations over a speech stream in a rodent. Perception and psychophysics 67(5): 867–875. Toro, Juan M., Antoni Rodriguez-Fornells and Nu´ria Sebastia´n-Galle´s 2007 Stress placement and word segmentation by Spanish speakers. Psicolo´gica 35: 53–76. Toro, Juan M., Mohinish Shukla, Marina Nespor and Ansgar D. Endress 2008 The quest for generalizations over consonants: Asymmetries between consonants and vowels are not the by-product of acoustic di¤erences. Perception and Psychophysics 70(8): 1515–1525. Valian, Virginia and Seana Coulson 1988 Anchor points in language learning: The role of marker frequency. Journal of Memory and Language 27: 71–86. Vroomen, Jean, Jyrki Tuomainen and Beatrice de Gelder 1998 The roles of word stress and vowel harmony in speech segmentation. Journal of Memory and Language 38(2): 133–149. Yang, Charles D. 2004 Universal Grammar, statistics or both? Trends in cognitive sciences 8(10): 451–456. Yoshida, Katherine A., John R. Iversen, Aniruddh D. Patel, Reiko Mazuka, Hirmi Nito, Judit Gervain and Janet F. Werker 2010 The development of perceptual grouping biases in infancy: A Japanese-English cross-linguistic study. Cognition 115: 356–361. Zipf, George K. 1935 The Psychobiology of Language. Boston, MA: Houghton Mi¿in.

The potential contribution of statistical learning to second language acquisition Luca Onnis 1. Introduction Many fundamental aspects of human learning can be characterized as problems of induction – finding patterns and generalizations in space and time in conditions of uncertainty and from limited exposure. Some of these problems include deriving abstract categories from experience (e.g., Tenenbaum & Gri‰ths, 2001); learning word meanings from their cooccurrence with perceived events in the world (e.g., Frank, Goodman, & Tenenbaum, 2009; Yu & Smith, 2007), learning the similarity and di¤erence of meanings from their co-occurrence with other words (Landauer & Dumais, 1997); and acquiring the di¤erent levels of linguistic structure (e.g., Bod, 2009; Solan, Horn, Ruppin, & Edelman, 2005). Although the areas of application and specific theoretical claims vary, all these forms of inductive learning can be described under a common framework for problems arising in developmental psychology (Gopnik & Tenenbaum, 2007), inductive reasoning (Chater & Oaksford, 2008), language acquisition (Bannard, Lieven, & Tomasello, 2009; Solan, Horn, Ruppin & Edelman, 2005), computational linguistics (Bod, 2002; Chater & Manning, 2006; Jurafsky, 2003), and machine learning (MacKay, 2003). This common framework encompassing experimental and computational approaches can be termed distributional or statistical learning, because the focus is on how learners discover structure from probabilistic information in the environment1. Behavioral studies have indicated that infants, toddlers, and adults can rapidly extract structural properties of stimuli from probabilistic information inherent in the input they are exposed to. For example, before the age of three children implicitly use frequency distributions to learn which phonetic units distinguish words in their native language (Kuhl, 2002; 2004; 1. Here I mainly use the term statistical learning to comply with the general trend in the literature. Terms like distributional/probabilistic learning/approaches are equally viable and often used interchangeably in the literature.

204

Luca Onnis

Maye, Werker, & Gerken, 2002), use the transitional probabilities between syllables to segment words (Sa¤ran, Aslin, & Newport, 1996), use word distributions to discover syntactic-like relations among adjacent and nonadjacent elements (see also Hay & Lany, this volume), learn to form abstract categories (Gomez & Gerken, 2000), and rapidly establish formmeaning mappings under conditions of uncertainty (Smith & Yu, 2008). The behavioral findings on humans’ remarkable statistical learning abilities have been enhanced and complemented by the rapid development of robust and sophisticated computational methods that learn from corpora of natural language (e.g. Bayesian, connectionist, dynamical systems; see Chater & Manning, 2006; Gri‰th et al., 2010; McClelland et al., 2010). Such methods make it possible to obtain detailed information on the nature of the input to which learners are exposed (e.g., distributional properties of language), as well as the structured environment in which learners interact (e.g. parent-child naturalistic dyadic interactions, Dale & Spivey, 2006; Goldstein, King, & West, 2003; Roy, 2003). Importantly, these methods now allow researchers to simulate the putative mechanisms responsible for the behavioral findings. This chapter asks what specific role statistical learning might play in understanding processes of second language (L2) learning after the acquisition of a first language. My first goal is to propose that L2 learners may (be made to) become attuned to useful distributional regularities (to be discussed below). Such regularities correlate non-randomly with structural properties of language, for instance, phonetic boundaries, word units in connected speech, phrasal constituents, morphemic structure, and lexical semantics, suggesting that at least part of the acquisition of language may involve the acquisition of knowledge of distributional regularities. In particular, I want to propose four general learning principles that can be gleaned from the statistical learning literature and applied to L2 learning scenarios. These principles are: (1) Integrate information sources; (2) Seek invariant structure; (3) Reuse learning mechanisms for di¤erent tasks; and (4) Learn to predict. These principles are exemplified in four studies that highlight the benefits of statistical learning at the sublexical, lexical, morpho-syntactic, phrasal and lexico-semantic levels. They all explore how distributional information can be brought to bear on assisting second language learning2. 2. Coverage here is selective and illustrative rather than comprehensive. Other important related literature can be found in the accompanying chapters of this book (e.g., Ellis & O’Donnell; Johnson; Williams & Rebuschat, this volume).

Statistical learning and second language acquisition

205

My second goal is to elaborate on how these principles derived from experimental studies in the laboratory can be put to use for specific problems arising in second language acquisition, and sketch out some practical suggestions for bridging this laboratory-style research with L2 instructional practices. The upshot is that statistical learning can be used as a diagnostic toolkit for identifying learner needs and pinpointing specific areas for improvement in adult learners of language. In addition, statistical learning principles can be e¤ectively implemented as supportive solutions to enhancing instruction and curricula. By considering the implications that statistical learning research may hold for practical aspects of learning, I hope to indicate some tentative directions in the integration of basic and applied research. Lastly, a third goal is to propose that computational analyses of language corpora and behavioral experiments can be jointly used in the service of the two goals above.

2. Origins and development of statistical learning The origins of probabilistic approaches to language can be traced back to structural linguistics, and the focus on finding regularities in languages. In the 1950s Zelig Harris (1954) proposed a series of heuristics for discovering phonemes and morphemes, based on the distributional properties of these units in natural languages. For Harris and distributional linguists, the process of discovering the structure of an unknown language (e.g., an indigenous language of the Amazon basin) was akin to cracking a code created with a secret language. This intuitive idea was explored mathematically by Claude Shannon (1948), who developed encryption/decryption systems during World War II based on the statistical structure of sequences of letters in an encrypted message. His concept of entropy in information theory describes how much ‘uncertainty’ there is in a signal transmitted over time. This uncertainty can be reduced if for example one knows that the frequency for the character E is much more common in English than the frequency of the character Z. Shannon (1951) contributed the first rigorous statistical approach to language as a sequence of letters. In the 1950s information-theoretic ideas inspired much work in the nascent cognitive psychology, including the first experimental studies on the learnability of formal linguistic systems. In a project named Grammarama, George Miller (1956, 1967) asked adult participants to memorize strings of letters such as XLLVXL that – unbeknownst to them – were either random or followed a set of grammatical rules of sequencing (e.g., L must follow X ). The grammatical strings were generated by devices

206

Luca Onnis

called finite-state grammars, a class of possible formal grammars, and were hence called artificial grammars. Miller found that participants memorized grammatical strings much more quickly than random strings, suggesting they had become sensitive to some of the rules generating such strings. In the 1960s interest declined in both theoretical and behavioral approaches to distributional learning. The distributional methods of Harris were seen as insu‰cient to capture the hierarchical linguistic relations postulated by Chomsky (1957). In a similar way, Miller’s attempts to explore the learnability of language-like systems were questioned because of a perceived lack of common ground between artificial grammars and natural languages to make plausible generalizations from one to the other. For a couple of decades distributional approaches to language played a minor role in language. Artificial grammars were seen as better suited to the study of general processes of implicit learning, not necessarily related to language acquisition (e.g., Reber, 1967). It was not until the 1990s that artificial grammars were applied to infants and toddlers (e.g., Mattys, Jusczyk, Luce & Morgan, 1999; Sa¤ran, Aslin & Newport, 1996), documenting their remarkable abilities to use a variety of probabilistic regularities in the speech signal. Adult learners were also studied on the assumption that they usefully approximated ‘human simulations’ of infant learning (Gillette, Gleitman, Gleitman, & Lederer, 1999; Redington & Chater, 1996). With respect to the implicit learning studies of the earlier decades, researchers began to create miniature languages that more closely mimicked distributional and structural aspects of natural languages (see sections 3 to 5 below for practical examples). The 1990s also saw the development of sophisticated computational analyses of language corpora. For example, Nick Chater and colleagues (Redington, Chater, and Finch, 1998) provided large-scale computational analyses of child-directed language transcriptions that distributional information may actually be extremely useful to children in acquiring the abstract syntactic category of words, such as nouns and verbs (see also Redington and Chater, 1996, 1997). Recently, an even more direct link between statistical learning and natural language has been documented. Studies that compare directly statistical learning and language processing (e.g., in within-subject designs where the same participants are tested on statistical learning as well as natural language tasks) are finding that similar cognitive and neural mechanisms may be recruited for both syntactic processing of linguistic stimuli and statistical learning of structured sequence patterns more generally (Christiansen, Conway, & Onnis, 2012; Misyak & Christiansen, 2010). Moreover, a break-

Statistical learning and second language acquisition

207

down in statistical learning abilities has been documented in children diagnosed with agrammatic aphasia (Christiansen, Kelly, Shillcock & Greenfield, 2010) and with specific language impairments (SLI; Evans, Sa¤ran, & RobeTorres, 2009). Having provided a brief historical overview of statistical learning, in the next sections I present research exemplifying the four learning principles to be applied to second language learning.

3. Learning principle I: Integrate probabilistic sources of information Words in natural languages exhibit a rich statistical sublexical structure. Speech sounds (and written words), both within and between words, do not occur with the same frequency and in any order, but display distributional regularities in the sequences they form. Are these regularities related to the order of sounds (phonotactics) and letters (orthotactics) a mere epiphenomenon, or can they actually provide useful cues to learn a language? Research to date already suggests that these properties of words are used in the context of segmenting speech (Sa¤ran, Aslin, & Newport, 1996), identifying phonetic contrasts (Thiessen, 2007), detecting orthographic (Pacton, Perruchet, Fayol, & Cleeremans, 2001) and phonotactic restrictions (Chambers, Onishi, & Fisher, 2003), and constraining speech production errors (Dell, Reed, Adams, & Meyer, 2000). Furthermore, knowledge of phoneme distributions may aid in di¤erent aspects of language learning simultaneously, such as speech segmentation and identification of lexical categories (Christiansen, Onnis, & Hockema, 2009). In line with this work, a preliminary study in my laboratory is investigating whether implicit knowledge of phonotactics (distribution of sounds in speech) and orthotactics (distribution of letters in text) might assist the learning of novel phonetic contrasts. As practical examples, both English and Chinese native speakers are insensitive to the singleton/ geminate distinction in Italian, e.g., /pala/ (shovel) versus /palla/ (ball), and Japanese learners of English need to learn a new phonemic distinction between /l/ and /r/. Methods targeted at improving speech perception contrasts have mainly focused on training learners with minimally dissimilar word pairs, such as ‘right-light’ (e.g., Akahane-Yamada et al., 2004; Lively, Pisoni, & Yamada, 1994). Although partially successful, this type of training potentially loses structural cues like phonotactic probability, since the training regime makes /r/ and /l/ equally likely in all contexts. The computational analyses and experiment described next are an initial

208

Luca Onnis

demonstration that sublexical distributional information may facilitate the identification of novel contrasts in adult second language learners. 3.1. Corpus analyses of probabilistic phonotactics In order to assess the informativeness of contextual cues for predicting an /l/ or /r/ segment and L and R letters in English words, statistical analyses were carried out on the English Lexicon Project (ELP, Balota et al., 2007)3 that includes a phonetic transcription. I calculated the probability of phonetic and orthographic sequences that immediately precede and follow the segments /l/ and /r/ and the letters L and R respectively. Immediate context was operationalized as two elements (segment or letter) to the left and to the right of a target element (letter L and R or segment /l/ or /r/). I refer to such immediate contexts as phonotactic and orthotactic frames. For example, the word CURTAIN yields the frame U*T (if one flanking letter is considered to each side of R) and the frame CU*TA (if two flanking letters are considered to each side of R). The question was whether frames can be used to reliably predict the target segment. The degree of informativeness of a given frame can be estimated as the conditional probability as follows: P(frame | /l/) ¼ freq(frame occurring with /l/ ) / (freq(frame occurring with /l/) þ freq(frame occurring with /r/)). For example: P(CU_TA | L) ¼ freq(CURTA) / (freq(CULTA) þ freq(CURTA)) For each frame type found in the ELP corpus, the conditional probability was estimated as above. Figure 1 provides a histogram showing the distribution of phonotactic frames (left panel) and orthotactic frames (right panel) as a function of how likely they are to flank an L, given the proportion of occurrences in the English corpus for which a L and R were found. The bar height indicates the number of frame types with a given probability of having an L between them. There are 100 bins in the histogram, so each bin accounts for a probability range of .01. The figure illustrates that the 3. This corpus is composed of more than 40,000 English word types accompanied by their log-frequency of use. The corpus data reported here are part of a manuscript in preparation. The experimental data reported in Section 3.2 form part of a thesis for the Advanced Graduate Certificate in Second Language Studies at the University of Hawaii (Uchida, 2010).

Statistical learning and second language acquisition

209

Figure 1. Phonemes and letters immediately flanking L and R in English words are highly predictive of either L or R. y axis: the number of frame types that predict an L (as opposed to an R); x axis: the probability of predicting an L versus an R. Of the 589 phonotactic frame types and 589 orthotactic frame types found in English, most predict and L with a high degree of certainty (rightmost column) or predict an R with a high degree of certainty (leftmost column. A probability ¼ 0 for predicting L equals probability ¼ 1 for predicting R). Analyses with frame tokens exhibit a similar bimodal distribution.

distribution of frames in English words is strongly bimodal. Most frames are associated only with an L or with an R segment, but not both. Indeed, the left- and right-most bins account for 60% of the frame types occurring with L and R in speech. This analysis shows the distribution to be very informative in terms of identifying two distinct categories, and is similar in the case of letters and phonemes. In other words, a typical L1 or L2 speaker exposed to reasonable amounts of natural English input will experience L (but not R) mostly in the frame K_E and R (but not L) mostly in the frame Z_A. Even if this fact is unbeknownst to speakers at the conscious level, the information is important for SLA researchers and language teachers when identifying learning di‰culties or when designing materials that may support the learning of the L/R distinction. This is because these consistent frames become predictive of when an L or an

210

Luca Onnis

R is more likely. Thus, speakers may act upon this information in their regular language use. As a corollary, if learners are helped to become sensitive to this type of cue, their implicit mechanisms may be tuned to perhaps learn to use it predictively as well. 3.2. An experiment with English pseudowords The computational analyses carried out on natural language such as the one above provide a way to estimate the potential informativeness of a cue inherent in a language – here phonotactics and orthotactics of English. Do people actually use such cues? This question was tested in a letter guessing game similar to the classic ‘hangman’, in which English native speakers and Japanese learners of English were presented with a list of orthographic pseudowords lacking one letter (e.g., SA*G ). The game consisted of guessing which one of two letters is the most likely for a given pseudoword. Critical trials contained the R-L pair (‘‘Is L or R the missing letter?’’), while filler trials contained other letter pairs (e.g., ‘‘Is M or N the missing letter?’’). The critical trials were 90 frames from the corpus analyses above, one third were sampled from the 30 most frequent frames in the left-most bin of Figure 1, or those being in principle very informative in predicting an L (L-informative). Another 30 critical trials were chosen among the 30 most frequent frames in the right-most bin of Figure 1 (R-informative). As a control, another 30 critical trials were frames sampled among those having closer to or equal to 0.5 probability, or those being the least informative (LR-ambiguous). Results indicated that both native English speakers and Japanese learners of English preferred L-responses most for trials containing frames predictive of L and least for trials containing frames predictive of R. For frames that were ambiguous between L and R, the di¤erence in preference for L or R was not significant for both native and non-native groups. Thus, both groups’ responses reflected the bimodal distribution of orthographic frames in English. Participants did not just make a random guess about a single letter in isolation. Rather, their linguistic choices under uncertainty were guided intuitively by the integration of a larger context of information, the sublexical distribution of letters in English. There were two further interesting results pertaining to the Japanese participants. First their preference for reading in English (measured on a self-assessment scale) correlated with better predictions for the missing letter, suggesting that experience with reading texts in a second language may naturally induce sensitivity to ortho-

Statistical learning and second language acquisition

211

tactics. Second, Japanese participants were poor on the classic perception discrimination task with various spoken tokens of /l/ versus /r/, which tests perception of sounds in isolation. The encouraging results on their ability to use orthotactics leaves open the possibility that when tested on a perception task that involves phonotactic frames these participants may improve their perception judgments and better discriminate /l/ and /r/ sounds. Thus, the hard problem of perceiving novel speech contrasts may be hardest when tested in isolation, and yet it may be attenuated in the presence of other sources of information available in the signal (coarticulation may be another cue not investigated here). At present, these results suggest that pronunciation practices that situate learning targets within highly informative phonotactic contexts may also be advantageous in principle. Inviting L2 English learners to listen to statements, then say whether they are ‘obviously true’, ‘strongly implied’, or ‘clearly false’ may help surmount problems with training using minimal pairs. Because these phrases exhibit a variety of /l/ and /r/ frames, some of them strongly favoring one phoneme or the other (refer to the underlined segments), they may be easier for learners who have already had some exposure to the language to produce. Also, the pedagogical emphasis can be on communication and intelligibility, rather than the far more di‰cult ability to distinguish between sounds in contexts disguising such distributional tendencies (e.g., ‘light’ and ‘right’). In the next section, I illustrate a case of useful distributional regularities above the lexical level, the discovery of non-adjacent morphosyntactic relations in language.

4. Learning principle II: Seek invariance Various aspects of inflectional morphology, such as gender and number agreement on noun phrases and verb phrases, remain particularly di‰cult to master even for second language learners at advanced levels of proficiency (e.g., Montrul, Foote, & Perpinan, 2008; Slabakova, 2008). The phenomenon has generated a lively debate on the nature of such insensitivity, with some accounts claiming lack of accessibility to L1-like linguistic knowledge, and others placing the burden on online processing deficiencies (for a review, see Clahsen & Felser, 2006). While much attention has been devoted to the theoretical underpinnings of such insensitivity, and pedagogical research has addressed improving morphosyntax (see, e.g., Spada & Tomita’s 2010 meta-analysis of the e¤ects of instruction on simple and

212

Luca Onnis

complex linguistic features), few studies have taken underlying statistical learning ability into account, although studies have looked at the influence of frequency of forms (e.g., Ellis & Schmidt, 1998) and the role of implicit and explicit learning (e.g., Robinson, 2005). Are there ways to improve L2 processing abilities, for instance by making the target structures distributionally salient? 4.1. Artificial languages that mimic natural language As noted in the introduction, artificial grammar learning tasks have been used extensively to inquire into the nature of human implicit processes (Cleeremans, Destrebecqz, & Boyer, 1998; Shanks, 2005) and their relationship to language knowledge (Kaufman, DeYoung, Gray, Jime´nez, Brown, Mackintosh, 2010; Misyak & Christiansen, 2011). Tasks that tap into language processes typically involve exposing participants to sentence-like sequences of word-like stimuli (presented either visually or auditorily), such as these: pel wadim jic, vot puser tood, dak wadim rud, vot loga tood. While these pseudo-sentences appear random, they respect some underlying rule defined a priori by the experimenter, and learners exposed to limited exemplars in relatively brief sessions end up becoming sensitive to such rules, even though they cannot often explicitly verbalize what the hidden rules were. For example, the pseudo-sentences above were constructed by Gomez (2002; Figure 2) to simulate the learning of nonadjacencies similar to morphosyntactic agreement and other non-local structural regularities in natural languages: each specific first word predicts a specific third word all the time. In the examples above, pel predicts jic, vot predicts tood, and dak predicts rud consistently (e.g., probability Pð jicjpel ¼ 1Þ, while the second middle word has no predictive value, for example wadim precedes any third word with equal probability ðPð jicjwadim ¼ 0:33Þ. It is possible to test learners’ implicit knowledge after training, by presenting grammatical ( pel wadim jic), as well as ungrammatical sentences (*pel wadim rud ), and even sentences that have zero probability, for instance, the sentence pel hiftam jic is not encountered during the learning phase, but it crucially maintains the correct structural non-adjacent relations ( pel __ jic). In my brief review of the origins of statistical learning I noted that an important development in the use of artificial grammars has been their much closer contact with natural language phenomena. For example, Gomez (2002) noted that sequences in natural languages typically involve

Statistical learning and second language acquisition

213

some items belonging to a relatively small set (functor words and morphemes like am, the, -ing, -s, are) interspersed with items belonging to a very large set (e.g. nouns, verbs, adjectives). Such asymmetry translates into patterns of highly invariant nonadjacent items separated by highly variable material (am cooking, am working, am going, etc.). How do learners detect non-adjacent invariant structures? Gomez showed that the variability of the material intervening between dependent elements (the first and third word in her study) modulates the ability to detect non-local dependencies in the grammar above. Learning improves consistently as the variability of elements that occur between two dependent items increases. One explanation for this pattern is that when the set of items that participate in the dependency is small relative to the set of elements intervening, the nonadjacent dependencies stand out as invariant structure against the changing background of more varied material, as in pel wadim jic, pel puser jic, pel coomo jic, pel loga jic, dak coomo rud, dak wadim rud, dak puser rud, etc. (see Figure 2 and 3, columns 2–5. The di¤erent intervening words are indicated as indexed Xs).

Figure 2. The underlying structure of the artificial grammars used by Gomez (2002; columns 2–5) and Onnis et al. (2003; 2004; columns 1–5). Sentences with three non-adjacent dependencies are constructed with an increasing number of possible intervening X words. Gomez used 2, 6, 12, and 24 intervening words. Onnis et al. added a new condition in which X ¼ 1.

214

Luca Onnis

Figure 3. Data from Onnis et al. (2003) incorporating the original Gomez experiment. Learning of non-adjacent dependencies results in a U shape curve as a function of the variability of intervening X words, in five conditions of increasing variability.

This e¤ect also holds in the absence of variability of intervening words shared by di¤erent nonadjacent items, as in pel wadim jic, vot wadim tood, dak wadim rud (the word wadim is common to all sentences, see Figure 2 and 3, first column), as the intervening material becomes invariant with respect to the variable dependencies (Onnis, Christiansen, Chater, & Gomez, 2003). In natural languages, long-distance relationships such as singular and plural agreement between noun and verb may in fact be separated by the same material, for example the books on the shelf are dusty and the book[0] on the shelf is dusty. Importantly, while artificial grammars appear at first limited in their generativity, they can be used to test learners’ knowledge to generalize correctly to novel sentences never encountered in the training. For example, the ability of adult learners to endorse pel hiftam jic while rejecting *pel hiftam rud (where hiftam is a new word) is modulated by the same conditions of zero or high variability (Onnis, Monaghan, Christiansen, & Chater, 2004). The upshot of these studies on variability is that there is a

Statistical learning and second language acquisition

215

striking tendency to detect variant versus invariant structure that is in turn adaptive to the informational demands of their input (for putative mechanisms responsible for this e¤ect see Gomez & Maye 2005). And learning a rule such a non-adjacent dependency is not an all or none phenomenon, but is mediated by distributional properties in which such dependencies happen to be experienced. 4.2. Invariance is at hand A complementary line of studies using artificial grammars and corpora of naturalistic child-directed speech has highlighted how invariant linguistic structure may be at hand, namely available to learners in a short window of time. Here I would like to show how invariant features of language can be detected when two sentences are allowed to partially overlap in immediate succession in a text or in the speech stream. In a study by Onnis, Waterfall and Edelman (2008) adult learners were asked to find the novel words of an alien language out of unparsed (unsegmented) whole sentences such as kedmalburafuloropesai. In the absence of acoustic and prosodic cues (the sentences were generated by a speech synthesis software), each sentence could in principle be composed of a range of possible words, from a single long word to as many words as there were identifiable syllabic clusters. Again, as the words in the sentences were all novel, the task was di‰cult, and it simulated some of the features and conditions involved in second language learning. It was found that learners were significantly better at the word segmentation task when a portion of sentences in the training set were ordered so as to partially repeat themselves one after the other (e.g., kedmalburafuloropesai, rafuloro), as opposed to a control learning situation in which no sentences overlapped immediately (although the training set was composed of exactly the same sentences in both conditions). Such immediate partial repetition across sentences facilitates comparison. When aligned, the partial overlap of the two sentences suggests three candidate units (kedmalbu, rafuloro, pesai) without the need for learners to entertain all possible unit candidates over several sentences. Importantly, the study also found evidence for a global learning e¤ect. That is, not only did learners more reliably prefer word units heard in partially self-repeated sentences during learning, but they also segmented units that never occurred in such order more accurately (e.g. gianaber, kiciorudanamjeisulcaz). Similar results were found in a second experiment in which the phrasal structure of sentences was to be discovered, suggesting that the same mechanisms of comparison of invariant structure can signal structure at di¤erent levels of linguistic analysis.

216

Luca Onnis

How can these laboratory studies inform L2 instruction? First, it is possible to construct L2 teaching materials that reflect the principles of variability over invariant structure described above. Given the instructed nature of L2 learning, the input to an L2 learner can be manipulated to a large extent – and much more flexibly than the input to a child. For instance, applying the concept of large variability to morphosyntactic relations in Spanish might involve a sequence of sentences like: Tengo las botas para el matrimonio. Tengo las pelı´culas para el fin de semana. Tengo las pelotas para el nin˜o. In the examples above the female gender and plural number agreements (las __ -as) are repeated while the intervening lexical items are modified (bot-, pelı´cul-, pelot-). The prediction is that a large enough number of intervening words should facilitate the extraction of the invariant nonadjacent relations (las __ -as), either implicitly or by promoting the explicit noticing of the target structure (Schmidt, 2001). It may also be useful to keep constant the non-target elements of the sentence, so that the target elements can vary. For instance, according to the zero variability condition described in Onnis et al. (2003; 2004) the following learning situation could also have a facilitative e¤ect, where the same lexical item is shared between the two gender-agreement constructions in Spanish: Mabel es la amiga de Carlos Carlos es el amigo de Maria. I have described principles of distributional learning that exploit the contrast between variable and constant materials. As mentioned earlier, one of the goals of this chapter is to show how such principles may apply to a range of learning situations, and thus be reusable in di¤erent tasks. Theoretically, this raises the possibility to understand human learning in terms of a relative small set of mechanisms. Practically, second language learning problems that are treated as di¤erent or unrelated may be amenable to similar distributional solutions. 5. Learning principle III: Reuse learning mechanisms In the introduction section I discussed how statistical learning can be viewed as a framework for identifying inductive learning mechanisms. How many mechanisms can we identify? Are they restricted to specific

Statistical learning and second language acquisition

217

tasks and or modalities? While there are di¤erent types of sensitivities (e.g., to conditional probabilities – see Section 3 – and invariant structure – Section 4), an assumption of parsimony makes it reasonable to assume that only a small number of such mechanisms exists, and these are recruited for di¤erent learning tasks and linguistic levels of representation. The aim of this section is thus to illustrate how the same learning mechanisms that apply to finding words in running speech and discovering phrasal units (discussed in Section 4) may also apply when learning formmeaning mappings. Expanding a paradigm used by Yu & Smith (2007), Onnis, Edelman and Waterfall (2011) demonstrated that form-meaning mappings in a word learning task were learned significantly better when the input was structured as partial self-repetitions. During a learning phase, adult participants saw multiple novel pictures simultaneously and heard multiple novel words, creating ambiguity regarding correct wordto-picture mappings for a given trial. For instance, when four words and four pictures were presented simultaneously on each trial, there could be 4  4 possible word-referent combinations (Figure 4). The participants’ task was to infer the correct word-picture mappings across these training trials. At test, they heard a single word and selected the picture (among four) that they thought mapped onto that word. Importantly, this task can only be solved if relations between words and referents are tracked across multiple trials, hence the term cross-situational learning (e.g., Yu & Smith, 2007). Onnis and colleagues were able to show that a learning condition where a specific single word-referent pair repeated successively across any two given trials (while all other pairs di¤ered) contributed to the immediate disambiguation of the word-scene mappings. Importantly, even pairs that had not appeared in such contiguous conditions were shown to be learned, suggesting that principles of local alignment and comparison did not only a¤ect the pairs involved locally, but had a global benefit on learning the form-meaning pairs. Cross-situational learning o¤ers a useful approach to modeling naturalistic L2 learning, in addition to yielding results that can potentially inform foreign language instruction. For example, Robinson and Ellis (2008), following Slobin (1996), have referred to the adjustment required to use conceptually novel form-meaning correspondences as rethinking for speaking. In this respect, word-referent learning experiments may shed light on the learning of lexical items and the structuring of conceptual domains in a second language. In addition, L2 vocabulary acquisition takes place in a rich extra-linguistic context. Quine (1960) illustrated this in his hypothetical account of a field linguist attempting to discern the meaning of the

218

Luca Onnis

Figure 4. Two possible trials in the cross-referential learning paradigm used by Onnis, Edelman, & Waterfall (2011). In each single trial the simultaneous presentation of 4 novel words and referents makes the formmeaning mapping task impossible. However, across trials learners were able to reduce uncertainty, by comparing the elements that changed versus those that stayed constant. In the example here, one wordreferent pair remains constant across the two trials, which one is it?

word gavagai (see also Ellis, 2005). In order to approximate the complexity of cross-situational learning in naturalistic environments, laboratory studies may need to increase the number of referents in a given trial, since visual scenes in the real world typically present much richer evidence for learning (see Cenoz & Gorter, 2008, for a multimodal account of the L2 linguistic landscape). Because the composition of actual visual scenes may overwhelm learners’ computational abilities, prior knowledge of conceptual and social domains, which is readily available to L2 learners, could also be incorporated in further experiments to inform the design of computer-based vocabulary tutorials and explore how learners might solve Quine’s dilemma outside of virtual learning environments.

Statistical learning and second language acquisition

219

The training interventions briefly envisioned here await robust evidence before they can be adapted to real-world L2 scenarios, but they open up ways to connect basic research to instructional concerns. In the following section, I conclude my overview of illustrative examples by looking at how meaning can be inferred from distributions of words across texts, and how knowledge of lexical distributions improves reading fluency.

6. Learning principle IV: Learn to predict Corpus analyses suggest that many words entail probabilistic semantic consequences that can be expressed as expectations for upcoming words. For instance, in English, the verb provide typically precedes positive words as in to provide assistance/benefits/relief, while the verb cause typically precedes negative items, as in to cause death/damage/disruptions (Sinclair, 1996). Interestingly, while the denotational meaning of say cause involves an agent and an e¤ect, there is little reason to assume a priori that in actual use cause may be associated with negative words (Guo et al., in press). Furthermore, while many speakers are fortunate enough to never directly experience negative events such as bleeding, war, and death, they learn that for instance wars are caused by famine rather than wars are provided by famine. Thus, although not the only way of discovering meaning, the connotational meaning of certain words may emerge as being distributed over the co-text and co-speech of their occurrences in natural language. On these assumptions, connotational meaning naturally lends itself to being modeled by distributional analyses of corpora. One class of available computational models of semantic knowledge – semantic space models – represents each word as a vector in a highdimensional state space (Rogers & McClelland, 2004; Vigliocco, Vinson, Lewis, & Garrett, 2004). The meaning of a word is obtained from the frequency distributions of the words that occur in the immediate context of a target word, over a large corpus. This method captures empirically the intuition that words that occur in the same sorts of contexts tend to be similar in meaning. For example, road and street are similar because they occur in similar co-texts (down the road/street, cross the road/street, the road/street to the left) and are dissimilar from tea and co¤ee, which co-occur with other words (co¤ee/tea and sugar, pour a cup of co¤ee/tea). Using a vector space model, Onnis, Farmer, Baroni et al. (2009) were able to derive the semantic orientation (valence tendency) of a number of words such as cause, provide, encounter, markedly, largely, impressive,

220

Luca Onnis

purely on distributional grounds. This orientation was measured as a signed value for each word. The authors further obtained independent human values of semantic orientation in a sentence continuation task. Native speakers of English were asked to provide a free completion for sentences like ‘‘The mayor was surprised when he encountered. . .’’. When the portions of sentence continuations were scored as positive or negative on a Likert scale by a di¤erent group of participants, their values correlated significantly with those assigned by the vector space model on a purely distributional way. This suggests that a) native speakers are sensitive to the general semantic orientation of a word, and constrain their free productions to accommodate it; and b) the semantic orientation of a word can be inferred automatically by simple distributional properties of texts (the computer model does not have any inbuilt notion of semantics). Computer models like this one might approximate to a fair degree the cognitive mechanisms available to human learners. The presence of valence tendencies may facilitate language comprehension in real-time situations. If producing a given word in a sentence, say the verb to encounter, prompted speakers to narrow down the set of possible sentence continuations, then on the comprehender’s side sensitivity to this semantic valence tendency may help anticipate the sentence continuation, resulting in a measurable gain in comprehension fluency. This idea was tested in a self-paced reading experiment in which words in sentences were presented one by one incrementally, and participants pressed a key on the keyboard to read the next word. This allows the measurement of reading times for each given word in a sentence. It was found that on-line reading was slowed down significantly in sentences that contained an incongruent semantic orientation (e.g., the news on television caused optimism in the audience), as opposed to when the sentences contained a congruent semantic orientation (the news on television caused pessimism in the audience). There is mounting evidence in the sentence processing literature that humans use expectations as the sentence unfolds in order to reduce the set of possible competitors to a word or sentence continuation (e.g., Altmann and Kamide, 1999; Tanenhaus et al., 1995). At each time step the linguistic processor uses the currently available input and the lexical information associated with it to anticipate possible ways in which the input might continue. An important consideration is that distributional patterns of words a¤ord speakers the necessary fluent generativity to understand and produce not only crystallized collocations (e.g. to cause damage which has a high co-occurrence and is probably learned by rote), but also novel ‘on-the-fly’

Statistical learning and second language acquisition

221

combinations of words that are nonetheless congruent with the general valence tendency of a given word. Thus, learning about distributions of words in the lexicon may support generative processes and is not limited to rote memorization processes. Explaining how learners acquire new vocabulary as well as how they become fluent speakers figure prominently in second language research. In this section I have o¤ered a glimpse of a distributional account of how lexical semantics may be acquired and how it improves language fluency. Researchers have long recognized the role of learning phraseology in developing proficiency, for example collocations and other extended units of meaning (e.g., Boers et al., 2006; Gries, 2008). The study reported here further shows that having knowledge of language-specific selectional restrictions and probabilistic tendencies is not a mere matter of sounding ‘native-like’ from a stylistic point of view. Rather, there is correlation between knowledge of language-specific phraseology and language fluency in native speakers (for studies of L2 see Howart, 1998; Onnis, 2001; Towell et al., 1996). The study also o¤ers some methodological advances. Often proposals of vocabulary learning have been described in qualitative mentalistic terminology that may not entirely provide causal and mechanistic explanations. Exactly how the denotational and connotational meanings of words are learned? I have argued that at least some aspects of lexical semantics can in principle be derived distributionally from a corpus using simple computational procedures. While still underdeveloped for instructional purposes, this approach opens up ways to think about what types of texts and word distributions within texts can optimize the salience of, for example, semantic orientations. Thus, one promise of computational modeling for second language learning is the possibility of making assumptions explicit and testable under specific conditions in computer simulations, as well as in testable conditions with human learners.

7. Discussion: Distributional approaches to SLA In this chapter I have proposed that by looking at language learning as induction of patterns and generalizations over patterns, important insights can be gained, not only for L1 but also in L2 research. I have further suggested some ways in which L2 instruction inspired by principles as well as methodologies o¤ered by statistical learning may help adult learners capitalize on distributional information that correlates with di¤erent types of linguistic structure at di¤erent levels of analysis – sublexical, morpho-

222

Luca Onnis

syntactic, lexical and phrasal, and lexico-semantic. My overarching goal has been to make the case for a closer integration of the research paradigm and methods of statistical learning and research on second language acquisition. I also wanted to stress the role of miniature artificial languages for unveiling principles of adult human learning. To date, most miniature languages involving adults have been intended to simulate scenarios of child language acquisition. Adult learners are thought of as useful ‘human simulations’ (Gillette, Gleitman, Gleitman, & Lederer, 1999) that approximate some learning behavior in infancy. However, these studies may also be directly linked to adult second language acquisition, because adults already possess knowledge of a linguistic system when they engage in learning a novel miniature language. As such, artificial grammar experiments with adults can be seen as useful human simulations of second language learning processes. Sections 3 to 6 reviewed relevant literature and proposed four principles of learning. Section 3 contributed the idea that learning di‰culties can be overcome by integrating di¤erent probabilistic sources to the task at hand. A traditional view that sees language separated in modular representational levels (e.g., phonetic – phonemic – sublexical) may underestimate the large redundancy of probabilistic information available in the signal. Accordingly, the perception of a foreign sound would be treated as a purely acoustic problem, and as such its solution sought at the acoustic level only. Instead, phonotactic and orthotactic regularities (along with other information yet to be assessed) may come in handy in recognizing the di‰cult sound. Sections 4 and 5 discussed the principle that learners seek invariance in the signal. Becoming sensitive to what changes versus what stays constant in the linguistic environment can highlight structural relations in language such as word boundaries, non-adjacent dependencies, syntactic phrases, and form-meaning mappings. Importantly, the putative underlying mechanisms of alignment and comparison of candidate structures are simple enough general learning mechanisms and can be ‘recycled’ at di¤erent levels of linguistic representation, providing a general framework for learning structure (see further below). In Section 6 I discussed how probabilistic lexico-semantic constraints impose choices on sentence continuations in free productions. In addition, knowledge of lexical semantics improves fluency in realistic conditions such as when reading text. Finally, Section 3 and 6 together contributed the idea of integrating computational analyses of language to make experimental predictions about which statistical properties are useful for learning and processing language. Computational analyses of corpora allow one to

Statistical learning and second language acquisition

223

assess the a priori usefulness of one or more probabilistic cues, which can then be evaluated empirically with language learners. In sum, statistical approaches to language contribute a diagnostic toolkit for testing what is easy and di‰cult to learn in experimentally controlled settings, and may further o¤er supportive solutions to instructional needs. 7.1. Implications for L2 instruction While it is early to sketch a map of how statistical learning will inform educational practices in meaningful ways, I speculate here on a few possibilities. For instance, statistical learning can be seen as complementary to existing techniques of input-based enhancement, which attempt to make certain features of the language more salient (e.g., Sharwood Smith, 1991). While textual enhancement can be achieved via manipulation of typographical cues such as bolding or italics, meta-analytic reviews of this research domain show that learners exposed to enhanced texts barely outperform those exposed to unenhanced, flooded texts on targeted grammatical features (Lee & Huang, 2008). It may be possible, therefore, to structure texts such that certain distributional properties enhance a particular target structure. In this respect, presenting a di‰cult structure in variation sets might inherently bring it to the attention of the learner, giving rise to the establishment of form-meaning connections. In addition, attempts to direct attention to L2 mappings may result in even greater performance gains when cues are made salient. That is, instructional interventions that orient learners to multiple distributional cues in ways that take advantage of the contribution of each cue in the real-time comprehension or production of fully-formed sentences or utterances may further reinforce learning. Such proposals are consistent with an emerging consensus on the part of researchers from both generative (Slabakova, 2008) and cognitiveinteractionist (Ortega, 2007) traditions who recommend practicing form and function in meaningful contexts. Therefore, one major advantage of applying statistical learning to second language teaching is its potential applicability to actual learning scenarios. If certain distributional properties of the input accelerate learning (as documented in several independent experiments on adult artificial language learning in this volume), then it is possible in principle to tailor the learner’s experience to reflect such optimal conditions, providing conditions of ‘statistically structured input’, in line with existing work (e.g., Lee & Van Patten, 2003).

224

Luca Onnis

Statistical learning research on L2 has also practical advantages that work in L1 settings does not. The initial stages of the development of language in infants and young children are mostly under parental control and di‰cult to modify with explicit interventions. Conversely, modulating the input an L2 learner receives can be practically achieved in various flexible ways, either in the classroom, or via educational software, or via the construction of materials that incorporate statistical learning principles. 7.2. The relationship with implicit learning While artificial language studies have been used in SLA, most have focused on the nature of implicit learning (see Dienes, this volume; Shanks, 2005) and knowledge in L2 (see Hamrick and Rebuschat, this volume; Leung and Williams, 2006; Schmidt, 1994), rather than on providing mechanisms of statistical learning. In most cases these studies do not directly include manipulations of distributional information in their designs, as opposed to the studies presented here. In this respect, research on statistical learning can be seen as complementary and orthogonal to the implicit/explicit distinction, the latter still being a useful framework for investigating processes of human learning. Statistical learning may occur on a cline from completely implicit to explicit. For example, a textbook or a learning task may present scenarios and sentences that implicitly form variation sets (see Section 4). The outcome of learning may at this point be fully explicit (a sort of ah-ah experience: ‘‘I recognize that what stays constant here may be an L2 construction’’), or less so, with the construction standing out without direct awareness on the part of the learner – who is perhaps engaged in encoding or decoding the meaning or the pragmatic relevance of the event. Furthermore, it is possible to direct L2 learners to explicitly find patterns of invariance in collections of texts, as indicated by pedagogical uses of corpora (e.g., Aston, Bernardini, & Steward, 2004). The relation between statistical regularities and implicit learning can be quite complex in second language learning. While certain distributional properties of language, especially low-level ones such as probabilistic phonotactics, are definitely learnt implicitly in one’s first language and may appear di‰cult to teach explicitly, there is also evidence to the contrary. Al-jasser (2008) reported on a pre-post test intervention study investigating the e¤ect of teaching English phonotactics to Arabic speakers with the purpose of improving their lexical segmentation abilities. His post-test results showed significant gains in the lexical segmentation of running speech in English. Therefore,

Statistical learning and second language acquisition

225

while it is quite reasonable to assume that statistical learning in infancy and childhood is implicit, for second language learning this line of research o¤ers non-intuitive insights beyond the classic implicit/explicit divide. 7.3. Defusing the internalist/externalist debate Research into statistical learning, in addition to guiding the development of novel instructional interventions, may also provide theoretical insight into the mechanisms underlying existing forms of L2 instruction, the e¤ectiveness of which has already been demonstrated. The trend in L2 research toward meta-analytic reviews (Norris & Ortega, 2006, 2011) has generated robust evidence for, among other areas, the role of interaction in learning in another language (Keck, Iberri-Shea, Tracy-Ventura, Wa-Mbaleka, 2006; Mackey & Goo, 2007). Many researchers now see the divide between social and cognitive dimensions of learning as hurtful to a better understanding of language and communication, in both first and second language research. While in this chapter I have focused on finding language-internal regularities in the input, such regularities need not be e¤ective in isolation, because there is already evidence that they do take e¤ect in social settings. Statistical sensitivity develops both within the linguistic input learners are exposed to, and across the linguistic and non-linguistic exchanges with their interlocutors during social interaction. Thus, distributional information inherent in the input along with social interaction can provide reliable cues to discovering structural and abstract properties of language (for a review, see Meltzo¤, Kuhl, Movellan, & Sejnowski, 2009). In this respect, one general framework for statistical learning that invokes cognitive principles directly relevant to interactionist approaches has been put forth by Goldstein and colleagues (2010). This framework uses the acronym ACCESS as a mnemonic for several key principles in learning from distributional patterns (Align Candidates, Compare, Evaluate Statistical/Social significance). Each of these components has a clear analogue in interactionist SLA research. To begin, L2 interaction is fundamentally a matter of exposure to input through conversational discourse, as illustrated by the following example, adapted from a classroom study on learner interaction in computer-mediated communication. Here, Kin and Gin are exchanging opinions in a communicative task: (1) Kin: If you don’t have much money, you can’t go university. Gin: but why do you go to the university?

226

Luca Onnis

Her interlocutor’s response o¤ers Kin an immediate opportunity to align candidates. For example, she may pay attention (Schmidt, 2001) to the partial reformulation of the verb phrase ‘go to the university’. Kin’s ability to restructure her knowledge of the usage required here may rely on cognitive comparison, during which learners’ output ‘‘must be compared with the relevant data available from the contingent utterances of their more competent interlocutors’’ (Doughty, 2001, p. 225). As hypothesized by Laufer and Hulstijn (2001), task-induced involvement is what drives L2 learning in this case, through need, search, and evaluation. The involvement load hypothesis acknowledges that motivational as well as cognitive components are involved in incidental second language learning (see also Dornyei, 2009). The statistical significance of the information Kin is presented with is registered according to mechanisms detailed throughout this chapter (but see Ellis, 2006 on related factors that impede learning from input). Finally, SLA theory o¤ers several theoretical perspectives emphasizing the sociocultural (Lantolf, 2000), sociocognitive (Atkinson, 2011), and socially distributed (Markee & Seo, 2009) aspects of L2 interaction that may help interpret the social significance of the linguistic choices in the present dyadic exchange. In sum, an interactionist account of SLA that incorporates principles of statistical learning is not merely possible; in many respects it already exists. What remains to be done is to more explicitly articulate these connections in order to strengthen future empirical work. To conclude, I have argued that there is an important potential role for statistical learning research in terms of direct links to practical aspects of second language learning and instruction, namely diagnosing learner needs, enhancing instruction and curricula, and defining principles to put into practice in a variety of ways, as called for by the specific details of the learning context.

Acknowledgements I would like to thank Shimon Edelman, Kevin Gregg, Daniel Jackson, Hannah Jones, Elizabeth Kissling, Phillip Hamrick, Julie Lake, William O’Grady, Lourdes Ortega, Patrick Rebuschat, Dick Schmidt, and two anonymous reviewers for their comments on earlier versions of this chapter. The manuscript also benefited from useful discussions with several graduate students in the SLS program at the University of Hawaii. The author was partially supported by a Language Learning Research Grant.

Statistical learning and second language acquisition

227

References Akahane-Yamada, R., Kato, H., Adachi, T., Watanabe, H., Komaki, R., Kubo, R., Takada, T, and Ikuma, Y. 2004 ATR CALL: A speech perception/production training system utilizing speech technology, The 18th International Congress on Acoustics, III, 2319–2320. Al-jasser, F. 2008 The e¤ect of teaching English phonotactics on the lexical segmentation of English as a foreign language. System, 36, 1, 94–106. Altmann, G.T.M., & Kamide, Y. 1999 Incremental interpretation at verbs: Restricting the domain of subsequent reference. Cognition, 73, 247–264. Atkinson, D. 2010 Extended, embodied cognition and second language acquisition. Applied Linguistics, 31, 599–622. Aston, G., Bernardini, S., & Stewart D. (Eds.) 2004 Corpora and language learners. Amsterdam: Benjamin. Balota, D.A., Yap, M.J., Cortese, M.J., Hutchison, K.A., Kessler, B., Loftis, B., Neely, J.H., Nelson, D.L., Simpson, G.B., & Treiman, R. 2007 The English Lexicon Project, Behavior Research Methods, 39, 445–459. Bannard, C., Lieven, E. & Tomasello, M. 2009 Modeling children’s early grammatical knowledge, PNAS, 106, 41, 17284–17289. Bod, R. 2009 From exemplar to grammar: A probabilistic analogy-based model of language learning, Cognitive Science, 33, 752–793. Boers, F., J. Eyckmans, J. Kappel, H. Stengers & M. Demecheleer 2006 Formulaic sequences and perceived oral proficiency: Putting a lexical approach to the test. Language Teaching Research, 10, 245–261. Cenoz, J., & Gorter, D. 2008 The linguistic landscape as an additional source of input in second language acquisition. IRAL, 46, 267–287. Chambers, K.E., Onishi, K.H., & Fisher, C. 2003 Infants learn phonotactic regularities from brief auditory experience. Cognition, 87, B69–B77. Chater, N., & Manning, C.D. 2006 Probabilistic models of language processing and acquisition. Trends in Cognitive Sciences, 10, 335–344. Chater, N., & Oaksford, M. (Eds.) 2008 The probabilistic mind: Prospects for Bayesian cognitive science. Oxford: Oxford University Press.

228

Luca Onnis

Chomsky, N. 1957 Syntactic Structures. Mouton. Christiansen, M.H., Kelly, M.L., Shillcock, R.C. & Greenfield, K. 2010 Impaired artificial grammar learning in agrammatism. Cognition, 116, 382–393. Christiansen, M., Onnis, L., & Hockema, S. 2009 The secret is in the sound: From unsegmented speech to lexical categories. Developmental Science, 12(3), 388–395. Christiansen, M.H., Conway, C., & Onnis, L. 2007 Neural responses to structural incongruencies in language and statistical learning point to similar underlying mechanisms. In Proceedings of the 29th Annual Meeting of the Cognitive Science Society. Clahsen, H. & C. Felser 2006 How native-like is non-native language processing? Trends in Cognitive Sciences, 10, 564–570. Cleeremans, A., Destrebecqz, A., & Boyer, M. 1998 Implicit learning: News from the front, Trends in Cognitive Sciences, 2, 406–416. Dale, R., & Spivey, M.J. 2006 Unraveling the dyad: Using recurrence analysis to explore patterns of syntactic coordination between children and caregivers in conversation. Language Learning, 56, 3, 391–430. Dell, G.S., Reed, K.D., Adams, D.R., & Meyer, A.S. 2000 Speech errors, phonotactic constraints, and implicit learning: A study of the role of experience in language production. Journal of Experimental Psychology: Learning, Memory, & Cognition, 26, 1355–1367. Dienes, Z. this volume Conscious versus unconscious learning of structure. Do¨rnyei, Z. 2009 Individual di¤erences: Interplay of learner characteristics and learning environment. Language Learning, 59, 230–248. Doughty, C. 2001 Cognitive underpinnings of focus on form. In P. Robinson (Ed.), Cognition and second language instruction (pp. 206–257). Cambridge: Cambridge University Press. Ellis, N.C. 2005 At the interface: Dynamic interactions of explicit and implicit language knowledge. Studies in Second Language Acquisition, 27, 305–352. Ellis, N.C. 2006 Selective attention and transfer phenomena in L2 acquisition: Contingency, cue competition, salience, interference, overshadowing, blocking, and perceptual learning. Applied Linguistics, 27(2), 164–194.

Statistical learning and second language acquisition

229

Ellis, N.C., & Schmidt, R. 1998 Rules or associations in the acquisition of morphology? The frequency by regularity interaction in human and PDP learning of morphosyntax. Language and Cognitive Processes, 13, 307–336. Ellis, N. & O’Donnell, M. this volume Statistical construction learning: Does a Zipfian problem space ensure robust language learning? Evans, J.L., Sa¤ran, J.R., and Robe-Torres, K. 2009 Statistical learning in children with Specific Language Impairment. Journal of Speech, Language and Hearing Research, 52, 321–335. Frank, M.C., Goodman, N.D., and Tenenbaum, J.B. 2009 Using speakers’ referential intentions to model early crosssituational word learning, Psychological Science, 20, 578–585. Gillette, J., Gleitman, H., Gleitman, L., & Lederer, A. 1999 Human simulations of vocabulary learning. Cognition, 73, 35– 176. Goldstein, M.H., King, A.P., & West, M.J. 2003 Social interaction shapes babbling: Testing parallels between birdsong and speech. Proceedings of the National Academy of Sciences, 100, 13, 8030–8035. Goldstein, M.H., Waterfall, H.R., Lotem, A., Halpern, J.Y., Schwade, J.A., Onnis, L., et al. 2010 General cognitive principles for learning structure in time and space. Trends in Cognitive Sciences, 14, 6, 249–258. Gomez, R.L. 2002 Variability and Detection of Invariant Structure. Psychological Science, 13, 5, 431–436. Gomez, R L., & Gerken, L. 2000 Infant artificial language learning and language acquisition. Trends in Cognitive Sciences, 4, 178–187. Gomez, R.L., & Maye, J. 2005 The Developmental Trajectory of Nonadjacent Dependency Learning. Infancy, 7, 2, 183–206. Gopnik, A., & Tenenbaum, J. 2007 Bayesian networks, Bayesian learning and cognitive development. Developmental Science (special section on Bayesian and BayesNet approaches to development), 10, 3, 281–287. Gries, S. 2008 Corpus-based methods in analyses of SLA data. In Peter Robinson & Nick C. Ellis (eds.), Handbook of cognitive linguistics and second language acquisition, 406–431. New York: Routledge, Taylor & Francis Group. Gri‰ths, T.L., Chater, N., Kemp, C., Perfors, A., & Tenenbaum, J.B. 2010 Probabilistic models of cognition: Exploring representations and inductive biases. Trends in Cognitive Sciences, 14, 357–364.

230

Luca Onnis

Guo, X., Zheng, L., Zhu, L., Yang, Z., Chen, C., Zhang, L., Ma, W., & Dienes, Z. in press Acquisition of conscious and unconscious knowledge of semantic prosody. Consciousness & Cognition. Hay, J. & Lany, J. this volume Sensitivity to Statistical Information Begets Learning in Early Language Development. Hamrick, P. & Rebuschat, P. this volume How implicit is statistical learning? Harris, Z.S. 1954 Distributional structure. Word, 10, 146–162. Howart, P. 1998 Phraseology and Second Language Proficiency. Applied Linguistics, 19, 1, 24–44. Jurafsky, D. 2003 Probabilistic modeling in psycholinguistics: Linguistic comprehension and production. In R. Bod, J. Hay, and S. Jannedy (Eds.), Probabilistic Linguistics, MIT Press. Kaufman, S.B., DeYoung, C.G., Gray, J.R., Jime´nez, L., Brown, J., & Mackintosh, N. 2010 Implicit learning as an ability. Cognition, 116, 321–340. Keck, C.M., Iberri-Shea, G., Tracy-Ventura, N., & Wa-Mbaleka, S. 2006 Investigating the empirical link between task-based interaction and acquisition: A meta-analysis. In J. M. Norris & L. Ortega (Eds.), Synthesizing research on language learning and teaching (pp. 91–131). Amsterdam: John Benjamins. Kuhl, P.K. 2004 Early language acquisition: Cracking the speech code. Nature Reviews Neuroscience, 5, 831–843. Kuhl, P.K. 2000 A new view of language acquisition. Proceedings of the National Academy of Science, 97, 11850–11857. Landauer, T.K., & Dumais, S.T. 1997 A solution to Plato’s problem: The latent semantic analysis theory of acquisition, induction, and representation of knowledge, Psychological Review, 1, 2, 211–240. Lantolf, J. (Ed.). 2000 Sociocultural theory and second language learning. Oxford: Oxford University Press. Laufer, B., & Hulstijn, J. 2001 Incidental vocabulary acquisition in a second language: The construct of task-induced involvement. Applied Linguistics, 22, 1–26. Lee, J., & Van Patten, B. 2003 Making Communicative Language Happen. New York: McGraw Hill.

Statistical learning and second language acquisition

231

Lee, S., & Huang, H. 2008 Visual input enhancement and grammar learning: A metaanalytic review. Studies in Second Language Acquisition, 30, 307– 331. Leung, J. & Williams, J.N. 2006 Implicit learning of form-meaning connections. In Sun, R. & Miyake, N. (Eds) Proceedings of the Annual Meeting of the Cognitive Science Society, pp. 465–470. Mahwah, N.J.: Lawrence Erlbaum. MacKay, D.J.C. 2003 Information Theory, Inference, and Learning Algorithms, Cambridge University Press. Mackey, A., & Goo, J. 2007 Interaction research in SLA: A meta-analysis and research synthesis. In A. Mackey (Ed.), Conversational interaction in second language acquisition (pp. 407–452). Oxford: Oxford University Press. Markee, N., & Seo, M. 2009 Learning talk analysis. IRAL, 47, 37–63. Miller, G. 1967 The psychology of communication. New York: Basic Books. Misyak, J.B., & Christiansen, M.H. in press Statistical learning and language: An individual di¤erences study. Language Learning. Lively, S.E., Pisoni, D.B. & Yamada, R.A. 1994 Training Japanese listeners to identify English /r/ and /l/: III. Long-term retention of new phonetic categories, Journal of the Acoustical Society of America, 96, 4, 2076–2087. Maye, J., Werker, J.F. & Gerken, L. 2002 Infant sensitivity to distributional information can a¤ect phonetic discrimination. Cognition, 82, 3, B101–B111. McClelland, J.L., Botvinick, M.M., Noelle, D.C., Plaut, D.C., Rogers, T.T., Seidenberg, M.S., and Smith, L.B. 2010 Letting Structure Emerge: Connectionist and Dynamical Systems Approaches to Understanding Cognition. Trends in Cognitive Sciences, 14, 348–356. McClelland, J.L. 1998 Connectionist models and Bayesian inference. In Rational models of cognition, ed. by Mike Oaksford and Nick Chater, 21–53. Oxford: Oxford University Press. Meltzo¤, A.N., Kuhl, P.K., Movellan, J., & Sejnowski, T.J. 2009 Foundations for a new science of learning. Science, 325, 284– 288. Miller, G.A. 1956 Information and memory, Scientific American, 1956, 195 (2), 42– 47.

232

Luca Onnis

Miller, G.A. 1958

Free recall of redundant strings of letters. Journal of Experimental Psychology, 56, 485–491. Misyak, J.B., Christiansen, M.H. & Tomblin, J.B. 2010 Sequential expectations: The role of prediction-based learning in language. Topics in Cognitive Science, 2, 138–153. Montrul, S., Foote, R., & Perpin˜a´n, S. 2008 Gender agreement in adult second language Learners and Spanish heritage speakers: The e¤ects of age and context of acquisition. Language Learning, 58, 3, 503–553. Norris, J., & Ortega, L. 2010 Research synthesis. Language Teaching, 43, 461–479. Norris, J., & Ortega, L. (Eds.). 2006 Synthesizing research on language learning and teaching. Amsterdam: John Benjamins. Onnis, L. 2001 Fluency in native and non-native speakers. In Carli A. (Ed.) Aspetti linguistici e interculturali del bilinguismo. (pp. 20–139) Milano: Franco Angeli. Onnis, L., Farmer, T., Baroni, M., Christiansen, M.H., and Spivey, M.J. 2009 Generalizable distributional regularities aid fluent language processing: The case of semantic valence tendencies. Special issue of the Italian Journal of Linguistics, 20(1), 129–156. Onnis, L., Christiansen, M.H., Chater, N., and Gomez, R. 2003 Reduction of uncertainty in human sequential learning: Evidence from artificial language learning. Proceedings of The 25th Annual Conference of the Cognitive Science Society. (pp. 886–891). Mahwah, NJ: Lawrence Erlbaum. Onnis, L., Monaghan, P., Christiansen, M.H., & Chater, N. 2004 Variability is the spice of learning, and a crucial ingredient for detecting and generalizing nonadjacent dependencies. In Proceedings of the 26th Annual Conference of the Cognitive Science Society. Onnis, L., Waterfall, H. & Edelman, S. 2008 Learn locally, act globally: Learning language from variation set cues. Cognition, 109, 423–430. Onnis, L., Edelman, S., & Waterfall, H. 2011 Local statistical learning under cross-situational uncertainty. In L. Carlson, C. Ho¨lscher and T. Shipley (Eds.). Proceedings of the 33rd Annual Conference of the Cognitive Science Society. Onnis, L. Uchida, Y. & Magnuson, J. in preparation Distributional phonotactic cues assist the perception of speech contrasts. Ortega, L. 2007 Meaningful L2 practice in foreign language classrooms: A cognitive-interactionist SLA perspective. In R.M. Dekeyser (Ed.),

Statistical learning and second language acquisition

233

Practice in a second language: Perspectives from applied linguistics and cognitive psychology (pp. 180–207). Cambridge: Cambridge University Press. Pacton, S., Perruchet, P., Fayol, M., & Cleeremans, A. 2001 Implicit learning out of the lab: The case of orthographic regularities. Journal of Experimental Psychology: General, 130, 401– 426. Quine, W.V.O. 1960 Word and object. Cambridge, MA: MIT Press. Reber, A.S. 1967 Implicit Learning of Artificial Grammars. Journal of Verbal Learning and Verbal Behavior, 6, 855–863. Redington, M. & Chater, N. 1998 Connectionist and statistical approaches to language acquisition: A distributional perspective. Language and Cognitive Processes, 13, 129–191. Redington, M., Chater, N., & Finch, S. 1998 Distributional information: A powerful cue for acquiring syntactic categories. Cognitive Science, 22, 425–469. Redington, M. & Chater, N. 1997 Probabilistic and distributional approaches to language acquisition. Trends in Cognitive Sciences, 1, 273–281. Redington, M. & Chater, N. 1996 Transfer in artificial grammar learning: A reevaluation. Journal of Experimental Psychology: General, 125, 123–138. Robinson, P. 2005 Cognitive abilities, chunk-strength, and frequency e¤ects in implicit artificial grammar and incidental L2 learning: Replications of Reber, Walkenfeld, and Hernstadt (1991) and Knowlton and Squire (1996) and their relevance for SLA, Studies in Second Language Acquisition, 27, 2, 235–268. Robinson, P., & Ellis, N.C. 2008 Conclusion: Cognitive linguistics, second language acquisition, and L2 instruction – issues for research. In P. Robinson & N.C. Ellis (Eds.), Handbook of cognitive linguistics and second language acquisition (pp. 489–545). New York: Routledge. Rogers, T.T. & McClelland, J.L. 2004 Semantic Cognition: A Parallel Distributed Processing Approach. Cambridge, MA: MIT Press. Roy, D. 2009 New Horizons in the Study of Child Language Acquisition. Proceedings of Interspeech 2009. Brighton, England. Sa¤ran, Aslin, & Newport 1996 Statistical Learning by 8-Month-Old Infants. Science, 274 (5294). 1926–1928.

234

Luca Onnis

Shanks, D.R. 2005 Shannon, C. 1951

Shannon, C. 1948

Implicit learning. In K. Lamberts and R. Goldstone (Eds.), Handbook of Cognition (pp. 202–220). London: Sage. Prediction and Entropy of Printed English, Bell System Technical Journal, 30, pp. 50–64. Reprinted in D. Slepian, (Editor) (1974). Key Papers in the Development of Information Theory, New York: IEEE Press.

A Mathematical Theory of Communication, Bell System Technical Journal, 27, 379–423 and 623–656. Reprinted in D. Slepian, (Editor) (1974). Key Papers in the Development of Information Theory, New York: IEEE Press. Sharwood Smith, M. 1991 Speaking to many minds: On the relevance of di¤erent types of language information for the L2 learner. Second Language Research, 7, 2, 118–132. Schmidt, R. 2001 Attention. In P. Robinson (Ed.), Cognition and second language instruction (pp. 3–32). Cambridge: Cambridge University Press. Schmidt, R. 1994 Implicit learning and the cognitive unconscious: Of artificial grammars and SLA. In N. C. Ellis (Ed.), Implicit and Explicit Learning of Languages (pp. 165–209). London: Academic Press. Sinclair, J. 1996 The search for units of meaning, Textus, IX, 75–106. Slabakova, R. 2008 Meaning in the second language. Berlin: Mouton de Gruyter. Slobin, D.I. 1996 From ‘‘thought and language’’ to ‘‘thinking for speaking’’. In J.J. Gumperz & S.C. Levinson (Eds.), Rethinking linguistic relativity (pp. 70–96). Cambridge: Cambridge University Press. Smith, L.B., & Yu, C. 2008 Infants rapidly learn word-referent mappings via cross-situational statistics. Cognition, 106, 1558–1568. Solan, Z., Horn, D., Ruppin, E., and Edelman, S. 2005 Unsupervised learning of natural languages. Proceedings of the National Academy of Science, 102, 11629–11634. Spada, N., & Tomita, Y. 2010 Interactions between type of instruction and type of language feature: A meta-analysis. Language Learning, 60(2), 263–308. Tanenhaus, M., Spivey-Knowlton, M., Eberhard, K., & Sedivy, J. 1995 Integration of visual and linguistic information in spoken language comprehension. Science, 268, 1632–1634.

Statistical learning and second language acquisition

235

Tenenbaum, J.B., and Gri‰ths, T.L. 2001 Generalization, similarity, and Bayesian inference, Behavioral and Brain Sciences, 24, 629–641. Thiessen, E.D. 2007 The e¤ect of distributional information on children’s use of phonemic contrasts. Journal of Memory and Language, 56, 16–34. Tokowicz, N., & Warren, T. in press Beginning adult L2 learners’ sensitivity to morphosyntactic violations: A self-paced reading study. Towell, R., Hawkins, R., & Bazergui, N. 1996 The development of fluency in advanced learners of French. Applied Linguistics, 17, 84–119. Uchida, Y. 2010 Measuring knowledge of English Orthotactics in Japanese learners of English: Towards the establishment of a training scheme for /l/-/r/ Perception. Unpublished thesis for the Advanced Graduate Certificate, Department of Second Language Studies, University of Hawai‘i at Manoa. Vigliocco, G., Vinson, D.P, Lewis, W. & Garrett, M.F. 2004 Representing the meanings of object and action words: The featural and unitary semantic space hypothesis. Cognitive Psychology, 48, 422–488. Williams, J.N. 2004 Implicit learning of form-meaning connections. In J. Williams, B. VanPatten, S. Rott, and M. Overstreet (Eds.), Form Meaning Connections in Second Language Acquisition. Mahwah, NJ: Lawrence Erlbaum Associates. 2004, pp. 203–218. Yu, C., and Smith, L.B. 2007 Rapid word learning under uncertainty via cross-situational statistics. Psychological Science, 18 (5), 414–420.

Statistical learning and syntax: What can be learned, and what di¤erence does meaning make?1 John N. Williams and Patrick Rebuschat Many contributors to this volume have discussed statistical learning in the context of the word segmentation problem, stressing the role of contingencies between forms as an important cue to locating the boundaries between words. Having established the power of this kind of learning mechanism, an important question is whether it can be extended into other aspects of language learning. Here we focus on the learning of word order regularities, focusing primarily on adult second language (L2) learning. Is this just another form of sequence learning? Just as potential word boundaries can be induced through a statistical analysis of the contingencies between syllables, can syntactic structure be induced through an analysis of the contingencies between words? An obvious place to look for research that is related to this issue is within the literature on artificial grammar learning (AGL) involving sequences of letters generated by finite-state grammars, or serial reaction time (SRT) tasks involving sequences of positions of a single stimulus. This is referred to generally as ‘‘implicit learning’’ research (see Cleeremans, Destrebecqz, & Boyer, 1998 for overviews; Perruchet, 2008; A. S. Reber, 1993; Shanks, 2005, for overviews). Following Misyak et al., and Dienes (this volume) we shall regard statistical learning and implicit learning research as essentially tapping a common underlying statistical learning mechanism. In both cases learning is assumed to occur non-intentionally, without explicit hypothesis formation or testing, and certainly without conscious access to the kinds of computations that appear to produce the learning e¤ects (e.g. calculations of transition probabilities, or some similar statistic). Clearly this assumption is more obviously valid when not only the learning process, but also the resulting knowledge is shown to be implicit, as is often claimed in AGL experiments. If people do not know what they have learned then it seems highly unlikely that they would be aware of how they learned it.

1. We would like to thank Phillip Hamrick and Rebecca Sachs for helpful comments and Michelle To for generating our Matlab figures.

238

John N. Williams and Patrick Rebuschat

Within the AGL tradition an important issue has been whether people learn the underlying structure of an artificial grammar. If they do then their knowledge should generalise to new surface forms, just as knowledge of natural language grammars generalises to an infinite number of sentences. Reber (1969) investigated the e¤ect of changing the lexicon or the grammar of the artificial language during the learning episode. The study consisted of two parts. The first part was a learning task that required subjects to memorize strings generated by a finite-state grammar. The second part consisted of a transfer task. Subjects were asked to continue memorizing letter sequences, but the letter sequences were modified without warning. For one group of subjects, the sequences were made up of the same letters as those in the first part, but a di¤erent finite-state grammar was used to generate them (old lexicon, new rules). For another group of subjects, the rules were the same, but the actual letters used to represent the grammar were changed (old rules, new lexicon). Reber found that changing the rules had a disruptive e¤ect on subjects’ memorization performance, but changing the lexicon had no detrimental e¤ect. The memorization advantage, observed in Reber (1967, Experiment 1) was maintained as long as the rules remained the same. That is, subjects could ‘‘transfer’’ the knowledge acquired while memorizing one set of letter sequences to the memorization of a di¤erent set of letter sequences, even though both sets featured di¤erent lexicons. Reber concluded that implicit learning results in an abstract representation of the structure displayed in the stimulus environment. The transfer e¤ect has been frequently replicated (Altmann, Dienes, & Goode, 1995; Brooks & Vokey, 1991; Gomez, 1997; Go´mez & Schvaneveldt, 1994; Matthews, et al., 1989; Tunney & Altmann, 2001; Whittlesea & Dorken, 1993). Several studies have shown transfer to be limited to certain conditions (Berry, 1991; Berry & Broadbent, 1984, 1988) but its existence seems largely noncontroversial. What remains contentious, however, is how to explain the underlying process. Reber’s (1967, 1969) initial assumption was that subjects acquire abstract, rule-based knowledge during AGL experiments (see Reber, 1989, p. 114). According to the abstractionist account, the mental representation established during AGL consists of a symbolic structure that is independent of the original surface form of the training materials (Manza & Reber, 1997; A. S. Reber, 1989; A. S. Reber & Lewis, 1977). The transfer e¤ect is explained by assuming that subjects learn rules which capture the structure of the training stimuli and then use this knowledge when judging whether test stimuli follow the same rules or not (e.g. Knowlton & Squire, 1996; Manza & Reber, 1997). An alternative

Statistical learning and syntax

239

view is that they merely learn the local patterns of alternation and repetition that the grammar generates, known as its ‘‘repetition structure’’ (Tunney & Altmann, 2001). Although repetition structures are an abstract form of knowledge (Marcus, et al., 1999), and may indeed be relevant to learning certain phonological rules, they are not the kinds of abstract rules that many linguists believe underlie natural language grammars in general. What the AGL tradition does tell us, however, is that people are at least able to implicitly learn chunks of surface form; i.e. frequently occurring bigrams and trigrams in the input strings (Johnstone & Shanks, 1999; Perruchet & Pacteau, 1990; Servan-Schreiber & Anderson, 1990). AGL learning can be modelled by chunking mechanisms (Cleeremans & Dienes, 2008). This is supported by research showing failures to implicitly learn grammars that cannot be induced through chunking surface forms (Matthews, et al., 1989; Shanks, Johnstone, & Staggs, 1997). An important di¤erence between typical AGL experiments and natural language learning is that AGLs exclude meaning. Therefore, they might underestimate the power of implicit learning to induce linguistic rules. There have been previous studies of miniature natural-like languages where the words have clear referents and the sentences are used to convey coherent messages (e.g., Friederici, Steinhauer, & Pfeifer, 2002; MorganShort, et al., 2010; Mueller, Girgsdies, & Friederici, 2008; Mueller, et al., 2005). These studies do appear to show learning of word order and agreement rules. But it is not clear whether the learning processes at work were purely statistical because the participants were in a situation where they had to work out the system in order to perform the task; that is, they were exposed to the system under intentional learning conditions. The main focus of those studies was on the knowledge that was acquired, and the associated brain regions and ERP responses evoked when it was used, rather than the learning process itself. But since these studies did not employ incidental learning paradigms it is not clear to what extent the results are due to simple statistical learning. Another problem with typical artificial language studies is that participants have to learn a very small system, consisting of just a few lexical items ( just 14 words in the studies cited above). This makes it di‰cult to construct strong tests of generalisation, which should involve novel sentences with new lexis. In order to show that people have acquired knowledge that is more abstract than just surface chunks or transition probabilities between words we have to show generalisation to sentences with new lexis.

240

John N. Williams and Patrick Rebuschat

A method for examining learning of natural language word order regularities To overcome the problems identified above Rebuschat & Williams (in press) used a semi-artificial language in which English lexis was combined with the syntax of a natural language that was unfamiliar to the participants. This resulted in sentences with apparently ‘‘scrambled’’ word order that were nevertheless perfectly comprehensible, e.g., ‘‘A few months ago ranted Chris about the government’s plans’’, ‘‘George repeated today that the movers his furniture scratched’’, ‘‘After the instructor a sword brandished, focused Brian more on his defensive stance’’. In the exposure phase participants made plausibility judgments on sentences, hence engaging in a meaningfocused task that did not require them to analyse word order patterns (e.g., ‘‘Rose abandoned in the evening her cats on planet Venus’’ is implausible). After this they were informed that ‘‘The scrambling of the previous sentences was not arbitrary but followed a complex system’’ and were required to perform a grammaticality judgement test (GJT). Because native language lexis was employed, these sentences could use entirely new content words, and hence tested generalisation of syntactic knowledge in a way that was not confounded by the familiarity of specific sequences of words. The status of the knowledge that was acquired was assessed for each judgment through subjective measures of confidence (not confident, somewhat confident, very confident) and source attribution (guess, intuition, memory, rule) (following Dienes & Scott, 2005). We assumed that judgements of low and moderate confidence and/or guess or intuition source attribution reflected implicit knowledge (see Dienes, this volume). It must be stressed that this paradigm was developed in the context of studying the initial stages of L2 acquisition. One might envisage a situation in which the second language learner has acquired some vocabulary in the language, begins to derive meaning from sentences in the input, but does not actually explicitly analyse the structure of those sentences. Following models of bilingual lexical development we assume that foreign language words initially associate with the meanings of first language translation equivalents (Kroll & Stewart, 1994; Kroll & Tokowicz, 2005). We also assume that they inherit the grammatical properties of their translation equivalents, as shown, for example, by cross-language syntactic priming (Salamoura & Williams, 2007; Schoonbaert, Hartsuiker, & Pickering, 2007), and cross-language priming of grammatical gender (Salamoura & Williams, 2008). Hence, in an experimental context, even if one were to burden participants with the task of learning novel lexical items, these

Statistical learning and syntax

241

Table 1. Descriptions and Examples of the Three Verb Placement Rules Rule

Description

Examples

V2

Finite verb placed in second phrasal position of main clauses that are not preceded by a subordinate clause.

Today bought John the newspaper in the supermarket.

V1

Finite verb placed in first position in main clauses that are preceded by a subordinate clause

Because his parents the newspaper in the supermarket bought, spent John the evening in his study.

VF

Finite verb placed in final position in all subordinate clauses

Peter repeated today that the movers his furniture scratched.

would merely inherit semantic and syntactic information from their first language equivalents. Therefore one might as well use first language words themselves, relieve the participants of the burden of learning and processing a large number of novel lexical items, and exploit the size of the native language vocabulary to construct strong tests of generalisation. Returning to the Rebuschat & Williams (in press) study, the GJT test items either followed syntactic patterns that had been encountered in the exposure phase, or new patterns that violated the word order rules. The rules at issue concerned German verb placement, which is conditioned by clause type and clause sequence. In the exposure phase there were three basic patterns, exemplified in table 1. Note that these are not simple word position rules since the structural patterns concern the ordering of phrases (e.g., Usually drew Jack his clients in a realistic fashion and A few months ago ranted Jessica about the government’s plans are both V2 constructions even though the verb is the fifth word in the second sentence). In the test phase participants performed a GJT on sentences with new lexis that either obeyed these verb position rules or violated them (see table 2). A control group performed this GJT without any prior exposure (they were instructed to judge how likely they thought it would be that each sentence would be grammatical in the world’s languages). None of the participants had any knowledge of German or languages with relevantly similar word order rules. The first question was what participants would learn incidentally about the word order rules of the language. The possibilities were that they would learn (i) associations between surface forms (e.g. that John was preceded by bought), (ii) structural patterns defined over abstract representations of

242

John N. Williams and Patrick Rebuschat

Table 2. Grammatical and Ungrammatical Patterns Used in the Testing Set Pattern

Grammatical

V2

Yesterday scribbled David a long letter to his family.

V2VF

Recently have his parents an accountant consulted.

V2-VF

Paul argued recently that the chairman the wrong figures displayed.

VF-V1

Because his children fairy tales loved, invented John many stories. Ungrammatical

*VF

Recently Jim the Boston Marathon in four hours ran.

*V3VF

Yesterday the guitar was by David smashed.

*V2-V1

Recently maintained David that abstained his father from unhealthy food.

*VF-VF

Because his son an instrument wanted, David with the music teacher talked.

phrasal types, (e.g. for the V2 structure ‘Time Phrase – Verb – Subject – Object’), or (iii) generalised syntactic rules (e.g., verb in second phrasal position in single clause sentences). If (i) were the case then participants should show no discrimination between grammatical and ungrammatical structures because all test sentences involve new lexis. In the case of (ii) and (iii) there should be discrimination between grammatical and ungrammatical sentences. Because the grammatical sentences repeat syntactic patterns from training it is not possible in this case to say whether they have learned patterns or rules (but see below for an experiment that attempts this). It was found that the group who had received exposure performed significantly better than the controls on the grammatical structures (71% versus 36%), but did not di¤er on the ungrammatical structures (47% and 51%). Thus, there was clear evidence that, at a minimum, the experimental group incidentally learned the syntactic patterns underlying the exposure sentences, which it must be emphasised in this case concern the sequences of phrases and not individual words. The failure to reliably reject ungrammatical structures may suggest that there was no learning of grammatical rules, assuming that rules enable a clear identification of what is ruled out, as well as what is ruled in. However, because participants may be able to identify cases where a rule applies, but be unsure about the grammatical

Statistical learning and syntax

243

status of cases where it does not, then this is not strong evidence for a failure to learn rules. Analysis of the data in terms of awareness measures revealed that overall performance was significantly above chance even for judgements where participants claimed that they were using intuition, as well as for judgements where they said they were using rules. Thus there was a mixture of implicit and explicit knowledge of the target structures, consistent with work in artificial grammar learning (Dienes, this volume). In one sense these results are inconsistent with AGL research, on the basis of which we might have expected only learning of surface chunks, and hence no generalisation to sentences with new lexis. Of course, this inconsistency only arises if we adopt naı¨ve assumptions about the kinds of representations over which people are able to learn the structure of sentences. If the learning mechanism is able to operate over more abstract categories than word forms, for example underlying phrasal categories, then what we have here is a simple case of learning sequences of phrase types. This secures generalisation to sentences with new lexis in a way that does not occur in AGL experiments. Similarly, Kaschak & Glenberg (2004) showed that participants can incidentally learn a novel construction in their native language which then generalises to sentences with new lexis. And Hudson Kam (2009) in an experiment employing an entirely artificial miniature language showed that adults can learn the word orders associated with specific verbs as abstract patterns which generalise to new lexis. The Rebuschat & Williams (in press) study extends these findings to a situation where the abstract patterns are defined over phrasal types and not individual words. However, given the chance performance on ungrammatical structures we cannot claim that participants learned the rules governing how clause type and sequence determines word order patterns within clauses.

Can abstract rules be learned? The potential for learning abstract grammatical rules was explored more fully by Williams & Kuribara (2008) and Williams (2010) in the context of Japanese word order regularities. English words were combined with Japanese word order and case markers to form sentences such as ‘‘John-ga Mary-ni ring-o gave’’ (John gave Mary a ring), sentences which with prior instruction on the meanings of the case markers are readily interpretable (-ga marks nominative, -o accusative, and –ni dative). The procedure involved presenting many such ‘‘Japlish’’ sentences, reflecting a variety of

244

John N. Williams and Patrick Rebuschat

constructions, in an exposure phase where the participant’s task was to judge sentences for semantic plausibility (e.g. ‘‘John-ga Mary-ni planet-o gave’’, ‘John gave Mary a planet’, would be implausible). The grammatical focus of these studies was on two interrelated properties of Japanese syntax: scrambling and head direction. The canonical Japanese word order is S(I)OV in simple sentences, and S[S(I)OV]V in complex sentences (e.g. Mary-ga John-ga book-o stole that said, ‘Mary said that John stole a book’). But it is possible to scramble constituents to produce other word orders. From the perspective of generative theory the principle is that scrambling is defined as movement in the direction opposite to the head direction (Saito & Fukui, 1998). Since Japanese is right-headed, movement to the left is possible, resulting in structures in which the verb is always clause-final, but other arguments can appear in di¤erent positions (e.g., OSV, ISOV). In complex sentences arguments can even be moved out of the embedded clause (e.g. OS[SV]V, Book-o Mary-ga John-ga stole that said). We asked if participants were only exposed to a subset of possible scrambled structures (e.g., involving object and adjunct movement) whether they would subsequently generalise to structures involving movement of other constituents (e.g., indirect objects). This would indicate learning of a generalised notion of scrambling. On the other hand, they might only accept instances of scrambling of a type that they have experienced in the input. In this case they could only be said to have learned specific syntactic patterns. Of course, it is also important to test whether scrambling is appropriately constrained. Therefore there were also ungrammatical test structures that violated the principle of scrambling because the verb was not in final position. Participants received either 194 or 388 sentences in the artificial language in the plausibility task. They then performed a GJT on sentences containing new lexis. There was also a control group who performed the GJT with no prior exposure. The results provided no evidence for learning a generalised notion of scrambling, nor indeed of the head-final property of the language. Rather, what people appeared to learn were the structures that they were trained on – as abstract syntactic patterns. But they did not learn that scrambling could be applied to new constituents (e.g., indirect objects). Nor did they learn the rule that all clauses had to end in verbs; that is, the head-final characteristic of the language. Despite the fact that every clause that they had received in training had ended in a verb they were still not better than chance at rejecting complex structures in which an embedded clause did not end in a verb, e.g., *S[SVO]V and *S[OVS]V. Rejection of simple sentences that did not end in a verb was rather better, e.g. *SIVO,

Statistical learning and syntax

245

*IOVS. Thus, they may have learned that sentences should end with a verb, but this was not the same as knowing that all clauses have to end with a verb, as would be required if the head direction of the language had been acquired. But the participants did not simply learn a restricted set of syntactic patterns. If this had been the case they would have reliably rejected sentences that did not conform to patterns they had encountered before. This was clearly not the case, since absolute acceptance rates of new scrambles were around 60% in all conditions (this rate was not a¤ected by level of exposure), and acceptance of ungrammatical structures was around 40% for the participants who had received exposure. An alternative hypothesis is that participants were basing their decisions on the similarity between a particular test sentence and the underlying statistical structure of the corpus of exposure sentences, and that acceptance rates reflected the familiarity of structures with respect to this statistical knowledge base. But the computation of similarity must have been based on abstract linguistic categories, rather than word forms. This hypothesis was supported by connectionist simple recurrent networks (SRNS) that were trained on the input coded as grammatical categories (e.g., S, O, V) and reproduced the relative rates of acceptance of the di¤erent test structures (Williams, 2010).

The contribution of linguistic knowledge One way of construing the learning process occurring in the experiments described above is as a form of sequence learning over abstract grammatical categories. But is this the same kind of sequence learning process that is operative in, say, the more familiar statistical learning experiments that look at learning sequences of syllables (Sa¤ran, Newport, & Aslin, 1996)? Does the nature of what is tracked and counted make any di¤erence? As Johnson (this volume) points out, the view that learning is driven by a statistical learning mechanism leaves open the nature of the representations over which it operates. On the other hand, it is possible that because the representations are now grammatical, or semantic, di¤erent, nonstatistical, learning mechanisms come into play. A few previous studies have sought to assess how the addition of grammatical and semantic information a¤ects the learning process. Robinson (2005) compared the performance of the same individuals on AGL and incidental natural language learning (participants were exposed to a microlanguage based on Samoan). He focussed on two areas: first, whether the

246

John N. Williams and Patrick Rebuschat

two learning tasks would pattern in similar ways with respect to measures of individual di¤erences (IQ, measures of aptitude, etc.), and second, whether e¤ects of chunk strength (Knowlton & Squire, 1996) would be found in both cases. He found very di¤erent patterns of results for the two tasks. There were di¤erent patterns of correlations with individual di¤erences, and whilst e¤ects of chunk strength and grammaticality were obtained in AGL, no such e¤ects were obtained for Samoan. Thus di¤erent learning processes appeared to be at work in AGL and incidental natural language learning. In fact the addition of meaning may decrease the learnability of sequences. When presented as meaningless syllable sequences, repetition-based patterns like AAB (e.g. wo-wo-fe) are easily learned by adults and infants (Marcus, et al., 1999), but when the categories correspond to grammatical categories, e.g. scavenge-listen-camel (VVN) these simple patterns are not (Endress & Hauser, 2009). It appears that linguistic knowledge can actually make us blind to patterns that would otherwise be readily learnable. Endress & Hauser argue that this is because syntactic processing, and learning, is modularised. Presumably this means that it does not draw on the same sequential learning mechanisms that operate in other domains. Mueller et al. (2008) examined learning of Japanese syntax in a miniature system and found that participants were actually more likely to produce native-like ERP signatures when meaning was removed. They suggest that the demands of semantic processing may interfere with learning the grammatical regularities. In this case it is meaning that makes people blind to syntax. On the other hand the presence of semantic information might increase learnability. Non-adjacent dependencies, for example the A-B association in AXB sequences, have long been a focus of attention in the statistical learning literature, and in general have proved di‰cult to learn (Newport & Aslin, 2004; see also Hay & Lany, this volume; Perruchet & PoulinCharronnat, this volume). Yet Amato & MacDonald (2010) found that in a language involving meaning, the equivalent of nonadjacent dependencies were easily learned. The extent to which grammatical and semantic knowledge is helpful may depend upon the naturalness of the regularities that need to be learned. The only exception to the pattern-blindness found by Endress & Hauser (2009) was when the sequence made syntactic sense, e.g. babywater-juggle (NNV) and clever-fragile-water (AAN). They argue that if the input can be interpreted syntactically (even if not matching syntactic patterns in the native language) its underlying structure can be learned. The ready learnability of the syntactically possible is reminiscent of a study by Cleary & Langley (2007) who found evidence for retention of

Statistical learning and syntax

247

the word order of meaningless, but grammatical, sentences, but not for ungrammatical sentences. For example, ‘Beautiful transportation sheds temporary plants’ primed sentences with the same sequence of grammatical categories. However, ungrammatical strings like ‘sour a kick clean balloon hard’ did not. The findings from Endress & Hauser (2009) and Cleary & Langley (2007) suggest that for grammatical knowledge to be helpful it has to resonate with the novel input, otherwise it might have a detrimental e¤ect. In order to gauge the impact of linguistic knowledge and/or meaning on learning word order regularities we need to employ methods that allow direct comparisons between linguistic and non-linguistic instantiations of the same underlying sequence. Here we shall discuss two studies in which we have tried to do this, albeit using very di¤erent methods.

Comparing meaningful sentences and nonsense analogues In the Williams (2010) study the impact of grammatical knowledge and/or meaning was assessed by replacing the lexical items in the Japlish sentences with nonsense syllables. Each grammatical category was substituted with a set of similar nonsense syllables. For example, grammatical subjects were randomly replaced with one of si/se/sa/so, objects with one of pi/pe/pa/ po, and verbs with one of ki/ka/ko/ku. For wh-words, what became fu, who became fe, and when became fa. The complementiser that was replaced with me. So for example, the original Japlish sentence Pilot-ga that runway-o saw became Se-ga pi-o ku; Writer-ga who-ni what-o handed? became Si-ga fe-ni fu-o ki ?; Detective-ga suspect-ga that car-o stole that announced became Si-ga so-ga pe-o ko me ku. During the exposure phase participants performed a probe recognition task on each string (e.g., Si-ga fe-ni fu-o ki ? might be followed by the probe fe-o, and the correct answer would be ‘no’). The test strings were again nonsense analogues of the original test items, and care was taken that the syllable sequences were unique. Participants were asked to indicate whether they thought the strings were generated by the same system as the exposure strings. The results from the nonsense analogue and the (388) Japlish exposure condition were then compared. The mean endorsement of grammatical items was almost equivalent (70% and 72% for the analogue and Japlish respectively), and although endorsement of ungrammatical items was slightly lower in the analogue than in Japlish (24% and 30% respectively), this di¤erence was not significant. To evaluate the correspondence between

248

John N. Williams and Patrick Rebuschat

the experiments at a finer level of detail the correlation was calculated over the mean acceptance rate for each type of test structure (there were 21 di¤erent types, with 4 sentences per type). The correlation between the acceptance rates in the two experiments was r ¼ 0.832, p < 0.001, meaning that 69% of the variability in Japlish was accounted for by endorsement rates in the analogue. The correspondence between the results of the two experiments suggests that similar learning processes were at work. However, the correlation between the results of the studies was not perfect. Such discrepancies that did exist could be traced to the role of meaning and linguistic knowledge in the learning of, and making judgments upon, Japlish. For example, the long distance scrambles OS[SIV]V and IS[SOV]V were endorsed relatively poorly in Japlish compared to the nonsense analogue. It is reasonable to attribute this to the unusual mapping between linear surface form and meaning that results from extraction from the embedded clause (e.g., That book-o John-ga student-ga borrowed that claimed, John claimed that the student borrowed that book). Thus, the presence of meaning in Japlish actually suppressed acceptance of a pattern that, as a pattern, was more highly endorsed in the analogue. Whether the e¤ort after semantic interpretation actually suppressed learning, or just the judgement process, is not clear however. Also, in Japlish, multiple wh- sentences such as Who-ga what-o bought? were endorsed at the same high level as their declarative counterparts such as Mary-ga book-o read, even though they were less frequent in the input. In contrast, the acceptance rate for the analogues of the wh- patterns was lower than the declaratives (e.g., The fe-ga fu-o ko pattern was endorsed at a significantly lower rate than the si-ga pe-o ku pattern), as would be predicted by their frequency. Given that the same –ga –o pattern is present in all cases the di¤erence is likely to arise because in Japlish, wh-questions and declaratives are encoded in terms of the same underlying grammatical pattern (e.g., SVO). Wh-words and nouns are already known to be in some sense functionally equivalent. But in the analogue there is no prior reason to suppose functional equivalence between syllables of the f_ class and the s_, and p_ classes. The results of the comparison between Japlish and the analogue therefore suggest two ways in which prior grammatical knowledge and meaning can impact on learning. First, the complexity of semantic interpretation may detract either from pattern learning, or reduce the acceptability of complex structures in a GJT, a result that is similar to Mueller et al. (2008) who also targeted Japanese. Second, the potential for a grammatical interpretation changes the representations that are counted by the statistical

Statistical learning and syntax

249

learning mechanism. Structures that on the surface appear to be di¤erent (e.g. Who-ga what-o bought and Mary-ga book-o read ) reinforce the same underlying pattern because they share a common abstract coding (SOV). In this case the coding is interpretable, and hence linguistic knowledge facilitates acquisition of a structure that is of relatively low frequency in the input. However, it must be stressed that overall the close correspondence between Japlish and the analogue suggests that the underlying learning mechanism was similar in the two cases. In the case of the nonsense analogue it is plausible that a simple statistical learning mechanism that tracks sequential contingencies between syllables was at work. And indeed, the same kind of SRN that was used to model Japlish produced a remarkably close fit to these data (r 2 ¼ 0.96). Although the fit obtained for the Japlish model was significantly lower (r 2 ¼ 0.83, significantly di¤erent at p < 0.05 using a Fisher r to Z transformation) the high fit in the two cases suggests that a similar learning mechanism was at work, except that it operated over di¤erent kinds of representation.

Linguistic and non-linguistic serial reaction time tasks If we regard grammar learning as a form of sequence learning then it could be argued that GJTs are a poor method for measuring it because they do not measure sequence processing as such. For instance, they do not tell us which aspect of a target sentence the participant found unacceptable. Nor do they show that sequence knowledge a¤ects real-time processing, as opposed to some o¤-line reflective process. And of course, psycholinguistic work on sentence processing has long since abandoned the use of GJTs in favour of on-line reading time and event related potential (ERP) measures. Assuming that these performance e¤ects are a reflection of the linguistic knowledge that is actually used in language comprehension and production, then research on language learning needs to adopt these kinds of measures as well (for examples see Amato & MacDonald, 2010; Friederici, Steinhauer, & Pfeifer, 2002; Kaschak & Glenberg, 2004). What we need, therefore, are on-line methods that measure processing of sequences in time. Furthermore, in order to gauge the impact of linguistic knowledge and meaning we also need to make comparisons with processing the same sequence in a non-linguistic task. The standard paradigm for examining sequence learning in psychology is the serial reaction time (SRT) task (e.g., Cleeremans & McClelland, 1991) in which participants track the movement of a stimulus by hitting corre-

250

John N. Williams and Patrick Rebuschat

sponding response keys. Unbeknownst to the participants the sequence of stimulus positions follows a regular pattern and, by comparing reaction times to sequences that do or do not follow this pattern, learning e¤ects can be evaluated. The SRT has become the preferred method for examining implicit learning in recent years, superseding the use of the AGL paradigm, most especially in neuropsychological research given that it is simple to administer to patient groups. The use of this paradigm also reflects a growing recognition of the importance of sequence learning, not only in motor skills, but also in complex cognitive skills such as language (Lieberman, 2007; Ullman, 2004). Because in an SRT task the regularity to be learned involves a sequence of screen positions, the stimulus itself can either have linguistic content or not, whilst keeping the underlying sequence, as defined over screen positions, constant. Instead of seeing a constant stimulus move around screen positions the stimuli can be words or phrases that comprise a meaningful sentence. Screen positions can correspond to grammatical categories of phrases so that the spatial sequence is an analogue of the grammatical sequence. The task becomes a cross between a segment-by-segment reading procedure and an SRT task. The methodology allows sensitive on-line measurement of sequence learning e¤ects and direct measurement of the impact of linguistic meaning. We have used this modified SRT procedure to extend the Rebuschat & Williams (in press) work on learning German word order. Working on the assumption that people encode the exposure sentences as sequences of grammatical phrases a sentence like Yesterday evening ate John a pizza at the restaurant would be learned as an underlying phrasal sequence TP S O Adj (where TP stands for temporal adverbial phrase, and Adj stands for adjunct). The sequences can then be instantiated in a linguistic or nonlinguistic SRT simply by assigning screen positions and corresponding keys to grammatical categories. Figure 1 shows one of the two layouts used in the experiment. The linguistic version of the task used words as stimuli, and the non-linguistic version used a constant meaningless stimulus (we used the string ‘XXX’). In the linguistic version the task is akin to a phrase-by-phrase reading paradigm except that the position of the phrase on the screen is variable, and the response key changes accordingly. In the linguistic version a plausibility decision was required after every string in the exposure phase. In the non-linguistic version there was no additional task. The exposure phase was followed by a surprise test phase in which participants tapped through more sequences but this time, after each one, had to indicate whether it conformed to any of the patterns

Statistical learning and syntax

251

Figure 1. A possible layout of screen positions (top pane) and response keys (bottom pane) in the SRT experiments. Note that labels were not presented to the participants. Note: CMP ¼ complementizer, SUB ¼ subordinator, TP ¼ time phrase, V ¼ Verb, S ¼ Subject, O ¼ Object, Adj ¼ adjunct, Beg ¼ Beginning of sentence, End ¼ End of sentence.

seen in the exposure phase. In other words they were essentially performing a recognition memory task for sequence structures. The test structures are shown in table 3. There were 5 examples of each type, or the same number of repetitions of the structures in the nonlinguistic version. The exposure phase sentences comprised 40 examples of each of the grammatical structures (V2, V2-VF, VF-V1), or the equivalent number of sequences in the non-linguistic version. In the linguistic version di¤erent lexis was used in the exposure and test sentences. The test sentences were formed into three sets in which the same words were configured into di¤erent word orders and rotated around conditions between participants. This enabled comparisons of reaction time profiles to be made in such a way as to control for lexical e¤ects. A total of 24 participants were tested in the linguistic and non-linguistic versions respectively. None of them knew any German. The reaction time e¤ects in the SRT tasks were remarkably similar for the linguistic and non-linguistic versions. In both cases there were longer response times in the ungrammatical than the grammatical sequences at the points at which the sequences were violated. However, this was only the case for the short sequences/simple sentences; i.e. for the *V1 vs V2

252

John N. Williams and Patrick Rebuschat

Table 3. Example test items used in the SRT task Pattern

Grammatical

V2

Yesterday scribbled David a letter to his brother.

V2-VF

Yesterday acknowledged David that in the evening Jenny the computer stole.

VF-V1

Because his dog many shoes devoured asked yesterday Peter the vet for advice. Ungrammatical

*V1

Scribbled yesterday David a letter to his brother.

*VF

Yesterday David a letter to his brother scribbled.

*V2-V2

Yesterday acknowledged David that in the evening stole Jenny the computer.

*VF-V2

Because his dog many shoes devoured, yesterday asked Peter the vet for advice.

and *VF vs V2 contrasts. By way of exemplification, figure 2 shows the reaction time profiles for the *VF vs V2 comparison for each version. No such slow-downs were evident in the long sequences. The addition of linguistic meaning did not appear to make the structure of long strings more learnable by this measure, but neither did it detract from learning of the short strings. Sequence learning was similar in the linguistic and non-linguistic versions. Reaction times were not the only measure of learning in this experiment. Participants also made recognition memory decisions after each test string. The results are shown in figure 3. Discrimination between grammatical (equivalent in this context to grammaticality judgement, GJT) and ungrammatical structures is better for the non-linguistic version than the linguistic version. Mean endorsement rates for grammatical and ungrammatical structures are 61% and 42% in the non-linguistic version, but 68% and 63% in the linguistic version (the interaction is significant, p < 0.05). In fact, in the linguistic version only the V2 vs *V1 comparison is significant, p < 0.001. The contrast between the linguistic and non-linguistic versions is particularly striking for the *VF structure, which is readily discriminable from V2 in the non-linguistic version, p < 0.001, but actually has a numerically higher endorsement rate than V2 for the linguistic version. The addition of linguistic meaning obscured sensitivity to underlying sequential regularities by this measure.

Statistical learning and syntax

253

In the case of the V2 vs *VF discrimination there is a clear dissociation between the GJT and RT data in the linguistic version. Within SRT research when recognition judgements show no e¤ect of grammaticality but SRT performance does it is typically concluded that the sequential knowledge tapped by the SRT is unconscious (Norman, et al., 2007; P. J. Reber & Squire, 1998). This dissociation logic has been criticised on the grounds that the SRT may just be more sensitive than the judgement task (Shanks & Perruchet, 2002; Shanks, Wilkinson, & Channon, 2003). However, in the present case we also found the reverse dissociation – in the non-linguistic version there was discrimination between complex grammatical and ungrammatical structures in GJT but not in the SRT (see also P. J. Reber & Squire, 1999, for evidence of e¤ects on recognition but not RT in an SRT task). We would argue, therefore, that the RT and GJT data are reflecting di¤erent kinds of knowledge. If di¤erent knowledge sources are driving these tasks what is their nature, and what is their relevance to language learning? The SRT is commonly assumed to tap procedural learning. Given the similar results for the two versions, the simplest and most conservative interpretation is that in both cases the e¤ects reflected learning in a ‘‘unidimensional’’ (Keele, et al., 2003) system that is only interested in the sequence as defined over screen position, whether the stimuli make linguistic sense or not. Learning was not a¤ected by the presence of the secondary task of comprehension in the linguistic version, and at least in the case of *VF the RT e¤ect seems to be implicit. Immunity to secondary task interference and implicitness are hallmarks of unidimensional learning according to Keele et al. (2003). Note also that learning here displays the limitations that one might expect of sequence learning in general; namely a sensitivity to the edges of short strings (hence rejection of *V1 and *VF) and an insensitivity to violations in the middle of long strings (Shukla et al, this volume).2 What kind of knowledge was tapped by the GJT? The obvious possibility is that in the linguistic version the GJT taps into linguistic representations of the sentences that are distinct from the underlying procedural knowledge reflected in the SRT. If we assume that, at this level, the material is represented in terms of sentences and clauses then the pattern of results makes sense. The acceptance of *VF can plausibly be regarded as an over2. Of course, this is not to say that long strings would not be learned with more exposure – connectionist simulations of this system (using an SRN) show that sensitivity to *V1 and *VF simply emerges before sensitivity to *V2-V2 and *VF-V2.

254

John N. Williams and Patrick Rebuschat

Linguistic Version 700

Reaction Times (msec)

600

500

400 *VF V2 300

200

100

0

TP

*S

O

Adj

*V

Non−Linguistic Version 700

Reaction Times (msec)

600

500

400

300 *VF V2 200

100

0

TP

*S

O

Adj

*V

Figure 2. Reaction times in the SRT task for the V2 vs *VF comparison. The upper panel displays the results of the linguistic version and the lower panel the results of the non-linguistic version. In both graphs, the categories are arranged on the x-axis in the order in which they occurred in the ungrammatical sentences, and the categories that deviate from the expected grammatical V2 order are marked with asterisks. All di¤erences between the V2 and *VF structures are significant at these positions. In each panel, errors are e1 S.E.M. based on data from 24 subjects.3

Statistical learning and syntax

255

1

Linguistic 0.9

Non-linguistic

Endorsement Rate

0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0

V2

*V1

*VF

V2VF

*V2V2

VFV1

*VFV2

Figure 3. Endorsement rates in the trained groups. Errors are e1 S.E.M. based on data from 24 subjects in each condition.

generalisation, based on the V2-VF structure, that sentences can end in verbs. And the elimination of discrimination between complex structures in the linguistic version plausibly reflects the fact that all complex sentences are composed of familiar clauses. Indeed, as figure 3 shows, the endorsement rates for the complex structures were uniformly higher in the linguistic version than the non-linguistic version (p < 0.001 across all complex structures). Clearly the participants had not learned the rules of clause type and 3. It could be argued that the di¤erences between response times for a particular segment, e.g. V, arise because of positional e¤ects. For example, the di¤erence in RT for the verb could be because the verb is in second position in the V2 structure, but final position in the *VF structure. To evaluate this possibility linguistic and non-linguistic control conditions were run in which participants tapped through the test items with no prior exposure phase. There were no di¤erences between grammatical and ungrammatical strings, showing that the di¤erences shown in Figure 2 are e¤ects of sequence knowledge, and not positional e¤ects.

256

John N. Williams and Patrick Rebuschat

clause sequence, but if they had learned V2, V1 and VF as possible clause structures then all of the complex sentences would presumably have seemed fairly familiar. Therefore, at the level of knowledge tapped by the GJT, linguistic knowledge might obscure sensitivity to regularities that are apparent when the material is treated as pure sequence. In terms of the learning mechanisms discussed in this volume, the GJT results in the linguistic version seem compatible with Perruchet & PoulinCharronat’s (this volume) unit formation view. They argue that associations between elements will not be learned unless they are held within the same attentional chunk. For example, in the domain of segmentation, prosody imposes an intonational phrase structure on the sound stream, and the edges of these units are a critical factor in learning sequential structure (Shukla, Nespor, & Mehler, 2007; Shukla et al., this volume). Associations will not be learned between elements that straddle an intonational phrase boundary (see Perruchet & Poulin-Charronat, this volume, for discussion). Likewise, it is not unreasonable to assume that in the larger language learning context, attentional chunks might be supported by meaning. What this means for our semi-artificial language is that the contingencies between elements at clause boundaries will be very hard to learn. And if beginnings and ends of strings are perceived as being edges of sentences (as high-level meaning-defined chunks) then the generalisation that sentences end in verbs will lead to acceptance of *VF. This experiment demonstrates how procedural knowledge of sequences may be dissociable from other forms of knowledge, as tapped by di¤erent tasks. In the case of language, where syntactic and semantic analyses can lay on deeper and deeper layers of representation, this should come as no surprise. Nor should it be any surprise if the learning processes operating on these other layers of representation are not simply sequence learning processes of the type assumed to underlie, for example, learning to segment continuous speech.

Conclusion In the studies reported here we have departed from the tradition of statistical learning and AGL research by exploring situations in which pre-existing grammatical knowledge and meaning can be brought to bear. We have defended this approach in the context of L2 acquisition on the assumption that L2 words tend to associate with the meanings and grammatical properties of their L1 translation equivalents, at least in the initial stages. The

Statistical learning and syntax

257

question we have posed is whether, when this rich linguistic knowledge is available, the nature or power of the learning process di¤ers. Clearly syntactic knowledge and meaning provide domains over which learning processes can operate that go beyond the kind of form-level learning traditionally examined in statistical learning research. As Johnson (this volume) remarks in the context of first language acquisition, ‘‘all computational models have to make some assumptions regarding what sort of information infants can perceive and process, and calculating statistics requires having some unit over which to do the calculations.’’ Clearly, the adult L2 learner has access to a whole arsenal of linguistic categories that can be brought to bear on the input. And Sandoval, Gonzalez & Gomez (this volume), in the context of category learning suggest that ‘‘at that point learners could acquire phrase types characteristic of their native language where the content of the phrases would not be individual words (as in Sa¤ran, 2001) but would be categories’’ – that is, having acquired lexical categories through statistical learning, then statistical learning can operate upon those categories to learn syntactic structure. Thus, in Japlish there was evidence that wh- and non-wh arguments, though di¤erent at the level of form were treated as equivalent, say at the abstract level of subjects and objects, in the meaningful version. And in the German SRT experiment, the distinction between ‘long and short sequence’, present in the non-linguistic version, was lost in the linguistic version where both were treated as ‘sentences’ (inviting an overgeneralisation of VF from complex sentences to simple ones). If representational re-description alters the shape of the learning landscape, does it alter the nature of the learning mechanism? Is the learning process still essentially statistical – specifically one that is tuned to picking up contingencies between successive events (as might be modelled in a connectionist SRN for example)? In some cases, yes, it appears so. For example, Shukla et al. (this volume) provide evidence that lexical segmentation is learned by a process that is specifically tuned to consonants. But having imposed the relevant linguistic categorisation, the learning mechanism could still be regarded as essentially statistical. And in the present case, the fit of the Japlish real and analogue results to an SRN suggested that similar statistical sequence learning mechanisms were operating. Thus, the kind of process that tracks contingencies between syllables in the speech stream seems operative at the level of tracking contingencies between grammatical categories in sentences. However, our German SRT study suggests that this assumption may be too simplistic, for here there was much less alignment between the linguistic

258

John N. Williams and Patrick Rebuschat

and non-linguistic versions. In this case the GJT results for the linguistic version were more suggestive of chunking at the sentence and clause level, even though sensitivity to sequential structure was apparent on a procedural performance measure. What these results remind us of then is that multiple levels of representation may be formed simultaneously, tapped by di¤erent tasks, and that di¤erent learning mechanisms may be operative at di¤erent levels. Our research also reveals limitations on what can be learned. The Japlish studies showed no learning of a generalised notion of scrambling, nor of head-final verb position as a generalised constraint. The German studies showed no evidence of learning the verb placement rules in terms of clause type and sequence. Rather, at best what the participants appeared to learn were just the syntactic patterns that they had received in training, and their responses to novel items in the grammaticality judgement tasks were determined by analogy to these patterns. These patterns were abstractly represented, hence supporting generalisation to sentences with new words (Endress, Nespor, & Mehler, 2009; Kaschak & Glenberg, 2004). Of course, these limitations could be just a reflection of limited exposure. Although we have not examined the e¤ects of prolonged exposure, we can make a prediction based on simulation work in Williams & Kuribara (2008). When the SRN simulation of Japlish was trained to asymptote (in terms of the error over training items) the only e¤ect in the test was to enhance the response to trained items. The response to ungrammatical and novel grammatical test items did not change. This is presumably because, no matter how many times the network cycles through the training set, the statistical structure of the training data, on which the response to novel items is based, does not change. Thus, although the ability to discriminate old and new items improved, this was because of the greater familiarity of trained patterns, and not a reduced response to ungrammatical ones. Similarly, the e¤ect of increased exposure to the German system would be predicted to have the e¤ect of improving acceptance of trained patterns, but it would not improve rejection of ungrammatical ones. We predict therefore that the human ability to accept novel grammatical structures or reject ungrammatical ones will not change with increasing exposure because both are determined by analogy to examples encountered in the input. If performance on these structures does improve it must be because other learning processes, of a non-statistical kind, are engaged, for example explicit learning, or at the other extreme from statistical learning, a form of implicit learning guided by universal grammar.

Statistical learning and syntax

259

Any discussion of the potential limitations of the statistical learning mechanism in relation to language must also be seen in the context of one’s view of the nature of language. From a generative linguistic perspective, a failure to learn generalised notions of scrambling or head position would look like a fatal failing of the statistical approach. However, other views lay far more emphasis on the learning of syntactic patterns, rather than rules, such as the broadly emergentist (Ellis, 1998, 2002), usagebased (Lieven & Tomasello, 2008) and construction grammar approaches (see Ellis & O’Donnell, this volume; Goldberg, 2006). The nature of what is learned when we learn a first language continues to be hotly debated. But in L2 research there is a growing consensus that, at least in areas of grammar that are dissimilar to the L1, native-like competence or processing ability may not be achievable (Tolentino & Tokowicz, 2011). In this context we should not be perturbed by evidence for the limited power of statistical learning. This may in fact be an indication of the handicap under which the L2 learner is operating.

References Altmann, G. T. M., Dienes, Z., & Goode, A. 1995 Modality independence of implicitly learned grammatical knowledge. Journal of Experimental Psychology: Learning Memory and Cognition, 21, 899–912. Amato, M. S., & MacDonald, M. C. 2010 Sentence processing in an artificial language: Learning and using combinatorial constraints. Cognition, 116, 143–148. Berry, D. C. 1991 The role of action in implicit learning. Quarterly Journal of Experimental Psychology, 43, 881–906. Berry, D. C., & Broadbent, D. E. 1984 On the relationship between task performance and associated verbalisable knowledge. Quarterly Journal of Experimental Psychology, 36, 209–231. Berry, D. C., & Broadbent, D. E. 1988 Interactive tasks and the implicit-explicit distinction. British Journal of Psychology, 79, 251–272. Brooks, R. L., & Vokey, R. J. 1991 Abstract analogies and abstracted grammars: Comments on Reber (1989) and Mathews et al. Journal of Experimental Psychology: Learning, Memory, and Cognition, 120, 316–323.

260

John N. Williams and Patrick Rebuschat

Cleary, A. M., & Langley, M. M. 2007 Retention of the structure underlying sentences. Language and Cognitive Processes, 22, 614–628. Cleeremans, A., Destrebecqz, A., & Boyer, M. 1998 Implicit learning: News from the front. Trends in Cognitive Sciences, 2, 406–417. Cleeremans, A., & McClelland, J. L. 1991 Learning the structure of event sequences. Journal of Experimental Psychology: General, 120, 235–253. Dienes, Z., & Scott, R. 2005 Measuring unconscious knowledge: distinguishing structural knowledge and judgment knowledge. Psychological Research, 69, 338–351. Ellis, N. C. 1998 Emergentism, connectionism and language learning. Language Learning, 48, 631–664. Ellis, N. C. 2002 Frequency e¤ects in language processing: A review with implications for theories of implicit and explicit language acquisition. Studies in Second Language Acquisition, 24, 143–188. Endress, A. D., & Hauser, M. D. 2009 Syntax-induced pattern deafness. Proceedings of the National Acadamey of Sciences, 106, 21001–21006. Endress, A. D., Nespor, M., & Mehler, J. 2009 Perceptual and memory constraints on language acquisition. Trends in Cognitive Sciences, 13, 348–353. Friederici, A. D., Steinhauer, K., & Pfeifer, E. 2002 Brain signatures of artificial language processing: Evidence challenging the critical period hypothesis. Proceedings of the National Academy of Sciences of the United States of America, 99, 529–534. Goldberg, A. E. 2006 Constructions at Work. Oxford: Oxford University Press. Gomez, R. L. 1997 Transfer and complexity in artificial grammar learning. Cognitive Psychology, 33, 154–207. Go´mez, R. L., & Schvaneveldt, R. W. 1994 What is learned from artificial grammars? Transfer tests of simple associative knowledge. Journal of Experimental Psychology: Learning, Memory, and Cognition, 20, 396–410. Johnstone, T., & Shanks, D. R. 1999 Two mechanisms in implicit artificial grammar learning? Comment on Meulemans and Van der Linden (1997). Journal of Experimental Psychology: Learning, Memory, and Cognition, 25, 524–531.

Statistical learning and syntax

261

Kaschak, M. P., & Glenberg, A. M. 2004 This construction needs learned. Journal of Experimental Psychology: General, 133, 450–467. Keele, S. W., Ivry, R., Mayr, U., Hazeltine, E., & Heuer, H. 2003 The cognitive and neural architecture of sequence representation. Psychological Review, 110, 316–339. Knowlton, B. J., & Squire, L. R. 1996 Artificial grammar learning depends on implicit acquisition of both abstract and exemplar-specific information. Journal of Experimental Psychology: Learning, Memory, and Cognition, 22, 169–181. Kroll, J. F., & Stewart, E. 1994 Category interference in translation and picture naming: Evidence for asymmetric connections between bilingual memory representations. Journal of Memory and Language, 33, 149–174. Kroll, J. F., & Tokowicz, N. 2005 Models of bilingual representation and processing: looking back and to the future. In J. F. Kroll & A. M. B. de Groot (Eds.), Handbook of Bilingualism (pp. 531–553). Oxford: Oxford University Press. Lieberman, P. 2007 The evolution of human speech: its anatomical and neural bases. Current Anthropology, 48, 39–66. Lieven, E., & Tomasello, M. 2008 Children’s first language acquisition from a usage-based perspective. In P. Robinson & N. C. Ellis (Eds.), Handbook of Cognitive Linguistics and Second Language Acquisition (pp. 168–196). New York: Routledge. Manza, L., & Reber, A. S. 1997 Representing artificial grammars: Transfer across stimulus forms and modalities. In D. C. Berry (Ed.), How implicit is implicit learning? (pp. 73–106). Oxford: Oxford University Press. Marcus, G. F., Vijayan, S., Bandi Rao, S., & Vishton, P. M. 1999 Rule learning in 7-month-old infants. Science, 283, 77–80. Matthews, R. C., Buss, R. R., Stanley, W. B., Blanchard-Fields, F., Cho, J.-R., & Druhan, B. 1989 The role of implicit and explicit processes in learning from examples: A synergistic e¤ect. Journal of Experimental Psychology: Learning, Memory, and Cognition, 15, 1083–1100. Morgan-Short, K., Sanz, C., Steinhauer, K., & Ullman, M. T. 2010 Second Language Acquisition of Gender Agreement in Explicit and Implicit Training Conditions: An Event-Related Potential Study. Language Learning, 60, 154–193. Mueller, J. L., Girgsdies, S., & Friederici, A. D. 2008 The impact of semantic-free second-language training on ERPs during case processing. Neuroscience Letters, 443, 77–81.

262

John N. Williams and Patrick Rebuschat

Mueller, J. L., Hahne, A., Fujjj, Y., & Friederici, A. D. 2005 Native and nonnative speakers’ processing of a miniature version of Japanese as revealed by ERPs. Journal of Cognitive Neuroscience, 17, 1229–1244. Norman, E., Price, M. C., Du¤, S. C., & Mentzoni, R. A. 2007 Gradations of awareness in a modified sequence learning task. Consciousness and Cognition, 16, 809–837. Perruchet, P. 2008 Implicit learning. In H. L. Roediger (Ed.), Cognitive psychology of memory. Learning and memory: A comprehensive reference. (Vol. 2, pp. 597–621). Oxford: Elsevier. Perruchet, P., & Pacteau, C. 1990 Synthetic grammar learning: Implicit rule abstraction or explicit fragmentary knowledge? Journal of Experimental Psychology: General, 119, 264–275. Reber, A. S. 1967 Implicit learning of artificial grammars. Journal of Verbal Learning and Verbal Behavior, 6, 855–863. Reber, A. S. 1969 Transfer of syntactic structure in synthetic languages. Journal of Experimental Psychology, 81, 115–119. Reber, A. S. 1989 Implicit learning and tacit knowledge. Journal of Experimental Psychology: General, 118, 219–235. Reber, A. S. 1993 Implicit Learning and Tacit Knowledge. Oxford: Oxford University Press. Reber, A. S., & Lewis, S. 1977 Toward a theory of implicit learning: The analysis of the form and structure of a body of tacit knowledge. Cognition, 5, 333– 361. Reber, P. J., & Squire, L. R. 1998 Encapsulation of implicit and explicit memory in sequence learning. Journal of Cognitive Neuroscience, 10, 248–263. Reber, P. J., & Squire, L. R. 1999 Intact learning of artificial grammars and intact category learning by patients with Parkinson’s disease. Behavioral Neuroscience, 113, 235–242. Rebuschat, P., & Williams, J. N. in press Implicit learning of second language syntax. Applied Psycholinguistics. Robinson, P. 2005 Cognitive abilities, chunk-strength, and frequency e¤ects in implicit artificial grammar and incidental L2 learning: Replications of Reber, Walkenfeld, and Hernstadt (1991) and Knowlton &

Statistical learning and syntax

263

Squire (1996) and their relevance for SLA. Studies in Second Language Acquisition, 27, 235–268. Sa¤ran, J. R. 2001

The use of predictive dependencies in language learning. Journal of Memory and Language, 44, 493–515. Sa¤ran, J. R., Newport, E. L., & Aslin, R. N. 1996 Word segmentation: The role of distributional cues. Journal of Memory and Language, 35, 606–621. Salamoura, A., & Williams, J. N. 2007 Processing verb argument structure across languages: Evidence for shared representations in the bilingual lexicon. Applied Psycholinguistics, 28, 627–660. Salamoura, A., & Williams, J. N. 2008 The representation of grammatical gender in the bilingual lexicon: Evidence from Greek and German. Bilingualism: Language and Cognition, 10, 257–275. Schoonbaert, S., Hartsuiker, R. J., & Pickering, M. J. 2007 The representation of lexical and syntactic information in bilinguals: Evidence from syntactic priming. Journal of Memory and Language, 56, 153–171. Servan-Schreiber, D., & Anderson, J. R. 1990 Learning artificial grammars with competitive chunking. Journal of Experimental Psychology: Learning, Memory, and Cognition, 16, 592–608. Shanks, D. R. 2005 Implicit learning. In K. Lamberts & R. Goldstone (Eds.), Handbook of Cognition (pp. 202–220). London: Sage. Shanks, D. R., Johnstone, T., & Staggs, L. 1997 Abstraction processes in artificial grammar learning. Quarterly Journal of Experimental Psychology: Human Experimental Psychology, 50A, 216–252. Shanks, D. R., & Perruchet, P. 2002 Dissociation between priming and recognition in the expression of sequential knowledge. Psychonomic Bulletin & Review, 9, 362– 367. Shanks, D. R., Wilkinson, L., & Channon, S. 2003 Relationship between priming and recognition in deterministic and probabilistic sequence learning. Journal of Experimental Psychology-Learning Memory and Cognition, 29, 248–261. Shukla, M., Nespor, M., & Mehler, J. 2007 An interaction between prosody and statistics in the segmentation of fluent speech. Cognitive Psychology, 54, 1–32. Tolentino, L. C., & Tokowicz, N. 2011 Across languages, space, and time: A review of the role of crosslanguage similarity in L2 (morpho)syntactic processing as re-

264

John N. Williams and Patrick Rebuschat

vealed by fMRI and ERP methods. Studies in Second Language Acquisition, 33, 91–125. Tunney, R. J., & Altmann, G. T. M. 2001 Two modes of transfer in artificial grammar learning. Journal of Experimental Psychology: Learning Memory and Cognition, 27, 614–639. Ullman, M. T. 2004 Contributions of memory circuits to language: the declarative/ procedural model. Cognition, 92, 231–270. Whittlesea, B. W. A., & Dorken, M. D. 1993 Incidentally, things in general are particularly determined: An episodic-processing account of implicit learning. Journal of Experimental Psychology: General, 122, 227–248. Williams, J. N. 2010 Initial incidental acquisition of word order regularities: Is it just sequence learning? Language Learning, 60, 221–244. Williams, J. N., & Kuribara, C. 2008 Comparing a nativist and emergentist approach to the initial stage of SLA: An investigation of Japanese scrambling. Lingua, 118, 522–553.

Statistical construction learning: Does a Zipfian problem space ensure robust language learning? Nick C. Ellis and Matthew Brook O’Donnell 1. Introductory Overview One of the key mysteries of language development is that each of us as learners has had di¤erent language experiences and yet somehow we have converged on broadly the same language system. From diverse, noisy samples, we end up with similar competence. How so? Some views hold that there are constraints in the learner’s estimation of how language works, as expectations of linguistic universals pre-programmed in some innate language acquisition device. Others hold that the constraints are in the dynamics of language itself – that language form, language meaning, and language usage come together to promote robust induction by means of statistical learning over limited samples. The research described here explores this question with regard English verbs, their grammatical form, semantics, and patterns of usage. As a child, you engaged your parents and friends talking about things of shared interest using words and phrases that came to mind, and all the while you learned language. We were privy to none of this. Yet somehow we have converged upon a similar-enough ‘English’ to be able to communicate here. Our experience allows us similar interpretations of novel utterances like ‘‘the ball mandoolz across the ground’’ or ‘‘the teacher spugged the boy the book.’’ You know that mandool is a verb of motion and have some idea of how mandooling works – its action semantics. You know that spugging involves some sort of transfer, that the teacher is the donor, the boy the recipient, and that the book is the transferred object. How is this possible, given that you have never heard these verbs before? Each word of the construction contributes individual meaning, and the verb meanings in these Verb-Argument Constructions (VACs) is usually at the core. But the larger configuration of words has come to carry meaning as a whole too. The VAC as a category has inherited its schematic meaning from all of the examples you have heard. Mandool inherits its interpretation from the echoes of the verbs that occupy this VAC – words like come,

266

Nick C. Ellis and Matthew Brook O’Donnell

walk, move, . . . , scud, skitter and flit – in just the same way that you can conjure up an idea of the first author’s dog Phoebe, who you have never met either, from the conspiracy of your memories of dogs. Knowledge of language is based on these types of inference, and verbs are the cornerstone of the syntax-semantics interface. To appreciate your idea of Phoebe, we would need a record of your relevant evidence (all of the dogs you have experienced, in their various forms and frequencies) and an understanding of the cognitive mechanisms that underpin categorization and abstraction. In the same way, if we want a scientific understanding of language knowledge, we need to know the evidence upon which such psycholinguistic inferences are based, and the relevant psychology of learning. These are the goals of our research. To describe the evidence, we take here a sample of VACs based upon English form, function, and usage distribution. The relevant psychology of learning, as we will explain, suggests that learnability will be optimized for constructions that are (1) Zipfian in their type-token distributions in usage (the most frequent word occurring approximately twice as often as the second most frequent word, which occurs twice as often as the fourth most frequent word, etc.), (2) selective in their verb form occupancy, and (3) coherent in their semantics. We assess whether these factors hold for our sample of VACs. In summary, our methods are as follows; we will return to explain each step in detail. We search a tagged and dependency-parsed version of the British National Corpus (BNC 2007), a representative 100-million word corpus of English, for 23 example VACs previously identified in the Grammar Patterns volumes (Francis, Hunston, and Manning 1996; Hunston and Francis 1996) resulting from the COBUILD corpus-based dictionary project (Sinclair 1987). For each VAC, such as the pattern V(erb) across N(oun phrase), we generate (1) a list of verb types that occupy each construction (e.g. walk, move, skitter). We tally the frequencies of these verbs to produce (2) a frequency ranked type-token profile for these verbs, and we determine the degree to which this is Zipfian (e.g. come 474 . . . spread 146 . . . throw 17 . . . stagger 5; see Fig. 1 below). Because some verbs are faithful to one construction while others are more promiscuous, we next produce (3) a contingency-weighted list which reflects their statistical association (e.g. scud, skitter, sprawl, flit have the strongest association with V across N ). Because verbs are highly polysemous, we apply word sense disambiguation algorithms to assign (4) senses to these verbs in the sentences where they are present, according to WordNet (Miller 2009). We use techniques for identifying clustering and degrees of separation in networks to determine (5) the degree to which there is semantic cohesion of the verbs

Statistical construction learning

267

occupying each construction (e.g., semantic fields travel and move are most frequent for V across N ), and whether they follow a prototype/radial category structure. In order to gauge the degree to which each VAC is more coherent than expected by chance in terms of the association of its grammatical form and semantics we generate a distributionally-yoked control (a ‘control ersatz construction’, CEC), matched for type-token distribution but otherwise randomly selected to be grammatically and semantically uninformed. Through the comparison of VACs and CECs of these various measures, and following what is known of the psychology of learning, we assess the consequences for acquisition. This work is a preliminary interdisciplinary test, across significantly large language usage and learning corpora, of the generalizability of construction grammar theories of language learning informed by cognitive linguistics, learning theory, categorization, statistical learning, usage-based child language acquisition, and complex systems theory.

2. Construction Grammar and Usage Constructions are form-meaning mappings, conventionalized in the speech community, and entrenched as language knowledge in the learner’s mind. They are the symbolic units of language relating the defining properties of their morphological, lexical, and syntactic form with particular semantic, pragmatic, and discourse functions (Goldberg 2006, 1995). Verbs are central in this: their semantic behavior is strongly intertwined with the syntagmatic constraints governing their distributions. Construction Grammar argues that all grammatical phenomena can be understood as learned pairings of form (from morphemes, words, idioms, to partially lexically filled and fully general phrasal patterns) and their associated semantic or discourse functions: ‘‘the network of constructions captures our grammatical knowledge in toto, i.e. it’s constructions all the way down’’ (Goldberg, 2006, p. 18). Such beliefs, increasingly influential in the study of child language acquisition, emphasize data-driven, emergent accounts of linguistic systematicities (e.g., Tomasello 2003; Clark and Kelly 2006). Frequency, learning, and language come together in usage-based approaches which hold that we learn linguistic constructions while engaging in communication (Bybee 2010). The last 50 years of psycholinguistic research provides the evidence of usage-based acquisition in its demonstrations that language processing is exquisitely sensitive to usage frequency at all levels of language representation from phonology, through lexis and

268

Nick C. Ellis and Matthew Brook O’Donnell

syntax, to sentence processing (Ellis 2002). That language users are sensitive to the input frequencies of these patterns entails that they must have registered their occurrence in processing. These frequency e¤ects are thus compelling evidence for usage-based models of language acquisition which emphasize the role of input. Language knowledge involves statistical knowledge, so humans learn more easily and process more fluently high frequency forms and ‘regular’ patterns which are exemplified by many types and which have few competitors (e.g., MacWhinney 2001). Psycholinguistic perspectives thus hold that language learning is the associative learning of representations that reflect the probabilities of occurrence of form-function mappings. If constructions as form-function mappings are the units of language, then language acquisition involves inducing these associations from experience of language usage. Constructionist accounts of language acquisition thus involve the distributional analysis of the language stream and the parallel analysis of contingent perceptuo-motor activity, with abstract constructions being learned as categories from the conspiracy of concrete exemplars of usage following statistical learning mechanisms (Christiansen and Chater 2001; Jurafsky and Martin 2000; Bybee and Hopper 2001; Bod, Hay, and Jannedy 2003; Ellis 2002; Perruchet and Pacton 2006) relating input and learner cognition. 3. Determinants of Construction Learning Psychological analyses of the learning of constructions as form-meaning pairs is informed by the literature on the associative learning of cue-outcome contingencies where the usual determinants include: (1) input frequency (type-token frequency, Zipfian distribution), (2) form (salience and perception), (3) function (prototypicality of meaning), and (4) interactions between these (contingency of form-function mapping) (Ellis and Cadierno 2009). We will briefly consider each in turn, along with studies demonstrating their applicability: 3.1. Input Frequency 3.1.1. Construction Frequency Frequency of exposure promotes learning and entrenchment (e.g., Anderson 2000; Ebbinghaus 1885; Bartlett [1932] 1967). Learning, memory and

Statistical construction learning

269

perception are all a¤ected by frequency of usage: the more times we experience something, the stronger our memory for it, and the more fluently it is accessed. The more recently we have experienced something, the stronger our memory for it, and the more fluently it is accessed [hence your reading this sentence more fluently than the preceding one]. The more times we experience conjunctions of features, the more they become associated in our minds and the more these subsequently a¤ect perception and categorization; so a stimulus becomes associated to a context and we become more likely to perceive it in that context. Frequency of exposure also underpins statistical learning of categories (Mintz 2002; Hunt and Aslin 2010; Lako¤ 1987; Taylor 1998; Harnad 1987). Human categorization ability provides the most persuasive testament to our incessant unconscious figuring or ‘tallying’. We know that natural categories are fuzzy rather than monothetic. Wittgenstein’s (1953) consideration of the concept game showed that no set of features that we can list covers all the things that we call games, ranging as the exemplars variously do from soccer, through chess, bridge, and poker, to solitaire. Instead, what organizes these exemplars into the game category is a set of family resemblances among these members – son may be like mother, and mother like sister, but in a very di¤erent way. And we learn about these families, like our own, from experience. Exemplars are similar if they have many features in common and few distinctive attributes (features belonging to one but not the other); the more similar are two objects on these quantitative grounds, the faster are people at judging them to be similar (Tversky 1977). The greater the token frequency of an exemplar, the more it contributes to defining the category, and the greater the likelihood it will be considered the prototype. The operationalization of this criterion predicts the speed of human categorization performance – people more quickly classify as dogs Labradors (or other typically sized, typically colored, typically tailed, typically featured specimens) than they do dogs with less common features or feature combinations like Shar Peis or Neapolitan Masti¤s. Prototypes are judged faster and more accurately, even if they themselves have never been seen before – someone who has never seen a Labrador, yet who has experienced the rest of the run of the canine mill, will still be fast and accurate in judging it to be a dog (Posner and Keele 1970). Such e¤ects make it very clear that although people don’t go around consciously counting features, they nevertheless have very accurate knowledge of the underlying frequency distributions and their central tendencies.

270

Nick C. Ellis and Matthew Brook O’Donnell

3.1.2. Type and Token Frequency Token frequency counts how often a particular form appears in the input. Type frequency, on the other hand, refers to the number of distinct lexical items that can be substituted in a given slot in a construction, whether it is a word-level construction for inflection or a syntactic construction specifying the relation among words. For example, the ‘‘regular’’ English past tense -ed has a very high type frequency because it applies to thousands of di¤erent types of verbs, whereas the vowel change exemplified in swam and rang has much lower type frequency. The productivity of phonological, morphological, and syntactic patterns is a function of type rather than token frequency (Bybee and Hopper 2001). This is because: (a) the more lexical items that are heard in a certain position in a construction, the less likely it is that the construction is associated with a particular lexical item and the more likely it is that a general category is formed over the items that occur in that position; (b) the more items the category must cover, the more general are its criterial features and the more likely it is to extend to new items; and (c) high type frequency ensures that a construction is used frequently, thus strengthening its representational schema and making it more accessible for further use with new items (Bybee and Thompson 2000). In contrast, high token frequency promotes the entrenchment or conservation of irregular forms and idioms; the irregular forms only survive because they are high frequency. There is related evidence for type-token matters in statistical learning research (Go´mez 2002; Onnis et al. 2004). These findings support language’s place at the center of cognitive research into human categorization, which also emphasizes the importance of type frequency in classification. 3.1.3. Zipfian Distribution In natural language, Zipf ’s law (Zipf 1935) describes how the highest frequency words account for the most linguistic tokens. Zipf ’s law states that the frequency of words decreases as a power function of their rank in the frequency table. If pf is the proportion of words whose frequency in a given language sample is f, then pf  f  with   1. Zipf showed this scaling law holds across a wide variety of language samples. Subsequent research provides support for this law as a linguistic universal. Many language events across scales of analysis follow his power law: phoneme and letter strings (Kello and Beltz 2009), words (Evert 2005), grammatical constructs (Ninio 2006; O’Donnell and Ellis 2010), formulaic phrases (O’Donnell and Ellis 2009) etc. Scale-free laws also pervade language

Statistical construction learning

271

structures, such as scale-free networks in collocation (Sole´ et al. 2005; Bannard and Lieven 2009), in morphosyntactic productivity (Baayen 2008), in grammatical dependencies (Ferrer i Cancho & Sole´, 2001, 2003; Ferrer i Cancho, Sole´, & Ko¨hler, 2004), and in networks of speakers, and language dynamics such as in speech perception and production, in language processing, in language acquisition, and in language change (Ninio 2006; Ellis 2008). Zipfian covering, where, as concepts need to be refined for clear communication, they are split, then split again hierarchically, determines basic categorization, the structure of semantic classes, and the language form-semantic structure interface (Steyvers and Tennenbaum 2005; Manin 2008). Scale-free laws pervade both language structure and usage. And not just language structure and use. Power law behavior like this has since been shown to apply to a wide variety of structures, networks, and dynamic processes in physical, biological, technological, social, cognitive, and psychological systems of various kinds (e.g. magnitudes of earthquakes, sizes of meteor craters, populations of cities, citations of scientific papers, number of hits received by web sites, perceptual psychophysics, memory, categorization, etc.) (Newman 2005; Kello et al. 2010). It has become a hallmark of Complex Systems theory. Zipfian scale-free laws are universal. Complexity theorists suspect them to be fundamental, and are beginning to investigate how they might underlie language processing, learnability, acquisition, usage and change (Beckner, et al., 2009; Ellis & Larsen-Freeman, 2009b; Ferrer i Cancho & Sole´, 2001, 2003; Ferrer i Cancho, et al., 2004; Sole´, et al., 2005) Various usage-based/functionalist/ cognitive linguists (e.g., Boyd & Goldberg, 2009; Bybee, 2008, 2010; Ellis, 2008a; Goldberg, 2006; Goldberg, Casenhiser, & Sethuraman, 2004; Lieven & Tomasello, 2008; Ninio, 1999, 2006) argue that it is the coming together of these distributions across linguistic form and linguistic function that makes language robustly learnable despite learners’ idiosyncratic experience and the ‘poverty of the stimulus’. In first language acquisition, Goldberg, Casenhiser & Sethuraman (2004) demonstrated that there is a strong tendency for VACs to be occupied by one single verb with very high frequency in comparison to other verbs used, a profile which closely mirrors that of the mothers’ speech to these children. They argue that this promotes language acquisition: In the early stages of learning categories from exemplars, acquisition is optimized by the introduction of an initial, low-variance sample centered upon prototypical exemplars. This low variance sample allows learners to get a fix on what will account for most of the category members, with the bounds

272

Nick C. Ellis and Matthew Brook O’Donnell

of the category being defined later by experience of the full breadth of exemplar types. In naturalistic second language (L2) acquisition, Ellis and FerreiraJunior (2009) investigated type/token distributions in the items comprising the linguistic form of English VACs (VL verb locative, VOL verb object locative, VOO ditransitive) and showed that VAC verb type/token distribution in the input is Zipfian and that learners first acquire the most frequent, prototypical and generic exemplar (e.g. put in VOL, give in VOO, etc.). 3.2. Function (Prototypicality of Meaning) Categories have graded structure, with some members being better exemplars than others. In the prototype theory of concepts (Rosch and Mervis 1975; Rosch et al. 1976), the prototype as an idealized central description is the best example of the category, appropriately summarizing the most representative attributes of a category. As the typical instance of a category, it serves as the benchmark against which surrounding, less representative instances are classified. Ellis & Ferreira-Junior (2009) show that the verbs that L2 learners first used in particular VACs are prototypical and generic in function (go for VL, put for VOL, and give for VOO). The same has been shown for child language acquisition, where a small group of semantically general verbs, often referred to as light verbs (e.g., go, do, make, come) are learned early (Clark 1978; Ninio 1999; Pinker 1989). Ninio (1999) argues that, because most of their semantics consist of some schematic notion of transitivity with the addition of a minimum specific element, they are semantically suitable, salient, and frequent; hence, learners start transitive word combinations with these generic verbs. Thereafter, as Clark describes, ‘‘many uses of these verbs are replaced, as children get older, by more specific terms. . . . General purpose verbs, of course, continue to be used but become proportionately less frequent as children acquire more words for specific categories of actions’’ (p. 53). 3.3. Interactions between these (Contingency of Form-Function Mapping) Psychological research into associative learning has long recognized that while frequency of form is important, so too is contingency of mapping (Shanks 1995). Consider how, in the learning of the category of birds, while eyes and wings are equally frequently experienced features in the

Statistical construction learning

273

exemplars, it is wings which are distinctive in di¤erentiating birds from other animals. Wings are important features to learning the category of birds because they are reliably associated with class membership, eyes are neither. Raw frequency of occurrence is less important than the contingency between cue and interpretation. Distinctiveness or reliability of form-function mapping is a driving force of all associative learning, to the degree that the field of its study has been known as ‘contingency learning’ since Rescorla (1968) showed that for classical conditioning, if one removed the contingency between the conditioned stimulus (CS) and the unconditioned (US), preserving the temporal pairing between CS and US but adding additional trials where the US appeared on its own, then animals did not develop a conditioned response to the CS. This result was a milestone in the development of learning theory because it implied that it was contingency, not temporal pairing, that generated conditioned responding. Contingency, and its associated aspects of predictive value, information gain, and statistical association, have been at the core of learning theory ever since. It is central in psycholinguistic theories of language acquisition too (Ellis 2008; MacWhinney 1987; Ellis 2006, 2006; Gries and Wul¤ 2005), with the most developed account for L2 acquisition being that of the Competition model (MacWhinney 1987, 1997, 2001). Ellis and Ferreira-Junior (2009) use a variety of metrics to show that VAC acquisition is determined by their contingency of form-function mapping. They show that the one-way dependency statistic DP (Allan 1980) that is commonly used in the associative learning literature (Shanks 1995), as well as collostructional analysis measures current in corpus linguistics (Gries and Stefanowitsch 2004; Stefanowitsch and Gries 2003) predict e¤ects of form-function contingency upon L2 VAC acquisition. Other researchers use conditional probabilities to investigate contingency e¤ects in VAC acquisition. This is still an active area of inquiry, and more research is required before we know which statistical measures of formfunction contingency are more predictive of acquisition and processing. Ellis and Larsen-Freeman (2009) provided computational (Emergent connectionist) serial-recurrent network simulations of these various factors as they play out in the emergence of constructions as generalized linguistic schema from their frequency distributions in the input. This fundamental claim that Zipfian distributional properties of language usage helps to make language learnable has thus begun to be explored for these three VACs, at least. But three VACs is a pitifully small sample of English grammar. It remains an important research agenda to explore its generality across the wide range of the verb constructicon.

274

Nick C. Ellis and Matthew Brook O’Donnell

The primary motivation of construction grammar is that we must bring together linguistic form, learner cognition, and usage. An important consequence is that constructions cannot be defined purely on the basis of linguistic form, or semantics, or frequency of usage alone. All three factors are necessary in their operationalization and measurement. Psychology theory relating to the statistical learning of categories suggests that constructions are robustly learnable when they are (1) Zipfian in their typetoken distributions in usage, (2) selective in their verb form occupancy, and (3) coherent in their semantics. Our research aims to assess this for a sample of the verbal grammar of English, analyzing the way VACs map form and meaning, and providing an inventory of the verbs that exemplify these constructions and their frequency. 4. Method As a starting point, we considered several of the major theories and datasets of construction grammar such as FrameNet (Fillmore, Johnson, and Petruck 2003). However, because our research aims to empirically determine the semantic associations of particular linguistic forms, it is important that such forms are initially defined by bottom-up means that are semanticsfree. There is no one in corpus linguistics who ‘trusts the text’ more than Sinclair (2004) in his operationalizations of linguistic constructions on the basis of repeated patterns of words in collocation, colligation, and phrases. Therefore we chose the definitions of VACs presented in the Verb Grammar Patterns (Hunston and Francis 1996) that arose out of the COBUILD project (Sinclair 1987) for our first analyses. We focus on a convenience sample of 23 constructions for our initial explorations here. Most of these follow the verb – preposition – noun phrase structure, such as V into N, V after N, V as N (Goldberg 2006), but we also include other classic examples such as the ditransitive, and the way construction (Jackendo¤ 1997). 4.1. Step 1 Construction Inventory: COBUILD Verb Patterns The form-based patterns described in the COBUILD Verb Patterns volume (Francis, Hunston, and Manning 1996) take the form of word class and lexis combinations, such as V across N, V into N and V N N. For each of these patterns the resource provides information as to the structural configurations and meaning groups found around these patterns through detailed concordance analysis of the Bank of English corpus

Statistical construction learning

275

during the construction of the COBUILD dictionary. For instance, the following is provided for the V across N pattern (Francis, Hunston, and Manning 1996): The verb is followed by a prepositional phrase which consists of across and a noun group. This pattern has one structure: * Verb with Adjunct. I cut across the field.

Further example sentences drawn from the corpus are provided and a list of verbs found in the pattern and that are semantically typical are given. For the V across N pattern these are: brush, cut, fall, flicker, flit, plane, skim, sweep. No indication is given as to how frequent each of these types are or how comprehensive the list of types is. Further structural (syntactical) characteristics of the pattern are sometimes provided, such as the fact that for V across N the prepositional phrase is an adjunct and that the verb is never passive. There are over 700 patterns of varying complexity in the Grammar Patterns volume. In subsequent work we hope to analyze them all in the same ways we describe here for our sample of 23. 4.2. Step 2 Corpus: BNC XML Parsed Corpora To get a representative sample of usage, the analysis of verb type-token distribution in the kinds of construction patterns described in Step 1 should be done across corpora in the magnitude of the tens or hundreds of millions of words. Searching for the pattern as specified requires that the corpora be part-of-speech tagged, and some kind of partial parsing and chunking is necessary to apply the necessary structural constraints (see Mason and Hunston 2004 for exploratory methodology). For this initial work, we chose to use the 100 million word BNC (2007) on account of its size, the breadth of text types it contains and the consistent lemmatization and part-of-speech tagging. Andersen et al. (2008) parsed the XML version of the BNC using the RASP parser (Briscoe, Carroll, and Watson 2006). RASP is a statistical feature-based parser that produces a probabilistically ordered set of parse trees for a given sentence and additionally a set of grammatical relations that capture ‘‘those aspects of predicateargument structure that the system is able to recover and is the most stable and grammar independent representation available’’ (Briscoe, Carroll, and

276

Nick C. Ellis and Matthew Brook O’Donnell

Watson 2006). For each VAC, we translate the formal specifications from the COBUILD patterns into queries to retrieve instances of the pattern from the parsed corpus. 4.3. Step 3 Searching Construction Patterns Using a combination of part-of-speech, lemma and dependency constraints we construct queries for each of the construction patterns. For example, the V across N pattern is identified by looking for sentences that have a verb form within 3 words of an instance of across as a preposition, where there is an indirect object relation holding between across and the verb and the verb does not have any other object or complement relations to following words in the sentence. Table 1 shows our 23 constructions, the number of verb types that occupy them, the total number of tokens found, and the type-token ratio. We have still to carry out a systematic precision-recall analysis, but the strict constraints using the dependency relations provides us with a good precision and the size of the corpus results in a reasonable number of tokens to carry out distributional analysis. In future, we plan to use a number of di¤erent parsers [e.g. Stanford (Klein and Manning 2003), Pro3Gres (Schneider, Rinaldi, and Dowdall 2004), MALT (Nivre, Hall, and Nilsson 2004), and Link (Grinberg, La¤erty, and Sleator 1995)] over the same corpora and use a consensus-based selection method where sentences will be counted if two or more parsers agree (according to queries particular to their parsing output) that it is an instance of a particular construction pattern. Further we will select samples of certain VAC distributions for manual evaluation. 4.4. Step 4 A Frequency Ranked Type-Token VAC Profile The sentences extracted using this procedure outlined for each of the construction patterns are stored in a document database. This database can then be queried to produce verb type distributions such as those in Table 2 for the V across N VAC pattern. These distributions appear to be Zipfian, exhibiting the characteristic long-tailed distribution in a plot of rank against frequency. We have developed scripts in R (R Development Core Team 2008) to generate logarithmic plots and linear regression to examine the extent of this trend. Dorogovstev & Mendes (2003) outline the use of logarithmic binning of frequency against log cumulative frequency methods for measuring distributions of this type. Linear regression can be applied to the resulting plots and goodness of fit (R 2) and the slope (g)

Statistical construction learning

277

Table 1. Type-Token data for 23 VACs drawn from COBUILD Verb Patterns retrieved from the BNC Construction

Types

Tokens

TTR

Lead verb type

V about N V across N V after N V among pl-N V around N V as adj V as N V at N V between pl-N V for N V in N V into N V like N VNN V o¤ N V of N V over N V through N V to N V towards N V under N V way prep V with N

365 799 1168 417 761 235 1702 1302 669 2779 2671 1873 548 663 299 1222 1312 842 707 190 1243 365 1942

3519 4889 7528 1228 3801 1012 34383 9700 3572 79894 37766 46488 1972 9183 1032 25155 9269 4936 7823 732 8514 2896 24932

10.37 16.34 15.52 33.96 20.02 23.22 4.95 13.42 18.73 3.48 7.07 4.03 27.79 7.22 28.97 4.86 14.15 17.06 9.04 25.96 14.6 12.6 7.79

talk come look find look know know look distinguish look find go look give take think go go go move come make deal

recorded. Figure 1 shows such a plot for verb type frequency of the V across N construction pattern extracted from the BNC grouping types into 20 logarithmic bins according to their frequency. Each point represents one bin and a verb from each group is randomly selected to label the point with its token frequency in parentheses. For example, the type look occurs 102 times in the V across N pattern and is placed into the 15th bin with the types go, lie and lean. Points towards the lower right of the plot indicate high-frequency low-type groupings and those towards the top left low-frequency high-type groupings, that is the fat- or long-tail of the distribution.

278

Nick C. Ellis and Matthew Brook O’Donnell

Figure 1. Verb type distribution for V across N

Figure 2 shows such the same type of plot for verb type frequency of the ditransitive V N N construction pattern extracted and binned in the same way. Both distributions produce a good fit (R 2 > 0.99) with a straight regression line, indicating a Zipfian type-token frequency distributions

Statistical construction learning

279

Figure 2. Verb type distribution for V N N

for these constructions. Inspection of the construction verb types, from most frequent down, also demonstrates that, as in prior research (Ellis & Ferreira-Junior, 2009b; Goldberg, et al., 2004; Ninio, 1999, 2006), the lead member is prototypical of the construction and generic in its action semantics.

280

Nick C. Ellis and Matthew Brook O’Donnell

If Zipf ’s law applies across language, then any sample of language will be Zipfian-distributed, rendering such findings potentially trivial (we elaborate on this in Step 7). But they become much more interesting if the company of verb forms occupying a construction is selective, i.e. if the frequencies of the particular VAC verb members cannot be predicted from their frequencies in language as a whole. We measure the degree to which VACs are selective like this using a variety of measures including a chi-square goodness-of-fit test, and the statistic ‘1-tau’ where Kendall’s tau measures the correlation between the rank verb frequencies in the construction and in language as a whole. Higher scores on both of these metrics indicate greater VAC selectivity. Another useful measure is Shannon entropy for the distribution. Entropy is a measure of the uncertainty associated with a random variable – it is a¤ected by the number of types in the system and the distribution of the tokens of the types. If there is just one type, then the system is far from random, and entropy is low. If there are ten types of equal probability, the system is quite random, but if 99% of the tokens are of just one type, it is far less random, and so on. The lower the entropy the more coherent the VAC verb family. Construction scores on all these measures are given later in Table 4. 4.5. Step 5 Determining the Contingency between Verbs and VACs Some verbs are closely tied to a particular construction (for example, give is highly indicative of the ditransitive construction, whereas leave, although it can form a ditransitive, is more often associated with other constructions such as the simple transitive or intransitive). As we described above, the more reliable the contingency between a cue and an outcome, the more readily an association between them can be learned (Shanks 1995), so constructions with more faithful verb members are more transparent and thus should be more readily acquired (Ellis 2006). The measures of contingency that we adopt here are (1) faithfulness – the proportion of tokens of total verb usage that appear this particular construction (e.g., the faithfulness of give to the ditransitive is approximately 0.40; that of leave is 0.01, (2) directional one-way associations, contingency (DP Construction ! Word: give 0.314, leave 0.003) and (DP Word ! Construction: give 0.025, leave 0.001) (e.g. Ellis & Ferreira-Junior, 2009), and (3) directional mutual information (MI Word ! Construction: give 16.26, leave 11.73 and MI Construction ! Word: give 12.61 leave 9.11), an information science statistic that has been shown to predict language processing fluency (e.g., Ellis, Simpson-Vlach, and Maynard 2008; Jurafsky

Statistical construction learning

281

Table 2. Top 20 verbs found in the V across N construction pattern in the BNC Faith.

Token* Faith

MI word ! constr

MI constr ! word

DP word ! constr

DP constr ! word

Verb

Constr. Freq.

Corpus Freq.

come

474

122107

0.0039

1.840

15.369

10.726

0.004

0.089

walk cut

203 197

17820 16200

0.0114 0.0122

2.313 2.396

16.922 17.016

15.056 15.288

0.011 0.012

0.040 0.039

run

175

36163

0.0048

0.847

15.687

12.800

0.005

0.034

spread move

146 114

5503 34774

0.0265 0.0033

3.874 0.374

18.142 15.125

17.971 12.295

0.026 0.003

0.030 0.021

look

102

93727

0.0011

0.111

13.534

9.273

0.001

0.015

go lie

93 80

175298 18468

0.0005 0.0043

0.049 0.347

12.498 15.527

7.333 13.610

0.000 0.004

0.008 0.015

lean

75

4320

0.0174

1.302

17.530

17.708

0.017

0.015

stretch

62

4307

0.0144

0.893

17.260

17.442

0.014

0.012

fall get

57 52

24656 146096

0.0023 0.0004

0.132 0.019

14.621 11.922

12.287 7.020

0.002 0.000

0.010 0.002

pass

42

18592

0.0023

0.095

14.588

12.661

0.002

0.007

reach travel

40 39

21645 8176

0.0018 0.0048

0.074 0.186

14.298 15.666

12.152 14.924

0.002 0.004

0.007 0.007

fly

38

8250

0.0046

0.175

15.616

14.861

0.004

0.007

stride scatter

38 35

1022 1499

0.0372 0.0233

1.413 0.817

18.629 17.957

20.887 19.663

0.037 0.023

0.008 0.007

sweep

34

2883

0.0118

0.401

16.972

17.734

0.011

0.007

2002). Table 2 lists some of these contingency measures for the verbs occupying the V across N VAC pattern. It is instructive to reorder the distribution according to these measures and consider the top items in terms of how characteristic of the VAC semantics they are (this is a standard option for each VAC listed on the website we are developing to allow open-access to our analyses). For the V across N VAC pattern, more generic movement verbs come, walk, cut, run, spread and move top the list ordered by token frequency. But when ordered according to verb to construction faithfulness, the items are much more specific in their meaning, though of low frequency: scud, skitter, sprawl, flit, emblazon and slant. The average faithfulness, MI and DP scores across the members of the construction are also important metrics, illustrating the degree to which VACs are selective in their membership. We show examples later in Table 4.

282

Nick C. Ellis and Matthew Brook O’Donnell

4.6. Step 6 Identifying the Meaning of Verb Types Occupying the Constructions We are investigating several ways of analyzing verb semantics. Because our research aims to empirically determine the semantic associations of particular linguistic forms, ideally the semantic classes we employ should be defined in a way that is free of linguistic distributional information, otherwise we would be building in circularity. Therefore distributional semantic methods such as Latent Semantic Analysis (LSA, Landauer et al. 2007) are not our first choice here. Instead, here we utilize WordNet, a distribution-free semantic database based upon psycholinguistic theory which has been in development since 1985 (Miller 2009). WordNet places words into a hierarchical network. At the top level, the hierarchy of verbs is organized into 559 distinct root synonym sets (‘synsets’ such as move1 expressing translational movement, move2 movement without displacement, etc.) which then split into over 13,700 verb synsets. Verbs are linked in the hierarchy according to relations such as hypernym [verb Y is a hypernym of the verb X if the activity X is a (kind of ) Y (to perceive is an hypernym of to listen)], and hyponym [verb Y is a hyponym of the verb X if the activity Y is doing X in some manner (to lisp is a hyponym of to talk)]. Various algorithms to determine the semantic similarity between WordNet synsets have been developed which consider the distance between the conceptual categories of words, as well as considering the hierarchical structure of the WordNet (Pedersen, Patwardhan, and Michelizzi 2004). Polysemy is a significant issue of working with lexical resources such as WordNet, particularly when analyzing verb semantics. For example, in WordNet the lemma forms move, run and give used as verbs are found in 16, 41 and 44 di¤erent synsets respectively. To address this we have applied word sense disambiguation tools specifically designed to work with WordNet (Pedersen and Kolhatkar 2009) to the sentences retrieved at Step 3. 4.7. Step 7 Generating Distributionally-Matched, Control Ersatz Constructions (CECs) Miller (1965) in his preface to the MIT Press edition of Zipf ’s (1935) Psychobiology of Language claimed that Zipfian type-token frequency distributions are essentially uninteresting artifacts of language in use rather than important factors in acquisition. His ‘‘monkey at the typewriter’’ (1957) word generation model produces random words of arbitrary average length as follows: With a probability s, a word separator is generated at

Statistical construction learning

283

each step, with probability ð1  sÞ=N, a letter from an alphabet of size N is generated, each letter having the same probability. That the monkey at the typewriter model produces gibberish that is Zipfian well-distributed thence rendered Zipf ’s law uninteresting for linguistics for several decades (see also Manning and Schu¨tze 1999). Li (1992) reawakened the issue with further demonstrations that random texts exhibit Zipf ’s law-like word frequency distributions. Ferrer-i-Cancho and Sole´ (2002) responded by showing that random texts lose the Zipfian shape in the frequency versus rank plot when words are restricted to a certain length, which is not the case in real texts. As they conclude: ‘‘By assuming that Zipf ’s law is a trivial statistical regularity, some authors have declined to include it as part of the features of language origin. Instead, it has been used as a given statistical fact with no need for explanation. Our observations do not give support to this view.’’ Nevertheless Yang (2010) claims that item/ usage-based approaches to language acquisition, which typically make use of the notion of constructions, have failed to amass su‰cient empirical evidence and to apply the necessary statistical analysis to support their conclusions. He asserts that it is the Zipfian nature of language itself (‘the sparse data problem’) that gives rise to apparent item-specific patterns. In response to these possibilities, for every VAC we analyze, we generate a distributionally-yoked control which is matched for type-token distribution but otherwise randomly selected to be grammatically and semantically uninformed. We refer to these distributions as ‘control ersatz constructions’ (CECs). We then assess, using paired-sample tests, the degree to which VACs are more coherent than expected by chance in terms of the association of their grammatical form and semantics. We show such comparisons for illustration VACs and their yoked CECs later in Tables 4, 5 and 6. The goal in generating CECs is to produce a distribution with the same number of types and tokens as the VAC. To do this we use the following method. For each type in a distribution derived from a VAC pattern (e.g. walk in V across N occurs 203 times), ascertain its corpus frequency (walk occurs 17820 times in the BNC) and randomly select a replacement type from the list of all verb types in the corpus found within the same frequency band (e.g. from learn, increase, explain, watch, stay, etc. which occur with similar frequencies to give in the BNC). This results in a matching number of types that reflect the same general frequency profile as those from the VAC. Then, using this list of replacement types, sample the same number of tokens (along with their sentence contexts) as in the VAC distribution (e.g. 4889 for V across N ) following the probability distribution of the replacement types in the whole corpus (e.g. walk, with

284

Nick C. Ellis and Matthew Brook O’Donnell

a corpus frequency of 17820, will be sampled roughly twice as often as extend, which occurs 9290 times). The resulting distribution has an identical number of types and tokens its matching VAC, although, if the VAC does attract particular verbs, the lead members of the CEC distribution will have a token frequency somewhat lower than those in the VAC. 4.8. Step 8 Evaluating Semantic Cohesion in the VAC Distributions We have suggested that an intuitive reading of VAC type-token lists such as in Table 2 shows that the tokens list captures the most general and prototypical senses (come, walk, move etc. for V across N and give, make, tell, o¤er for V N N ), while the list ordered by faithfulness highlights some quite construction specific (and low frequency) items, such as scud, flit and flicker for V across N. Using the structure of the verb component of the WordNet dictionary, where each synset can be traced back to a root or top-level synset, we are able to compare the semantic cohesion of the top 20 verbs, using their disambiguated WordNet senses, from a given VAC to its matching CEC. So for each verb in a VAC or CEC we query the database for the disambiguated WordNet senses for the verb in the instance sentences. For example, in V across N, the verb type move occurs 114 times across 5 synsets: move.v.1 (86x), move.v.2 (18x), move.v.3 (5x), move.v.7 (1x) and move.v.8 (4x). Each of these synsets can be traced back to a top or root level synset or may itself be that synset: move.v.1 ! travel.v.1, move.v.2 ! move.v.2, move.v.3 ! move.v.3, move.v.7 ! change.v.3, move.v.8 ! act.v.1. Table 3 shows this for the V across N VAC pattern, where the synsets come.v.1, walk.v.1, run.v.1, move.v.1, go.v.1, fall.v.2, pass.v.1, travel.v.1, stride.v.1, stride.v.2 account for 744 of the 4889 (15%) tokens, and share the top level hypernym synset travel.v.01. In comparison, the most frequent root synset for the matching CEC, pronounce.v.1, accounts for just 4% of the tokens. The VAC has a much more compact semantic distribution, in that 5 top level synsets account for a third of the tokens compared to the 21 required to account for the same proportion for the CEC We have explored two methods of evaluating the di¤erences between the semantic sense distributions, such as the one in Table 3, for each VAC-CEC pair. First, we can measure the amount of variation in the distribution (i.e. its compactness) using Shannon entropy as we did in Step 4. For these semantic distributions this can be done according to (1) number of sense types per root (V across N VAC: 2.75 CEC: 3.37) (so ignoring the token frequency column in Table 3) and (2) the token fre-

Statistical construction learning

285

Table 3. Disambiguated WordNet senses for the top 20 verbs found in the V across N VAC and yoked CEC distributions from the BNC and the root verb synsets to which they belong (Top 12 root synsets shown for VAC and CEC). Actual V across N VAC distribution

Random V across N CEC distribution

Root verb synset

Specific WordNet senses

Freq.

Cum. %

Root verb synset

Specific WordNet senses

Freq.

Cum. %

travel.v.01

come.v.1, walk.v.1, run.v.1, move.v.1, go.v.1, fall.v.2, pass.v.1, travel.v.1, stride.v.1, stride.v.2

744

15

pronounce.v.01

say.v.6

193

04

be.v.03

come.v.9, run.v.3, go.v.7, lie.v.1, stretch.v.1, pass.v.6, reach.v.6, sweep.v.5, sweep.v.8

259

21

be.v.01

make.v.31, go.v.10, go.v.6, take.v.38, come.v.14, look.v.2, need.v.2, work.v.14, seem.v.1

183

08

be.v.01

come.v.12, come.v.14, cut.v.25, run.v.12, look.v.2, lie.v.4, lean.v.3, fall.v.16, fall.v.4, get.v.33

233

25

travel.v.01

go.v.1, come.v.1

154

11

move.v.02

cut.v.1, run.v.26, move.v.2, lean.v.2, fly.v.4

210

30

make.v.03

make.v.3, make.v.5, see.v.4, give.v.13, think.v.5, work.v.11

123

13

286

Nick C. Ellis and Matthew Brook O’Donnell

Actual V across N VAC distribution

Random V across N CEC distribution

change.v.02

come.v.4, cut.v.39, run.v.38, run.v.39, spread.v.4, go.v.4, lean.v.1, stretch.v.3, stretch.v.9, fall.v.26, fall.v.3, get.v.12, get.v.2, pass.v.18, fly.v.7

198

34

think.v.03

see.v.5, know.v.6, give.v.10, think.v.1, think.v.2, think.v.3, try.v.2

spread.v.01

spread.v.1, scatter.v.3

172

37

move.v.02

move.v.03

cut.v.14, run.v.6, spread.v.2, move.v.3, stretch.v.11, reach.v.3, sweep.v.2

106

39

get.v.01

run.v.36, get.v.1

41

touch.v.01

fly.v.3

reach.v.01

100

15

say.v.5, set.v.1, put.v.1

84

17

transfer.v.05

give.v.17, give.v.3

84

19

40

understand.v.01

see.v.24, take.v.6, work.v.24

83

21

30

41

know.v.01

know.v.1

73

22

reach.v.1

26

41

use.v.01

give.v.18, use.v.1, work.v.23, put.v.4

73

24

guide.v.05

sweep.v.3

14

42

remove.v.01

take.v.17

72

25

happen.v.01

come.v.19, come.v.3, pass.v.8

14

42

change.v.02

make.v.30, go.v.17, go.v.30, go.v.4, see.v.21, see.v.3, know.v.5, take.v.5, come.v.4, give.v.26, find.v.12, leave.v.8

67

26

Statistical construction learning

287

quency per root (V across N VAC: 2.08 CEC: 3.08), the lower the entropy the more coherent the VAC verb semantics. These figures are calculated for all 23 VACs and CECs and shown in Tables 4 and 5 as (1) Type entropy per root synset and (2) Token entropy per root synset. Secondly, we can develop the observation for the distribution in Table 3 that the top three root synsets, in the VAC account for 25% (1236) of the tokens compared to 11% (530) for the CEC. Third, we quantify the semantic coherence or ‘clumpiness’ of the disambiguated senses for the top 20 verb forms in the VAC and CEC distributions using measures of semantic similarity from WordNet and Roget’s. Pedersen et al (2004) outline six measures in their Perl WordNet::Similarity package, three ( path, lch and wup) based on the path length between concepts in WordNet Synsets and three (res, jcn and lin) that incorporate a measure called ‘information content’ related to concept specificity. For instance, using the res similarity measure (Resnik 1995) the top 20 verbs in V across N VAC distribution have a mean similarity score of 0.353 compared to 0.174 for the matching CEC. 5. Results Our core research questions concern the degree to which VAC form, function, and usage promote robust learning. As we explained in the theoretical background, the psychology of learning as it relates to these psycholinguistic matters suggests, in essence, that learnability will be optimized for constructions that are (1) Zipfian in their type-token distributions in usage, (2) selective in their verb form occupancy, (3) coherent in their semantics. Their values on the metrics we have described so far are illustrated for the 23 VACs in Table 4 along with those for their yoked CECs in Table 5. Table 6 contrasts between the VACs and the CECs on these measures as the results of paired-sample t-tests. The results demonstrate: 5.1. Type-token Usage Distributions All of the VACs are Zipfian in their type-token distributions in usage (VACs: M g ¼ 1.00, MR2 ¼ 0.98). So too are their matched CECs (M g ¼ 1.12, M R 2 ¼ 0.96). The fit is slightly better for the VACs than the CECs because the yoked-matching algorithm tends to make the topmost types of the CEC somewhat less extreme in frequency than is found in the real VACs (because particular verbs are attracted to particular

R2

0.98 0.99 0.99 0.99 0.97 0.96 0.99 0.97 0.98 0.97 0.96 0.98 0.98 0.99 0.98 0.97 0.98 0.99 0.95 0.98 0.97 0.99 0.98

0.98

VAC Pattern

V about N V across N V after N V among pl-N V around N V as adj V as N V at N V between pl-N V for N V in N V into N V like N VNN V o¤ N V of N V over N V through N V to N V towards N V under N V way prep V with N

Mean

Entropy

3.79 5.30 5.04 5.36 5.51 4.05 4.84 4.94 5.17 5.58 6.22 5.22 4.80 3.79 4.89 4.26 5.95 5.37 5.02 4.36 5.74 3.61 5.59

4.97

g

0.80 1.08 1.04 1.43 1.17 0.98 0.80 1.02 1.08 0.79 0.96 0.82 1.08 0.84 1.29 0.76 1.08 1.11 0.92 1.16 1.10 0.83 0.96

1.00 69412

29919 23324 48065 9196 40241 8993 184085 66633 47503 212342 61215 82396 12141 51652 10101 319284 77407 29525 25729 15127 19244 29827 192521

w2

0.76

0.74 0.77 0.69 0.77 0.77 0.76 0.87 0.79 0.80 0.73 0.72 0.71 0.66 0.66 0.60 0.88 0.87 0.83 0.72 0.78 0.70 0.81 0.81

1-t

14.16

15.55 15.49 12.87 17.51 15.96 17.88 10.36 12.51 15.18 9.54 10.48 11.44 15.84 11.52 17.84 11.15 13.72 14.84 13.50 19.59 13.13 17.26 12.56

Mean MIw-c

0.006

0.011 0.003 0.002 0.009 0.004 0.020 0.003 0.003 0.005 0.002 0.002 0.003 0.009 0.004 0.011 0.003 0.002 0.003 0.003 0.017 0.002 0.013 0.003

Mean DPc-w

3.10

3.17 2.75 3.33 2.93 2.80 3.20 3.55 3.23 3.11 3.38 3.56 3.21 2.99 3.21 2.64 3.31 2.87 3.05 2.88 2.68 3.07 3.27 3.16

Type entropy per root synset

2.41

2.42 2.08 2.12 2.79 2.43 2.48 2.56 1.72 2.61 2.70 2.90 2.39 1.92 2.38 2.46 2.56 2.33 2.10 2.59 2.35 2.54 2.46 2.50

Token entropy per root synset

0.26

0.45 0.25 0.31 0.11 0.19 0.34 0.25 0.36 0.21 0.16 0.10 0.26 0.34 0.41 0.21 0.33 0.17 0.26 0.19 0.31 0.16 0.39 0.18

Proportion of tokens covered by top 3 synsets

0.134

0.162 0.194 0.103 0.096 0.155 0.078 0.079 0.099 0.078 0.117 0.079 0.168 0.121 0.139 0.198 0.11 0.237 0.147 0.189 0.149 0.14 0.105 0.136

lch

Table 4. Values for our 23 Verb Argument Constructions on metrics of Zipfian distribution, verb form selectivity, and semantic coherence.

0.237

0.271 0.353 0.184 0.174 0.284 0.141 0.146 0.185 0.149 0.198 0.138 0.289 0.216 0.236 0.358 0.189 0.404 0.266 0.325 0.274 0.248 0.194 0.231

res

288 Nick C. Ellis and Matthew Brook O’Donnell

R2

0.94 0.96 0.97 0.99 0.96 0.96 0.94 0.96 0.97 0.93 0.96 0.94 0.95 0.94 0.96 0.93 0.97 0.97 0.93 0.98 0.97 0.95 0.95

0.96

VAC Pattern

V about N V across N V after N V among pl-N V around N V as adj V as N V at N V between pl-N V for N V in N V into N V like N VNN V o¤ N V of N V over N V through N V to N V towards N V under N V way prep V with N

Mean

Entropy

4.80 5.55 5.95 5.39 5.57 4.74 5.98 6.01 5.46 6.22 6.33 5.81 5.46 5.48 4.89 5.70 5.99 5.66 5.46 4.52 5.99 4.32 6.07 5.54

g

1.04 1.14 1.21 1.45 1.17 1.26 0.99 1.18 1.19 0.94 1.05 0.86 1.32 1.05 1.23 0.97 1.18 1.19 1.05 1.14 1.18 0.90 1.05

1.12 698

441 232 222 867 366 1232 203 248 329 205 228 225 678 226 3853 193 264 293 253 3414 237 1628 220

w2

0.21

0.17 0.19 0.22 0.35 0.22 0.31 0.14 0.21 0.26 0.10 0.15 0.11 0.24 0.20 0.28 0.13 0.22 0.24 0.18 0.25 0.24 0.21 0.16

1-t

12.80

14.02 13.29 12.05 15.25 13.71 15.97 15.97 11.67 13.52 8.42 9.47 9.62 14.57 11.98 16.22 10.38 11.79 12.98 12.48 17.08 11.82 11.82 10.22

Mean MIw-c

0.004

0.004 0.003 0.002 0.006 0.003 0.010 0.010 0.002 0.003 0.001 0.001 0.002 0.004 0.002 0.010 0.002 0.002 0.002 0.003 0.015 0.002 0.002 0.002

Mean DPc-w

3.51

3.52 3.37 3.65 3.42 3.51 2.90 3.64 3.53 3.33 3.83 3.70 3.71 3.38 3.53 3.42 3.72 3.60 3.51 3.52 3.15 3.59 3.57 3.57

Type entropy per root synset

3.08

3.07 3.08 3.11 3.10 3.14 2.84 3.07 3.11 3.04 3.12 3.07 3.12 3.10 3.10 3.20 3.08 3.07 3.21 3.11 2.84 3.22 2.96 3.13

Token entropy per root synset

0.11

0.15 0.11 0.09 0.10 0.10 0.16 0.09 0.08 0.11 0.08 0.09 0.10 0.12 0.10 0.13 0.10 0.09 0.09 0.10 0.20 0.08 0.22 0.09

Proportion of tokens covered by top 3 synsets

0.094

0.084 0.098 0.083 0.068 0.093 0.165 0.088 0.083 0.092 0.075 0.082 0.088 0.083 0.093 0.072 0.078 0.081 0.099 0.091 0.085 0.14 0.105 0.136

lch

Table 5. Values for our 23 Control Ersatz Constructions on metrics of Zipfian distribution, verb form selectivity, and semantic coherence.

0.22

0.152 0.174 0.146 0.123 0.16 0.286 0.154 0.151 0.149 0.198 0.138 0.289 0.216 0.236 0.358 0.189 0.404 0.266 0.325 0.274 0.248 0.194 0.231

res

Statistical construction learning

289

R2

0.98

0.96

1.6 e-06 ***

Pattern

Mean VACs

Mean CECs

p value for paired t-test (d.f. 22)

5.54

1.12 4.89 e-04 ***

4.97

1.00

4.4 e-06 ***

Entropy

g

5.5 e-18 ***

698

69412

w2

1.9 e-03 ***

0.21

0.76

1-t

1.1 e-02 ***

12.80

14.16

Mean MIw-c

5.1 e-05 ***

0.004

0.006

Mean DPc-w

1.7 e-08 ***

3.51

3.10

Type entropy per root synset

1.2 e-10 ***

3.08

2.41

Token entropy per root synset

3.2 e-08 ***

0.11

0.26

Proportion of tokens covered by top 3 synsets

2.0 e-04 ***

0.094

0.134

lch

1.6 e-06 ***

0.22

0.237

res

Table 6. Comparisons of values for our 23 VACs and CECs on metrics of Zipfian distribution, verb form selectivity, and semantic coherence.

290 Nick C. Ellis and Matthew Brook O’Donnell

Statistical construction learning

291

VACs), and so the fit line is not pulled out into so extreme a tail. Inspection of the graphs for each of the 23 VACs shows that the highest frequency items take the lion’s share of the distribution and, as in prior research (Ellis & Ferreira-Junior, 2009b; Goldberg, et al., 2004; Ninio, 1999, 2006), the lead member is prototypical of the construction and generic in its action semantics (see the rightmost column in Table 1). 5.2. Family Membership and Type Occupancy VACs are selective in their verb form family occupancy. There is much less entropy in the VACs than the CECs, with fewer forms of a less evenly-distributed nature (M distribution Entropy VAC 4.97, CEC 5.54, p < .0001). The distribution deviation from verb frequency in the language as a whole is much greater in the VACs than the CECs (M w 2 VAC 69411, CEC 698, p < .0001). The lack of overall correlation between VAC verb frequency and overall verb frequency in the language is much greater in the VACs (M 1  t VAC 0.76, CEC 0.21, p < .002). Individual verbs select particular constructions (M MIw-c VAC 14.16, CEC 12.80, p < .01) and particular constructions select particular words (M DPc-w VAC 0.006, CEC 0.004, p < .0001). Overall then, there is greater contingency between verb types and constructions. 5.3. Semantic Coherence VACS are coherent in their semantics with lower type (M VAC 3.10, CEC 3.51, p < .0001) and token (M VAC 2.41, CEC 3.08, p < .0001) sense entropy. Figure 3 shows distributions of the root synsets for the top 20 types of each of the VAC-CEC pairs through plots of logarithmic token frequency against rank – in each pair, fewer senses cover more of the VAC uses than the CEC. Figure 3 also shows the proportion of tokens accounted for by the top three root synsets (e.g. for V across N: VAC 0.25 CEC 0.11). The proportion of the total tokens covered by their three most frequent WordNet roots is much higher in the VACs (M VAC 0.26, CEC 0.11, p < .0001). Finally, the VAC distributions are higher on the Pedersen semantic similarity measures (M lch VAC 0.13, CEC 0.09, p < .0002) (M res VAC 0.24, CEC 0.22, p < .0001). 6. Discussion Twenty-three constructions is a better sample of constructions than three, and the 16,141,058 tokens of verb usage analyzed here is a lot more representative than the 14,474 analyzed in Ellis & Ferreira-Junior (2009a,b).

292

Nick C. Ellis and Matthew Brook O’Donnell

Figure 3. Distribution of WordNet root verb synsets for VACs and CECs

Statistical construction learning

293

Nevertheless, the conclusions from those earlier studies seem to generalize. We have shown: – The frequency distribution for the types occupying the verb island of each VAC are Zipfian. – The most frequent verb for each VAC is much more frequent than the other members, taking the lion’s share of the distribution. – The most frequent verb in each VAC is prototypical of that construction’s functional interpretation, albeit generic in its action semantics. – VACs are selective in their verb form family occupancy: – Individual verbs select particular constructions. – Particular constructions select particular verbs. – There is greater contingency between verb types and constructions. – VACS are coherent in their semantics. Psychology theory relating to the statistical learning of categories suggests that these are the factors which make concepts robustly learnable. We suggest, therefore, that these are the mechanisms which make linguistic constructions robustly learnable too, and that they are learned by similar means. 7. Future Work 7.1. An Exhaustive Inventory of English VACs This is still a small sample from which to generalize. In subsequent work we intend to analyze the 700þ patterns of Verb Pattern Grammar volume as found in the 100 million words of the BNC. Other theories of construction grammar start from di¤erent motivations, some more semantic [e.g. Framenet (Fillmore, Johnson, and Petruck 2003) and VerbNet (Kipper et al. 2008; Palmer 2010; Levin 1993)], some alternatively syntactic [e.g. the Erlangen Valency Patternbank (Herbst and Uhrig 2010; Herbst et al. 2004)], and so present di¤erent, complementary descriptions of English verb grammar. Given time, we hope to analyze usage patterns from these descriptions too. We are particularly interested in whether these inventories represent optimal partitioning of verb semantics, starting with basic categories of action semantics and proceeding to greater specificity via Zipfian mapping. 7.2. Learner Language We are also interested in extending these approaches to learner language to investigate whether first and L2 learners’ acquisition follows the same

294

Nick C. Ellis and Matthew Brook O’Donnell

construction distributional profiles. We have done some initial pilot work to test the viability of our methods by extracting 18 of the same VAC patterns from American English and British English child language acquisition corpora in CHILDES (MacWhinney 2000, 2000) transcripts. Child directed speech (CDS, over 6.8 million words) was separated from the speech of the target child (over 3.6 million words) for the UK and USA components of the database where dependency parsing of each utterance is available (Sagae et al. 2007). The same analysis steps described here are equally viable with learner language. In our initial explorations (O’Donnell and Ellis submitted) we build on the types of analysis carried out in Goldberg, Casenhiser & Sethuraman (2004) that demonstrate how the frequency profiles of CDS are reproduced in child language. For example, for the V across N VAC pattern go is the most frequent type in both CDS and child speech. Likewise, for V over N we found go and jump as the first types in both samples. For V with N the top 4 types, play, go, do, come, are shared, as they are for V under N: go, look, get, hide and the top two for V like N: look and go. The nature of CDS with respect to more general English can also be examined. Applying the various contingency and semantic measures discussed above we found the 10 most faithful types to the VAC pattern V like N were: 1) from the BNC: glitter, behave, gleam, bulge, shape, flutter, glow, shine, sound, sway (with a wup similarity score of 0.3559) and 2) for CDS: sound, act, shape, smell, taste, look, yell, feel, talk, fit (wup 0.4564). This initial analysis points both to the more frequent use of generic verbs (e.g. go and do) in CDS and a tighter semantic coherence in the items most associated with specific VACs. These steps need next to be done for the complete inventory of VACs so that a comparison can be made of general usage (BNC), CDS, and child language acquisition at di¤erent stages. 7.3. Determinants of Learning Once we have these parallel datasets of su‰cient scale, we can undertake a principled empirical analysis of the degree to which the psychological factors outlined really do determine acquisition. For each VAC in the input we will have the data relating to frequency, distributional, contingency, and semantic factors which learning theory considers important in acquisition. With the staged child language acquisition analyzed in the same way, we can test out these predictions and explore how the di¤erent factors conspire in the emergence of language.

Statistical construction learning

295

7.4. Modeling Acquisition As we have argued in an upcoming review of statistical corpus linguistics and language cognition (Ellis in press), the field as a whole needs to work on how to combine the various corpus metrics that contribute to learnability into a model of acquisition rather than a series of piecemeal univariate snapshots. We have developed some connectionist methods for looking at this and trialed them with just the three VACs VL, VOL, and VOO (Ellis and Larsen-Freeman 2009), but that enterprise and the current one are of hugely di¤erent scales. We need models of acquisition that relate such VAC measures as applied to the BNC and CDS to longitudinal patterns of child language and L2 acquisition.

8. Conclusion This research shows some promise towards an English verb grammar operationalized as an inventory of VACs, their verb membership and their type-token frequency distributions, their contingency of mapping, and their semantic motivations. Our initial analyses show that constructions are (1) Zipfian in their type-token distributions in usage, (2) selective in their verb form occupancy, and (3) coherent in their semantics. Psychology theory relating to the statistical learning of categories suggests that these are the factors which make concepts robustly learnable. We suggest therefore, that these are the mechanisms which make linguistic constructions robustly learnable too, and that they are learned by similar means.

9. Epilogue Phoebe was a black and brindle collie-cross (Figure 4). She was 12 years old when we brought her to (VOL to ) the US. It was Michigan, February, blue skies over 12 00 of snow. We collected her, dehydrated, from (VOL from) DTW, left the airport, and pulled onto (VL onto ) the nearest safe verge to let her out (VOL out ) of her travel-kennel. It had been a long flight and we were somewhat concerned, but after a typically warm reunion, she looked at (VL at ) the strange whiteness, and then, like a wolf pouncing on (VL on ) a mouse, she ponked into (VL into ) the snow.

296

Nick C. Ellis and Matthew Brook O’Donnell

Figure 4. Phoebe. One particular dog. How was your estimate?

References Allan, Lorraine G. 1980 A note on measurement of contingency between two binary variables in judgment tasks. Bulletin of the Psychonomic Society 15: 147–149. Andersen, Øistein E., Julien Nioche, Ted Briscoe, and John Carroll 2008 The BNC Parsed with RASP4UIMA. Proceedings of the Sixth International Language Resources and Evaluation (LREC08): 28–30. Anderson, John Robert 2000 Cognitive psychology and its implications (5th ed.). New York: W.H. Freeman. Baayen, Harald 2008 Corpus linguistics in morphology: morphological productivity. In Corpus Linguistics. An international handbook, edited by A. Ludeling and M. Kyto. Berlin: Mouton De Gruyter. Bannard, Colin and Elena Lieven 2009 Repetition and reuse in child language learning. In Formulaic Language: Volume 2 Acquisition, loss, psychological reality, and functional explanations, edited by R. Corrigan, E. A. Moravcsik, H. Ouali and K. M. Wheatley. Amsterdam: John Benjamins.

Statistical construction learning

297

Bartlett, Frederick Charles [1932] 1967 Remembering: A Study in Experimental and Social Psychology. Cambridge: Cambridge University Press BNC 2007 BNC XML Edition http://www.natcorp.ox.ac.uk/corpus/. Bod, Rens, Jennifer Hay, and Stefanie Jannedy, eds. 2003 Probabilistic linguistics. Cambridge, MA: MIT Press. Briscoe, Ted, John Carroll, and Rebecca Watson 2006 The Second Release of the RASP System. Paper read at Proceedings of the COLING/ACL 2006 Interactive Presentation Sessions, at Sydney, Australia. Bybee, Joan 2010 Language, usage, and cognition. Cambridge: Cambridge University Press. Bybee, Joan and Paul Hopper, eds. 2001 Frequency and the emergence of linguistic structure. Amsterdam: Benjamins. Bybee, Joan and Sandra Thompson 2000 Three frequency e¤ects in syntax. Berkeley Linguistic Society 23: 65–85. Christiansen, Morten H., and Nick Chater, eds. 2001 Connectionist psycholinguistics. Westport, CO: Ablex. Clark, Eve V. 1978 Discovering what words can do. In Papers from the parasession on the lexicon, Chicago Linguistics Society April 14–15, 1978, edited by D. Farkas, W. M. Jacobsen and K. W. Todrys. Chicago: Chicago Linguistics Society. Clark, Eve V. and Barb Kelly, eds. 2006 Constructions in acquisition. Chicago: University of Chicago Press. Dorogovstev, Sergey N. and Jose´ F. F. Mendes 2003 Evolution of Networks: From Biological Nets to the Internet and WWW. Oxford: Oxford University Press. Ebbinghaus, Hermann 1885 Memory: A contribution to experimental psychology. Translated by H. A. R. C. E. B. (1913). New York: Teachers College, Columbia. Ellis, Nick C. 2002 Frequency e¤ects in language processing: A review with implications for theories of implicit and explicit language acquisition. Studies in Second Language Acquisition 24(2): 143–188. Ellis, Nick C. 2006 Language acquisition as rational contingency learning. Applied Linguistics 27(1): 1–24.

298

Nick C. Ellis and Matthew Brook O’Donnell

Ellis, Nick C. 2006

Ellis, Nick C. 2008

Ellis, Nick C. 2008

Ellis, Nick C. in press

Selective attention and transfer phenomena in SLA: Contingency, cue competition, salience, interference, overshadowing, blocking, and perceptual learning. Applied Linguistics 27(2): 1–31. The dynamics of second language emergence: Cycles of language use, language change, and language acquisition. Modern Language Journal 41(3): 232–249. Usage-based and form-focused language acquisition: The associative learning of constructions, learned-attention, and the limited L2 endstate. In Handbook of cognitive linguistics and second language acquisition, edited by P. Robinson and N. C. Ellis. London: Routledge.

What can we count in language, and what counts in language acquisition, cognition, and use? In Frequency e¤ects in cognitive linguistics (Vol. 1): Statistical e¤ects in learnability, processing and change, edited by S. T. Gries and D. S. Divjak. Berlin: Mouton de Gruyter. Ellis, Nick C. and Teresa Cadierno 2009 Constructing a second language. Annual Review of Cognitive Linguistics 7 (Special section): 111–290. Ellis, Nick C. and Fernando Ferreira-Junior 2009 Construction Learning as a function of Frequency, Frequency Distribution, and Function Modern Language Journal 93: 370– 386. Ellis, Nick C. and Fernando Ferreira-Junior 2009 Constructions and their acquisition: Islands and the distinctiveness of their occupancy. Annual Review of Cognitive Linguistics 7: 111–139. Ellis, Nick C. and Diane Larsen-Freeman 2009 Constructing a second language: Analyses and computational simulations of the emergence of linguistic constructions from usage. Language Learning 59 (Supplement 1): 93–128. Ellis, Nick C., Rita Simpson-Vlach, and Carson Maynard 2008 Formulaic Language in Native and Second-Language Speakers: Psycholinguistics, Corpus Linguistics, and TESOL. TESOL Quarterly 42(3): 375–396. Evert, Stefan 2005 The Statistics of Word Cooccurrences: Word Pairs and Collocations, University of Stuttgart, Stuttgart. Ferrer-i-Cancho, Ramon and Ricard V. Sole 2002 Zipf ’s law and random texts. Advances in Complex Systems 5(1): 1–6.

Statistical construction learning

299

Fillmore, Charles J., Christopher R. Johnson, and Miriam R. L. Petruck 2003 Background to Framenet. International Journal of Lexicography 16: 235–250. Francis, Gill, Susan Hunston, and Elizabeth Manning, eds. 1996 Grammar Patterns 1: Verbs. The COBUILD Series. London: Harper Collins. Goldberg, Adele E. 1995 Constructions: A construction grammar approach to argument structure. Chicago: University of Chicago Press. Goldberg, Adele E. 2006 Constructions at work: The nature of generalization in language. Oxford: Oxford University Press. Goldberg, Adele E., Devin M. Casenhiser, and Nitya Sethuraman 2004 Learning argument structure generalizations. Cognitive Linguistics 15: 289–316. Go´mez, Rebecca 2002 Variability and detection of invariant structure. Psychological Science 13: 431–436. Gries, Stefan Th. and Anatol Stefanowitsch 2004 Extending collostructional analysis: a corpus-based perspective on ‘alternations’. International Journal of Corpus Linguistics 9: 97–129. Gries, Stefan Th., and Stefanie Wul¤ 2005 Do foreign language learners also have constructions? Evidence from priming, sorting, and corpora. Annual Review of Cognitive Linguistics 3: 182–200. Grinberg, Dennis, John La¤erty, and Danniel Sleator 1995 A robust parsing algorithm for link grammars. Carnegie Mellon University Computer Science technical report CMU-CS-95–125. Harnad, Steven, ed. 1987 Categorical perception: The groundwork of cognition. New York: Cambridge University Press. Herbst, Thomas, David Heath, Ian Roe, and Dieter Go¨tz 2004 Valency Dictionary of English. Berlin/New York: Mouton de Gruyter. Herbst, Thomas and Peter Uhrig. Erlangen Valency Patternbank 2010 Available from http://www.patternbank.uni-erlangen.de/cgi-bin/ patternbank.cgi?do=help. Hunston, Susan and Gill Francis 1996 Pattern grammar: A corpus driven approach to the lexical grammar of English. Amsterdam: Benjamins. Hunt, Ruskin H. and Richard N. Aslin 2010 Category induction via distributional analysis: Evidence from a serial reaction time task. Journal of Memory and Language 62: 98–112.

300

Nick C. Ellis and Matthew Brook O’Donnell

Jackendo¤, Ray 1997 Twistin’ the night away. Language 73: 543–559. Jurafsky, Daniel 2002 Probabilistic Modeling in Psycholinguistics: Linguistic Comprehension and Production. In Probabilistic Linguistics, edited by R. Bod, J. Hay and S. Jannedy. Harvard, MA: MIT Press. Jurafsky, Daniel and James H. Martin 2000 Speech and language processing: An introduction to natural language processing, speech recognition, and computational linguistics. Englewood Cli¤s, NJ: Prentice-Hall. Kello, Chris T. and Brandon C. Beltz 2009 Scale-free networks in phonological and orthographic wordform lexicons. In Approaches to Phonological Complexity, edited by I. Chitoran, C. Coupe´, E. Marsico and F. Pellegrino. Berlin: Mouton de Gruyter. Kello, Chris T., Gordon D. A. Brown, Ramon Ferrer-i-Cancho, John G. Holden, Klaus Linkenkaer-Hansen, Theo Rhodes, and Guy C. Van Orden 2010 Scaling laws in cognitive sciences. Trends in Cognitive Science 14: 223–232. Kipper, Karin, Anna Korhonen, Neville Ryant, and Martha Palmer 2008 A large-scale classification of English verbs. Language Resources and Evaluation 41: 21–40. Klein, Dan and Christopher D. Manning 2003 Fast Exact Inference with a Factored Model for Natural Language Parsing. In Advances in Neural Information Processing Systems. Cambridge, MA: MIT Press. Lako¤, George 1987 Women, fire, and dangerous things: What categories reveal about the mind. Chicago: University of Chicago Press. Landauer, Thomas K., Danielle S. McNamara, Simon Dennis, and Walter Kintsch, eds. 2007 Handbook of Latent Semantic Analysis. Mahwah, NJ: Lawrence Erlbaum. Levin, Beth 1993 English verb classes and alternations: A preliminary analysis. Chicago: Chicago University Press. Li, Wentian 1992 Random Texts Exhibit Zipf ’s Law-Like Word Frequency Distribution. IEEE Transactions on Information Theory 38.6: 1842–1845. MacWhinney, Brian 1987 Applying the Competition Model to bilingualism. Applied Psycholinguistics 8(4): 315–327. MacWhinney, Brian 1987 The Competition Model. In Mechanisms of language acquisition, edited by B. MacWhinney. Hillsdale, NJ: Erlbaum.

Statistical construction learning

301

MacWhinney, Brian 1997 Second language acquisition and the Competition Model. In Tutorials in bilingualism: Psycholinguistic perspectives, edited by A. M. B. De Groot and J. F. Kroll. Mahwah, NJ: Lawrence Erlbaum Associates. MacWhinney, Brian 2000 The CHILDES project: Tools for analyzing talk, Vol 1: Transcription format and programs (3rd ed.): (2000). xi, 366pp. MacWhinney, Brian 2000 The CHILDES Project: Tools for analyzing talk, Vol 2: The database (3rd ed.): (2000). viii, 418pp. MacWhinney, Brian 2001 The competition model: The input, the context, and the brain. In Cognition and second language instruction, edited by P. Robinson. New York: Cambridge University Press. Manin, Dmitrii Y. 2008 Zipf ’s Law and Avoidance of Excessive Synonymy Cognitive Science 32: 1075–1098. Manning, Chris D. and Hinrich Schu¨tze 1999 Foundations of statistical natural language processing. Cambridge, MA: The MIT Press. Mason, Oliver and Susan Hunston 2004 The automatic recognition of verb patterns: A feasibility study. International Journal of Corpus Linguistics 9: 253–270. Miller, George A. 1957 Some e¤ects of intermittent silence. American Journal of Psychology 70: 311–314. Miller, George A. 1965 Preface to the MIT press publication of G. H. K. Zipf (1935) The psychobiology of language: An introduction to dynamic philology. Boston MA: MIT Press. Miller, George A. 2009 WordNet – About us. Princeton University. Mintz, Toben 2002 Category induction from distributional cues in an artificial language. Memory & Cognition 30: 678–686. Newman, Mark 2005 Power laws, Pareto distributions and Zipf ’s law. Contemporary Physics 46: 323–351 Ninio, Anat 1999 Pathbreaking verbs in syntactic development and the question of prototypical transitivity. Journal of Child Language 26:619–653. Ninio, Anat 2006 Language and the learning curve: A new theory of syntactic development. Oxford: Oxford University Press.

302

Nick C. Ellis and Matthew Brook O’Donnell

Nivre, Joakim, Johan Hall, and Jens Nilsson 2004 Memory-Based Dependency Parsing. Paper read at Proceedings of the Eighth Conference on Computational Natural Language Learning (CoNLL). O’Donnell, Matthew B. and Nick C. Ellis submitted ‘Zipfing across the input’ Analyzing the distribution and semantics of verbs in verb argument patterns in child directed and general English. In BUCLD 35. Boston. O’Donnell, Matthew B. and Nick C. Ellis 2009 Measuring formulaic language in corpora from the perspective of Language as a Complex System. Paper read at 5th Corpus Linguistics Conference, 20–23 July, 2009, at University of Liverpool. O’Donnell, Matthew B. and Nick C. Ellis 2010 Towards an Inventory of English Verb Argument Constructions. Proceedings of the 11th Annual Conference of the North American Chapter of the Association for Computational Linguistics, Los Angeles. Onnis, Luca, Padraic Monaghan, Morten H. Christiansen, and Nick Chater 2004 Variability is the spice of learning, and a crucial ingredient for detecting and generalising nonadjacent dependencies. Proceedings of the 26th Annual Conference of the Cognitive Science Society. Palmer, Martha 2010 VerbNet: A class based verb lexicon 2010. Available from http:// verbs.colorado.edu/~mpalmer/projects/verbnet.html. Pedersen, Ted and Varada Kolhatkar 2009 WordNet::SenseRelate::AllWords – A Broad Coverage Word Sense Tagger that Maximizes Semantic Relatedness Paper read at Proceedings of the Demonstration Session of the Human Language Technology Conference and the Tenth Annual Meeting of the North American Chapter of the Association for Computational Linguistics, at Boulder, Colorado. Pedersen, Ted, Siddharth Patwardhan, and Jason Michelizzi 2004 WordNet::Similarity – Measuring the Relatedness of Concepts. Paper read at Proceedings of Fifth Annual Meeting of the North American Chapter of the Association of Computational Linguistics (NAACL 2004). Perruchet, Pierre and Se´bastian Pacton 2006 Implicit learning and statistical learning: one phenomenon, two approaches. Trends in Cognitive Sciences 10: 233–238. Pinker, Steven 1989 Learnability and cognition: The acquisition of argument structure. Cambridge, MA: Bradford Books. Posner, Michael I., and Stephen W. Keele 1970 Retention of abstract ideas. Journal of Experimental Psychology 83: 304–308.

Statistical construction learning

303

R 2010

A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. Rescorla, Robert A. 1968 Probability of shock in the presence and absence of CS in fear conditioning. Journal of Comparative and Physiological Psychology 66: 1–5. Resnik, Philip 1995 Using Information Content to Evaluate Semantic Similarity in a Taxonomy. IJCAI: 448–453. Rosch, Eleanor and Carolyn B. Mervis 1975 Cognitive representations of semantic categories. Journal of Experimental Psychology: General 104: 192–233. Rosch, Eleanor, Carolyn B. Mervis, Wayne D. Gray, David M. Johnson, and Penny Boyes-Braem 1976 Basic objects in natural categories. Cognitive Psychology 8: 382– 439. Sagae, Kenji, Eric Davis, Alon Lavie, Brain MacWhinney, and Shuly Wintner 2007 High-accuracy annotation and parsing of CHILDES transcripts. In Proceedings of the ACL-2007 Workshop on Cognitive Aspects of Computational Language Acquisition. Prague: Czech Republic. Schneider, Gerold, Fabio Rinaldi, and James Dowdall 2004 Fast, deep-linguistic statistical dependency parsing. Paper read at Recent advances in DG workshop, Coling 2004, at Geneva. Shanks, David R. 1995 The psychology of associative learning. New York: Cambridge University Press. Sinclair, John 2004 Trust the text: Language, corpus and discourse. London: Routledge. Sinclair, John, ed. 1987 Looking Up: An Account of The COBUILD Project in Lexical Computing. London: Collins ELT. Sole´, Ricard V., Bernat Murtra, Sergi Valverde, and Luc Steels 2005 Language Networks: their structure, function and evolution. Trends in Cognitive Sciences 12. Stefanowitsch, Anatol and Stefan Th. Gries 2003 Collostructions: Investigating the interaction between words and constructions. International Journal of Corpus Linguistics 8: 209– 43. Steyvers, Mark and Josh Tennenbaum 2005 The large-scale structure of semantic networks: statistical analyses and a model of semantic growth. Cognitive Science 29: 41–78. Taylor, John R. 1998 Syntactic constructions as prototype categories. In The new psychology of language: Cognitive and functional approaches to language structure, edited by M. Tomasello. Mahwah, NJ: Erlbaum.

304

Nick C. Ellis and Matthew Brook O’Donnell

Tomasello Michael 2003 Constructing a language. Boston, MA: Harvard University Press. Tversky, Alan 1977 Features of similarity. Psychological Review 84: 327–352. Wittgenstein, Ludwig 1953 Philosophical investigations. Translated by G. E. M. Anscombe. Oxford: Blackwell. Yang, Charles 2010 Who’s afraid of George Kingsley Zipf? http://www.ling.upenn. edu/~ycharles/papers/zipfnew.pdf. Zipf, George K. 1935 The psycho-biology of language: An introduction to dynamic philology. Cambridge, MA: The M.I.T. Press.

Can we enhance domain-general learning abilities to improve language function? Christopher M. Conway, Michelle A. Gremp, Anne D. Walk, Althea Bauernschmidt and David B. Pisoni One of the most exciting recent developments in the cognitive neurosciences is the growing realization that the mind and brain are highly plastic. Whereas traditionally the brain was thought to be relatively immutable past a certain age, there is a growing body of evidence suggesting that neural connections remain modifiable even late into adulthood (Kleim & Jones, 2008; van Praag, Kempermann, & Gage, 2000). In addition, recent research has demonstrated that even relatively short-term training can lead to improvements to neurocognitive abilities such as working memory capacity (Klingberg, 2010), with improvements transferring to a host of non-trained tasks of memory and cognition. In addition to working memory (WM), it may be fruitful to explore whether it is possible to enhance statistical learning abilities. Here we use the term statistical learning in a fairly broad sense, to refer to incidental learning that results in sensitivity to structured patterns in the environment. Under this definition, statistical learning is related to other forms of nondeclarative pattern learning abilities such as implicit learning (Perruchet & Pacton, 2006) and sequential learning (Conway, 2012). These kinds of domain-general learning abilities appear to be important for language acquisition and processing (Conway & Pisoni, 2008; Gervain & Mehler, 2010; Gogate & Hollich, 2010; Gupta & Dell, 1999; Kuhl, 2004; Reber, 1967; Sa¤ran, 2003; Ullman, 2004). The question we address is whether it is possible to improve statistical learning abilities by capitalizing on the highly plastic nature of the mind and brain, in a similar vein to the demonstrations of improvements to WM capacity. Given that statistical learning is important for language acquisition and processing, then we should expect that if statistical learning can be enhanced through some type of training regimen, that this would result in a better facility for acquiring and processing language. Providing a demonstration of the causal e¤ects of enhanced statistical learning on language acquisition would be important theoretically as well

306

Christopher M. Conway et al.

as clinically. A number of language and communication disorders in fact may be due at least in part to disturbances to domain-general learning abilities such as statistical learning and procedural memory (Nicolson & Fawcett, 2007; Ullman & Pierpont, 2005) including dyslexia (Howard, Howard, Japikse, & Eden, 2006), specific language impairment (Evans, Sa¤ran, & Robe-Torres, 2009), and language delays caused by a period of deafness early in development (Conway, Pisoni, Anaya, Karpicke, & Henning, 2011). In this chapter, we first review recent evidence highlighting the importance of statistical learning for language in populations both with and without a language or communication disorder. We then describe recent work using computerized training techniques that were designed to improve WM. This provides the background for presenting the results of a novel adaptive training task that we have developed to improve domain-general learning abilities. We review two studies still in their formative stages, the first with normal-hearing adults, the second with children who are deaf or hard of hearing. The initial findings, although still preliminary, show the promise of adaptively training basic elementary mechanisms of learning and memory to improve language and communication functions.

1. Statistical learning and language processing It is widely accepted that statistical learning is important for language acquisition and processing. For instance, statistical learning mechanisms are thought to be important for word segmentation (Sa¤ran, Aslin, & Newport, 1996), word learning (Graf Estes, Evans, Alibali, & Sa¤ran, 2007; Mirman, Magnuson, Graf Estes, & Dixon, 2008), and the acquisition of syntax (Gomez & Gerkin, 2000; Ullman, 2004). Previous work has shown that knowledge of the statistical probabilities in language can enable a listener to better identify – and perhaps even implicitly predict – the next word that will be spoken (Miller, Heise, & Lichten, 1951; Rubenstein, 1973; cf., Bar, 2007). This use of top-down knowledge becomes especially apparent when the speech signal is perceptually degraded, which is the case in many real-world situations. When ambient noise or multitalker babble degrades parts of a spoken utterance, the listener must rely on long-term knowledge of the statistical regularities in language to implicitly predict the next word that will be spoken based on the previous spoken words, thus improving speech perception and language comprehension (Elliott, 1995; Kalikow,

Can we enhance domain-general learning abilities?

307

Stevens, & Elliott, 1977; McClelland, Mirman, & Holt, 2006; Miller, et al., 1951; Pisoni, 1996). Surprisingly, despite the voluminous work on statistical learning and the suggestions of its importance for language acquisition and processing, up until recently no studies had shown an empirical association between individual di¤erences in statistical learning abilities and language. Recently, we investigated whether statistical learning abilities would be associated with one particular measure of everyday language performance: how well one uses preceding sentence context to implicitly predict upcoming linguistic units (Conway et al., 2010). The rationale is that statistical learning might provide a language user with knowledge that constrains the possible set of words that will be heard next in a sentence. For example, consider the following two sentences: (1) Her entry should win first prize. (2) The arm is riding on the beach. The final word in sentence (1) is highly predictable, while the final word in sentence (2) is not predictable. Therefore, when these two sentences are presented to participants under degraded listening conditions, long-term knowledge of language structure can improve perception of the final word in sentence (1) more so than in (2). We argue then, that performance on the first type of sentence ought to be more closely associated with fundamental statistical learning abilities because it relies on one’s knowledge of word predictability that accrued implicitly over many years of exposure to language. On the other hand, performance on the second type of sentence simply relates to how well one perceives speech in noise, where knowledge of sequential word predictability is less useful. We directly tested this hypothesis by assessing healthy adult participants on both statistical learning and speech perception tasks. In the statistical learning task, participants observed and then immediately reproduced visual color sequences on a touch-screen monitor (Figure 1). Unbeknownst to participants, the task consisted of two parts, a learning phase and a test phase, which di¤ered only in terms of the sequences used. In the learning phase, the sequences were constrained such that only certain colors (e.g., blue) would ever occur following certain others (e.g., green). In the test phase, participants were now presented with novel sequences that either contained the same statistical regularities as before or completely random sequences in which any color could occur no matter what preceded it (except that immediate color repetitions were not allowed).

308

Christopher M. Conway et al.

Figure 1. Depiction of the visual sequential statistical learning task used in Conway et al., 2010, similar to that used in previous work (Conway et al., 2007; Karpicke & Pisoni, 2004). Participants view a sequence of colored squares (700-msec duration, 500-msec ISI) appearing on the computer screen (top) and then, 2000-msec after sequence presentation, they must attempt to reproduce the sequence by pressing the touch-panels in correct order (bottom). The next sequence occurs 2000-msec following their response.

Can we enhance domain-general learning abilities?

309

Learning was assessed by observing improvements to immediate memory span for statistically-consistent structured sequences (Botvinick, 2005; Conway et al., 2007; Hebb, 1961; Jamieson & Mewhort, 2005; Karpicke & Pisoni, 2004; Miller & Selfridge, 1950). That is, as participants were exposed to the statistical patterns, if any learning occurred, their immediate serial recall should improve for sequences that contained the same statistical regularities compared to ones that did not contain those regularities. This is an indirect measure of learning, which has a number of advantages over using a more traditional direct measure of learning such as explicitly asking for which sequence was more ‘‘familiar’’ or ‘‘obeyed the rules’’ (Redington & Chater, 2002). Importantly, this indirect measure appears to provide a wider range of individual di¤erences in performance as compared to explicit measures of implicit and statistical learning (Karpicke & Pisoni, 2004). Participants also completed a speech perception in noise task. In this task, participants had to identify sentences spoken under degraded listening conditions in which half of the sentences ended on a highly predictable word (sentences of type 1) and half ended on a low predictable word (sentences of type 2) (Elliott, 1995; Kalikow et al., 1977). To assess performance, we used the di¤erence score suggested by Bilger and Rabinowitz (1979). This score was calculated by taking the di¤erence between how well one perceives the final word in high-predictability sentences and how well they perceive the final word in low- or zero-predictability sentences. This di¤erence score provides a means of assessing how well an individual is able to use sentence context to guide spoken language perception under degraded listening conditions. Across three experiments, we found that individual di¤erences in statistical learning abilities were significantly correlated with the sentence perception di¤erence score (Figure 2). Importantly, the correlations remained even after controlling for sources of variance associated with non-verbal intelligence, verbal short-term memory and WM, attention and inhibition, and knowledge of vocabulary and syntax. We conclude that the common factor involved in both tasks – and which mediated the observed correlations – is sensitivity to the underlying statistical structure contained in sequential patterns, independent of general memory, intelligence, or linguistic abilities (Conway et al., 2010). We propose that superior statistical learning abilities result in more detailed and robust representations of the structure of spoken language. Having a more detailed veridical representation of the likely probability that any given linguistic unit will follow based on what has already occurred

310

Christopher M. Conway et al.

Figure 2. Scatterplot of data from Experiment 3 (n ¼ 59) of Conway et al. (2010). The x-axis displays the implicit statistical learning scores; the y-axis displays the word predictability di¤erence scores for the spoken sentence perception task.

can in turn improve how well one can rely on top-down knowledge to help implicitly predict, and therefore perceive, the next word spoken in a sentence. Thus, forming predictions for upcoming language units may be an important way in which statistical learning directly supports language acquisition and processing (see also Misyak, Christiansen, & Tomblin, 2010).

2. Statistical learning in language disorders If domain-general statistical learning abilities are important for language acquisition and processing, then we might expect that what initially appear to be language-specific disorders may be due in part to disturbances with domain-general learning abilities. There is in fact, a growing body of

Can we enhance domain-general learning abilities?

311

evidence suggesting that this is indeed the case. Here, we review research examining statistical learning in specific language impairment and dyslexia. Then, we present the results of a new study with deaf children with cochlear implants that supports the theory that statistical learning is a crucial part of typical language acquisition, and if it is disturbed or developmentally delayed, can impair successful language development (Conway, Pisoni, Anaya, Karpicke, & Henning, 2011). A growing body of research has established implicit learning impairments in individuals with various types of language disorders. For example, Plante, Gomez, and Gerken (2002) showed that a group of adults with language and reading impairments had more di‰culty with an artificial grammar learning task than adults without a diagnosed language disorder. In terms of specific language impairment (SLI), recent research indicates that implicit learning abilities may be intact but significantly slower than in normal controls. For example, one study showed that in a serial reaction time task, adolescents with and without SLI showed evidence for implicitly learning embedded patterns (i.e., reaction times improved over trials), but learning rates for the SLI group were slower (Tomblin, Mainela-Arnold, & Zhang, 2007). Likewise, children diagnosed with SLI were able to learn an artificial language after 42 minutes, whereas controls learned it after only 21 minutes (Evans, Sa¤ran, & Robe-Torres, 2009). It should be noted that, somewhat surprisingly, the SLI children were only able to learn the language when it was made up of speech (phonemic) stimuli; when it was made up of tones, performance for the SLI group did not reach above chance levels (Evans et al., 2009). Regarding reading disorders such as dyslexia, the evidence on implicit learning is mixed. Studies using the visual serial reaction time task appear to show an absence of implicit learning. This is indicated by a failure to observe a decrease in reaction times to repeating patterns of stimuli by dyslexic participants (Menghini, Hagberg, Caltagirone, Petrosini, & Vicari, 2006; Vicari, Marotta, Menghini, Molinari, & Petrosini, 2003). However, other studies using techniques such as cued reaction time (Roodenrys & Dunn, 2008) and artificial grammar learning (Russeler, Gerth, & Munte, 2006) show unimpaired learning for individuals with dyslexia. Howard, Howard, Japikse, and Eden (1995) made a somewhat novel distinction: they showed that individuals diagnosed with dyslexia demonstrated normal learning on tasks involving spatial implicit learning, but showed impaired performance on tasks involving sequential implicit learning. With more research, this distinction may help resolve the previous divide in the literature on implicit learning in dyslexia.

312

Christopher M. Conway et al.

A final population that o¤ers an interesting test of the role of statistical learning in language is deaf children who have received a cochlear implant (CI). A CI is a medical prosthesis surgically implanted into the inner ear of a deaf child in order to provide sound by directly stimulating the auditory nerve. Although a CI provides the potential to develop age-appropriate speech and language abilities, it is well known that some children obtain little language benefit other than the awareness of sound from their implant (American Speech-Language-Hearing Association, 2004). Some of this variation in outcome has been shown to be due to certain demographic factors, such as age at implantation and length of deafness (Kirk et al., 2002; Tomblin, Barker, & Hubbs, 2007). However, these demographic variables leave a large amount of variance unexplained. It is likely that intrinsic cognitive factors, especially fundamental learning and memory abilities, contribute to language outcomes following implantation (Pisoni, 2000). Disturbances in statistical learning specifically may hold the key to understanding the enormous range of variation in language development in this population. Deaf children with CIs also provide a unique opportunity to study neurocognitive plasticity and neural reorganization following the introduction of sound and spoken language after a period of auditory deprivation. Whereas most previous work with this clinical population has investigated the development of auditory perception, speech perception, and spoken language development following cochlear implantation, relatively few studies have examined more global learning and cognitive capabilities. Recently we assessed visual sequential statistical learning abilities in a group of deaf children with CIs (Conway et al., 2011). Our aims were twofold: to assess the e¤ects that a period of auditory deprivation and language delay may have on visual statistical learning skills; and to investigate the role that statistical learning plays in language outcomes following cochlear implantation. Our hypothesis was that deaf children with CIs would show disturbances in visual sequential statistical learning as a result of their relative lack of experience with sequential (auditory) patterns early on in development. Furthermore, we expected that statistical learning performance would be associated with measures of language development, with better statistical learners showing the best language outcomes postimplantation. A group of deaf children with CIs engaged in a visual sequential learning task similar to the sequence reproduction task described earlier with adult participants. The results revealed that the CI children on average showed no learning (Figure 3, right), and were significantly worse than an age-matched group of hearing children (Figure 3, left).

Can we enhance domain-general learning abilities?

313

Figure 3. Average statistical learning scores for CI children (right) and NH children (left). Data adapted from Conway et al. (2011). Error bars represent e1 standard error.

Furthermore, performance on the statistical learning task was found to be significantly correlated with a standardized measure of language outcome, the Clinical Evaluation of Language Fundamentals, 4th Ed. (CELF-4; Semil, Wiig, & Secord, 2003), which has a particular emphasis on syntaxrelated language functions. That is, those children who were the best learners on the visual sequential statistical learning task showed the best language and syntax abilities as measured by the CELF-4. For the most part, these correlations remained significant even after controlling for the shared variance associated with duration of implant use (and age at which the device was implanted), forward and backward digit span, and vocabulary scores. In addition, performance on the statistical learning task was

314

Christopher M. Conway et al.

associated with how well the children could use sentence context to perceive spoken words (Pisoni, Conway, Kronenberger, Henning, & Anaya, 2010), a finding that is consistent with the adult data presented earlier. Why did these children show a disturbance to non-linguistic visual statistical learning skills? There is some indication that a period of auditory deprivation occurring early in development may have secondary cognitive and neural ramifications in addition to the obvious hearing-related e¤ects (Conway, Pisoni, & Kronenberger, 2009). Specifically, because sound is a temporally-arrayed signal, a lack of experience with sound may a¤ect how well one is able to encode, process, and learn serial patterns (Marschark, 2006; Rileigh & Odom, 1972; Todman & Seedhouse, 1994). Exposure to sound may provide a kind of ‘‘auditory sca¤olding’’ in which a child gains vital experience and practice with learning and encoding sequential patterns in the environment (Conway et al., 2009). We suggest that a lack of experience with sound may delay or alter the development of domain-general processing skills – including statistical learning – that rely on the encoding and learning of temporal or sequential patterns, even for non-auditory inputs. Poor sequential learning skills therefore might help explain why this particular population may have di‰culty learning spoken language even after hearing is restored through a cochlear implant. In sum, across a variety of populations having a language or communication disorder, we find that domain-general learning abilities are associated with the impairment, and therefore may provide a key for successful intervention and treatment through novel focused training techniques.

3. Study 1: Computerized training in healthy adults The relationship between statistical learning and language in both healthy individuals and those with language disorders makes it important to ask whether it is possible to improve these domain-general learning abilities. A number of studies have demonstrated the e‰cacy of using di¤erent kinds of cognitive training paradigms to improve aspects of perception, attention, and cognition (Dye, Green, & Bavelier, 2009; Klingberg, 2010; Rueda et al., 2005; Shalev, Tsal, & Mevorach, 2007; Tallal & Gaab, 2006). To our knowledge, there have been no published attempts to improve statistical learning or any other non-declarative learning ability. However, one cognitive domain that has received much interest in the cognitive training literature is WM. While the training tasks and populations have varied, there is a growing body of evidence suggesting that computerized

Can we enhance domain-general learning abilities?

315

training tasks can improve WM capacity, and importantly, result in transfer to non-trained tasks of spatial and verbal WM, attention, and other cognitive functions (Curtis and D’Esposito, 2003, Olesen, Westerberg & Klingberg 2004, Holmes, Gathercole, & Dunning 2009, Thorell, Lindqvist, Nutley, Bohlin, & Klingberg 2009, Westerberg, Jacobaeus, Hirvikoski, Clevberger, Ostensson, Bartfai, & Klingberg 2007, Klingberg, Fernell, Olesen, Johnson, Gustafsson, Dahlstrom, Gillgberg, Forssberg, & Westerberg 2005; Verhaeghen et al., 2004). The findings from these studies suggest that adaptive training on a visuospatial WM task appears to generalize in a domain-general manner to non-trained tasks of WM and other cognitive functions. For example, visuospatial WM training generalizes to inhibition (Klingberg et al. 2002; Klingberg et al. 2005, Oleson et al. 2004), attention (Westerberg et al. 2007), and verbal WM (Holmes et al. 2009; Thorell et al. 2009). Increased activity in the prefrontal cortex indicates that WM training has a direct impact on neural circuits of this brain region (Curtis & D’Esposito 2003; Olesen et al. 2004), implying that training tasks directly alter the functioning of domain-general executive control mechanisms (Smith & Jonides, 1999), rather than merely improving the e‰ciency of modality-specific ‘‘slave’’ systems. It has also been proposed that the striatum, a brain area recognized for its role in implicit learning (Seger, 2006), plays an important role in mediating transfer e¤ects to non-trained tasks of WM (Dahlin, Ba¨ckman, Neely, & Nyberg, 2009; Dahlin, Neely, Larsson, Ba¨ckman, & Nyberg, 2008). These studies demonstrate the utility of improving cognitive function through computerized training techniques, leaving open the possibility that like WM, statistical learning might also be amenable to training. As Klingberg (2010) rightfully pointed out, the synaptic mechanisms governing WM capacity are governed by the same principles of neural plasticity underlying the rest of the brain. Thus, we might expect that statistical learning can also be enhanced using a similar approach. Whereas the standard short-term memory or WM task involves recalling a set of stimuli that have no intrinsic relation to each other, such as a series of random digits, most of our experiences in the world, such as events and scenes we encounter and interact with, have an underlying structure to them. How we acquire knowledge about these underlying regularities and statistical dependencies is arguably as important as how well we remember random, unstructured stimuli, if not more so. Enhancing statistical learning thus could have important and far-reaching ramifications, especially for populations having language delays or communication disorders.

316

Christopher M. Conway et al.

Figure 4. Computerized training task used in Study 1 and Study 2. Participants view a 4  4 matrix of circles. A sequence of circles light up, one at a time (the white circle depicted in each of the three scenes A, B, and C). Participants must reproduce each sequence in its entirety. Unknown to the participants, the presented sequences are not random; each circle can be followed by only 1 of 3 possible circles (shaded light grey). Note that in the actual task, all circles are colored the same, except for the one that is currently illuminated.

We recently created a novel computerized visual training task, and piloted it first with healthy adults (Bauernschmidt, Conway, & Pisoni, 2009; Conway, Bauernschmidt, & Pisoni, in preparation). The training task is a visual-spatial training procedure that is conceptually similar to other training tasks designed to improve WM abilities in adults and children (e.g., Holmes et al., 2009; Thorell et al., 2008). However, rather than using random sequences, we adaptively trained participants on nonrandom sequential patterns that share an underlying structure that can be implicitly learned. Thus, the novel facet of our task is that it adaptively trains participants to encode and reproduce sequences of visual stimuli conforming to underlying statistical regularities (see Figure 4). In the training task, participants view a sequence of colored lights, occurring one at a time (white circles in panels A, B, and C) and then are required to reproduce what they saw by pressing the circles in correct order on the touch-sensitive monitor. Unbeknownst to the participants, each circle that lights up next in a sequence is not determined randomly but rather conforms to certain underlying statistical regularities. Specifically, any given circle has only three others that can legally follow it (shaded light grey). (Note that for the actual task, all circles are colored the same, except for the one that is currently lit.). As participants begin to implicitly uncover these regularities specifying which circles can occur

Can we enhance domain-general learning abilities?

317

Table 1. Study 1 Training Schedule Day 1 (Pre-Training)

Days 2–5 (Training)

Day 6 (Post-Training)

e Stroop Color and Word

Group 1: Adaptive Training, with Statistically Constrained Sequences

e Stroop Color and Word

e e e e

Test Forward Digit Span Backward Digit Span Raven’s Standard Progressive Matrices Visual Sequential Learning

Group 2: Adaptive Training, with Pseudo-random Sequences Group 3: Non-Adaptive Control, using Pseudorandom Sequences

Test

e Forward Digit Span e Backward Digit Span e Raven’s Standard

Progressive Matrices

e Visual Sequential

Learning

next, their performance on the recall task will improve, a sign of statistical learning occurring. As with the WM training studies, an important aspect of this training task is that the lengths of the sequences presented to each participant are adaptively based on their performance level. Sequence lengths were based on a two-up, two-down metric. For example, if a subject starts at sequence length four and correctly reproduces all four items in that sequence, then their next trial will also be a sequence of length four. If the subject correctly reproduces all elements in the second sequence of length four, then they will move up to a sequence of length five in the next trial. If they incorrectly reproduce this sequence of length five then their next trial will still be sequence length five; if they respond incorrectly to this sequence as well, then their next sequence will be moved down to length four. And so on. Importantly, on each trial, a new sequence is presented (at the individual’s current length). The new sequence is randomly determined, but conforms to the underlying regularities as specified earlier. Participants engaged in this visual-spatial sequence training task for four days (days 2–5, see Table 1), with each training session lasting no longer than 45 minutes. Crucially, the ‘‘grammar’’ or statistical patterns that dictate what circles/locations can legally occur next were re-randomized for each participant on each subsequent training day. Because each of the four days of training incorporated a new set of statistical regularities, our intention was that participants would gradually improve their abilities to learn a variety of statistical patterns and not any one specific set of regularities. This was done to encourage generalization by improving participants’ abilities to ‘‘learn to learn’’.

318

Christopher M. Conway et al.

By adaptively training participants on this task, we expected to enhance their ability to learn statistical patterns of any type. The key test, of course, is whether any such improvements to learning result in transfer e¤ects, that is, improvements to non-trained tasks. To test generalization and transfer, all participants were given a set of pre-training measures on Day 1 (see Table 1) that included the sequential statistical learning task used in Conway et al. (2010) in addition to other tests of verbal short-term memory and WM (measured with the Forward and Backward digit span tasks from the WISC-III, Wechsler, 1991), executive control and inhibition (measured by the Stroop Color and Word test, Golden & Freshwater, 2002), and nonverbal reasoning (measured by the Raven’s Standard Progressive Matrices, Raven, Raven, & Court, 2000). These same measures were given once training was complete on Day 6 in order to ascertain improvements to these non-trained tasks. Finally, in order to ensure that any observed gains on non-trained tasks were not merely a result of a test-retest e¤ect, participants were randomly assigned to one of three di¤erent training conditions. Group 1 engaged in an adaptive, statistically-constrained version of the training task already described above. The Group 2 training task was identical to Group 1’s except that the sequences were pseudo-random rather than conforming to statistical regularities. The pseudo-random sequences were generated so that each element (circle) in the sequence could be followed by any other in the set with equal likelihood. Like Group 1, the Group 2 task also was adaptive. Thus, Group 2 was very similar to previous WM training tasks. Finally, Group 3 served as a non-adaptive control using pseudo-random sequences. Participants in this group received visual sequences varying in length randomly determined at each trial, not based on their performance as was the case for Groups 1 and 2. In sum, any training e¤ects observed in Group 1 but not in Group 2 can be safely regarded as being due to the e¤ect of including statisticallyconstrained sequences in the adaptive sequence task. Any training e¤ects observed in Group 2 compared to Group 3 can be regarded as being due to the e¤ect of using an adaptive (versus a non-adaptive) training paradigm. Initial results are presented below for 56 adult participants (ages 18–30), with 20 participants in each of Groups 1 and 3, and 16 participants in Group 2 (Conway, Bauernschmidt, & Pisoni, in preparation). For each of the non-trained measures, a separate repeated measures ANOVA was run, with the within subject factor being the pre- vs. post-training dependent measure of interest, and the between subjects factor being training

Can we enhance domain-general learning abilities?

319

Figure 5. E¤ects of computerized training with healthy adults on Forward digit span scores (white bars: pre-training digit span scores; shaded bars: post-training digit span scores). Group 1 received adaptive training with statistically-constrained, structured sequences; Group 2 received adaptive training with pseudo-random sequences; and Group 3 served as a control group, receiving non-adaptive training with pseudo-random sequences. Error bars represent e1 standard error.

group (Groups 1–3). For all analyses, chronological age was used as a covariate. We report results from three of the non-trained measures below. Figure 5 shows the pre- and post-training scores for each group on the Forward digit span task. In this task, which serves as a measure of verbal short-term memory capacity, subjects were presented with lists of spoken digits at progressive lengths and asked to repeat the sequence aloud. There was a marginally significant interaction of training group X pre- vs. post-

320

Christopher M. Conway et al.

scores: F(2,45) ¼ 2.62, p ¼ .084. Paired t-tests were used to compare pretraining to post-training performance for each of the three conditions, to determine for which training groups Forward digit spans improved (or worsened) following training. As shown in Figure 5, only Group 1 (t(14) ¼ 1.841, p ¼ .087) and Group 2 (t(14) ¼ 2.03, p ¼ .062) showed signs of improvement following training. Thus, adaptive training of visuospatial sequences showed signs of improving verbal auditory short-term memory capacity, regardless of whether the visuospatial sequences were statistically-constrained or pseudo-random. This transfer from a visuospatial to a verbal memory task is consistent with previous research showing training-related transfer across modalities (Thorell et al., 2008). Next, Figure 6 shows pre- and post-training scores for each group on the Stroop Color and Word test. In this version of the classic task, participants were asked to read three pages of words, colors, and color-words aloud. The Word page consisted of the words ‘‘red’’, ‘‘green, and ‘‘blue’’ arranged randomly and printed in black ink. The Color page consisted of 100 items written as XXXX, printed in either red, green, or blue ink. The Color-Word page consisted of the words from the Word page printed in the colors from the Colors page. Participants are instructed to read the color of the print, not the word itself. Of course, for the Color-Word page, the words and colors do not always match, and as such, this requires inhibiting the natural and automatic response of reading the word. There was a marginally significant interaction of training group X prevs. post-scores: F(2,51) ¼ 2.78, p ¼ .07. Similar to the Forward digit span results, performance on the Stroop task improved following training only for Group 1 (t(18) ¼ 3.04, p < .01) and Group 2 (t(15) ¼ 1.86, p ¼ .083). The control Group 3 showed no signs of change. These results suggest that adaptive training of visuospatial sequences (statistically-constrained or pseudo-random) can improve executive control and inhibition abilities, also consistent with previous research (e.g., Klingberg et al., 2005). Finally, and of most relevance at present, Figure 7 shows the pre- and post-training scores on a non-trained task of implicit statistical learning, the sequence learning task described earlier and used in several published studies (Conway et al., 2007; Conway et al., 2010). In this task, participants saw a sequence of four colored squares light up on the screen and were asked to reproduce the sequence that they had just seen by pressing the colors on a touch-screen monitor in correct order. Unbeknownst to the participants, the sequences were generated by an artificial grammar that provides statistical constraints on which color can occur next. Learning was assessed by computing a di¤erence score for performance on statistically-

Can we enhance domain-general learning abilities?

321

Figure 6. E¤ects of computerized training with healthy adults on Stroop interference scaled scores (white bars: pre-training Stroop scores; shaded bars: post-training Stroop scores). Group 1 received adaptive training with statistically-constrained, structured sequences; Group 2 received adaptive training with pseudo-random sequences; and Group 3 served as a control group, receiving non-adaptive training with pseudo-random sequences. Error bars represent e1 standard error.

constrained sequences minus performance on the sequences not conforming to the grammar. There was a significant interaction of training group X pre- vs. postscores: F(2,50) ¼ .71, p < .05. Interestingly, for this non-trained sequential learning task, only Group 1 showed any hint of improvement (t(19) ¼ 1.37, p ¼ .18). Group 2 actually showed worse performance following training (t(14) ¼ 3.31, p < .01); Group 3 showed no e¤ect of training

322

Christopher M. Conway et al.

Figure 7. E¤ects of computerized training with healthy adults on a non-trained sequential learning task (white bars: pre-training sequential learning scores; shaded bars: post-training sequential learning scores). Group 1 received adaptive training with statistically-constrained, structured sequences; Group 2 received adaptive training with pseudo-random sequences; and Group 3 served as a control group, receiving nonadaptive training with pseudo-random sequences. Error bars represent e1 standard error.

either way. Thus, on this statistical learning measure we see a di¤erential e¤ect of using statistically-constrained versus pseudo-random sequences, with only the former showing any signs of improving statistical learning abilities on non-trained tasks. In sum, these results suggest the following. First, only the adaptive training conditions (Groups 1 and 2) showed transfer e¤ects to Forward

Can we enhance domain-general learning abilities?

323

digit spans and Stroop inhibition scores. This result is consistent with other recent findings demonstrating the utility of using computerized adaptive training to improve aspects of WM and executive function (Klingberg, 2010). Second, only the group that was specifically trained on sequential patterns with statistical structure (Group 1) showed any sign of improving on the non-trained sequential statistical learning task. In fact, the training condition that incorporated random sequential patterns actually led to significantly worse performance on the statistical learning task following training. Although preliminary, these results suggest that training participants to interact with random patterns actually hampers their ability to learn structured patterns following training. On the other hand, training participants to interact with structured patterns not only leads to marginally better abilities to learn structured patterns following training, but also improves other WM and executive functions. Thus, incorporating statisticallystructured patterns into a WM training task appears to provide just as much benefit as using unstructured random patterns and may actually show some carryover and transfer to other tasks requiring statistical learning. These findings provide initial evidence for the feasibility of improving domain-general learning abilities in populations that have a language delay, an endeavor we turn to next.

4. Study 2: Computerized training with deaf or hard of hearing children As previously mentioned, there is now some evidence linking poor statistical learning abilities to impaired language function. Based on the findings in the previous section suggesting the possibility of improving statistical learning, we recently used our computerized training task with a group of children who are deaf or hard of hearing (d/hh) and who exhibit language delays to determine whether enhancing domain-general learning abilities can lead to improvements to overall language functioning. On a related note, Kronenberger et al. (2011) recently established the e‰cacy of using computerized WM training tasks to improve verbal and nonverbal WM in deaf children with cochlear implants, with some e¤ects lasting up to 6 months. In the present study, which is still ongoing, 23 children who are d/hh (ages 5:10 to 11:4; mean 8:2) took part in 10 days of training utilizing the computerized training task previously described. Among this group, 10 had bi-lateral cochlear implants, 8 were fitted with one implant and one

324

Christopher M. Conway et al.

hearing aid, and the remaining 5 children wore hearing aids in both ears. The children were assigned to one of two training conditions matched for chronological age. As with the adult study, sequences in the adaptive condition conformed to underlying statistical regularities, beginning at a length of three and increasing or decreasing in length based upon the two up, two down metric. The second condition was an active control group in which the sequence presentation was non-adaptive and pseudo-random in nature, with a constant sequence length of three. Pre- and post-training measures were selected to assess visual pattern memory, attention/inhibition, verbal WM, and visual sequential learning. The children showed significant improvement on a number of nontrained tasks following training. Here, we focus on two of the measures, verbal WM and visual sequential learning. For the verbal WM task, a subset of 20 nonwords from the Children’s Test of Nonword Repetition (Gathercole & Baddeley, 1996) was presented to participants via a loud speaker at a level of 70–75 dBSPL. Participants were asked to repeat what they heard; responses were recorded then scored for overall word accuracy and for syllable accuracy, that is whether the response contained the same number of syllables as the stimuli presented. As shown in Figure 8, only children in the adaptive training condition showed a significant reduction in the number of syllable errors from the pre- to post-training session, F(1,11) ¼ 10.170, p ¼ .009, and this improvement was also sustained at a second post-training session measured 4–6 weeks later, F(1,11) ¼ 7.301, p ¼ .021. This di¤erential e¤ect of sequence training, with only the adaptive group showing improvement, is also evident with a non-trained measure of visual sequential learning, as assessed by a version of the learning task described earlier. Figure 9 shows performance on this task with statistically-constrained sequences assessed before training and after the second post-test. Only participants in the adaptive condition showed significant improvement from pre-training to the second post-training session on the number of correctly reproduced statistically-constrained sequences, F(1,11) ¼ 9.308, p ¼ .011. Although the improvement in performance on the statistically-constrained sequences may suggest an improvement of sequential learning abilities, performance also improved on a set of pseudo-random sequences on this same task, raising the possibility that learning itself was not improved, but merely visual serial recall or sequential memory. We are currently exploring the viability of these alternative explanations.

Can we enhance domain-general learning abilities?

325

Figure 8. E¤ects of computerized training with deaf or hard of hearing children on nonword repetition errors (white bars: pre-training performance; shaded bars: post-training post-training performance). Group 1 received adaptive training with statistically-constrained, structured sequences; Group 2 served as a control group, receiving non-adaptive training with pseudo-random sequences. Error bars represent e1 standard error.

In summary, these d/hh children showed significant improvement in the nonword repetition and non-trained visual sequential learning tasks. These findings hold promising implications for the improvement of language skills for children who are deaf or hard of hearing, and possibly other clinical populations as well. The nonword repetition task utilizes the same processes necessary for learning new words, namely, multiple auditory, cognitive, linguistic, and articulatory speech-motor processes, without the use of previous knowledge or visual cues. We would therefore expect

326

Christopher M. Conway et al.

Figure 9. E¤ects of computerized training with deaf or hard of hearing children on a non-trained sequential learning task (white bars: pre-training performance; shaded bars: post-training performance). Group 1 received adaptive training with statistically-constrained, structured sequences; Group 2 served as a control group, receiving non-adaptive training with pseudo-random sequences. Error bars represent e1 standard error.

that improvement on this task would carry over to gains in receptive and expressive vocabulary and other language functions. So too, if statistical learning skills are improved as suggested by the current results, then it indicates that fundamental learning abilities important for language acquisition can be enhanced, and are likely to lead to more generalized global improvements to language, perhaps specifically for syntax-related aspects of language processing, a possibility that we will be exploring in the near future.

Can we enhance domain-general learning abilities?

327

5. General discussion Due to the increasing body of evidence suggesting that domain-general statistical learning abilities are used in the service of language acquisition, and given recent work showing the utility of using computerized adaptive training techniques, we believe it is important to attempt to improve statistical learning in order to treat language and communication disorders. The computerized training task that we have developed was based conceptually on recent WM training task designs. Our training task is relatively easy to implement and short in duration (45 minutes per day over 4–10 days) and crucially incorporates underlying statistical regularities into the patterns. The results with adults show that training resulted in gains to verbal short-term memory, executive control, and a non-trained task of sequential statistical learning. The findings from children who are deaf or hard of hearing also showed improvements to verbal short-term memory and sequential learning following training. Although the findings are preliminary, they suggest that adaptive training of visuospatial statistically-constrained patterns can enhance broad domain-general skills of WM, inhibitory control, and statistical learning. This training task thus shows promise as a novel intervention for treating various disorders of language and learning. If this training task does in fact improve performance on these three types of tasks (verbal short-term memory, inhibition, and statistical learning), it becomes important to ask what specific neurocognitive mechanism(s) were enhanced that led to these task improvements? Is there a common underlying function or set of functions that are shared by all three tasks? Fuster (2001) has argued that the prefrontal cortex (PFC) is critically involved in the temporal organization of behavior, including representing, formulating, and planning sequences of thought and action. For any complex sequential skill or behavior, the PFC is thought to be intimately involved because it allows for the integration of sensory cues with cognitive actions across time. Under this view, the PFC is important for any kind of sequencing or temporal functions (Conway & Pisoni, 2008), including higher level planning, executive memory, language processing, and sequential learning. The PFC has many interconnections with other sensory, motor, and subcortical regions, making it an ideal candidate for domaingeneral aspects of cognitive sequencing function (Miller & Cohen, 2001). As other research has shown, adaptive WM training tasks appear to result in enhancements to the neural functioning of the prefrontal cortex, (Curtis & D’Esposito 2003; Olesen et al. 2004). For instance, Olesen et al.

328

Christopher M. Conway et al.

(2004) had subjects practice three visuospatial memory tasks for a period of five weeks. Use of functional magnetic resonance imagining (fMRI) before, during, and after training showed increased activity in the prefrontal and parietal cortices. Similarly, Curtis and D’Esposito (2003) reported sustained prefrontal cortex activity during delay periods preceding the response portion of a visual WM task. The former study included a battery of neuropsychological tests as part of the pre- and post-training evaluation. Subjects showed significant improvement in performance on the Span board task and the Digit span task and in time on the Stroop test, illustrating, similarly to our results, transfer e¤ects to non-trained tasks of WM and inhibition. These neuroimaging studies suggest that increases in cortical prefrontal activity during or following WM training is a sign of training-related plasticity in the neural systems supporting WM and other executive functions (Olesen et al. 2004). Given the evidence of prefrontal activity and its relation to executive function, Funahashi (2001) proposed that the prefrontal cortex is responsible not only for storing and processing information, but also for assessing the input and providing information to neuronal systems to direct the processing of information in these systems (Funahashi, 2001). The processes of perception, motor control, and memory must be coordinated to accomplish the tasks of anticipating, planning, monitoring, and making a decision (Funahashi, 2001). The current evidence suggests that improvement on a visual-spatial sequence training task a¤ects neural functioning of the prefrontal cortex and thus, perhaps by extension, executive and cognitive functions more generally, which may include statistical and sequential learning. The involvement of the prefrontal cortex in executive processes (Smith & Jonides 1999, Funahashi 2001) and evidence of increased prefrontal activity during spatial memory tasks (Curtis & D’Esposito 2003, Olesen et al. 2004, Smith & Jonides 1999, Funahashi 2001) thus lend support to the proposal that training on a visual-spatial task may carry over to other tasks involving di¤erent skills, including those requiring verbal memory or executive processing. Although at present statistical learning is generally not considered to be an aspect of executive function, and if anything, might be rightfully thought of as a part of the nondeclarative/procedural learning system, there are reasons to believe that a connection may exist between some types of statistical learning and prefrontal cortical function. First, there is increasing neural evidence suggesting that the prefrontal cortex is involved during sequential learning and artificial grammar learning tasks (e.g.,

Can we enhance domain-general learning abilities?

329

Fletcher et al., 1999; Forkstam et al., 2006; Petersson et al., 2004). Because our training task incorporates statistical patterns distributed across time – i.e., visual sequences – it is likely that the prefrontal cortex plays an important role in encoding these statistical regularities. Second, our training task likely promotes not merely statistical learning, but also cognitive control, attention, and inhibition. This is because at the beginning of every new training session, the ‘‘rules’’ or statistical regularities change, and so participants must over-ride the regularities that had been previously acquired. In this way, successful performance on this task requires participants to not only focus on the current input sequence, but to switch attention and inhibit prior learning in order to learn the new patterns. For these reasons, this training task may actually improve several overlapping elementary abilities (sequential learning, inhibition, attention, and serial recall) that are all mediated by the prefrontal cortex. Although we have no neural evidence yet, the behavioral evidence is consistent with this claim, with both verbal short-term memory and inhibitory control showing task gains following adaptive sequence training. Of course, the ultimate objective remains to use these training tasks to improve learning and language abilities as a treatment for populations with language disorders. Notably, all three of the tasks that showed gains following training have been implicated as being important for language acquisition and processing: verbal short-term memory (Gathercole, Willis, & Baddeley, 1994), cognitive control (Dea´k, 2003), and of course, statistical and sequential learning (Conway et al., 2010). Clearly, the next step is to ascertain to what extent adaptive computerized training tasks such as this one that target statistical learning processes and other prefrontal cortex related abilities will show robust and lasting improvements to language function. In addition to treating language disorders, it may be possible to use this approach to help improve language acquisition for individuals learning a second language. As this edited volume aptly indicates, we are beginning to realize the importance of domain-general statistical learning abilities for language. But we ought not to stop there. As recent research has amply demonstrated, our cognitive and neural systems are far more plastic and modifiable by experience than initially believed. By capitalizing on these theoretical and empirical developments, it may be possible to improve language functions by using novel computerized training techniques that specifically target domain-general learning abilities, o¤ering great promise for alleviating disorders of language and communication.

330

Christopher M. Conway et al.

Acknowledgements Portions of this work was supported by the following grants from the National Institutes of Health: R03DC009485, T32DC00012, R01DC00111, R01DC000064, and R01DC009581.

References American Speech-Language-Hearing Association 2004 Cochlear implants [Technical report]. Available from www. asha.org/policy. Bar, M. 2007 The proactive brain: Using analogies and associations to generate predictions. Trends in Cognitive Sciences, 11(7), 280–289. Bauernschmidt, A., Conway, C.M., & Pisoni, D.B. 2009 Working memory training and implicit learning. In Research on Spoken Language Processing Progress Report No. 29. Bloomington, IN: Speech Research Laboratory, Indiana University. Bilger, R.C., & Rabinowitz, W.M. 1979 Relationships between high- and low-probability SPIN scores. The Journal of the Acoustical Society of America, 65(S1), S99. Botvinick, M.M. 2005 E¤ects of domain-specific knowledge on memory for serial order. Cognition, 97, 135–151. Conway, C.M. 2012 Sequential learning. In R.M. Seel (Ed.), Encyclopedia of the Sciences of Learning (pp. 3047–3050). New York, NY: Springer Publications. Conway, C.M., Bauernschmidt, A., Huang, S.S., & Pisoni, D.B. 2010 Implicit statistical learning in language processing: Word predictability is the key. Cognition, 114, 356–371. Conway, C.M., Karpicke, J., & Pisoni, D.B. 2007 Contribution of implicit sequence learning to spoken language processing: Some preliminary findings with hearing adults. Journal of Deaf Studies and Deaf Education, 12, 317–334. Conway, C.M. & Pisoni, D.B. 2008 Neurocognitive basis of implicit learning of sequential structure and its relation to language processing. Annals of New York Academy of Sciences, 1145, 113–131. Conway, C.M., Pisoni, D.B., Anaya, E.M., Karpicke, J. & Henning, S.C. 2011 Implicit sequence learning in deaf children with cochlear implants. Developmental Science, 14, 69–82.

Can we enhance domain-general learning abilities?

331

Conway, C.M., Pisoni, D.B., & Kronenberger, W.G. 2009 The importance of sound for cognitive sequencing abilities. Current Directions in Psychological Science, 18, 275–279. Curtis, C.E. & D’Esposito, M. 2003 Persistent activity in the prefrontal cortex during working memory. Trends in Cognitive Sciences, 7, 415–423. Dahlin, E., Backman, L., Neely, A.S., & Nyberg, L. 2009 Training of the executive components of working memory: Subcortical areas mediate transfer e¤ects. Restorative Neurology and Neuroscience, 27, 405–419. Dahlin, E., Neely, A.S., Larsson, A., Backman, L., & Nyberg, L. 2008 Transfer of learning after updating training mediated by the striatum. Science, 320, 1510–1512. Dea´k, G.O. 2003 The development of cognitive flexibility and language abilities. Advances in Child Development and Behavior, 31, 271–327. Dye, M.W.G., Green, C.S. & Bavelier, D. 2009 Increasing speed of processing with action video games. Current Directions in Psychological Science, 18, 321–326. Elliott, L. L. 1995 Verbal auditory closure and the speech perception in noise (SPIN) test. Journal of Speech, Language, and Hearing Research, 38(6), 1363–1376. Evans, J.L., Sa¤ran, J.R., & Robe-Torres, K. 2009 Statistical learning in children with specific lanugage impairment. Journal of Speech, Language, and Hearing Research, 52, 321–335. Fletcher, P., Buchel, C., Josephs, O., Friston, K. & Dolan, R. 1999 Learning related neuronal responses in prefrontal cortex studied with functional neuroimaging. Cerebral Cortex, 9, 168–178. Forkstam, C., Hagoort, P., Fernandez, G., Ingvar, M. & Petersson, K.M. 2006 Neural correlates of artifial syntactic structure classification. NeuroImage, 32, 956–967. Funahashi, S. 2001 Neuronal mechanisms of executive control by the prefrontal cortex. Neuroscience Research, 39, 147–165. Fuster, J. 2001 The prefrontal cortex – An update: Time is of the essence. Neuron, 30, 319–333. Gathercole, S.E. & Baddeley, A.D. 1994 The Children’s Test of Nonword Repetition. London: Psychological Corporation Europe. Gathercole, S.E., Willis, C.S., & Baddeley, A.D. 1994 The Children’s Test of Nonword Repetition: A test of phonological working memory. Memory, 2, 103–127.

332

Christopher M. Conway et al.

Gervain, J., & Mehler, J. 2010 Speech perception and language acquisition in the first year of life. Annual Review of Psychology, 61, 191–218. Gogate, L.J., & Hollich, G. 2010 Invariance detection within an interactive system: A perceptual gateway to language development. Psychological Review, 117, 496–516. Golden, C., & Freshwater, S. 2002 The Stroop Color and Word Test. Wood Dale, IL: Stoelting Co. Gomez, R.L., & Gerken, L. 1999 Artificial grammar learning by 1-year-olds leads to specific and abstract knowledge. Cognition, 70, 109–135. Graf Estes, K., Evans, J.L., Alibali, M.W. & Sa¤ran, J.R. 2007 Can infants map meaning to newly segmented words? Psychological Science, 18, 254–260. Gupta, P., & Dell, G.S. 1999 The emergence of language from serial order and procedural memory. In B. MacWhinney (Ed.), The emergence of language (pp. 447–481). Hillsdale, NJ: Lawrence Erlbaum Associates. Hebb, D.O. 1961 Distinctive features of learning in the higher animal. In J.F. Delafresnaye (Ed.), Brain mechanisms and learning (pp. 37–51). Oxford, Blackwell Scientific Publications. Holmes, J., Gathercole, S. & Dunning, D. 2009 Adaptive training leads to sustained enhancement of poor working memory in children. Developmental Science, F1–F7. Howard Jr., J.H., Howard, D.V., Japikse, K.C., & Eden, G.F. 2006 Dyslexics are impaired on implicit higher-order sequence learning, but not on implicit spatial context learning. Neuropsychologia, 44, 1131–1144. Jamieson, R.K., & Mewhort, D.J.K. 2005 The influence of grammatical, local, and organizational redundancy on implicit learning: An analysis using information theory. Journal of Experimental Psychology: Learning, Memory, & Cognition, 31(1), 9–23. Kalikow, D.N., Stevens, K.N., & Elliott, L.L. 1977 Development of a test of speech intelligibility in noise using materials with controlled word predictability. Journal of the Acoustical Society of America, 61(5), 1377–1351. Karpicke, J.D., & Pisoni, D.B. 2004 Using immediate memory span to measure implicit learning. Memory & Cognition, 32(6), 956–964. Kirk, K.I., Miyamoto, R.T., Lento, C.L., Ying, E., O’Neil, T., & Fears, B. 2002 E¤ects of age at implantation in young children. Annals of Otology, Rhinology, & Laryngology, 189, 69–73.

Can we enhance domain-general learning abilities?

333

Kleim, J.A., & Jones, T.A. 2008 Principles of experience-dependent neural plasticity: Implications for rehabilitation after brain damage. Journal of Speech, Language, and Hearing Research, 51, 225–239. Klingberg, T. 2010 Training and plasticity of working memory. Trends in Cognitive Sciences, 317–324. Klingberg, T., Forssberg, H. & Westerberg, H. 2002 Training of working memory in children with ADHD. Journal of Clinical and Experimental Neuropsychology, 24, 781–791. Klingberg, T., Fernell, E., Olesen, P.J., Johnson, M., Gustafsson, P., Dahlstrom, K., Gillberg, C.G., Forssberg, H. & Westerberg, H. 2005 Computerized training of working memory in children with ADHD – A randomized, controlled trial. Journal of the American Academy of Child & Adolescent Psychiatry, 44, 177–186. Korkman, M. Kirk, U., & Kemp, S. 2007 NEPSY-II. San Antonio, TX: Psychological Corporation. Kronenberger, W.G., Pisoni, D.B., Henning, S.C., Colson, B.G. & Hazzard, L.M. 2011 Working memory training for children with cochlear implants: A pilot study. Journal of Speech, Language, and Hearing Research, 54, 1182–1196. Kuhl, P.K. 2004 Early language acquisition: Cracking the speech code. Nature Reviews Neuroscience, 5, 831–843. Marschark, M. 2006 Intellectual functioning of deaf adults and children: Answers and question. European Journal of Cognitive Psychology, 18, 70–89. McClelland, J.L., Mirman, D., & Holt, L.D. 2006 Are there interactive processes in speech perception? Trends in Cognitive Sciences, 10, 363–369. Menghini, D., Hagberg, G.E., Caltagirone, C., Petrosini, L., & Vicari, S. 2006 Implicit learning deficits in dyslexic adults: and fMRI study. NeuroImage, 33, 1218–1226. Miller, G.A. & Cohen, J. 2001 An integrative theory of prefrontal cortex function. Annual Review of Neuroscience, 24, 167–202. Miller, G.A., Heise, G.A., & Lichten, W. 1951 The intelligibility of speech as a function of the context of the test materials. Journal of Experimental Psychology, 41, 329–335. Miller, G.A., & Selfridge, J.A. 1950 Verbal context and the recall of meaningful material. American Journal of Psychology, 63, 176–185. Mirman, D., Magnuson, J.S., Estes, K.G., & Dixon, J.A. 2008 The link between statistical segmentation and word learning in adults. Cognition, 108, 271–280.

334

Christopher M. Conway et al.

Misyak, J.B., Christiansen, M.H., & Tomblin, J.B. 2010 Sequential expectations: The role of prediction-based learning in language. Topics in Cognitive Science, 2, 138–153. Nicolson, R.I., & Fawcett, A.J. 2007 Procedural learning di‰culties: Reuniting the developmental disorders? Trends in Neuroscience, 30(4), 135–141. Olesen, P.J., Westerberg, H. & Klingberg, T. 2004 Increased prefrontal and parietal activity after training of working memory. Nature Neuroscience, 7, 75–79. Perruchet, P. & Pacton, S. 2006 Implicit learning and statistical learning: One phenomenon, two approaches. Trends in Cognitive Sciences, 10(5), 233–238. Petersson, K.M., Forkstam, C. & Ingvar, M. 2004 Artificial syntactic violations activate Broca’s region. Cognitive Science, 28, 383–407. Pisoni, D.B. 1996 Word identification in noise. Language and Cognitive Processes, 11(6), 681–687. Pisoni, D.B. 2000 Cognitive factors and cochlear implants: Some thoughts on perception, learning, and memory in speech perception. Ear & Hearing, 21, 70–78. Pisoni, D.B., Conway, C.M., Kronenberger, W., Henning, S., & Anaya, E. 2010 Executive function, cognitive control, and sequence learning in deaf children with cochlear implants. In M. Marschark & P. Spencer (Eds), Oxford Handbook of Deaf Studies, Language, and Education (pp. 439–457). New York, NY: Oxford University Press. Plante, E., Gomez, R., & Gerken, L. 2002 Sensitivity to word order cues by normal and language/learning disabled adults. Journal of Communication Disorders, 35, 453– 462. Raven, J., Rave, J.C., & Court, J. 2000 Standard Progressive Matrices. San Antonio, TX: Harcourt Assessment. Reber, A.S. 1967 Implicit learning of artificial grammars. Journal of Verbal Learning and Verbal Behavior, 6, 855–863. Redington, M., & Chater, N. 1997 Probabilistic and distributional approaches to language acquisition. Trends in Cognitive Sciences, 1(7), 273–281. Redington, M., & Chater, N. 2002 Knowledge representation and transfer in artificial grammar learning (AGL). In R.M. French & A. Cleeremans (Eds.), Implicit learning and consciousness: An empirical, philosophical, and com-

Can we enhance domain-general learning abilities?

335

putational consensus in the making (pp. 121–143). Hove, East Sussex: Psychology Press. Rileigh, K.K. & Odom P.B. 1972 Perception of rhythm by subjects with normal and deficient hearing. Developmental Psychology, 7, 54–61. Roodenrys, S., & Dunn, N. 2008 Unimpaired implicit learning in children with development dyslexia. Dyslexia, 14, 1–15. Rubenstein, H. 1973 Language and probability. In G. A. Miller (Ed.), Communication, language, and meaning: Psychological perspectives (pp. 185–195). New York: Basic Books. Rueda, M.R., Rothbart, M.K., McCandliss, B.D., Saccomanno, L. & Posner, M.I. 2005 Training, maturation, and genetic influences on the development of executive attention. Proceedings of the National Academy of Sciences, 102, 14931–14936. Russeler, J., Gerth, I., & Munte, T.F. 2006 Implicit learning is intact in adult developmental dyslexic readers: Evidence from serial reaction time task and artificial grammar learning. Journal of Clinical and Experimental Neuropsychology, 28, 808–827. Sa¤ran, J.R. 2003 Statistical learning learning: Mechanisms and constraints. Current Directions in Psychological Science, 12(4), 110–114. Sa¤ran, J.R., Aslin, R.N., & Newport, E.L. 1996 Statistical learning by 8-month-old infants. Science, 274, 1926– 1928. Seger, C.A. 2006 The basal ganglia in human learning. The Neuroscientist, 12, 285– 290. Semel, E., Wiig, E.H., & Secord, W.A. 2003 Clinical evaluation of language fundamentals, fourth edition (CELF-4). Toronto, Canada: The Psychological Corporation/A Harcourt Assessment Company. Shalev, L., Tsal., Y. & Mevorach, C. 2007 Computerized progressive attentional training (CPAT) program: E¤ective direct intervention for children with ADHD. Child Neuropsychology, 13, 382–388. Smith, E.E. & Jonides, J. 1999 Storage and executive processes in the frontal lobes. Science, 283, 1657–1661. Tallal, P. & Gaab, N. 2006 Dynamic auditory processing, musical experience, and language development. Trends in Neuroscience, 29, 382–390.

336

Christopher M. Conway et al.

Thorell, L., Lindqvist, S., Nutley, S., Bohlin, G. & Klingberg, T. 2009 Training and transfer e¤ects of executive functions in preschool children. Developmental Science, 1–8. Todman, J. & Seedhouse, E. 1994 Visual-action code processing by deaf and hearing children. Language and Cognitive Processes, 9, 129–141. Tomblin, J.B., Barker, B.A., & Hubbs, S. 2007 Developmental constraints on language development in children with cochlear implants. International Journal of Audiology, 46, 512–523. Tomblin, J.B., Mainela-Arnold, E., & Zhang, X. 2007 Procedural learning in adolescents with and without specific language impairments. Language Learning and Development, 3, 269–293. Ullman, M.T. 2004 Contributions of memory circuits to language: The declarative/ procedural model. Cognition, 92, 231–270. Ullman, M.T., & Pierpont, E.I. 2005 Specific language impairment is not specific to language: The procedural deficit hypothesis. Cortex, 41, 399–433. van Praag, H., Kempermann, G., & Gage, F.H. 2000 Neural consequences of environmental enrichment. Nature Reviews: Neuroscience, 1, 191–198. Verhaeghen, P., Carella, J. & Basak, C. 2004 A working memory workout: How to expand the focus of serial attention from one to four items in 10 hours or less. Journal of Experimental Psychology: Learning, Memory, & Cognition, 30, 1322–1337. Vicari, S., Marotta, L., Menghini, D., Molinari, M., & Petrosini, L. 2003 Implicit learning deficit in children with developmental dyslexia. Neuropsychologia, 41, 108–114. Wechsler, D. 1991 Wechsler Intelligence Scale for Children – Third Edition. San Antonio, TX: The Psychological Corporation. Westerberg, H., Jacobaeus, H., Hirvikoski, T., Clevberger, P., Ostensson, M.L., Bartfai, A. & Klingberg, T. 2007 Computerized working memory training after stroke – A pilot study.

Conscious versus unconscious learning of structure Zoltan Dienes 1. Introduction The ways we come to learn about the structure of complex environments is intimately linked to the conscious-unconscious distinction. Indeed, Reber (1967, 1989) argued that we could acquire unconscious knowledge of some structures we could not readily consciously learn about because of their complexity. Some authors agree there are two modes of learning distinguished by their conscious versus unconscious phenomenology (e.g. Scott & Dienes, 2010a). Others at least agree there are striking di¤erences in phenomenology in di¤erent learning situations. For example, Shanks (2005), who does not accept there is such thing as unconscious knowledge, nonetheless notes that when he himself performed a standard implicit learning task that he found that ‘‘trying to articulate my knowledge, even only moments after performing the task, seem[ed] to require a Herculean e¤ort of mental will that yield[ed] only the sketchiest useful information. (p. 211)’’ By contrast, in yet other learning situations, knowledge can be readily described as it is being applied. The everyday example of natural language makes the contrast between these phenomenologies stark. We all learnt the main grammatical constructions of our native language by about age five without even consciously knowing there was a grammar to be learnt. And as adults we still cannot describe all the rules we spontaneously use. Yet when we learn a second language as an adult we may spend considerable time memorising rules of grammar. The two methods of learning feel very di¤erent and produce di¤erent results. By suitably defining conscious versus unconscious we can describe the di¤erence in phenomenologies. And that di¤erence may well be a marker of di¤erent mechanisms of learning. Indeed, the di¤erence in phenomenology is so striking that the distinction was constantly reinvented before it became part of an established literature, despite behaviouristic tendencies in psychologists to avoid the conscious versus unconscious terms (e.g. Broadbent, 1977; Hull, 1920; Lewicki, 1986; Phelan, 1965; Reber, 1967; Rommetveit, 1960; Smoke, 1932). Thus, the starting point for a definition of conscious versus unconscious should be one that picks out the real life examples that motivate the distinction,

338

Zoltan Dienes

and not one that makes the distinction evaporate. Just as phenomenologically unconscious learning seems especially powerful when we consider language or perceptual motor skills (Reed, McLeod, & Dienes, 2010), so many things are learnt only consciously (e.g. special relativity). Often what is interesting is the fact we can or cannot learn a structure unconsciously (or consciously) – not just whether we can or cannot learn it. Thus, it is vital for learning researchers to have a means for determining the conscious or unconscious status of knowledge, suitably defined. Only then can we experimentally explore whether the phenomenologies mark qualitatively di¤erent, if possibly interacting, learning mechanisms – and also isolate those mechanisms on a trial by trial basis. In this chapter I present a methodology for determining the conscious status of structural knowledge. First, I o¤er a definition of what an unconscious mental state is. Then I will use the definition to motivate a method of measuring the conscious-unconscious distinction. The method can be applied separately to di¤erent knowledge contents, specifically to ‘‘judgment’’ knowledge and ‘‘structural’’ knowledge, which will be defined. Next I review evidence that when the method is used to separate conscious from unconscious structural knowledge it isolates di¤erent learning systems (thus, the method shows its scientific worth). Finally, I present recent experiments showing its application (to cross cultural di¤erences and to learning language-like structure in the lab).

2. What is unconscious knowledge? I will take conscious knowledge to be knowledge one is conscious of (cf. Rosenthal, 2005; also Carruthers, 2000). This may sound like a tautology, but it is not, as we now see. How are we ever conscious of anything, say, of a dog being there? Only by either perceiving that the dog is there or thinking that it is there; that is, only by having some mental state that asserts that the dog is there. Likewise, to be conscious of knowledge we need to have a mental state that asserts that the knowledge is there. In other words, a higher-order mental state (a mental state about a mental state) is needed to be conscious of knowing. To establish that knowledge is conscious one must establish that the subject is in a metacognitive state of knowing about knowing: and this claim is not a tautology. Simply showing that a person knows about the world (for example, by accurately discriminating states of the world) will not do for determining whether the knowledge is conscious. Establishing that there is knowledge is just a pre-

Conscious versus unconscious learning of structure

339

condition for establishing whether it is conscious or not. But establishing that there is knowledge clearly does not establish whether the knowledge is conscious or unconscious. In sum, accepting that: conscious knowledge is knowledge one is conscious of, is to accept that: establishing the conscious status of knowledge requires establishing the existence of a metacognitive state (see Rosenthal, e.g. 2005, from whom this argument has been adapted to the case of knowledge). The argument can be put concisely: To be conscious of X requires a mental state about X; knowing is a mental state; thus, being conscious of knowing requires a mental state about a mental state. Some argue that metacognition is one thing and a state being conscious is another; being aware that one is in a mental state is ‘introspective’ or ‘reflective’ or ‘higher order’ consciousness, which is not necessary for a mental state to be simply conscious (Block, 2001; Dulany, 1991; Seth, 2008). While it seems odd to say that a mental state is conscious when a person is in no way conscious of being in that state, there is no point quibbling excessively over terminology. For those who argue that there is higher order consciousness separately from a mental state being conscious, they can simply translate what I am calling ‘‘conscious knowledge’’ into ‘‘reflectively conscious knowledge’’ or whatever their favourite term is. The scientific problem I will be addressing is to determine whether the fact of whether or not one is aware of one’s knowledge (or can be when probed) can distinguish qualitatively di¤erent types of knowledge or learning mechanisms. Once this point is accepted, a lot of heat in the implicit learning literature can be side stepped. For example, Dulany (1991), a critic of the existence of unconscious knowledge, nonetheless accepts as obvious that we can be in a mental state without being aware that we are (e.g. p. 109). And knowing without being aware of knowing is exactly what I am calling unconscious knowledge. Having some match to everyday use is the reason for using a term at all, and the definition of conscious knowledge as ‘the knowledge of which one is conscious’ in no way stretches the everyday use of the word. On the other hand, allowing knowledge to be conscious that a person sincerely denies having probably does stretch normal usage. In the end however, the test is not armchair argument or exact every day usage, but whether the definition is useful in conjunction with a theory in predicting experimental findings (Dienes, 2008; Dienes & Scott, 2005; Dienes & Seth, 2010a; Merikle, 1992; Seth et al., 2008). And that is what we will explore: Does the definition pull itself up by its bootstraps by being scientifically useful? First, we need to find a way of operationalising the definition.

340

Zoltan Dienes

3. How can we measure awareness of knowing? To establish that knowledge is conscious requires establishing the existence of a metacognitive state (Cleeremans, 2008; Dienes, 2008; Lau, 2008). Thus, establishing that a person can make a discrimination about states of a¤airs in the world (‘‘worldly discrimination’’) with forced choice discrimination or recognition tests establishes only that there is knowledge, but not that it is conscious. How could we determine whether or not a person is conscious of their knowledge? One way is to ask them to freely report what they know about the domain. The logic of this method is that a person will only report as facts about the world what they think they know. Indeed, free report has been used as a test of conscious knowledge ever since the word ‘‘implicit learning’’ was coined by Reber (1967) to mean learning that produces unconscious knowledge. Reber (1967) introduced the artificial grammar learning task as a task for investigating implicit learning. He used a finite state grammar to generate strings of letters. Subjects memorised such strings without being told that the strings were rule governed. After 5–10 minutes subjects were told of the existence of a set of rules, but not what they were. Subjects could classify new strings as grammatical or not, with 60–70% accuracy depending on materials. However, at the end of the experiment people found it hard to describe the rules of the grammar, and Reber took this as evidence that subjects lacked conscious knowledge. Such ‘‘post task free report’’ has been widely used to determine the extent of people’s conscious knowledge. But it is not a compelling measure because it involves subjects’ attempts to recall their thinking some time after it happened (Ericsson and Simon, 1980). Reber and Lewis (1977) asked people to report their reasons after each classification instead of at the end of a set of trials, clearly an improvement, and a source of valuable information about the person’s conscious knowledge. Nonetheless, the method is still problematic. Berry and Dienes (1993, p. 38) isolated two potential problems with any measure of conscious knowledge. The first is ‘‘simply not asking for the same knowledge that the subject used to classify’’ and the second problem is of ‘‘di¤erential test sensitivity’’ (the two problems also highlighted by Shanks and St John, 1994). In terms of the first, if people did not use general rules to classify but instead analogy to recollected training items, they may not report their actual source of grammaticality judgments if they believe the experimenter is only interested in hearing about rules (Brooks, 1978; Jamieson & Mewhort, 2009). Alternatively, people may not report using the experimenter’s rules if specifically asked about them,

Conscious versus unconscious learning of structure

341

because they used a correlated but di¤erent rule (Dulany, 1962). This problem can be overcome by suitable instructions, but it does highlight the care that needs to be taken in collecting free reports (Ruenger & Frensch 2010). In terms of the second problem, people may hold back on reporting knowledge if they are not completely confident in it. Why risk stating a rule that might be wrong? Thus, free report might easily underestimate the amount of conscious knowledge that a person has. A method that deals with both these problems is eliciting confidence after every judgment. Any knowledge the participant is conscious of using as knowledge, no matter what its content, should be reflected in the participant’s confidence. Further, using confidence ratings has an advantage over free report in that low confidence is no longer a means by which relevant conscious knowledge is excluded from measurement; rather the confidence itself becomes the object of study and can be directly assessed on every trial. Indeed, Ziori and Dienes (2006) provided empirical evidence for the greater sensitivity of confidence-based methods over free report in detecting conscious knowledge. The simplest confidence scale is for the subject to report ‘guess’ if she believes the judgment had no basis whatsoever, and ‘know’ if she believes the judgment constituted knowledge to some extent. If on all the trials when the person says ‘guess’ nonetheless the discrimination performance is above baseline, then there is evidence that the person does have knowledge (performance above baseline) that she doesn’t know she has (she says she is guessing). This is unconscious knowledge by the guessing criterion. If a person’s knowledge states are conscious, she will know when she knows and when she is just guessing. In this case, there should be a relation between confidence and accuracy. Thus, a relation between confidence and accuracy indicates conscious knowledge and zero relation indicates unconscious knowledge by the zero-correlation criterion (Dienes, Altmann, Kwan, & Goode, 1995). Confidence can also be elicited by various methods of gambling. Persaud, McLeod, & Cowey (2008) asked people to wager high or low on each grammaticality decision. Subjects were told that if their decision was wrong they would lose the amount of the wager and if their decision was right they would win that amount. Low wagers can be taken to reflect low confidence and high wagers higher confidence in one’s grammaticality decision. The problem with wagering is that it is subject to loss aversion: A person may wager low even though they have some confidence in their decision, but they do not want any risk of losing the higher amount of money. Indeed, Dienes and Seth (2010b) showed empirically that high-

342

Zoltan Dienes

low wagering was sensitive to loss aversion. Further, Dienes and Seth introduced a loss-free gambling method for measuring confidence to eliminate the confounding e¤ect of loss aversion. On each trial subjects made two decisions. First was a grammaticality decision. Second was a choice of one of two cards the subject had just shu¿ed. One card had a reward printed on the back, the other was blank. Next the subject made a choice between the two decisions. For whichever decision they chose, if they got it right, they won the reward. If they got it wrong nothing happened. Thus, subjects should always be motivated to bet on their grammaticality decision if they had the slightest confidence that it was right: loss aversion is irrelevant. Nonetheless, subjects sometimes chose to bet on the transparently random process rather than their own grammaticality decision; and on those trials subjects were still about 60% correct on their grammaticality decision. This is unconscious knowledge by the guessing criterion. Further, subjects were more accurate on the grammaticality decision when they bet on that rather than the transparently random process: This is conscious knowledge by the zero correlation criterion. Thus the method confirmed that artificial grammar learning involves a mix of trials, some on which subjects are aware of knowing grammaticality, and some on which they are not. The use of no-loss gambling helps subjects appreciate what we want them to understand by ‘‘guess’’. ‘‘Guessing’’ on the grammaticality decision means one expects to perform no better than a random process. In general, eliciting confidence by means of gambling enables measuring the conscious status of knowledge in young children (Ru¤man et. 2001) and even some non-human primates (Kornell, Son, & Terrace, 2007). A more indirect measure of awareness of knowing is to ability to control (Jacoby, 1991: the ‘‘process dissociation procedure’’). The logic of this method is that if one is aware of one’s knowledge, one can control its use, if so instructed. Fu Dienes and Fu (2010) showed that in one implicit learning task (the SRT or serial reaction time task) when people said they were guessing in predicting the next element they also had no control over the use of knowledge: When asked to produce the next element according to the rules they no more tended to produce rule-governed completions than when asked to produce the next element such that it violated the rules. Further when people had some confidence they also had some control. So awareness of knowing as measured by confidence and as measured by control went together in this case. However, control and conscious knowledge do not always go together. Dienes et al. (1995) and Wan, Dienes, and Fu (2008) showed that people could control which of two

Conscious versus unconscious learning of structure

343

grammars to use while believing they were completely guessing. That is, unconscious knowledge can produce control, and thus the presence of control does not definitively indicate the knowledge was conscious. Nonetheless a lack of control (under self paced conditions) is a good indicator of unconscious knowledge: If a person reliably produces grammatical choices when told to pick only ungrammatical items, without time pressure, it is often good evidence that the person is not actually aware of grammaticality (see Destrebecqz & Cleeremans, 2001; Fu, Fu, & Dienes, 2008; and Rohrmeier, Fu, & Dienes, submitted, for examples). In sum, to measure awareness of knowing that an item belongs to a category, the method with least ambiguity is no-loss gambling. Nonetheless, verbal confidence, where ‘‘guess’’ has been defined to subjects as meaning equivalent to the outcome of a random process, also behaves well and is not correlated with loss aversion (Dienes & Seth, 2010b). Measures based on control can be informative, but may produce ambiguous findings.

4. Judgment versus structural knowledge A person has conscious knowledge that p when they are aware of knowing that p, where p is any proposition. Mental states with di¤erent contents are di¤erent mental states. ‘Knowing that p’ is di¤erent from ‘knowing that q’ if p and q are di¤erent. Being conscious of knowing that p thus entails that we know that we know specifically p. Being aware of knowing q does not make knowledge of p conscious. Corollary: A good methodological rule whenever a claim is made about a state or process being conscious or unconscious is to always specify the content said to be conscious or unconscious (cf. Dienes & Seth, 2010c). For example, implicit memory does not normally involve any unconscious knowledge per se, something that becomes clear as soon as one tries to specify the content of the knowledge involved. The presentation of a word may strengthen the connections between the letters in the part of the cortex that codes words, making the word a more likely completion to a stem. The knowledge applied in stem completion is what letters can follow other letters and there is no reason to think this is unconscious when the person completes the stem. Further there is no reason to think the person has any knowledge, conscious or unconscious, that the word was presented in the experiment (cf. Dulany, 1991). So implicit memory is not a case of unconscious knowledge. Implicit learning, by contrast, does involve unconscious knowledge.

344

Zoltan Dienes

What are the knowledge contents involved in an implicit learning experiment? When a person is exposed to a domain with some structure they often acquire knowledge about that structure (e.g. Reber, 1989; Gebhart, Newport, & Aslin, 2009). The knowledge might be of the conditional probabilities of successive elements, of allowable chunks and their probabilities of occurrence, of what letters can start a string, of allowable types of symmetries and their probabilities, of particular allowed sequences, and so on. Let us call all this knowledge ‘structural knowledge’ . For any of this knowledge to be conscious, one would have to be aware of having specifically that knowledge. For example, to consciously know that ‘‘An M can start a string’’ one would have to represent specifically ‘‘I know that an M can start a string’’. Such a metacognitive representation makes a particular piece of structural knowledge conscious, namely the knowledge that M can start a string. In a test phase a subject may make judgments about whether a presented item is grammatical or not: Whether this string is grammatical, or whether this item can occur next in the sequence. Structural knowledge is brought to bear on the test item to form a new piece of knowledge, for example, that this item has the structure of the training items. Let us call this knowledge ‘judgment knowledge’. That is, when a subject makes a judgment that p, then the judgment knowledge has content p. The structural knowledge is whatever other knowledge the person had that enabled the judgement. For example, if a subject judges that ‘‘MTTVX is grammatical’’ the judgment knowledge is that ‘‘MTTVX is grammatical’’ and the structural knowledge the person may have used are things like ‘‘An M can start a string’’, ‘‘VX is an allowable bigram’’, and ‘‘MTTVT is an allowable string’’. What knowledge do the methods of the last section determine the conscious status of ? When a person makes a judgment followed by a confidence rating, the expressed confidence is in the judgment. Thus, confidence ratings – whether verbal reports, high-low wagering, or no-loss gambling – determine the conscious status of judgment knowledge. Similarly, Jacoby’s method of measuring control, as applied to implicit learning (Destrebecqz & Cleeremans, 2001; Dienes et al. 1995; Fu Fu & Dienes, 2008; Fu Dienes & Fu, 2010; Jime´nez, Vaquero, & Lupia´n˜ez, 2006; Wan et al. 2008) measures ability to control making a judgment, and hence measures the conscious status of judgment knowledge. If a person is confident that this string is grammatical, they are aware of knowing this string is grammatical, but that does not mean they are aware of the structural knowledge that enabled that judgment. Similarly a person may, because

Conscious versus unconscious learning of structure

345

they consciously know that this item is grammatical, be able to choose that item OR another one if instructed according to Jacoby’s methods, but that control over the judgment does not mean the person consciously knows why it is grammatical. The conscious status of judgment knowledge does not completely determine the conscious status of structural knowledge. Consider our knowledge of our native language. Our structural knowledge can be largely unconscious. Yet we may be sure that a given sentence is ungrammatical even if we do not know why: Conscious judgment knowledge, unconscious structural knowledge. But if a key divide between di¤erent learning mechanisms is between conscious and unconscious structural knowledge rather than between conscious and unconscious judgment knowledge then we need a method for measuring the conscious status of structural knowledge rather than just of judgment knowledge. Free report does measure the conscious status of structural knowledge. But as mentioned, free report has its problems. Dienes and Scott (2005) devised a simple method for measuring the conscious status of structural knowledge that deals with these problems (see also Fu, Dienes, & Fu, 2010; Guo et al., 2011; Rebuschat, 2008; Scott & Dienes 2008, 2010b,c; Wan et al., 2008; Chen et al., 2011). After every judgment subjects indicate what the basis of the judgment was according to a set of attribution categories: random, the judgment had no basis whatsoever; intuition, it had some basis but the subject had no idea what it was; familiarity, the decision was based on a feeling of familiarity but the subject had no idea what the familiarity itself was based on; recollection, the basis was a recollection of a string or strings or part(s) of the strings from training; and rules, the basis was a rule or rules that the subject could state if asked. Assuming the subject’s judgments are above baseline for each attribution, then: random attributions indicate that both judgment and structural knowledge were unconscious; intuition and familiarity attributions indicate that judgment knowledge was conscious but structural knowledge was unconscious; and recollection and rules indicate both judgment and structural knowledge were conscious. Thus, to measure the amount of unconscious structural knowledge one can pool together random, intuition and familiarity attributions, and to measure the amount of conscious structural knowledge one can pool together the recollection and rules attributions. If one wanted to compare conscious and unconscious judgment knowledge, random attributions could be compared with intuition and familiarity: the conscious status of structural knowledge has been controlled because it is unconscious for each of these attributions.

346

Zoltan Dienes

One criticism of free report is that subjects may believe the experimenter only wants to hear about rules and not specific recalled exemplars, or some types of rule rather than another. The attributions in contrast do not presuppose any particular form of conscious structural knowledge beyond it being classifiable as rules or recollections. And if the subject does not want to report a rule that might be wrong, that is fine: We do not actually ask people to report their rules, just indicate if they have one. Further, confidence can be elicited independently of the felt basis of the judgment: If people used rules that they felt they were just guessing, this can be indicated by a rules response and a separate confidence rating (Scott & Dienes, 2008). The attributions are recorded trial by trial, and can even be given by the very same button press as indicates the classification judgment (Dienes, Baddeley, & Jansari, 2012), so forgetting should not be an issue. The method, though simple, prima facie deals with many problems that face measures of conscious knowledge. But the acid test of its worth is whether it contributes to explaining empirical findings in a theoretically motivated way. Subjects may simply pick arbitrarily amongst the attributions according to whim or momentary bias. What evidence is there that they reflect anything interesting?

5. Have subjective measures of the conscious status of structural knowledge proved their mettle? First we need to situate conscious and unconscious knowledge within a theory. The theoretical claims should be su‰ciently broad that they don’t depend on a theory so idiosyncratic that few would wish to strongly associate the conscious-unconscious distinction with it; but su‰ciently precise that some predictions can be made. Relatedly, the theory should go beyond defining conscious versus unconscious, but rather introduce properties that di¤er between the conscious and the unconscious in an empirical, contingent way rather than a conceptual way, so that the properties can be empirically tested as di¤ering between conscious and unconscious knowledge. (If the properties were conceptually associated with the conscious-unconscious – i.e. necessarily part of our proposed concept of that distinction – one could not test whether the conscious-unconscious was associated with those properties.) So here is a theoretical context that tries to satisfy these constraints.

Conscious versus unconscious learning of structure

347

5.1. The theoretical context Prototypical unconscious knowledge is the structural knowledge embedded in the weights of a connectionist network (which could e¤ectively learn exemplars, abstractions or both: Cleeremans & Dienes, 2008). Typically such knowledge does not need manipulation in working memory to be applied; it just needs activation running through it. Why should such knowledge empirically be unconscious? Because there is no reason why such knowledge should be input to a device that forms higher order thoughts, or awareness of knowing specific contents. For example, if accurate thoughts about one’s mental states are located in a specific location (e.g. the mid dorsolateral prefrontal cortex according to Lau & Passingham, 2006), the values of synaptic weights in other brain regions would not normally be input to that location, only patterns of activation would be. A connectionist network can classify by an overall goodness-of-fit signal, that we postulate corresponds to a feeling of familiarity (Dienes, Scott & Wan, 2011), such a feeling communicating the existence of unconscious structural knowledge to more general conscious mechanisms (the function of ’’fringe feelings’’ according to Mangan, 1993, and Norman, Price, Du¤, & Mentzoni, 2007). For example, in a test phase, if part of a string feels more familiar than another part, this can be used to form a conscious rule about why that might be so. Prototypical conscious structural knowledge is knowledge formed by hypothesis testing: By the consideration of hypotheticals understood as such, and also by the use of recollection. Why should such knowledge be associated with awareness of knowing? First, the knowledge would be represented as a pattern of activation rather than a pattern of connection strengths – but this alone is not enough. Importantly, hypotheses, understood as such, need to be represented in a format which explicitly marks the distinction between reality and possibility, a level of explicitness Dienes and Perner (1999, 2002) argued was close to explicitly marking knowledge as knowledge. That is, the step from representing a tested hypothesis as such to conscious knowledge is thus a short one. Recollection intrinsically involves representing oneself as remembering, and hence involves awareness of knowing (Perner & Ru¤man, 1995; Searle, 1983). Further, hypothesis testing (and recollection) usually require the use of working memory. It is plausible that information in working memory is generally available to di¤erent processing modules in the brain, including any that have the function to form accurate higher order thoughts.

348

Zoltan Dienes

As connectionist networks become adapted to particular domains, they learn to detect structures in new instances in that domain most easily according to the structures already learned and their prior probabilities. More generally, we postulate that people will unconsciously learn those structures most easily in the lab which have a high prior probability of being relevant in that domain, strictly statistical or not. Hypothesis testing can become fixated and ruled by prior probabilities too; but it can also make flexible jumps that can be useful or lead one systematically astray. The model proposed is a dual process one – learning can be based on a mechanism that acquires unconscious structural knowledge or a mechanism that acquires conscious structural knowledge. Dual process models have been criticised in the implicit learning literature, and single process models proposed, possessing the virtue of simplicity (e.g. Shanks, 2005; Cleeremans & Jime´nez, 2002). These single-process authors use connectionist networks where all knowledge of the grammar is embedded in the weights. The ‘‘single process’’ aspect of the models are that they could, if a metacognitive component were added, model both conscious and unconscious judgment knowledge with a single learning device that acquires unconscious structural knowledge (see Pasquali, Timmermans, & Cleeremans, 2010, for an integrated connectionist model of learning and metacognition). However, the models still leave a necessary distinction between conscious and unconscious structural knowledge, as they only model the latter (compare Shanks & St John’s, 1994, distinction between exemplar-based and rulebased learning). Thus, the single process proposals of Cleeremans and Shanks are in principle consistent with the framework developed here. Thus, the framework is su‰ciently broad, it should capture the intuitions of a large number of workers in the field. 5.2. The evidence That is the theoretical context, showing that given a particular conceptual approach to the conscious – unconscious knowledge distinction (viz conscious knowledge is knowledge one is conscious of ), there are further properties, based on theoretical speculation, that should be empirically associated with conscious versus unconscious knowledge. Does the method of measuring conscious status by the structural knowledge attributions classify accurately often enough that it helps identify qualitatively di¤erent types of knowledge that fit in with this theoretical framework? We now consider the evidence. (Note that we do not require the attributions

Conscious versus unconscious learning of structure

349

always classify accurately – no instrument in science does that – just that it classifies accurately enough we can get on with the science, refining the measurement process as we go.) 1. Conscious unlike unconscious structural knowledge typically requires manipulation in working memory for the knowledge to be applied to a new test item. Thus engaging executive resources at test should interfere specifically with conscious structural knowledge. Dienes and Scott (2005; experiment 2) found just such a dissociation, where random number generation at test interfered with the application of conscious structural knowledge but not at all for unconscious structural knowledge. (Interfering with perceptual rather than executive resources interferes with both conscious and unconscious structural knowledge, Tanaka et al. 2008; Eitam, Schul, & Hassin, 2009; Rowland & Shanks, 2006, as would be expected if the input to a connectionist learning device were degraded.) 2. When unconscious knowledge is formed for a domain consisting of simple statistics over well established perceptual units, it should be largely accurate. Conscious knowledge may also be partially correct, but when it is wrong it will lead to the same mistake being repeated – the repeated application of a partially correct rule makes one both consistently correct and consistently incorrect (cf. Reber, 1989, from which this prediction is derived; see also Sun, 2002, whose conscious and unconscious systems embody a similar principle). Indeed, Dienes and Scott (2005; experiment 1) found that when people gave unconscious structural knowledge attributions, they did not systematically misclassify strings (i.e. if a string was misclassified one time it may be classified correctly the next). On the other hand, when people gave conscious structural knowledge attributions, if they made an error in classifying a string, the error was likely to be repeated. Likewise, Reed, McLeod and Dienes (2010) found that in a perceptual motor domain where a simple rule exists for avoiding/creating interceptions between us and other objects, a rule that no doubt has been crucial to survival in the evolutionary past, the knowledge of the rule was unconscious as revealed by confidence ratings. Further, conscious knowledge was systematically wrong, misled by its own flexibility in considering all sorts of possibilities of how things might be. 3. As the postulated di¤erences between conscious and unconscious structural knowledge reflect qualitatively di¤erent learning mechanisms, di¤erences between the mechanisms should not be reducible to a single dimension, such as confidence (cf. Tunney 2007). Indeed, recent work by

350

Zoltan Dienes

Andy Mealor at the University of Sussex showed that the reaction times for unconscious rather than conscious structural knowledge attributions were di¤erent (as it turned out, longer). Reaction times were also correlated with confidence, such that the lower the confidence the longer the RTs. So could the di¤erence in RTs between conscious and unconscious knowledge be simply due to a di¤erence in confidence? The answer is no: Once confidence was partialed out, the reaction time di¤erence between conscious and unconscious structural knowledge remained. The output of the unconscious mechanism is often not consciously experienced as anything, or at least not as the output of a learning mechanism (see Dienes, Scott & Wan, 2011). But often it is experienced as a feeling of familiarity which people can rate. Scott and Dienes (2008) and Wan et al. (2008) found a greater relation between rated feelings of familiarity and classification when people said they were using feelings of familiarity as their attribution than when they used other unconscious attributions (Dienes, Scott & Wan, 2011). Thus, the attribution of familiarity is not given arbitrarily. When feelings of familiarity are conscious, we can make the following prediction: 4. If people have consciously worked out some aspects of structure, they will have more knowledge relevant to making a classification than is contained in their feelings of familiarity. This is just what Scott and Dienes (2008) found. In a standard artificial grammar learning task, where we know people acquire some accurate conscious structural knowledge (Reber & Lewis, 1977), familiarity ratings of each item predicted grammaticality judgments for both conscious and unconscious structural knowledge, but judgments based on conscious structural knowledge had additional discriminative ability above and beyond rated familiarity (unlike judgments based on ‘familiarity’ attributions; cf Scott & Dienes, 2010c). Thus, the attribution method shows its mettle by not simply classifying di¤erent types of knowledge but identifying a real divide in nature, separating out knowledge qualitatively di¤erent in ways expected based on theory. There is also additional evidence that people do not hand out the attributions arbitrarily. Riccardo Pedersini working at the University of Sussex found that Galvanic Skin Response was di¤erent for correct and incorrect answers on an artificial grammar learning task. The di¤erence was significant for each attribution category, and of a very similar magnitude for the unconscious structural knowledge attributions amongst themselves and for the conscious structural knowledge attributions amongst themselves – but strikingly di¤erent between conscious and unconscious

Conscious versus unconscious learning of structure

351

structural knowledge attributions1. In sum, the di¤erent phenomenology associated with conscious and unconscious structural knowledge corresponds to objective di¤erences in the properties of the knowledge. More work remains to be done, of course: For example, are the structural knowledge attributions useful in separating knowledge with di¤erent time constants of decay (cf. Allen & Reber, 1980), or knowledge di¤erentially related to IQ (cf. Gebauer & Mackintosh, 2007) or other individual di¤erence variables (cf. Scott & Dienes, 2010a)? To what extent does eliciting the ratings change what type of knowledge is used? Unconscious knowledge as revealed by this method is substantial, replicable, and occasionally more powerful than conscious knowledge (Scott and Dienes, 2010c; Reed, McLeod and Dienes, 2010). Typically 70% of responses in the artificial grammar learning and similar tasks are attributable to unconscious structural knowledge, with performance levels around 65%. Unconscious structural knowledge is not something that can be conveniently ignored. Note also that striking qualitative di¤erences were obtained when responses were separated out on a trial by trial basis – certain task conditions may on balance favour conscious or unconscious knowledge, but tasks are unlikely to be process pure (Jacoby, 1991). 5.3. Summary The conscious-unconscious distinction that comes out most strongly as a real divide in nature is between conscious and unconscious structural knowledge rather than between conscious and unconscious judgment knowledge. Unconscious judgment knowledge corresponds to cases of unconscious structural knowledge so establishing that judgment knowledge is unconscious is useful in picking out one of the learning mechanisms; but conscious judgment knowledge (associated with intuition, familiarity and other fringe feelings) can also result from unconscious structural knowledge. 1. In Dienes, 2008, the Pedersini study was misreported as showing a di¤erence in the time of application of conscious versus unconscious structural knowledge; in fact the zero point on the time axis for the graphs in Dienes 2008 was not defined as when the string was presented but as simply three seconds from before feedback was given (feedback was given a set time after the subject responded, not a set time from when the string was displayed). For this experiment, the time relative to the presentation of the string is unknown. Thus the time it takes for structural knowledge to apply cannot be inferred from the graph. Nonetheless, the results do show a striking di¤erence between conscious and unconscious structural knowledge.

352

Zoltan Dienes

Now we have a method with some evidence of its worth (and more of course needed), we can use it to explore the conscious-unconscious distinction further. We consider as examples some recent applications of the method: First to the issue of whether there are Asian-Western cross-cultural di¤erences in unconscious processes; and second to the learning of stimuli modelled on natural language.

6. Cross cultural di¤erences in unconscious processes Nisbett and colleagues have been arguing for the last couple of decades that one’s cultural background can profoundly a¤ect cognitive processes (e.g. Nisbett, 2003). Specifically, Asians compared to Westerners take a more global rather than analytic perspective, being especially sensitive to context in conscious perception, memory, reasoning and social attributions, with Westerners often having the reverse tendency. For example, Masuda and Nisbett (2001) presented Japanese and Americans with underwater scenes. In a subsequent recognition test, Japanese recognized previously seen objects more accurately when they saw them in their original settings rather than in novel settings, whereas this manipulation had relatively little e¤ect on Americans. Japanese tended to pay attention to the scene globally, whereas Americans focused more on foreground objects. A wealth of studies have investigated cross cultural di¤erences in conscious processing, showing consistent medium to large e¤ects for global/ analytic di¤erences. However, the question of whether unconscious processes are a¤ected by culture remains unanswered. Reber (1989) argued that some minimal level of attention was needed for implicit learning to occur (cf. also e.g. Jime´nez, & Me´ndez, 1999). Thus, one might expect di¤erent attentional preferences in di¤erent cultures to lead to acquiring unconscious knowledge of di¤erent types of structures. Kiyokawa, Dienes, Tanaka, Yamada, and Crowe (2012) tested this claim using an artificial grammar learning paradigm developed by Tanaka et al. (2008). Tanaka et al. (2008) showed how global vs local attention could be separated in artificial grammar learning. They used ‘‘GLOCAL’’ strings (an example is shown in Figure 1) which are chains of compound letters (Navon, 1977). A compound letter represents one large letter (i.e., a global letter) composed of a set of small letters (i.e., local letters). A critical feature of this stimulus is that while a GLOCAL string can be read as one string at the global level (NVJTVJ in Figure 1), it can also be read as another string at the local level (BYYFLB in Figure 1). Tanaka et al. found that when

Conscious versus unconscious learning of structure

353

Figure 1. A GLOCAL string

people were instructed to attend at one particular level (global or local), they learned the grammar at that level, but not at the unattended level, confirming Reber’s claim of a minimal amount of attention needed for implicit learning (see also Eitam et al., 2009, for a related finding). Kiyokawa et al. asked psychology students from Chubu University in Japan and from University of Sussex in the UK to attend to GLOCAL strings embodying two di¤erent grammars at the local and global level. For the condition considered here, no instructions as to which level to attend were given. In a test phase, strings were presented in normal font, i.e. not in GLOCAL format, and their knowledge of each grammar tested, accompanied by structural knowledge attributions. For conscious structural knowledge attributions, the proportion of correct responses was much higher for the global (83%) rather than the local (53%) grammar for the Japanese participants, but there was no global advantage for the UK students (75% versus 76%). These results conceptually replicate the pre-existing literature: For conscious processing, Asian people show a greater global preference than Western people. The real contribution of the study comes from considering cross cultural di¤erences in unconscious knowledge. For unconscious structural knowledge, the proportion of correct responses was much higher for the global (67%) rather than the local (51%) grammar for the Japanese participants, but there was no global advantage for the UK students (60% versus 61%). In sum, Japanese participants showed a striking global advantage, performing at chance on local structure, whereas the UK participants learned similarly from both global and local levels. Importantly, this e¤ect occurred when people were apparently unaware of the contents of the structural knowledge they had induced. Thus, cultural biases can profoundly a¤ect the contents of unconscious and not just conscious states.

7. Learning language-like structures Our second example of the application of measuring the conscious status of structural knowledge is the learning of language-like structures. Rebuschat

354

Zoltan Dienes

and Williams (2009) pointed out that ‘‘despite the widespread recognition that language acquisition constitutes a prime example of implicit learning . . . relatively little e¤ort has been made, within linguistics or experimental psychology, to investigate natural language acquisition within the theoretical framework provided by implicit learning research’’. Indeed, the structures typically investigated within the implicit learning field are exemplars, chunks, conditional probabilities, or repetition patterns (see e.g. Pothos, 2007, for a review). These structures are generally relevant for learning in most any domain. But does learning within particular domains, e.g. language, come with biases or capabilities or limitations (either innate or based on experience) for learning particular structures beyond those most generic ones? Here we explore the acquisition of structures more closely resembling natural language than has been typical in the implicit learning literature. Rebuschat and Williams (2009) presented English speaking subjects with two-clause sentences constructed from English words, but with the order of words obeying the rules of German grammar (e.g. ‘‘Last year visited Susan Melbourne because her daughter in Australia lived’’). In particular, the authors were interested in whether the subjects could learn the rules of verb phrase placement, which vary according to whether the clause is main or subordinate and the first or second clause in the sentence. In the training phrase subjects assessed the semantic plausibility of the sentences. In the test phase, subjects were informed of the existence of rules and asked to classify sentences made of completely new words. The test sentences were either grammatical or violated one or other of the placement rules. After each grammaticality judgment subjects gave their structural knowledge attribution (guess, intuition, rules or memory). When people used intuition they classified significantly above baseline, indicating unconscious structural knowledge. People also classified well when they said they used rules, which appears to indicate some conscious structural knowledge. However, in free report at the end of the experiment no subject could articulate useful rules. The latter result likely reflects the insensitivity of free report we have already discussed, and the greater sensitivity of the structural knowledge attributions to picking up conscious knowledge of structure than free report allows. An alternative possibility is that the rules attributions may actually have reflected unconscious structural knowledge guiding people who consciously held only vague and uninformative rules. Be that as it may, the intuition attributions provide evidence for unconscious structural knowledge of verb placement regularities that can apply to new words. In other words, subjects had unconscious

Conscious versus unconscious learning of structure

355

knowledge of relations that went beyond exemplars, chunks or statistics over words in themselves, and applied to verb placement within phrases. Second language vocabulary acquisition is an area where conscious learning has been emphasized (e.g. Ellis, 1994). Guo et al. (2011) explored an aspect of vocabulary acquisition, namely semantic prosody, that plausibly involves unconscious knowledge. Semantic prosody is the contextual shading in meaning of a word, largely uncaptured by dictionary definitions. Prosodies are often positive or negative; that is, the target word is frequently used with either positive or else negative surrounding words. For example, the word ‘‘cause’’ may seem to have the simple meaning ‘‘to bring about’’, but because the word is largely used in contexts in which a negative event has been brought about (a tendency that the Oxford English Dictionary does not mention) the word has a negative semantic prosody. Chinese participants learning English were exposed to English sentences containing one of six pseudo-words, presented as real words, that substituted for a corresponding English word with known positive or negative prosody (like ‘‘cause’’). In the training phase, participants read sentences providing a consistent positive or negative context for each pseudo-word. In the test phase, participants judged the acceptability of the target pseudo-words in new sentences, which provided a context that was either consistent or inconsistent with the trained prosody. After each judgment, structural knowledge attributions were given. When participants gave unconscious as well as conscious structural knowledge attributions, they accurately discriminated appropriate and inappropriate contexts for the pseudo-words. Thus, second language vocabulary acquisition may be partly conscious, especially for core meanings, but people acquire both conscious and unconscious knowledge of shadings of meaning. Williams (2004, 2005) constructed a rule to create noun phrases in which determiners before nouns were categorized according to animacy: living things used one set of determiners and non-living things another. (In English determiners include: ‘the’, ‘a’, ‘that’, ‘this’.) Williams asked participants to translate Italian phrases into English, phrases which followed the form-animacy regularity mentioned. On a later test using both trained and generalization items, participants responded correctly on a forced-choice test of form-meaning connections. In a post task free report, most participants claimed that they were not aware of the relevance of animacy during training, leading Williams to suggest the knowledge of the use of di¤erent determiners for animate and inanimate objects was unconscious. Chen et al. (2011) conceptually replicated the procedure with Chinese subjects, using structural knowledge attributions to provide a more sensitive test of

356

Zoltan Dienes

the conscious status of the structural knowledge. Chen et al. used characters unknown to the participants as determiners, in sentences that were otherwise standard Chinese. As in the Williams experiments, the correct determiner varied according to the animacy (and also distance) of the modified noun phrase. In the training phase, subjects were exposed to sentences following the regularities. In the test phase, new sentences were judged for acceptability, followed by structural knowledge attributions. When structural knowledge was unconscious, people classified the right determiner according to the animacy of the modified noun phrase at about 60% correct, significantly above chance. People also classified about 60% when using conscious structural knowledge, though no subject was willing to report any regularity in post task free report. In natural languages determiners can be sensitive to a range of features. For example, in English, the determiners ‘this’ versus ‘that’ make a nearfar distinction. In Mandarin, animacy is also relevant. Thus, animacy is a linguistically relevant feature, that is, a feature that in natural languages selects di¤erent determiner forms for di¤erent nouns. Chen et al. showed that use of another feature (smaller or larger than a prototypical dog) instead of animacy did not result in learning under the same conditions. Thus, implicit learning only becomes sensitive to some of the available regularities. We propose it is those regularities with a high prior probability of being relevant within a particular domain, a proposal that needs further investigation (for other examples see Ziori & Dienes, 2008; also: Dienes, Kuhn, Guo, & Jones, 2012; Rohrmeier & Cross, 2010; Rohrmeier, Rebuschat, & Cross, in press). In sum, adults learning language structures acquire unconscious as well as conscious knowledge of a range of such structures in syntax and vocabulary.

8. Structural versus statistical learning Some of the structures people learn in implicit learning experiments can be described as straight-forwardly statistical: n-gram statistics, conditional probabilities, or even the joint or conditional probabilities of events a fixed distance apart (X - - - Y, where the blanks could be anything; Remillard, 2008). The semantic prosody in Guo et (2011), described above, is a statistical relation between a word and the valence of its context; the form-meaning correspondence in Williams (2004, 2005) and Chen et al. (2011) is a statistical association between a form and a semantic feature.

Conscious versus unconscious learning of structure

357

However, not all structure is straight-forwardly statistical. The structure of having mirror symmetry is not in itself statistical (Dienes & LonguetHiggins, 2004; Kuhn & Dienes, 2005), although learning it may involve dealing with statistics (as e.g. the model of Kuhn & Dienes, 2008, does). While Rebuschat and Williams (2009) looked at learning what look like simple verb placement rules, true sensitivity to them requires sensitivity to phrases and clauses per se, and not just to words, or statistical relations between words at fixed positions. Similarly, implicit learning of recursively embedded phrase structure (Rohrmeier, Fu, & Dienes, submitted) involves more than learning statistics over the terminal elements themselves. Bayesian approaches of course recognize the need to specify prior structures (Perfors & Navarro, this volume). Bayesian approaches provide a framework for integrating learning of structure and statistics. However, calling the overall learning phenomenon ‘‘statistical learning’’ may prejudge divides in nature that may not exist (statistical versus structural learning) and may divert attention away from exploring the possible conscious vs unconscious learning of other interesting structures, that are not straightforwardly regarded as statistical (e.g. the structural learning investigated by Z. P. Dienes & Jeeves, 1965; cf. Halford & Busby, 2007). Thus, it might be best simply to think about the field of research as simply the acquisition of (conscious and) unconscious knowledge of structure.

9. Conclusion This chapter has argued that the distinction between conscious and unconscious structural knowledge is an important one for learning researchers to take into account. The chapter argues for a particular simple trial-by-trial methodology (categorical attributions for the basis of judgments, as first introduced in Dienes & Scott, 2005) and attempts to justify it philosophically and scientifically. If the method does pick out the products of di¤erent learning mechanisms, as is argued, then even researchers not interested in the conscious – unconscious distinction per se, but simply interested in characterizing the nature of learning, would benefit from employing the method.

References Allen, R., & Reber, A. R. 1980 Very long term memory for tacit knowledge. Cognition, 8, 175– 185.

358

Zoltan Dienes

Berry, D. C. and Dienes, Z. 1993 Implicit learning: Theoretical and empirical issues. Hove: Lawrence Erlbaum. Block, N. 2001 Paradox and cross purposes in recent work on consciousness. Cognition, 79, 197–219. Broadbent, D. E. 1977 Levels, hierarchies, and the locus of control. Quarterly Journal of Experimental Psychology, 29, 181–201. Brooks, L. R. 1978 Non-analytic concept formation and memory for instances. In E. Rosch & B. Lloyd (Eds.), Cognition and concepts (pp. 169– 211). Hillsdale, NJ: Erlbaum. Carruthers, P. 2000 Phenomenal consciousness: A naturalistic theory. Cambridge University Press. Chen, W., Guo, X., Tang, J., Zhu, L., Yang, Z., & Dienes, Z. 2011 Unconscious Structural Knowledge of Form-meaning Connections. Consciousness & Cognition, 20, 1751–1760. Cleeremans, A. 2008 Consciousness: The radical plasticity thesis. Progress in Brain Science, 168, 19–33. Cleeremans, A., & Dienes, Z. 2008 Computational models of implicit learning. In R. Sun (Ed.), Cambridge Handbook of Computational Psychology. Cambridge University Press, pp. 396–421. Cleeremans, A. & Jime´nez, L. 2002 Implicit learning and consciousness: A graded, dynamic perspective. In R. M. French & A. Cleeremans (Eds.), Implicit Learning and Consciousness, Hove, UK: Psychology Press (pp. 1–40). Destrebecqz, A. & Cleeremans, A. 2001 Can sequence learning be implicit? New evidence with the process dissociation procedure Psychonomic Bulletin & Review, 8(2), pp. 343–350. Dienes, Z. 2008 Subjective measures of unconscious knowledge. Progress in Brain Research, 168, 49–64. Dienes, Z., Altmann, G., Kwan, L, Goode, A. 1995 Unconscious knowledge of artificial grammars is applied strategically. Journal of Experimental Psychology: Learning, Memory, & Cognition, 21, 1322–1338. Dienes, Z., Baddeley, R. J., & Jansari, A. 2012 Rapidly Measuring The Speed Of Unconscious Learning: Amnesics Learn Quickly And Happy People Slowly. PLoS ONE 7(3): e33400. doi:10.1371/journal.pone.0033400

Conscious versus unconscious learning of structure

359

Dienes, Z., Kuhn, G., Guo, X. Y., & Jones, C. 2012 Communicating structure, a¤ect and movement: Commentary on Bharucha, Curtis & Paroo. In Rebuschat, P., Rohrmeier, M., Cross, I., Hawkins (Eds), Language and Music as Cognitive Systems. Oxford University Press (pp. 156–168). Dienes, Z. & Longuet-Higgins, H. C. 2004 Can musical transformations be implicitly learned? Cognitive Science, 28, 531–558. Dienes, Z., & Perner, J. 1999 A theory of implicit and explicit knowledge. Behavioural and Brain Sciences, 22, 735–755. Dienes, Z., & Perner, J. (2002). A theory of the implicit nature of implicit learning. In French, R. M. & Cleeremans, A. (Eds), Implicit Learning and Consciousness: An Empirical, Philosophical, and Computational Consensus in the Making? Psychology Press (pp. 68–92). Dienes, Z., & Scott, R. 2005 Measuring unconscious knowledge: Distinguishing structural knowledge and judgment knowledge. Psychological Research, 69, 338–351. Dienes, Z., Scott, R. B., & Wan, L. L. 2011 The role of familiarity in implicit learning. Higham, P., & Leboe, J. (Ed.) Constructions of Remembering and Metacognition: Essays in honour of Bruce Whittlesea. Palgrave Macmillan (pp. 51–62). Dienes, Z., & Seth, A. 2010a The conscious and the unconscious. In G. F. Koob, M. Le Moal, & R. F. Thompson (Eds), Encyclopedia of Behavioral Neuroscience, volume 1, pp. 322–327. Oxford: Academic Press. Dienes, Z., & Seth, A. 2010b Gambling on the unconscious: A comparison of wagering and confidence ratings as measures of awareness in an artificial grammar task. Consciousness & Cognition, 19, 674–681. Dienes, Z., & Seth, A. 2010c Measuring any conscious content versus measuring the relevant conscious content: Comment on Sandberg et al. Consciousness & Cognition, 19, 1079–1080. Dienes, Z. P., & Jeeves, M. A. 1965 Thinking in structures. Hutchinson Educational. Dulany, D. E. 1962 The place of hypotheses and intentions: an analysis of verbal control in verbal conditioning. In C. W. Eriksen (Ed.), Behavior and awareness, Duke University Press, Durham, N.C. pp. 102– 129.

360

Zoltan Dienes

Dulany, D. E. 1991

Conscious representation and thought systems. In R. S. Wyer and T. K. Srull (Eds), Advances in social cognition, Lawrence Erlbaum Associates, Hillsdale, NJ, pp. 91–120. Eitam, B., Schul, Y., & Hassin, R. R. 2009 Goal relevance and artificial grammar learning. Quarterly Journal of Experimental Psychology, 2009, 62, 228–238. Ellis, N. C. 1994 Consciousness in second language learning: psychological perspectives on the role of conscious processes in vocabulary acquisition. AILA Review, 11, 37–56. Ericsson, K., & Simon, H. 1980 Verbal reports as data. Psychological Review, 87, 215–251. Fu, Q., Dienes, Z., & Fu, X. 2010 Can unconscious knowledge allow control in sequence learning? Consciousness & Cognition, 19, 462–475. Fu, Q., Fu, X., & Dienes, Z. 2008 Implicit sequence learning and conscious awareness. Consciousness and Cognition, 17, 185–202. Gebauer, G. F., Mackintosh, N. J. 2007 Psychometric Intelligence Dissociates Implicit and Explicit Learning. Journal of Experimental Psychology: Learning, Memory, & Cognition, 33, 34–54. Gebhart, A. L., Newport, E. L., and Aslin, R. N. 2009 Statistical learning of adjacent and non-adjacent dependencies among non-linguistic sounds. Psychonomic Bulletin & Review, 16, 486–490. Guo, X., Zheng, L., Zhu, L., Yang, Z., Chen, C., Zhang, L., Ma, W., & Dienes, Z. 2011 Acquisition of conscious and unconscious knowledge of semantic prosody. Consciousness & Cognition, 20, 417–425. Halford, G. S., & Busby, J. 2007 Acquisition of structured knowledge without instruction: The relational schema induction paradigm. Journal of Experimental Psychology: Learning, Memory, & Cognition, 33, 586–603. Hull, C. L. 1920 Quantitative aspects of the evolution of concepts: An experimental study. Psychological Monographs, 28, whole issue no. 123. Jacoby, L. L. 1991 A process dissociation framework: Separating automatic from intentional uses of memory. Journal of Memory and Language, 30, 513–541. Jime´nez, L., & Me´ndez, C. 1999 Which attention is needed for implicit sequence learning? Journal of Experimental Psychology: Learning, Memory, and Cognition, 25, 236–259.

Conscious versus unconscious learning of structure

361

Jamieson, R. K., & Mewhort, D. J. K. 2009 Applying an exemplar model to the artificial-grammar task: Inferring grammaticiality from similarity. Quarterly Journal of Experimental Psychology, 62, 550–575. Jime´nez, L., Vaquero, J. M. M., & Lupia´n˜ez, J. 2006 Qualitative Di¤erences Between Implicit and Explicit Sequence Learning. Journal of Experimental Psychology: Learning, Memory, and Cognition, 32, 475–490. Kiyokawa, S., Dienes, Z., Tanaka, D., Yamada, A., & Crowe, L. in press Cross Cultural Di¤erences in Unconscious Knowledge. Cognition. Kornell, N., Son, L. K., & Terrace, H. S. 2007 Transfer of metacognitive skills and hint seeking in monkeys. Psychological Science, 18, 64–7. Kuhn, G., & Dienes, Z. 2005 Implicit learning of non-local musical rules. Journal of Experimental Psychology: Learning, Memory, & Cognition, 31, 1417–1432. Kuhn, G., & Dienes, Z. 2008 Learning non-local dependencies. Cognition, 106, 184–206. Lau, H. C. 2008 A higher order Bayesian decision theory of consciousness. Progress in Brain Research, 168, 35–48. Lau, H. C., & Passingham, R. E. 2006 Relative blindsight in normal observers and the neural correlate of visual consciousness. Proceedings of the National Academy of Sciences, 103(49), 18763–18768. Lewicki, P. 1986 Nonconscious social information processing. New York: Academic Press. Mangan, B. 1993 Taking phenomenology seriously: the ‘‘fringe’’ and its implications for cognitive research. Consciousness and Cognition, 2, 89– 108. Masuda, T. & Nisbett, R. E. 2001 Attending holistically versus analytically: Comparing the context sensitivity of Japanese and Americans. Journal of Personality and Social Psychology, 81, 922–934. Merikle. P. M. 1992 Perception without awareness: Critical issues. American Psychologist, 47, 792–795. Navon, D. 1977 Forest before trees: The precedence of global features in visual perception. Cognitive Psychology, 9, 353–383. Nisbett, R. E. 2003 The geography of thought: How Asians and westerners think differently. New York: Free Press.

362

Zoltan Dienes

Norman, E., Price, M. C., Du¤, S. C., & Mentzoni, R. A. 2007 Gradations of awareness in a modified sequence learning task. Consciousness and Cognition, 16(4), 809–837. Pasquali, A., Timmermans, B., & Cleeremans, A. 2010 Know thyself: Metacognitive networks and measures of consciousness. Cognition, 117, 182–190. Perner, J., & Ru¤man, T. 1995 Episodic Memory and Autonoetic Conciousness: Developmental Evidence and a Theory of Childhood Amnesia. Journal of Experimental Child Psychology, 59, 516–548. Persaud, N., McLeod, P., & Cowey, A. 2007 Post-decision wagering objectively measures awareness. Nature Neuroscience, 10, 257–261. Phelan, J. G. 1965 A replication of a study on the e¤ects of attempts to verbalise on the process of concept attainment. Journal of Psychology, 59, 283–293. Pothos, E. M. 2007 Theories of artificial grammar learning. Psychological Bulletin, 133, 227–244. Reber, A. S. 1967 Implicit learning of artificial grammars. Journal of Verbal Learning and Verbal Behavior, 6, 317–327. Reber, A. S. 1989 Implicit learning and tacit knowledge. Journal of Experimental Psychology: General, 118(3), 219–235. Reber, A. S., & Lewis, S. 1977 Implicit learning: An analysis of the form and structure of a body of tacit knowledge. Cognition, 114, 14–24. Rebuschat, P. 2008 Implicit learning of natural language syntax. Unpublished PhD dissertation. University of Cambridge. Rebuschat, P. & Williams, J. 2009 Implicit learning of word order. In N. A. Taatgen & H. van Rijn (Eds.), Proceedings of the 31th Annual Conference of the Cognitive Science Society. Austin, TX: Cognitive Science Society. Reed, N., McLeod, P., & Dienes, Z. 2010 Implicit knowledge and motor skill: What people who know how to catch don’t know. Consciousness & Cognition, 19, 63–76. Remillard, G. 2008 Implicit learning of second-, third-, and fourth-order adjacent and nonadjacent sequential dependencies. The Quarterly Journal of Experimental Psychology, 61, 400–424.

Conscious versus unconscious learning of structure

363

Rohmeier, M., & Cross, I. 2010 Narmour’s principles a¤ect implicit learning of melody. In Demorest et al. (eds.), Proceedings of the 11th International Conference on Music Perception and Cognition (ICMPC 2010). Rohrmeier, M., Fu, Q., & Dienes, Z. submitted Implicit learning of recursive, hierarchical grammatical structures. Rohrmeier, M., Rebuschat, P., & Cross, I. 2011 Incidental and online learning of melodic structure. Consciousness & Cognition, 20(2): 214–22. Rommetveit, R. 1960 Stages in concept formation and levels of cognitive functioning. Scandinavian Journal of Psychology, 1, 115–124. Rosenthal, D. M. 2005 Consciousness and mind. Oxford: Oxford University Press. Rowland, L. A., & Shanks, D. R. 2006 Attention modulates the learning of multiple contingencies. Psychonomic Bulletin & Review, 13, 643–648. Ru¤man, T., Garnham, W., Import, A., & Connolly, D. 2001 Does eye gaze indicate implicit knowledge of false belief ? Charting transitions in knowledge. Journal of Experimental Child Psychology, 80, 201–224. Ru¨nger, D., & Frensch, P. A. 2010 Defining consciousness in the context of incidental sequence learning: theoretical considerations and empirical implications. Psychological Research, 74, 121–137. Scott, R. B., & Dienes, Z. 2008 The conscious, the unconscious, and familiarity. Journal of Experimental Psychology: Learning, Memory, & Cognition, 34, 1264– 1288. Scott, R., & Dienes, Z. 2010a The metacognitive role of familiarity in artificial grammar learning: Transitions from unconscious to conscious knowledge. In A. Efklides and P. Misailidi (Eds), Trends and Prospects in Metacognition Research. Springer (pp. 37–62). Scott, R. B., & Dienes, Z. 2010b Prior familiarity with components enhances unconscious learning of relations. Consciousness & Cognition, 19, 413–418. Scott, R. B., & Dienes, Z. 2010c Knowledge applied to new domains: The unconscious succeeds where the conscious fails. Consciousness & Cognition, 19, 391– 398. Searle, J. R. 1983 Intentionality: An essay in the philosophy of mind. Cambridge University Press.

364

Zoltan Dienes

Seth, A. K. 2008

Post-decision wagering measures metacognitive content, not sensory consciousness. Consciousness and Cognition, 17, 981–983. Seth, A., Dienes, Z., Cleeremans, A., Overgaard, M., & Pessoa, L. 2008 Measuring Consciousness: Relating Behavioural and Neurophysiological Approaches. Trends in Cognitive Sciences, 12, 314–321. Shanks, D. R. 2005 Implicit learning. In K. Lamberts & R. Goldstone (Eds.), Handbook of cognition (pp. 202–220). London: Sage. Shanks, D. R., & St. John, M. F. 1994 Characteristics of dissociable human learning systems. Behavioral & Brain Sciences, 17, 367–447. Smoke, K. L. 1932 An objective study of concept formation. Psychological Monographs, 42, whole number 191. Sun, R. 2002 Duality of mind: A bottom-up approach toward cognition. Erlbaum. Tanaka, D., Kiyokawa, S., Yamada, A., Dienes, Z., Shigemasu, K. 2008 Role of Selective Attention in Artificial Grammar Learning. Psychonomic Bulletin and Review, 15, 1154–1159. Tunney, R. J. 2007 The subjective experience of remembering in artificial grammar learning. European Journal of Cognitive Psychology, 19, 934– 952. Wan, L. L., Dienes, Z., & Fu, X. L. 2008 Intentional control based on familiarity in artificial grammar learning. Consciousness and Cognition, 17, 1209–1218. Williams, J. N. 2004 Implicit learning of form-meaning connections. In J. Williams, B. VanPatten, S. Rott & M. Overstreet (Eds.), Form Meaning Connections in Second Language Acquisition (pp. 203–218). Mahwah, NJ: Lawrence Erlbaum Associates. Williams, J. N. 2005 Learning without awareness. SALL, 27, 269–304. Ziori, E., & Dienes, Z. 2006 Subjective measures of unconscious knowledge of concepts. Mind & Society, 5, 105–122. Ziori, E. & Dienes, Z. 2008 How does prior knowledge a¤ect implicit and explicit concept learning? Quarterly Journal of Experimental Psychology, 61, 601– 624.

How implicit is statistical learning? Phillip Hamrick and Patrick Rebuschat Introduction How learners extract knowledge from the environment is one of the fundamental questions in cognitive science. In this chapter, we will focus on two approaches that have gained in prominence over the past 15–20 years, namely implicit learning and statistical learning (see also Dienes, this volume; Misyak, Goldstein, & Christiansen, this volume). Implicit learning research began with Reber’s (1967) early work and developed into one of the major paradigms in cognitive psychology (see Perruchet, 2008, for an overview). Statistical learning research was sparked by the work of Sa¤ran and colleagues (Sa¤ran, Aslin, & Newport, 1996) and now represents an important research strand in developmental psychology (see Go´mez, 2007, for an overview). Both approaches focus on how we acquire information from the environment and both rely heavily on the use of artificial grammars. In typical experiments, subjects are first exposed to stimuli generated by an artificial system and then tested to determine what they have learned. Given these and other similarities, Perruchet & Pacton (2006) suggested that implicit and statistical learning represent two approaches to a single phenomenon. Conway & Christiansen (2006) go as far as combining the two in name: implicit statistical learning. Despite the considerable overlap between implicit and statistical learning research, there are several important di¤erences. For example, one of the most distinctive features of statistical learning research is the careful manipulation of statistical information in the input. This aspect is generally absent in implicit learning studies. In addition, statistical learning research generally concentrates on how we acquire linguistic information, while implicit learning research focuses on information in general.1 For this reason, statistical learning researchers tend to employ artificial systems that resemble

1. As one of our reviewers pointed out, there are also several studies on statistical learning in di¤erent modalities and with non-linguistic stimuli.

366

Phillip Hamrick and Patrick Rebuschat

natural languages more closely (phrase-structure grammars instead of finitestate systems, and pseudowords instead of letter sequences). Another important di¤erence is that implicit learning researchers are generally concerned with the question of whether subjects acquire conscious (explicit) or unconscious (implicit) knowledge as a result of exposure. For this purpose, implicit learning studies usually contain measures of awareness.2 This holds across implicit learning paradigms (i.e., artificial grammar learning, sequence learning, control of complex systems). In contrast, statistical learning studies do not typically feature any measures of awareness as part of the experimental design. This is, of course, partially explained by the fact that infants are unable to provide verbal reports, indicate confidence levels, or perform on fragment completion tasks. However, many of the experiments conducted within the statistical learning framework employ adults as subjects, which means that basic measures of awareness could be administered. Usually, lack of awareness is assumed but not empirically assessed (see Aslin, Sa¤ran, & Newport, 1999, though see e.g., Dell, Reed, Adams, & Meyer, 2000; Warker & Dell, 2006). As such, it is unclear whether statistical learning typically results in conscious or unconscious knowledge. Given that language comprehension and production are thought to be based on implicit knowledge, it seems important to determine whether subjects in statistical learning research develop this type of knowledge. The present study seeks to address this gap. Specifically, we investigated whether the knowledge acquired in a typical statistical learning experiment is conscious, unconscious, or both. We argue that measures of awareness can provide a richer understanding of the cognitive mechanisms involved in statistical learning and of the nature of the resulting knowledge. Before describing our experiment, though, it is important to review how the conscious and unconscious status of knowledge can be assessed.

2. In the research tradition started by Reber (1967), the use of the term implicit is generally restricted to those situations where subjects have acquired unconscious knowledge under incidental learning conditions. If incidental exposure in an experiment results in conscious knowledge, e.g. when subjects were able to figure out the rule system despite not having been told about its existence, the learning process is usually only characterized as being incidental and not as implicit. The same applies for those experiments that do not include a measure of awareness. The term explicit learning is usually applied to learning scenarios in which subjects are instructed to actively look for patterns, i.e. learning is intentional, a process which tends to result conscious knowledge.

How implicit is statistical learning?

367

Measuring implicit and explicit knowledge of language The question of whether knowledge acquired during incidental learning experiments is actually ‘‘implicit’’ is controversial.3 Proposals for measuring awareness include verbal reports, direct and indirect tests, and subjective measures (see Dienes & Seth, 2010; Rebuschat, submitted, for overviews). Verbal reports A common way of measuring awareness is to prompt subjects to verbalize anything they might have noticed while doing the experiment (e.g., Reber, 1967). Knowledge is considered to be unconscious if subjects perform above chance despite being unable to verbalize the knowledge that underlies their performance. The view that knowledge is unconscious when subjects are unable to verbalize the knowledge they have acquired has been criticized for a variety of reasons (see Perruchet, 2008). One problem is that subjects may only be able to verbalize knowledge after a long exposure period. Another problem is that verbal reports constitute a relatively insensitive and incomplete measure of awareness. For example, subjects may fail to verbalize knowledge because low-confidence knowledge retrieval may be di‰cult. Direct and indirect tests Several authors have advocated the contrastive use of direct and indirect tests as a more exhaustive measure of awareness (e.g., Reingold & Merikle, 1988). Direct and indirect tests are not, per se, measures of awareness, but actually measures of learning. Generally, the performance on two tasks is compared. The direct test is a measure that explicitly instructs subjects to make use of their knowledge (e.g., a free generation task). The task encourages subjects to access all relevant conscious knowledge in order to perform. The indirect test assesses subjects’ performance without instructing them to use their acquired knowledge (e.g., serial reaction time task). Knowledge is assumed to be unconscious if an indirect test clearly indicates a learning e¤ect, even though a direct test shows no evidence of learning. In the case of sequence learning, for example, Jime´nez, Me´ndez, & Cleeremans (1996) used the serial reaction time (SRT; Nissen & Bullemer, 1987) task as an indirect measure and a generation task as a direct measure. Subjects performed on the two tasks successively. In the SRT task, subjects 3. As one of our reviewers pointed out, this might well be reason why statistical learning researchers have avoided the implicit/explicit distinction in the first place.

368

Phillip Hamrick and Patrick Rebuschat

saw a stimulus appear at one of several locations on a computer screen and were asked to press as fast and accurately as possible on the corresponding key. Unbeknownst to subjects, the sequence of successive stimuli was determined by an artificial grammar. In the generation task, subjects were asked to predict the location of the next stimulus by pressing the corresponding key. Jime´nez et al. (1996) found that subjects had clearly learned to exploit the regularities inherent in the stimulus environment. More importantly, they also found that some knowledge about the sequential structure of the material was exclusively expressed in the indirect task (SRT), but not in the relatively similar direct task (generation). Their results suggest that this knowledge was unconscious. Direct and indirect measures have been criticized for a number of reasons, perhaps most often because the tests lack exclusivity. That is, they may not solely measure what they are supposed to. As Reingold & Merikle (1988) argue, direct tests are inadequate measures of conscious knowledge because they may be contaminated by unconscious knowledge. When direct tests indicate greater than zero sensitivity, it is unclear whether performance is driven exclusively by conscious knowledge or not. Thus, any approach based on these measures runs the risk of underestimating the influence of unconscious knowledge. Subjective measures Dienes (2004, 2008, this volume) has advocated the use of subjective measures in order to assess whether the knowledge acquired during Artificial Grammar Learning (AGL) tasks is conscious or unconscious. One way of dissociating conscious and unconscious processes is to collect confidence ratings (e.g., Dienes, Altmann, Kwan, & Goode, 1995). In AGL, for example, subjects can be asked to report, for each grammaticality judgment, how confident they were in their decision. Dienes et al. (1995) suggested two ways in which confidence rating data could serve as an index of unconscious knowledge. Firstly, knowledge can be considered unconscious if subjects believe to be guessing when their classification performance is, in fact, significantly above chance. Dienes et al. called this the guessing criterion. Secondly, knowledge is unconscious if subjects’ confidence is unrelated to their accuracy. This criterion was labeled zero correlation criterion by Dienes et al. Several studies have shown that performance on standard AGL tasks can result in unconscious knowledge according to these criteria (e.g., Dienes et al., 1995). Structural knowledge and judgment knowledge. One criticism that can be leveled at the use of confidence ratings concerns the type of knowledge that is assessed by this measure. Consider the case of natural language

How implicit is statistical learning?

369

acquisition (Dienes, 2008). Language acquisition is often considered a prime example of implicit learning. All cognitively unimpaired adults are able to discern grammatical sentences of their native language from ungrammatical ones, even though they are unable to report the underlying rule system. However, if asked how confident they are in their grammaticality decisions, most native speakers will report high confidence levels, as in: ‘‘John bought an apple in the supermarket’’ is a grammatical sentence and I am 100% confident in my decision, but I do not know what the rules are or why I am right.’’ Since in these cases accuracy and confidence will be highly correlated, does this mean that language acquisition is not an implicit learning process after all? Probably not. Dienes (2008; Dienes & Scott, 2005) proposed a convincing explanation for this phenomenon, based on Rosenthal’s (2005) Higher-Order Thought Theory. Dienes suggested that, when subjects are exposed to letter sequences in an AGL experiment, they learn about the structure of the sequences. This structural knowledge can consist, for example, of knowledge of associations, whole exemplars, knowledge of fragments or knowledge of rules (e.g., ‘‘A letter sequence can start with an M or a V.’’) In the testing phase, subjects use their structural knowledge to construct a di¤erent type of knowledge, namely whether the test items shared the same underlying structure as the training items (e.g., ‘‘MRVXX has the same structure as the training sequences.’’). Dienes labeled this judgment knowledge. Both forms of knowledge can be conscious or unconscious. For example, a structural representation such as ‘‘An R can be repeated several times.’’ is only conscious if it is explicitly represented, i.e., if there is a higher-order thought such as ‘‘I {know/think/believe, etc.} that an R can repeated several times.’’ Likewise, judgment knowledge is only conscious if there is a corresponding higher-order thought (e.g., ‘‘I {know/think/believe, etc.} that MRVXX has the same structure as the training sequences.’’) The guessing and the zero correlation criteria measure the conscious or unconscious status of judgment knowledge, not structural knowledge. Dienes & Scott (2005) assume that conscious structural knowledge leads to conscious judgment knowledge. However, if structural knowledge is unconscious, judgment knowledge could still be either conscious or unconscious. This explains why, in the case of natural language, people can be very confident in their grammaticality decisions without knowing why. Here, structural (linguistic) knowledge is unconscious while (metalinguistic) judgment knowledge is conscious. The phenomenology in this case is that of intuition, i.e. knowing that a judgment is correct but not knowing why. If, on the other hand, structural and judgment knowledge are uncon-

370

Phillip Hamrick and Patrick Rebuschat

scious, the phenomenology is that of guessing. In both cases the structural knowledge acquired during training is unconscious. Dienes and Scott proposed that in AGL experiments the conscious status of both structural and judgment knowledge can be assessed concurrently by adding source attributions to the confidence ratings in the testing phase. That is, in addition to asking subjects how confident they were in their grammaticality judgments, one also prompts them to report the basis (or source) of their judgments. In the next section, we will describe an experiment that applied the subjective measures developed by Dienes (2004, 2008, this volume; Dienes & Scott, 2005) to an established paradigm in statistical learning research, namely cross-situational word learning.

Method The following experiment had two objectives. The first objective was to illustrate how subjective measures can be applied to the investigation of statistical learning. The second objective was to determine what type of knowledge (conscious or unconscious) subjects acquire in a typical statistical (word) learning paradigm. To our knowledge, no statistical learning experiment has empirically assessed whether subjects acquire conscious or unconscious knowledge as a result of exposure. The experiment below adds subjective measures of awareness to the cross-situational word learning paradigm (e.g., Kachergis et al., 2010; Yu & Smith, 2007) to address this gap.4 Participants Thirty native speakers of English (19 women and 11 men, mean age ¼ 19.3) were randomly assigned to incidental or intentional learning conditions. There were no significant di¤erences between the two groups in terms of age or language background, ps > .05.

4. A recent study by Kachergis et al. (2010) compared cross-situational word learning under incidental and intentional learning conditions. That is, the emphasis was on whether or not intention to learn played a role in word learning. However, the study did not including measures of awareness to assess the conscious or unconscious status of the acquired knowledge.

How implicit is statistical learning?

371

Stimuli An artificial lexicon consisting of 27 auditory pseudowords was created for this experiment. All pseudowords were bisyllabic, stressed on the first syllable, and obeyed English phonotactics. The pseudowords were read aloud by a female native speaker of English, digitally recorded and subsequently edited by means of sound processing software (Audacity, version 1.2.4). Each pseudoword was then matched with one or more black-andwhite drawings from the International Picture-Naming Project website (Szekely et al., 2004). To control for memorability, all possible stimuli (pseudowords and object images) were normed using two memory recognition tasks. Twelve undergraduates who were not involved in the main experiment participated in the memory recognition tasks. The procedure for both tasks was identical, with the only di¤erence being whether or not pseudowords or pictures were being used. For the pseudoword memory task, we divided our pool of pseudowords (n ¼ 68) into set A (n ¼ 34) and set B (n ¼ 34). In the memorization phase, half the participants were then instructed to memorize set A, while the other half was instructed to memorize set B. In the test phase, all subjects performed on a recognition task: They were presented with the complete pool (n ¼ 68) of pseudowords and asked to indicate which items they had previously encountered. The procedure for the picture memory task was the same, except that the stimuli in this version were images of objects (n ¼ 68). All twelve participants were given both pseudoword and picture memory tasks. The order of the tasks was counterbalanced across participants. The analysis of the two tasks showed that the mean recognition rate was 3.52 for pseudowords and 3.76 for the pictures of objects. That is, on average 3.52 people recalled pseudowords correctly at test while 3.76 people recalled the pictures correctly. We decided to use items that had recognition rates between 3 and 4 (the closest to average) for our experiment, and discarded stimuli with memory rates of 1, 2, 5, and 6. This was because we neither wanted to use stimuli that were too easy to remember nor stimuli that were too di‰cult to retain. The selected stimuli were then randomly paired to create the artificial vocabulary. The lexicon was divided into 12 target items and 15 fillers.5 All filler items were unambiguous and only occurred once each in the input during 5. The reason for this manipulation is that we were also interested in determining the role of frequency in the development of implicit and explicit knowledge. This data will be reported elsewhere, however (Hamrick & Rebuschat, under review).

372

Phillip Hamrick and Patrick Rebuschat

Table 1. Ambiguous and Unambiguous Target Items and Their Referents Pseudoword

Referents (Co-Occurrence Frequency)

dobez

backpack (6), arrow (4), bathtub (2)

paylig

wheelchair (6), towel (4), bandage (2)

femod

bench (6), thumb (4), bridge (2)

whoma

comb (6), crib (4), fan (2)

houger jillug

elephant (6), glass (4), pear (2) ladder (6), leaf (4), mixer (2)

keemuth

mop (6)

nengee

panda (6)

zomthos

radio (4)

loga

stethoscope (4)

shrama

robot (2)

thueek

tank (2)

the exposure phase. The target items were subdivided into six lexically ambiguous pseudowords (one word, three matching referents) and six lexically unambiguous pseudowords (one word, one matching referent). All target words were manipulated in terms of their word-referent cooccurrence frequencies. Some words co-occurred with their matching referents six times, other words co-occurred with their appropriate referents four times, and others co-occurred with their appropriate referents twice. For example, the pseudoword houger occurred 12 times: Six times with an elephant, four times with a glass, and two times with a pear. Procedure The experiment was presented by means of a PC with a 15.6 inch screen using Microsoft Power Point 2007. Instructions were displayed in black text (Arial font sizes 20–24) on a white background. Pseudowords were played through headphones. The experiment consisted of an exposure phase and a testing phase. The testing phase was the same for both groups. The groups di¤ered in how they interacted with the 57 exposure trials. Exposure phase In the exposure phase, subjects in both conditions were presented with the same 57 trials. In each trial, two images were displayed on the screen at the same time, one on the left, the other on the right side

How implicit is statistical learning?

373

Figure 1. Sample screenshot from the exposure phase. Participants in both experimental conditions were presented with the same 57 trials. In each trial, a fixation cross was first displayed for two seconds. This was followed by the concurrent presentation of two pictures and two spoken pseudowords. Importantly, the presentation order of the pseudword was not related to the location of the image on the screen.

of the monitor. The two images were displayed for six seconds. While the images were on display, two pseudowords were played once. For example, subjects might see an image of a panda on the left and an image of a glass on the right, while hearing first the pseudword houger, followed by the pseudoword femod. Importantly, the presentation order of the pseudword was not related to the location of the image on the screen. That is, each word could refer either to the image on the left or to the image on the right. The only way for participants to learn the artificial vocabulary was to keep track of the pseudoword-object co-occurrences across trials. The order of trials was randomized for each participant.

374

Phillip Hamrick and Patrick Rebuschat

Subjects in the intentional learning condition (n ¼ 15) were told that they were participating in a word-learning experiment and were instructed to ‘‘learn the meanings of the words’’. They were also told that they would be tested afterwards. In contrast, subjects in the incidental learning condition (n ¼ 15) were not informed about the true purpose of the experiment, nor did they know that they would be tested after the exposure phase. Instead, they were told that the objective of the study was to investigate how people with di¤erent language experience perceive and categorize objects. Their task during the exposure phase was to indicate how many objects on each slide were animate. There were three possible responses (zero, one, or two animate objects) and participants were instructed to

Figure 2. Sample screenshot of the four-alternative forced-choice (4AFC) picture matching task. The 4AFC task consisted of thirty trials. In each trial, participants were presented with four pictures, one in each corner of the screen, and a spoken pseudoword. Their task was to select the appropriate referent as quickly and accurately as possible. In addition, subjects were also asked to report how confident they were in their decision and what the basis of their decision was.

How implicit is statistical learning?

375

enter 0, 1, or 2 on their keypads. This task was made more di‰cult by the presence of pictures that were not easily classifiable as animate or inanimate (e.g., a thumb, a leaf ). They were informed that they would have to do the task while hearing ‘‘nonsense’’ words through their headphones. In sum, all experimental subjects were exposed to the same 57 trials. The key di¤erence between subjects in the intentional group and subjects in the incidental group is how they interacted with the stimuli. Subjects in the former group were instructed to learn the meanings of words, whereas subjects in the latter group were asked to perform on an irrelevant task and to treat the auditory pseudowords as a distraction. Test phase After the exposure phase, all participants were asked to complete a four-alternative forced-choice (4AFC) picture matching task. The 4AFC task consisted of thirty trials. In each trial, participants were presented with four pictures, one in each corner of the screen, and a spoken pseudoword. Their task was to select the appropriate referent as quickly and accurately as possible. For each trial, the screen contained one correct referent and three foils. Each picture was numbered (1 through 4) and participants indicated the best match by writing down their answers on an answer sheet. In addition, subjects were also asked to report how confident they were in their decision and what the basis of their decision was. Subjects were asked to place their confidence on a continuous scale, ranging from 50% (complete guess) to 100% (complete certainty). We emphasized that subjects should only use ‘‘50%’’ when they believed to be truly guessing, i.e., they might as well have flipped a coin. In the case of the source attributions, there were three response options: guess, intuition, and memory. The guess category indicated that subjects believed the classification decision to be based on a true guess. The intuition category indicated that they were somewhat confident in their decision but did not know why it was right. The memory category indicated that the judgment was based on the recollection of pseudoword-referent mappings from the exposure phase. All participants were provided with these definitions before starting the testing phase. At the end of the test phase, all subjects completed a debriefing questionnaire, which asked them to report if they had learned any of the pseudoword-referent mappings during exposure, whether or not they had used any specific learning strategies and, if so, what kind of strategies.

376

Phillip Hamrick and Patrick Rebuschat

Results Performance on the 4AFC task served as the measure of learning. Awareness was measured by means of confidence ratings and source attributions. Four-alternative forced-choice task The analysis of the 4AFC task showed that the incidental group classified 44.4% (SD ¼ 7.5%) of the test items correctly and the intentional group 73.3% (SD ¼ 10.7%). Both the incidental group, t(14) ¼ 9.99, p < .05, and the intentional group, t(14) ¼ 17.53, p < .05, performed significantly above chance (25%), which indicates that there was a clear learning e¤ect for both groups. Further analysis showed that the di¤erence between the two groups was significant, t(28) ¼ 8.559, p < .001, i.e., the learning e¤ect was greater under intentional learning conditions. Confidence ratings The average confidence level was 61.3% (SD ¼ 7.2%) in the incidental group and 80.6% (SD ¼ 6.3%) in the intentional group. The di¤erence was significant, t(28) ¼ 7.79, p < .05. Further analysis showed that accuracy and confidence were significantly correlated in the intentional group, r ¼ .77, p < .05, but not in the incidental group, r ¼ .45, p > .05. When intentional learners were confident in their decision, they tended to be accurate. This suggests that subjects in the intentional group had acquired conscious judgment knowledge: These participants were partially aware that they had acquired some knowledge during the exposure phase. In contrast, subjects in the incidental group were not aware of having acquired knowledge, despite the fact that their performance on the 4AFC task clearly indicates that they did. The zero correlation criterion was thus met in the case of the incidental group. We then analyzed all classification decisions for which subjects gave a 50% rating, i.e., they believed to have guessed when deciding on the appropriate referent for the pseudoword. When subjects in the incidental group gave a confidence rating of 50%, their classification performance was 34.0% (SD ¼ 47.5%), which was significantly above chance, t(140) ¼ 2.26, p < .05. In the case of the intentional group, when subjects gave a confidence rating of 50% their classification performance was 45.0% (SD ¼ 50.3%), also significantly above chance, t(39) ¼ 2.51, p < .05. That is, the guessing criterion for unconscious judgment knowledge was satisfied in

How implicit is statistical learning?

377

both groups. Subjects in both conditions had acquired at least some unconscious judgment knowledge. The confidence ratings thus indicate that the incidental group was largely unaware of having acquired knowledge during the exposure phrase. In the case of the intentional group, subjects were clearly aware of having acquired knowledge (see correlation between confidence and accuracy), though some of their judgment knowledge did remain unconscious (as indicated by guessing criterion). Source attributions In terms of proportion, the incidental group most frequently believed their classification decisions to be based on a guess or intuition (86% of judgments). The memory category was selected least frequently (only 14% of all judgments). That is, when performing on the 4AFC task, subjects in the incidental group generally based their decisions on the more implicit categories. In the case of the intentional group, the memory category was selected most frequently (61% of judgments), followed by guessing and intuition. In terms of accuracy, the analysis showed that the incidental group scored highest when reporting that their classification was based on memory, followed by the intuition and guess categories. The same pattern was observed in the intentional group, i.e., these subjects were most accurate when attributing their classification decision to memory. They were, however, considerably more accurate, performing close to 90% accuracy. Further analyses revealed significant e¤ects of source attribution in both the incidental group, F(2,16) ¼ 8.247, p < .05, and the intentional group, F(2,22) ¼ 5.49, p < .05. In the case of the incidental group, the di¤erence between decisions based on guessing and decisions based on intuition was significant, p < .05, as was the di¤erence between decisions based on guessing and those based on memory, p < .05. In the case of the intentional group, the di¤erences between decisions based on guessing and intuition, guessing and memory, and intuition and memory were all significant, p < .05. Interestingly, subjects in both groups performed significantly above chance across categories, irrespectively of whether they attributed their decision to guessing, intuition, or memory. The guessing criterion was therefore satisfied in both groups: When subjects gave a confidence rating of 50%, indicating that they were forced to guess the right answer in the 4AFC task, their actual classification performance suggests that they had acquired the knowledge to make that decision. This suggests that subjects

378

Phillip Hamrick and Patrick Rebuschat

Table 2. Accuracy and proportions (%) across source attributions

Incidental

Intentional

Guess

Intuition

Memory

Accuracy

35.8*

48.5**

61.4**

Proportion

44.2

41.7

14.1

Accuracy

54.2**

61.9**

88.9**

Proportion

23.2

27.9

48.9

Significance from chance (25%): *p < .01, **p < .001.

in both groups acquired at least some unconscious structural knowledge. Table 2 shows the classification performance for the di¤erent attributions. Verbal reports Analysis of the verbal reports showed that learners in the intentional condition became aware of many pseudoword-referent pairs and were able to name a few. When prompted for strategies, the most commonly reported strategies were repeating the pseudowords, making a link between pseudowords and prior knowledge (e.g., ‘‘that sounded like something in French’’), and hypothesis testing. In contrast, subjects in the incidental group reported deliberately trying to block out the pseudowords. Indeed, many interpreted the pseudowords to be a distraction and consequently tried to ignore them.

Discussion The results of the present experiment show that subjects can use statistical information to learn new words, which is consistent with previous research (e.g., Yu & Smith, 2007). Moreover, statistical word learning can take place under both intentional and incidental learning conditions (see also Kachergis et al., 2010). Subjects learn to associate pseudowords with their appropriate referents even when they are instructed to perform on an irrelevant task (indicating animacy) and disregard the auditory pseudowords. However, the learning e¤ect is greater under intentional learning conditions. Instructing subjects to ‘‘learn the meanings of the words’’ resulted in higher accuracy in the 4AFC task. The analysis of the confidence ratings and the source attributions showed that the learning condition plays an important role in the type of

How implicit is statistical learning?

379

knowledge that subjects acquire. Under incidental learning conditions, subjects developed primarily implicit knowledge. These subjects were not aware of having acquired knowledge, reported most of their decisions to be based on guessing or on intuition, and performed significantly above chance in the discrimination task even when they believed to be guessing. In contrast, subjects in the intentional learning condition acquired primarily conscious knowledge. These subjects were aware of having acquired knowledge and tended to be highly accurate when reporting high levels of confidence. Moreover, these subjects also acquired some unconscious knowledge. These findings are consistent with implicit learning research, where it is often found that subjects develop unconscious knowledge even when explicitly instructed to discover rule structure (e.g, Guo et al., in press; Rebuschat, 2008, Experiment 6; Rebuschat & Williams, 2009). Our data suggest that incidental statistical learning is more likely to result in implicit knowledge. They also suggest that instructing subjects to learn the meanings of words prompts them to use strategies resulting in explicit knowledge. Interestingly, subjects using explicit strategies still acquired some unconscious knowledge. This supports the view that implicit learning may proceed in parallel with explicit learning (cf. Guo et al., in press). Thus, statistical word learning may result in both implicit and explicit knowledge, and learning conditions appear to influence the extent to which each type of knowledge develops. The experiment illustrates the usefulness of including measures of awareness when researching statistical learning. Depending on the learning conditions of the experiment, subjects may rely on general learning mechanisms, explicit strategies, or both, and the use of subjective measures provides a method for determining the contributions of di¤erent learning conditions. Likewise, it may also be the case that di¤erent statistical regularities may di¤erentially promote implicit or explicit knowledge (Perruchet & Pacton, 2006). Measures of awareness would allow us to empirically verify such a hypothesis. In summary, the present study shows that statistical word learning can result in both implicit and explicit knowledge; however, the amount and quality of each kind of knowledge was influenced by learning condition. Thus, subjective measures of awareness may provide researchers with a richer characterization of the knowledge acquired in statistical learning experiments. We argue that the present paper brings to light some important considerations for further investigations of the relationship between implicit learning and statistical learning.

380

Phillip Hamrick and Patrick Rebuschat

Acknowledgments The authors would like to thank Luke Amoroso, Natalie Brito, Katie Jeong-Eun Kim, Julie Lake, Kaitlyn Tagarelli and three anonymous reviewers for their helpful feedback. Special thanks to Jennifer Johnstone and Michelle To for their support.

References Aslin, R. N., Sa¤ran, J. R., & Newport, E. L. 1999 Statistical learning in linguistic and non-linguistic domains. In B. MacWhinney (Ed.), Emergentist approaches to language. Hillsdale, NJ: Lawrence Erlbaum. Conway, C. M., & Christiansen, M. H. 2006 Statistical learning within and between modalities. Psychological Science, 17, 905–912. Dell, G. S., Reed, K. D., Adams, D. R., Meyer, A. S. 2000 Speech errors, phonotactic constraints, and implicit learning: A study of the role of experience in language production. Journal of Experimental Psychology: Learning, Memory, and Cognition, 26, 1355–1367. Dienes, Z. 2004 Assumptions of subjective measures of unconscious mental states: Higher order thoughts and bias. Journal of Consciousness Studies, 11(9), 25–45. Dienes, Z. 2008 Subjective measures of unconscious knowledge. Progress in Brain Research, 168, 49–64. Dienes, Z. this volume Conscious versus unconscious learning of structure. In P. Rebuschat & J. N. Williams (Eds.) Statistical learning and language acquisition. Berlin: Mouton de Gruyter. Dienes, Z., & Scott, R. 2005 Measuring unconscious knowledge: distinguishing structural knowledge and judgment knowledge. Psychological Research, 69, 338–351. Dienes, Z., & Seth, A. 2010 The conscious and the unconscious. In G. Koob, M. Le Moal, & R. F. Thompson (Eds), Encyclopedia of Behavioral Neuroscience, vol. 1, (pp. 322–327). Oxford: Academic Press. Dienes, Z., Altmann, G., Kwan, L., & Goode, A. 1995 Unconscious knowledge of artificial grammars is applied strategically. Journal of Experimental Psychology: Learning, Memory, and Cognition, 21(5), 1322–1338.

How implicit is statistical learning? Go´mez, R. L. 2007

381

Statistical learning in infant language development. In M. G. Gaskell (Ed.), The Oxford Handbook of Psycholinguistics. Oxford: Oxford University Press. Guo, X., Zheng, L., Zhu, L., Yang, Z., Chen, C., Zhang, L., Ma, W., & Dienes, Z. in press Acquisition of conscious and unconscious knowledge of semantic prosody. Consciousness and Cognition. Hamrick, P., & Rebuschat, P. under review Measuring frequency e¤ects on the development of implicit and explicit lexical knowledge. Proceedings of the 2012 Georgetown University Roundtable on Languages and Linguistics Conference. Washington, D.C. Jime´nez, L., Me´ndez, C., & Cleeremans, A. 1996 Comparing direct and indirect measures of implicit learning. Journal of Experimental Psychology: Learning, Memory, and Cognition, 22, 948–969. Kachergis, G., Yu, C., & Shi¤rin, R. M. 2010 Cross-situational statistical learning: Implicit or intentional? Proceedings of the 32nd Annual Conference of the Cognitive Science Society. Austin, TX: Cognitive Science Society. Misyak, J. B., Goldstein, M. H., & Christiansen, M. H. this volume Statistical-sequential learning in development. In P. Rebuschat & J. N. Williams (Eds.) Statistical learning and language acquisition. Berlin: Mouton de Gruyter. Nissen, M. J., & Bullemer, P. 1987 Attentional requirements of learning: Evidence from performance measures. Cognitive Psychology, 19, 1–32. Perruchet, P. 2008 Implicit learning. In J. Byrne (Ed.), Learning and memory: A comprehensive reference (Vol. 2, pp. 597–621). Oxford: Elsevier. Perruchet, P., & Pacton, S. 2006 Implicit learning and statistical learning: One phenomenon, two approaches. Trends in Cognitive Sciences, 10(5), 233–238. Reber, A. S. 1967 Implicit learning of artificial grammars. Journal of Verbal Learning and Verbal Behavior, 6, 317–327. Rebuschat, P. 2008 Implicit learning of natural language syntax. Unpublished PhD dissertation, University of Cambridge. Rebuschat, P. submitted Measuring implicit and explicit knowledge in second language research: A review. Rebuschat, P., & Williams, J. N. 2009 Implicit learning of word order. Proceedings of the 31st Annual Conference of the Cognitive Science Society. Austin, TX: Cognitive Science Society.

382

Phillip Hamrick and Patrick Rebuschat

Reingold, E. M., & Merikle, P. M. 1988 Using direct and indirect measures to study perception without awareness. Perception & Psychophysics, 44, 563–575. Rosenthal, D. M. 2005 Consciousness and mind. Oxford: Oxford University Press. Sa¤ran, J. R., Aslin, R. N., & Newport, E. L. 1996 Statistical learning by 8-month-old infants. Science, 274, 1926– 1928. Szekely, A., Szekely, A., Jacobsen, T., D’Amico, S., Devescovi, A., Andonova, E., Herron, D., . . . Bates, E. 2004 A new on-line resource for psycholinguistic studies. Journal of Memory and Language, 51, 247–250. Warker, J. A., Dell, G. S. 2006 Speech errors reflect newly learned phonotactic constraints. Journal of Experimental Psychology: Learning, Memory, and Cognition, 32, 387–398. Yu, C., & Smith, L. B. 2007 Rapid word learning under uncertainty via cross-situational statistics. Psychological Science, 18, 414–420.

What Bayesian modelling can tell us about statistical learning: What it requires and why it works Amy Perfors and Daniel J. Navarro Introduction The purpose of this chapter is to address some issues regarding the why and what of statistical learning, with a particular focus on Bayesian computational modelling and language acquisition. We begin with a brief introduction to Bayesian modelling, contrasting it with the other primary computational approach to statistical learning (connectionist modelling), and demonstrating how it clarifies some common confusions about what statistical learning is and is not. The chapter is structured around a series of questions: What is statistical learning? What data does statistical learning operate on? What knowledge does learner acquire from the data? What assumptions do learners make about the data? What prior knowledge does the learner possess? Finally, why does statistical learning work? Each of these is a big topic in itself, so we aim only to provide a general introduction to them, covering some but not all of the issues involved.

What is statistical learning? Statistical learning encompasses a wide variety of learning situations in which the knowledge acquired by the learner is highly dependent on the statistical structure of the data that they are given. From an empirical perspective, researchers are interested in finding out whether and to what extent people are sensitive to statistical structure (e.g., the frequencies of di¤erent events) when learning from data. From a formal perspective, we aim to describe the abstract principles and processes that are necessary to explain how the learner might acquire knowledge based on statistical input. Statistical learning can be distinguished from learning that relies solely on deterministic rules, such as the subset principle (Berwick, 1986) or learning that requires a certain type of input before acting, like ‘‘trigger’’ learning (Gibson & Wexler, 1994).

384

Amy Perfors and Daniel J. Navarro

Strictly speaking, any model that learns primarily by exploiting the statistical structure of the data is a statistical learner, but in practice computational modelling in cognitive science has focused on two particular types of statistical learners: connectionist models and Bayesian models. Connectionist modelling grew in popularity in cognitive science beginning in the 1980s, and has led to advances in our understanding of multiple areas in cognitive science, including categorisation, verb learning, and semantic representation. At its core connectionism is a learning theory inspired by the architecture of the brain. Connectionist networks consist of a collection of nodes connected by weighted links, which propagate activation between the nodes. Computation is performed by the pattern of activation that is passed, and learning is achieved by adjusting the link weights. Bayesian modeling is a somewhat more recent approach in cognitive science, having emerged over the last decade, but is inspired by the statistical theory of probabilistic inference that dates back to the 18th century. It di¤ers from connectionist modelling largely in terms of its assumptions about representation and learning. Since this chapter approaches statistical learning from a Bayesian perspective, we will now give a brief introduction to the Bayesian approach. Later on, we discuss the relationship between Bayesian and connectionist theories. Bayesian theories of language and cognition draw their inspiration from probability theory, which provides a normative theory for describing how to learn from noisy data. The framework revolves around Bayes’ rule, which describes how an ideal learner should update his or her beliefs in light of data, denoted d. Suppose that there are a set of hypotheses H ¼ fh1 ; h2 ; h3 ; . . .g that the learner could potentially believe to be the correct theory as to the origin of the data d. This set of hypotheses is referred to as the hypothesis space. Before the data have been observed, the degree of belief that the learner assigns to the ith hypothesis is Pðhi Þ, the prior probability that hi is the correct one. Because each of these hypotheses yields precise predictions about what data would be expected if it were true, it is possible for the learner to assess the likelihood Pðd jhi Þ, the probability of observing data d if the true hypothesis were hi . Bayes’ rule then provides a method for belief updating, in which the posterior probability Pðhi jd Þ that the learner assigns to this hypothesis is given by: Pðd jhi ÞPðhi Þ hj 2 H Pðd jhj ÞPðhj Þ

Pðhi jdÞ ¼ P

ð1Þ

What Bayesian modelling can tell us about statistical learning

385

In this expression, the numerator multiplies the prior by the likelihood: hypotheses that are more consistent with the data receive a larger multiplier than those that are not. The denominator is just a normalizing term, to make sure that the posterior probabilities all sum to 1. Qualitatively, the key idea is that the prior beliefs Pðhi Þ are modified by the data through the likelihood Pðd jhi Þ. As more data are observed, the likelihood term becomes more and more important, and so the learner will come to assign the highest posterior probability to those hypotheses that are most consistent with the data, regardless of what the prior beliefs were. A Bayesian model, therefore, is built from three distinct parts. Firstly, we need to specify the hypothesis space H, the set of things that might be true. Secondly, we need to specify the learner’s prior beliefs PðhÞ. Finally, we need to specify the likelihood function Pðd jhÞ that relates the hypothesis to data. A natural question to ask is how to interpret the model. In general, Bayesian models do not try to describe any particular cognitive processes. People are not literally assumed to be computing posterior probabilities by mechanically applying Equation 1. Indeed, as is often pointed out in both the statistics literature and the cognitive science literature, these calculations can be extremely time consuming. Even given the impressive computational power that the brain provides, it is highly unlikely that Bayesian updating describes a literal mechanism for human learning. Instead, what it gives us is an abstract, ideal solution to the learning problem at hand. As such, it provides answers about what is and is not possible for the learner to learn, and provides a standard against which real learners can be assessed. As a consequence, Bayesian models are not focused on cognitive processes: they are focused on trying to understand the abstract goals of the cognitive system. What problem does it solve? How do the constraints under which it solves that problem a¤ect what is learned? Why does the cognitive system have these goals? What would a good solution look like, and why would it be good? Bayesian models, like any other family of models, vary in many particulars. Depending on the nature of the problem, they may incorporate di¤erent hypotheses and hypothesis spaces, di¤erent assumptions about how the data is sampled from the environment, and di¤erent assumptions about which hypotheses have the highest prior probability. One advantage of Bayesian modelling is that the assumptions of the model are made explicit. Representational constraints are specified clearly by describing the hypothesis space, preexisting beliefs are specified through the prior, and the learners interpretation of data is specified through the likelihood.

386

Amy Perfors and Daniel J. Navarro

As a consequence, it is comparatively easy to manipulate these these assumptions and evaluate precisely how they a¤ect what can be learned given certain data. Indeed, much of this chapter will discuss this question. The flip side of this advantage, however, leads to one of the main perceived disadvantages of Bayesian models: they can appear to have much more ‘‘built in’’ than other types of statistical learning models. This is particularly problematic for those who are interested in questions of innateness, for whom the goal is to build models that assume as little knowledge as possible a priori. While there are specific cases where Bayesian models rely on a lot of assumed knowledge, as a general critique of Bayesian models this problem is overstated, for two basic reasons. Firstly, drawing on work Bayesian statistics, it is quite possible to construct Bayesian models that make very few assumptions. In nonparametric Bayesian models, for instance, the hypothesis spaces are constructed so as to have ‘‘support’’ across the space of all probability distributions. Without going into technical details, what this means is that these models do not place strong prior constraints on what knowledge the learner can acquire. Secondly, it is important to recognise that all models must incorporate some assumptions about what hypotheses can be represented, what hypotheses are considered more likely a priori and so on. As illustrated by learnability problems such as Goodman’s (1955) problem of induction, Quine’s (1960) indeterminacy of translation, and Gold’s (1967) theorem, when faced with a complex learning problem that can have an infinite number of possible solutions, some prior biases are necessary in order to learn e¤ectively. In this respect, the main di¤erence between the various modeling frameworks is not whether they make prior assumptions: it is the extent to which these assumptions are made explicit. Having described Bayesian models of statistical learning, we now explore how they can be used to elucidate and explore some of the issues involved in such learning, in particular the what and the why. We begin with one of the important ‘‘what’’ questions, in particular, how what units the statistical learning operates over a¤ects the sort of inferences that can be made.

What data does statistical learning operate on? At its most basic, statistical learning describes a process in which people acquire knowledge from probabilistic data. Within the Bayesian framework, this learning is governed by the likelihood function, Pðd jhÞ. A natural question to ask, therefore, is what actually constitutes the data d

What Bayesian modelling can tell us about statistical learning

387

from which people learn. This fundamental question attaches to a range of problems in the study of language. Two cases of particular interest are: (1) At what level of language (phoneme, syllable, word, etc) should we describe the input? (2) What should count as a single observation: a type or a token? We choose these because they are interesting questions in their own right, but also because each of these will serve as the motivation for one of the later sections in the chapter. At what level should we describe the learner’s data? In order to motivate our discussion, we consider one of the most empirically robust demonstrations of human statistical learning: the fact that people are extremely sensitive to the transition probabilities that characterize the shortrange sequential dependencies between linguistic units (Sa¤ran, Aslin, & Newport, 1996; Aslin, Sa¤ran, & Newport, 1998; Sa¤ran & Thiessen, 2003). The natural assumption is that sensitivity to transition probabilities can help the learner solve problems like word segmentation, since the transition probability for pairs of linguistic units that cross word boundaries tends to be lower than transition probabilities for units that are contained within the same word (Harris, 1955). However, the story is more subtle than this: transition-probability-based statistical learning is dependent on the units over which transition probabilities are calculated and used. In empirical work it has been typical to define transition probabilities at the level of the syllable, and to examine how these transition probabilities help people solve the word segmentation problem (e.g., Sa¤ran et al., 1996; Aslin et al., 1998), although there are studies that look at word level transition probabilities to learn syntax (e.g., Thompson & Newport, 2007). A typical experiment involves exposing learners to input from an artificial language in which the syllable transition probability tends to be low when two syllables lie on either side of the word boundary, and high when the two syllables are wholly contained within a single word. The fact that people can learn the correct word segmentations within the artificial language suggests that syllable-level transition probabilities are a powerful source of evidence. However, unfortunately, natural language, is full of instances in which syllables cross word boundaries: Brent (1999b) gives the example of teak rail, whose syllabic boundaries are /ti/ and /krel/, but whose word boundaries are /tik/ and /rel/. Even if the syllable transition probability from /ti/ to /krel/ were low enough to indicate the presence of a word boundary, it does not provide the learner with the information required to infer that the location of the boundary is between /tik/ and /rel/.

388

Amy Perfors and Daniel J. Navarro

As a consequence of this ambiguity, most computational models of word segmentation use input in which phonemes are the basic unit, not syllables. However, computational models allow a finer grained investigation of the issue: by comparing the performance of models that rely on di¤erent assumptions, we can obtain more insights into the level of abstraction at which transition probabilities are calculated. For instance, it might be possible for the learner to calculate transition probabilities at the syllable level only, and try to memorize the ambiguous cases. Alternatively, a model might operate solely at the phonetic level, requiring no syllable-level calculations. Finally, the learner might try something in between, in which syllable-level transition probabilities do the vast majority of the work, but the learner relies on phoneme-level transition probabilities to handle ambiguous cases. In short, comparing the performance of di¤erent computational models can yield further insight about what units transitional probabilities must be calculated over in order for the successful segmentation of words from fluent speech. Learning from types or from tokens? The second sense in which the units of analysis matters is whether we learn on the basis of types or tokens. Although there is extensive psycholinguistic and developmental evidence that people are sensitive to token-level frequency variation in a variety of contexts, there are also reasons to think that for at least some kinds of inference, it may be more sensible to rely on type frequencies rather than token frequencies (or an interpolation between the two). For instance, one of the most robust empirical regularities in language is the power-law distribution in the frequency of word tokens: a few words are extraordinarily common, and many are extremely infrequent (Zipf, 1932). These power-law distributions appear to apply to many other elements of language, not just words (see Briscoe, 2006, for an overview). However, many standard statistical models fail to capture this distribution; for instance, context-free grammars capture an exponential rather than power-law distribution. Recent work addresses this lack using a two-stage Bayesian model for language learning called an adaptor grammar, which separates the question of how forms are generated from the question of how frequent those forms are (Goldwater, Gri‰ths, & Johnson, 2006b; Johnson, Gri‰ths, & Goldwater, 2007). The framework corresponds to assuming that language users can generate tokens either by drawing on a memory store of familiar types, or by generating a type anew based on deeper linguistic principles. The model can simultaneously

What Bayesian modelling can tell us about statistical learning

389

infer which underlying linguistic generalisation is correct, as well as whether it is more sensible to perform inferences on the basis of types, tokens, or a mixture of the two. This model has been used for unsupervised acquisition of the morphology of English (e.g., learning that the word helped can be parsed into the stem help and the separate past-tense su‰x ed ). This sort of knowledge is useful to children for making productive generalisations of novel verbs. Adaptor grammars have also been applied in the problem of grammar induction. For instance, work by Perfors, Tenenbaum, and Regier (2011) suggests that the nature of the abstract grammatical inferences a learner is justified in making can depend on whether they assume that learning should be done over sentence types or tokens. A learner who assumes that grammar induction should be done on the basis of types will infer that grammars with hierarchical phrase structure are the best fit to child-directed speech; a learner who assumes that it should be done on the basis of tokens will prefer grammars without hierarchical phrase structure. Which of these learners is most sensible? We can address this question by evaluating what a learner capable of interpolating between types and tokens – determining which interpretation of the data will lead to the highest overall probability – would conclude, as in (Johnson et al., 2007). Results indicate that such a learner would probably1 infer that a more type-based analysis is more appropriate, and that, overall, the grammars with the highest probability are those with hierarchical phrase structure. The point here is that the inferences possible from types can be di¤erent from the inferences possible from tokens, and that realising which is most appropriate for a given problem is an important task facing the learner. Research into this question is still in its infancy, so many open questions remain.

What knowledge does the learner acquire from the data? Uncovering the form and content of the mental representations that the learner relies upon is a central question in any discussion of language acquisition and cognitive science more generally. Are the rules of syntax best described in terms of regular grammars, context free grammars, context sensitive grammars, or something else? Are phonetic categories repre1. Because the search problem in this sort of grammatical learning is so di‰cult, the results are based on approximations, rather than an exhaustive search of the entire space of grammars and type-to-token-based interpolations.

390

Amy Perfors and Daniel J. Navarro

sented in the same way as other perceptual categories? If so, are these categories represented in terms of prototypes, exemplars, distribution estimates, decision boundaries, or something else? These questions and many others all form part of the broader issue of describing the knowledge acquired by the learner. The scope of this issue is thus too wide to cover in detail. With this in mind, we restrict ourselves to two topics. Firstly, we follow on from our previous discussion of sensitivity to transition probabilities in word segmentation, and try to show the range of mental representations that can be explored by computational models. We then pick a single issue that these models open up (learning on multiple levels), and follow it through in some detail. Modelling the word segmentation process In our earlier discussion of transition probabilities, the focus was primarily on the data: when addressing the word segmentation problem, is it sensible to assume that the learner computes transition probabilities between phonemes, between syllables, or something more complex. However, while transition probabilities are presumably useful to the learner when trying to solve the word segmentation problem, it is clearly the case that the actual knowledge acquired by the learner is much richer. To see this, suppose that the learner has successfully calculated all transitional probabilities (be they phoneme transition probabilities or syllable transition probabilities). How should he or she segment the speech into words on the basis of this knowledge? Is it su‰cient for the learner to infer word boundaries whenever the transition probability is under some threshold? If so, how is that threshold set? Alternatively, do the statistics of the rest of the language play some role in determining the word boundaries? If so, how does this work? These are the kinds of questions that computational models can help address. In doing so, they open up new and interesting questions. For example, few models directly apply a simple threshold to learn word segmentations. Instead, transition probability is assumed to be one cue available to the learner, who seeks to extract some explicit theory about the lexicon. In some cases, the model aims to learn the word boundaries directly (e.g., Elman, 1990; Cairns, Shillcock, & Chater, 1997; Christiansen, Allen, & Seidenberg, 1998), while other models focus on learning the lexicon itself, identifying the boundaries as a side e¤ect (de Marcken, 1995; Brent & Cartwright, 1996; Perruchet & Vinter, 1998; Brent, 1999a; Venkataraman, 2001; Swingley, 2005; Goldwater, Gri‰ths, & Johnson,

What Bayesian modelling can tell us about statistical learning

391

2006a, 2009). The motivation for the second group of models is that word segmentation is not the primary goal of the learner: rather, the main goal is to identify the words themselves. As such, it is argued the better approach is to build word segmentation models that are focused on learning the lexicon itself. Thus we have three di¤erent ideas about the nature of the knowledge that the learner acquires during the word segmentation process: raw transition probabilities, word segmentation data, lexical knowledge. These three ideas di¤er in terms of the extent to which the learner is assumed to abstract away from the raw data. Calculating transition probabilities involves very little abstraction, assigning word boundaries involves only the minimal amount of abstraction required to solve the segmentation problem, and lexical knowledge involves moving beyond the basic problem and focusing on the broader problems facing the language learner. Evaluating all three of these possibilities is important for understanding what kind of knowledge the learner acquires, and what kind of data is required to acquire it. Recent Bayesian models of word segmentation (Brent, 1999a; Goldwater et al., 2009), which tend to outperform other models on naturalistic corpora, have tended to be of the third type (focusing on lexical knowledge). They also more accurately capture human performance on artificial languages in the lab (Frank, Goldwater, Gri‰ths, & Tenenbaum, 2007). The interesting thing about these models is the extent to which they highlight how complex the learning problem is. Within these models there is a tension between the desire to have a simple lexicon, and the desire to assign high probability to the observed corpus. One way to maximise the probability of the data is to assume that the lexicon contains only a single ‘‘word’’ with that word being precisely identical to the entire corpus. At the other extreme, the simplest possible ‘‘lexicon’’ consists of a small number of words, one per phoneme, with word boundaries placed between all pairs of phonemes. In between these two extremes exists an optimal compromise solution, with a moderately large lexicon consisting of fairly short words, which assigns fairly high probability to the data. The most recent of these models (Goldwater et al., 2009) goes further than this, and shows that better word segmentation can be achieved if the learner calculates the transitional probabilities not just between the observed phonemes, but also between the words in the (learned) lexicon. As it turns out, if the learner ignores the transitions probabilities between words (e.g., by assuming words are independent) then the result is systematic under-segmentation; learning based on within-word as well as between-word dependencies greatly improves the segmentation of naturalistic child-directed speech. In other

392

Amy Perfors and Daniel J. Navarro

words, by constructing detailed computational models that can solve the statistical learning problem, we learn that the most successful representations are likely to be complex, and involve learning an explicit lexicon and tracking transition probabilities at multiple levels of linguistic analysis. Learning on multiple levels In the previous section, we illustrated how computational models help ‘‘flesh out’’ the ideas associated with the statistical learning of language, using the word segmentation problem as an example. In this section we now pick one of the issues raised in that discussion, and follow it through in some detail, across multiple problems. That issue is the question of learning on multiple levels. In the previous example, we saw that word segmentation is improved when the learner has the capacity to use knowledge from two di¤erent levels (in this case, phonemes and words) to assist them. However, the issue is quite general, and turns up in a range of problems in language and cognition. One especially interesting case is when the two ‘‘levels’’ refer to levels of abstraction: learning both information about specific items and information about general principles. For instance, acquiring categories involves learning information about how specific categories are organised (e.g., that balls tend to be round) as well as information about how categories in general tend to be organised (e.g., that count nouns are organised by shape). Empirical work demonstrates that children learn both of these elements (Landau, Smith, & Jones, 1988; Imai & Gentner, 1997; Smith, Jones, Landau, Gershko¤-Stowe, & Samuelson, 2002), and that this learning is probably driven by the statistical nature of the input (Samuelson & Smith, 1999). An analogous problem arises in a very di¤erent domain – acquiring verb argument constructions. In every language, di¤erent verbs take arguments in distinct constructions; for instance, the verb load can occur with two distinct locative constructions (‘‘He loaded apples into the cart’’ and ‘‘He loaded the cart with apples’’). Not all verbs can occur in all constructions: one can pour apples into a cart but not pour a cart with apples, and one can fill a cart with apples but not fill apples into a cart. Knowing which verbs can occur with which constructions is verb-specific knowledge, but children learning language also acquire verb-general knowledge about the sorts of constructions that verbs of di¤erent types can appear with (see, e.g., Baker, 1979; Pinker, 1989). This allows them to generalise sensibly about novel verbs, spontaneously producing sentences such as ‘‘He is mooping the cloth with marbles’’ when introduced to the novel

What Bayesian modelling can tell us about statistical learning

393

verb ‘mooping’ in the context of an experimenter placing marbles into a cloth (Gropen, Pinker, Hollander, & Goldberg, 1991). Simultaneous verbgeneral and verb-specific learning on the basis of the statistical structure of the input has been demonstrated in the lab as well, in artificial language learning studies (Wonnacott, Newport, & Tanenhaus, 2008). The fact that human learners are able to learn knowledge organized at multiple levels of abstraction raises a natural question: how is it possible to describe this learning, and how do we describe the structure of the knowledge that they acquire? This kind of learning has been shown to be possible for both connectionist (e.g., Kruschke, 1992; Colunga & Smith, 2005) and Bayesian (Navarro, 2006; Kemp, Perfors, & Tenenbaum, 2007; Gri‰ths, Sanborn, Canini, & Navarro, 2008; Heller, Sanborn, & Chater, 2009) learners, and some of these models have been applied to verb construction learning (Hsu & Gri‰ths, 2009; Perfors, Tenenbaum, & Wonnacott, 2010). The Bayesian framework in particular is quite revealing about the general computational principles that allow this learning to occur and the kind of knowledge that is acquired. The one thing that all of the Bayesian models discussed above have in common is that they are hierarchical models. In a standard Bayesian model the learner is assumed to postulate hypotheses that explain the observed data. Each individual learning problem (e.g., learning a single category) maps onto a single hypothesis. In a hierarchical model, however, the learner goes one step further and postulates more general hypotheses that can explain each of the individual ones. Following Goodman (1955), these ‘‘hypotheses about hypotheses’’ are called overhypotheses. Mathematically, what this means is that instead of having a single set of prior beliefs Pðhi Þ (as per Equation 1), the learner’s belief about a specific hypothesis hi are constrained by an overhypothesis. Thus, if the learner has the overhypothesis ok , his or her beliefs are described by the more structured distribution, Pðhi jok ÞPðok Þ. What this means is that as the learner refines their beliefs about specific hypotheses, they also acquire more general knowledge in the form of the overhypothesis. In principle, the level at which learning occurs could be extended upward even more, until the knowledge at the highest level is weak or general enough that it can be plausibly assumed to be innate. As in the problem of word segmentation, Bayesian models demonstrate what can be learned by a learner capable of statistical inference on multiple levels at once. Do people actually learn in this way? This is an empirical question that requires much more work to flesh out fully, but some indications are promising. For instance, it is empirically observed that children are capable of acquiring higher-order word learning general-

394

Amy Perfors and Daniel J. Navarro

isations after learning relatively few words (Smith et al., 2002); this sort of rapid inference is one of the trademarks of Bayesian models, but is rarely observed in connectionist models. A recent model captures this learning (Kemp et al., 2007), and the same model captures human verb learning in artificial language learning tasks (Perfors et al., 2010). A related version of the model predicts that both higher-level and lower-level generalisations in category learning should be acquired at the same time (i.e., on the basis of the same amount of data). This prediction was empirically supported in an experiment with adults (Perfors & Tenenbaum, 2009). These models also demonstrate how it may be possible to acquire two di¤erent higherorder generalisations about two di¤erent kinds of things at the same time (for instance, learning that solids are organised by shape but non-solids are organised by texture). Although there is evidence that young children are capable of this sort of learning over the course of their first years, it is an open question whether it is possible for adults in a laboratory setting.

What assumptions do learners make about the data? In the previous section we discussed the kind of knowledge that a statistical learner is able to acquire. Closely related to this issue is the question about what the learner assumes about the data from which such knowledge is acquired. This issue was hinted at in our earlier discussion of the typetoken distinction. As noted previously, inferences drawn from type-level data can be di¤erent from those drawn from token-level data. As such, if the learner assumes that the ‘‘power law’’ statistics observed at the token level are caused by extraneous processes (i.e., the adaptor in adaptor grammars) then the inferences will be di¤erent than if the learner treats these as an important characteristic of the data (i.e., if the token-level data are sampled independently). In other words, in order to successfully extract linguistic knowledge from data, the learner must also make assumptions about the manner in which those data were generated. These are referred to as sampling assumptions. In this section we discuss how sampling assumptions can play a role in influencing what a statistical learner learns, and the rate at which this knowledge is acquired. Much of the literature on sampling assumptions focuses on two extreme but illustrative cases, known as ‘‘strong sampling’’ and ‘‘weak sampling’’. Strong sampling corresponds to assuming that data is generated by the underlying process that the learner seeks to acquire: for instance,

What Bayesian modelling can tell us about statistical learning

395

sentences in a language are probably strongly sampled from some sort of underlying grammar, since the grammar is used to generate the sentences. Conversely, assuming weak sampling corresponds to assuming that the data can be generated independently of the process being learned about, and that process serves simply to label the data. For instance, the typical paradigm in a category-learning experiment corresponds to a weak sampling process: items are presented to the learner, and labelled as being either examples of the concept or not examples of the concept. One could assume weak sampling even if one only sees positive examples, of course, if one assumes that items are generated independently but are labelled by the underlying process, and that for some reason the negative items are never seen. Because Bayesian models force an explicit specification of the learner’s assumptions about the nature of the generative process, these models can clarify how sampling assumptions change the nature of the inferences that can be made. If a learner assumes strong sampling, then it is possible to learn a surprising amount from just a few data points. This is because the data points tell you something about the ‘‘size’’ of the concept. Broadly speaking, if there are h items in the concept – for instance, h animals in the immediate world that correspond to the label ‘‘dog’’, or h sentences that a grammar can generate – then the probability of generating any specific item is proportional to 1h .2 As a result, if there are n items generated independently, the probability of all n of them will be proportional to h1n . In other words, after very few data points, the highest-probability hypotheses will be those that are most conservative with respect to those data points. This principle is known as the ‘‘size principle’’ (Tenenbaum & Gri‰ths, 2001; Navarro & Perfors, 2010). As a result of the size principle, it is possible to make strong inferences on the basis of relatively little data. Work by Xu and Tenenbaum (2007) demonstrates this type of inference in a word learning context. Intuitively, if we were shown one object – say, a dalmatian – and told that it was an 2. It is only precisely 1h in the case that items are sampled with equal probability from the underlying process, and completely independently from each other. If that changes – as, for instance, might occur in the case of grammars, because some sentences might be more probable than others by virtue of being shorter or using more frequent words or constructions – then the precise probability of any one item might diverge slightly from 1h . Still, the probability will scale proportionally to 1h , which is all that is necessary to drive the e¤ect being discussed.

396

Amy Perfors and Daniel J. Navarro

example of a ‘‘fep’’, we would not necessarily infer that ‘‘fep’’ means dalmation; it could mean dog or animal or pet. Yet, if we were shown three examples of a ‘‘fep’’ and they were all dalmatians, we would think it much more likely that ‘‘fep’’ meant dalmation. This is because dalmation is a ‘‘smaller’’ concept than dog: dogs include all of the same items as dalmations do, plus many others. If the underlying concept actually were dog, it would be somewhat surprising that only dalmations happened to be generated, but this would not be a puzzle if the underlying concept were dalmation. Xu and Tenenbaum (2007) found that both adults and children reason according to this intuition in a word-learning task like this. Even infants appear to make inferences that are consistent with a strong sampling assumption: after observing an experimenter blindly draw four red balls and one white ball out of a box, infants tend to assume that the box is predominately red and are surprised if it turns out to be predominantly white (Xu & Garcia, 2008). Intriguingly, this e¤ect vanishes when the infant is presented with evidence that the balls are being sampled in a di¤erent way. If they see the experimenter look into the box and selectively remove four red balls after evincing a preference for red balls, they are no longer surprised to find that the box is predominantly full of white ones: the sample wasn’t drawn independently and at random from the underlying concept (Xu & Denison, 2009). Strong sampling assumptions may also enable a learner to overcome the problem of learning in the absence of negative evidence. If items are strongly sampled from the concept, there is never any negative evidence given, but the learner can nevertheless eventually constrain their generalisation in a sensible way. This is because the size principle captures the notion of a suspicious coincidence: as the number of examples increases, hypotheses that make specific predictions – those with more explanatory power – tend to be favored over those that are more vague. As the size of the data set approaches infinity, a Bayesian learner rejects larger or more overgeneral hypotheses in favor of more precise ones. With limited amounts of data, the Bayesian approach can make more subtle predictions, as the graded size-based likelihood trades o¤ against the preference for simplicity in the prior. This is essentially the same idea as implicit negative evidence, which others have suggested (e.g., Braine & Brooks, 1995); the mathematics of Bayesian probability theory provides a principled quantitative justification for why a sensible learner should take it into account, and precisely how much to do so.

What Bayesian modelling can tell us about statistical learning

397

Do learners always assume that data in the world is a result of strong sampling? This has not yet been studied extensively, but early work suggests both that there are individual di¤erences, and that assumptions appear to depend somewhat on the nature of the task and domain. In some recent work, people were given di¤erent cover stories explaining how di¤erent sorts of categorical data were created: for instance, in one, participants were told that they had observed bacteria at certain temperatures and then were asked what range of temperatures such bacteria could survive (Navarro, Dry, & Lee, in press). Although there were individual di¤erences in how people performed, many inferences suggested that people were assuming some mixture between strong and weak sampling: their generalisations sharpened somewhat with increasing data, but not as much as the size principle would predict. Interestingly, as long as a learner assumes anything other than purely weak sampling – that is, as long as they sharpen their generalisations with increasing data at all – it is still possible to constrain generalisations even without negative evidence. As before, the mathematics of probability theory can explain precisely how much generalisations should sharpen as a function of the degree to which the learner assumes strong sampling.

What prior knowledge does a statistical learner possess? Assumptions about the sampling process is tied to another set of assumptions that can drive the shape and nature of a learner’s generalisations, captured in Bayesian models via the prior. In the model of Navarro et al. (in press) above, people’s assumptions about sampling were included as a part of the model parameters. This isn’t meant to imply that such parameters can be completely arbitrary: for instance, consider the priors. Although most Bayesian models have enough flexibility to accommodate extremely strange priors if the modeller wants to, in practice a preference for simpler or more parsimonious hypotheses will emerge naturally without having to be deliberately engineered. This preference derives from the generative assumptions underlying the Bayesian framework, in which hypotheses are themselves generated by a process that produces a space of candidate hypotheses and the prior probability P(h) reflects the probability of generating h under that process. As an example, consider the grammar-learning work of Perfors et al. (2011) referred to earlier. It assumes that grammars are generated by an underlying grammar-generating process in which individual grammar rules

398

Amy Perfors and Daniel J. Navarro

are created by generating non-terminal and terminal nodes from an underlying vocabulary according to certain specifications (e.g., for a context-free grammar rule, the left-hand-side must always be a non-terminal, and so forth). For each choice that is made, there is some probability of choosing di¤erently: for instance, one might have chosen NP to go in a particular ‘‘slot’’ in some rule, but it could equally have been a VP or PP. As a result, grammars that are longer and more complicated – that have more rules, more non-terminals, and more ways of creating legitimate rules – will be disfavoured in the prior; the more choices a hypothesis requires, the more likely it is that those choices could have been made in a di¤erent way, resulting in an entirely di¤erent hypothesis. More formally, because the prior probability of a hypothesis is the product of the probabilities for all choices needed to generate it, and the probability of making any of these choices in a particular way must be less than one, a hypothesis specified by strictly more choices will in general receive strictly lower prior probability. Although we have illustrated this by the grammar example, this is a general property of Bayesian modelling, and most priors (unless deliberately engineered otherwise) will naturally favour more parsimonious hypotheses with fewer parameters. The word segmentation model of Goldwater et al. (2009), for instance, favours segmentations that reflect a smaller underlying vocabulary of words, and category-learning models (e.g., Perfors & Tenenbaum, 2009) generally favour fewer categories. None of these models blindly prefers the simplest of all possible explanations, of course, because Bayesian probability theory trades o¤ the preference for simplicity (in the prior) with goodness-of-fit to the data (in the likelihood). Since likelihood is weighted more and more heavily as the amount of data increases, there can be interesting di¤erences in what is learned as the amount of data increases. With little data, preferred hypotheses tend to be simpler, but as it increases, the complexity of the preferred hypothesis can also increase. Bayesian modelling makes explicit how di¤erent learning assumptions lead to a di¤erent kind of learning in another way, too. All Bayesian models explicitly specify the nature of the hypotheses that the model entertains. It is rare for all hypotheses to be explicitly enumerated and compared, since most hypothesis spaces are simply too large, if not infinite in size; in practice, the best hypothesis (the one with maximum posterior probability) is identified through intelligent search of the space. Models with di¤erent hypothesis spaces can therefore learn di¤erent things. For instance, a learner could not learn that context-free grammars are the best fit to child-directed speech if they were not capable of representing

What Bayesian modelling can tell us about statistical learning

399

and evaluating context-free grammars in the first place; a learner could not learn about an underlying category if it was incapable of representing the features to that category. This explicit representation of the hypothesis space can appear as a disadvantage to many, especially by comparison to connectionist models, which appear to build in less. However, as discussed earlier, all models must build in something, and connectionist models implicitly build in hypothesis spaces through their choice of architecture and other parameters; learning in connectionist models corresponds to searching over some of the hypotheses in the space. The main di¤erence between the models is not, therefore, that Bayesian models necessarily build in a lot more innate machinery in terms of their hypothesis spaces than connectionist models do: it is that the nature of these hypothesis spaces is made explicit, rather than implicit, and it is that for connectionist models the focus is on the process of the search, whereas in Bayesian models the focus is on the nature of the solution. This leads to our final section, which addresses the issue of why? Why does statistical learning work in the first place, what does it mean to ‘‘work’’, and why is it beneficial to study statistical learning from a modelling perspective?

When and why does statistical learning work? In a very real sense, Bayesian theories are genuinely rational theories of inference. What we mean by this is that Bayesian probability theory is not simply a set of ad hoc rules useful for manipulating and evaluating statistical information: it represents a set of unique, consistent rules for conducting plausible inference (Jaynes, 2003). In essence, it is a extension of deductive logic to the case where propositions have degrees of truth or falsity (and indeed, deductive inference is a special case of Bayesian inference). Just as formal logic describes a deductively correct way of thinking, Bayesian probability theory describes an inductively correct way of thinking. As Laplace (1816) said, ‘‘probability theory is nothing but common sense reduced to calculation.’’ To illustrate what this means, suppose we were to try to come up with a set of desiderata that a system of ‘‘proper reasoning’’ should meet. This might include things like consistency, and qualitative correspondence with common sense. For instance, if you see some data supporting a new proposition A, you should conclude that A is more plausible rather than less; the more you think A is true, the less you should think it is false; if a conclusion can be reasoned multiple

400

Amy Perfors and Daniel J. Navarro

ways, its probability should be the same regardless of how you got there; etc. Probability theory and Bayesian inference can be seen to follow from a mathematical formalization of these basic desiderata (Cox, 1946, 1961). In that sense, Bayesian inference follows from basic common sense, and supplies a sensible normative standard for inductive inference. Moreover, there are a number of results that show that Bayesian models have optimal or near-optimal predictive capabilities (see de Finetti, 1937; Dawid, 1984; Gru¨nwald, 2007): a non-Bayesian reasoner attempting to predict the future will typically make worse predictions than a Bayesian one. On the surface, then, it would appear to be the case that Bayesian reasoning solves all of the problems of inductive inference for us, since it is a normative model for induction that follows from qualitative common sense reasoning. However, the story is somewhat more subtle than this. Earlier in this chapter we briefly referred to work by Goodman (1955), Quine (1960) and Gold (1967), all of which implied that when the learning problem is ‘‘complex’’ in some sense, it is impossible for any variety of statistical inference to work e¤ectively. In the statistics literature, for instance, it is is well-established that if the world lacks any structure or the learner is simply insensitive to such structure, then no learning is possible (Scha¤er, 1994; Rao, Gordon, & Spears, 1995). We mentioned this at the time in order to make the point that both connectionist and Bayesian models require some assumptions in order to work. In many ways, this point is unremarkable. There simply has to be some minimal set of a priori assumptions required for statistical learning to work, and both Bayesian and connectionist models must rely on such assumptions. From the connectionist perspective, what this means is that any suggestion that connectionist models do not ‘‘build things in’’ must be viewed with skepticism. Whether it be through the architecture, the learning rule, or the activation mechanisms, a connectionist model must supply constraints and assumptions: it cannot be otherwise. From a Bayesian perspective, one must be cautious about overly strong claims about ‘‘optimality’’. For instance, if the learner lives in a world in which ‘‘tomorrow is like today’’, a Bayesian model that assumes that that ‘‘tomorrow will be the opposite of today’’ will perform extremely poorly. Moreover, there are even cases where it is possible to make Bayesian models behave suboptimally (Gru¨nwald & Langford, 2007) or pathologically (Diaconis & Freedman, 1986); there are also clever ways of speeding up Bayesian learning beyond what the standard analysis might predict (Van Erven, Gru¨nwald, & de Rooij, 2009). In other words, Bayesian modelling is not a panacea, nor an excuse to avoid thinking carefully about assumptions. To the extent that a

What Bayesian modelling can tell us about statistical learning

401

Bayesian model makes poor assumptions, it can perform just as poorly as any other type of model. In large part, this is the reason why we compare di¤erent Bayesian models against each other: to find out which ones make more sensible assumptions about the world. Returning to the cognitive science questions, it is an open question of if (and to what extent) human learners actually do behave in accordance with the optimal predictions of Bayesian theory. In some domains, it may certainly appear that they do not: for instance, it has long been noted that human decision making is rife with biases that appear to diverge markedly from what Bayesian reasoning would predict (e.g., Tversky & Kahneman, 1974). Nevertheless, in other areas, Bayesian models appear to make surprisingly strong fits to human behaviour, some of which we have seen already, some of which applies in areas other than language acquisition (e.g., causal reasoning (Gri‰ths & Tenenbaum, 2009), sensorimotor control (Kording & Wolpert, 2006), and vision (Yuille & Kersten, 2006), among others). Even when humans are non-optimal, it is impossible to know this without being able to precisely specify what optimal thinking would amount to. Put another way, understanding how humans do think is often made easier if one can identify the ways in which people depart from the ideal: this is approximately the methodology by which Kahneman and Tversky derived many of their famous heuristics and biases. Moreover, specifying an optimal (or at least near-optimal) solution to the problem often makes it clear how it might be the case that people could still be reasoning sensibly, but be operating under additional constraints emerging from the nature of the underlying representation, or from having limited memory or processing available (e.g., Chase, Hertwig, & Gigerenzer, 1998). It is possible to capture these sort of constraints within the Bayesian framework using a relatively recent approach called rational process modelling (Sanborn, Gri‰ths, & Navarro, 2010), which focuses more on the process by which the hypothesis space is searched, and provides for a way to evaluate who a learner with limited capacity might approximate the optimal solution. Although rational process models have not yet been applied much in language (though see Pearl, Goldwater, & Steyvers, 2010, for an example where it has), they have been applied with some success in areas like decision making (Vul, Goodman, Gri‰ths, & Tenenbaum, 2009) and category learning (Sanborn et al., 2010), and are a promising future direction for those interested in exploring the extent to which humans approximate optimal inference. There is one final way in which it is interesting and useful for those interested in language acquisition to be able to precisely specify and

402

Amy Perfors and Daniel J. Navarro

understand what optimal reasoning would look like: it is useful for performing ideal learnability analysis. What must be ‘‘built into’’ the newborn mind in order to explain how infants eventually grow to be adult reasoners, with adult knowledge? One way to address this question is to establish the bounds of the possible: if some knowledge couldn’t possibly be learned by an optimal learner presented with the type of data children receive, it is probably safe to conclude either that actual children couldn’t learn it, either, or that some of the assumptions underlying the model are inaccurate. The tools of Bayesian inference are well-matched to this sort of problem, both because they force modelers to make all of these assumptions explicit, and also because of their representational flexibility and ability to calculate optimal inference.

Conclusion We have explored here a little bit about the why and what of statistical learning, using the Bayesian framework to illuminate all of these questions. We demonstrated that Bayesian techniques can be useful for understanding what kinds of learners and assumptions are necessary for successful statistical learning: the units that such learning operates over and the levels of abstraction it includes both drive the nature of the inferences that can be made. Learners, too, must make assumptions about whether it is more reasonable to make inferences on the basis of types or tokens (or a mixture of the two) and whether the data was strongly or weakly sampled (or a mixture of the two). And, of course, any learner must incorporate – whether explicitly or implicitly – certain assumptions in the form of their prior biases and the nature of the hypotheses they can represent and consider. We discussed how these assumptions might drive what is learned, and how the Bayesian paradigm, since it forces them to be made explicit, can make it relatively straightforward to manipulate them and evaluate how that changes what can be learned. Of course, Bayesian modelling is only one tool in the toolbox available to researchers studying language acquisition. Although it can be very useful to be able to specify and understand the nature of the problem facing language learners as well as what an optimal solution would entail, this is only part of the scientific problem. In our view, maximal scientific progress can be made with a combination of computational modelling (of all sorts) in conjunction with rich and precise empirical work exploring what people actually do learn, and clarifying and testing the assumptions

What Bayesian modelling can tell us about statistical learning

403

and predictions made by the models. Only by working in tandem can we most e‰ciently arrive at a full understanding of how and why people learn from the statistical regularities in their environment.

References Aslin, R., Sa¤ran, J., & Newport, E. 1998 Computation of conditional probability statistics by 8-monthold infants. Psychological Science, 9, 321–324. Baker, C. 1979 Syntactic theory and the pro jection problem. Linguistic Inquiry, 10(4), 533–581. Berwick, R. (1986). Learning from positive-only examples: The subset principle and three case studies. Machine Learning, 2, 625–645. Braine, M., & Brooks, P. 1995 Verb argument structure and the problem of avoiding an overgeneral grammar. In Beyond names of things: Young children’s acquisition of verbs (pp. 353–376). Hillsdale, NJ: Lawrence Erlbaum Associates. Brent, M. 1999a An e‰cient, probabilistically sound algorithm for segmentation and word discovery. Machine Learning, 34, 71–105. Brent, M. 1999b Speech segmentation and word discovery: A computational perspective. Trends in Cognitive Sciences, 3(8), 294–301. Brent, M., & Cartwright, T. 1996 Distributional regularity and phonotactic constraints are useful for segmentation. Cognition, 61, 93–125. Briscoe, E. 2006 Language learning, power laws, and sexual selection. In 6th International Conference on the Evolution of Language. Cairns, P., Shillcock, R., & Chater, N. 1997 Bootstrapping word boundaries: A bottom-up approach to speech segmentation. Cognitive Psychology, 33, 111–153. Chase, V., Hertwig, R., & Gigerenzer, G. 1998 Visions of rationality. Trends in Cognitive Sciences, 2(6). Christiansen, M., Allen, J., & Seidenberg, M. 1998 Learning to segment speech using multiple cues. Language and Cognitive Processes, 13, 221–268. Colunga, E., & Smith, L. 2005 From the lexicon to expectations about kinds: A role for associative learning. Psychological Review, 112(2), 347–382.

404

Amy Perfors and Daniel J. Navarro

Cox, R. 1946 Cox, R. 1961 Dawid, A. P. 1984

de Finetti, B. 1937

Probability, frequency, and reasonable expectation. American Journal of Physics, 14, 1–13. The algebra of productive inference. Baltimore, MD: Johns Hopkins University Press. Present position and potential developments: Some personal views, statistical theory, the prequential approach. Journal of the Royal Statistical Society, Series A, 147, 278–292. Prevision, its logical laws, its subjective sources. In H. Kyburg & H. Smokler (Eds.), In studies in subjective probability (2nd ed.). New York: J. Wiley and Sons.

de Marcken, C. 1995 The unsupervised acquisition of a lexicon from continuous speech (A.I. Memo No. 1558). Cambridge, MA: Massachusetts Institute of Technology. Diaconis, P., & Freedman, D. 1986 On the consistency of Bayes estimates. The Annals of Statistics, 14, 1–26. Elman, J. 1990 Finding structure in time. Cognitive Science, 14, 179–211. Frank, M. C., Goldwater, S., Gri‰ths, T. L., & Tenenbaum, J. B. 2007 Modeling human performance in statistical word segmentation. In D. McNamara & J. Trafton (Eds.), Proceedings of the 29th Annual Conference of the Cognitive Science Society (p. 281– 286). Austin, TX: Cognitive Science Society. Gibson, E., & Wexler, K. 1994 Triggers. Linguistic Inquiry, 25(3), 407–454. Gold, E. M. 1967 Language identification in the limit. Information and Control, 10, 447–474. Goldwater, S., Gri‰ths, T. L., & Johnson, M. 2006a Contextual dependencies in unsupervised word segmentation. In Proceedings of the 44th Annual Meeting of the Association for Computational Linguistics (p. 673–680). Sydney, Australia. Goldwater, S., Gri‰ths, T. L., & Johnson, M. 2006b Interpolating between types and tokens by estimating power law generators. In Y. Weiss, B. Scholkopf, & J. Platt (Eds.), Advances in Neural Information Processing Systems (Vol. 18, p. 459–466). Cambridge, MA: MIT Press. Goldwater, S., Gri‰ths, T. L., & Johnson, M. 2009 A Bayesian framework for word segmentation: Exploring the e¤ects of context. Cognition, 112, 21–54.

What Bayesian modelling can tell us about statistical learning Goodman, N. 1955

405

Fact, fiction, and forecast. Cambridge, MA: Harvard University Press. Gri‰ths, T. L., Sanborn, A., Canini, K., & Navarro, D. 2008 Categorization as non-parametric Bayesian density estimation. In M. Oaksford & N. Chater (Eds.), The probabilistic mind: Prospects for Bayesian cognitive science (p. 303–328). Oxford: Oxford University Press. Gri‰ths, T. L., & Tenenbaum, J. B. 2009 Theory-based causal induction. Psychological Review, 116, 661– 716. Gropen, J., Pinker, S., Hollander, M., & Goldberg, A. 1991 A¤ectedness and direct objects: The role of lexical semantics in the acquisition of verb argument structure. Cognition, 41, 153– 195. Gru¨nwald, P. 2007 The minimum description length principle. Cambridge, MA: MIT Press. Gru¨nwald, P., & Langford, J. (2007). Suboptimal behavior of Bayes and MDL in classification under misspecification. Machine Learning, 66, 119–149. Harris, Z. 1955 From phoneme to morpheme. Language, 31, 190–222. Heller, K., Sanborn, A., & Chater, N. 2009 Hierarchical learning of dimensional biases in human categorization. In Y. Bengio, D. Schuurmans, J. La¤erty, C. K. I. Williams, & A. Culotta (Eds.), Advances in Neural Information Processing Systems (Vol. 22, p. 727–735). Cambridge, MA: MIT Press. Hsu, A., & Gri‰ths, T. L. 2009 Di¤erential use of implicit negative evidence in generative and discriminative language learning. In Y. Bengio, D. Schuurmans, J. La¤erty, C. K. I. Williams, & A. Culotta (Eds.), Advances in Neural Information Processing Systems (Vol. 22, p. 754–762). Cambridge, MA: MIT Press. Imai, M., & Gentner, D. 1997 A cross-linguistic study of early word meaning: Universal ontology and linguistic influence. Cognition, 62 (169–200). Jaynes, E. 2003 Probability theory: The logic of science. Cambridge: Cambridge University Press. Johnson, M., Gri‰ths, T. L., & Goldwater, S. 2007 Adaptor grammars: A framework for specifying compositional nonparametric Bayesian models. In Advances in Neural Information Processing Systems (Vol. 19). Cambridge, MA: MIT Press. Kemp, C., Perfors, A., & Tenenbaum, J. B. 2007 Learning overhypotheses with hierarchical Bayesian models. Developmental Science, 10(3), 307–321.

406

Amy Perfors and Daniel J. Navarro

Kording, K., & Wolpert, D. 2006 Bayesian decision theory in sensorimotor control. Trends in Cognitive Sciences, 10(7), 319–326. Kruschke, J. 1992 ALCOVE: An exemplar-based connectionist model of category learning. Psychological Review, 99(1), 22–44. Landau, B., Smith, L., & Jones, S. 1988 The importance of shape in early lexical learning. Cognitive Development, 3 (299–321). Laplace, P. S. 1816 A philosophical essay on probabilities. Dover Publications. Navarro, D. 2006 From natural kinds to complex categories. In R. Sun & N. Miyake (Eds.), Proceedings of the 28th Annual Conference of the Cognitive Science Society (pp. 621–626). Austin, TX: Cognitive Science Society. Navarro, D., Dry, M., & Lee, M. in press Sampling assumptions in inductive generalization. Cognitive Science. Navarro, D., & Perfors, A. 2010 Similarity, feature discovery, and the size principle. Acta Psychologica, 133, 256–268. Pearl, L., Goldwater, S., & Steyvers, M. 2010 How ideal are we? Incorporating human limitations into Bayesian models of word segmentation. In K. Franich, K. Iserman, & L. Keil (Eds.) Proceedings of the 34th Annual Boston University Conference on Child Language Development (pp. 315–326). Somerville, MA: Cascadilla Press. Perfors, A., & Tenenbaum, J. B. 2009 Learning to learn categories. In N. Taatgen, H. van Rijn, L. Schomaker, & J. Nerbonne (Eds.), Proceedings of the 31st Annual Conference of the Cognitive Science Society (pp. 136–141). Austin, TX: Cognitive Science Society. Perfors, A., Tenenbaum, J. B., & Regier, T. 2011 The learnability of abstract syntactic principles. Cognition, 118(3), 306–338. Perfors, A., Tenenbaum, J. B., & Wonnacott, E. 2010 Variability, negative evidence, and the acquisition of verb argument constructions. Journal of Child Language, 37, 607–642. Perruchet, P., & Vinter, A. 1998 Parser: A model for word segmentation. Journal of Memory and Language, 39, 246–263. Pinker, S. 1989 Learnability and cognition: The acquisition of argument structure. Cambridge, MA: MIT Press.

What Bayesian modelling can tell us about statistical learning

407

Quine, W. v. O. 1960 Word and object. Cambridge, MA: MIT Press. Rao, R. B., Gordon, D., & Spears, W. 1995 For every generalization action, is there really an equal and opposite reaction? analysis of the conservation law for generalization performance. In Proceedings of the 12th International Conference on Machine Learning (p. 471–479). Sa¤ran, J., Aslin, R., & Newport, E. 1996 Statistical learning by 8-month-olds. Science, 274, 1926–1928. Sa¤ran, J., & Thiessen, E. 2003 Pattern induction by infant language learners. Developmental Psychology, 39, 484–494. Samuelson, L., & Smith, L. 1999 Early noun vocabularies: Do ontology, category structure, and syntax correspond? Cognition, 73, 1–33. Sanborn, A., Gri‰ths, T. L., & Navarro, D. 2010 Rational approximations to rational models: Alternative algorithms for category learning. Psychological Review, 117, 1144– 1167. Scha¤er, C. 1994 A conservation law for generalization performance. In Proceedings of the 11th International Conference on Machine Learning (p. 259–265). Smith, L., Jones, S., Landau, B., Gershko¤-Stowe, L., & Samuelson, L. 2002 Object name learning provides on-the-job training for attention. Psychological Science, 13(1), 13–19. Swingley, D. 2005 Statistical clustering and the contents of the infant vocabulary. Cognitive Psychology, 50, 86–132. Tenenbaum, J. B., & Gri‰ths, T. L. 2001 Generalization, similarity, and Bayesian inference. Behavioral and Brain Sciences, 24(2), 629–641. Thompson, S., & Newport, E. 2007 Statistical learning of syntax: The role of transitional probability. Language Learning and Development, 3, 1–42. Tversky, A., & Kahneman, D. 1974 Judgment under uncertainty: Heuristics and biases. Science, 135, 1124–1131. Van Erven, T., Gru¨nwald, P., & de Rooij, S. 2009 Catching up faster in Bayesian model selection and model averaging. In J. C. Platt, D. Koller, Y. Singer, & S. Roweis (Eds.), Advances in Neural Information Processing Systems (Vol. 20, p. 417–424). Venkataraman, A. 2001 A statistical model for word discovery in transcribed speech. Computational Linguistics, 27(3), 351–372.

408

Amy Perfors and Daniel J. Navarro

Vul, E., Goodman, N., Gri‰ths, T. L., & Tenenbaum, J. B. 2009 One and done? Optimal decisions from very few samples. In N. Taatgen, H. van Rijn, L. Schomaker, & J. Nerbonne (Eds.), Proceedings of the 31st Annual Meeting of the Cognitive Science Society (p. 148–153). Wonnacott, E., Newport, E., & Tanenhaus, M. 2008 Acquiring and processing verb argument structure: Distributional learning in a miniature language. Cognitive Psychology, 56, 165– 209. Xu, F., & Denison, S. 2009 Statistical inference and sensitivity to sampling in 11-month infants. Cognition, 112, 97–104. Xu, F., & Garcia, V. 2008 Intuitive statistics by 8-month-old infants. Proceedings of the National Academy of Sciences, 105(13), 5012–5015. Xu, F., & Tenenbaum, J. B. 2007 Word learning as Bayesian inference. Psychological Review, 114(2), 245–272. Yuille, A., & Kersten, D. 2006 Vision as Bayesian inference: Analysis by synthesis? Trends in Cognitive Sciences, 10(7), 301–308. Zipf, G. 1932 Selective studies and the principle of relative frequency in language. Cambridge, MA: Harvard University Press.

Evolutionary perspectives on statistical learning Kenny Smith 1. Introduction Language is a socially-learned system: children learn the language of their speech community, and this social learning underpins both the synchronic diversity of the world’s languages (Chomsky, 1986; Evans & Levinson, 2009) and the ways in which languages change over time (e.g. Lightfoot, 1999; Blevins, 2004). Despite the widespread agreement over the role of social learning in language transmission, one of the central debates in the cognitive sciences concerns the precise nature of the language learning process: is it a highly-constrained process which applies specifically and exclusively to the linguistic domain, or does it involve the application of relatively unconstrained, domain-general processes? Classic generative accounts typically favour the former type of approach: central processes of language acquisition are taken to involve selection from among a highly-constrained set of possible grammars, with those constraints being assumed to be specific to language (e.g. Chomsky, 1980; Jackendo¤, 2002, although recent years have seen something of a shift away from the domain-specificity claim: see e.g. Hauser, Chomsky, & Fitch, 2002; Chomsky, 2005). The standard generative account is motivated by poverty of the stimulus arguments which suggest that language learning from data of the sort that children encounter is likely to be di‰cult (but see Pullum & Scholz, 2002, for a critical review of poverty of the stimulus claims). This mismatch between the apparent di‰culties of language learning and the apparent ease with which language is acquired motivates the hypothesis that language learners come to the language acquisition task with strong expectations about the nature of the system they are attempting to learn: given the (putative) insurmountable di‰culties of learning a language from data via relatively unconstrained learning processes, all interesting structural properties of language must be prespecified in the learner: ‘‘It is, for the present, impossible to formulate an assumption about initial, innate structure rich enough to account for the fact that grammatical knowledge is attained on the basis of the evidence available

410

Kenny Smith

to the learner’’ (Chomsky, 1965, p. 58). Under this account, the crosslinguistic variation sustained by social learning is of a relatively superficial nature, limited to those aspects of language which can plausibly be learnt from data by relatively unrestricted processes of inference. The core of language, in particular grammar, is largely prespecified in the learner: innate constraints provide a small number of possible grammars which learners select among on the basis of observed behaviour. However, a recent trend has been to revise these strong conclusions about the nature of language learning. This has in part been driven by a more careful consideration of the challenges inherent in language learning. For instance, the linguistic data that learners encounter has been shown to have a surprisingly rich statistical structure which learners are capable of identifying and exploiting (see articles in this volume, Monaghan & Christiansen, 2008, for review). Furthermore, models of language learning as a process of rational inference show that some aspects of language which have been presented as model cases of stimulus-poverty can in fact be acquired from realistic corpora of linguistic data using domain-general learning techniques (e.g. constraints on the use of anaphoric one: Foraker, Regier, Khetarpal, Perfors, & Tenenbaum, 2009). The primary battleground for this debate over the nature of the human language learning device is naturally the facts of language acquisition: what kinds of structures can be learned, and what kinds of mechanisms must be employed to learn them. However, every theory of language acquisition also comes with (often implicit) implications for language typology, language change, and the evolution of the learning mechanism in question, which can provide an additional constraint on theories of language acquisition: indeed, as discussed below, the idea that language universals constitute evidence for the strongly-constrained learning process of the standard generative account is widespread (but, it will be argued here, erroneous). Understanding these implications requires an account of the link between the properties of individuals and the properties of collective systems of behaviour which occur in populations of such individuals, and as such is an example of ‘‘population thinking’’, as practised in evolutionary biology (Mayr, 1982; Boyd & Richerson, 2000). Standard generative accounts of language acquisition assume a direct and transparent mapping from the mental apparatus of language learners to language structure: the latter is directly encoded in the former. Consequently, theories of this sort o¤er attractively straightforward explanations of language universals, language change, and the evolution of the language

Evolutionary perspectives on statistical learning

411

faculty, which have been well developed in the literature. Constraints on crosslinguistic variation are simply a manifestation of species-wide constraints on language learning, i.e. universals are pre-specified in the learner (‘‘there is no doubt that a theory of language, regarded as a hypothesis about the innate language-forming capacity of humans, should concern itself with universals’’, Chomsky, 1965, p. 30). In the strongest possible form (e.g. Baker, 2001), the claim is that we can read o¤ the structure of the language faculty from the typological distribution of languages in the world. Similarly, language change is explained in terms of misconvergence during language acquisition: changes are actuated by learners arriving at di¤erent settings of the underlying parameters than those selected by the individuals they learnt from (e.g. Lightfoot, 1979; Clark & Roberts, 1993; Lightfoot, 1999; Niyogi, 2006). Finally, Pinker and Bloom (1990) have argued that the generative account invites an obvious evolutionary explanation for the origins of the domain-specific, highly-constraining language acquisition process central to the standard generative account, namely the biological evolution of the language faculty under natural selection (again, see Chomsky, 2005, for a minimalist alternative). In this article I will summarise research addressing equivalent questions for domain-general statistical learning accounts. In section 2 I will review models and experiments, employing iterated learning techniques, which link hard-to-observe biases in statistical learning to apparently categorical e¤ects in language structure. The crucial linking mechanism, which allows weak biases in individuals to have strong e¤ects on language structure, is cultural transmission: because language is a socially-learnt system, biases of learners are repeatedly applied to language as it is passed down the generations, resulting (under some circumstances) in trajectories of language change which yield strong eventual e¤ects on language structure. In other words, biases inherent in statistical learning can drive language change which, on large timescales, yields language universals. In section 3 I will review the implications of this weak-bias-strong-e¤ect scenario for the biological evolution of those learner biases. Under at least some sets of assumptions, the implication is that strong learning biases should not generally be favoured by selection for communication: on evolutionary grounds, we should therefore expect to see learning biases which are weak if domain-specific, with strong biases being restricted to domaingeneral learning capacities (with strength being selected for on non-linguistic grounds).

412

Kenny Smith

2. Cultural transmission, learning bias and language universals 2.1. The general argument Under the standard generative account, there is an explicit and straightforward linkage between properties of individual language learners and language universals (shared features of the world’s languages): constraints on the possible forms of language are hard-wired in to learners, and this tacit knowledge of constraints on linguistic variation is shared across the species. In other words, language universals directly reflect the LAD. Indeed, there has been some experimental work aimed at uncovering evidence for such biases (see e.g. Lidz, 2010). Statistical learning accounts should also be able to explain universals, and do so without recourse to strong, domain-specific constraints acting on statistical learning – one of the appeals of statistical learning is that it does not require us to posit such constraints. There are several such accounts in the literature (see e.g. Hurford, 1990; Deacon, 1997; Kirby, 1999; Christiansen & Chater, 2008; Chater & Christiansen, 2009; Evans & Levinson, 2009), all of which revolve around the fact that cultural transmission (transmission of behaviour via social learning) acts as a mechanism linking properties of individuals to properties of collective linguistic systems. This process of language transmission is sometimes called iterated learning: the outcome of learning at one generation provides the input to learning at the next. Rather than language structure being strongly constrained by a highly restrictive learning apparatus, the idea is that languages adapt over repeated episodes of learning and production in response to much weaker (and possibly domain-general) constraints inherent in language learning and use. One possibility is to explain universal features of language as adaptations (by the linguistic system) to universal constraints operating on language use, rather than acquisition: for instance, Evans and Levinson (2009) argue that some universals (e.g. a function-content split) might reflect convergent (cultural) evolution of the same design solution to common communicative problems in unrelated languages. Similarly, some language universals might arise from shared constraints on language processing, for instance a preference to minimise sequential processing costs (Hawkins, 1994). Under this account, language is under continuous pressure to minimise processing costs for learner/users, resulting in a pressure towards language types which embody relatively processable structures: universals tendencies for consistent head ordering would be one example

Evolutionary perspectives on statistical learning

413

of the impact of this processing pressure on the typological distribution of languages (Hawkins, 1994; Kirby, 1999). There is also a large body of modelling literature that shows that weak biases in language learning, rather than use, can have strong e¤ects on language design as a consequence of cultural transmission: in other words, as in standard generative accounts, there is a mapping from properties of the learning device to properties of language, but there is no requirement that strong universal tendencies reflect strong constraints in learners. While typologically unattested languages might be both possible and even learnable, the languages we see in the world will typically be selected from the restricted set of highly learnable languages: languages which are hard to learn will tend to change, and those which are easy to learn will tend to be preserved, eventually yielding languages which are generally well-fitted to the biases of language learners (Christiansen, 1994; Deacon, 1997; Kirby, 2000; Christiansen & Chater, 2008; Chater & Christiansen, 2009). In other words, strong or absolute constraints on the space of possible languages do not have to be explicitly and directly hard-wired into learners, since cultural transmission will ensure that the languages which learners tend to encounter will be drawn from the subset of highly learnable languages. We have previously termed this evolutionary pressure cultural selection for learnability (Brighton, Kirby, & Smith, 2005). This theory is supported by a number of formal models of cultural evolutionary processes which show that weak biases can have strong e¤ects as a consequence of their repeated application during cultural transmission. The simplest such model is provided by Boyd and Richerson (1985). Assume there are two languages, L1 and L2. We are interested in the proportion of a population which uses L1 at time t, which we denote by pt , and will assume that learners are exposed to the linguistic variants of two cultural parents. Boyd & Richerson provide the biased learning rule given in Table 1, where B gives the strength of the bias in favour of L1, 0 a B a 1. Assuming that we are dealing with a very large population, su‰ciently large that we don’t have to concern ourselves with noise arising from the stochastic selection of cultural parents, the proportion of the population using L1 after an episode of transmission is then given by: ptþ1 ¼ pt þ Bpt ð1  pt Þ: This expression shows that the frequency of L1 increases wherever B > 0 and 0 < p < 1: whenever L1 is present in the population, even if rare, it will increase in frequency. Consequently, the bias of learners drives

414

Kenny Smith

Table 1. A biased learning rule Probability of o¤spring acquiring Parent 1 language

Parent 2 language

L1

L2

L1

L1

1

0

L1

L2

(1/2)(1 þ B)

(1/2)(1  B)

L2

L1

(1/2)(1 þ B)

(1/2)(1  B)

L2

L2

0

1

L1 to fixation in the population, with the rate of change depending on the frequency of L1 in the population. The key feature of this model is that any B > 0 will drive L1 to fixation, given su‰cient time: weak biases in learners can have strong or categorical e¤ects on the frequencies of competing linguistic variants in a language. In other words, language universals are not necessarily a direct reflection of strong biases in the language acquisition device, but may derive from weak bias in learners which have strong e¤ects as a result of cultural transmission. These dynamics of course center around the learning algorithm used by Boyd and Richerson (1985), and alternative algorithms may yield di¤erent cultural dynamics. For instance, Gri‰ths and Kalish (2007) show that similar dynamics can apply as a consequence of cultural transmission in populations of learners who learn via Bayesian inference, but only under certain assumptions about the way in which learners select a hypothesis given the posterior distribution over hypotheses (specifically, only if learners select the hypothesis with the maximum a posteriori (MAP) probability: if learners instead sample a hypothesis proportionately to its posterior probability, a rather di¤erent cultural dynamic ensues). Are the biases of learner/users the only factor shaping the distribution of languages in the world? It has been argued (see e.g. Kirby, 2002; Zuidema, 2003; Brighton, Kirby, & Smith, 2005; Kirby, Dowman, & Gri‰ths, 2007; K. Smith, Kirby, & Smith, 2008) that, at a minimum, language must be seen as a compromise between the biases of learner/users (which would subsume the types of processing and learning biases discussed above), and other constraints acting on languages during their transmission. The classic example of this second class of constraint is the mismatch between the infinite expressivity of languages and the finite set of data from which such languages must be learned. This transmission bottleneck favours

Evolutionary perspectives on statistical learning

415

languages which can be recreated from a subset, i.e. which contain some structure which facilitates generalisation. Recursive compositionality is one such generalisation (e.g. Kirby, 2002; Brighton, 2002), and therefore can be explained as an adaptation by language in response to pressure arising during its transmission. Importantly, the adaptiveness of recursive compositionality is not solely a reflection of the biases of learners: while this evolutionary process requires the correct learner biases to be in place (in particular, a capacity to make certain generalisations), it does not arise as a consequence of these learning biases alone, but is modulated by the transmission bottleneck (Brighton, Smith, & Kirby, 2005). More generally, in the cultural evolutionary dynamic arising from transmission via Bayesian learning (with MAP hypothesis selection), the amount of data learners see influences the extent to which the prior biases of learners are exaggerated: when learners see relatively little data, weak biases in favour of one language or class of languages can be massively amplified as a result of transmission; in contrast, the more data learners see, the weaker the amplification of their prior bias is (Gri‰ths & Kalish, 2007; Kirby et al., 2007). This work suggests that the biases of language learners can’t simply be read o¤ from typological distributions of languages, for two reasons. Firstly, strong or categorical e¤ects on the distribution of languages (i.e. universals) may be a consequence of weak or strong biases in language learners: the mere presence of a universal does not indicate that that universal is coded in the learner in a strong or absolute manner. Secondly, some universals may be due to the mediating cultural dynamic, rather than a direct reflection of properties of the language learning device: the transmission bottleneck is one such example of a mediating dynamic, population structure is another (K. Smith, 2009; Burkett & Gri‰ths, 2010). Without understanding how the mediating cultural dynamics might act to shape language, we cannot assume that any universal directly reflects properties of language learners, let alone assume they reflect strong or absolute constraints on learnable systems. 2.2. A specific example: linguistic variation Linguistic variation provides a useful test-case for exploring some of these debates on the link between learning biases and universal tendencies in language design. Additionally, the recent experimental work in this area provides a concrete example of a case where these cultural evolutionary accounts can o¤er some direct insight into the nature of human statistical learning mechanism.

416

Kenny Smith

Languages tend not to exhibit unpredictable variation: all other things being equal, any variation in language tends to be conditioned on semantic, pragmatic, phonological or sociolinguistic criteria (Givo´n, 1985). Conditioning of variation occurs at all levels of linguistic structure, including phonetics (e.g. sociolinguistic conditioning of vowel variants in English: Labov, 1963), morphology (e.g. phonological conditioning of plural allomorphs in English: Lass, 1984, p. 13–14), and syntax (e.g. semantic conditioning of noun classes in Dyirbal: Dixon, 1972; sociolinguistic and syntactic conditioning of copula/auxiliary BE in Bequia: Meyerho¤, 2008). Similarly, processes of language change tend to eliminate variability injected into the linguistic system via processes of phonological change: levelling of variation has been termed the first law of analogical evolution (i.e. language change: Man´czak, 1980) Language creation through creolization can also be characterised as the construction of a new language via levelling and regularisation of a pool of linguistic variants arising from radical language contact (Siegel, 2004). Why do languages tend to exhibit conditioned, rather than unpredictable (or free) variation? And why do languages reliably change so as to eliminate such variability? As discussed in the introduction to this chapter, one possible explanation for such universal tendencies is that there is a simple mapping between properties of individuals and properties of language: the absence of free variation in language reflects the fact that language learners have strong or absolute biases against free variation. Child learners have such strong, readily-observed biases: Hudson Kam and Newport (2005, 2009) show that, when trained on an artificial language exhibiting unpredictable variation (namely a language in which two variants of a meaningless marker word alternate randomly), six-year-old children typically eliminate this unpredictable variation by eliminating one of the two variants entirely, producing only one version of the marker word on test. Adult learners undergoing the same learning procedure do not demonstrate this regularisation behaviour, and instead appear to reproduce the free variation of their linguistic input in their output. However, the evolutionary accounts discussed above suggest that regularity in language could also be a consequence of cumulative evolutionary processes acting on language during its cultural transmission: even weak learner biases, which may be hard to spot in standard artificial language learning paradigms, may have strong or categorical e¤ects as a consequence of their repeated application. In other words, the absence of a strong signature of regularisation in individual learners of a particular sort (i.e. adults) does not necessarily mean that regularisation will not occur in

Evolutionary perspectives on statistical learning

417

populations of such learners. In order to pinpoint the source of regularity in language, we have to establish whether populations of adult learners might also be capable of regularising a language. In order to resolve these questions, two recent papers investigate the consequences of cultural transmission explicitly, using an iterated artificial language learning methodology. Iterated artificial language learning (see Scott Phillips & Kirby, 2010, for a review) involves combining standard artificial language learning tasks with cultural di¤usion chain techniques designed to study simple cultural systems in the lab. In cultural di¤usion experiments, an artificial population is created and seeded with some behaviour, that behaviour is passed from individual to individual in the population via social learning, and the change in that behaviour over time is observed: this may be to establish the presence of cultural transmission (e.g. in non-humans, as in Horner, Whiten, Flynn, & de Waal, 2006), to explore the social-learning strategies involved (e.g. Mesoudi, 2008; Mesoudi & O’Brien, 2008), or to explore the biases of learners (Mesoudi, Whiten, & Dunbar, 2006; Kalish, Gri‰ths, & Lewandowsky, 2007; Gri‰ths, Christian, & Kalish, 2008); see Whiten and Mesoudi (2008); Mesoudi and Whiten (2008) for reviews of the di¤usion chain method. In a standard artificial language learning experiment, participants are trained on some artificial language (e.g. a set of sequences of words or picture-label pairs following some abstract pattern), then tested to see how well they have learnt this language (e.g. to what extent they can produce or identify sequences or labellings drawn from the target language). In an iterated artificial language learning experiment, as in the standard di¤usion chain method, the language a participant produces during testing is simply used as the target language for the next individual in a chain of transmission: the language is passed from participant to participant, and potentially changes as a result. Reali and Gri‰ths (2009) describe an iterated artificial language learning experiment where participants are required to learn a set of objectlabel pairs. In the initial language, exhibiting unpredictable variation, each object is paired with two labels (i.e. the choice of label is unpredictable). Adult learners exposed to this unpredictable variation appear to match the unpredictability of their input language reasonably closely: there is no statistical signature of regularisation in the language they produce on test (as also demonstrated by Vouloumanos, 2008). However, as these languages are transmitted along di¤usion chains they become less variable and more predictable: one of the two labels for each object is lost, eventually yielding a final system which exhibits no variability. This loss of

418

Kenny Smith

variability in the system of object labelling is consistent with a learning bias in adults, not observed in the initial group of participants, against many-to-one label-object mappings, i.e. some sort of mutual exclusivity bias (Markman & Wachtel, 1988). K. Smith and Wonnacott (2010) present a related iterated artificial language learning experiment based around a slightly di¤erent learning task, where learners are trained on an artificial language which provides descriptions for scenes involving one or two animals. An initial group of adult participants attempted to learn a target language where nouns were marked for plurality using one of two plural markers, with the choice of plural marker varying unpredictably. Again, these isolate learners appear to capture this unpredictability fairly reliably: there is no strong statistical signature of a bias against unpredictability. None the less, and similarly to the Reali & Gri‰ths finding, the artificial languages increase in predictability over repeated episodes of transmission: after five generations, 9 of 10 di¤usion chains exhibit entirely predictable plural marker usage. However, unlike in the Reali & Gri‰ths experiment, this increase in predictability is not achieved by eliminating one of the two variants: both plural markers are retained in the majority of the di¤usion chains. The persistence of predictable variability is due to the availability of (rather minimal) linguistic context: over episodes of transmission, the usage of the two plural markers becomes conditioned on the noun being marked, such that plurality on some nouns is marked with one marker, and plurality on other nouns is marked with the alternative marker. The availability of conditioning context, absent from the Reali & Gri‰ths study, allows variability to be stable over repeated episodes of transmission in spite of a learner bias in favour of predictability: conditioned variability, of the sort witnessed in real languages, is the result. Interestingly, Hudson Kam and Newport (2009) report results on a much more complex artificial language learning task which are indicative of learner biases in this same direction: although the majority of adult learners in their experiment produce output which seems genuinely unpredictable, those that do introduce regularity into the system tend to do so by conditioning variants on the available linguistic context, even when no such conditioning is suggested in their input. What then are the causes of regularity in natural languages? One possibility is that regularity is a signature of strong learner biases in child learners, as per the evidence provided by Hudson Kam and Newport (2005, 2009). An additional or alternative possibility is that this regularity is a consequence of far weaker biases inherent in adult statistical learning, which has a strong e¤ect as a result of cultural transmission. Finally, it may be

Evolutionary perspectives on statistical learning

419

the case that di¤erent regularisation processes are attributable to di¤erent learners and learning biases: for instance, regularisation by children may tend to eliminate variants, whereas regularisation by adults may tend to preserve variants via conditioning on linguistic context. Once such conditioned variation is established in a language, it seems to be learnable by children, even if the conditioning contexts are rather complex or dependent on sociolinguistic knowledge (Austin, Newport, & Wonnacott, 2006; J. Smith, Durham, & Fortune, 2007). This suggests a mismatch between ability to acquire systems of conditioned variability (both adults and children can do so) and imposition of conditioning on systems which exhibit free variation (only adults seem to impose conditioning on unpredictable variation, whereas children eliminate it) which is at least worthy of further study: it may be the case, for instance, that children extend the conditioning of variation in systems where it is partly conditioned, but do not do so when the variability of the input system is above some critical value. It is also reasonable to ask where adult biases in favour of conditioning of variability come from: an obvious possibility is that this is an acquired expectation about the structure of variation in natural languages.

3. Biological evolution of the language faculty Given the assumption, inherent in the standard generative account, that structural features of language are a direct reflection of properties of the language faculty, evolution by natural selection is the best explanation for the origins of those properties (e.g. Pinker & Bloom, 1990; Pinker & Jackendo¤, 2005): it seems reasonable to assume that evolution has favoured language faculties, and by direct extension languages, which have features which permit or enhance communication.1 The language faculty is therefore hypothesised to be domain-specific in the specifically evolutionary 1. Note that, under this account, there is a separation between language evolution, which is the biological evolution of the language faculty, and language change, which is a cultural process involving shifts between possible languages permitted by the biologicallyevolved language faculty. In accounts which emphasise the role of cultural processes in shaping language design (e.g. those discussed in section 2), this delineation between language change and language evolution is blurred: both are potentially a consequence of cultural evolutionary processes, the distinction being based on the timescale of that evolutionary process and the qualitative nature of the changes involved: see Scott-Phillips and Kirby (2010) for brief discussion.

420

Kenny Smith

sense, in that it is an adaptation for the acquisition of a linguistic communication system (rather than, say, an adaptation for general-purpose reasoning about sequences or social relationships, which would be a domain-general capacity). Statistical learning accounts make rather di¤erent claims about the evolutionary origins of the human capacity for language, typically emphasising the domain-generality of statistical learning mechanisms (see e.g. Sa¤ran & Thiessen, 2007, for review). Furthermore, statistical learning appears not to be isolated to humans (Hauser, Newport, & Aslin, 2001; Newport, Hauser, Spaepen, & Aslin, 2004; Toro & Trobalon, 2005; Takahasi, Yamada, & Okanoya, 2010): this suggests that it potentially relies on capacities which can either be selected for in the absence of language, or which have deep homologies in animal cognition. As such, statistical learning in humans is claimed to be domain general in the evolutionary sense given above: (i.e. to have evolved for its utility in a number of domains, including language), or might even represent a case of exaptation or evolutionary reappropriation, whereby mechanisms adapted for one purpose are pressed into service for other functions.2 This evolutionary picture is supported by a strand of modelling work, summarised below, which explores the evolutionary implications of the cultural evolutionary processes outlined in section 2 (see also Kirby & Hurford, 1997; Chater, Reali, & Christiansen, 2009, for related models). As discussed in that section, weak biases in individual learners can have strong e¤ects in populations, as a result of their repeated application during cultural transmission – in other words, the assumption that language faculties map directly to languages, central to the evolutionary strand of the standard generative account, is incorrect. This has implications for the biological evolution of such biases: in particular, give that both strong and weak biases ultimately have equivalent e¤ects on language structure, bias strength is shielded from selection acting on biases. The evolutionary 2. Note that the iterated artificial language learning experiments described in the previous section do not necessarily speak to the domain-specificity of the mechanisms involved: since the materials in these experiments are linguistic in nature, it may be the case that these experiments engage some domainspecific language learning capacity. Indeed, Reali and Gri‰ths (2009) show that adult learners bring di¤erent biases to a nonlinguistic task which is structurally analogous to their linguistic task, suggesting adult biases in favour of regularisation may be somewhat specialised for language, although as discussed in the main text, a range of other research emphasises the domain-generality of statistical learning mechanisms.

Evolutionary perspectives on statistical learning

421

prediction is therefore that biological evolution should not deliver learning biases which are both strongly-constraining and language-specific. Furthermore, related coevolutionary work suggests that the frequency-dependent nature of communication weakens the selection pressure in favour of desirable learning biases when those biases are rare, again favouring accounts which do not appeal to the evolution of strong, domain-specific constraints on the learning of language. 3.1. Shielding of bias strength due to cultural evolution Given that, as discussed above, weak biases can have strong cumulative e¤ects, and that the stable outcomes of cultural transmission in populations can be identical under a range of bias strengths, it is possible that biological evolution is neutral with respect to strength of bias. Boyd & Richerson’s (1985) directly biased transmission model can be used to illustrate this point. Consider a population where there are two strengths of bias in favour of L1, B1 > B2 > 0, with bias strength being genetically transmitted from parent to o¤spring. Further assume that individuals who acquire L1 receive some payo¤ relative to L2 learners: L1 has some inherent functional advantages over L2 (for instance, it might be more expressive than L2, or more concise). Under these conditions, as discussed in section 2.1, biased learning will drive L2 out of the population. Once L2 has been eliminated, any selective advantage associated with B1 will disappear: in a culturally homogenous population, all learners learn the sole cultural variant in the population, regardless of bias strength (assuming the learning rule given in Table 1). In other words, and assuming that (1) cultural evolution is fast relative to biological evolution (i.e. selection will tend to act on populations which are at cultural equilibrium, in this case meaning dominated by L1) and (2) biases of di¤ering strengths are equally costly, the model predicts selective neutrality with respect to bias strength: there will not be selection in favour of strong biases. If we relax this second assumption and instead assume that stronger biases are more costly, then there will be selection in favour of weaker biases. More generally, Boyd and Richerson (1985) show that any selective advantage for biased learning relative to unbiased learning (B ¼ 0) depends on cultural variation in the population, and cultural evolution under directly biased transmission eliminates that variation, thereby eliminating evolutionary pressure in favour of such biases.

422

Kenny Smith

K. Smith and Kirby (2008) present a related result for the evolution of prior biases for language learning in populations of MAP Bayesian learners. Populations of these learners behave similarly to Boyd & Richerson’s biased learners, in that weak prior biases in favour of particular languages (or classes of languages) can have strong cumulative e¤ects as a consequence of transmission. K. Smith and Kirby (2008) show that, if bias strength is genetically transmitted, selection operates on the stationary outcomes of cultural transmission (i.e. cultural change is fast relative to biological evolution), and selection favours learners who converge on the same language as their peers, then selection is neutral over strength of bias: stronglybiased learners receive no selective advantage over weakly-biased learners, since both are likely to occupy populations which are converged to the type of system matching their bias and are equally capable of acquiring such a system. Again, assuming some cost for (or mutational pressure against) strong biases, the net outcome of selection on bias strength should be the weakest-possible bias in favour of linguistic systems of particular kinds. What do these results imply for the evolution of the human language capacity? Both models suggest that, once we relax the assumption that there is a direct mapping from properties of language learners to properties of languages (as we must, in situations where weak biases can have strong e¤ects as a result of language transmission), we should not expect to see the evolution of strongly-constrained learning. This is true even if we assume that learners are under direct selection for communication, i.e. for coordinating with their peers. This does not mean that strong constraints on learning cannot evolve, only that (under the assumptions of these models) they cannot evolve if they are solely under selection for communication. In other words, these models do not speak against the evolution of strong domaingeneral constraints on learning, only constraints which are both strong and domain-specific to (i.e. evolved for) language. 3.2. Coordination and the evolution of learning biases Boyd & Richerson’s analysis of the evolution of direct bias assumes that there is some cultural variant which is inherently more functional than alternatives: selection will favour this bias up until the point where the cultural variance on which this selection depends is used up. However, the assumption of an a priori desirable cultural variant seems to apply less to coordination problems like language. While it may be the case that some linguistic variants are more functional than others, this is poten-

Evolutionary perspectives on statistical learning

423

tially modulated by the requirement for language users to coordinate: using a more functional linguistic system is not necessarily advantageous if no-one else in your population uses (produces or understands) that system. In other words, using the optimal linguistic system is potentially less important than using the linguistic system that other members of your population use. This potentially makes evolving biases in favour of desirable communication systems di‰cult. Cumulative cultural evolution takes time, and consequently social learning may not be selected for when rare (Boyd & Richerson, 1996). Cumulative cultural evolution eventually produces behaviours which are more complex and functional than those which can be discovered via asocial learning processes (also known as individual learning: learning via independent exploration of the environment without social influence). Consequently there is strong selective pressure in favour of ability to socially learn in populations where such complex behaviours have already been established via cultural processes. However, at the early stages of the process of cumulative cultural evolution, the advantages of learning socially will be limited (the behaviours available for social learning may be no more adaptive than those which could be achieved via asocial learning), and may be outweighed by the costs of social learning, resulting in selection in favour of purely asocial learning (Boyd & Richerson, 1996). K. Smith (2004) shows that this evolutionary problem applies to the evolution of biased learning as well, and is particularly problematic for coordination problems such as language. As a test case, K. Smith (2004) considers the cultural transmission of vocabulary systems in populations where learners have some genetically-coded bias in favour of vocabularies with varying functionality. All other things being equal, a one-to-one vocabulary system is optimal in terms of communication: if two individuals share a one-to-one vocabulary system, the lack of ambiguity in the system enables a hearer to correctly identify the object communicated about by a speaker. In contrast, many-to-one vocabulary systems are less functional: they associate multiple objects with the same label, and consequently ambiguous words leave the hearer with some uncertainty as to the intended referent. As expected given the models discussed above, in populations which are homogeneous with respect to learning bias, the population’s vocabulary system comes to reflect the biases of the learners, with consequences for communication in those populations: weak biases in favour of one-to-one mappings result in the cultural evolution of optimal communication systems; biases in favour of many-to-one mappings result in

424

Kenny Smith

maximally ambiguous vocabularies, with consequently low levels of communicative accuracy. The co-evolutionary extension to this model considers populations which are heterogeneous with respect to learning bias, parents transmit their bias to their o¤spring, and reproduction is proportionate to communicative success, with individuals whose vocabulary systems enable them to successfully communicate with other population members being likely to reproduce more. A reasonable expectation would be that selection acting on the transmission of learning biases can identify those learning biases which lead, via cultural processes, to functional vocabularies, resulting in the eventual biological evolution of one-to-one learning biases and the cultural evolution of unambiguous vocabulary systems: biological evolution yields the learning biases which produce the cultural dynamics which yield optimal communication. While the results of a number of simulation runs of the co-evolutionary model show that this is a possible outcome, it is however contingent on a period of genetic drift maintaining the advantageous bias in the population for su‰cient time to allow its cumulative cultural e¤ects to be felt: individuals bearing the desirable one-to-one bias receive no fitness advantage in populations where they are rare, for two reasons. Firstly, following the logic provided by Boyd and Richerson (1996), having a bias in favour of acquiring a specific type of system which is not already in place in the population is not an advantage. Secondly, the requirement that vocabularies be coordinated if they are to facilitate communication acts as a further barrier to evolving desirable biases: even if there are a few other learners in the population with the same desirable bias, they are unlikely to share their vocabulary, and will therefore receive no payo¤ for having the ‘right’ bias. While individuals with the apparently desirable bias do accrue some advantage if they remain in the population in su‰cient numbers to begin the construction of a shared and unambiguous vocabulary, they are prone to elimination by drift (driven by stochastic sampling for reproduction) at the early stages of this process. Again, these models speak against the evolution of strong domainspecific biases for language: in situations where reproduction depends on communication, learning biases which ultimately favour the acquisition of communicatively-functional behaviours are unlikely to evolve, due to their initial low frequency. Again, these models do not speak to the evolution of domain-general biases, or the evolutionary re-appropriation of biases which become frequent in a population for reasons other than their communicative consequences: their predictions are, however, incompatible

Evolutionary perspectives on statistical learning

425

with an evolutionary story which centers around the evolution of strong, domain-specific traits.

4. Conclusions Every theory of language acquisition implies a particular theory of language universals, language change, and the evolution of the capacities underpinning language acquisition. These implicit predictions can be used as another source of constraint on theories of language acquisition, or linguistic theory more generally (Kinsella, 2009). In this chapter I have outlined how weak biases inherent in statistical learning can induce patterns of language change which deliver strong or universal tendencies in language design: the existence of language universals does not force us to conclude that there are absolute constraints on the kinds of languages that can be learnt, and in fact drawing that conclusion is fundamentally unsafe. Furthermore, the evolutionary modelling literature reviewed here shows that, in situations where there is no straightforward mapping from properties of learners to properties of languages, biological evolution will probably not deliver strongly-constrained domain-specific learning devices: weak biases or strong but domain-general biases, of the sort proposed in the statistical learning literature, are more plausible on evolutionary grounds. References Austin, A., Newport, E. L., & Wonnacott, E. 2006 Predictable versus unpredictable variation: Regularization in adult and child learners. (Paper presented at BUCLD 31) Baker, M. 2001 The atoms of language: The mind’s hidden rules of grammar. New York, NY: Basic Books. Blevins, J. 2004 Evolutionary phonology. the emergence of sound patterns. Cambridge: Cambridge University Press. Boyd, R., & Richerson, P. J. 1985 Culture and the evolutionary process. Chicago, IL: University of Chicago Press. Boyd, R., & Richerson, P. J. 1996 Why culture is common but cultural evolution is rare. Proceedings of the British Academy, 88, 73–93.

426

Kenny Smith

Boyd, R., & Richerson, P. J. 2000 Memes: Universal acid or a better mousetrap. In R. Aunger (Ed.), Darwinizing culture: The status of memetics as a science (pp. 143–162). Oxford: Oxford University Press. Brighton, H. 2002 Compositional syntax from cultural transmission. Artificial Life, 8(1), 25–54. Brighton, H., Kirby, S., & Smith, K. 2005 Cultural selection for learnability: Three principles underlying the view that language adapts to be learnable. In M. Tallerman (Ed.), Language origins: Perspectives on evolution (pp. 291–309). Oxford: Oxford University Press. Brighton, H., Smith, K., & Kirby, S. 2005 Language as an evolutionary system. Physics of Life Reviews, 2(3), 177–226. Burkett, D., & Gri‰ths, T. L. 2010 Iterated learning of multiple languages from multiple teachers. In A. D. M. Smith, M. Schouwstra, B. de Boer, & K. Smith (Eds.), The Evolution of Language: Proceedings of the 8th International Conference (EVOLANG 8) (p. 58–65). Singapore: Word Scientific. Chater, N., & Christiansen, M. H. 2009 Language acquisition meets language evolution. Cognitive Science, 34(7), 1131–1157. Chater, N., Reali, F., & Christiansen, M. H. 2009 Restrictions on biological adaptation in language evolution. Proceedings of the National Academy of Sciences, USA, 27, 1015– 1020. Chomsky, N. 1965 Aspects of the theory of syntax. Cambridge, MA: MIT Press. Chomsky, N. 1980 Rules and representations. London: Basil Blackwell. Chomsky, N. 1986 Knowledge of language: its nature, origin and use. London: Praeger. Chomsky, N. 2005 Three factors in language design. Linguistic Inquiry, 36(1), 1–22. Christiansen, M. 1994 Infinite languages, finite minds: Connectionism, learning and linguistic structure. Unpublished doctoral dissertation, University of Edinburgh. Christiansen, M., & Chater, N. 2008 Language as shaped by the brain. Behavioral and Brain Sciences, 31, 489–509.

Evolutionary perspectives on statistical learning

427

Clark, R., & Roberts, I. 1993 A computational model of language learnability and language change. Linguistic Inquiry, 24, 299–345. Deacon, T. 1997 The symbolic species. London: Penguin. Dixon, R. M. W. 1972 The Dyirbal language of North Queensland. Cambridge: Cambridge University Press. Evans, N., & Levinson, S. C. 2009 The myth of language universals: Language diversity and its importance for cognitive science. Behavioral and Brain Sciences, 32, 429–492. Foraker, S., Regier, T., Khetarpal, N., Perfors, A., & Tenenbaum, J. 2009 Indirect evidence and the poverty of the stimulus: The case of anaphoric one. Cognitive Science, 33, 287–300. Givo´n, T. 1985 Function, structure, and language acquisition. In D. Slobin (Ed.), The crosslinguistic study of language acquisition (Vol. 2, pp. 1005–1028). Hillsdale, NJ: Lawrence Erlbaum. Gri‰ths, T. L., Christian, B. R., & Kalish, M. L. 2008 Using category structures to test iterated learning as a method for indentifying inductive biases. Cognitive Science, 32(1), 68– 107. Gri‰ths, T. L., & Kalish, M. L. 2007 Language evolution by iterated learning with Bayesian agents. Cognitive Science, 31, 441–480. Hauser, M. D., Chomsky, N., & Fitch, W. T. 2002 The faculty of language: What is it, who has it, and how did it evolve. Science, 298, 1569–1579. Hauser, M. D., Newport, E. L., & Aslin, R. N. 2001 Segmentation of the speech stream in a non-human primate: statistical learning in cotton-top tamarins. Cognition, 78(3), B53– B64. Hawkins, J. A. 1994 A performance theory of order and constituency. Cambridge: Cambridge University Press. Horner, V., Whiten, A., Flynn, E. G., & de Waal, F. B. M. 2006 Faithful replication of foraging techniques along cultural transmission chains by chimpanzees and children. Proceedings of the National Academy of Sciences, USA, 103, 13878–13883. Hudson Kam, C. L., & Newport, E. L. 2005 Regularizing unpredictable variation: The roles of adult and child learners in language formation and change. Language Learning and Development, 1, 151–195.

428

Kenny Smith

Hudson Kam, C. L., & Newport, E. L. 2009 Getting it right by getting it wrong: When learners change languages. Cognitive Psychology, 59, 30–66. Hurford, J. R. 1990 Nativist and functional explanations in language acquisition. In I. M. Roca (Ed.), Logical issues in language acquisition (pp. 85– 136). Dordrecht: Foris. Jackendo¤, R. 2002 Foundations of language: Brain, meaning, grammar, evolution. Oxford: Oxford University Press. Kalish, M. L., Gri‰ths, T. L., & Lewandowsky, S. 2007 Iterated learning: intergenerational knowledge transmission reveals inductive biases. Psychonomic Bulletin and Review, 14, 288–294. Kinsella, A. R. 2009 Language evolution and syntactic theory. Oxford: Oxford University Press. Kirby, S. 1999 Function, selection and innateness: the emergence of language universals. Oxford: Oxford University Press. Kirby, S. 2000 Syntax without natural selection: how compositionality emerges from vocabulary in a population of learners. In C. Knight, M. Studdert-Kennedy, & J. Hurford (Eds.), The evolutionary emergence of language: Social function and the origins of linguistic form (pp. 303–323). Cambridge: Cambridge University Press. Kirby, S. 2002 Learning, bottlenecks and the evolution of recursive syntax. In E. Briscoe (Ed.), Linguistic evolution through language acquisition: Formal and computational models (pp. 173–203). Cambridge: Cambridge University Press. Kirby, S., Dowman, M., & Gri‰ths, T. L. 2007 Innateness and culture in the evolution of language. Proceedings of the National Academy of Sciences, USA, 104, 5241–5245. Kirby, S., & Hurford, J. R. 1997 Learning, culture and evolution in the origin of linguistic constraints. In P. Husbands & I. Harvey (Eds.), Fourth european conference on artificial life (pp. 493–502). Cambridge, MA: MIT Press. Labov, W. 1963 The social motivation of a sound change. Word, 19, 273–309. Lass, R. 1984 Phonology: An introduction to basic concepts. Cambridge: Cambridge University Press. Lidz, J. 2010 Language learning and language universals. Biolinguistics, 4(2–3), 201–217.

Evolutionary perspectives on statistical learning Lightfoot, D. 1979 Lightfoot, D. 1999 Man´czak, W. 1980

429

Principles of diachronic syntax. Cambridge: Cambridge University Press. The development of language: Acquisition, change, and evolution. Oxford: Blackwell.

Laws of analogy. In J. Fisiak (Ed.), Historical morphology (pp. 283–288). The Hague: Mouton. Markman, E. M., & Wachtel, G. F. 1988 Children’s use of mutual exclusivity to constrain the meaning of words. Cognitive Psychology, 20, 121–157. Mayr, E. 1982 The growth of biological thought. Cambridge, MA: Harvard University Press. Mesoudi, A. 2008 An experimental simulation of the ‘‘copy-successfulindividuals’’ cultural learning strategy: Adaptive landscapes, producerscrounger dynamics and informational access costs. Evolution and Human Behavior, 29, 350–363. Mesoudi, A., & O’Brien, M. J. 2008 The cultural transmission of Great Basin projectile point technology i: An experimental simulation. American Antiquity, 73, 3–28. Mesoudi, A., & Whiten, A. 2008 The multiple roles of cultural transmission experiments in understanding human cultural evolution. Philosophical Transactions of the Royal Society of London B, 363, 3489–3501. Mesoudi, A., Whiten, A., & Dunbar, R. 2006 A bias for social information in human cultural transmission. British Journal of Psychology, 97, 405–423. Meyerho¤, M. 2008 Bequia sweet/bequia is sweet: Syntactic variation in a lesserknown variety of Caribbean English. English Today, 93, 31–37. Monaghan, P., & Christiansen, M. H. 2008 Integration of multiple probabilistic cues in syntax acquisition. In H. Behrens (Ed.), Corpora in language acquisition research: History, methods, perspectives (pp. 139–164). Amsterdam: John Benjamins. Newport, E. L., Hauser, M. D., Spaepen, G., & Aslin, R. N. 2004 Learning at a distance: II. statistical learning of non-adjacent dependencies in a non-human primate. Cognitive Psychology, 49, 85–117. Niyogi, P. 2006 The computational nature of language learning and evolution. Cambridge, MA: MIT Press.

430

Kenny Smith

Pinker, S., & Bloom, P. 1990 Natural language and natural selection. Behavioral and Brain Sciences, 13(4), 707–784. Pinker, S., & Jackendo¤, R. 2005 The faculty of language: What’s special about it? Cognition, 95(2), 201–236. Pullum, G. K., & Scholz, B. C. 2002 Empirical assessment of stimulus poverty arguments. The Linguistic Review, 19(1–2), 9–50. Reali, F., & Gri‰ths, T. L. 2009 The evolution of frequency distributions: Relating regularization to inductive biases through iterated learning. Cognition, 111, 317–328. Sa¤ran, J. R., & Thiessen, E. D. 2007 Domain-general learning capacities. In E. Ho¤ & M. Shatz (Eds.), Handbook of language development (p. 68–86). Cambridge: Blackwell. Scott-Phillips, T. C., & Kirby, S. 2010 Language evolution in the laboratory. Trends in Cognitive Sciences, 14(9), 411–417. Siegel, J. 2004 Morphological simplicity in pidgins and creoles. Journal of Pidgin and Creole Languages, 19, 139–162. Smith, J., Durham, M., & Fortune, L. 2007 ‘Mam, my trousers is fa’in doon!’ Community, caregiver and child in the acquisition of variation in a Scottish dialect. Language Variation and Change, 19, 63–99. Smith, K. 2004 The evolution of vocabulary. Journal of Theoretical Biology, 228(1), 127–142. Smith, K. 2009 Iterated learning in populations of bayesian agents. In N. Taatgen & H. van Rijn (Eds.), Proceedings of the 31th Annual Conference of the Cognitive Science Society (p. 697–702). Austin, TX: Cognitive Science Society. Smith, K., & Kirby, S. 2008 Cultural evolution: implications for understanding the human language faculty and its evolution. Philosophical Transactions of the Royal Society B, 363, 3591–3603. Smith, K., Kirby, S., & Smith, A. D. M. 2008 The brain plus the cultural transmission mechanism determine the nature of language. Behavioral and Brain Sciences, 31, 533– 534. Smith, K., & Wonnacott, E. 2010 Eliminating unpredictable variation through iterated learning. Cognition, 116, 444–449.

Evolutionary perspectives on statistical learning

431

Takahasi, M., Yamada, H., & Okanoya, K. 2010 Statistical and prosodic cues for song segmentation learning by bengalese finches (lonchura striata var. domestica). Ethology, 116(6), 481–489. Toro, J. M., & Trobalon, J. B. 2005 Statistical computations over a speech stream in a rodent. Perception and psychophysics, 67(5), 867–875. Vouloumanos, A. 2008 Fine-grainedsensitivitytostatistical information in adult word learning. Cognition, 107, 729–742. Whiten, A., & Mesoudi, A. 2008 Establishing an experimental science of culture: Animal social di¤usion experiments. Philosophical Transactions of the Royal Society of London B, 363, 3477–3488. Zuidema, W. H. 2003 How the poverty of the stimulus solves the poverty of the stimulus. In S. Becker, S. Thrun, & K. Obermayer (Eds.), Advances in Neural Information Processing Systems 15 (pp. 51–58). Cambridge, MA: MIT Press.

Statistical learning: What can music tell us? Psyche Loui 1. Introduction Statistical learning is a theory of language acquisition that has gained popularity in recent years. The utility of a statistical learning theory for language lies in its possible explanatory power for multiple phenomena in language acquisition. While many aspects of statistical learning will be covered elsewhere in this volume, in this chapter I will consider statistical learning in nonlinguistic modalities, with special focus on music. The central thesis of this chapter is that much of what we know and love about music can be acquired via experience with the statistical regularities of the world, which consist of frequencies and probabilities of sounds that interact with biological constraints of the central nervous system. These statistical regularities, in turn, may result from physical constraints of objects that produce sounds in the natural world.

2. Statistical learning and the problem of language acquisition When faced with a sea of sounds, infants are required to develop a basic understanding of sounds to satisfy the pressure to communicate. This basic understanding includes identifying the boundaries between words, identifying the phonemes that are representative of the infant’s own language, and distilling syntactical and morphological rules that govern their language. In recent years, these skills have been shown to be learnable solely via exposure to sounds that are governed by statistical relationships. In the seminal study on word segmentation, Sa¤ran et al. (1996) presented eightmonth-old infants with a continuous speech stream consisting of threesyllable nonsense words where adjacent syllables were presented with high transitional probability within the word, but low transitional probability between words. This familiarization phase was followed by test trials where infants were presented with words and non-words (Experiment 1) or words and part-words (Experiment 2). Results showed that eight-monthold infants can use the transitional probability between adjacent syllables

434

Psyche Loui

as a statistical cue to identify boundaries between words in an artificial language (Sa¤ran, Aslin, & Newport, 1996). Importantly, the infants were able to distinguish not only between words and non-words, but also between words and part-words that had appeared in the exposure stream but with lower transitional probabilities compared to words. Subsequent studies tested for word segmentation using stimuli that were matched for frequency, suggesting that transitional probability (or conditional probability) of syllable pairs is used as the important statistical cue for word boundaries (Aslin, Sa¤ran, & Newport, 1998). While the ability to find word boundaries is undeniably important, one question that perhaps relates to an even more fundamental aspect of language learning is how the sound constituents of words are acquired. Categorical perception of phonemes that belong to one’s own language emerges before one year of age (Werker & Tees, 1984). The question of whether this ability to distinguish phoneme contrasts in one’s own language can be acquired via statistical learning was investigated by Maye, Werker, and Gerken (2002). Maye et al. constructed speech sounds that varied along the phonetic continuum of voicing (from [da] to [ta]). Six- and eight-month-old infants were exposed to phonemes that were either unimodally distributed or bimodally distributed along this da-ta continuum. Infants were then tested on their ability to distinguish phonemes on opposite ends of the continuum, with the expectation that if infants were using the frequency distribution of phonemes in the input as a statistical cue, the unimodal and bimodal groups would behave di¤erently in this subsequent test of phoneme discrimination. Consistent with predictions from statistical learning, only infants who received bimodally distributed input were able to distinguish phonemes from the endpoints (Maye, Werker, & Gerken, 2002). While these data provide strong evidence for the frequency distribution of input as an important cue for learning, the authors noted that the infants who received unimodal distribution were surprisingly not performing above chance in a situation that had shown successful performance in previous studies (Pegg & Werker, 1997). This below-expected level of performance suggests that exposure (in this case, to input that is atypical of one’s own language) may actually lead to unlearning of previously discriminable di¤erences (Werker & Tees, 1984).

3. Multimodality of statistical learning Although statistical learning was originally meant to explain word segmentation, it has now been extended to other domains as well. Results

Statistical learning: What can music tell us?

435

similar to the original studies, showing successful learning of relationships between stimuli based only on statistical cues, have now been reported using sequentially as well as simultaneously presented stimuli including visual patterns (Fiser & Aslin, 2002; Kirkham, Slemmer, & Johnson, 2002), tactile sequences (Conway & Christiansen, 2005), and motor sequences (Hunt & Aslin, 2001). Notably, following a rigorous comparison among auditory, motor, and tactile sequence learning, Conway and Christiansen (2005) concluded that the auditory modality a¤orded quantitatively superior statistical learning outcomes compared to vision and touch. If the auditory system does o¤er some advantage for the statistical learning mechanism, then structures that primarily tap into the auditory system should be regarded as the most similar to language in its appeal to the statistical learning mechanism. In that regard, music o¤ers both a testable model for the domain-generality of statistical learning, as well as a unique but fertile testing ground on which theories of statistical learning can be borrowed for new explorations in the musical domain.

4. The human implicit knowledge of music Music is a pervasive human activity that is known and celebrated in all ages and in all cultures. Over and over again, the human brain is shown to possess knowledge of grammatical structures in music. For instance, knowledge of musical harmony has repeatedly been demonstrated in both musically trained and untrained humans (Krumhansl & Kessler, 1982). Reaction time studies have shown that performance levels on a variety of tasks, such as consonance detection (Bharucha & Stoeckig, 1986), timbre and contour discrimination (timbre discrimination: Tillmann, Bigand, Esco‰er, & Lalitte, 2006; contour discrimination: Loui & Wessel, 2007), and even phoneme identification (Bigand, Tillmann, Poulin, D’Adamo, & Madurell, 2001) and visual target identification (Esco‰er & Tillmann, 2008), are superior when the target is accompanied by expected harmonies as dictated by common-practice Western music theory. From the electrophysiological literature, grammatically unexpected musical events are shown to elicit several components of brain potentials. The Late Positive Component, a parietally-centered positive waveform, is largest around 600 ms after the onset of a melodically unexpected note (Besson & Faita, 1995). In contrast, harmonically unexpected chords are shown to elicit an Early Anterior Negativity at around 200 ms, followed by a Late Negativity largest prefrontally at around 500 ms (Koelsch, Gunter, Friederici, & Schroger,

436

Psyche Loui

2000). Even individuals without explicit musical training do show neural activations elicited by unexpected harmonic or melodic endings (harmonic endings: Koelsch, et al., 2000; melodic endings: Besson & Faita, 1995), although these e¤ects may be diminished or delayed relative to trained musicians (Besson & Faita, 1995). From a developmental standpoint, knowledge of musical structure has been investigated extensively in children and infants. An implicit sense of rhythm has been documented in newborn infants (Winkler, Haden, Ladinig, Sziller, & Honing, 2009), and a culturally-informed sense of musical meter develops before the age of one (Hannon & Trehub, 2005). Starting as young as 9 months of age, infants prefer consonant harmonic intervals over dissonant ones (Schellenberg & Trehub, 1996). Children as young as 4–5 years of age show the same Event-Related Potentials as adults in response to unexpected chords (Koelsch et al., 2003). At six years of age, children are faster and more accurate at perceptual tasks in which target stimuli co-occurred with expected chords relative to unexpected chords in a harmonic chord progression (Schellenberg, Bigand, Poulin-Charronnat, Garnier, & Stevens, 2005). As a whole, these developmental studies suggest that musical knowledge and expectations are already well-formed early in life. If statistical learning is to be invoked as an explanatory theory for knowledge in some or all of the above-mentioned aspects of music, then the putative statistical learning mechanism, which might make use of any event frequencies and probabilities of sound to help garner knowledge about musical structure, must be powerful enough to acquire rapidly, and must be in place quite early in life.

5. Statistical learning of tone sequences The extent to which language and music are shared in the brain is a topic of increasing interest in recent years (Patel, 2008). Although music and speech appear on the surface to sound very di¤erent, they clearly share the common identity of being sound energy over time. Musical tones, in particular, share a roughly similar time scale as speech syllables in the range of one to three events per second. Thus, the way we might learn any predictive structures between musical tones and between linguistic syllables may be similar at the most basic level, i.e. at the level of sound events that unfold over time. At a higher level, since the goals of producing tones and syllables may be di¤erent, the requirements for learning, or even the agreement for what constitutes learning, may shape the way statistics in tones and syllables constrain learning.

Statistical learning: What can music tell us?

437

Studies have investigated the role of statistics in the origin of musical knowledge. One of the first statistical learning studies to employ nonlinguistic auditory stimuli used musical tone sequences instead of syllables (Sa¤ran, Johnson, Aslin, & Newport, 1999). Sequential tones (intervals in musical terms) formed the same sets of transitional probabilities as were used in the original study with syllable stimuli (Sa¤ran, et al., 1996). Results showed similar patterns as the 1996 study – adults and eightmonth-old infants were able to extract statistical regularities from the ‘‘tone words’’, suggesting that the learning mechanism that was previously thought to be of use in language acquisition can also be at work in nonlinguistic modalities such as music. Although striking parallels exist between this tone sequences study and the original study using syllable sequences, one major di¤erence emerges in this direct comparison between linguistic and musical stimuli. While syllables are usually identifiable (or at least reproducible with reasonable accuracy) by most individuals, tones are usually not identifiable in the same way. For instance, while most speakers can maintain a syllable in memory as its constituent phonemes (i.e. forming a memory for the syllable ‘‘ba’’ upon hearing it), most listeners cannot, upon hearing a single isolated musical A, maintain the same (absolute) pitch in memory for long (Siegel, 1972). In contrast, pairs of tones are best recognized as intervals with pitches relative to each other (relative pitch) rather than as tones with absolute pitches themselves (absolute pitch). This dichotomy between relative and absolute pitch was explored in further studies that specifically compared the performance on absolute and relative pitch in infants and adults (Sa¤ran & Griepentrog, 2001). Triplets of tones were presented with absolute pitch patterns (e.g. A-D-E) or relative pitch patterns (e.g. Perfect 4th up – Major 2nd up) that appeared with either high or low conditional probability. Results showed superior performance for relative pitch in adults, but superior performance for the absolute pitch condition in infants. This double dissociation led to the supposition of a developmental shift in pitch processing between infancy and adulthood (Sa¤ran, 2003): infants might be born with absolute pitch, but lose it without musical training or in cultures that do not rely on pitched information in their language or musical training (i.e. non-tonal-language cultures and nonmusicians). This hypothesis of widespread absolute pitch possession among infants may explain several observations from the absolute pitch literature: the increased incidence of absolute pitch among tonal language speakers, which may be attributable to increased meaningful exposure to pitched information from a young age (Deutsch, Dooley, Henthorn, & Head, 2009; Deutsch, Henthorn, & Dolson, 2004; Deutsch,

438

Psyche Loui

Henthorn, Marvin, & Xu, 2006), and the decreased rate of absolute pitch as age of musical training onset increases (Baharloo, Johnston, Service, Gitschier, & Freimer, 1998). However, more experimental evidence for this theory remains to be found. Among the adults in the tone sequences study by Sa¤ran et al. (1999), an interesting dissociation was also observed that highlighted the role of musical training: musically untrained individuals were at chance for the absolute pitch condition but above chance for the relative pitch condition, whereas musically trained adults performed above chance in both absolute and relative pitch conditions. Three possible explanations account for this e¤ect of musical training. Firstly, musicians are more likely to possess absolute pitch. Secondly, musicians might have been paying more attention overall, which led to superior performance in both relative pitch and absolute pitch conditions, whereas nonmusicians showed inferior performance overall, with the lower of the two conditions (absolute pitch) being indistinguishable from chance level. Thirdly, musicians might have acquired long-term memory of a stable pitch representation throughout their years of musical training, from which the pitch classes used in the exposure stream could have been compared, thus producing behavioral output similar to people with absolute pitch based on a relative pitch comparisons alone.

6. Music changes the brain, even in implicit musicians Regardless of the mechanisms that led to results from these studies, the idea that musical training changes the brain and impacts cognitive performance is not new. Brain di¤erences exist between musicians and nonmusicians in widespread regions of the brain, as shown in many studies using structural and functional neuroimaging (Gaser & Schlaug, 2003; Imfeld, Oechslin, Meyer, Loenneker, & Jancke, 2009; Kleber, Veit, Birbaumer, Gruzelier, & Lotze, 2009; Oechslin, Imfeld, Loenneker, Meyer, & Jancke, 2010; Schlaug, Jancke, Huang, Staiger, & Steinmetz, 1995; Sluming, Brooks, Howard, Downes, & Roberts, 2007) and electrophysiological studies (Besson & Faita, 1995; Loui, Grent-’t-Jong, Torpey, & Woldor¤, 2005; Magne, Schon, & Besson, 2006; Pantev et al., 2003; Schlaug, Jancke, Huang, & Steinmetz, 1995; Tervaniemi, Rytkonen, Schroger, Ilmoniemi, & Naatanen, 2001). These structural and functional brain changes have led to the claim that music may act as a model of neuroplasticity by changing the brain through active and motivated engagement (Schlaug, Forgeard, Zhu, Norton, &

Statistical learning: What can music tell us?

439

Winner, 2009). Although musicians, by definition, engage in music making more than nonmusicians do, it should be noted that the rules of music, like those of language, are largely acquired not by explicit instruction, but by implicit learning based on passive exposure to music in the environment. As noted by Leonard Meyer in his seminal work Emotion and Meaning in Music (1956): ‘‘Understanding music is not a matter of dictionary definitions, of knowing this, that, or the other rule of musical syntax and grammar, rather it is a matter of habits correctly acquired in one’s self and properly presumed in the particular work.’’ (Meyer, 1956, p. 61). The ‘‘habits correctly acquired’’ that Meyer noted, when placed in the context of statistical learning, can refer to any or all of the standard musical components that the field of music perception and cognition is concerned with today: pitch, rhythm, melody, harmony, tonality, meter, timbre, and emotional content in music. A normal human being is exposed to music in the environment at every point during the day. This exposure is compounded with the aid of modern technology such as iPods, resulting in thousands of hours of exposure to musical sounds before adulthood. These thousands of hours of exposure must result in a significant body of implicit knowledge, or at least sensitivity to statistical regularities within musical sounds, even amongst individuals with no formal musical training. While artificial grammars have been used to investigate language learning in many studies, studies in music perception have focused predominantly on Western classical music. Some studies have investigated cognitive and perceptual learning of alternate musical systems, such as Indian classical music (Castellano, Bharucha, & Krumhansl, 1984), Javanese scales (Lynch, Eilers, Oller, & Urbano, 1990), and North Sami Finnish melodies (Krumhansl et al., 2000). Some other studies have also investigated artificially generated harmony based on alternate scales such as modes (Jonaitis & Sa¤ran, 2009) . A few studies have also created artificial grammars from timbre sequences (Bigand, Perruchet, & Boyer, 1998; Tillmann & McAdams, 2004), but few studies have systematically come up with a new musical system that is guaranteed to be both unfamiliar to subjects and musically valid in the sense that it can enable new musical compositions. With the goal of investigating questions about the sources of musical knowledge without the possible confounds of a lifetime of musical experience, our approach is to create a completely novel but musically valid system of artificial grammars. We made use of the Bohlen-Pierce scale (Mathews, Pierce, Reeves, & Roberts, 1988), a novel scale that is completely di¤erent from existing musical systems. By composing a theoreti-

440

Psyche Loui

cally plausible tonal and harmonic structure in this scale, the resulting artificial musical system allows us to ask many questions: what aspects of musical structure can be learned? How quickly can we acquire pitch, timbre, and harmony and tonality? How much does emotion in music depend on statistical regularities?

7. A new musical system based on the Bohlen-Pierce scale Musical systems around the world are based on an octave, which is a 2:1 ratio in frequency. Within the 2:1 frequency ratio, the equal-tempered Western chromatic scale has 12 logarithmically even divisions. With a starting point that tunes to 440 Hz (the concert A that is used as the gold standard for tuning), the formula of the Western chromatic scale is: F ¼ k*2 ^ (n/12), where F is the frequency of each tone, k is the starting point of the scale (e.g. 440 Hz), and n is the number of steps along the scale. In contrast, the Bohlen-Pierce scale is based on the tritave, which is a 3:1 ratio in frequency. Within the 3:1 frequency ratio, the scale is divided into 13 logarithmically even steps, so that the formula is: F ¼ k * 3 ^ (n/13) The 13-step logarithmic division of the Bohlen-Pierce scale is useful because some of the resultant steps along the 3:1 frequency ratio approximate whole-integer ratios that have long been known to be consonant (Kameoka & Kuriyagawa, 1969a, 1969b) and therefore more pleasant to the ear. While the major triad in Western music approximates a 4:5:6 ratio in frequency, the new ‘‘major’’ chord in the Bohlen-Pierce scale approximates the 3:5:7 ratio, resulting in chords that sound pleasantly consonant together but are unlike any existing musical system. A comprehensive new musical system uses the pitches that form these new chords in the BohlenPierce scale as nodes of the artificial grammar. Legal paths in the artificial grammar are defined as possible movement between the pitches in the chord progression: pitches within each chord can either move forward to any pitch within the following chord or move up, down, or stay the same within the same chord. Given three pitches within each chord, four chords in a chord progression, and eight-note-long melodies that are constrained to move from the first chord to the last in the chord progression (while allowing for repeats), the new musical system can generate thousands of possible melodies.

Statistical learning: What can music tell us?

441

The new musical system used in our studies di¤ers from existing musical systems in that it uses pitches that do not fit into the pitch categories of Western musical systems. In addition, it is free of any pre-existing knowledge or associations, or any influences of prior learning or long-term memory that the subjects may have. Using the Bohlen-Pierce scale, we can test how subjects respond when faced with a new musical world, yet the input of this new musical world can be manipulated with exquisite control. Thus this artificial musical system o¤ers an ideal environment with which to investigate hypotheses generated from statistical learning theory.

8. Learning pitch categories The first question we asked was whether people could learn to form expectancies for novel pitch categories. This was assessed using a seven-point rating system for subjective goodness of fit, in a Bohlen-Pierce version of the probe tone paradigm. In the probe tone paradigm (Krumhansl & Kessler, 1982), participants are presented with a melody followed by a tone, and their task is to rate how well the tone fits the melody. The procedure is repeated for each tone along the musical scale, and the resultant set of ratings for each tone is called the probe tone profile. The probe tone profile has been shown to be highly correlated with the frequency statistics of large bodies of musical compositions (Krumhansl, 1990). The degree of fit between probe tone profiles and standardized profiles, as defined by averaged ratings of large groups of trained musicians, is termed the ‘‘recovery score’’ and is taken to be a test of an individual’s sensitivity to tonality (Russo, 2009). For our probe tone study, subjects were given a melody in the BohlenPierce scale, followed by a single tone, and were told to rate how well the tone fit with the preceding melody. This test was done before and after exposure to a large corpus of melodies (400 melodies) in the Bohlen-Pierce scale, which formed the profile of tones to which subjects were exposed (i.e. the exposure profile). In analyzing the probe tone data, each individual’s probe tone profile was regressed against the exposure profile, resulting in a ‘‘recovery score’’ (correlation between rating profile and exposure profile) both before and after the exposure phase. Results showed a higher recovery score after exposure than before exposure: a stronger correlation was observed between probe tone ratings and the exposure profile after subjects had listened to the large corpus of melodies. In other words, subjects acquired sensitivity to the frequencies of tones in the Bohlen-Pierce scale

442

Psyche Loui

Figure 1. Pre-exposure and post-exposure probe tone ratings, and the exposure profile which was used as the target rating against which all the subjects’ ratings were correlated to obtain a ‘‘recovery score’’ for each individual subject.

during half an hour of exposure, such that their ratings significantly improved and became more highly correlated with their input (Loui, Wessel, & Hudson Kam, 2010). Figure 1 shows an example of pre-exposure and post-exposure probe tone ratings, and the exposure profile which was used as the target rating against which all the subjects’ ratings were correlated to obtain a ‘‘recovery score’’ for each individual subject.

9. Learning grammatical structure While the probe tone ratings demonstrate sensitivity to the event frequencies of tones in a scale, one might ask what other statistics are at work when subjects are exposed to the Bohlen-Pierce scale. Results from previous language learning studies (Aslin, et al., 1998; Sa¤ran, et al., 1996) show that conditional (transitional) probability, rather than simple frequencies, is the most important aspect of statistical input that is acquired in artificial grammars. Conditional probability is defined as the probability of one element (Y) given another element (X) (Sa¤ran, et al., 1996): Y|X ¼ frequency of XY / frequency of X

Statistical learning: What can music tell us?

443

This definition lends itself readily to aspects of music theory that can readily be tested using the Bohlen-Pierce scale. In music we listen to every day, tones typically occur as sequentially-presented elements in a melody, but are also often accompanied by chords, which are simultaneouslypresented tones. These chords also occur with a range of frequencies and transitional probabilities, and a sequence of chords that contains both frequencies and transitional probabilities is known as a chord progression. Much of traditional Western music contains chord progressions that recur with high transitional probability: for instance, the chord progression that appears in Pachelbel’s Canon in D: D major – A major – b minor – f] minor – G major – D major – G major – A major (in music-theoretical Roman numeral notation: I – IV – vi – iii – IV – I – IV – V) is a prototypical chord progression that has been used in many classical and contemporary compositions, a fact humorously illustrated by comedian Rob Paravonian in his famous Pachelbel Rant (http://www.youtube.com/ watch?v=JdxkVQy7QLM). In Western music, the most prototypical chord progression is I – IV – V – I, with each chord implying three or more tones that belong to the same chord (see top panel of Figure 2). The reverse chord progression is considered grammatically incorrect in traditional Western music1. Thus, in Western musical convention, the transitional probability of the V chord given the IV chord is high, whereas the transitional probability of the IV chord given the V chord is low. These common rules in chord progressions are typically considered to be the basis of musical syntax (Bernstein, 1973; Piston & DeVoto, 1987). While Western musical convention dictates that the IV – V chord transition occurs with higher transitional probability than its reverse, no such convention exists in the Bohlen-Pierce scale because of its novelty to most listeners. We can therefore imagine two parallel universes with opposite chord progressions – say I – VI – VIII – I and I – VIII – VI – I, that are equally legal and equally likely to a person with no exposure to the Bohlen-Pierce scale. With each chord implying three notes, each of the two four-chord progressions can contain 4 * 3 ¼ 12 possible chord notes, each of which may occur with equal frequency but with opposite transitional probability. (The two Bohlen-Pierce chord progressions and their

1. Although the I – V – IV – I progression is considered ungrammatical in Western classical music, it occurs commonly in other musical cultures such as Blues and Bossa Nova – a fact that suggests that harmonic conventions in music are not universal.

444

Psyche Loui

Figure 2. Top: a prototypical chord progression in Western music in musical and Roman-numeral notations. The white box with 12 numbers is a numerical translation of the same chord progression, where each number represents a tone with the frequency given by substituting the number into ‘‘n’’ in the formula below. Each Roman numeral corresponds to a single chord, with its root specified by the degree of the Roman numeral. Bottom: two Bohlen-Pierce chord progressions and their resultant tones are shown in the grey panel.

resultant tones are shown in the grey panel of Figure 2). These two opposite chord progressions form two parallel but opposite artificial grammars, which can be used as control stimuli for each other. In our experiments, each group of subjects was exposed to melodies that were legal in one of the two artificial grammars. Several sample files of the Bohlen-Pierce sequences are available at www.psycheloui.com/publications/downloads. 10. Testing for recognition and generalization In our experiments we were generally interested in two questions. Given a set of melodies in the Bohlen-Pierce scale:

Statistical learning: What can music tell us?

445

1. Could participants recognize grammatical melodies they had heard before? 2. Could participants generalize their knowledge to grammatical melodies that they had not heard before? The two questions were addressed using two-alternative forced choice tests of recognition and generalization respectively. In each trial of each of these two-alternative forced choice tests, participants were given two melodies and were told to choose the one that sounded more familiar, without knowledge of whether the trial was for recognition or generalization. In the recognition trials, one of the melodies had been presented during the exposure phase (and was therefore grammatical), whereas the other melody was generated from the opposite grammar (and therefore had not been presented during the exposure phase). This was di¤erent from the generalization trials, where both melodies had not been presented in the exposure phase, but one was grammatical and the other was not (i.e. the ungrammatical melody was generated from the opposite grammar, such that a melody that was ungrammatical for one group of subjects was grammatical for the other group). In a first experiment, five melodies in each grammar were presented repeatedly to participants for 25 minutes, resulting in about 100 presentations of each melody. Results from two-alternative forced choice tests showed near-ceiling levels of performance in recognition, but chance levels in generalization: participants recognized the previously-heard melodies but did not generalize their knowledge to new melodies in the same grammar. In the next experiment, we changed the exposure set so that 400 melodies in each grammar were presented once each for an overall duration of 30 minutes. Results showed a very di¤erent pattern: participants performed above chance, but below ceiling levels, in both recognition and generalization (Loui, et al., 2010). In other replications of the study we have used exposure sets of 10 melodies repeated 40 times each (Experiment 2 in Loui & Wessel, 2008) and 15 melodies repeated 27 times each (Experiment 1 in Loui & Wessel, 2008), for an overall exposure duration of 30 minutes in each experiment. The overall pattern of results is shown in Figure 3: recognition improves as the number of repetitions increases, whereas generalization improves as the set size of the melodies increases. Furthermore, people with musical training were not better at learning the musical system than people with no previous musical training (Loui, et al., 2010). Considering results from multiple experiments altogether,

446

Psyche Loui

Figure 3. The overall pattern of results from two-alternative forced choice tests of recognition and generalization in four experiments with di¤erent set sizes of input. These data show a double dissociation between recognition and generalization: recognition improves as number of repetitions increases, whereas generalization improves as the set size of the melodies increases.

a double dissociation emerges between recognition and generalization: fewer melodies repeated more times leads to successful recognition, whereas more melodies repeated fewer times leads to superior generalization. This pattern suggests that recognition and generalization are di¤erent mechanisms: recognition taps into rote memory for individual items, whereas generalization taps into a deeper and more flexible knowledge of grammatical structure underlying the surface instances that had been previously encountered. Generalization tests are probably more e¤ective at tapping into statistical learning ability of artificial grammars. The present generalization results are congruent with existing literature on the importance of multiple exemplars for learning, as seen in various domains including artificial grammar learning (Gomez, 2002; Meulemans & Van der Linden, 1997), object categorization (Needham, Dueker, & Lockhead, 2005), word segmentation (Houston & Jusczyk, 2000), and speech production (Lively, Pisoni, Yamada, Tohkura, & Yamada, 1994; Richtsmeier, Gerken, Go¤man, & Hogan, 2009).

Statistical learning: What can music tell us?

447

11. Learning the structures of nontraditional scales In addition to the Bohlen-Pierce scale, other studies had investigated the acquisition of pitch structure using non-Western scales. In a series of experiments, Lynch et al. investigated sensitivity to pitch categories by testing for detection of deviations from expected pitch categories (mistuning) in Javanese and Western scales among Western infants and adults (Lynch, et al., 1990). Results showed that while Western six-month-old infants were equally able to detect mistunings in both Javanese and Western scales, adults were better at detecting mistunings in Western scales. Musically trained Western participants, in particular, were able to distinguish pitch categories at very small (0.4%) levels of mistuning, but their detection performance was consistently better at Western scales compared to Javanese scales. These results suggest that much of our knowledge of pitch categories is built up over many years of experience with music in our culture, such that qualitative shifts occur from infancy to adulthood in the small di¤erences in frequencies that we are able to accept as belonging to the same pitch categories. In the study by Lynch et al. (1990), even non-musician Westerners showed superior performance in detecting mistuning in Western music compared to Javanese music. This provides further support for the notion presented earlier, that even non-musicians have reasonably sophisticated understanding of musical pitch as a result of being exposed to pitched sounds in the environment. However, one question we might ask is what exactly constitutes musical input. While some musical sounds, such as hit songs on the radio, might be canonically accepted as musical stimuli, it is possible that the cognitive mechanisms that use statistical learning to build up musical knowledge may benefit from input that is much more general than what is typically considered to be musical. For instance, any sound that has pitch and amplitude modulations, such as the human voice, may already contain statistical information that is important to musical knowledge.

12. Sources of statistical knowledge in speech, music, and other sounds Schwartz et al. (2003) (Schwartz, Howe, & Purves, 2003) tested the hypothesis that energy in human speech sounds is related to musical universals by doing a series of statistical analyses on speech sounds. Speech sounds from various speakers in multiple languages were analyzed using the fast-

448

Psyche Loui

Fourier transform (FFT)2. Results showed peaks of energy that align with periodic frequency ratios corresponding to whole-integer ratios in frequency (1:2, 1:3, 1:4, 2:3, etc.) Interestingly, these peaks of energy are also aligned with ratios that are most important in music, such as the musical fifth and third. The authors’ interpretation of these results is that musical scale structure is derived from the necessary statistical relationship between sensory stimuli and their physical sources, and that using speech sounds as a model, they were able to predict the universals of music. While the human voice and musical structure both contain periodic relationships, it is important to note that harmonicity, i.e. the property of having frequencies that are related by integer ratios, is by no means limited to speech and music theory, but is a physical property of many sounds in the living and nonliving world. Many nonhuman species give o¤ vocalizations that have periodic relationships in their harmonic partials (Hauser & McDermott, 2003), an observation that possibly gave rise to common musical labels for animal sounds (e.g. words such as ‘‘birdsong’’ or ‘‘whale song’’ ascribe musical attributes to animal sounds). In nonliving objects, vibrations in the frequency range of pitched sounds can also have periodic relationships, leading to periodic frequency components with harmonic ratios – a phenomenon readily noticeable when listening to electrical appliances with motors.

13. Timbre Among periodic sounds produced by nonliving objects, the harmonic structure of timbres of musical instruments deserves to be examined in some detail. Timbre is the sound quality, especially of musical instruments, that is not captured by pitch, rhythm, or amplitude alone (Sethares, 2004). For instance, when a piano and a guitar are playing the same note for the same time, the main aspect of sound that is di¤erent between them is referred to as timbre. Studies in music perception and cognition have used multidimensional scaling, among other techniques, to investigate the 2. FFT analysis entails the mathematical decomposition of a sound to obtain information such as frequencies of components within the sound. FFT functions are now available in programming environments such as MATLAB and MaxMSP (Zicarelli, 1998), as well as freely available software such as Praat (Boersma & Weenink, 2010), Audacity (http://audacity.sourceforge.net), and PureData (http://puredata.info).

Statistical learning: What can music tell us?

449

Figure 4. Frequency domain (top panels) and time-domain (bottom panels) waveforms of the same pitch (E-flat4) played on the violin, the trumpet, and the piano, showing timbre di¤erences in the spectral centroid and in the temporal envelope.

qualities of sound that give rise to the percept of timbre. In the classical study by Wessel et al. (1979), instrumental sounds of various timbres were presented to subjects, who made pairwise judgments regarding the similarities between any given pair of sounds. Results from the multidimensional scaling solution of these sounds showed that the two most important components that predicted people’s ratings were spectral distribution and temporal envelope – two dimensions succinctly described as ‘‘brightness’’ and ‘‘bite’’ (Wessel, 1979). In Figure 4, each panel shows a time-domain waveform or a time-frequency plot (spectrogram) of the same pitch played on a single instrument. The left panel is the sound of a violin, the middle panel is of a trumpet, and the right panel is of a piano. Based on the spectrograms (top row), it is possible to see that the violin sound has energy in the highest frequencies within the spectrum compared to the other two instruments. Based on the time-domain waveforms (bottom row), in contrast, it is clear that the piano has the fastest rise in its temporal envelope, reaching its loudest point immediately after the beginning of the tone, but the sound decays quickly relative to the other instruments. In other words, the violin sound has the most ‘‘brightness’’, whereas the piano sound has the most ‘‘bite’’. These dimensions of timbre may be comparable to the phonological measures in speech: brightness may be thought of as an analog of formant frequency, whereas bite might

450

Psyche Loui

be analogous to voice onset time in speech. Importantly, all of these timbres share the quality of harmonicity: in other words, each sound contains energy at frequencies that are related in integer ratios to the fundamental frequency. For instance, a violin that is playing at a fundamental frequency of an A4 (440 Hz) also has energy at the first harmonic of 2 * 440 Hz ¼ 880 Hz, a second harmonic of 3 * 440 Hz ¼ 1320 Hz, a third harmonic of 4 * 440 Hz ¼ 1760 Hz, and so on. The ubiquity of harmonicity in sounds again highlights the fact that by adulthood, most humans have had large amounts of exposure to the periodic ratios in music. If a learning mechanism is in place that uses harmonic information between frequencies in sounds to aid the acquisition of implicit musical knowledge, this learning mechanism should make the average listener highly proficient at forming the correct expectations for statistically probable and improbable notes in a pitch sequence. Furthermore, if such a harmonicity-dependent statistical learning mechanism uses the harmonic properties of sounds to build knowledge of musical structure, then sounds with alternate patterns of harmonicity might be used by the statistical learning mechanism to build knowledge of alternate types of musical structure. From this idea of alternate patterns of harmonicity, we can form a strong prediction: new sounds can help learn new types of structure, provided that the harmonic structure of the sound is congruent with the musical structure. This prediction can be tested using the Bohlen-Pierce scale, which makes use of 3:1 ratios instead of 2:1 ratios. If sounds with frequency structures that are congruent with the scale can help learn the structure of the scale, then people who are exposed to sounds with harmonic ratios of 3:1 should be better at learning the Bohlen-Pierce scale, relative to people who are exposed to sounds with the more common harmonic ratios of 2:1. The above hypothesis was tested in two groups of subjects who were exposed to a set of 400 melodies in the Bohlen-Pierce scale over the course of 30 minutes, and made probe tone ratings before and after the exposure phase. In one group of subjects, all sounds in the experiment (including exposure melodies and all testing material) had harmonic partials that were related in 3:1 frequency ratios. In the other group of subjects, all sounds in the experiment had harmonic partials that were related in 2:1 frequency ratios. The prediction was that subjects in the 3:1 ratio condition (i.e. the condition that is congruent with the Bohlen-Pierce scale) would learn the Bohlen-Pierce scale better than those in the 2:1 ratio (incongruent) condition. The results confirmed our predictions: subjects in the congruent

Statistical learning: What can music tell us?

451

condition did better than subjects in the incongruent condition. Recovery scores obtained from probe tone ratings were higher in the congruent condition than in the incongruent condition, suggesting that timbre, or the frequency distribution of harmonic partials in sounds, is a useful cue for learning the structure of sounds (Loui, 2010). Importantly, subjects who heard the whole experiment in pure tones, which have no harmonic partials and are therefore neutral with respect to its timbre congruence with the Bohlen-Pierce scale, showed middling results: their recovery scores were lower than the congruent group, but were higher than the incongruent group. This suggests that an incongruent timbre condition can actually harm performance, perhaps by interfering with the frequencies and probabilities of harmonic partials that are useful input to the statistical learning mechanism. When confronted with novel sounds, the statistical learning system rapidly and constantly acquires a mental representation of the new sound structure, using both temporal and spectral cues as they become available. Other researchers have also investigated the e¤ects of timbre as a perceptual cue in statistical learning. Creel et al. (2004) asked the question of whether statistical relationships that were not temporally adjacent could be learned when the statistics were pitted against tone complexes that varied in the perceptual cue of timbre, here manipulated as a combination of harmonic partials. Their study tested the ability to learn two statistically predictive but temporally interleaved ‘‘languages’’ that were either similar or di¤erent in pitch range and timbre. Results showed that when the two languages were the same in pitch range and in timbre, learners acquired moderate regularities among adjacent tones but did not acquire highly consistent regularities among nonadjacent tones – thus, they did not learn the nonadjacent dependency. However, when the two sets of nonadjacent dependencies di¤ered in pitch range or in timbre, such that temporally adjacent tones varied in these perceptual cues, learners acquired statistical regularities among the similar but temporally nonadjacent elements. Finally, when pitch range and timbre were both moderately similar, both adjacent and nonadjacent statistics were learned. Taken together, this set of studies showed conclusively that statistical learning is governed not only by temporal adjacency and statistical regularity, but also by perceptual similarity in physical parameters such as pitch range and timbre (Creel, Newport, & Aslin, 2004). Another approach in the study of timbre as a grouping cue against statistical learning was taken by Tillmann and McAdams (2004), who familiarized subjects with tone sequences of instrumental and synthesized timbres that were either close together or far apart in timbre, and

452

Psyche Loui

then asked subjects at test to identify tone sequences that sounded most familiar (Tillmann & McAdams, 2004). This design was aided by the addition of a no-learning control group of subjects, who were not exposed to the familiarized sequence but were instructed at test to choose which tone sequences sounded better as a unit. When timbre cues (a perceptual cue) were pitted against statistical regularities (a learning cue), tone sequences with timbres that were close together resulted in the best statistical learning performance, whereas tone sequences with timbres that were far apart resulted in the worst statistical learning performance. The control group, which was not exposed to statistical regularities, made ratings based on timbre similarity alone. These results provide a clear picture that is congruent with Creel et al. (2004) in showing that timbre is a perceptual cue that can both aid and hinder the learning of statistical regularities.

14. Expectation and statistical learning Having seen how statistical learning can interact with perceptual qualities such as timbre, a logical question concerns the nature of the statistical learning system itself: is it a cognitive mechanism by itself or is it a combination of other cognitive mechanisms such as memory and attention? While studies have shown that success in statistical learning in both auditory and visual modalities is dependent upon attentional resources (Toro, Sinnett, & Soto-Faraco, 2005), it appears that expectation is a major cognitive mechanism that interacts with statistical learning, at least in the domain of music (Huron, 2006). According to Huron (2006), expectation is a psychological state that is informed by statistical learning; in all aspects of music (pitch, rhythm, melody, harmony, and timbre), the frequencies and probabilities of the input are important cues that inform us as to what expectations are most appropriate. One way of investigating expectations is through electrophysiological recordings comparing the fulfillment and violation of expectations. Using the Event-Related Potential (ERP) technique, electrical waveforms recorded from the surface of the scalp have yielded the Early Anterior Negativity (EAN, also known as ERAN in other studies e.g. Koelsch, et al., 2000), which is elicited by musical sounds that are unexpected given their musical context. The EAN is thought to be an index of expectation and is sensitive to di¤erent states of attention (Loui, et al., 2005). In a recent study (Loui, Wu, Wessel, & Knight, 2009), we recorded ERPs from subjects while they listened to statistically probable and improbable chord progressions in the

Statistical learning: What can music tell us?

453

Bohlen-Pierce scale. Results showed a large Early Anterior Negativity (EAN) and a Late Negativity (LN) in response to improbable, and therefore unexpected, chord progressions (Loui, et al., 2009). These two waveforms, EAN and LN, disappeared when the chords were made to be equally probable, suggesting that expectation is strongly sensitive to event probability. Furthermore, the amplitude of the EAN is predicted by individual subjects’ behavioral performance on two-alternative forced choice tests of generalization mentioned earlier in this chapter, suggesting that the strength of expectations is an e¤ective index of statistical learning ability (Loui, et al., 2009). Further support for the coupling between expectation and statistical learning was observed by Tillmann and Poulin-Charronnat (2010), who trained subjects on an artificial grammar of tone sequences and then investigated expectancies for grammatical and ungrammatical tones using a priming paradigm and an in-tune/out-of-tune judgment. Results showed that grammatical tones were processed faster and more accurately than ungrammatical ones, suggesting that statistical learning trains expectations for each event in the input (Tillmann & Poulin-Charronnat, 2010).

15. Emotional content in music Meyer famously posited that emotion and meaning in music is a result of implicit learning (Meyer, 1956). This theory is elegant because it groups the emotional content in music with the cognitive content, such that statistical learning can be conceived to have an e¤ect on emotional functioning. In this regard, music might hold unique power over human function in its ability to modulate emotions, and understanding if and how statistical learning in music relates to emotion may have profound implications for findings such as the Mozart E¤ect (Rauscher, Shaw, & Ky, 1993), which has been posited as a byproduct of emotional arousal (Thompson, Schellenberg, & Husain, 2001). We investigated the question of how statistical learning might influence our subjective emotional feelings towards music. In the same studies that investigated recognition and generalization described earlier (Loui & Wessel, 2008; Loui, et al., 2010), after subjects were exposed to melodies in the Bohlen-Pierce scale, we conducted an additional test of emotional preference. Participants were simply asked to rate how much they liked each melody after hearing it once; ratings were on a scale of 1 to 7 (7 being

454

Psyche Loui

Figure 5. Results from preference ratings plotted as an e¤ect of exposure: the y-axis represents a di¤erence score between ratings for familiar and unfamiliar stimuli. The more the subjects heard a set of melodies, the more they reported liking it – a replication of the Mere Exposure E¤ect.

most liked; 1 being least). The e¤ect of exposure on liking was assessed as a di¤erence score between average ratings for familiar (previouslyencountered) items and unfamiliar (previously-unencountered) items. Results are shown in Figure 5: subjects generally rated melodies that they had heard before as being more preferable, with higher ratings for melodies that had been heard many times. As the number of repetitions increased, subjects’ preference ratings also increased, as shown in Figure 5. The e¤ect of repeated exposure on preference ratings echoes the Mere Exposure E¤ect, previously shown in both musical (Tan, Spackman, & Peaslee, 2006) and non-musical (Zajonc, 1968) domains. In this sense, results are conclusive: the frequency with which we hear a melody influences our emotional response to it. It is notable, however, that being able to identify a melody as being grammatical had no e¤ect on its preference rating. Across several experiments, we observed that preference change is correlated with performance in recognition tests, rather than with performance in generalization tests. This suggests that emotion in music is related to memory but is dissociated from grammar learning, at least given the limited exposure within the course of our experiments. Further studies are needed to see if prolonged exposure will lead to generalization of preferences to previously-unencountered grammatical items.

Statistical learning: What can music tell us?

455

16. Concluding remarks In this chapter I aimed to review research in statistical learning, especially with regard to music. I began by outlining early studies on the theory of statistical learning in language. I then extended the discussion to the modality-independence of the statistical learning mechanism, with special emphases on the parallels between language and music. In the musical domain, the first studies on statistical learning made use of tone sequence stimuli, e¤ectively treating each tone as a syllable. However, as more research provided us with a deeper understanding of what sound structure entails, we moved on to other aspects of music such as relative and absolute pitch, harmony and tonality, and timbre. Although musicians may be more finely tuned to pitch di¤erences due to their intensive training with pitch, nonmusicians may still benefit from years of passive exposure to statistics in musical sounds in the environment. These statistics may include the frequency structure of timbre in harmonic sounds. Frequency structure helps in providing information toward scale structure, as shown by results from the probe tone paradigm. From the section on sound structure, I moved on to discuss the cognitive mechanisms that might give rise to statistical learning, especially in how expectation for frequent and probable events might inform statistical learning. I concluded by drawing links between statistical learning and emotion in music, and outlining some ideas for possible future directions. So what has music informed us about statistical learning? Based on investigations of statistical learning with music, we now know that the human brain is continuously updating itself with statistical information and forming a stable mental representation of sound structure. Types of statistical information include frequencies of events, as assessed using probe tone ratings, and transitional probabilities, as assessed using twoalternative forced choice tests. Frequencies and probabilities of the sound environment are acquired even by people who only experience the world through passive exposure rather than through active involvement or formal training. Statistical information can be presented over time as in tone sequences, or they can be presented simultaneously as single temporal events, as with frequencies in complex tones and pitches in chords. The statistical learning system is useful for sharpening and enhancing our expectations, as shown in ERP and priming studies. The recognition and generalization results showed that people also form memories for rote items; this rote memory may result in ceiling levels of recognition performance but may not be as important in helping to generalize internalized

456

Psyche Loui

knowledge to new instances. In contrast to rote memory, grammar learning ability – as measured by generalization – is a flexible process but requires exposure to large sets of items. Finally, statistical learning may play a role in changing our emotional behavior by informing us of our preference implicitly through the Mere Exposure E¤ect. By developing a musical system that o¤ers extensive possibilities for new combinations of sound, the hope is that music can be seen as a window into fundamental investigations of how statistics influence human perception, cognition, and emotion.

Acknowledgements I would like to thank Carla Hudson Kam, David Wessel, Ervin Hafter, and Robert Knight for their support and mentorship, Carol Krumhansl for helpful discussions, and Loes Bazen, Sang-Hee Min, and Jennifer Zuk for valuable comments on an earlier version of this manuscript.

References Aslin, R., Sa¤ran, J. R., & Newport, E. 1998 Computation of conditional probability statistics by 8-month old infants. Psychological Science, 9(4), 321–324. Baharloo, S., Johnston, P. A., Service, S. K., Gitschier, J., & Freimer, N. B. 1998 Absolute pitch: an approach for identification of genetic and nongenetic components. Am J Hum Genet, 62(2), 224–231. Bernstein, L. 1973 The Unanswered Question: Six Talks at Harvard (Vol. 1). Cambridge: Harvard University Press. Besson, M., & Faita, F. 1995 An Event-Related Potential (ERP) Study of Musical Expectancy: Comparison of Musicians With Nonmusicians. Journal of Experimental Psychology: Human Perception and Performance, 21(6), 1278–1296. Bharucha, J. J., & Stoeckig, K. 1986 Reaction time and musical expectancy: priming of chords. J Exp Psychol Hum Percept Perform, 12(4), 403–410. Bigand, E., Perruchet, P., & Boyer, M. 1998 Implicit learning of an artificial grammar of musical timbres. Cahiers de Psychologie Cognitive/Current Psychology of Cognition, 17(3), 577–600.

Statistical learning: What can music tell us?

457

Bigand, E., Tillmann, B., Poulin, B., D’Adamo, D. A., & Madurell, F. 2001 The e¤ect of harmonic context on phoneme monitoring in vocal music. Cognition, 81(1), B11–20.

Boersma, P., & Weenink, D. 2010

Praat: doing phonetics by computer (Version 5.1.44). Retrieved from http://www.praat.org Castellano, M. A., Bharucha, J. J., & Krumhansl, C. L. 1984 Tonal hierarchies in the music of north India. Journal of Experimental Psychology: General, 113(3), 394–412. Conway, C. M., & Christiansen, M. H. 2005 Modality-constrained statistical learning of tactile, visual, and auditory sequences. Journal of Experimental Psychology: Learning, Memory, and Cognition, 31(1), 24–39. Creel, S. C., Newport, E. L., & Aslin, R. N. 2004 Distant Melodies: Statistical Learning of Nonadjacent Dependencies in Tone Sequences. JEP: Learning, Memory, and Cognition, 30(5), 1119–1130. Deutsch, D., Dooley, K., Henthorn, T., & Head, B. 2009 Absolute pitch among students in an American music conservatory: Association with tone language fluency. The Journal of the Acoustical Society of America, 125(4), 2398–2403. Deutsch, D., Henthorn, T., & Dolson, M. 2004 Absolute Pitch, Speech, and Tone Language: Some Experiments and a Proposed Framework. Music Perception, 21(3), 339–356. Deutsch, D., Henthorn, T., Marvin, E. W., & Xu, H. 2006 Absolute pitch among American and Chinese conservatory students: prevalence di¤erences, and evidence for a speech-related critical period. J Acoust Soc Am, 119(2), 719–722. Esco‰er, N., & Tillmann, B. 2008 The tonal function of a task-irrelevant chord modulates speed of visual processing. Cognition, 107(3), 1070–1083. Fiser, J. z., & Aslin, R. N. 2002 Statistical learning of new visual feature combinations by infants. PNAS, 99(24), 15822–15866. Gaser, C., & Schlaug, G. 2003 Brain structures di¤er between musicians and non-musicians. J Neurosci, 23(27), 9240–9245. Gomez, R. L. 2002 Variability and Detection of Invariant Structure. Psychological Science, 13(5), 431–437. Hannon, E. E., & Trehub, S. E. 2005 Metrical Categories in Infancy and Adulthood. Psychological Science, 16(1), 48–55. Hauser, M. D., & McDermott, J. 2003 The evolution of the music faculty: a comparative perspective. Nat Neurosci, 6(7), 663–668.

458

Psyche Loui

Houston, D. M., & Jusczyk, P. W. 2000 The role of talker-specific information in word segmentation by infants. Journal of Experimental Psychology: Human Perception and Performance, 26(5), 1570–1582. Hunt, R. H., & Aslin, R. H. 2001 Statistical Learning in a Serial Reaction Time Task: Access to Separable Statistical Cues by Individual Learners. Journal of Experimental Psychology: General, 130(4), 658–680. Huron, D. 2006 Sweet Anticipation: Music and the Psychology of Expectation (1 ed. Vol. 1). Cambridge, MA: MIT Press. Imfeld, A., Oechslin, M. S., Meyer, M., Loenneker, T., & Jancke, L. 2009 White matter plasticity in the corticospinal tract of musicians: a di¤usion tensor imaging study. Neuroimage, 46(3), 600–607. Jonaitis, E. M., & Sa¤ran, J. R. 2009 Learning Harmony: The Role of Serial Statistics. Cognitive Science, 33(5), 951–968. Kameoka, A., & Kuriyagawa, M. 1969a Consonance theory part I: consonance of dyads. J Acoust Soc Am, 45(6), 1451–1459. Kameoka, A., & Kuriyagawa, M. 1969b Consonance theory part II: consonance of complex tones and its calculation method. J Acoust Soc Am, 45(6), 1460–1469. Kirkham, N. Z., Slemmer, J. A., & Johnson, S. P. 2002 Visual statistical learning in infancy: evidence for a domain general learning mechanism. Cognition, 83, B35–B42. Kleber, B., Veit, R., Birbaumer, N., Gruzelier, J., & Lotze, M. 2009 The brain of opera singers: experience-dependent changes in functional activation. Cereb Cortex, 20(5), 1144–1152. Koelsch, S., Grossmann, T., Gunter, T. C., Hahne, A., Schroger, E., & Friederici, A. D. 2003 Children processing music: electric brain responses reveal musical competence and gender di¤erences. Journal of Cognitive Neuroscience, 15(5), 683–693. Koelsch, S., Gunter, T., Friederici, A. D., & Schroger, E. 2000 Brain indices of music processing: ‘‘nonmusicians’’ are musical. J Cogn Neurosci, 12(3), 520–541. Krumhansl, C. 1990 Cognitive Foundations of Musical Pitch: Oxford University Press. Krumhansl, C. L., & Kessler, E. J. 1982 Tracing the dynamic changes in perceived tonal organization in a spatial representation of musical keys. Psychol Rev, 89(4), 334–368.

Statistical learning: What can music tell us?

459

Krumhansl, C. L., Toivanen, P., Eerola, T., Toiviainen, P., Jarvinen, T., & Louhivuori, J. 2000 Cross-cultural music cognition: cognitive methodology applied to North Sami yoiks. Cognition, 76(1), 13–58. Lively, S. E., Pisoni, D. B., Yamada, R. A., Tohkura, Y., & Yamada, T. 1994 Training Japanese listeners to identify English /r/ and /l/. III. Long-term retention of new phonetic categories. J Acoust Soc Am, 96(4), 2076–2087. Loui, P. 2010 Sound Frequency Distribution Enhances Statistical Learning. submitted. Loui, P., Grent-’t-Jong, T., Torpey, D., & Woldor¤, M. 2005 E¤ects of attention on the neural processing of harmonic syntax in Western music. Cognitive Brain Research, 25(3), 678–687. Loui, P., & Wessel, D. 2007 Harmonic expectation and a¤ect in Western music: E¤ects of attention and training. Perception & Psychophysics, 69(7), 1084– 1092. Loui, P., & Wessel, D. L. 2008 Learning and Liking an Artificial Musical System: E¤ects of Set Size and Repeated Exposure. Musicae Scientiae, 12(2), 207–230. Loui, P., Wessel, D. L., & Hudson Kam, C. L. 2010 Humans Rapidly Learn Grammatical Structure in a New Musical Scale. Music Perception, 27(5), 377–388. Loui, P., Wu, E. H., Wessel, D. L., & Knight, R. T. 2009 A Generalized Mechanism for Perception of Pitch Patterns. Journal of Neuroscience, 29(2), 454–459. Lynch, M. P., Eilers, R. E., Oller, D. K., & Urbano, R. C. 1990 Innateness, experience, and music perception. Psychological Science, 1(4), 272–276. Magne, C., Schon, D., & Besson, M. 2006 Musician children detect pitch violations in both music and language better than nonmusician children: behavioral and electrophysiological approaches. J Cogn Neurosci, 18(2), 199–211. Mathews, M. V., Pierce, J. R., Reeves, A., & Roberts, L. A. 1988 Theoretical and experimental explorations of the Bohlen-Pierce scale. J Acoustical Soc Am, 84, 1214–1222. Maye, J., Werker, J. F., & Gerken, L. 2002 Infant sensitivity to distributional information can a¤ect phonetic discrimination. Cognition, 82(3), B101–111. Meulemans, T., & Van der Linden, M. 1997 Associative chunk strength in artificial grammar learning. Journal of Experimental Psychology: Learning, Memory, and Cognition, 23(4), 1007–1028.

460

Psyche Loui

Meyer, L. 1956 Emotion and Meaning in Music: U of Chicago Press. Needham, A., Dueker, G., & Lockhead, G. 2005 Infants’ formation and use of categories to segregate objects. Cognition, 94(3), 215–240. Oechslin, M. S., Imfeld, A., Loenneker, T., Meyer, M., & Jancke, L. 2010 The plasticity of the superior longitudinal fasciculus as a function of musical expertise: a di¤usion tensor imaging study. [Original Research Article]. Frontiers in Human Neuroscience, 3, 1–12. Pantev, C., Ross, B., Fujioka, T., Trainor, L. J., Schulte, M., & Schulz, M. 2003 Music and learning-induced cortical plasticity. Ann N Y Acad Sci, 999, 438–450. Patel, A. 2008 Music, Language, and the Brain: Oxford University Press. Pegg, J. E., & Werker, J. F. 1997 Adult and infant perception of two English phones. Journal of the Acoustical Society of America, 102, 3742–3753. Piston, W., & DeVoto, M. 1987 Harmony: WW Norton. Rauscher, F. H., Shaw, G. L., & Ky, K. N. 1993 Music and spatial task performance. Nature, 365(6447), 611. Richtsmeier, P. T., Gerken, L., Go¤man, L., & Hogan, T. 2009 Statistical frequency in perception a¤ects children’s lexical production. Cognition, 111(3), 372–377. Russo, F. A. 2009 Towards a functional hearing test for musicians: The probe tone method. In M. Chasin (Ed.), Hearing Loss in Musicians (pp. 145– 152). San Diego, CA: Plural Publishing. Sa¤ran, J. R. 2003 Absolute pitch in infancy and adulthood: the role of tonal structure. Developmental Science, 6(1), 35–43. Sa¤ran, J. R., Aslin, R. N., & Newport, E. 1996 Statistical learning by 8-month-old infants. Science, 274, 1926– 1928. Sa¤ran, J. R., & Griepentrog, G. J. 2001 Absolute pitch in infant auditory learning: evidence for developmental reorganization. Dev Psychol, 37(1), 74–85. Sa¤ran, J. R., Johnson, E. K., Aslin, R. N., & Newport, E. L. 1999 Statistical learning of tone sequences by human infants and adults. Cognition, 70, 27–52. Schellenberg, E. G., Bigand, E., Poulin-Charronnat, B., Garnier, C., & Stevens, C. 2005 Children’s implicit knowledge of harmony in Western music. Developmental Science, 8(6), 551–566. Schellenberg, E. G., & Trehub, S. E. 1996 Natural musical intervals: Evidence from infant listeners. Psychological Science, 7(5), 272–278.

Statistical learning: What can music tell us?

461

Schlaug, G., Forgeard, M., Zhu, L., Norton, A., & Winner, E. 2009 Training-induced neuroplasticity in young children. Ann N Y Acad Sci, 1169, 205–208. Schlaug, G., Jancke, L., Huang, Y., Staiger, J. F., & Steinmetz, H. 1995 Increased corpus callosum size in musicians. Neuropsychologia, 33(8), 1047–1055. Schlaug, G., Jancke, L., Huang, Y., & Steinmetz, H. 1995 In vivo evidence of structural brain asymmetry in musicians. Science, 267(5198), 699–701. Schwartz, D. A., Howe, C. Q., & Purves, D. 2003 The statistical structure of human speech sounds predicts musical universals. J Neurosci, 23(18), 7160–7168. Sethares, W. 2004 Tuning Timbre Spectrum Scale: Springer-Verlag. Siegel, W. 1972 Memory e¤ects in the method of absolute judgment. Journal of Experimental Psychology, 94(2), 121–131. Sluming, V., Brooks, J., Howard, M., Downes, J. J., & Roberts, N. 2007 Broca’s area supports enhanced visuospatial cognition in orchestral musicians. J Neurosci, 27(14), 3799–3806. Tan, S. L., Spackman, M. P., & Peaslee, C. L. 2006 The E¤ects of Repeated Exposure on Liking and Judgment of Intact and Patchwork Compositions. Music Perception, 23(5), 407–421. Tervaniemi, M., Rytkonen, M., Schroger, E., Ilmoniemi, R. J., & Naatanen, R. 2001 Superior formation of cortical memory traces for melodic patterns in musicians. Learn Mem, 8(5), 295–300. Thompson, W. F., Schellenberg, E. G., & Husain, G. 2001 Arousal, mood, and the Mozart e¤ect. Psychol Sci, 12(3), 248– 251. Tillmann, B., Bigand, E., Esco‰er, N., & Lalitte, P. 2006 The influence of musical relatedness on timbre discrimination. European Journal of Cognitive Psychology, 18(3), 343–458. Tillmann, B., & McAdams, S. 2004 Implicit Learning of Musical Timbre Sequences: Statistical Regularities Confronted With Acoustical (Dis)Similarities. Journal of Experimental Psychology: Learning, Memory, and Cognition, 30(5), 1131–1142. Tillmann, B., & Poulin-Charronnat, B. 2010 Auditory expectations for newly acquired structures. Q J Exp Psychol (Colchester), 1–19. Toro, J. M., Sinnett, S., & Soto-Faraco, S. 2005 Speech segmentation by statistical learning depends on attention. Cognition, 97(2), B25–B34.

462

Psyche Loui

Werker, J. F., & Tees, R. C. 1984 Phonemic and phonetic factors in adult cross-language speech perception. J Acoust Soc Am, 75(6), 1866–1878. Wessel, D. L. 1979 Timbre Space as a Musical Control Structure. Computer Music Journal, 3(2), 45–52. Winkler, I., Haden, G. P., Ladinig, O., Sziller, I., & Honing, H. 2009 Newborn infants detect the beat in music. Proc Natl Acad Sci U S A, 106(7), 2468–2471. Zajonc, R. B. 1968 Attitudinal E¤ects of Mere Exposure. Journal of Personality and Social Psychology, 11(224–228.). Zicarelli, D. 1998 An extensible real-time signal processing environment for Max. Paper presented at the Proceedings of the International Computer Music Conference, University of Michigan, Ann Arbor, USA.

‘‘I let the music speak’’: Cross-domain application of a cognitive model of musical learning Geraint A. Wiggins ‘‘I’m hearing images, I’m seeing songs No poet ever has painted. . . ’’ ‘‘I let the music speak’’ from The Visitors (ABBA, 1981)

1. Introduction In this chapter, I introduce a new thread of work on language perception derived from earlier modelling e¤orts focused on the perception of music. I will briefly discuss the relationship between music and language, viewed as cognitive processes, and then I will describe a complex cognitive model of musical melody perception which is founded, on the one hand, in cognitive musicology and, on the other, in information theory. A particular feature of this model, which sets it apart from most, if not all, comparable models in the linguistics literature is the feature of multidimensionality: it is capable of modelling perceptual phenomena whose percepts are multidimensional constructs – a capability without which models of music cannot succeed in general. Part of the contribution of the current work is to assess the importance of multidimensional processing in language perception. Throughout the work, I take what might be called a strong view of statistical language processing, in which statistical estimation is paramount in cognition, being used not merely as a theoretical analysis tool, but as a functional simulation (albeit at a rather abstract level) of the actual mechanism. The key notion here will be association, both diachronic and synchronic, and one aim in presenting the work is to sketch a contextual basis for why a literal statistical processing model may be credible. It follows, therefore, that the interest here is focused on the process and not on the data: I am not interested in the statistics of language corpora in their own right, but rather in the processes that brains use to estimate them. I will end with the results of a small-scale study of syllable identification – a well-rehearsed problem in language modelling, but one which demon-

464

Geraint A. Wiggins

strates the utility of the multidimensional model straighforwardly, and for which independent, objective ground truth is readily available. Before beginning my main exposition, I believe it is worthwhile to address a philosophical issue which frequently arises in the study of music, and which I believe has arisen, at least in the past, in the study of language. My position on this issue fundamentally underpins the reasoning presented in the present chapter, and is crucial to its sense. My position is this: music and language are both psychological phenomena, and they have no meaningful existence outside minds. Stated baldly, thus, this claim may seem uncontroversial (even when we consider notation: marks on a page still require interpretation), but its extension into rigorous study is more problematic. Specifically, one frequently observes the students of both phenomena discussing and treating their observable elements as though these were the full extent of the phenomena themselves. In the early twenty-first century, the audio-engineering community began to notice that it was not possible to analyse musical sound automatically, using purely mathematical, signal-processing methods, and thus achieve a perfect match with a predicted result proposed as a ‘‘ground truth’’1: indeed, such accuracy was limited to around 70% (Aucouturier, 2009; Aucouturier and Pachet, 2004), for just about any task requiring what might be called ‘‘musical intelligence’’. The limit on accuracy was named the ‘‘glass ceiling’’ and the missing 30%, the ‘‘semantic gap’’. I have discussed the methodological issues around this elsewhere (Wiggins, 2009, 2010); su‰ce it here to point out the two fundamental problems leading to this position: first, the omission of cognitive elements and listener models from the attempts; and, second, the related and false assumption that there is ‘‘ground truth’’ for music in general at all. Aucouturier (2009) argues, correctly, that it is necessary to borrow methods from psychology when analysing data from comparisons of human listeners with these utomated systems, but omits the point that the music itself, as both form and function, is subjective and mutable in its very nature, because it is generated by human psychology in the first instance2. Both of these

1. This is a term frequently borrowed into engineering from geography, in which context its meaning is relatively clear. Regrettably, the analogy between geographical and psychological phenomena is rarely reliable. 2. On the other hand, language is grounded in reference to the world, and so its mutability is more restricted. This, and other arguments presented here, parallel thinking of Chater and Christiansen (2010) in the linguistic context.

Cross-domain application of a cognitive model of musical learning

465

aspects need to be addressed if scientific methods of studying music are to be successful. Equally, musicology has tended to avoid explicit reference to the listener, though human listening mechanisms are implicit in most successful explications of musical structure (Lerdahl and Jackendo¤, 1983; Nattiez, 1975; Schenker, 1925). Parallel e¤ects are to be found in linguistics, particularly in the emphasis on syntactic form arising from Chomsky (1965). Chomsky writes ‘‘Linguistic theory is concerned primarily with an ideal speaker-listener, in a completely homogeneous speech-communication, who knows its (the speech community’s) language perfectly and is una¤ected by such grammatically irrelevant conditions as memory limitations, distractions, shifts of attention and interest, and errors (random or characteristic) in applying his knowledge of this language in actual performance.’’ (Chomsky, 1965, p. 3)

Making the assumption that all of syntactic processing is somehow encoded in the surface form and/or deep structure of the language seems to me (and others, such as Steedman, 2000) to be fundamentally problematic, since it ignores the presence of the interpreting process altogether, claiming that is is ‘‘grammatically irrelevant’’. The notion of ‘‘correct’’ syntax, in the sense of a predefined listener’s grammar, can quickly be reduced to absurdity, on the one hand, by the art of E. E. Cummings3 or, on the other, by the (nevertheless e¤ectual) communication of the average teenager. It is easy to see how an over-emphasis on syntactic correctness (ground truth) can arise from Victorian teaching methods and their descendants, but it is equally easy to demonstrate that it is not necessary for e¤ective communication: a clear example Yoda the Jedi Master gives us4. Ultimately, therefore, a successful theory of language will account for the transmission of information not just via ‘‘correct’’ (syntactically expected) utterances, but also via ‘‘incorrect’’ (unexpected) ones, and, what is more, it will explore just how unexpected utterances can be before they are incomprehensible, and it will account for the changes that result from the spread of unexpected utterances in linguistic societies until they become expected. Essentially, therefore, such a theory can only be about mechanism, distinct from the substrate of study, because it needs to account for unpredicted change in 3. Contrary to popular myth, it is not correct to write Cummings’ name in lower case. 4. If sense this does not make, see Star Wars you must.

466

Geraint A. Wiggins

that substrate of study, at the meta-level with respect to its lexicon and structure (Wiggins, 2011). Ultimately, to be methodologically correct, we must refrain from thinking about music or language as though they were ‘‘out there’’, to be studied like physical or geographical phenomena, but instead look for their source, in mental process. The consequence of this is that one is compelled to seek for learningbased approaches: otherwise, one cannot account for the mutability of music and language over time, nor for the variation clearly exhibited between individuals. I am quite sure that these points are obvious to many readers of this chapter. However, my personal experience shows that they are not obvious to everyone, and that not everyone is even aware of the distinctions I address in the current section. From the wrong side of that distinction, what follows will make no sense at all.

2. Studying the Relationship between Music and Language The relationship between music and language at the cognitive level is a hot topic in the music cognition research world. There are broadly two common approaches to studying the relationship: looking for similarities in observed and reported behaviour; and comparing the neurophysiology of data processing in the respective domains (e.g., Patel, 2008). Both of these approaches have borne fruit, suggesting objective similarities and di¤erences, and Patel (2008) has proposed specific hypotheses concerning shared resources for structural processing of audio sequences, and given empirical evidence to support them. Here, I consider a third way, enabled by computer science: a simulation approach, where computer programs embodying theories of processing mechanism can be applied (at an appropriate level of abstraction) to both domains, and di¤erences and similarities can be objectively and repeatably observed. These observations serve as hypothesis generators allowing us to make predictions about the corresponding cognitive systems, which can then be subject to empirical challenge in turn. In order to pursue such an approach, it is necessary to consider carefully the representation of the problem one is addressing. If one is to propose a theoretical mechanism, and operationalise it in a program, then the theory must be very precise indeed. Equally, since it is not currently computationally feasible to simulate a brain, it is necessary to consider di¤erent levels of representation at which the theory may reside, and choose one (or more) which properly model(s) the aspects of the behaviour one is interested

Cross-domain application of a cognitive model of musical learning

467

in. In the current work, I choose a level fairly distant from both the neural substrate and the observed human behaviour associated with language and music, but placed on a clear trajectory between them. Thus, we can expedite research by understanding di¤erent levels simultaneously and then fitting them together later, as opposed to working step-wise from either end. The level of data abstraction used here is that of what might be called atomic auditory percepts: that is to say, structures which are induced by sonic stimuli, and which are not subject to decomposition by everyday perceptual mechanisms. For example, the average listener does not (and cannot) hear the distinct harmonics which are physically present in a musical note or a phoneme, even though they are made explicit in the physical processing mechanism of the inner ear; however, the sensation of timbre (tone colour) results directly from the relative energy in those harmonics. At this level, there are common elements to represent in speech and music: each percept has timbre, pitch, amplitude and duration. In the world, timbre, pitch and amplitude of a given note may be dynamic, but here we work at level above that change, so, for example, the twang of a sitar note is viewed as a single structure. These data are placed in temporal sequence, not overlapping – and here is an area where we idealise: while it is not unreasonable, at least as a first approximation, to view speech as an uninterrupted stream of phonemes from a given speaker, music which consists of a solo single line is less common (though, in the music of the world in general, it is more common than a Westerner might think). To address this in our work, my colleagues and I have carefully selected music which can be treated in this way, taking care to not cut against the musical grain when we prepare our corpora (Potter et al., 2007; Wiggins, 2010). The mechanism studied here performs unsupervised learning of percept sequences in time. The ability to deduce consequents from antecedents is fundamental to engagement with the world, whether this be encoded in genes, as in insects, or learned and reasoned in humans, so there is every reason to hypothesise that a mechanism enabling the capacity would be rather general and applicable to perceptual events as much as (and indeed prior to) more abstract concepts. The evolutionary incentive for this capacity is clear: it admits the experience of expectation, and expectation allows organisms to anticipate, and thus steal the march on organisms which do not. Further, given an expectation, one can identify what is not expected, and thus modulate attention (conscious or non-conscious5) 5. ‘‘Non-conscious attention’’ which looks, at first sight, like an oxymoron, is perhaps best thought of as the application of processing power.

468

Geraint A. Wiggins

allowing more e‰cient use of cognitive resources: unexpected events need extra processing because they cannot be dealt with in the expected way. Finally, it is important to understand that I view the model presented in the next section as a simulation of actual cognitive processing. In other words, the model is not presented merely as a way of capturing regularities in the observed data for the purposes of study, but as a theory of processing mechanism in itself. It is clear that aspects of it are approximate (as I will explain below), and it is clear that it is a model of function and not of implementation, since there are no neurons (artificial or otherwise) in it. However, understanding function at various levels of abstraction is crucial to understanding implementation, and this is the motivation of the current and other similar work. In the context of cross-domain comparison, comparable performance of a model in two domains yields evidence that it is possible for the same cognitive mechanism to be implicated in both.

3. The IDyOM Model of Musical Melody Perception I now present an overview of the Information Dynamics of Music (IDyOM) model of musical melody processing. Many of the issues discussed in this section apply to sequences of auditory percepts other than musical melody, and that is the motivation for this paper. As a caveat: the use of the word ‘‘model’’ is problematic here, since it is the only appropriate term to use for the whole of the IDyOM theory-and-system, which is a model of a process, but also for some of its components, which are (Markov) models of data. This work is motivated by empirical evidence of implicit learning of statistical regularities in musical melody (Oram and Cuddy, 1995; Sa¤ran et al., 1996, 1999). In particular, Krumhansl et al. (1999) presented evidence for the influence of higher-order distributions in melodic learning. Ponsford et al. (1999) used 3rd and 4th order models to capture implicit learning of harmony, and evaluated them against musicological judgements. So there is evidence that broadly the same kind of model can captures two di¤erent aspects of music (melody and harmony) but also that they predict the expectations of untrained listeners as well as specialist theoreticians. The aim, then, was to construct a computational system embodying these theories and to subject them to rigorous testing. 3.1. An Unsupervised Learning Model The core of IDyOM is a model of human melodic pitch prediction, invented by Marcus Pearce (2005), building on modelling methods invented by

Cross-domain application of a cognitive model of musical learning

469

Conklin (1990) and Conklin and Witten (1995), who, in turn, drew on methods from statistical linguistics (see §3.2). Its method of learning is unsupervised6, simulating implicit learning by exposure alone, as opposed to training, and so it is strongly bottom-up in the sense defined by Cairns et al. (1997). To do so, it uses a Markov model, or n-gram, system. I assume that readers of this volume will be familiar with such systems; if this is not the case, a detailed explanation is given by Manning and Schu¨tze (1999, ch. 9). As the model is exposed to the corpus of musical melody from which it learns, it creates a compact representation of the data (Pearce, 2005), designed in such a way as to facilitate matching of sequences of notes against sequences encountered in the corpus. The basic idea of Markov/ n-gram modelling is extended in two significant ways. First, the model is of variable order, using a back-o¤ smoothing strategy that allows context matches of all possible lengths to contribute probability mass to the distribution resulting from each prediction (Cleary and Witten, 1984), and an escape strategy admitting estimation of distributions that include previously-unseen symbols (Cleary and Witten, 1984; Mo¤at, 1990). Pearce and Wiggins (2004) conducted a detailed comparative study of the methods available and concluded that the combination used in IDyOM was the most e¤ective for musical melody. The back-o¤ strategy, a development of Prediction by Partial Matching (PPM; Cleary and Witten, 1984) called PPM* (Cleary and Teahan, 1997), works by trying the longest possible context and working down to no context, summing the probabilities until the context is empty; each probability is weighted proportionally to the number of steps of back-o¤ required to reach it. The escape method used in IDyOM is the one called Method C by Mo¤at (1990). Second, the model is multidimensional, in two ways. First, the IDyOM system is configured with two functionally identical (sub-)models, illustrated in Figure 1, one for Long Term (LTM), which is exposed to an entire corpus (modelling the learned experience of a listener) and one for Short Term (STM), which is exposed only to the current melody (modelling a 6. This form of learning is sometimes referred to as ‘‘supervised, from positive examples only’’; however, it seems to me that the notion of supervision is misleading in the current context. Here, the data is merely presented, and there are merely examples, with no external notion of correctness. The question, then, is whether the learning mechanism correctly captures the behaviour being studied.

470

Geraint A. Wiggins

Figure 1. Schematic diagram of IDyOM viewpoint model, showing a subset of available viewpoints. Basic viewpoints are in solid, rounded boxes; derived viewpoints are in double-edged rounded boxes; linked viewpoints are in hexagonal boxes; and threaded viewpoints are in brokenlined boxes. Di denotes a distribution produced by predicting from a particular viewpoint or viewpoint combination

specific listening experience). Each model produces a distribution (over all possible discrete values of the feature) predicting each note as the melody proceeds, and the two distributions may be combined to give a final output as illustrated in Figure 2, weighted by the Shannon (1948) entropy of the distribution (distributions containing more information are weighted more heavily: Conklin, 1990; Pearce et al., 2005). There are five di¤erent configurations: each model alone (STM, LTM), two models together (BOTH) – where the LTM is fixed and does not learn from the stimulus data to which the model is currently being exposed – and LTMþ and BOTHþ, correspondingly – where the LTM does learn from stimulus data as the stimulus proceeds. LTMþ, BOTH and BOTHþ are serious candidates as models of human music cognition; STM and LTM alone are included for completeness, but in fact both can tell us interesting things about musical structure (Potter et al., 2007). The second multidimensional

Cross-domain application of a cognitive model of musical learning

471

Figure 2. Schematic diagram of combined IDyOM short term and long term models. Iconography is as in Figure 1.

aspect is within each sub-model, where there are multiple distributions derived from multiple features of the data, as detailed in the next section; each feature may be thought of as having its own Markov model. These distributions are combined using the same weighting strategy, weighted by the Shannon (1948) entropy of the distribution (distributions containing more information are weighted more heavily), to give the overall output distributions of the STM and LTM. The full detail of this arrangement is given by Pearce et al. (2005) and Pearce (2005). It is important to note that at no stage in using the model is it given the answers that it is expected to produce (viz., measurements of human

472

Geraint A. Wiggins

behaviour), nor is it optimised by any process having access to those answers. Thus, the predictions it makes about pitch and segmentation (see below) are in a sense epiphenomenal, and this is the strongest reason for proposing IDyOM, and the strong statistical view in general, as a veridical mechanistic model of music cognition at this level of abstraction: it does what it is required to do in an entirely mechanistic domainindependent way, without being told how. 3.2. Data Representation IDyOM operates at the level of abstraction described above: its input is a representation of musical percepts (notes) described in terms of two fundamental dimensions, pitch and time. These dimensions, however, give rise to multiple features of each note, derived from pitch or time or a combination of the two. Added to these representations of the percepts themselves is an explicit representation of sequence in time: the sequence is the fundamental unit of representation here. Conklin (1990) introduced a neat, uniform way of thinking about these features of data sequences, and it is used in IDyOM. Given a sequence of percepts, we define functions, known as viewpoints, which select initial contiguous subsequences of the available sequence and choose particular features of the percepts in those sequences. Thus, we might define a viewpoint that selects pitch from melodic data; at each point along the melody it will return the pitches of all the notes encountered so far, in sequence. These viewpoints form the contexts used to predict unseen symbols from the given data, the backo¤ strategy allowing the entirely of the viewpoint to be used, as well as incrementally shorter contexts, immediately preceding the unseen symbol. The model starts from basic viewpoints, which are literal selections of note features as they are presented to the system, including7 pitch, note start time, duration, mode, and tonic (key note). From these, further viewpoints may be derived, such as pitch interval (the perceived distance between two pitches). Viewpoints may be linked, which e¤ectively creates a new, compound viewpoint whose alphabet is the cross-product of the alphabets of the two viewpoints being linked. This is denoted by A d B, where A and B are the source viewpoints. Finally, threaded view7. To be absolutely precise, some of the viewpoints named here are in 1-to-1 correspondence with viewpoints used in the actual implementation; however, this is theoretically immaterial. The names used here are more musically informative.

Cross-domain application of a cognitive model of musical learning

473

points allow the selection of elements of a sequence, depending on an external predicate: for example, it is possible to select the pitch of the first note in each phrase of a melody, if phrasing information is given. Each of these data-feature models is carefully considered in musicperceptual, musicological and mathematical terms (Wiggins et al., 1989), in some cases using feedback from musical expert participants (Pearce and Wiggins, 2007). Each viewpoint models a percept which is expressed and used in music theory and thence there is clear, careful motivation for each feature8. Having said this, it is important to understand that we are not predisposing the key feature of the system, its operation over sequences of percept features, in any hard-coded or rule-based way. These features are merely the properties of the data, psychologically grounded at a level of abstraction below the level of interest of the current study, that are made available for prediction; thus, their use does not contradict my claims of domain-generality and methodological neutrality at the level of interest of sequence processing. How those properties arise is not our focus of interest in the current presentation, but will be the object of future work. The system itself selects which of the available representations is actually used, as described in the next section. 3.3. Viewpoint Selection The learning system is enhanced by an optimisation step, based on the hypothesis that brains compress information, and that they do so e‰ciently. The optimisation works by choosing the representation of the musical features from a pre-defined repertoire of music-theoretically valid repre-

8. Music theory is arguably the most formally developed example of a folk psychology currently extant, being based on extensive and careful study of the aural constructs used in a particular musical culture (Western art music), and their associated semiotic connotations, in terms of their usage in that culture. A point sometimes missed in the interdisciplinary music literature is that the constructs of music theory almost always correspond with perceptual principles identifiable in general auditory psychology. For example, the musical concept of melody relies on auditory streaming (Bregman, 1990) of sequences of pitched events (Wiggins et al., 1989), and artistic attempts deliberately to create alternative notions of melody which break these constraints, such as Schoenberg’s tonfarbenmelodie (Schoenberg, 1974), have met with less success. Further, Western music notation often encodes these musical properties (in particular, the overarching construct of tonality) implicitly.

474

Geraint A. Wiggins

sentations, here defined by the set of viewpoints used in a model. For example, two pitch viewpoints (representations of pitch) are available, one in absolute terms and one in terms of the di¤erence (interval, in musical terms) between successive notes. The system chooses the relative representation and discards the absolute one, because the relative representation allows the music to be represented independently of musical key, and this requires fewer symbols (by a factor of 12). There is evidence that humans may go through a similar process as exposure to music increases: infants demonstrate absolute pitch, but the vast majority quickly learn relative pitch, and this becomes the dominant percept (Sa¤ran and Griepentrog, 2001). Nevertheless, there is also evidence that people who develop relative pitch retain their absolute perception at a non-conscious level (Levitin, 1994). Again, it is important to emphasise that no training, nor programmer intervention, with respect to or in favour of the solutions being sought, is involved here: using a hill-climbing search method applied over the set of all viewpoints present (Pearce, 2005), the system objectively picks the set of viewpoints that encodes the data in a model with the lowest possible average information content9 (h). Thus, the data itself determines the selection of the viewpoints best able to represent it e‰ciently; a level playing field for prediction is provided by the fact that each viewpoint distribution is converted into a basic one before comparison: thus, in the music work, h is computed from the pitch distribution of each model; here, it is computed from the phoneme distribution. The selection approach is a brute force simulation of a more subtle process proposed in cognitive theories such as that of Ga¨rdenfors (2000), which allow for the rerepresentation of conceptual spaces in response to newly learned data: in Ga¨rdenfors’ terms, viewpoints are quality dimensions, which can be rendered redundant by new, alternative, learned additions to the representational ontology, and therefore forgotten, or at least de-emphasised. A general mechanism by which this may take place in our statistical model is a focus of our current research, beyond the scope of the current paper. 3.4. Pitch Prediction The primary purpose of the core IDyOM model was to simulate human melodic pitch prediction (Pearce, 2005; Pearce and Wiggins, 2006), and 9. Manning and Schu¨tze (1999) and Conklin (1990) call this same quantity ‘‘cross-entropy’’; I find the current terminology more accurately descriptive.

Cross-domain application of a cognitive model of musical learning

475

Table 1. Results from the IDyOM prediction experiments published by Pearce (2005) and Pearce and Wiggins (2006). The r2 statistic estimates what fraction of the variance in the human participants’ responses is accounted for by the model. IDyOM significantly outperforms its nearest competitor on this task ( p < .01) Data from

Schellenberg’s (1996) model (r2 )

IDyOM (r2 )

Manzara et al. (1992)

.13

.63

Cuddy and Lunny (1995)

.68

.72

Schellenberg (1996)

.75

.83

the data to which it was exposed led it to work in the context of Western tonal melody. The premise, derived from music theory (e.g., Meyer, 1956), is that expectation, anticipation, and subsequent fulfilment or denial thereof, is a major factor in the experience of music. This was tested by exposing the long term model to a corpus of 903 mid-European folk melodies, which capture the nature of tonal melody rather well, from the musicological standpoint, and then comparing the predictions made, which are output in terms of distributions over the symbols in the alphabet of each viewpoint, during simulated listening by the BOTHþ configuration with those made by human participants in published studies (Cuddy and Lunny, 1995; Manzara et al., 1992; Schellenberg, 1996). The results are reproduced in Table 1, and a comparison with the model of Schellenberg (1996) is given for reference. The model generates the most accurate predictions of pitch expectation in the literature to date, but it is not possible to argue precisely why, other than that it is considerably more complex, as explained above, using both back-o¤ and smoothing, and having both short-and long-term elements, which are missing from the earlier models. 3.5. Expectation Prediction and Neural Correlates In more recent work on IDyOM, Pearce et al. (2010a) confirmed the relationship between the information content of its prediction outputs and the expectedness of melodic notes in context reported by human listeners: the predictions of the model correlate very well with human responses concerning perceived expectation (r2 ¼ .78, p < .01). Further, they correlate quite well with the time taken in giving those responses (r2 ¼ .56, p < .01). The correlation between expectation and response time supports

476

Geraint A. Wiggins

my earlier suggestion that unexpectedness may modulate (non-conscious) attention, or processing power: more processing power is recruited for unexpected events. Pearce et al. (2010a) also report on an EEG study of neural correlates of the expectation signals generated by the model. Neural responses to expected and unexpected notes were examined and corresponding reliable di¤erences in signal and location were found; further results include analysis of phase synchrony in the expected and unexpected cases. A particularly interesting aspect is the identification of bursts of beta oscillatory activity during processing of musical expectations: most evidence for similar e¤ects is associated with motor tasks, and it is suggested that this may be evidence for an implicit link between perception and action in musical behaviour. 3.6. Phrase Segmentation and Music Analysis Following music-theoretical reasoning by Meyer (1956) and Narmour (1990) and empirical modelling work by Ferrand et al. (2003a, 2002, 2003b) concerning regular relationships between the strength of musical expectations and phrase boundaries, Pearce et al. (2010b) investigated the relationship between musical segmentation (chunking) and the outputs of the model. The prediction, broadly, is that decreases in strength of expectation, followed by a sharp increase, mark the ends of phrases, the beginning of the new phrase being the note that causes the increase. This was implemented by identifying peaks in IDyOM’s information content signal, whose size was above a threshold determined by local context. IDyOM performed reasonably well in this task, predicting phrase boundaries in a corpus of 1,705 Germanic folk songs with an average of about 46 events per melody with precision/accuracy of .76 and recall/completeness of .50, yielding an F1 score of .58. It is hard to make a true comparison with others, since IDyOM is the only model to perform melodic segmentation by unsupervised learning: all the others in the literature use humanprogrammed rules. However, there is some evidence that IDyOM discovers information that is not present in the rule-based systems: an objectively optimised hybrid model composed of IDyOM and three rule-based systems (GPR2a, Lerdahl and Jackendo¤, 1983; LBDM, Cambouropoulos, 2001; Grouper Temperley, 2001) retained IDyOM under stepwise regression, thus demonstrating that IDyOM contributed information that the other models did not. In this part of the work, we have arrived at a conclusion strongly related to the work of, for example, Goldwater et al. (2006) and Christiansen et al. (1998), who report on comparable means of modelling segmentation in

Cross-domain application of a cognitive model of musical learning

477

language. This is a particularly interesting meeting of ideas, because our approach was motivated entirely by music theory (Narmour, 1990), while the linguistic studies are motivated by the statistical properties of language corpora. Indeed, once one understands how the language segmentation approaches based on Markov models (appealing to the probabilities of expectations, mediated via Shannon information theory or otherwise) function, it is rather easy to see that, of course, they should work, simply because the occurrence of patterns in the data clearly and self-evidently defines the segmentation. The musical result is rather more surprising, and certainly relies much more heavily on context, both in the sense of the model’s context (the preceding sequence) but also in terms of the musical context that is surrounding the perceived pitches. Most interesting of all is that the e¤ect required should have been isolated and explained, really quite precisely, by a music theorist, but quite independently of the linguistic e¤ect using exactly the same mathematics. I believe this to be an important piece of supporting evidence for the fundamentally statistical nature of these aspects of human cognitive processing. 3.7. Shortcomings of the Model This model is the first stage of an extended research programme of cognitive modelling. In this context, it is important that we note its shortcomings as well as its successes and potentials. I do so at this point to make a clear distinction between the issues that are outstanding for IDyOM as a whole, and which are specific to the language-related work presented in the next section. First, the model is currently limited to monodic melodic music, which is only one aspect of the massively multidimensional range of music available; while our focus on melody is perceptually, musicologically and methodologically defensible, the other aspects need to be considered in due course. Elsewhere, we are studying the modelling of musical harmony (Whorley et al., 2010, 2008), following on from the early e¤orts of (Ponsford et al., 1999). The aim of the current work is to extend IDyOM’s coverage beyond music. Second, and more fundamentally, the memory model used here is inadequate: the model exhibits total recall and its memory never fails. There is work to do on the statistical memory mechanism (currently based on exact literal recording and matching by identity) to model human associative memory more closely. Perhaps a neural network model would come closer, and I return to this possibility below.

478

Geraint A. Wiggins

Third, as explained above, the viewpoints used in the system are chosen from music theory and must be implemented by hand. This is useful for the purposes of research, because we are able to interpret, to some extent, what the model is doing by looking at the viewpoints it selects. For example, the viewpoint scale degree d pitch interval encodes aspects of tonal listening (Lerdahl, 2001), and this viewpoint consistently emerges from the compression of tonal music databases. However, a purer system would be capable of constructing its own viewpoints (based on established perceptual principles) and choosing new ones which lead to more compact models. This could be posited as a model of perceptual learning, in which new quality dimensions (Ga¨rdenfors, 2000) are created in the perceiver’s representation as they are required. This would greatly increase the power of the system, because it would be able to determine its own representation, by reflection.

4. Associative Models of Learning At the level of description given above, it is not hard to imagine a similarity between the IDyOM model and neural network models such as SRNs, and I have referred above to related work using SRNs. There are, however, significant di¤erences, and it is worth briefly considering them. Also, SRNs have been successfully compared with (very simple) bigram models in the literature (Mozer, 1994), and it is therefore necessary to motivate a renewed interest in the Markovian approach. IDyOM can be viewed as a model of a particular kind of associative learning, in two ways. First, and most obvious, is the diachronic association implicit in sequence learning, to which I referred above; of course, it is not the case that this diachronicity is implicit in the Markovian approach, but it is explicit in the IDyOM data representation presented here10. Second is the synchronic association implicit in the multidimensionality of the representation: the interplay between a linked viewpoint and its component viewpoints, under the empirically validated viewpoint weighting strategy (Pearce et al., 2005), encodes associations between particular elements of viewpoint alphabets in terms of increased contribution to the overall model. Threaded viewpoints allow something similar for 10. In other representations, such as that required to model harmony, current thinking is that the tight coupling between temporal sequence and Markov sequence may need to be relaxed (Whorley et al., 2010).

Cross-domain application of a cognitive model of musical learning

479

association with perceptual e¤ects which take place at lower levels of abstraction than the main representation (e.g., musical metre). This is similar to the SRN model, where node activations, or activation patterns, may be linked together in comparable ways. However, the network di¤ers strongly in being a homogenous, unstructured entity. While this is an advantage in some applications, and allows in principle for more subtle interactions than IDyOM, it does not generally help in understanding what the network is actually doing. Methodologically speaking, therefore, IDyOM is a more transparent research tool, because it is possible to interpret the various outputs of the system transparently and directly, and I take advantage of this approach here. To apply leverage to this particular advantage, a Bayesian version of the system is currently under development, which will allow more detailed analysis in future. In this context, the di¤erence in functional construction that models the distinction between synchronic and diachronic association is paramount: since music is so heavily time-dependent when expressed, the mystery of how an entire song or symphonic movement can be internally conceptualised instantaneously cries out for investigation, and maintaining the distinction will help to do so, even should it turn out to be unnecessary in the final analysis. Another significant di¤erence is the learning mechanism. In IDyOM, there is no account of the actual process of deriving statistics: symbols are merely counted, albeit e‰ciently; a less abstract model will need to account for this in due course, and eventually the modelling will be reach a level that is neurally motivated. The only optimisation mechanism currently present in IDyOM is the viewpoint selection step described above. Since this requires complete exposure, and since the dimensions selected vary as the model is exposed to more data, one might look for an relationship with sleep-dependent memory consolidation here (Stickgold, 2005). The cross-domain application of IDyOM, as begun in the next section, is a means to identify which of these features are specific to music and which are general to both music and language. The synchronic associations represented in IDyOM’s linked viewpoints will serve later to admit richer representations of linguistic data; an exciting question is the extent to which the viewpoint weighting strategy, which is in principle domainindependent, proves to be so in fact. Before we can begin to answer this question, however, an auditory-level account of syllable construction is needed, to feed into morphology and word learning, and that is the focus here.

480

Geraint A. Wiggins

5. Exploratory Study 5.1. Introduction The study presented here is exploratory in nature, being intended as a proof of concept, rather than a rigorous modelling attempt. From work such as that of Goldwater et al. (2006), it is clear that IDyOM should be able to model the activity in question: the segmentation of phoneme sequences into syllables, by applying the same information theoretic principles that apply in melody segmentation. In this study, I use an idealised dataset, and only two viewpoint dimensions: phoneme and stress; and phoneme is the predicted viewpoint. Each distinct configuration of the model (STM, LTM, LTMþ, BOTH, BOTHþ) is exposed to all three di¤erent ways this data can be represented in IDyOM ({phoneme}, {phoneme d stress} and {phoneme, phoneme d stress}; stress alone is not applicable, because it cannot predict phoneme). The first of these representations constitutes a one-dimensional variable-order Markov system capable only of learning over one-dimensional data; the second is equivalent to a system of one-dimensional learning in which compound symbols are formed from two-dimensional data, making a larger (cross-product) alphabet for the learning model; and the third is two-dimensional, with its predictions being calculated by entropic weighting as described above. In each modelling attempt, there is a parameter which is optimised according to Cohen’s k statistic, to increase the match between the syllable segmentation ‘‘ground truth’’ supplied with the data, and the predictions made by a simplified version of the peak-picking approach described above. 5.2. Data The corpus consists of 2,342 sequences (utterances), constructed from an alphabet of 40 phonemes. Phonemes are represented in an IPA-like ASCII code, the details of which are irrelevant here. The corpus is supplied as meta-data with the TIMIT audio corpus (available from the Linguistic Data Consortium at www.ldc.upenn.edu). There were a mean 34.81 phonemes per utterance, with 81,532 phonemes in all. The TIMIT corpus includes data from 8 di¤erent North American dialects: for the purposes of this preliminary study, which is aimed primarily at proof of concept, I used the dictionary comparators given by the compilers of TIMIT, rather than the actual dialect versions, because this makes the corpus more regular and therefore more learnable. Similarly, the stress dimension was taken

Cross-domain application of a cognitive model of musical learning

481

from annotations by the compilers of TIMIT, in three di¤erent levels: primary stress, secondary stress, no stress. Finally, the TIMIT metadata includes markers for syllable, word and phrase boundaries, and these were used as a comparator to test the success of the model; regrettably, morpheme boundary information was not available. Example utterances are: ‘‘She had your dark suit in greasy wash water all year.’’ and ‘‘Eating spinach nightly increases strength miraculously.’’ 5.3. Method The data were presented to the IDyOM system in a total of 15 conditions, as itemised in Table 2. In each condition, the resulting model was used to predict the information content of each phoneme in the corpus, in context of its utterance prefix. The resulting signal was di¤erentiated, and values larger than a parameter p were taken as boundaries. p was varied to give the largest possible k for each model. Recall, again, that the segmentation information was not given to the system, and neither were negative examples, nor feedback on correctness. k (Cohen, 1960) is calculated thus: k¼

PðaÞ  PðeÞ 1  PðeÞ

where PðaÞ is the actual agreement, calculated from the known data, and PðeÞ is the likelihood that such agreement would be by chance. Landis and Koch (1977) give a qualitative scale of notional significance to k: 0.0 and below means no agreement; up to 0.2 is slight agreement; 0.2 to 0.4 is fair; 0.4 to 0.6 is moderate; 0.6 to 0.8 is substantial; and above 0.8 is almost complete agreement with 1.0 being the maximum; I use this scale as a rule of thumb, noting that it is not agreed by all statisticians. The statistic tends to the conservative, and so, apart from a certain pathological case, is safe to use. That case arises in circumstances where relatively very few of the datapoints are in one category: agreements with relatively very many of the other category swamp the statistic, and it simply increases, topping out when all predictions are in the larger category. k is somewhat similar to the more familiar F-score (F1), but it includes a correction for an estimate that the match between compared results is by chance, and this is a reason for using it here. The F1 values, and the precision/accuracy (P) and recall/completeness (R) values from which it is calculated, are given below for comparison. Induction of statistics leads to an important methodological point. In this kind of modelling work, one can use statistics in ways which are

482

Geraint A. Wiggins

di¤erent from and complementary to those familiar in analysis of conventional empirical results. In particular, as mentioned above, various statistical analyses can apply leverage to particular dimensions of the model, and identify which of a range of possible parameterisations gives the best fit to the empirical data, thus identifying which is the most e¤ective model. Of course, when one does this, one faces all the dangers of over-fitting that arise in supervised machine learning (Cleeremans and Dienes, 2008; Wiggins, 2011), and model selection (Honing, 2006) and so one must guard against them. In the current technique, it is not immediately clear what ‘‘over-fitting’’ would mean, since the only generalisation steps are inherent in the smoothing strategy, and are therefore separate from any optimisation carried out; what is more, since there is no objective function expressed in terms of the human behaviour being modelled, any ‘‘over-fitting’’ that might somehow happen would not be relative to the desired results. 5.4. Questions The study is intended to explore various questions about the application of this model of learning to this aspect of language. The many other questions begged by this initial study are reserved for future work. 1. Does it work at all, and, if so, how well? To claim any kind of success, I would need k > 0. To claim that it works well (given the mitigation of a relatively small learning corpus and an impoverished representation) I would need k > 0.4. It would be surprising and disappointing indeed if the model were unable to perform this simple task at all, since simple bigram models have proven successful elsewhere. 2. Which configuration of IDyOM works best? We might predict a divergence, here, from the musical case. IDyOM’s STM finds local structure in the current stimulus; this is evidently less important in language (other than poetry) than in music, so we might predict that an LTMonly configuration would be best. 3. Does IDyOM’s multidimensionality make any di¤erence? To claim that this is the case, the multidimensional model would need to predict more accurately than the others. 4. What di¤erence does the representation make to the average information content of the model? The model with lowest average information content (h) is the most e‰cient in terms of storage: if the compression hypothesis, above, is correct, we would expect the model with the highest h to be the best at segmentation.

Cross-domain application of a cognitive model of musical learning

483

5. How does language processing di¤er from music processing, in terms of the model? Comparing the behaviour of the model in the two domains may suggest approaches to comparative empirical study. 5.5. Results I now answer the questions posed above. Quantitative results are given in Table 2 and example segmentations in Table 3. 5.5.1. Does it work at all and, if so, how well? All of the IDyOM configurations except STM model this task with k > 0.4; Landis and Koch (1977) characterise k a [0.4, 0.6) as ‘‘moderate’’, so the Table 2. Results from the preliminary study: all five configurations of the model were exposed to three di¤erent representations of the data. k, precision/ accuracy (P) and recall/completeness (R) and the derived F-score (F1) values are shown, optimised for each model (parameter, p, shown, in bits), in comparison with TIMIT’s own syllable annotations. h is the average information content of the learned model in bits. Note that, because the parameter p is chosen with reference to k, the P, R, and F1 values can be misleading: they are included here only for reference. {Phoneme}

{Phoneme d Stress}

Model

h

p

k

P

R

F1

h

p

k

P

R

F1

STM

5.50

3.40

0.10

.72

.20

.20

5.50

2.60

0.11

.61

.16

.25

LTM

3.63

1.00

0.47

.71

.62

.66

3.59

1.00

0.47

.70

.63

.67

LTMþ

3.62

1.00

0.47

.71

.63

.66

3.57

1.00

0.47

.71

.63

.67

BOTH

3.75

1.10

0.45

.70

.60

.65

3.71

1.10

0.45

.71

.58

.64

BOTHþ

3.74

1.10

0.45

.70

.60

.65

3.70

1.10

0.45

.69

.62

.65

{Phoneme, Phoneme d Stress} Model

h

p

k

P

R

F1

STM

5.48

2.30

0.11

.58

.18

.27

LTM

3.59

1.07

0.48

.71

.62

.67

LTMþ

3.57

1.00

0.47

.71

.63

.66

BOTH

3.71

0.96

0.45

.69

.62

.66

BOTHþ

3.70

0.99

0.46

.70

.62

.66

484

Geraint A. Wiggins

most successful model (k ¼ 0.48) achieves more than minimally ‘‘moderate’’, but less than ‘‘substantial’’, agreement. As noted above, k is a very conservative statistic, especially over data whose categories are greatly di¤erent in size (as here: there are many more phoneme boundaries than syllable boundaries); to put this note in context, the LTM configuration with two viewpoints has k ¼ 0.48, and predicts 75.6% of potential boundaries correctly. Ultimately, as I said above, it is no surprise that a model of this nature can perform this task to some degree (see, e.g., Lanchantin et al., 2008); the point here is to verify that IDyOM can indeed do what we would expect. Another important point to note is that we might expect IDyOM to do rather better at modelling morphemes rather than syllables; and indeed, inspection of the results in the figure and the wider corpus do support that: false positives in the syllabisation task are often (though not always) at morpheme boundaries. 5.5.2. Which configuration of IDyOM works best? One of the ways in which everyday language (as opposed to metrical and/ or rhyming poetry) di¤ers from music is that the internal structure of an utterance is relatively unimportant, as compared with the structure of the language being used. That is to say, for example, a word is a noun because of its semantic connotation, and not because the sounds in it bear any relation to the sounds in the verb that follows it. This, of course, is not the case in general, because some languages exhibit vowel harmony and/ or mutation, and here is the connection with music: it is the relationships between pitch structures that make up the essence of Western tonal melody, and there is absolutely no external reference or definition grounding that relationship other than musical practice. The relationships realised in vowel harmony and mutation are localised, however, and do not show the sustained structural organisation of musical structure; only poetry, in language, o¤ers this degree of organisation at the surface level. This leads me to the unastonishing prediction that the STM alone would work very poorly for syllabisation, because it will not learn from the corpus, but only from the prefix of each utterance as it proceeds. That prediction is confirmed in Table 2, and I leave the STM configuration here. The converse of this argument would lead me to suggest that the LTM and LTMþ configurations would be better than BOTH and BOTHþ, and this is borne out by the results, though the di¤erence in k is not so large. Since the LTM information is explicitly present in the BOTH configuration (and

Cross-domain application of a cognitive model of musical learning

485

LTMþ in BOTHþ, respectively), there is an implication that the addition of the STM actually confounds the model, which is a stronger point than mere reduced success. The di¤erence between the LTM and LTMþ configurations, and, respectively, BOTH and BOTHþ, is less clear and will require further analysis with a larger corpus. Here and in the musical case, the di¤erence is very small, and cannot be taken as a strong indication. This will require further work. 5.5.3. Does IDyOM’s multidimensionality make any di¤erence? We would expect that the addition of new information in the data (in this case, stress information) would improve prediction, and there is some very slight evidence that this might be so, in the multidimensional LTM model. An increase of 0.01 in k seems very small, but is more encouraging when expressed in absolute terms of correct predictions: the absolute increase in correct results over the single-viewpoint LTM model is 246, while the increase over the single-linked-viewpoint LTM model is 240. What is happening here is that, because of the entropic weighting between the viewpoints, the stress-linked viewpoint is interjecting where it has information to o¤er; otherwise it contributes relatively little to the predicted distribution. This e¤ect will be particularly marked where the phoneme-only viewpoint carries relatively little information. Thus, we see a dialog between the di¤erent features of the data, and further research is necessary to explore this issue, including the possibly of adding further dimensions, such as intonation, to assist learning. 5.5.4. What di¤erence does the representation make to the average information content of the model? However, the representation does make a di¤erence to the value of h in the models, and the implication is that the extra information, while not noticeably improving performance, does improve the compressibility of the data. Thus, the most compressed models correspond clearly with the most successful models at prediction (the LTM and LTMþ configurations including stress), as was predicted. This accords with our hypothesis that e‰cient compression, rather than domain-specific reasoning, is a good heuristic for model optimisation.

486

Geraint A. Wiggins

According to the k heuristic, the multidimensional LTM configuration is most successful, but by such a small margin that strong conclusions cannot be drawn, notwithstanding the absolute figures above. Again, further research is required to ascertain whether alternative representations, including further dimensions, will produce better models. 5.5.5. How does language processing di¤er from music processing, in terms of the model? The results discussed above and summarised in Table 2 suggest that the model does perform the required linguistic task, and the configuration result suggests that it does so in a predictable way. Examples of the outputs can be seen in Table 3: it will be seen that, where syllable boundaries are missed, word boundaries are usually correct, and also that false-positive syllable boundaries are mostly on morpheme boundaries, as predicted earlier – this is borne out in inspection of the wider corpus. It seems clear that the stress viewpoint is substantially less important than the phoneme viewpoint, and this is not surprising, though one might have expected it to contribute more here. It is di‰cult to give defensible explanations in this very exploratory context: further experiments with larger corpora may well shed light on this question. In any case, the addition of the stress information was able to correct 1.2% of the errors made by the simpler system, and this demonstrates the potential utility of the multidimensional approach, which was the primary focus of this work. Of course, this specific result regarding stress cannot be assumed to generalise across languages, since some languages exhibit structural stress patterns which are not uniformly related to the morphology of the language or the semantics expressed11; however, the point is that the multi-dimensional mechanism required for music can be used productively in language, and this begs further research to understand whether it is thus used.

11. For example, in Welsh, stress is almost always on the second-last syllable, so the addition of a plural ending can change the stress of the stem: dang-OS-iad [daÐsiad], exposition; dan-gos-IAD-au [daÐsiadai], expositions (the ending -iad is e¤ectively a single syllable here).

Cross-domain application of a cognitive model of musical learning

487

Table 3. Random example segmentations produced by the best-performing system. | denotes correctly predicted boundaries, : denotes false positives and . denotes false negatives according to the TIMIT syllable annotations; note the occurrence of false positives before plural endings, supporting my suggestion that morpheme, as one might expect, is the unit of learning here. The phoneme representation is borrowed directly from the TIMIT meta-data (qv: www.ldc.upenn.edu) She had your dark suit in greasy wash water all year | sh iy | hh ae D | y uh r | D : aa r K | s : uw T . ih n | G r iy | s : iy . w ao sh | w ao . T er | ao l | y iy r |

Production may fall far below expectations | P r ax | D ah . K sh ih n | m ey | f ao l | f aa r | B ax . l : ow | eh K . s : P eh K . T : ey . sh ih n : z |

Don’t do Charlie’s dirty dishes | D ow n T | D uw | CH aa r | l iy : z . D er T iy | D ih | sh ih : z |

In wage negotiations the industry bargains as a unit with a single union | ih n | w ey : JH . n ax . G ow | sh iy | ey . sh ih n : z | dh iy . ih n | D ax . s T r iy | B aa r | G ax n : z . ae z . ey . y uw | n ih T | w ih : dh . ax | s ih ng . G ax l | y uw | n y ax n |

Jane may earn more money by working hard | JH ey n | m ey | er n | m ow : r | m ah . n iy | B ay . w er . K ih ng | hh aa r D |

A big goat idly ambled through the farmyard | ey . B ih G | G ow T | ay D . l iy | ae : m . B : ax l : D | th r uw | dh ax | f aa r m | y aa r D |

6. Summary and Conclusion In this chapter, I have presented the philosophical and methodological background to a planned extended programme of cognitive modelling research based around a computational simulation approach that is fundamentally statistical. The model is strongly bottom-up, and, in the musical context, has previously been used to show that at least some assumptions of top-down rule application are unnecessary. The model is viewed as a hypothetical veridical simulation of processing at a particular and precise level of abstraction: above the level of note percept, and also of auditory streaming, but below the level of diachronic grouping. The question asked here, preceding a full investigation of the model’s utility in language processing, is: do the extra mechanisms introduced to enable the modelling of music enhance the model’s ability to model language, at an equivalently low level. The answer seems to be positive: the addition of multidimensionality seems to improve the performance of the model, though not very greatly. Better and richer data is likely to increase this di¤erence. On the other hand, the aspects of the model that capture

488

Geraint A. Wiggins

the structure of a ‘‘current experience’’ of music are detrimental to language performance, and this is not surprising, since (put simply) most language does not rhyme. This issue points the way to interesting future research involving poetry. One must, of course, be cautious in drawing conclusions from work as preliminary as this. Further, the conclusions drawn from computational models can only be stated confidently after being subjected to rigorous empirical test. However, the evidence presented here does lead me tentatively to suggest that a common mechanism may be implicated in processing both music and language. This hypothesis sits neatly alongside the suggestion of Patel (2008) that there is a limited shared processing resource for music and language: a shared mechanism makes a shared resource limit easier to explain. However, there is no reason to suppose that a common mechanism should not be implemented in di¤erent places and in di¤erent ways, so it is too early to make strong claims on this front. The approach taken here and elsewhere, including in particular some recent evidence that the musical predictions of the model correspond with particular neural activity (Pearce et al., 2010a), point to some clear avenues for research on equivalents in language, which may shed further light on this issue. The current work is a small piece of evidence contributing towards a picture of these aspects of cognition as a process of information management, where anticipation, experienced as expectation, is fundamental in focusing processing resource on information that is most cognitively relevant – which is to say, least expected, because less familiar information requires more processing. The connection with anticipation suggests a link to the most basic of experiences, that of simple, immediate causality in the world, which in turn provides the root of an evolutionary argument as to why such a mechanism might arise. Evidence that two apparently di¤erent cognitive phenomena such as music and language may share a basis in that mechanism adds weight to the argument, because deployment of a mechanism in multiple areas both simplifies the hypothetical system, thus making evolution more likely, and increases the evolutionary advantage the mechanism conveys12. The fact that perceptual chunking seems to arise directly from this basic information management process is yet further evidence for a tight theory: expectation determines chunk boundaries, which in turn allow both compression of information and expectation 12. This said, it is important to make a distinction between biological and social evolution here: it seems likely that any biological selection pressure brought about by either language or music is strongly mediated by social selection.

Cross-domain application of a cognitive model of musical learning

489

generation to proceed more e‰ciently. Thus, the theory feeds back into higher cognitive levels, reducing the number of mechanisms to be explained in a larger theoretical framework. The ability of the IDyOM model to capture synchronic associations (in other words, conjunctions) o¤ers further potential for modelling of non-conscious expectation in rich representational environments, further testing the model, but also opening up new, uncharted opportunities, such as modelling the perception of sung lyrics; several studies suggest that the combination of words and music are integrated in this way (e.g., Peretz et al., 2004). The way is now clear to carry this work forward to larger corpora, and to other linguistic phenomena, including poetic forms. In due course, I aim to propose more detailed mechanisms that account for both musical and linguistic learning and for the combination of the two in sung music. This, in turn, is likely to lead to new, richer theories of meaning, as the interaction between intra-musical association13 and the semantics of lyrics is explored. Acknowledgements I gratefully acknowledge UK Engineering and Physical Sciences Research Council grant EP/H01294X, which supports the work presented here. I am also grateful for the support and feedback of my esteemed colleague Dr. Marcus Pearce, on whose model this work is based. The constructive and careful commentary of Amy Perfors and a second, anonymous, reviewer greatly enhanced the quality of the chapter. References Aucouturier, J.-J. 2009 Sounds like teen spirit: Computational insights into the grounding of everyday musical terms. In Minett, J. and Wang, W., editors, Language, Evolution and the Brain, Frontiers in Linguistics. Academia Sinica Press, Taipei. Aucouturier, J.-J. and Pachet, F. 2004 Improving timbre similarity: How high’s the sky? Journal of Negative Results in Speech and Audio Sciences, 1(1). Bregman, A. S. 1990 Auditory Scene Analysis. The MIT Press, Cambridge, MA. 13. To say that music has semantics is questionable (Wiggins, 1998, 2009), but there is clearly some quasi-semantic, semiotic association mechanism involved in the comprehension of musical structure (Delie`ge, 1987; Nattiez, 1975, 1990).

490

Geraint A. Wiggins

Cairns, P., Shillcock, R., Chater, N., and Levy, J. 1997 Bootstrapping word boundaries: A bottom-up corpus-based approach to speech segmentation. Cognitive Psychology, 33: 111– 153. Cambouropoulos, E. 2001 The Local Boundary Detection Model (LBDM) and its application in the study of expressive timing. In Proceedings of the International Computer Music Conference (ICMC’2001), Havana, Cuba. Chater, N. and Christiansen, M. H. 2010 Language acquisition meets language evolution. Cognitive Science, 34(7): 1131–1157. Chomsky, N. 1965 Aspects of the theory of syntax. MIT Press, Cambridge, MA. Christiansen, M. H., Allen, J., and Seidenberg, M. S. 1998 Learning to segment speech using multiple cues: A connectionist model. Language and Cognitive Processes, 13: 221–268. Cleary, J. G. and Teahan, W. J. 1997 Unbounded length contexts for PPM. The Computer Journal, 40(2/3): 67–75. Cleary, J. G. and Witten, I. H. 1984 Data compression using adaptive coding and partial string matching. IEEE Transactions on Communications, 32(4): 396–402. Cleeremans, A. and Dienes, Z. 2008 Computational models of implicit learning. In Sun, R., editor, Cambridge Handbook of Computational Psychology, pages 396– 421. Cambridge University Press. Cohen, J. 1960 A coe‰cient of agreement for nominal scales. Educational and Psychological Measurement, 20(1): 37–46. Conklin, D. 1990 Prediction and entropy of music. Master’s thesis, Department of Computer Science, University of Calgary, Canada. Conklin, D. and Witten, I. H. 1995 Multiple viewpoint systems for music prediction. Journal of New Music Research, 24: 51–73. Cuddy, L. L. and Lunny, C. A. 1995 Expectancies generated by melodic intervals: Perceptual judgements of continuity. Perception and Psychophysics, 57: 451–62. Delie`ge, I. 1987 Grouping conditions in listening to music: An approach to Lerdahl and Jackendo¤ ’s grouping preference rules. Music Perception, 4: 325–360. Ferrand, M., Nelson, P., and Wiggins, G. 2003a Unsupervised learning of melodic segmentation: A memorybased approach. In Kopiez, R., Lehmann, A. C., Wolther, I.,

Cross-domain application of a cognitive model of musical learning

491

and Wolf, C., editors, Proceedings of the 5th Triennial ESCOM Conference, pages 141–144, Hanover, Germany. Hanover University of Music and Drama. Ferrand, M., Nelson, P., and Wiggins, G. A. 2002 A probabilistic model for melody segmentation. In Proceedings of ICMAI’02. Springer-Verlag. Ferrand, M., Nelson, P., and Wiggins, G. A. 2003b Memory and melodic density: A model for melody segmentation. In Proceedings of the XIV Colloquium on Musical Informatics (XIV CIM 2003), pages 95–98, Florence, Italy. Ga¨rdenfors, P. 2000 Conceptual Spaces: the geometry of thought. MIT Press, Cambridge, MA. Goldwater, S., Gri‰ths, T. L., and Johnson, M. 2006 Contextual dependencies in unsupervised word segmentation. In Calzolari, N., Cardie, C., and Isabelle, P., editors, Proceedings of the 21st International Conference on Computational Linguistics and the 44th annual meeting of the ACL, pages 673–680, East Stroudsburg, PA, USA. Association for Computational Linguistics. Honing, H. 2006 Computational modeling of music cognition: a case study on model selection. Music Perception, 23(5): 365–376. Krumhansl, C. L., Louhivuori, J., Toiviainen, P., Ja¨rvinen, T., and Eerola, T. 1999 Melodic expectation in Finnish spiritual hymns: Convergence of statistical, behavioural and computational approaches. Music Perception, 17(2): 151–195. Lanchantin, P., Morris, A. C., Rodet, X., and Veaux, C. 2008 Automatic phoneme segmentation with relaxed textual constraints. In Proceedings of Language Resources and Evaluation Conference (LREC). Landis, J. and Koch, G. G. 1977 The measurement of observer agreement for categorical data. Biometrics, 33: 159–174. Lerdahl, F. 2001 Tonal Pitch Space. Oxford University Press, Oxford. Lerdahl, F. and Jackendo¤, R. 1983 A Generative Theory of Tonal Music. MIT Press, Cambridge, MA. Levitin, D. J. 1994 Absolute memory for musical pitch: Evidence from the production of learned melodies. Perception & Psychophysics, 56(6): 414– 423. Manning, C. D. and Schu¨tze, H. 1999 Foundations of Statistical Natural Language Processing. MIT Press, Cambridge, MA.

492

Geraint A. Wiggins

Manzara, L. C., Witten, I. H., and James, M. 1992 On the entropy of music: An experiment with bach chorale melodies. Leonardo, 2:81–8. Meyer, L. B. 1956 Emotion and Meaning in Music. University of Chicago Press, Chicago. Mo¤at, A. 1990 Implementing the PPM data compression scheme. IEEE Transactions on Communications, 38(11): 1917–1921. Mozer, M. C. 1994 Neural network music composition by prediction: Exploring the benefits of psychoacoustic constraints and multi–scale processing. Connection Science, 6(2–3): 247–280. Narmour, E. 1990 The Analysis and Cognition of Basic Melodic Structures: The Implication-Realistation Model. The University of Chicago Press, Chicago. Nattiez, J.-J. 1975 Fondements d’une se´miologie de la musique. Union Ge´ne´rale d’Editions, Paris. Nattiez, J.-J. 1990 Music and Discourse – toward a semiology of music. Princeton University Press. Oram, N. and Cuddy, L. L. 1995 Responsiveness of Western adults to pitch-distributional information in melodic sequences. Psychological Research, 57(2): 103–118. Patel, A. D. 2008 Music, Language, and the Brain. Oxford University Press, Oxford. Pearce, M. T. 2005 The Construction and Evaluation of Statistical Models of Melodic Structure in Music Perception and Composition. PhD thesis, Department of Computing, City University, London, UK. Pearce, M. T., Conklin, D., and Wiggins, G. A. 2005 Methods for combining statistical models of music. In Wiil, U. K., editor, Computer Music Modelling and Retrieval, pages 295– 312. Springer Verlag, Heidelberg, Germany. Pearce, M. T., Herrojo Ruiz, M., Kapasi, S., Wiggins, G. A., and Bhattacharya, J. 2010 Unsupervised statistical learning underpins computational, behavioural and neural manifestations of musical expectation. NeuroImage, 50(1): 303–314. Pearce, M. T., Mu¨llensiefen, D., and Wiggins, G. A. 2008a Melodic segmentation: A new method and a framework for model comparison. In Proceedings of ISMIR 2008. Pearce, M. T., Mu¨llensiefen, D., Wiggins, G. A., and Frieler, K. 2008b Perceptual segmentation of melodies: Ambiguity, rules and statistical learning. In Proceedings of the 10th International Conference on Music Perception and Cognition, Sapporo, Japan.

Cross-domain application of a cognitive model of musical learning

493

Pearce, M. T., Mu¨llensiefen, D., and Wiggins, G. A. 2010b The role of expectation and probabilistic learning in auditory boundary perception: A model comparison. Perception, 9: 1367– 1391. Pearce, M. T. and Wiggins, G. A. 2004 Improved methods for statistical modelling of monophonic music. Journal of New Music Research, 33(4): 367–385. Pearce, M. T. and Wiggins, G. A. 2006 Expectation in melody: The influence of context and learning. Music Perception, 23(5): 377–405. Pearce, M. T. and Wiggins, G. A. 2007 Evaluating cognitive models of musical composition. In Cardoso, A. and Wiggins, G. A., editors, Proceedings of the 4th International Joint Workshop on Computational Creativity, pages 73– 80, London. Goldsmiths, University of London. Peretz, I., Radeau, M., and Arguin, M. 2004 Two-way interactions between music and language: Evidence from priming recognition of tune and lyrics in familiar songs. Memory and Cognition, 32: 142–152. 10.3758/BF03195827. Ponsford, D., Wiggins, G. A., and Mellish, C. 1999 Statistical learning of harmonic movement. Journal of New Music Research, 28(2): 150–177. Potter, K., Wiggins, G. A., and Pearce, M. T. 2007 Towards greater objectivity in music theory: Information-dynamic analysis of minimalist music. Musicæ Scientiæ, 11(2): 295–322. Sa¤ran, J. R., Aslin, R. N., and Newport, E. L. 1996 Statistical learning by 8-month-old infants. Science, 274: 1926– 1928. Sa¤ran, J. R. and Griepentrog, G. J. 2001 Absolute pitch in infant auditory learning: Evidence for developmental reorganization. Developmental Psychology, 37(1): 74–85. Sa¤ran, J. R., Johnson, E. K., Aslin, R. N., and Newport, E. L. 1999 Statistical learning of tone sequences by human infants and adults. Cognition, 70: 27–52. Schellenberg, E. G. 1996 Expectancy in melody: Tests of the implication-realisation model. Cognition, 58(1): 75–125. Schenker, H. 1925 Das Meisterwerk in der Musik, Volume I. Drei Maksen Verlag, Munich. Schoenberg, A. 1974 Letters. Faber, London. Edited by Erwin Stein. Translated from the original German by Eithne Wilkins and Ernst Kaiser. Shannon, C. 1948 A mathematical theory of communication. Bell System Technical Journal, 27: 379–423, 623–56.

494

Geraint A. Wiggins

Steedman, M. 2000 Stickgold, R. 2005 Temperley, D. 2001

The Syntactic Process. Language, speech, and communication. The MIT Press, Cambridge, MA. Sleep-dependent memory consolidation. Nature, 437: 1272–1278.

The Cognition of Basic Musical Structures. MIT Press, Cambridge, MA. Whorley, R., Wiggins, G. A., Rhodes, C., and Pearce, M. 2010 Development of techniques for the computational modelling of harmony. In Ventura et al., editors, Proceedings of the First International Conference on Computational Creativity. Whorley, R. P., Pearce, M. T., and Wiggins, G. A. 2008 Computational modelling of the cognition of harmonic movement. In Proceedings of the 10th International Conference on Music Perception and Cognition, Sapporo, Japan. Wiggins, G. A. 1998 Music, syntax, and the meaning of ‘‘meaning’’. In Proceedings of the First Symposium on Music and Computers, pages 18–23, Corfu, Greece. Ionian University. Wiggins, G. A. 2009 Semantic Gap?? Schemantic Schmap!! Methodological considerations in the scientific study of music. In Proceedings of 11th IEEE International Symposium on Multimedia, pages 477– 482. Wiggins, G. A. 2010 Cue abstraction, paradigmatic analysis and information dynamics: Towards music analysis by cognitive model. Musicae Scientiae, Special Issue: Understanding musical structure and form: papers in honour of Ire`ne Delie`ge: 307–322. Wiggins, G. A. 2011 Computer models of (music) cognition. In Rebuschat, P., Rohrmeier, M., Cross, I., and Hawkins, J., editors, Language and Music as Cognitive Systems. Oxford University Press, Oxford. In press. Wiggins, G. A., Harris, M., and Smaill, A. 1989 Representing music for analysis and composition. In Balaban, M., Ebciogˇlu, K., Laske, O., Lischka, C., and Soriso, L., editors, Proceedings of the Second Workshop on AI and Music, pages 63– 71, Menlo Park, CA. AAAI.

Index Not all author citations are listed in the index; readers requiring complete lists of authors and works cited should consult the reference lists at the end of each chapter. abstract rules learning 243–245 abstract sensitivity 102 ACCESS 225 accuracy conscious and unconscious knowledge 349 measuring 147 acoustic factors 137, 138 acoustic-phonetic detail 71, 74–75 acoustic variation 68 acoustical cues 119–120 adaptor grammars 388–389 adjacent dependencies 29–30, 32 adults acoustic-phonetic detail 71, 74–75 anchoring hypothesis 189–190 auditory and visual statistical learning 40 computerized training 314–323 correlated cues in word classification 153–155 distributional cues 104 frame based artificial language 157–158 free variation 416–417 Grammarama 205–206 as human simulations 222 non-adjacent dependencies 105–106 nonce words 186–187 nonce words and noise syllables 193–194 phrasal grouping 188 phrase structure 159–161 pitch processing 438 sampling assumptions 396 sensitivity to pitch categories 447 age-related di¤erences 32 hippocampal involvement 36–38

agrammatic aphasia 207 ambiguity 111 analogical models 158 anchoring hypothesis 189–190 animacy 356 animal conditioning 126–127 animal learning 17 animal studies 3 artificial finite-state grammar 19 artificial grammar learning (AGL) 18–20, 29–30, 32, 237, 238, 368–370 global vs. local attention 352–353 implicit learning 340 influence and importance of 59–60 mimicry of natural language 212– 215 artificial language studies 30, 64, 70, 80, 108, 159, 162, 238–239 artificial languages applications 206 complexity 94 ecological validity 93 as mainstream approach 58 problems of 61–62, 92–93 types of infant calculations 69–70 Aslin, R. 1–2, 21, 57–59, 94, 105–106, 178, 186 associational learning 272–274 acoustic factors 137 associative interpretations 124–125 attention based-account 125–128 context and overview 119–121 dynamic view 132–137 earlier processing episodes 133 empirical and computational evidence 134–137 general outline 132–134

496

Index

Information Dynamics of Music (IDyOM) 478–479 integrating sources of information 130–132 interference-based account 128–130 new principle 132–133 predictions 134–137 sensitivity to statistical regularities 122–130 summary and conclusions 137–139 thesis 121–122 types of statistical regularities 122– 125 associative learning processes 5 atomic auditory percepts 467 attention, as filter 126 attention-based accounts 17 attentional chunks 256 attentional factors 125–128 Aucouturier, J. J. 464 auditory sequential learning 39 auditory statistical learning 39 awareness 7 back-o¤ strategy 469 backward transitional probability (BTP) 95, 99–100, 123–124 Baldwin, D. 91 basal ganglia 36 Bayesian approaches 357 Bayesian inference 414 Bayesian modeling 389 advantages and disadvantages 386 assumptions 400, 401 description 384–385 hierarchical 393 interpretation 385 knowledge acquired 389–390, 391 learners’ assumptions about data 394–397 learner’s data 387–388 learning from types or tokens 388– 389 multiple level learning 392–394 overview 383

predictive capability 400 prior knowledge 397–399 success of statistical learning 399– 402 summary and conclusions 402–403 transitional probabilities (TPs) 387– 388 variations 385–386 word segmentation 390–392 Bayesian techniques 7 behavior as learned and innate 75 temporal organization 327 behavioral studies 73–74, 203–204 behaviorism 171 belongingness 126, 137 Berry, D. C. 340 bias 413–414, 416, 420–421 coordination, and evolution of bias 422–425 shielding of bias strength 421–422 transmission 424 biased learning rules 413–414 bilingualism 190 biological evolution 419–425 Bion, R. 188, 191 Bohlen-Pierce scale 8, 439–441 see also musical knowledge study Bonatti, L. L. 178–179 bootstrapping 4–5, 79–80, 112 correlated cues to category structure 145 word classification 163 boundaries, phenomenological 22–26 Boyd, R. 413–414, 421–423 brain e¤ects of music 438–440 plasticity 305 Braine, M. D. S. 153–155, 158, 162 Braine’s two-step model for word-class generalization 155 Bremner, A. J. 28 Brent, M. R. 76, 120, 133–134, 387 British National Corpus (BNC) 6 Bullemer, P. 20

Index C/V hypothesis 175–177, 180 Canfield, R. L. 27 Cartwright, T. 76 catalogue segmentation study 74 categories, semantic properties of 109 Chalkley, M. A. 146 Chater, N. 206 Chen, W. 355–356 child-directed speech, corpus analyses 146–149 Chomsky, N. 56, 206, 409–410, 465 chord progressions 443–444 Christiansen, M. H. 149, 151 chunk strength 246 chunking 30, 239, 256, 476 Clark, E. V. 272 Clark, H. H. 172–173 classic statistical learning studies 20 classical conditioning 273 Cleary, A. M. 246 Cleary, J. G. 469 Clinical Evaluation of Language Fundamentals, 4th Ed. (CELF-4) 313 Clohessy, A. B. 28 co-evolution 424 co-occurrences 29, 91–93, 101–102, 104, 122–124 of tokens 187 COBUILD 266, 276 COBUILD Verb Patterns 274–275, 277 cochlear implants (CI) 312–314 cognitive and neural mechanisms 206–207 cognitive constraints, transient 34–36 cognitive development, connectionist models 59 cognitivism 171 combinatorial explosion 126 completeness, measuring 147 computational analyses 206 computational modeling 7 types of infant calculations 69–70 use of term 121

497

computational models as conservative estimates of performance 68 units for tracking 66–69 computerized training adults 314–323 deaf/hard of hearing children 323– 326 discussion 327–329 e¤ects for deaf/hard of hearing children 325–326 e¤ects for healthy adults 319, 321– 323 generalization and transfer 318 training schedule 317 training task 316–317 concatenative morphology 182, 184 conditional probabilities 172–173, 442–443 conditional probability distributions 190 conditioned variability 418, 419 confidence 342–343, 345, 369 confidence ratings 341 Conklin, D. 472 connected speech processes 68 connectionist models 59, 383, 393– 394, 399, 400 connectionist networks 347, 348 conscious knowledge 338–339 and control 342–343 see also conscious-unconscious distinction conscious-unconscious distinction control 342–343 cross cultural di¤erences in unconscious processes 352–353 evaluation of subjective measures 346–352 evidence 348–351 judgment vs. structural knowledge 343–346 learning language-like structures 353–356

498

Index

measuring awareness of knowing 340–343 overview 337–338 second language vocabulary acquisition 355 structural vs. statistical learning 356–357 summary and conclusions 357 theoretical context 347–348 unconscious knowledge 338–339 see also subjective measures study consonants 175–181 construction frequency 268–269 Construction Grammar 267–268 construction grammar theories 6 construction learning components 6 construction frequency 268–269 determinants 268–274 input frequency 268–272 interactions 272–274 prototypicality of meaning 272 token frequency 270–272 type frequency 270–272 constructions 267 content words, vs. function words 102–103, 188–189 context-dependent sequences 27–29 contextual information 120, 132, 133– 134, 138 contingencies 237 contingency matrix 123 contingent probabilities 17 control 342–343 control ersatz constructions (CECs) 282–284 see also Zipfian distribution study convergence implicit and statistical learning 16 in language learning 265 Conway, C. M. 35, 312 Conwell, E. 152 corpus analyses 70–71, 92, 96, 103, 146, 149, 162, 219 probabilistic phonotactics 208–210

correlated cues 104–105, 152–156, 162 correlated features, sensitivity to 109– 110, 111 Creel, S. C. 451, 452 creolization 416 cross cultural di¤erences, in unconscious processes 352–353 cross-domain modeling see Information Dynamics of Music (IDyOM) cross-linguistic variation 411 cross-referential learning 217–218 cross-situational learning 217–218 cross-situational word learning paradigm 7 cues, cross-language variation 196– 197 cultural di¤usion 417 cultural selection for learnability 413 cultural transmission 411, 412–415 general argument 412–415 learning bias and language universals 412–419 linguistic variation 415–419 vocabulary systems 423–424 Curtin, S. 121 Curtis, C. E. 328 Dahan, D. 120, 133–134 deaf/hard of hearing children computerized training 323–326 visual sequential learning 324 working memory (WM) 324 deafness 312–314 decay 129–130 degradation, of speech signal 306, 309 Delta P 123 derivation 181 D’Esposito, M. 328 determiners 356 developmental e¤ects 151–152 developmental invariance 16 Dienes, Z. 340, 341–343, 345, 347, 349, 350, 368–370

Index dimension-modality pairings, perceptual development 40 diminutive dependency 70–71 direct tests 367–368 directionality 99–100 directly biased transmission model 421 dishabituation 151 distributional cues 103–104, 110–111 word classification 146–149 distributional learning as favored viewpoint 60 overview 55–56 distributional models 55 assumptions 79 dependent measure 77–79 ecological validity 62–64 innate ability 75–77 predictive ability 72–75 rich interpretation 77–79 scalability 61–66 summary and conclusions 79–80 distributional patterns 220–221 distributional regularities 204 diversity, of language 56 domain-general accounts 409, 411 domain-general learning abilities computerized training in healthy adults 314–323 computerized training task 316–317 computerized training with deaf or hard of hearing children 323– 326 discussion 327–329 generalization and transfer 318 language disorders 310–314 overview 305–306 speech perception in noise 309 statistical learning abilities 307–310 statistical learning and language processing 306–310 Stroop Color and Word test 320– 321 training schedule 317 see also statistical learning

499

domain generality 174, 175, 177–178, 424–425 see also musical knowledge domain specificity 175, 412, 419–420, 424–425 domains, adaptation to 348 dual process models 348 dual-view hypothesis 24 Dulany, D. E. 339 dyslexia 311 earlier processing episodes 133 ecological validity artificial languages 63 distributional models 62–64 natural language studies 97 electrophysiological studies 174 Ellis, N. C. 217, 272, 273 Elman, J. L. 35 emotional content, of music 453–454 empiricism 171 Endress, A. 186–187, 246, 247 English/Chinese studies 355 English/Italian studies 95, 355–356 English/Japanese studies 243–245, 247–249, 257, 258 entropy 205, 280 errors, expected 72–74 Evans, N. 412 event related potential (ERP) measures 249 evolutionary perspective 8 biological evolution 419–425 coordination and evolution of bias 422–425 cultural transmission 412–419 linguistic variation 415–419 overview 409–411 statistical learning accounts 420 summary and conclusions 425 see also bias expectation and attention 467–468 of learners 409–410 musical knowledge 452–453

500

Index

neural correlates 475 sentence processing 220 use of term 26 expected errors 72–74 experience-dependent learning mechanisms 57 explicit memorization and recall 24 exposure times 258 familiarity 78–79, 120, 133 feedback-driven vocal learning 36 Ferreira-Junior, F. 272, 273 figure-ground e¤ect 158 Fischer, F. 37–38 Fitneva, S. A. 151 fixed continuous sequences 26–29 fixed sequential learning 24–25 flexible frames model 148–149 forgetting 129 form-based categories 29, 161 form-function mappings 268 contingency 272–274 form-meaning correspondences 217 form-meaning mappings 217 forward transitional probability (FTP) 95, 99–100, 173 frame-based analysis 146–149 frames, distribution 208–210 free reports 340–341, 345–346 free variation 416–417 frequency 269 frequency e¤ects 23 frequency information 108–109 frequent frames 146–149 Frigo, L. 154–155 Fu, Q. 342–343 Fu, X. 342 Funahashi, S. 328 function words, vs. content words 102–103, 188–189 functional categories 152 functional magnetic resonance imaging (fMRI) study 28–29 Fuster, J. 327

galvanic skin response (GSR) 350– 351 gambling 341–342 game 269 Gang E¤ects 158 Ga¨rdenfors, P. 474 gender, and non-adjacent dependency learning 107–108 gender markers 153–157 generalization e¤ects of sleep 38 musical knowledge 444–446 generalization tests 161, 162, 239, 240 computerized training 318 generalizations inducing 186–187 roles of consonants and vowels 179–181 generative accounts 409, 410–411 Georgetown University Round Table on Languages and Linguistics (GURT) 3 Gerken, L. A. 102, 104–105, 156, 187 German studies 250–254, 257–258, 354–355 Gervain, J. 103, 182–185, 188, 189– 190, 191 Glenberg, A. M. 243 Goldberg, A. E. 271 Goldwater, S. 391, 398 Go´mez, R. L. 31, 34, 38, 70, 102, 105, 106–107, 148, 155, 187, 212–214 gradiency e¤ects 41 Graf Estes, K. 97–98, 99–100 grammar-conforming utterances, generation 145 grammar learning 389, 397–398 Grammarama 205–206 grammatical categories identifying 145 see also word classification grammatical information, and syntax learning 245–246 grammatical structure 101–108 nonadjacent dependencies 105–108

Index grammaticality judgement test (GJT) 240–241, 244, 249, 253, 256, 258 Gri‰ths, T. L. 414, 417–418 guessing criterion 341–342 Guo, X. 355 habituation-dishabituation 21 Haith, M. M. 20–21 harmonicity 450 Harris, Z. 172, 205, 206 Hartshorne, J. K. 107 Hasher, L. 23 Hauser, M. D. 246, 247 Hay, J. 63–64, 98–99, 100 Hayes, J. R. 172–173 hearing impairment 7 Hebrew 175 High Transitional Probability (HTP) 95–96, 98–99, 100 higher-level language structure 102 hippocampus 36–38 Hochmann, J. 108–109, 185 Ho¨hle, B. 151, 152 homophones 146 Howard, J. H. Jr. 311 Hudson Kam, C. L. 243, 416, 418 human melodic pitch prediction 468– 469 hypothesis space 384, 398–399 immersion programs 151 implicit learning 224–225, 237 artificial grammar learning (AGL) 340 assumptions 17 characteristics 13–14 impairments 311–313 knowledge contents of experiments 343–344 research focus 365 and statistical learning 4, 365–366 see also subjective measures study implicit memory 23, 343 incidental learning 16, 241–243

501

nonadjacencies 30 indirect tests 367–368 induction, problems of 203 infant language development, statistical information in 4–5 infant statistical learning studies 21 infants listening preferences 151 musical knowledge 436 pitch processing 437–438 problem of language acquisition 433–434 sampling assumptions 396 sensitivity to pitch categories 447 types of calculations 69–70 inflection 181 Information Dynamics of Music (IDyOM) 8 associational learning 478–479 context of study 463–465 cross-domain application 479–487 data 480–481 data representation 472–473 expectation, prediction and neural correlates 475–476 exploratory study 480–487 language processing vs. music processing 486 levels of representation 466–467 method 481–482 neural network models 478–479 overview of model 468–478 phrase segmentation 476–477 pitch prediction 474–475 preliminary study results 483 random example segmentations 483 relationship of music and language 466–468 research questions 482–483 results 483–487 schematics 470, 471 shortcomings 477–478 simulation approach 466–468 summary and conclusions 487–489

502

Index

unsupervised learning model 468– 472 viewpoint selection 473–474 viewpoints 472–473 information theory 172 Initial Word Likeness (IWL) 135–137 innate ability 56–57, 60, 75–77 input-based enhancement 223 integrative view 171–172 intercorrelations 110 interference-based account 128–130 internal consistency, visual expectation paradigm (VExP) 27 internal dependencies 159–161 internalist/externalist debate 225 Italian/French study 108–109 Italian segmentation studies 63–64, 95 iterated artificial language learning 417–418 iterated learning 412 Japanese/American study 352 Japanese/UK study 353 Japlish 243–245, 247–249, 257, 258 Jime´nez, l. 367–368 Johnson, E. K. 64–65, 70, 257 judgment knowledge, vs. structural knowledge 343–346, 368–370 Juscyzk, P. 64–65, 70, 94, 148 Kalish, M. L. 414 Kaschak, M. P. 243 Kirby, S. 422 Kirkham, N. Z. 91 Kiyokawa, S. 353 knowledge conscious-unconscious distinction see separate heading direct and indirect tests 367–368 explicit and implicit 7 judgment vs. structural 343–346, 368–370 measuring awareness 340–343 measuring implicit and explicit 367

prototypical conscious structural knowledge 347 prototypical unconscious knowledge 347 subjective measures 368–370 verbal reports 367 see also musical knowledge Kronenberger, W. G. 323 Krumhansl, C. L. 468 Kuribara, C. 243–245, 258 label-learning 98–99 Lakusta, L. 105, 155 Langley, M. M. 246 language as psychological phenomenon 464– 466 relationship with music 466–468 rhythmic classes 179–180 language acquisition, empiricist approaches 1 language creation 416 language discrimination 179 language disorders 310–314 language experience, and non-adjacent dependency learning 106–107 language function, improving 6 language impairments 207 language learning, as inbuilt and experiential 55–56 language processing 306–310 vs. music processing 486 language-specific narrowing 41 language-specificity 184 language universals 412–415 Lany, J. 106–107, 109–110, 111, 148, 162 Larsen-Freeman, D. 273 lawfulness 15 learnability, and meaning 246 learnability problems 386 learning, and sleep 37–38 learning developments, timeline 34fig learning mechanisms 2–3, 145, 171, 257

Index conscious and unconscious structural knowledge 349 second language (L2) acquisition 216–219 learning principles 204 integrate probabilistic sources of information 207–211 second language (L2) acquisition 5–6 seek invariance 211–216 length of look 77–79 ‘less is more’ hypothesis 35 letter guessing game 210 Levinson, S. C. 412 Lewis, S. 340 lexical categories 5 acquisition 161–163 correlations 109 distinguishing 145, 147, 151, 152 lexical context 146 likelihood 384 limited temporal processing window hypothesis 35–36 linguistic categories, induction vs. derivation 184 linguistic constraints morphological level 181–187 overview 171–172 phonemic level 175–181 prosodic level 191–195 summary and conclusions 195–197 at syntactic level 187–191 linguistic generalities, extraction 57– 58 linguistic knowledge, word order regularities 245–247 linguistic variation 415–419 listening preferences 151 looking paradigms 77–79 Low Transitional Probability (LTP) 95–96, 98–100 Lynch, M. P. 447 MacDonald, J. L. 154–155 mandooling 265

503

mapping, content words and referents 109–110 Maratsos, M. 146 Mareschal, D. 110 Markov model 469 Masuda, T. 352 Maye, J. 434 McAdams, S. 451–452 Mealor, A. 350 meaning and learnability 246 in lexical category acquisition 162 prototypicality of 272 and sounds 97–101, 108–111 memorability 371 memorization advantage 238 memorization and recall, explicit 24 memory implicit 343 musical 437 see also working memory (WM) memory capacity, and non-adjacent dependency learning 106 Mere Exposure E¤ect 454, 456 metacognition 338–339, 340 methodologies artificial grammar learning (AGL) 18–20, 29–30 context-dependent sequences 27–29 infant statistical learning studies 21–22 serial reaction-time paradigm (SRT) 20, 28, 32, 37–38 visual expectation paradigm (VExP) 20–21, 26–27 Meyer, L. 439, 453 Miller, G. A. 205–206, 282 Minimum Description Length (MDL) 184 Mintz, T. H. 146, 148, 157–159 modality-specific e¤ects 25 Monaghan, P. 149, 151 monkey at the typewriter word generation model 282–283 mora-timed languages 180

504

Index

Morgan, J. L. 152 morpheme boundaries 172–173 morphemes, identification 182–185 morphological regularities, nonadjacent 185–187 morphology 181–182, 389 morphosyntactic dependencies acquisition studies 70 infant learning 78 morphosyntactic patterns 59 morphosyntactic relationships 35 motivation, e¤ects of 27 Mueller, J. L. 246 multidimensionality 463 multimodality 434–435 multiple level learning 392–394 music e¤ects on brain 438–440 emotional content 453–454 exposure to 439 as psychological phenomenon 464– 466 relationship with language 466–468 music perception 439 music processing, vs. language processing 486 musical knowledge 8 implicit 435–436 parallels with language learning 437 statistical learning of tone sequences 436–438 musical knowledge study Bohlen-Pierce scale 439–441 expectation 452–453 learning grammatical structure 442–444 learning pitch categories 441–442 nontraditional scales 447 recognition and generalization 444– 446 sources of statistical knowledge 447–448 summary and conclusions 455–456 timbre 448–452 musical learning, cognitive mode 463

musical memory 437 musical training 438 n-gram model 469 natural language 93–97 acquisition as implicit learning 369 as noisy stimulus 95–96 statistical sublexical structure 207 and word learning 98–99 word order regularities 240–243 natural language studies ecological validity 97 word segmentation 63–64 natural speech analogue, word segmentation studies 64–66 Nelson, C. A. 28 Nespor, M. 195 neural and cognitive mechanisms 206–207 neural correlates, of expectation 475 neural network models 24, 478–479 of sequential processing 32 neural network simulations (‘‘less is less’’) 36 neural structures, changes in 36–38 neurocognitive mechanisms, enhancement 327–328 Newport, E. 1–2, 35, 57–59, 102, 105–106, 178, 186, 188, 190, 416, 418 Nisbett, R. E. 352 Nissen, M. J. 20 non-adjacent dependencies 30–31, 70– 71 sex di¤erences 107–108 tracking 59 non-adjacent morphological regularities 185–187 non-concatenative morphology 182, 184 non-linguistic auditory stimuli 2–3 nonadjacencies age-related di¤erences 32 deterministic 32, 34 nonadjacent dependencies 213–214

Index di‰culty of learning 246 grammatical structure 105–108 violations 148 normative [bi-directional] contingency 22 object categories 110–111 object perception 133 Occam’s Razor 75, 76 Oleson, P. J. 327–328 on-line reading time 249 Onnis, L. 34, 217, 219–220 orthotactics 207–208, 210–211 over-segmentation 72–74 overhypotheses 393 Pachet, F. 464 Pacton, S. 16 paradigms, diversity and commonality 18–22 parental feedback 41 PARSER 129–130, 137 partwords 69–70 Patel, A. D. 466, 488 patterning, of speech sounds 59 Pearce, M. T. 468, 473, 474–475, 476–477 Pearson r 124 Pedersini, R. 350–351 Pelucchi, B. 63–64, 94–95 Pena, M. 105–106, 186 perceptual constraints 17 perceptual development 39–42 general principles 41 starting small 35 perceptual edges 186–187, 194 perceptual factors as anchor points 192 and statistical learning 120 perceptual learning 14 Perfors, A. 389 Perner, J. 347 Perruchet, P. 16, 134–136, 256 Persaud, N. 341 phenomenological boundaries 22–26

505

phenomenologies, di¤erences 337, 351 Phoebe 266, 295–296 phoneme distributions 207 phonemes, acquisition 434 phonemic inventory 59 phonetic contrasts 59 phonetic information 149–151 phonological cues 110–111 word classification 152, 161 phonological information 111, 120, 152, 158 phonological properties, di¤erences 102–103 phonotactic regularities, and word learning 100–101 phonotactics 207–208 phrasal grouping 188 phrasal prosody 190–191 phrase boundaries 195 phrase structure 103 statistical learning 159–161 Pinker, S. 146 pitch, relative and absolute 437–438 Plante, E. 311 plasticity e¤ects of music 438–440 of mind and brain 305 neuro-cognitive 312 Plaute, D. 36 plural dependency 70–71 polysemy 282 Ponsford, D. 468 posterior probability 384 Poulin-Charronat, B. 256, 453 poverty of the stimulus 409 power-law distribution 388 prediction 306 prediction-based estimation 17 Prediction by Partial Matching (PPM) 469 predictive dependencies, visual and auditory 39 predictive probability 22 predictive relationships, nonadjacent dependencies 105–108

506

Index

preference methods 21 preferential looking visual acuity estimates 40 prefrontal cortex (PFC), role and functioning 327–329 prelinguistic social interaction 41 primacy 25 principle of belonging 126–127, 137, 149 prior experience, and non-adjacent dependency learning 106–107 prior probability 384 probabilistic dependencies 29–31 probability 172 probability theory 384 probe tone paradigm 441–442 process dissociation procedure 342 processing window 34–35 prosodic cues, vs. statistical cues 192 prosodic information 120, 149–151 prosodic phrases 193–194 prosody 188–189 as filter 120 phrasal 190–191 semantic 355, 356 prototype theory of concepts 272 prototypical conscious structural knowledge 347 prototypical unconscious knowledge 347 prototypicality of meaning 272 Proximity 158 pseudowords 210–211 Quine, W. V. O. 217–218 RASP 275 rational inference models 410 rats, statistical learning mechanisms 58, 173 reading disorders 311 reading times 220 Reali, F. 417–418 Reber, A. S. 14, 15, 16–17, 18, 238, 337, 340, 352

Rebuschat, P. 240–242, 250–254, 354–355, 356 recency 25, 269 recognition, musical knowledge 444– 446 recursive compositionality 415 Reed, N. 349 regularisation 416–417, 419 regularity 418–419 repeating sequences 23–24 asymmetric or simple 26–27 repetition rules 186–187 repetition structures 239 Rescorla, R. A. 273 Rescorla-Wagner model 17 Reznick, J. S. 27 rhythm, sense of 436 rhythmic classes, of language 179–180 Richerson, P. J. 413–414, 421–423 Robinson, P. 217, 245–246 robustness, correlated cues 154–157 Rohde, D. L. T. 36 Rose, S. A. 27 rule abstraction 15 Russian/English study 104–105 Sa¤ran, J. 1–2, 14, 16, 25, 29–30, 57– 58, 59, 62, 101, 102, 109–110, 111, 159–161, 162, 173, 192, 433– 434, 437 sampling assumptions 394–397 Sandhofer, C. 111 Sandoval, M. 257 Santelmann, L. M. 148 scalability, distributional models 61– 66 scale-free laws 270–271 scaling up, form based categories to lexical classes 145 Scott, R. 345, 349, 350, 369–370 scrambling 244 second language (L2) acquisition 5–6 ACCESS 225 adult learners 6 distributional approaches 221–226

Index educational practices 223–224 implicit learning 224–225 input-based enhancement 223 integration of information sources 207–211, 222 internalist/externalist debate 225 learning principles 204, 222 learning to predict 219–221, 222– 223 overview 196–197 reusing learning mechanisms 216– 219, 222 seeking invariance 211–216, 222 task-induced involvement 226 vocabulary acquisition 355 see also syntax learning semantic information 6, 92, 104, 110, 112 and syntax learning 245–246 semantic prosody 355, 356 semantic space models 219 sensitivity to correlated features 109–110, 111 function vs. content words 102–103 sentence processing 220 sequential patterns 16 sequential processing, neural network modeling of 32 serial reaction-time paradigm (SRT) 20, 28, 32, 237 e¤ects of sleep 37–38 serial reaction time tasks, syntax learning 249–256 serial reaction-time test 367–368 Seth, A. K. 341–342 sex di¤erences 107–108 Shady, M. 151, 152 Shanks, D. R. 337 Shannon, C. 205 Shannon entropy 280, 284, 470, 471 Shukla, M. 120, 131, 193–195, 257 simple probabilities 172 Simple Recurrent Networks (SRNs) 245, 249, 257, 258, 478, 479

507

simulation approach 466–468 Sinclair, J. 274 single process models 348 size principle 395–396 sleep, and statistical-sequential learning 37–38 Smith, K. 418, 422, 423–424 Smith, L. B. 111 social learning 409–410, 423 socially guided statistical learning 41 Soderstrom, M. 72 sound constituents, acquisition 434 sound cues 149–152 sounds, and meaning 97–101, 108– 111 spatio-temporal sequential learning 39 species-specific cognitive traits 195 specific language impairment (SLI) 311 speech perception, e¤ects of surroundings 66–67, 71 speech perception in noise 309 speech perception research 56–57 speech segmentation 173 see also word segmentation speech signal, degradation of 306, 309 speech sounds, patterning of 59 speech variability 68–69 Spinoza, B. 175–176 spoken language, and transcribed language 67–68, 71 St. Clair, M. C. 148–149 ‘starting small’ simulations 35 statistical cues 22 neonates 29 statistical learning basis of 1 cognitive mechanism 2–3 definitions 14, 305, 383–384 developmental trajectory 174 and language acquisition 433–434 limitations 257–258 multimodality 434–435 origins and development 205–207 possibility of enhancement 305

508

Index

probabilities 172–173 research focus 365 success of 399–402 vs. structural learning 356–357 see also domain-general learning abilities statistical learning abilities 307–310 visual sequential statistical learning task 308 statistical learning mechanisms, infants 58–59 statistical regularities 112 constraints on 120–121 statistical-sequential learning developmental changes 32–42 e¤ects of sleep 37–38 future development 42 learning landscape 26–31 neural structures 36–38 overview 13–15 perceptual development 39–42 stem completion 343 stimulus events, predictive covariation 17 Storkel, H. L. 100 stress placement 192–193 stress timed languages 180 striatum 36–37 strong sampling 394–395 Stroop Color and Word test 320–321 structural knowledge, consciousunconscious distinction 349 subjective measures of conscious states evidence 348–351 summary 351–352 theoretical context 347–348 subjective measures study confidence ratings 376–377 discussion 378–379 exposure phase 372–375 four-alternative forced-choice task 375, 376 method 370 participants 370 procedure 372–375

results 376–378 source attributions 377–378 stimuli 371–372 target items and referents 372 test phase 375 verbal reports 378 see also conscious-unconscious distinction; implicit learning supracontexts 158 Swingley, D. 72–73 Switch paradigm 97–98 syllable counts 71 syllable distribution analyses 73 syllable distribution cues 57–58 syllable-timed languages 180 symbolic information 138 synaptic pruning 35 syntactic bootstrapping hypothesis 111 syntactic categories, cues 104 syntactic form 465 syntactic grammars 59 syntactic information 187–188 syntax acquisition 306 syntax learning abstract rules learning 243–245 grammatical and ungrammatical patterns 242 incidental learning 241–243 limitations 258 linguistic knowledge 245–247 meaningful sentences and nonsense analogues 247–249 natural language word order regularities 240–243 overview 237 serial reaction time tasks 249–256 summary and conclusions 256–259 see also word order regularities synthesis of approaches 5 Tanaka, D. 352–353 task-induced involvement 226 Tenenbaum, J. B. 395–396 tests, direct and indirect 367–368

Index theory on unconscious cognition 16– 17 Thiessen, E. 25, 62, 192 Thomas, K. M. 28 Thompson, S. P. 102, 188, 190 Thorndike, E. L. 126–127, 137 TIGRE model 179 Tillmann, B. 134–136, 451–452, 453 timbre 448–452 timeline, learning developments 34fig token frequency 270–272 Bayesian modeling 388–389 see also Zipfian distribution study token-level data 394 tokens co-occurrences of 187 generation 388–389 tone sequences 436–438 Toro, J. M. 173, 180, 192 transcribed language, and spoken language 67–68, 71 transfer e¤ect 238–239 transient cognitive constraints 34–36 transitional probabilities (TPs) 2, 25, 29, 91–92, 122–125 and attention 128 Bayesian modeling 387–388 dips 188 directionality 99–100 English/Italian study 95 as first cues 62 higher-level language structure 102 and interference 129–130 reliance on 58–59 roles of consonants and vowels 177–179 sensitivity to 101 word boundaries 119 see also backward transitional probability (BTP); forward transitional probability (FTP) transmission bottleneck 414–415 transmission, constraints on 414–415 Trobalon, J. B. 173 Turk-Browne, N. B. 38

509

Tyler, M. 64–65 type frequency 270–272 Bayesian modeling 388–389 see also Zipfian distribution study type-level data 394 Ullman, M. T. 107 uncertainty 205, 280 unconscious knowledge 338–339 prototypical 347 see also conscious-unconscious distinction unconscious processes, roots of 17 underlying mechanisms 195 unit formation view 256 units for tracking 66–67 usage-based acquisition 267–268 utterances, analysis 181 valence tendencies 220 van den Bos, E. 30 Van Heugten, M. 70 vector space model 219–220 verb argument constructions 392–393 Verb-Argument Constructions (VACs) 6, 265–266 see also Zipfian distribution study verb placement rules 241 verb type-token distribution 275 verbal reports 367 viewpoints 472–474 visual expectation paradigm (VExP) 20–21, 26–27 age-related di¤erences 27 internal consistency 27 visual sequential learning 39 deaf/hard of hearing children 324 visual statistical learning 30 visual system development 40 visuospatial memory tasks 328 vocabulary size, and cue use 110–111 vowels 175–181 Vroomen, J. 192 wagering 341–342

510

Index

Wan, L. L. 342–343, 350 weak-bias-strong-e¤ect 411 weak sampling 395 Weissenborn, J. 151 Werker, J. F. 191 Wessel, D. L. 449 Wiggins, G. A. 473, 474–475 Williams, J. N. 240–242, 243–245, 247, 250–254, 258, 354–355, 356 Wilson, D. P. 101 Wittgenstein, L. 269 Wonnacott, E. 418 word boundaries 183, 193–195 word categories 111 word classes 5 word classification corpus analyses 162 correlated cues 152–156, 162 cues to 145 distributional cues 146–149 gender markers 153–157 overview 145–146 phonological cues 161 phrase structure 159–161 sound cues 149–152 see also grammatical categories word frequency 183, 184, 188–190 word learning and natural language 98–99 and phonotactic regularities 100– 101 and word segmentation 97–98 word length 65–66 word order regularities 102, 183 see also syntax learning word-picture mappings 217 word position 183 word segmentation 1–2, 5, 21–22, 29, 57, 306 associational learning see associational learning Bayesian modeling 390–392 cues 119–120 form-based to lexical categories 161–163

natural language studies 63–64 transitional probabilities (TPs) 58– 59 and word learning 97–98 see also associational learning; speech segmentation word segmentation studies 24, 433– 434 influence and importance of 59–60 Italian segmentation studies 63–64 natural speech analogue 64–66 WordNet 282 see also Zipfian distribution study words, sub-parts 181 working memory (WM) adaptive training 327–328 computerized training task 316–317 conscious and unconscious structural knowledge 349 deaf/hard of hearing children 324 hypothesis testing 347 improving 6 possibility of enhancement 305, 314–315 see also memory Xu, F. 395–396 Yang, C. 283 Younger, B. A. 110 Zacks, R. T. 23 zero-correlation criterion 341–342 Ziori, E. 341 Zipf ’s law 270–271 Zipfian covering 271 Zipfian distribution 270–272, 388 academic attitudes to 282–283 Zipfian distribution study BNC XML parsed corpora 275– 276 Comparison of values for VACs and CECs 290–291 Construction Grammar 267–268 construction inventory 274–275

Index control ersatz constructions (CECs) 282–284 determinants of construction learning 268–274 determinants of learning 294 determining contingency between verbs and VACs 280–281 disambiguated WordNet senses 285–286 discussion 291–293 distribution of WordNet root verb synsets for VACs and CECs 292 evaluating semantic cohesion in VAC distributions 284–287 frequency ranked type-token VAC profile 276–280 future research 293–295 inventory of English VACs 293

511

learner language 293–294 method 266–267, 274–287 modeling acquisition 295 results 287–291 searching construction patterns 276 semantic coherence 291 summary and conclusions 295 top 20 verbs 281 type token data 277 type-token usage distributions 287– 291 values for CECs 289 values for VACs 288 verb form family occupancy 291 verb semantics 282 verb type distribution 278, 279 Zipfian scale-free laws 270–271