Second Language Speech Learning: Theoretical and Empirical Progress 1108840639, 9781108840637

Including contributions from a team of world-renowned international scholars, this volume is a state-of-the-art survey o

654 95 5MB

English Pages 320 [538] Year 2021

Report DMCA / Copyright

DOWNLOAD FILE

Polecaj historie

Second Language Speech Learning: Theoretical and Empirical Progress
 1108840639, 9781108840637

Table of contents :
Cover
Half-title page
Title page
Copyright page
Contents
List of Figures
List of Tables
List of Contributors
Preface
Acknowledgments
Part I Theoretical Progress
1 The Revised Speech Learning Model (SLM-r)
2 The Revised Speech Learning Model (SLM-r) Applied
3 New Methods for Second Language (L2) Speech Research
4 Phonetic and Phonological Influences on the Discrimination of Non-native Phones
5 The Past, Present, and Future of Lexical Stress in Second Language Speech Production and Perception
Part II Segmental Acquisition
6 English Obstruent Perception by Native Mandarin, Korean, and English Speakers
7 Changes in the First Year of Immersion: An Acoustic Analysis of /s/ Produced by Japanese Adults and Children
8 Effects of the Postvocalic Nasal on the Perception of American English Vowels by Native Speakers of American English and Japanese
Part III Acquiring Suprasegmental Features
9 Relating Production and Perception of L2 Tone
10 Production of Mandarin Tones by L1-Spanish Early Learners in a Classroom Setting
11 Production of English Lexical Stress by Arabic Speakers
12 Variability in Speaking Rate of Native and Nonnative Speech
Part IV Accentedness and Acoustic Features
13 Comparing Segmental and Prosodic Contributions to Speech Accent
14 Do Proficient Mandarin Speakers of English Exhibit an Interlanguage–Speech Intelligibility Benefit When Tested with Complex Sound–Meaning Mapping Tasks?
15 Foreign Accent in L2 Japanese: Cross-Sectional Study
Part V Cognitive and Psychological Variables
16 Self-Reported Effort of Listening to Nonnative Accented English Depends on Talker Pausing and Listener Working Memory Capacity
17 Investigating the Role of Cognitive Abilities in Phonetic Learning of Foreign Consonants and Lexical Tones
18 Auditory Priming Effects on the Pronunciation of Second Language Speech Sounds
19 Indexical Effects in Cross-Language Speech Perception: The Case of Japanese Listeners and English Fricatives
20 The Role of Orienting Attention during Perceptual Training in Learning Nonnative Tones and Consonants
Index

Citation preview

Second Language Speech Learning Theoretical and Empirical Progress Edited by Ratree Wayland

SECOND LANGUAGE SPEECH LEARNING

Including contributions from a team of world-renowned international scholars, this volume is a state-of-the-art survey of second language speech research, showcasing new empirical studies alongside critical reviews of existing influential speech learning models. It presents a revised version of Flege’s Speech Learning Model (SLM-r) for the first time, an update on a cornerstone of second language research. Chapters are grouped into five thematic areas: theoretical progress, segmental acquisition, acquiring suprasegmental features, accentedness and acoustic features, and cognitive and psychological variables. Every chapter provides new empirical evidence, offering new insights as well as challenges on aspects of the second language speech acquisition process. Comprehensive in its coverage, this book summarizes the state of current research in second language phonology and aims to shape and inspire future research in the field. It is an essential resource for academic researchers and students of second language acquisition, applied linguistics, and phonetics and phonology. ratree wayland is Associate Professor in the Department of Linguistics at the University of Florida. She has published extensively on cross-language perception and production of lexical tones.

SECOND LANGUAGE SPEECH LEARNING Theoretical and Empirical Progress

edi ted by RAT REE WAYLAN D University of Florida

University Printing House, Cambridge cb2 8bs, United Kingdom One Liberty Plaza, 20th Floor, New York, ny 10006, USA 477 Williamstown Road, Port Melbourne, vic 3207, Australia 314–321, 3rd Floor, Plot 3, Splendor Forum, Jasola District Centre, New Delhi – 110025, India 79 Anson Road, #06–04/06, Singapore 079906 Cambridge University Press is part of the University of Cambridge. It furthers the University’s mission by disseminating knowledge in the pursuit of education, learning, and research at the highest international levels of excellence. www.cambridge.org Information on this title: www.cambridge.org/9781108840637 doi: 10.1017/9781108886901 © Cambridge University Press 2021 This publication is in copyright. Subject to statutory exception and to the provisions of relevant collective licensing agreements, no reproduction of any part may take place without the written permission of Cambridge University Press. First published 2021 A catalogue record for this publication is available from the British Library. isbn 978-1-108-84063-7 Hardback Cambridge University Press has no responsibility for the persistence or accuracy of URLs for external or third-party internet websites referred to in this publication and does not guarantee that any content on such websites is, or will remain, accurate or appropriate.

On behalf of students who have benefited from her mentoring and colleagues who have been inspired by the creativity and breadth of her research on second language speech learning, we dedicate this volume to Susan Guion Anderson

Contents

List of Figures List of Tables List of Contributors Preface Acknowledgments

page x xvi xviii xxiii xxvii

part i  theoretical progress

1

1 The Revised Speech Learning Model (SLM-r)

3



James Emil Flege and Ocke-Schwen Bohn

2 The Revised Speech Learning Model (SLM-r) Applied

84

3 New Methods for Second Language (L2) Speech Research

119

4 Phonetic and Phonological Influences on the Discrimination of Non-native Phones

157

5 The Past, Present, and Future of Lexical Stress in Second Language Speech Production and Perception

175

part ii  segmental acquisition

193

6 English Obstruent Perception by Native Mandarin, Korean, and English Speakers

195









James Emil Flege, Katsura Aoyama, and Ocke-Schwen Bohn James Emil Flege

Michael D. Tyler

Annie Tremblay

Yen-Chen Hao and Kenneth de Jong



Contents

viii

   7 Changes in the First Year of Immersion: An Acoustic Analysis of /s/ Produced by Japanese Adults and Children

213

   8 Effects of the Postvocalic Nasal on the Perception of American English Vowels by Native Speakers of American English and Japanese

228

part iii  acquiring suprasegmental features

247

 9 Relating Production and Perception of L2 Tone

249

10 Production of Mandarin Tones by L1-Spanish Early Learners in a Classroom Setting

273

11 Production of English Lexical Stress by Arabic Speakers

290

12 Variability in Speaking Rate of Native and Nonnative Speech

312

part iv  accentedness and acoustic features

335

13 Comparing Segmental and Prosodic Contributions to Speech Accent

337

14 Do Proficient Mandarin Speakers of English Exhibit an Interlanguage–Speech Intelligibility Benefit When Tested with Complex Sound–Meaning Mapping Tasks?

350

15 Foreign Accent in L2 Japanese: Cross-Sectional Study

377

part v  cognitive and psychological variables

397

16 Self-Reported Effort of Listening to Nonnative Accented English Depends on Talker Pausing and Listener Working Memory Capacity

399















Katsura Aoyama

Takeshi Nozawa and Ratree Wayland

James Kirby and Đinh Lư Giang

Lucrecia Rallo Fabra, Xialin Liu, Si Chen, and Ratree Wayland Wael Zuraiq and Joan A. Sereno

Melissa M. Baese-Berk and Ann R. Bradlow

Marina Oganyan, Richard Wright, and Elizabeth McCullough

Marta Ortega-Llebaria, Claire C. Chu, and Carrie Demmans Epp Kaori Idemaru, Misaki Kato, and Kimiko Tsukada

Mengxi Lin and Alexander L. Francis

Contents

ix

17 Investigating the Role of Cognitive Abilities in Phonetic Learning of Foreign Consonants and Lexical Tones

418

18 Auditory Priming Effects on the Pronunciation of Second Language Speech Sounds

439

19 Indexical Effects in Cross-Language Speech Perception: The Case of Japanese Listeners and English Fricatives

463

20 The Role of Orienting Attention during Perceptual Training in Learning Nonnative Tones and Consonants

485

Index

503









Irina A. Shport

Lindsay Leong, Trude Heift, and Yue Wang

Benjamin Munson, Fang fang Li, and Kiyoko Yoneyama

Ying Chen and Eric Pederson

Figures

1.1 1.2

1.3

2.1

2.2

The generic three-level production–perception model assumed by the Speech Learning Model page 12 The mean VOT (ms) in word-initial tokens of /p t k/ produced in English words in 1992 and 2003 by native Italian (NI) speakers in Canada and by 20 native English speakers and 20 NI speakers each of whom reported using English either more or less in 2003 compared to 1992 27 Mean VOT values in the production of English /t/ by native speakers of English and native Spanish early and late learners of English 56 Hypothetical cross-language mapping between a Japanese liquid consonant and two English liquids at four hypothetical stages of L2 development by native speakers of Japanese 85 The mean perceived dissimilarity of English /r/ and /l/ in the single-talker condition and the five-talker condition 92

2.3 The mean subjective familiarity ratings for English words obtained by Flege et al. (1995) for two groups of native Japanese speakers plotted as a function of the mean ratings obtained for the same words for native English speakers 99 2.4 An account of the effects of subjective lexical familiarity on native Japanese speakers’ identifications of /r/ and /l/ that was inspired by the Theory of Signal Detection 100 2.5 The preferred F3 values obtained from native speakers of English and native speakers of Japanese for English /r/ and /l/ (both groups) and Japanese /R/ (just the native Japanese speakers) 103



List of Figures

xi

2.6 The mean ratings of /r/ and /l/ obtained for the 12 participants in three groups 108 2.7 The mean ratings of /r/ tokens produced by the members of three group as a function of the F3 values in the rated tokens 109 3.1 Example of a test item from the Cumulative Use Index 135 3.2 Sample items from a Cumulative Use Index 137 3.3 The classification of the members of a VOT continuum ranging from /bi/ to /pi/ by native English (NE) monolinguals and native Spanish (NS) “near-monolinguals” using one of three response labels 147 3.4 Mean number of English words that were correctly recognized by native English (NE) speakers and two groups of native Italian (NI) speakers who arrived in Canada at the mean age of seven years but differed in how frequently they used Italian 149 6.1 Proportional accuracy for consonants in coda position plotted by proportional accuracy for the same consonant in onset position 203 8.1 Mean F1 and F2 frequencies of six vowels uttered by four native speakers averaged across five preplosive contexts 232 8.2 Classification overlap scores of six vowel pairs in preplosive and prenasal contexts 235 8.3 Mean percentages and standard errors of English and Japanese listeners’ discrimination accuracy of the six AE vowel pairs in preplosive and prenasal contexts 236 8.4 Mean percentages and standard errors of English and Japanese listeners’ identification accuracy of the six AE vowels in preplosive and prenasal contexts 238 9.1 Waveform and spectrogram of stimulus /taː33/ and f0 contours of synthesized perception stimuli 257 9.2 Average f0 contours for Southern Vietnamese tones across speakers by L1 259 9.3 Tone productions for six KG participants, averaged over repetitions of each target syllable 260 9.4 Mean discrimination accuracy by tone pair, averaged over speakers and repetitions 261

xii

List of Figures

   9.5 Tone productions for KM10 and KF1 262 10.1 Interspeaker normalized contours for Mandarin Chinese tone 1 (level) produced by the native Mandarin and Spanish learner groups 281 10.2 Interspeaker normalized contours for Mandarin Chinese tone 2 (rising) produced by the native Mandarin and Spanish learner groups 282 10.3 Interspeaker normalized contours in z-scores for Mandarin tone 3 (low dipping) produced by a group of 4 native Mandarin children and 12 Spanish children learning Mandarin 283 10.4 Interspeaker normalized pitch contours in z-scores for Mandarin tone 4 (high falling) produced by a group of 4 native Mandarin children and 12 Spanish children learning Mandarin 284 11.1 Duration ratio for words (first-syllable stress, second-syllable stress) for native speakers of English (NE), advanced Arabic learners of English (AALE), and beginning Arabic learners of English (BALE) 296 11.2 Fundamental frequency ratios for words (first-syllable stress, second-syllable stress) for native speakers of English (NE), advanced Arabic learners of English (AALE), and beginning Arabic learners of English (BALE) 297 11.3 Amplitude ratio for words (first-syllable stress, second-syllable stress) for native speakers of English (NE), advanced Arabic learners of English (AALE), and beginning Arabic learners of English (BALE) 298 11.4 Second formant frequency (F2) values for front and back vowels in the initial (first) syllable for first-syllable stressed words (nouns) and second-syllable stressed words (verbs) for native speakers of English (NE), advanced Arabic learners of English (AALE), and beginning Arabic learners of English (BALE) 300 11.5 Second formant frequency (F2) values for front and back vowels in the final (second) syllable for first-syllable stressed words (nouns) and second-syllable stressed words (verbs) for native speakers of English (NE), advanced Arabic learners of English (AALE), and beginning Arabic learners of English (BALE) 301

List of Figures

xiii

12.1 Speaking rate (syllables per second) for native and nonnative speakers reading paragraphs in the Wildcat Corpus 320 12.2 Rate change (calculated for consecutive utterances) for native and nonnative speakers reading paragraphs in the Wildcat Corpus 322 12.3 Absolute value of rate change (calculated across consecutive utterances) for native and nonnative speakers reading paragraphs in the Wildcat Corpus 323 12.4 L1 speaking rate for native speakers of three languages 324 12.5 Absolute value of rate change (calculated across consecutive utterances) for native speakers of three languages from the ALLSSTAR corpus producing speech in their native languages. 325 12.6 Correlation between absolute value of rate change in L1 and L2 for native Korean and Mandarin speakers in the ALLSSTAR corpus 326 13.1 Accented rating score distribution for American English, Hindi, Korean, Spanish, and Mandarin speakers 344 13.2 Correlation between segmental and prosodic properties and accented ratings for each language 345 14.1 Participants’ ability to correctly identify the trait within sentences 365 14.2 The distribution of speaker accentedness ratings by listener and speaker type and the distribution of speaker comprehensibility ratings by listener and speaker type 366 15.1 Variable importance for learner groups 387 15.2 Eight top-ranked predictors for Y1 and their rankings for Y2 and Y4 389 16.1 Subjective evaluations for the three types of speech 409 16.2 Listening effort ratings by WMC and condition 411 17.1 The structure of the word-learning task 424 17.2 Error types in the word-identification task 429 17.3 Error types in the AX discrimination task with catch (same) and different trials 430 18.1 Vowel productions in a vowel space plot 450

xiv

List of Figures

18.2 High-front vowel productions as a function of prime-target congruency 451 19.1 Selected acoustic characteristics of word-initial fricatives produced by two men and two women, separated by sexual orientation 464 19.2 Selected acoustic characteristics of the vocalic bases of the stimuli in this study, separated by the word from which the base was excised and the perceived masculinity of the speaker 465 19.3 The proportion of “yes” responses to the question “Is this a ‘sh’?” to the 128 stimuli with the four-step /s/-/ʃ/ series, separated by stimulus step (1–4), listener language (Japanese vs. English), and perceived masculinity of the talkers 475 19.4 The proportion of “yes” responses to the question “Is this a ‘sh’?” for the Japanese-speaking listeners and the English-speaking listeners for the 128 stimuli with the four-step /s/-/ʃ/ series 476 19.5 The proportion of “yes” responses to the question “Is this an ‘s’?” to the 128 stimuli with the four-step /s/-/θ/ series, separated by stimulus step (1–4), listener language (Japanese vs. English), and perceived masculinity of the talkers 478 19.6 The proportion of “yes” responses to the question “Is this an ‘s’?” for the Japanese-speaking listeners and the English-speaking listeners for the 128 stimuli with the four-step /s/-/θ/ series 479 20.1 Consonant training paradigm 490 20.2 Tone training paradigm 490 20.3 Results of the AXB discrimination task in pretraining and posttraining tests of the consonant-attending group 492 20.4 Results of the AXB discrimination task in pretraining and posttraining tests of the tone-attending group 493 20.5 Results of the identification training combining both the consonant-attending and tone-attending groups 493 20.6 Mean errors of consonant discriminations by the consonantattending group 494

List of Figures

xv

20.7 Mean errors of consonant discriminations by the tone-attending group 494 20.8 Mean errors of tone discriminations by the tone-attending group 495 20.9 Mean errors of tone discriminations by the consonantattending group 495

Tables

6.1 English obstruents examined in the current study and the closest Mandarin and Korean obstruents in the onset and coda positions page 198 7.1 Characteristics of the native English (NE) and native Japanese (NJ) participants 217 7.2 Mean noise duration averaged across speakers in each group 219 7.3 Mean center of gravity (CoG) values averaged across speakers in each group 220 7.4 Mean noise amplitude averaged across speakers in each group 221 8.1 Perceptual assimilation: the most frequent responses in percentages and mean categorical goodness ratings 234 9.1 Production stimuli 254 9.2 Mean global Fréchet and DTW distances between KG and VN tone productions, from most to least similar 259 9.3 Mean discrimination accuracies for KM10 and KF1 by tone pair 262 10.1 Growth curve analysis results for the four Mandarin tones 281 12.1 Distribution of speakers across languages in the Wildcat Corpus 317 14.1 Average proficiencies of listeners 358 14.2 Speech characteristics of speakers 359 14.3 Accuracy of target traits produced by speakers 360 15.1 Speaker information 379 15.2 Test sentences and English translation 15.3 The results of mixed-effects modeling 

380 385

List of Tables

xvii

16.1 Likert scale rating of listening effort, subjective intelligibility, and acceptability 405 16.2 Pause analysis 408 16.3 Mixed-models results for listening effort 411 17.1 Twelve stimulus words 423 17.2 Two-tailed Pearson correlations for all observed variables 427 18.1 Example priming task trial 445 18.2 Mean vowel duration produced by English and Mandarin speakers in the congruent and incongruent prime-target conditions 448 18.3 Mean F1 and F2 values for the target vowels produced by English and Mandarin speakers in the congruent and incongruent prime-target conditions 449 18.4 Mean percent correct intelligibility for the target vowels produced by English and Mandarin speakers in the congruent and incongruent prime-target conditions as judged by native listeners of English 453 18.5 Mean percent correct vowel identification by the English and Mandarin speakers 454 19.1 Acoustic characteristics of the fricative stimuli 472 19.2 Results of the most complex model predicting responses to the /s/-/ʃ/ stimuli 475 19.3 Results of the most complex model predicting responses to the /s/-/θ/ stimuli 478

Contributors

katsu ra aoya ma

Department of Audiology and Speech-Language Pathology University of North Texas

m e l issa m . baese-berk Department of Linguistics University of Oregon

oc ke - sc h wen bohn Department of English Aarhus University

ann r. b radlow

Department of Linguistics Northwestern University

si c h e n

Department of Chinese and Bilingual Studies The Hong Kong Polytechnic University

y ing c h e n

School of Foreign Studies Nanjing University of Science and Technology

c l aire c . c hu

Department of Linguistics University of Pittsburgh

ke nne t h d e jong

Department of Linguistics Indiana University

c arrie d e m ma ns epp Faculty of Science University of Alberta

lu c re c ia ra llo fabra

Department of Spanish, Modern and Classical Philologies University of the Balearic Islands 

List of Contributors j am e s e m il fl eg e

Professor Emeritus of Speech and Hearing Sciences University of Alabama at Birmingham

a le xand e r l . f ranci s Linguistics Program Purdue University

đ i nh lư giang

Department of Spanish Linguistics and Literature Vietnam University of Social Sciences and Humanities

yen- c h e n h ao

Department of Modern Foreign Languages and Literatures University of Tennessee

t ru d e h e ift

Department of Linguistics Simon Fraser University

kaori id e m aru

Department of East Asian Languages and Literature University of Oregon

m i saki kato

Department of Linguistics University of Oregon

j am e s kirby

School of Philosophy, Psychology, and Language Sciences The University of Edinburgh

l i nd say l e ong

Department of Linguistics Simon Fraser University

fang fang l i

Department of Psychology University of Lethbridge

m engxi l in

Department of Linguistics Purdue University

x i al in l iu

Centro Educativo Huayue

e liz ab e t h m cculloug h Pacific Science Center

b enj am in m u n son

Department of Speech-Language-Hearing Sciences University of Minnesota

xix

xx

List of Contributors

take sh i no z awa

Program in Language Education Ritsumeikan University

m arina oganya n

Department of Linguistics University of Washington

m arta ort e ga -lleba ri a Department of Linguistics University of Pittsburgh

e ric pe d e rson

Department of Linguistics University of Oregon

j oan a. se reno

Department of Linguistics University of Kansa

irina a. sh port

Department of English Louisiana State University

annie t re m bl ay

Department of Linguistics University of Kansas

kim iko tsu kada

Department of Linguistics Macquarie University and the University of Melbourne, School of Languages and Linguistics

m ic h ae l d . t yler

School of Psychology and the MARCS Institute for Brain, Behaviour, and Development Western Sydney University

y u e wang

Department of Linguistics Simon Fraser University

rat re e way l a nd

Department of Linguistics University of Florida

ric h ard w r i g ht

Department of Linguistics University of Washington

List of Contributors kiyoko yo neya ma Department of English Daito Bunka University

wae l z u raiq

English Language and Literature Hashemite University

xxi

Preface

This present volume is the outcome of the inspiration that Susan Guion Anderson impressed upon researchers working on cross-linguistic speech learning during her short but productive career. A professor of linguistics at the University of Oregon, Susan Guion Anderson passed away on December 24, 2011. Her 1996 doctoral dissertation, titled “Velar Palatalization: Coarticulation, Perception, and Sound Change,” reflected her passions for phonetics and historical linguistics. After graduation, she became a NIH postdoctoral fellow under the mentorship of Professor James Flege, where her research in second language (L2) phonology acquisition began. She was interested in theoretical questions related to the acquisition and representation of second L2 phonetic categories as well as the influence of native language’s (L1) phonological distribution and regularity. She is best known for a series of studies on lexical stress in which she challenged the previous generative account of L2 stress placement and provided an alternative approach that better explained empirical data. Working under the Speech Learning Model (SLM) and the Perceptual Assimilation Model (PAM), Susan also generated an impressive volume of work on L2 speech learning at the segmental level, focusing on the acquisition of English vowels and consonants by native Japanese speakers. The role of attention in L2 category formation was added to the breadth of her research toward the end of her career. Theoretical Progress In honor of Susan’s intellectual legacy, prominent researchers from the field of second language speech research, including James Flege, OckeSchwen Bohn, Joan Sereno, Kenneth de Jong, Richard Wright, Benjamin Munson, Alexander Francis, Yue Wang, Annie Tremblay, Michael Tyler, James Kirby, Marta Ortega-Llebaria, and others, contributed either a critical review chapter or original, empirical data on the role of phonetics 

xxiv

Preface

and cognitive and psychological factors on second language speech learning, at both the segmental and suprasegmental levels. For the first time since its proposal in 2005 (Flege, ISCA Workshop on Plasticity, London, June 15–17, 2005), the Flege’s revised Speech Learning Model (SLM-r) is formally and comprehensively presented, evincing a remarkable theoretical advancement of nearly three decades, most notably in the shift of the model’s focus from accounting for age-related limits on the ability to produce position-sensitive allophones of L2 vowels and consonants among sequential bilinguals to the role of input in the reorganization of the phonetic systems during naturalistic L2 learning (Chapter 1). Point-by-point comparisons between the original SLM and the SLM-r were explicitly and succinctly explained. The application of the SLM-r to existing empirical data on the acquisition of English /l/ and /r/ is exemplified in a separate chapter (Chapter 2). In complement to Chapter 1, in Chapter 3, Flege describes new methods on how to elicit L1 and L2 speech samples that are representative of bilinguals’ production; how to assess L2 perception in order to determine if a new phonetic category has been formed; how to obtain more accurate estimates of the amount of L1 and L2 use; and finally, how to measure the quantity and quality of L2 input to which L2 learners have been exposed in order to determine L2 distribution patterns that promote the formation of new L2 phonetic categories. In Chapter 4, Michael Tyler discusses four different sources of information that can be used to discriminate contrasting nonnative phones. Using the PAM as an example, he demonstrates how a cross-language speech perception model may account for these various sources of information. Methodological requirements for determining which sources of information listeners use for discrimination are then evaluated. Complementary to the SLM-r and the PAM, which focus on L2 segmental acquisition, Ann Tremblay (Chapter 5) critically reviews a body of work on L2 lexical stress acquisition and suggests future research directions. The review highlights a shift in theoretical approaches on cross-linguistics stress acquisition research from the generative framework to the statistical regularity approaches pioneered by Susan Guion Anderson and colleagues and to the more recent approaches focusing on the effects of phonological encoding and phonetic implementation of lexical stress in the native language on L2 stress perception and production accuracy. Refinement of these phonetic approaches are suggested by Tremblay for future research on nonnative processing of lexical stress, including testing the limit of a transfer of an existing L1 acoustic cue and potential

Preface

xxv

cross-domain transfer (i.e., from segmental to suprasegmental) to the processing L2 lexical stress. Empirical Progress Every chapter provides empirical evidence offering new insight as well as challenges to aspects of the L2 speech acquisition process. For instance, while supporting the SLM-r hypothesis that production and perception coevolve (Chapters 1), Kirby and Giang (Chapter 9) reveal evidence to suggest that native-like articulatory specifications may not be necessary for accurate perception of L2 lexical tones. Ortega-Llebaria, Chu, and Demmans Epp (Chapter 14) challenge the hypothesis that the Interlanguage-Speech Intelligibility Benefit (ISIB) declines with L2 proficiency with an intelligibility measure using a task involving form–meaning mappings at the prosodic level. Contrary to some previous findings, Base-Berk and Bradlow (Chapter 12) report that nonnative speech is more variable in speaking rate than native speech but that a speaker’s first language may not be a potential source of this variability. Among new insights revealed are the observations that languagedependent and language-specific factors as well as L2 proficiency modulate the transferability of L1 acoustic cues to the acquisition of L2 prosodic features, consonantal features, and the perceived degrees of foreign accents (Chapters 6, 8, 11, 13, and 15). In addition, formation of L2 phonetic categories is a slow process even among early learners (Chapters 7 and 10) and may be influenced by a speaker’s indexical attributes, such as masculinity (Chapter 19). Furthermore, cognitive factors may influence acquisition and processing of L2 at the segment, suprasegment, and discourse levels (Chapters 16, 17, and 20). Pedagogical Implications As its predecessor was, the SLM-r remains focused on speech learning at the segmental level in a natural setting; nonetheless, several pedagogical implications for speech learning among adult L2 learners in a classroom setting may be inferred from the model as well as from empirical data from contributing chapters. Only a few are mentioned here. First, owing to differences both in quantity and quality between L1 and L2 input, the SLM-r stipulates that native-like perception and production are virtually unattainable. Thus, the goal of L2 speech learning is not to become indistinguishable from a native speaker but to form a new L2 phonetic

xxvi

Preface

category with acoustic and articulatory specifications that are consistently and reliably distinguishable from those of the closest L1 category. Second, according to the SLM-r, production and perception coevolve, and accurate perception is no longer believed to take precedence over accurate production. As such, production and perception training should proceed in parallel. It should be noted, however, that the two skills draw on the same cognitive resources, so the focus at one time should be on one or the other skill, but not both. Third, both the SLM and the SLM-r maintain that L2 speech learning occurs, not at the abstract phonemic level, but at the “position-sensitive allophonic” level. Thus, exposure to all positional variants of a phoneme is necessary for its mastery. Fourth, though not specified by the SLM or the SLM-r, to detect phonetic divergence between an L2 and the closest L1 sound category, learners’ direct attention is required. That is, learners should be explicitly instructed to allocate their attention to specific L1–L2 phonetic deviations during training. Fifth, different L1 acoustic cues may be transferred to the learning of a novel L2 contrast among learners at different L2 instructional or proficiency levels. For example, it was found that advanced L1 Arabic–L2 learners of English, but not L1 Arabic speakers, approximated native English speakers in their use of amplitude and duration but not of fundamental frequency, despite all three acoustic dimensions being used to signal lexical stress contrast in Arabic. On the other hand, vowel reduction, a phonetic feature not exploited in lexical stress contrast in Arabic, was not transferred to English lexical stress production by either group of Arabic speakers (Chapter 11). Thus, production and perception training materials and methods should be accordingly designed to optimize their outcomes. Finally, besides linguistics features, a speaker’s indexical properties in L1 speech may also affect perceptual representation of phonetic categories formed by L2 learners. For example, it was found that gender typicality of men’s voices exerts a stronger influence on how voiceless fricative consonants are categorized by Japanese listeners than by the English listeners (Chapter 19). This factor should be taken into consideration when L1 speech materials are chosen and included for training.

Acknowledgments

I am in debt to all the contributors for making this volume possible, particularly to Professor James Flege for his mentoring and devotion to research on second language speech learning. For years, Flege worked with single-minded dedication to resolve some of the core questions regarding how second language speech is learned. Like its predecessor, his revised Speech Learning Model (SLM-r) will guide research in this area for decades to come. Professor Ocke-Schwen Bohn also deserves a special thanks. His tremendous contribution and dedication to the field beyond the two chapters in this volume are acknowledged with gratitude.



part i

Theoretical Progress

chapter 1

The Revised Speech Learning Model (SLM-r) James Emil Flege and Ocke-Schwen Bohn*

Like its predecessor, the revised Speech Learning Model (SLM-r) focuses on the learning of L2 vowels and consonants (or “sounds,” for short) across the life-span. To define the context in which the original SLM (Flege, 1995) developed, we begin by presenting some key studies carried out before 1995. After summarizing the SLM with clarification of some key points, we present the SLM-r. The primary aim of the SLM-r differs from that of its predecessor, which was to “account for age-related limits on the ability to produce L2 vowels and consonants in a native-like fashion” (Flege, 1995, p. 237). The SLM focused on differences between groups of individuals who began learning an L2 before versus after the close of a supposed Critical Period (CP) for speech learning (Lenneberg, 1967). Closure of the CP was regarded as an undesired consequence of normal neurocognitive maturation that arose from diminished cerebral plasticity and a reduced ability to exploit L2 speech input. The SLM-r offers an account for differences between “early” and “late” learners, but its primary aim is to provide a better understanding of how the phonetic systems of individuals reorganize over the life-span in response to the phonetic input received during naturalistic L2 learning.

1.1  Work Prior to 1995 A phonemic level of analysis dominated early L2 research. Bloomfield (1933, p. 79) posited that because monolinguals learn to respond only to * The work presented here was supported by grants from the National Institute of Deafness and Other Communicative Disorders. Susan Guion played an important role in this research, and we truly miss her. Special thanks are also due to Katsura Aoyama, Wieke Eefting, Anders Højen, Satomi Imai, Ian MacKay, Murray Munro, Thorsten Piske, Carlo Schirru, Anna Marie Schmidt, Naoyuki Takagi, Amanda Walley and Ratree Wayland. We thank Charles Chang, Olga Dmitrieva, Francois Grosjean, Nikola Eger, Natalia Katushina and Juan Carlo Mora Bonilla for comments of earlier versions of this chapter.





James Emil Flege and Ocke-Schwen Bohn

distinctive features, they can “ignore the rest of the gross acoustic mass that reaches [their] ears.” Hockett (1958, p. 24) defined the phonological system of a language as “not so much a set of sounds as … a network of differences between sounds.” Trubetskoy (1939) proposed that the phonological system of the native language (L1) acts like a “sieve” that passes only phonetic information in the production of L2 words that is needed to distinguish words found in the L1. This approach shifted attention away from the language-specific phonetic details of the L1 to which children attune slowly during infancy and childhood and it implied that such details might be inaccessible to individuals who learn the same language as an L2. One dissenting voice was that of Brière (1966), who maintained that the relative ease or difficulty of learning specific L2 sounds could only be predicted through “exhaustive” analyses of phonetic details (p. 795). The aim of the contrastive analysis (CA) approach was to identify learning problems that would need to be addressed through instruction in the foreign-language classroom. Its general prediction was that L2 phonemes that do not have a counterpart in the L1 would be difficult to learn whereas those having an L1 counterpart would be relatively easy to learn. The CA approach assumed that pronunciation errors observed in L2 speech were the result of faulty articulation (i.e., production), not the results of incorrect targets resulting from faulty perception. Just as importantly, the CA approach ignored the fact that the “same” sound found in two languages may differ greatly at the phonetic level. Another problem for the CA approach was that allophonic distributions of the “same” phonemes found in two languages often differ crosslinguistically, making point-by-point comparisons of phonemes difficult or meaningless (Kohler, 1981). The phonemes in a contrastive analysis were defined primarily in terms of a static articulatory description of a single canonical variant. This ignored the fact that an important part of L1 acquisition is the integration of conditioned variants of a phoneme (Gupta & Dell, 1999; Song, Shattuck-Hufnagel, & Demuth, 2015). In addition, the CA approach tacitly assumed that L2 learners make errors even after having received adequate input and that knowing how the L1 is learned is irrelevant for an understanding of L2 speech learning. The one-time, one-size-fits-all CA approach soon fell from favor. As Lado had already noted in 1957, not all individual speakers of a single L1 make the same errors when speaking the same L2. Flege and Port (1981) showed that the distinctive features needed to distinguish L1 phonemes cannot be recombined freely to produce an L2 sound that is not present



The Revised Speech Learning Model (SLM-r)



in the L1. Most importantly, as noted in 1953 by Weinreich, the nature and extent of the mutual “phonic interference” between the sounds in a bilingual’s two languages depends, in addition to phonological differences, on factors such as language dominance, demography (e.g., ethnicity, gender, age), years of L2 use, and the domains in which the L1 and L2 are used (pp. 83–110; see also Grosjean, 1998). In the 1970s research began examining purely phonetic aspects of L2 segmental production and perception. Much of this early work focused on the voice-onset time (VOT) dimension in the production and perception of word-initial English stops by native Spanish speakers. For example, Elman, Diehl, and Buchwald (1977) examined the identification of naturally produced consonant-vowel syllables initiated by stops having VOT values in the “lead,” “short-lag,” or “long-lag” ranges. Spanish and English monolinguals labeled stops having short-lag VOT as /p/ and /b/, respectively. The Spanish-English bilinguals who participated were asked to label the same stimuli in two “language sets” intended to induce a Spanish or English perceptual processing mode. The effect of the language set manipulation was small for most participants, but five of the 31 bilingual participants, referred to as “strong” bilinguals, were far more likely to identify short-lag stops as /b/ in the English set than in the Spanish set. These five participants seem to have been early learners (R. Diehl, personal communication, June 3, 1990). Many accepted the hypothesis by Lenneberg (1967) that a critical period (CP) exists for speech learning and that it closes at about the age of 13 years as the results of normal neurological maturation. Lenneberg (1967, p. 176) also suggested that following the close of a CP, L2 learners cannot make “automatic use” of L2 input from “mere exposure” to the input as children do when learning their L1. To evaluate the “automatic use” hypothesis, Flege and Hammond (1982) recruited native English (NE) university students in Florida. All of them had been previously exposed to Spanish-accented English and, in addition, were taking a Spanish class taught by a native Spanish speaker who spoke English with a strong Spanish accent. The students were asked to read English sentences containing two variable test words (The X is on the Y) with a feigned Spanish accent. The amount of prior exposure to Spanish-accented English the students had received was estimated by counting the number of expected “Spanish accent substitutions” (e.g., [vel] or [veIl] for bail, [big] for big) they produced in the English test words. VOT was measured in additional test words beginning in /t/.



James Emil Flege and Ocke-Schwen Bohn

Members of both the higher- and lower-exposure groups shortened VOT in the direction of values typical of Spanish, but only the higherexposure group produced significantly shorter VOT values than did the members of a control group who read the sentences without special instruction. Importantly, the students who produced significantly shortened VOT values when speaking English with a feigned Spanish accent did not accomplish this by using a short-lag English /d/ to produce the /t/-initial test words. Flege and Hammond (1982) concluded that monolingual adults are able to access cross-language phonetic differences through mere exposure to speech after the supposed closure of a CP for speech learning. The results indicated that NE monolinguals with substantial exposure to Spanish-accented English could not only detect phonetic differences between standard and Spanish-accented English, they could also store that information in long-term memory and use it to guide production (see also Reiterer, Hu, Sumathi, & Singh, 2013). Other research examined production and perception of the VOT dimension in L2 learners. Williams (1977) found that the “phoneme boundary” between stops such as /b/ and /p/ occurred at significantly longer VOT values for adult English than Spanish monolinguals. Flege and Eefting (1986) reported that this also held true for monolingual children. They also reported that, within languages, phoneme boundaries occurred at longer average VOT values for adults than for eight- to nine-year-old children. Indeed, the phoneme boundaries of NE 17-year-olds occurred at significantly shorter VOT values than those of NE adults, suggesting that attunement to L1 phonetic-level details may continue into the late teenage years. Not surprisingly, Flege and Eefting (1986) observed cross-language production differences that mirrored the above-mentioned perception differences in phoneme boundaries. Both monolingual NE adults and children produced /p t k/ with longer VOT values than age-matched native Spanish (NS) monolinguals and, within both languages, adults produced longer VOT values than children did. Research began to focus on providing an explanation for differences observed for early and late learners. Flege (1991) compared VOT in stops produced by groups of NS adults differing in age of arrival in the United States (means = 2 vs. 20 years). These early and late learners also differed in percentage English use (means = 82 vs. 66 percent). The early learners produced English stops with native-like VOT values, both individually and as a group. The average values obtained for late learners, on the other



The Revised Speech Learning Model (SLM-r)



hand, were intermediate to the values observed for Spanish and English monolinguals. This finding suggests that the speech learning ability of the late learners may have been partially compromised, perhaps due to the closing of a critical period. The results of a speech imitation study (Flege & Eefting, 1988) suggested that NS early learners of English can form new long-lag phonetic categories for English /p t k/. This finding led Flege (1991) to suggest that the accurate production of VOT in English stops by NS early but not late learners arose from the inability by the late learners to form new phonetic categories. Had this been true it would have provided a solid empirical basis for the CP proposed by Lenneberg (1967). However, the interpretation suggested by Flege (1991) was problematic for two reasons. First, the VOT values produced by individual NS Late learners ranged from Spanish-like to English-like. If a CP exists, it should affect everyone in much the same way. Second, the results for the late learners may have reflected learning in progress rather than the performance that might have been evident had they received as much L2 phonetic input as monolingual NE children need to achieve an adult-like production of VOT. For this chapter, we estimated years of full-time equivalent (FTE) English input that had been received by the NS participants in Flege (1991). These values were calculated by multiplying years of residence in the United States by proportion of English use (self-reported by each participant as a percentage). The mean estimated FTE years of English input was much higher for the early than late learners (means = 17.2 vs. 9.2 FTE years). Thus, if category formation is a slow process requiring input that gradually accumulates over many years of daily use, the early– late difference observed by Flege (1991) might simply have been the result of input differences, not the loss of capacity by the late learners to form new phonetic categories. FTE years of L2 input may be a somewhat better estimate of quantity of L2 input than LOR alone, but it says nothing regarding the quality of L2 input. Early learners acculturate more rapidly following immigration than late learners do (Cheung, Chudek, & Heine, 2011; Jia & Aaronson, 2003). Acculturation involves the establishment of social contacts with native speakers of the target L2. This means that the NS late learners tested by Flege (1991) were likely to have been exposed more often to Spanish-accented English than the early learners were, and so they may have been exposed to shorter VOT values in English words overall than the early learners and NE monolinguals.



James Emil Flege and Ocke-Schwen Bohn

The effect of foreign-accented input was observed in research examining NS early learners who learned English in an environment where Spanish-accented English was the rule rather than the exception. The mean VOT values obtained by Flege and Eefting (1987) for early learners in Puerto Rico were intermediate in value, in both production and perception, to values obtained for English and Spanish monolinguals, and so were similar to the values obtained for NS late learners of English in Texas. The difference between the early learners tested in Puerto Rico and Texas suggested that the quality of L2 input may matter more than the age of first exposure to an L2. Research in the period we are considering also showed that the magnitude of cross-language phonetic differences matters. Flege (1987) examined the production of French vowels by NE speakers who had lived in France for an average of 10 years. Unlike French, English has no /y/ and its /u/ differs acoustically from the /u/ of French. The three vowels of interest (French /y/ and /u/, English /u/) differ primarily in (F2) frequency, and NE speakers generally hear the French /y/ as their English /u/ (Levy, 2009a). Flege (1987) hypothesized that NE speakers would be able to produce the “new” French vowel, /y/, more accurately than the “similar” French /u/. In fact, the difference between the NE speakers and French monolinguals in terms of the crucial acoustic phonetic dimension, F2 frequency, was nonsignificant for /y/ but not /u/. Flege (1992) further evaluated the new-similar distinction by examining the production of English vowels by native Dutch (ND) adults. The English vowel in hit (/I/) was classified as “identical” to a Dutch vowel based on previously published acoustic data and on reports that the auditory differences between English /I/ and the closest Dutch vowel are likely to go undetected by native Dutch-speaking listeners. The English vowel in hat (/æ/) was classified as “new” because it occupies a portion of vowel space not exploited by Dutch and because earlier research suggested that /æ/ is learnable. The vowels in heat, hoot, hot and hut (/i/, /u/, /ɑ/, /ʌ/) were each categorized as “similar” to a Dutch vowel. The results obtained by Flege (1992) for the “new” vowel in hat supported the view that ND late learners can form new phonetic categories for certain L2 vowels. However, the results obtained for English vowels classified as “similar” to a Dutch vowel did not support the hypothesis that native versus nonnative differences persist for L2 vowels that are similar but not identical to an L1 vowel. Two Dutch vowels classified as similar were produced quite well but two others were produced poorly. Flege (1992, p. 162) concluded that no principled method existed



The Revised Speech Learning Model (SLM-r)



for distinguishing “new” from “similar” L2 sounds and so the trichotomy “new-similar-identical” was not included in the SLM (Flege, 1995). Flege, Munro, and Skelton (1992) evaluated the effect of L2 experience by recruiting two groups each of native Mandarin (NM) and Spanish (NS) speakers. All had begun learning English as adults, but the samelanguage groups differed according to LOR in the United States (Mandarin means = 0.9 vs. 5.5 years; Spanish means = 0.4 vs. 9.0 years). The study focused on the production of word-final English /t/ and /d/ because these stops are not found in the final position of Mandarin or Spanish words. The authors hypothesized that the nonnatives with a relatively long residence in the United States would treat the word-final stops as “new” sounds and so produce them accurately. NE-speaking listeners were more successful overall in identifying the nonnative speakers’ productions of /t/ than /d/ (means = 82 vs. 65 percent correct). Acoustic analyses revealed that the NM and NS speakers produced smaller acoustic phonetic differences between /t/ and /d/ (longer vowels before /d/, higher F1 offset frequency for /d/, more closure voicing in /d/, longer closure for /t/) than the NE speakers did. Stops produced by both “experienced” and “inexperienced” nonnatives were significantly less intelligible (means = 68 percent for both groups) than stops produced by the NE speakers. Within languages, the LOR-defined groups did not differ significantly. Of the 40 NM and NS speakers tested, just six produced word-final stops that were as intelligible as the stops produced by the NE speakers. One possible explanation for the frequent errors in nonnative speakers’ final stop productions identified by Flege et al. (1992) is that adult learners of an L2 lack the capacity to learn new forms of speech. An alternative explanation is that the errors may have been the result of inadequate input. Monolingual NE children need approximately five years of full-time English input in order to produce /t/ and /d/ accurately in word-final position (e.g., Smith, 1979). The nonnative speakers designated as “experienced” had an average of just 4.2 FTE (full-time equivalent) years of English input and were likely to have often heard other nonnatives produce the word-final English stops inaccurately. The same two explanations might be applied to the findings of Flege and Davidian (1984), who used a picture-naming task to elicit the production of /p t k/ and /b d ɡ / in the final position of English words. Among the participants tested were immigrants from China and Mexico (12 each) who had all learned English as adults and had lived in Chicago for 4.2 years on average (range = 0.2 to 7.5 years). Unlike members of the



James Emil Flege and Ocke-Schwen Bohn

NE comparison group (n = 12), these late learners omitted (means = 2.3 vs. 3.4 percent), devoiced (means = 29.5 vs. 43.0 percent) and spirantized (means = 0.8 vs. 19.3 percent) the word-final English stops. The differing frequency of error types observed for the two L1 groups was readily understandable with reference to the inventory of word-final obstruents found in their L1s, but overall, they produced only about half of the stops without error. All 24 participants were enrolled in English as a second language classes at a local community college where they certainly heard one another, and other immigrants outside the classroom, producing final English stops with the same errors. At least some of them may have learned to accurately produce the wrong phonetic “models.” In summary, L2 speech research carried out prior to 1995 gradually began to focus on a phonetic rather than a phonemic level of analysis. Language-specific phonetic differences between the L1 and L2 became the focus of speech production and perception research. The existing research made clear that (1) the L1 phonetic system “interferes” with L2 speech learning; (2) some L2 sounds are learned better than others; (3) L2 sounds without an L1 counterpart might be learned more effectively than those with an L1 counterpart; and (4) the quantity and quality of L2 input that L2 learners receive may exert an important influence on phonetic-level learning. It appeared possible that early learners generally produce and perceive L2 sounds more effectively than late learners do because they, but not late learners, might be able to form new phonetic categories for L2 sounds. This inference was at odds, however, with evidence that late learners can gain access to L1–L2 phonetic differences, store the detected differences in long-term memory, and then use the stored perceptual representations to guide articulation.

1.2  The Speech Learning Model (SLM) Flege (1995) observed that at a time when “children’s sensorimotor abilities are generally improving, they seem to lose their ability to learn the vowels and consonants of an L2” (p. 234). We now know that earlier is generally better than later for those learning an L2, but only in the long run. Adults outperform children in the early stages of naturalistic L2 acquisition, but adult-child differences tend to recede over time until early learners outperform late learners (e.g., Jia, Strange, Wu, Collado, & Guan, 2006; Snow & Hoefnagel-Höhle, 1979). DeKeyser and Larson-Hall (2005) attributed the age-performance “cross-over” to age-related cognitive changes. If applied to L2 speech



The Revised Speech Learning Model (SLM-r)



learning, their hypothesis would mean that children learn L1 speech implicitly through massive exposure to the sounds making up the L1 phonetic inventory. Also by hypothesis, the efficacy of implicit learning mechanisms would be reduced following the close of a critical period because it would cause L2 learners to lose the ability to make “automatic” use of input from “mere exposure” to the sounds making up the L2 phonetic inventory (Lenneberg, 1967, p. 176). The ability to make effective use of ambient language phonetic input is the acknowledged prerequisite for L1 speech acquisition (e.g., Kuhl, 2000). According to a “cognitive change” hypothesis (DeKeyser & Larson-Hall, 2005), late learners fare well in early stages of L2 learning through the use of explicit learning mechanisms, but such mechanisms are not well suited for the slow process of attunement to the languagespecific details defining L2 sounds and their differences from L1 sounds. Early learners, on the other hand, learn L2 phonetic details well but slowly via implicit learning mechanisms. The SLM provided a way to understand the cross-over paradox without positing a loss of neural plasticity or a change in the cognitive mechanisms needed for speech learning. As mentioned earlier, research has shown (e.g., Flege & Hammond, 1982; see also Reiterer et al., 2013) that even late learners can gain access to the language-specific details defining L2 sounds. The SLM proposed that L2 phonetic input is accessible and that L2 learners of all ages exploit the same mechanisms and processes they used earlier for L1 speech learning, including the ability to create new phonetic categories for certain L2 sounds based on the experienced distributions of tokens defining those L2 sounds. The SLM focused on the development of language-specific phonetic categories and the phonetic realization rules used to implement those categories motorically. The model assumed a generic three-level perception-production framework, illustrated in Figure 1.1, that envisages a flow of information from a sensory motor level to a phonetic category level to lexico-phonological representations (see, e.g., Evans & Davis, 2015). A precategorical, auditory level of processing is evident only in specific perceptual testing conditions and is imperceptible to listeners (e.g., Werker & Logan, 1985), whereas the distinction between the phonetic category and lexico-phonological levels is more readily evident. For example, listeners can “hear” (i.e., perceive) a sound in the speech stream even when the sound has been replaced by silence or noise, thereby removing any phonetic-level information (e.g., Samuel, 1981). Evidence



James Emil Flege and Ocke-Schwen Bohn LANGUAGE UNDERSTANDING word recognition

phonological codes in lexical representations

language-specific phonetic categories (“inner speech”)

language-specific realization rules

preattentive sensory codes

sensory motor codes

Figure 1.1  The generic three-level production–perception model assumed by the Speech Learning Model.

that sounds are categorized at a phonetic level is provided by the fact that monolingual listeners can recognize unfamiliar names heard for the first time. Phonetic categories have two important functions. They define the articulatory goals used by language-specific phonetic realization “rules” in producing speech (but see Best, 1995, for a different perspective). More specifically, the realization rules “specify the amplitude and duration of muscular contractions that position the speech articulators in space and time” (Flege, 1992, p. 165). Second, phonetic categories are used to access segment-sized units of speech that, in turn, are used to activate word candidates during lexical access. Listeners are usually not consciously aware of phonetic categories as they process speech because phonetic-level changes do not change meaning. However, language-specific phonetic categories are sufficiently rich in detail that they permit the detection of a fluent speaker as nonnative in as little as 30 ms (Flege, 1984). Moreover, phonetic-level differences can be detected when listeners focus attention on such differences (Best & Tyler, 2007; Pisoni, Aslin, Perey, & Hennessy, 1982).



The Revised Speech Learning Model (SLM-r)



1.2.1  Cross-Language Mapping The SLM focused on sequential bilinguals who already possess a functioning phonetic system when first exposed to an L2. For such individuals, L2 phonetic learning is influenced importantly by the perceived relationships between the sounds making up the L2 phonetic inventory and those present in the L1 phonetic inventory. The SLM proposed that L1 and L2 sounds are perceptually linked to one another through a cognitive process called “interlingual identification,” which operates automatically and subconsciously. When first exposed to the L2, learners interpret the “full range” (Flege, 1995, p. 241) of L2 sounds they encounter on the phonetic surface of the L2 as being instances, some better than others, of existing L1 phonetic categories. The SLM did not specify how much L2 input learners will need in order to establish stable patterns of interlingual identification. We speculate that the amount of exposure needed to do so may depend on the complexity of various L2 sounds, operationalized by the sounds’ frequency of occurrence in the world’s languages and the time needed by monolingual children to learn them. 1.2.2  Position-Sensitive Allophones According to the SLM, the mapping of L2 to L1 sounds occurs at the level of position-sensitive allophones, not phonemes. This design feature was based on the observation (Kohler, 1981) that allophonic distributions of phonemes vary across languages and, within a single language, allophones may differ greatly in their articulatory and acoustic specification. Moreover, the relative importance of multiple acoustic cues to the categorization of a sound may differ according to position (see, e.g., Dmitrieva, 2019, for English word-medial vs. word-final stops). Research has shown that learning one position-sensitive allophone of an L2 phoneme does not guarantee success in producing and perceiving other allophones of the same L2 phoneme (e.g., Mitterer, Reinisch, & McQueen, 2018; Mochizuki, 1981; Pisoni, Lively, & Logan, 1994; Rochet, 1995; Strange, 1992; Takagi, 1993). Iverson, Hazan, and Bannister (2005) found that training NJ speakers to identify /r/ and /l/ in initial position increased identification accuracy for liquids in that position but not for medial liquids or liquids in initial clusters. To take another example, the presence of voiced and voiceless consonants in word-initial position in the L1 does not permit learners to produce the “same” consonants in the



James Emil Flege and Ocke-Schwen Bohn

word-final position of L2 words if such sounds do not also appear in the final position of L1 words (e.g., Flege et al., 1992; Flege & Davidian, 1984; Flege & Wang, 1989). 1.2.3  Age of First Exposure The L1 phonetic system necessarily “interferes” with the learning of L2 sounds because sounds encountered on the phonetic surface of an L2 “map onto,” that is, are perceptually linked to one or more L1 sounds. The perceptual links created for various sound pairings may vary in rapidity and consistency and may evolve as learners gain experience in the L2. Flege (1995, p. 263) suggested that L2 learners may only “gradually discern” the existence of phonetic differences between an L2 sound and the closest L1 sound and that when this happens “a phonetic category representation may be established for the new L2 sound which is independent of representations established previously for L1 sounds.” By hypothesis, the likelihood of cross-language phonetic differences being discerned between pairs of L1 and L2 sounds decreases as a function of the age of first exposure to the L2. This change was attributed to increasing use of “higher-order invariants” as age of first exposure to the L2 increases. This was expected to make it increasingly difficult for L2 learners to “pick up” detailed phonetic-level information regarding L2 speech sounds (Flege, 1995, p. 266). The SLM age hypothesis was meant as an alternative to the critical period hypothesis, which claimed that age-related effects in L2 speech learning arise from a reduction of neurocognitive plasticity. 1.2.4 L2 Experience The observation that L2 learners “gradually discern” L1–L2 phonetic differences implied changes over time as a function of L2 experience. The term “experience” has been used in diverse and inconsistent ways. In some early research, it was used to differentiate groups of participants who had studied an L2 in school from groups of participants who had not (e.g., Gottfried, 1984). A difference in schooling, of course, might be expected to co-occur with differences in metalinguistic awareness (e.g., Levy & Strange, 2008, p. 151). Flege (1995) used the term “experience” with reference to conversational experience as was typical for L2 research at the time. More specifically for the SLM, the term experience was meant to indicate the cumulative speech input learners have received while communicating verbally in the L2, usually in face-to-face conversations.



The Revised Speech Learning Model (SLM-r)



Many researchers (including us) have used the variable “length of residence” (LOR) to index L2 experience because it can be readily obtained from a written questionnaire. Researchers have reasonably supposed that, for example, a German who had lived in the United States for 10 years would have heard and spoken English far more than a German who had lived there for just 1 year. It later became evident that LOR may at times provide a misleading index of quantity of L2 phonetic input. This is because LOR specifies only an interval of time, not what occurred during that interval. Not all immigrants begin using their L2 immediately upon arriving in the host country (e.g., Flege, Munro, & MacKay, 1995a, table I) or use their L2 on a regular basis even after years of residence there (Moyer, 2009, p. 162). The results of Flege and Liu (2001) suggested that LOR may provide a useful estimate of quantity of L2 input only for immigrants who have both the opportunity and the need to use their L2 regularly. Selfestimated percentage use of an L2 usually increases as LOR increases but the relationship between the two is nonlinear. Important individual exceptions exist due to the circumstances of everyday life, for example, a rapid increase in L2 use when an immigrant marries a native speaker of the L2 or a decrease in L2 use following marriage to a fellow L1 speaker. Even more importantly, the LOR variable provides no insight into the quality of L2 input. The importance of quality of input on speech perception can be seen in research with monolingual children. Many children who learn English as an L1 hear a single dialect of English. Such children recognize words in their native language less efficiently when hearing an unfamiliar dialect of their L1 or foreign-accented English (e.g., Bent, 2014; Buckler, Oczak-Arsic, Siddiqui, & Johnson, 2017). Perceiving speech optimally requires adapting phonetic categories and their real-time use in recognizing and producing words to what has been heard and seen in the past, even the recent past. It is unknown at present how much L2 input is needed to form phonetic categories in an L2 and optimally adapt them to everyday use. This may depend, at least in part, on the uniformity of the L2 speech input that has been received. At least some children are exposed to a single variety of their native language but children and adults who learn an L2 rarely if ever get uniform input. It is usually impossible for L2 learners, at least immigrants to a predominantly L2-speaking country, to avoid using their L2 in “mixed conversations.” A mixed conversation in one in which L2 learners converse with one or more monolingual native speakers of the target L2 and at least one other nonnative speaker. The L2 must be used by all participants due to the presence of the monolingual native speaker(s). In



James Emil Flege and Ocke-Schwen Bohn

such a context, L2 learners are likely to hear their L2 spoken with a foreign accent, often their own kind of foreign accent, by the other nonnative speaker(s) present. 1.2.5  Categories, Not Contrasts The SLM focused on individual sounds in the L1 and L2 phonetic subsystems of L2 learners rather than on contrasts between pairs of sounds. The focus on individual, position-sensitive allophones was based in part on the assumption that listeners match the properties of an incoming sound (e.g., the [θ] in think) to a representation stored in longterm memory because in real-time speech processing it takes too long to eliminate multiple alternative candidates (e.g., “not [f ],” “not [s],” “not [v]”). The categorization of speech sounds is considered the basis of speech perception in monolinguals (Holt & Lotto, 2010) and, in our view, this holds true for the perception of L2 sounds given that the L1 and L2 sounds are perceptually linked via the mechanism of interlingual identification. Categorization and identification are not the same thing (Nosofsky, 1986; Smits, Sereno, & Jongman, 2006). A stimulus sound is categorized by computing its relative distance along multiple dimensions to multiple long-term memory representations, requiring generalizations across discriminably distinct tokens within categories. The identification of a sound requires that a decision be made regarding a sound’s unique identity and requires discrimination between categories. L2 and cross-language research has demonstrated the methodological importance of the distinction between categorization and identification tasks. For example, Bohn and Flege (1993) assessed how Spanish monolinguals perceived English stops in a two-alternative forced-choice identification task. The stimuli were multiple natural tokens of Spanish stops (prevoiced /d/, short-lag /t/) and English stops (short-lag /d/, long-lag /t/). The Spanish monolinguals consistently identified the long-lag English /t/ tokens as “t” even though they surely did not have an English /t/ category. Instead, they appear to have made use of an “X-not-X” strategy. Given the need to choose one of two response alternatives, “d” or “t,” they selected the “t” response for long-lag stimuli because these stimuli were clearly not instances of their Spanish /d/ category. Many researchers have used N-alternative force-choice tests, attempting to offer every reasonably possible response alternative. For example, MacKay, Meador, and Flege (2001) examined the perception of



The Revised Speech Learning Model (SLM-r)



English consonants by native Italian speakers. In addition to a written label for the target sounds (e.g., “s” for word-initial /s/ tokens) four other responses alternatives selected on the basis of confusion matrices from earlier research were offered. While this reduces the problem, it does not guarantee that the choices that were provided included the response alternative that adequately represented what every individual participant perceived. Moreover, the proliferation of response alternatives may be a source of confusion, especially for nonnatives whose spelling-to-sound correspondences are not native-like. The methodology used to assess perception is crucial for the evaluation of how “native-like” L2 learners are judged to be. Iverson and Evans (2007, p. 2852) noted, for example, that a two-alternative forced-choice (2AFC) test can reflect perceptual sensitivity as much as categorization. Díaz, Mitterer, Broersma, and Sebastián-Gallés (2012) evaluated L2 speech perception by determining the percentage of L2 learners who obtained scores falling within the native-speaker range. More of them met this criterion when L2 perception was evaluated using a categorization task than by an identification or lexical decision task. For these authors, a categorization task provides information regarding an “acoustic phonetic analysis” whereas the latter two tasks “involve lexical processes” (p. 680). The responses obtained in a 2AFC test are often used to compute phoneme “boundaries.” Escudero, Sisinni, and Grimaldi (2014, p. 1583) proposed that in order to perceive L2 vowels more accurately, L2 learners may in certain instances need to “shift the boundary” between L1 ­categories. As we see it, a shift in boundary locations, if observed, is epiphenomenal, the result of learning-induced changes in the phonetic categories themselves. Boundary locations may vary as a function of experimental design. For example, Benders, Escudero, and Sjerps (2012) showed that phonetic context and stimulus range effects were smaller when listeners were offered five rather than just two response alternatives. The perception of L2 sounds has often been evaluated by examining how accurately pairs of L2 sounds can be discriminated. Best and Tyler (2007) summarize research showing that monolinguals tested in a laboratory setting usually discriminate two foreign sounds better if the two sounds map onto distinct L1 sounds (a 2-to-2 mapping pattern) than if they map onto a single L1 sound (a 2-to-1 mapping pattern). If the same monolinguals were later to learn the language from which the foreign sounds examined in a laboratory experiment were drawn, discrimination of the two foreign (now L2) sounds might be expected to improve over time. This could be attributed to changes in cross-language mapping



James Emil Flege and Ocke-Schwen Bohn

patterns (Best & Tyler, 2007; Tyler, 2019). Within the SLM framework, changes in cross-language mapping are important because such changes may lead to the formation or modification of phonetic categories. On this view, it is the use of distinct phonetic categories that results in improved discrimination. For pairs of L2 sounds that previously exhibited a 2-to-1 mapping pattern, this will require the formation of a new phonetic category. 1.2.6 L1 Phonetic Development Is Slow The SLM proposed that the mechanisms and processes used to establish the elements making up the L1 phonetic system, including the ability to form phonetic categories, remain intact and available for L2 learning. Developmental research indicates that L1 phonetic categories develop slowly. Infants begin attuning to the phonetic categories that will eventually constitute their L1 phonetic inventory even before they have a lexicon, perceptually grouping sets of acoustically similar sounds into “equivalence classes” (Kuhl, 1983). The language-specific phonetic categories that guide production and perception evolve from equivalence classes and continue to develop long after children have established a phonemic inventory for their L1 in their first few years of life (see e.g., Hazan & Barrett, 2000; Lee, Potamianos, & Narayanan, 1999). Phonetic categories have traditionally been described as points in an n-dimensional perceptual space. Children’s attunement to the phonetic categories of their L1 is based on long-term exposure to statistically defined distributions of sound tokens. Each sound token is processed as an instance of a category, leaving a trace in episodic memory (e.g., Hintzman, 1986). The effect of L1 phonetic category development can be seen in research examining children’s ability to categorize L1 sounds, which continues to improve at least until the age of 15 years (see, e.g., Johnson, 2000; Markham & Hazan, 2004; Neuman & Hochberg, 1983). As monolingual children are exposed to an ever wider range of variant realizations of an L1 category, that is, more broadly tuned distributions, they become better able to recognize words spoken in an unfamiliar L1 dialect (Nathan, Wells, & Donlan, 1998; Bent, 2018) and to recognize L1 words spoken with a foreign accent (Bent, 2014, Bent & Holt, 2018). The end point of L1 phonetic category development has not yet been established but it surely extends beyond the age of seven years (Bent, 2014; Newton & Ridgway, 2015). Indeed, there is evidence that the fine-tuning for L1



The Revised Speech Learning Model (SLM-r)



­categories extends over the entire life-span (e.g., Harrington, Palethorpe, & Watson, 2000). As children mature, they gradually produce L1 sounds with less variability and reduce the overlap in their productions of adjacent L1 categories (Lee et al., 1999). The L1 phonetic categories that monolingual children develop are multidimensional, cue-weighted representations of sound classes residing in long-term memory. Each category is mediated by a narrow range of “best exemplars” (or “prototype”) that specifies the ideal weighting of a set of independent and continuously varying properties (perceptual cues). Prototypes define for listeners how realizations of a category ought to sound when produced by themselves (self-hearing) and by others. They provide a reference point that listeners can use when asked, in a laboratory experiment, to rate the members of an array of stimuli for category goodness (e.g., Miller, 1994; Smits et al., 2006). The use of prototypes enables listeners to reliably chose the “best exemplar” of a particular vowel category from an array of stimuli (e.g., Iverson & Evans, 2007, 2009; Johnson, Flemming, & Wright, 1993) and to detect “distortion” or “foreign accent” in productions of a specific sound they have been asked to evaluate auditorily (Flege, 1992, p. 170; Lengeris & Hazan, 2010). Phonetic category prototypes also play a role in the categorization of speech sounds. Iverson and Evans (2007) found, for example, that nonnative speakers’ accuracy in categorizing naturally produced English vowels varied as a function of how closely their perceptual prototypes for English vowels resembled those of NE speakers. 1.2.7 L2 Phonetic Category Formation The SLM proposed that L2 learners of any age, like infants exposed to what will become their L1, form auditory equivalence classes derived from statistical properties of the input distributions to which they have been exposed while using the L2 (e.g., Anderson, Morgan, & White, 2003; Maye, Werker, & Gerken, 2002). In L1 monolinguals, the equivalence classes evolve into language-specific phonetic categories without interference from another phonetic system. For individuals learning an L2, on the other hand, the formation and elaboration of new phonetic categories entails disrupting L2-to-L1 perceptual links as cross-language phonetic differences are discovered (“discerned”). The specification of how multiple cues are weighted for an L2 phonetic category is language-specific and so it must be learned. For example,



James Emil Flege and Ocke-Schwen Bohn

NE-speaking listeners use both spectral cues (i.e., formant frequencies) and duration to categorize English vowels (Flege, Bohn, & Yang, 1997). However, the frequency (spectral) cues are more important for NE-speaking listeners than the temporal cues are because temporal cues are not reliably present, or are substantially reduced when English is spoken rapidly. In Swedish, on the other hand, duration is a more important cue to the categorization of certain vowels than are spectral cues (McAllister, Flege, & Piske, 2003). L2 category formation is understood less well than L1 category formation, but it seems reasonable to think that L2 category formation takes at least as long as L1 category formation does. This is because the distributions of sounds defining each L2 category are likely to be less uniform than the distributions encountered by monolingual children. L2 learners, especially adults, are likely to be exposed to diverse dialects of the target L2 as well as to multiple foreign-accented renditions of the target L2 (Bohn & Bundgaard-Nielsen, 2009). 1.2.8  Factors Determining L2 Category Formation According to the SLM, L2 learners of all ages retain the capacity to form new phonetic categories but will not do so for all L2 sounds differing auditorily from the closest L1 sound. By hypothesis, a new phonetic category will be formed for an L2 sound when learners discover (discern) phonetic differences between the L2 sound and the L1 sound(s) that is (are) closest in phonetic space to it. The SLM proposed that discerning L1–L2 phonetic differences, and thus the likelihood of a new category being formed for an L2 sound, depends on two factors. First, as the degree of perceived cross-language phonetic dissimilarity between an L2 sound and the closest L1 sound(s) increases, the easier it will be for L2 learners to discern cross-language phonetic differences. Second, the older L2 learners are when they are first exposed to an L2, the less likely they will be to discern cross-language phonetic differences. No consensus existed in 1995 regarding the best way to quantify degree of perceived cross-language dissimilarity, nor did an objective criterion exist for how great a perceived cross-language phonetic difference must be in order to initiate the process of category formation. Valid and reliable measures of cross-language dissimilarity are, of course, essential for testing SLM predictions. We will return to this issue again later in the chapter.



The Revised Speech Learning Model (SLM-r)



1.2.9  Few If Any Perfect Learners According to the SLM, the categories that L2 learners form for certain L2 sounds will likely never be identical to those of native speakers but this does not in itself demonstrate a loss or diminution of the capacity for learning speech. New L2 phonetic categories are expected to differ from monolinguals’ if L2 learners have received less phonetic input than monolingual children need to reach adult-like levels of performance, or if the input distributions upon which L2 learners base their L2 categories differ from the distributions to which monolingual native speakers have been exposed. The latter is expected for virtually all L2 learners, especially those exposed to multiple dialects of the L2 and to foreign-accented renditions of the L2 by other nonnative speakers. The SLM proposed that a new L2 phonetic category formed for an L2 sound might also differ from the phonetic categories of monolingual native speakers if the relative importance of the multiple features defining an L2 sound, as spoken by native speakers, differed from the relative importance of the same features in corresponding L1 sounds, or if the L2 sound was, at least in part, defined by some feature “not exploited in the L1” (Flege, 1995, pp. 241–243). Finally, L2 learners might differ from monolingual native speakers because of interactions between sounds making up the L1 and L2 phonetic subsystems. The SLM proposed that such interactions occur because L1 and L2 sounds exist in a common “phonological space.” In retrospect, we recognize that use of this term was a misnomer that caused confusion and will instead use the term “common phonetic space.” The categories making up a monolingual’s phonetic system tend to occupy positions in phonetic space that augment correct categorization. Operation of this language universal increases intercategory distances in phonetic space (see, e.g., Lindblom, 1990). According to the SLM (Flege, 1995, p. 242), the elements making up the phonetic subsystems of bilinguals self-organize in the same way as do the sounds of languages or specific dialects of a language. As a result, a bilingual’s new L2 category might “deflect away” from a category in the L1 phonetic subsystem to augment intercategory distances in the common L1–L2 phonetic space of bilinguals. 1.2.10 L2 Effects on L1 Categories L2 speech research prior to 1995 focused exclusively on the L2 but it is now clear that understanding how L2 sounds are learned also requires an



James Emil Flege and Ocke-Schwen Bohn

examination of how L1 sounds are produced and perceived. One wellknown finding that compels this approach pertains to global foreign accent. The strength of an L2-inspired foreign accent in the L1 varies inversely as a function of an L1-inspired foreign accent in the L2 (YeniKomshian, Flege, & Liu, 2000; see also Flege, 2007). The SLM proposed a mechanism that might account, at least in part, for L2-on-L1 effects. As mentioned, learners do not form new phonetic categories for all L2 sounds. Some L2 sounds are so similar to an L1 sound that an L1-for-L2 substitution would go unnoticed by monolingual speakers of the target L2 (Flege, 1992). What about L2 sounds that differ audibly from the closest L1 sound but for which a new category is not formed? The SLM proposed that “composite” (compromise) L1–L2 categories may develop. The perceptual link between an L2 sound and the closest L1 sound remains intact, and a composite L1–L2 category develops that is based on the combined distribution of sounds defining the L1 and L2 categories. The SLM predicted that when a composite L1–L2 category develops, the L1 category may shift in the direction of the L2 category (e.g., MacKay, Flege, Piske, & Schirru, 2001). The magnitude of the shift, and whether it will be auditorily detectable by monolingual speakers of the L1, depends on the nature of the combined distributions. Specifically, the magnitude of the shift in a bilingual is expected to vary as a function of how much the L1 and L2 have been used cumulatively by the bilingual over the course of his or her life, how much the L1 and L2 have been used recently, and how dissimilar a bilingual perceives pairs of L1 and L2 sounds to be. 1.2.11  Perception before Production The fact that L2 learners often speak with a foreign accent (e.g., Flege, 1984) and produce errors in specific vowels and consonants (e.g., Flege & Munro, 1994) gave rise to the widespread belief that L2 production errors arise because of an age-related reduction in ability to learn new forms of articulation. The SLM challenged this view, proposing that production errors often have a perceptual basis. For the SLM, the accurate perception of an L2 sound is a necessary but not sufficient condition for its accurate production. The SLM proposed that perceptual phonetic categories formed for L2 sounds and the realization rules used to motorically implement them will “align,” as in L1 acquisition. By hypothesis, the production of an L2 sound will “eventually correspond” to the properties specified in its phonetic



The Revised Speech Learning Model (SLM-r)



category representation (Flege, 1995, p. 239). The SLM did not provide an estimate for how long an alignment of the perceptual information stored in a phonetic category and the information encoded in motoric representations used to realize (produce) a phonetic category will take.

1.3  The Revised Speech Learning Model (SLM-r) The SLM-r aims to account for how phonetic systems reorganize over the life-span in response to the phonetic input received during naturalistic L2 learning. Some aspects of the original SLM (Flege, 1995) have been carried forward to the SLM-r without change but other aspects are new. For example, the SLM-r continues to focus on sequential learning of an L2 following establishment of an L1 phonetic system rather than on the simultaneous learning of two languages in infancy and early childhood (see, e.g., Werker & Byers-Heinlein, 2008, for the latter), but the SLM “age hypothesis” has been replaced by a new hypothesis which, if correct, may help explain age-related effects on L2 speech learning. The SLM was radical in its simplicity and this is even more the case for the SLM-r. If one needed a two-word summary of the SLM-r approach those two words would be that, there is no change in how the vowels and consonants found in an L1 and in an L2 are learned. The core premises of the SLM-r are that (1) the phonetic categories which are used in word recognition and to define the targets of speech production are based on statistical input distributions; (2) L2 learners of any age make use of the same mechanisms and processes to learn L2 speech that children exploit when learning their L1; and (3) native versus nonnative differences in L2 production and perception are ubiquitous not because humans lose the capacity to learn speech at a certain stage of typical neuro-cognitive development but because applying the mechanisms and processes that functioned “perfectly” in L1 acquisition to the sounds of an L2 do not yield the same results. A difference in L1 and L2 learning outcomes will necessarily arise because 1. L1 sounds initially “substitute” L2 sounds because the L2 sounds are automatically linked to sounds in the L1 phonetic inventory; 2. preexisting L1 phonetic categories interfere with, and sometimes block, the formation of new phonetic categories for L2 sounds; and 3. the learning of L2 sounds is based on input that differs from the input that monolingual native speakers of the target L2 receive when learning the same sounds.



James Emil Flege and Ocke-Schwen Bohn

The SLM-r shares the view of other theoretical models (e.g., Best & Tyler, 2007) that L2 speech learning is profoundly shaped by perceptual biases induced by the L1 phonetic system. The SLM-r has yet to be evaluated empirically. We think, however, that if furnished with adequate empirical data, the SLM-r will be able to provide an account of how these biases change as a function of exposure to L2 sounds. We acknowledge that not all perceptual biases which L2 learners bring to the task of L2 learning, and which change as an L2 is learned, can be attributed to the L1 or L2. An interesting avenue for future research not developed in this chapter is the interaction of language-specific and universal perceptual biases. To paraphrase Nam and Polka (2016), “the phonetic landscape … is an uneven terrain” (p. 65), with certain classes of sounds having a special status for all language learners irrespective of previous language experience. For example, research inspired by the Natural Referent Vowel framework of Polka and Bohn (2003, 2011) has shown that vowels which are peripheral in the acoustic/articulatory vowel space have a special status, and there is some evidence from recent L1 and L2 research that consonants with an alveolar place of articulation and stop consonants in general have a special status for both L1 and L2 learners (Bohn, 2020). 1.3.1  Focus of the SLM-r The focus of the SLM-r has changed in two important respects. 1.3.1.1  Early versus Late Learners The SLM-r no longer focuses on differences between early and late learners. This is because research since 1995 has shown that the critical period (CP) hypothesis proposed by Lenneberg (1967) for L2 speech learning does not offer a plausible explanation for the age-related effects routinely seen in L2 speech learning research. Our reasoning is as follows. First, differences between late learners and native speakers of the target L2 cannot be attributed to a loss of neural plasticity by the late learners. We now know that the adult brain retains considerable plasticity for processes relevant to L2 speech production and perception (e.g., Callan et al., 2003; Callan et al., 2004; Ylinen et al., 2010; Zhang & Wang, 2007). Second, the CP hypothesis was based on an evaluation of foreignaccented L2 production that was misleading and incomplete. To be sure, immigrants who arrive in a predominantly L2-speaking country after puberty usually speak their L2 with stronger foreign accents than those



The Revised Speech Learning Model (SLM-r)



who arrived earlier. However, many early (“precritical period”) learners speak their L2 with a detectable foreign accent even after decades of primary L2 use, and the strength of their foreign accents will vary at least in part as a function of language use patterns (Flege, 2019). Furthermore, late learners’ foreign accents grow stronger following the supposed closure of a CP. As well, many immigrants who are observed to speak their L2 with a foreign accent either have not yet received enough L2 input, or received too much foreign-accented L2 input (or both) to have reached their full potential in L2 pronunciation and perception. Third, Lenneberg (1967) believed that foreign languages have to be “taught and learned through a conscious and labored effort” (p. 176) if L2 learning begins following closure of a CP. The subjective impression of relatively great effort, presumably in comparison to children learning their L1, is surely true for those attempting to learn an L2 in a foreignlanguage classroom. For those learning an L2 by immersion (e.g., immigrants) the sensation of “effort” can probably be attributed to the fact that L2 learners, especially late learners, usually have smaller lexicons than native speakers and deploy phonetic categories that are not optimally tuned to the L2 speech sounds they hear when attempting to access L2 words (Song & Iverson, 2018). Finally, the CP hypothesis rested on the assumption that L2 learners can no longer gain “automatic access” to the language-specific phonetic properties of L2 sounds from “mere exposure” to the L2 following the closure of a CP (Lenneberg, 1967, p. 176). In fact, late learners can and do gain access to the phonetic details defining L2 sounds without special tutoring or using cognitive processes not previously exploited for L1 acquisition (e.g., de Leeuw & Celata, 2019; Flege & Hammond, 1982). 1.3.1.2  Not Just “End State” Learners The SLM-r no longer focuses on individuals who are highly experienced in the L2. We now recognize that it is virtually impossible for L2 learners to produce and perceive an L2 sound exactly like mature monolingual native speakers of the target L2. That being the case, it is no longer of theoretical interest to determine if the L2 performance of a particular learner is or is not indistinguishable from that of L2 native speakers. Most individuals who participate in L2 research have typically, perhaps inevitably, received different input than the members of a presumably representative native-speaker comparison group (Schmidtke, 2016). Input differences may well lead to subtle native versus nonnative differences, even in highly proficient and experienced L2 learners and even for L2



James Emil Flege and Ocke-Schwen Bohn

sounds that should be easy (Broersma, 2005). The simple fact of being bilingual may prevent the so-called mastery of L2 sounds (see Hopp & Schmid, 2013, for discussion). Early learners have been considered by many to be rapid and perfect learners of L2 speech when, in fact, early learners often differ from native speakers when examined closely. For example, Højen and Flege (2006) tested native Spanish (NS) adults who had learned English as children on the discrimination of three especially difficult pairs of English vowels. As expected, NS monolinguals discriminated the three pairs at near-chance levels. The early learners obtained substantially higher scores than the NS monolinguals did but, as a group, they differed significantly from NE speakers for two of the three pairs of English vowels. Other research suggests that the magnitude of differences between early learners and native speakers depends, at least in part, on differences in the relative frequency of L1 and L2 use (e.g., Bosch & Ramon-Casas, 2011; Flege, 2019; Mora, Keidel, & Flege, 2010, 2015). Persistent differences between native and nonnative speakers can be seen for some L2 learners in research examining comprehension. Nonnative speakers are less successful in recognizing L2 words than are native speakers, especially in nonideal listening conditions. This is due, at least in part, to nonnatives’ use of phonetic categories that differ from the phonetic categories deployed by native speakers (e.g., Garcia Lecumberri, Cooke, & Cutler, 2011; Imai, Walley, & Flege, 2005; Jongman & Wade, 2007). Such differences are evident even in some early learners who speak their L2 without an obvious foreign accent (Rogers et al., 2006), reflecting the fact that immigrants who are tested in L2 research are likely to have had substantially less exposure to L2 words than age-matched monolingual native speakers (Schmidtke, 2016) regardless of when they began learning their L2. As we now see it, the earlier SLM focus on “end state” learning was mistaken because it is necessary to examine early stages of L2 speech development in order to understand the process of L2 phonetic category formation. The earlier focus on highly experienced L2 learners assumed that, at some point, L2 speech learning reaches an asymptote or “ultimate” level of attainment. Even though the notion that L2 competence and performance fossilize is widely accepted (Han & Odlin, 2006) it has never been tested for L2 speech learning as far as we know. That being the case, we have decided to briefly summarize here the results of unpublished research carried out in Canada. In 1992, Murray Munro recorded a total of 24 NE speakers and 240 native Italian (NI)



The Revised Speech Learning Model (SLM-r)



speakers who had lived in Canada for 15 to 44 years (mean = 32.5 years). The data obtained were reported by Flege, Munro, and MacKay (1995b), who measured VOT in English words beginning in /p t k/ produced by the 264 participants. In 2003, Jim Flege and Ian MacKay rerecorded 20 NE and 150 NI speakers from the 1992 sample to determine if the NI participants had learned to produce English stops more accurately over the 10.5-year interval between the recordings. All participants were recorded in the same location using identical procedures, speech materials, and equipment; only the testers differed. As expected, the VOT values obtained in 1992 and 2003 for NE speakers were much the same. As can be seen in Figure 1.2(a), this also held true for most but not all of the NI speakers. An inspection of the scatterplot revealed that 20 NI participants produced English stops with longer VOT values in 2003 than 1992 whereas 20 others showed the opposite pattern. When these 40 NI speakers were removed, the 1992– 2003 VOT correlation obtained for the NI speakers increased from r(148) = 0.82. to r(108) = 0.95. The VOT values of the NI speakers who increased (n = 20) or decreased (n = 20) VOT over time were compared to the values obtained from the 20 NE speakers. The values obtained for the 60 participants, when submitted to a (3) Group × (2) Time ANOVA with repeated (b) mean VOT in 3 groups of 20 each

(a) mean VOT in /p t k/ at two times 120

80 75

VOT in 2003

100

1992

2003

70 65

80

60

60

55 50

40

45 40

20

35 0

0

20

40 60 80 VOT in 1992

100 120

30

NE

2003

Figure 1.2  The mean VOT (ms) in word-initial tokens of /p t k/ produced in English words in 1992 and 2003 (a) by native Italian (NI) speakers in Canada and (b) by 20 native English (NE) speakers and 20 NI speakers each of whom reported using English either more or less in 2003 compared to 1992. The error bars in (b) bracket ±1 SEM.



James Emil Flege and Ocke-Schwen Bohn

­ easures on time (1992 vs. 2003), yielded a significant interaction m [F(2,57) = 96.6, p < 0.01]. This was because, as can be seen in Figure 1.2(b), the effect of time was nonsignificant for the NE speakers but significant in opposite directions for the two NI groups (p < 0.01). The two NI groups did not differ significantly in LOR. However, those who increased VOT in 2003 compared to 1992 reported using English significantly more frequently in 2003 than those who decreased VOT [means = 76.5 vs. 63.2 percent, F(1,38) = 4.2, p < 0.05] even though the two groups did not differ significantly in self-reported percentage English use in 1992 (p > 0.10). These results suggest that the language use patterns of immigrants usually do stabilize, but this does not place an upper limit on the human capacity for learning speech when phonetic input changes. The change in percentage English use by long-time NI immigrants in Canada was likely to have been the result of important life changes such as remarriage, a job change, relocation to a new neighborhood or some combination of life changes. It is also possible, of course, that the changes in self-reported L2 use were accompanied by changes in how often the NI speakers heard English pronounced with an Italian foreign accent. How much time and input are needed to induce VOT production changes in the L2? The results of Sancier and Fowler (1997) suggested that two months of input may suffice. These authors examined a bilingual who spent alternating periods in the United States and Brazil. She produced shorter VOT values in Portuguese than English stops, and her VOT values in both languages were shorter following a “several months stay” in Brazil than a comparable stay in the United States. The SLM-r interpretation of these results, to be elaborated below, is that the late learner studied by Sancier and Fowler (1997) had established a composite L1–L2 phonetic category for perceptually linked Portuguese and English voiceless stops. Her composite phonetic categories for voiceless stops, which specified the articulatory goals for the production of stops in English and Portuguese, were updated regularly to reflect recent input. 1.3.2  SLM-r Hypotheses 1.3.2.1  Perception and Production Coevolve The SLM proposed that the accuracy of L2 segmental perception places an upper limit on the accuracy with which L2 sounds are produced. This hypothesis has been replaced by the hypothesis that L2 segmental



The Revised Speech Learning Model (SLM-r)



production and perception coevolve without precedence. The SLM-r “co-evolution” hypothesis arises from the observation of inconsistencies in L2 research and from evidence that a strong bidirectional connection exists between production and perception. This new evidence requires adding an arrow connecting phonetic-level production to perception in Figure 1.1. Mitterer, Reinisch, and McQueen (2018) observed that from the standpoint of spoken word recognition, there is no need to assume that production and perception must be very similar. These authors noted, for example, that native Dutch (ND) speakers differ in the extent to which they produce Dutch /r/ as an approximant in postvocalic position and also in terms of what kind of trilled /r/ they use in prevocalic position. They observed that even though all ND speakers are able to recognize both [rot] and [Rot] variants of the Dutch word for red, a given speaker was “unlikely to use both variants” (p. 90) in production. The earlier SLM “upper limit” hypothesis was supported by several kinds of evidence. First, infants show an effect of ambient language input on perception before showing ambient language effects on production (Kuhl, 2000) and at least some of children’s segmental production errors can be attributed to an inability to discriminate a sound produced in error from the correct target sound (Eilers & Oller, 1976). Second, nonnative speakers can scale overall degree of perceived foreign accent in their L2 much like native speakers even though they themselves speak their L2 with a strong accent (Flege, 1988; MacKay, Flege, & Imai, 2006). Third, perceptual training leads to an improved production of both consonants and vowels (e.g., Bradlow et al., 1999; Lengeris & Hazan, 2010) in the absence of explicit training on production. The primary source of support for the SLM upper limit hypothesis, however, was the observation of significant positive correlations between measures of segmental production and perception accuracy (Flege, 1999). The strength of such correlations seemed to vary according to the commensurability of the production and perception measures (e.g., Flege, 1999; Baker & Trofimovich, 2006; Kim & Clayards, 2019). However, several observations raised doubts about the correlational evidence. First, the presence of near-mergers, that is, the systematic production of differences that cannot be readily perceived (e.g., Labov, 1994), indicates that production and perception are not completely symmetrical. Second, studies failed to yield significant positive correlations and some even yielded inverse correlations (e.g., Darcy & Krüger,



James Emil Flege and Ocke-Schwen Bohn

2012; Peperkamp & Bouchon, 2011; Sheldon & Strange, 1982). Most importantly, the observation of significant positive production–perception correlations did not demonstrate causality. The correlations could just as easily be interpreted to mean that production accuracy places an upper limit on perceptual accuracy as the reverse (see, e.g., Best, 1995). The evidence now at hand suggests that a strong bidirectional connection exists between production and perception. It is important to recognize, of course, that the correspondence between the two is never perfect, even in monolinguals. For example, Shultz, Francis, and Llanos (2012) examined NE speakers’ use of VOT and F0 onset frequency in the production and perception of words beginning in /b/ and /p/. Although participants made greater use of VOT than F0 onset frequency in perception, a significant inverse relation was observed between the two dimensions in production, leading the authors to conclude that the goals for “efficient” production and perception differ (p. EL99). Johnson, Fleming, and Wright (1993) observed what they called a “hyperspace” effect. These authors asked native English listeners to select what they considered to be the best examples of various English vowel categories from a two-dimensional array of vowel stimuli differing in F1 and F2 frequencies. The F1 and F2 values in the participants’ production of English vowels was also analyzed acoustically. The NE participants tended to choose, as best exemplars of a vowel, stimuli having more peripheral frequency values than they themselves produced for the same vowel (see also Frieda, Walley, Flege, & Sloane, 2000; Newman, 2003). Here production and perception measures were correlated but differed in absolute value. It is plausible that although the targets for the articulation of speech sounds are defined by perceptual representations (e.g., Tourville & Guenther, 2011), a bidirectional and co-equal link between the two is actively maintained (Chao et al., 2019; Perkell, Guenther et al., 2004; Perkell, Matthies et al., 2004). This is consistent with the observation (Reiterer et al., 2013) that the regulation of motor and sensory processes used in speech production and perception is localized in “partly overlapping, heavily interconnected brain areas” (p. 9). Guenther, Hampson, and Johnson (1998) noted that brain areas specialized for speech production are active during speech perception, and vice versa. These authors hypothesized that “auditory target” regions which develop during L1 acquisition guide articulation in space and time. In their view, articulatory gestures are planned as “trajectories in auditory perceptual space” that map onto “articulator movements.” This coupling



The Revised Speech Learning Model (SLM-r)



permits auditory goals to be achieved via “motor equivalent” gestures in which “constriction locations and degrees” may vary (p. 611). This capacity enables individual monolingual NE speakers, for example, to produce /r/ to the satisfaction of other NE-speaking listeners using very different articulatory gestures (e.g., Mielke, Baker, & Archangeli, 2016; Westbury, Hashi, & Lindstrom, 1998). Evidence for the existence of bidirectional links has been provided by the results of perturbation studies. Houde and Jordan (1998; see also Houde & Jordan, 2002) altered the vocal output of adult NE monolinguals as they spoke so that the output differed from what they intended to say. Most participants managed to adapt their articulation so that their vocal output again corresponded to what they intended to say. When the auditory distortion was removed, the participants returned to their normal mode of production. Nasir and Ostry (2009) obtained a corollary finding for production. Most NE participants were able to compensate for unexpected perturbations of jaw position while producing /æ/. The better the compensation, the more the participants’ identifications of stimuli in an /ε/-/æ/ continuum was observed to shift before versus after the perturbations were administered. The linkage between vocal production and perception appears to be uniquely human. Schulze, Vargha-Khade, and Mishkin (2012) found that humans, unlike monkeys, are extremely good at storing lasting memories of speech sounds in long-term memory. They attributed this capacity to the evolution of robust and rapid links between the auditory system, localized in the posterior temporal region, and an oromotor sensory system in the ventrolateral frontal region of the human cortex. Schulze et al. (2012) examined the mimicry of words, nonwords and environmental sounds by 36 normal young adults. They also examined participants’ auditory recognition memory for the same speech and nonspeech stimuli. The participants were often unable to recognize an auditory stimulus they could not reproduce (mimic) or label. The authors hypothesized that a representation in long-term memory cannot be created unless a novel speech sound is “pronounceable,” that is, likely to “activate the speech production system automatically and subvocally” (p. 7123). 1.3.2.2  L2 Input The SLM proposed that L2 learners gradually “discern” L1–L2 phonetic differences as they gain experience using the L2 in daily life, and that the accumulation of detailed phonetic information with increasing exposure



James Emil Flege and Ocke-Schwen Bohn

to statistically defined input distributions for L2 sounds will lead to the formation of new phonetic categories for certain L2 sounds. The SLM did not provide a method for measuring how phonetic information accumulates, nor how much phonetic input is needed to precipitate the formation of new L2 phonetic categories. The model simply pointed to years of L2 use as a metric to quantity of L2 input. As mentioned, however, immigrants’ length of residence (LOR) in a predominantly L2-speaking environment is problematic because it does not vary linearly with the phonetic input that L2 learners receive and because it provides no insight into the quality of L2 input that has been received. It is universally accepted that infants and preliterate children attune to the phonetic categories of the ambient language through “exposure to a massive amount of distributional information” (Aslin, 2014, p. 2; see also Kuhl et al., 2005). For the SLM-r, input is also crucial for the formation of language-specific L2 phonetic categories and composite L1–L2 phonetic categories. The SLM-r defines phonetic input as the sensory stimulation associated with L2 speech sounds that are heard and seen during the production by others of L2 utterances in meaningful conversations. Input, which has both quantitative and qualitative dimensions, has proven difficult to measure. For now, we simply observe that full-time equivalent (FTE) years of L2 input provides a somewhat better estimate of input than LOR alone does. Years of FTE input is calculated by multiplying LOR by the proportion of L2 use (derived from questionnaire estimates of percentage L2 use). Consider, for example, two immigrants who have both lived in an L2-speaking country for 20 years but report using their L2 with unequal frequencies (90 vs. 30 percent of the time). The former has 18.0 FTE years of English input, the latter just 6.0 years. Such a difference is likely to be crucial inasmuch as the former, but not the latter, immigrant has probably received as much input as monolingual children need to reach adult-like performance levels for certain L2 sounds. Quality of input has been largely ignored in L2 speech research even though it may well determine the extent to which L2 learners differ from native speakers. As mentioned earlier, native Spanish (NS) adults who learned English in childhood but often heard Spanish-accented English were found to produce English /p t k/ with VOT values that were too short for English (as spoken by most English monolinguals), thereby resembling NS speakers who learned English as adults in a place where Spanish-accented English was not prevalent.



The Revised Speech Learning Model (SLM-r)



More fine-grained measures of the quantity and quality of input are clearly needed. Promising new methods for obtaining better measures of both are presented in Chapter 3. In addition to obtaining accurate input measures, it is important to note that the context in which input is assessed may also matter. For example, the time of day when L2 input is received may influence how well the input is consolidated and thus indirectly influence speech learning (Earle & Myers, 2015). 1.3.2.3  Perceived Cross-Language Dissimilarity The SLM-r maintains the earlier SLM hypotheses that learners subconsciously and automatically relate L2 sounds to L1 phonetic categories, and that the greater is the perceived phonetic dissimilarity of realizations of an L2 phonetic category from the realizations defining an L1 category, the more likely a new phonetic category will be formed for the L2 sound. As far as we know, the consistency of L2-to-L1 mapping patterns has not been studied longitudinally. It is probably the case, however, that mapping patterns stabilize as L2 phonetic input is received. Iverson and Evans (2007) examined cross-language mapping patterns before and after five vowel training sessions. The data they presented in their table I indicated that nonnative participants were more consistent in their labeling of English vowels in terms of L1 categories in 20 of 23 possible instances at the second compared to the first time of observation. Cross-language mapping patterns may vary when phonetic contexts alter the realization of an L2 sound in a language-dependent manner (Levi & Strange, 2008, p. 153), for example, English vowels spoken in different consonantal contexts (Bohn & Steinlen, 2003; Levy & Law, 2009). Levy (2009b, p. 2680) found that NE-speaking listeners perceived the French vowel /y/ as “most similar” to the American English vowel /u/ more often when the French vowel occurred in an alveolar than bilabial context (see also Levy, 2009a). It remains to be determined how best to measure cross-language phonetic dissimilarity. The importance of doing so is widely accepted but a standard measurement procedure has not yet emerged (for discussions, see Bohn, 2002; Strange, 2007). Cross-language dissimilarity must be assessed perceptually rather than acoustically because acoustic measures sometimes diverge from what listeners perceive (e.g., Levy & Strange, 2008, p. 153; Johnson et al., 1993). The most common procedure currently being used in L2 research is to obtain two judgments of a single stimulus (e.g., Iverson & Evans, 2009; Strange, Bohn, Nishi, & Trent, 2005). Tokens of an L2 sound are randomly presented for classification



James Emil Flege and Ocke-Schwen Bohn

(labeling) in terms of L1 categories in an N-alternative forced-choice format. After labeling a token, listeners then rate it for degree of perceived dissimilarity from the L1 category just used to label the token. Many researchers have integrated labeling and rating data in an attempt to provide a metric of perceived L1–L2 phonetic “distance,” and thus to determine which L1 sound is closest in phonetic space to a target L2 sound. For example, Iverson and Evans (2007) multiplied the proportions of trials in which various L1 vowels had been used to label an English vowel by the average rating obtained for the various L2 vowels of interest on a continuous scale ranging from “close” to “far away.” The authors noted that, in a research design intended to compare groups of learners, this metric was “poor at predicting” whether various English vowels had been “learned” or not learned (p. 2852). Cebrian (2006) used a similar technique to assess the perceived phonetic distance between L1 (Catalan) and L2 (English) vowels, finding little differences in the measures obtained for a group tested in Spain (Catalonia) and the measures obtained for native Catalan participants who were long-time residents of Canada. Both findings just mentioned appear to contradict the SLM-r proposal that as L2 learners gain experience in the L2 they will become better able to discern L1–L2 phonetic differences which will, in turn, increase the likelihood of a new L2 category being formed. We suspect that results such as these would not have been obtained had better measures of the perceived phonetic dissimilarity of an L2 sound from the closest L1 sound been obtained. The label-then-rate technique is an example of what Tulving (1981) called an “ecphoric” task inasmuch as a physically presented stimulus must be compared to information stored in episodic memory. As we see it, this technique for assessing L1–L2 phonetic distance is problematic for several reasons. The mean L1–L2 dissimilarity ratings calculated for various members of a group will necessarily be based on varying subsets of the L2 tokens that have been presented. This is because individuals may map L2 sounds onto L1 categories in differing ways. The process of classification requires participants to access information stored in longterm memory before rating a token for dissimilarity. Filling the interval between the classification and rating responses may influence the ratings, perhaps in diverse ways for various individual participants. Finally, this method can only be used with participants who are literate and can confidently use the labels provided by the experimenter. An alternative method for assessing perceived dissimilarity is what Tulving (1981) would call a “perceptual” similarity task. Flege (2005a)



The Revised Speech Learning Model (SLM-r)



recommended assessing perceived cross-language phonetic dissimilarity by presenting, in a single trial, pairs of L1 and L2 sounds for ratings using an equal appearing interval scale. For this technique to be used effectively, the L1 and L2 sounds under evaluation must be represented by tokens produced by multiple monolingual native speakers of the learners’ L1 dialect, and multiple native speakers of the L2 dialect or variety being learned. As well, the L1 and L2 sounds under investigation should be represented by tokens representing a wide range of variants in a specific phonetic context (e.g., Lengeris, 2009, p. 141) rather than “best exemplars.” From the perspective of the SLM-r, dissimilarity ratings must be obtained at an early stage of L2 learning if they are to serve as a predictor of whether a new category will eventually be formed. This is because the rated dissimilarity of L1–L2 sound pairs is likely to increase when a new category is formed for the L2 sound (Flege, Munro, & Fox, 1994, figure 5; see also Bohn & Ellegaard, 2019). 1.3.2.4  The Category Precision Hypothesis According to Flege (1992), language-specific phonetic categories are characterized by a narrow range of “good” exemplar located in phonetic space within a perceptual “tolerance region.” Tokens falling slightly outside the tolerance region may be heard as intended but will nonetheless be judged to be distorted or foreign-accented instances of the category (see Flege, Takagi, & Mann, 1995, figure 4). It is plausible that the narrow range of good exemplars of a phonetic category are (1) found near the center of gravity of the statistical distribution of tokens of a category that an individual has encountered (Chao, Ochoa, & Daliri, 2019), (2) define the core acoustic properties of a category and their weighting in a way that maximizes categorization accuracy (Holt & Lotto, 2006), and (3) are deployed by listeners as a collective referent when they consciously rate the accuracy of production of various tokens of a phonetic category in a laboratory experiment (e.g., Miller, 1994) and when they subconsciously perceive degree of phonetic dissimilarity of a pair of L1 and L2 sounds in ordinary conversations. As monolingual children mature, their L1 phonetic categories develop. The SLM proposed that L1 phonetic category development may impact L2 speech learning. More specifically, the development of L1 phonetic categories may make it progressively less likely for children and adolescents to discern cross-language phonetic differences and thus to form new phonetic categories (Flege, 1995, p. 266). The SLM hypothesis made explicit reference to the chronological age of L2 learners at the time of first exposure to an L2. As we now see it, the



James Emil Flege and Ocke-Schwen Bohn

SLM “age” hypothesis was problematic because it lacked specificity and because it is not possible to dissociate the state of development of learners’ L1 phonetic categories from their overall state of neurocognitive development at the time of first exposure to an L2. Consider, for example, two hypothetical participants, A and B, who are 38 years and 46 years of age when tested but were 8 and 16 years of age when first exposed to their L2. Participant A will surely produce and perceive L2 sounds more accurately than participant B after 30 years of L2 use. Such a difference could be attributed to a putative difference in neurocognitive maturation between the two participants when they were first exposed to their L2 (Lenneberg, 1967) or to a putative difference in the state of development of their L1 phonetic categories (Flege, 1995). The SLM-r has replaced the “age” hypothesis with the “L1 category precision” hypothesis. According to the category precision hypothesis, the more precisely defined L1 categories are at the time of first exposure to an L2, the more readily the phonetic difference between an L1 sound and the closest L2 sound will be discerned and a new phonetic category formed for the L2 sound. The SLM-r operationalizes category precision as the variability of acoustic dimensions measured in multiple productions of a phonetic category. It should be noted, of course, that variability in the realization of phonetic categories that are adjacent in phonetic space will be related to the magnitude of intercategory distances in phonetic space. The source(s) of intrasubject differences in category precision is (are) unknown at present. However, the SLM-r regards category precision as an endogenous factor that is potentially linked to individual differences in auditory acuity, early-stage (precategorical) auditory processing, and auditory working memory. Cross-language phonetic research has focused on language-specific differences in the phonetic categories found in various languages. Importantly, the phonetic categories developed by individual monolingual speakers of a single language can differ as well. Consider, for example, the production and perception of word-initial tokens of English /p t k/. All NE adults produce these stops with long-lag VOT values in word-initial position, but the exact values that individuals typically produce varies substantially (Theodore, Miller, & DeSteno, 2009) even in the apparent absence of dialect differences (Docherty, Watt, Llamas, Hall, & Nycz, 2011). Similarly, individual differences exist in adults’ production of L1 vowels and these differences remain stable over time (Heald & Nusbaum, 2015).



The Revised Speech Learning Model (SLM-r)



As monolingual children mature, their production of L1 sounds generally becomes less variable (e.g., Kent & Forner, 1980). For example, Lee et al. (1999) observed that the normalized variability of vowel formant frequencies continues to decrease in children until at least 14 years of age. This developmental change in production is accompanied by changes in vowel perception, specifically, an increase in the steepness of slopes in identification functions (Hazan & Barrett, 2000; Walley & Flege, 1999). Importantly, however, individual differences in production variability continue to exist in adulthood. For example, variability in the production of VOT in English stops by NE children generally decreases until about the age of 12 years, but even NE adults may show differing degrees of variability VOT production. Heald and Nusbaum (2015) observed variability in formant frequencies values in vowels produced by NE adults. We analyzed the standard deviations associated with the means of 63 formant frequency values (9 data samples, 7 English vowels) obtained from five NE females (Heald & Nusbaum, 2015, tables S6–S8). We found that one of the women, participant 3, produced vowel formant frequencies with significantly smaller SDs (Bonferroni-adjusted p < 0.001) than did the remaining four female participants. This held true for F1 frequencies (mean SD = 22.1 vs. 24.1 to 35.0), for F2 (mean SD = 58.1 vs. 87.8 to 155.7), and for F3 (mean SD = 95.0 vs. 136.0 to 158.4). Chao et al. (2019) also found that within-category vowel production variability differed substantially among NE adults and was strongly related to their category boundaries in a vowel identification task. Perkell, Guenther et al. (2004; see also Perkell, Matthies et al., 2004) examined the perception and production of English vowels by NE-speaking adults. The participants whose productions were more precise, that is, showed relatively little within-vowel variability and relatively large between-vowel distances, showed finer discrimination abilities. The authors suggested that a relatively great auditory sensitivity is associated with a relatively narrow target region in the realizations of vowel categories and this, in turn, is associated with relatively great precision in producing a vowel. Similar results were obtained by Franken, Acheson, McQueen, Eisner, and Hagoort (2017), who examined the production and discrimination of Dutch vowels by 40 ND adults. Vowel production precision was defined as relatively little within-category variability and relatively great between-category distances. Once again, vowel category precision was associated with relatively great auditory sensitivity. Lengeris and Hazan (2010) found that individual differences in category precision that were observed in the L1 were also evident in the L2.



James Emil Flege and Ocke-Schwen Bohn

The authors indexed individual differences in perceptual precision by analyzing the slopes of bilinguals’ identification functions. Those who were most consistent (precise) when identifying L1 (Greek) vowels were also most consistent when identifying L2 (English) vowels. Previous research provides some support for the SLM-r category precision hypothesis, which will need to be evaluated in future research. Baker, Trofimovich, Flege, Mack, and Halter (2008) examined the interlingual identification of English and Korean vowels. Native Korean (NK) adults were more likely than NK children to identify English vowels in terms of a single Korean category. NK children identified an English vowel with a wider variety of Korean categories. The authors did not assess category precision, but it is likely that the adults’ categories were generally more precise than those of the children. Kartushina and Frauenfelder (2013) provided more direct evidence that L1 category precision affects L2 speech learning. These authors examined the production and perception of French vowels by native Spanish (NS) adolescents who had studied French at school for about four years. French /e/ and /ε/ occupy a portion of acoustic vowel space where Spanish has just one vowel, /e/. Acoustic analyses showed more overlap in F1 and F2 values between the students’ Spanish /e/ productions and native French speakers’ productions of French /ε/ than between the students’ Spanish /e/ productions and French /e/. Kartushina and Frauenfelder (2013) reported that the students whose Spanish /e/ productions were closer in an F1–F2 acoustic space to French /e/ were better able to identify French /e/ in a five-alternative forcedchoice test than the students whose Spanish /e/ productions were more distant from French /e/. Students whose Spanish /e/ productions showed a relatively “compact” distribution, that is, relatively little token-to-token variability in the F1–F2 vowel space (greater “precision” in SLM-r terminology) were more accurate in identifying French /ε/ than students whose Spanish /e/ productions showed greater token-to-token variability (less precision). The authors hypothesized that the students who showed relatively little token-to-token variability in L1 vowel production may have been better able to discern phonetic differences between the Spanish and French vowels. The results obtained by Kartushina, Hervais-Adelman, Frauenfelder, and Golestani (2016) suggested that an influence of L1 category precision may be evident even in the earliest stages of L2 speech learning. These authors examined the production of Danish /ɔ/ and Russian /ɨ/ by 20 native French (NF) speakers who had no prior exposure to Danish or



The Revised Speech Learning Model (SLM-r)



Russian. The NF participants were asked to repeat multiple natural tokens of the foreign vowels as accurately as possible both before and after articulatory training on the vowels had been administered. The accuracy with which the foreign vowels were produced before and after training was assessed as well as was the precision (token-to-token variability) with which the NF participants produced the foreign vowels and the closest French vowels. Kartushina et al. (2016) found that the NF participants produced the foreign vowels far more accurately, and with greater precision, after than before training. Most importantly for the present discussion, the training was not found to modify the precision with which the NF participants produced native French vowels. This supports the view that L1 category precision is an endogenous factor not shaped by language-specific phonetic factors. Another finding that supported this conclusion is a recent study of vowel production in Yoloxóchitl Mixtec. DiCanio, Nam, Amith, García, and Whalen (2015) evaluated both the extension and precision of vowel categories in elicited and spontaneous speech samples. The authors noted that “with a few exception … their participants were very similar in their overall degree of vowel [production] variability across style” (p. 55). The two production samples differed systematically, but the seven talkers maintained between-vowel differences and exhibited similar degrees of precision in both. For the SLM-r category precision hypothesis to be accepted, it will be necessary to show in prospective research that individual differences in L1 category precision affect the discernment of L1–L2 phonetic differences as predicted. It will also be necessary to show that differences in discernment of cross-language phonetic differences will impact the production and perception of L2 sounds. It will also be valuable to determine if individual differences in L1 category precision affect how much L2 input learners need to establish consistent patterns of interlingual identification and if, as we suspect, individual differences in category precision in monolinguals derives from individual differences in auditory acuity, early-stage (precategorical) auditory processing, and auditory working memory. 1.3.2.5  Bilingual Phonetic Categories The SLM-r proposes that the capacity for phonetic category formation remains intact over the life-span, but that new categories are not formed for all L2 sounds differing audibly from the closest L1 sound. By



James Emil Flege and Ocke-Schwen Bohn

­ ypothesis, the likelihood that a new phonetic category will be formed h for an L2 sound depends on (1) the degree of perceived phonetic dissimilarity of an L2 sound from the closest L1 sound, (2) how precisely defined is the closest L1 phonetic category, and (3) the quantity and quality of L2 input that has been received. Categories formed for L2 sounds are defined by the statistical properties of input distributions. This kind of distributional learning is slow in L1 acquisition, and so the SLM-r maintains that it will also be slow in L2 learning. Much more needs to be known about the time course of distributional learning, both in the L1 and in an L2. Feldman, Griffiths, and Morgan (2009) provided evidence that listeners need not estimate the entire distribution of instances of a category because “simply storing [a sufficient number of ] exemplars can provide an alternative method for estimating the distribution associated with a category” (p. 774). Furthermore, the development of categories through the estimation of distributions eliminates the need for learners to have a priori knowledge of what kind and how many categories exist in the L2 being learned (p. 774). Purely distributional learning theories treat each token of a category as independent of neighboring sounds, ignoring higher-level structure. Feldman, Griffiths, Goldwater, and Morgan (2013) showed that the learning problem becomes substantially more tractable if one assumes that children developing L1 phonetic categories learn to categorize speech sounds and to recognize words simultaneously. However, modeling phonetic category formation in an L2 remains a considerable challenge when the data sets to which L2 learning models are exposed resemble the kind of highly variable input L2 learners actually receive (see Antetomaso et al., 2017, for a first attempt). The SLM-r proposes that the formation of a new L2 phonetic category for an L2 sound is a three-stage process. First, an L2 learner must discern a phonetic difference (or differences) between the realizations of an L2 sound and the L1 sound that is closest to it in phonetic space. Second, a functional “equivalence class” of speech tokens that resemble one another, and so are close together in phonetic space, must emerge (Kuhl, 1991; Kuhl et al., 2008; Kluender, Lotto, Holt, & Bloedel, 1998). By hypothesis, the sounds making up such equivalence classes remain perceptually linked to the closest L1 sound until the distribution of tokens defining the equivalence class has stabilized. Third, at a later and as-yet undefined moment in phonetic development, the perceptual link between the L2 “equivalence” class and the L1 category will be sundered.



The Revised Speech Learning Model (SLM-r)



We speculate that this delinking may be speeded by growth of the L2 lexicon, at least in literate learners of an L2 (Bundgaard-Nielsen, Best, & Tyler, 2011). Once delinking has occurred, the development of a new L2 phonetic category will be based on statistical regularities of the distribution of L2 sounds that are implicitly categorized as instances of the new L2 phonetic category. The SLM-r regards L2 phonetic category formation as a gradual process, not a one-time event. Consider, for example, the results of Thorin, Sadakata, Desain, and McQueen (2018). These authors trained native Dutch (ND) university students to produce and perceive the English vowels /ε/ and /æ/. The training resulted in somewhat greater improvement for English /æ/ than /ε/, which is consistent with the fact that of the two English vowels, /æ/ is more distant in an F1–F2 space from the closest Dutch vowel, /ε/ (see also Díaz, Mitterer, Broersma, & Sebastián-Gallés, 2012; Flege, 1992). Evidence of discrimination peaks before training and a shift in posttraining phoneme boundaries suggested to the authors that the ND students already had “weak” phonetic categories for English /æ/ before training, which became “stronger” as a result of the training. The SLM-r proposes that new phonetic categories will not be formed by individual learners for an L2 sound that the learners judge as being too similar phonetically to the closest L1 sound. Crucially, however, learners do not discard audible phonetic information in such cases. By hypothesis, a perceptual link between the L2 sound and the closest L1 sound will continue to exist and a composite L1–L2 phonetic category will develop, defined by the statistical regularities present in the combined distributions of the perceptually linked L1 and L2 sounds. 1.3.2.6  L1–L2 Interactions The presence or absence of category formation is the key determinant of how phonetic systems and subsystems reorganize. A method did not exist in 1995 for determining when a new L2 phonetic category had been formed and, alas, the same holds true today. One new technique we consider promising is presented in Chapter 3, but it has not yet been used in L2 speech research. Brain imaging techniques have developed substantially since 1995 and may someday provide a litmus test for L2 category formation. For example, if L2 learners use a new phonetic category when processing tokens of an L2 vowel, their categorization responses might be associated with “more efficient neural processing in frontal speech regions



James Emil Flege and Ocke-Schwen Bohn

i­mplicated in phonetic processing” than would be the case if a new L2 phonetic category had not been formed (Golestani, 2016, p. 676). It will be of special interest to determine if the processing of L2 sounds via a new phonetic category will ever demonstrate the “neural commitment” seen for L1 sounds, namely, focal cortical representations that persist for relatively brief intervals (Zhang, Kuhl, Imada, Kotani, & Tohkura, 2005). Meanwhile, interactions between L1 and L2 phonetic categories provide a reflex that is diagnostic of L2 category formation or its absence. According to the SLM-r, new L2 categories may shift away from (i.e., dissimilate from) neighboring L1 categories to maintain phonetic contrast between certain pairs of L1 and L2 sounds. This is so because, by hypothesis, the L1 and L2 phonetic categories of a bilingual exist in a common phonetic space. In the absence of category formation for an L2 sound, on the other hand, the SLM-r predicts a merger of the phonetic properties of an L1 sound and the L2 sound to which it remains perceptually linked. This may cause the L1 sound to shift toward (assimilate to) the L2 sound in phonetic space. We will illustrate the interactions predicted by the SLM-r by considering the results of two studies examining the same participants. Flege, Schirru, and MacKay (2003) provided evidence of L1–L2 category dissimilation. These authors examined the production of English /eɪ/ and Italian /e/ by four groups of native Italian (NI) immigrants who had lived in Canada for decades. The four NI groups differed orthogonally in age of arrival in Canada (early vs. late) and amount of continued Italian use (high vs. low). The 36 participants in two “high-Italian-use” groups reported using Italian more than the 36 participants in two “low-Italianuse” groups (means = 48 vs. 9 percent) and used Italian in more social contexts and with more other NI immigrants than members of the lowItalian-use groups did. English /eɪ/ is produced with substantial formant movement, Italian /e/ with little or none. Acoustic analyses revealed that the 36 members of the two late learner groups (late-low, late-high) produced English /eɪ/ in an Italian-like way, that is, with significantly less formant movement than NE speakers produced. This suggested that many late learners had not formed new phonetic categories for English /eɪ/. However, the NI speakers who were likely to have received the most English native-speaker input, early-low, produced English /eɪ/ with significantly more formant movement than NE speakers did. This suggested that not only had members of the early-low group formed new phonetic categories for



The Revised Speech Learning Model (SLM-r)



English /eɪ/, their new L2 categories dissimilated from their Italian /e/ categories to maintain contrast in a common phonetic space. The SLM-r proposes that category assimilation may occur when a new phonetic category has not been formed for one or more of the reasons stated above. By hypothesis, composite L1–L2 phonetic categories develop in such cases. MacKay et al. (2001) obtained evidence of category assimilation – the opposite of what was just reported for the same participants – in research examining how the four NI groups produced and perceived /b/ in English and Italian. Confirming past research, MacKay et al. (2001) found that NE speakers produced English /b/ in three different ways: with full prevoicing that continued until stop release, with partial prevoicing that ceased before release, or as short-lag stops. Also confirmed was the fact that Italian /b/ is produced with full prevoicing. The authors reasoned that phonetic contrast must be maintained between phonetic elements making up the L1 and L2 phonetic subsystems of bilinguals. That being the case, the NI speakers could not form new “short-lag” phonetic category for English /b/ because Italian /p/ is realized with short-lag VOT values. MacKay et al. (2001) found that members of the two high-Italian-use groups incorrectly identified naturally produced short-lag tokens of English /b/, as /p/, more often than did members of the two low-Italianuse groups. Members of the high-Italian-use groups produced English [b] in an English way, that is, as short-lag stops, less often than members of the two low-Italian-use groups. The NI speakers likely to have received the most native-speaker English input over the course of their lives did something that never happens in Italian: they produced Italian /b/ with prevoicing that ceased before stop release. The authors suggested that the NI speakers “restructured” their Italian /b/ categories to varying degrees for use in English. The SLM-r interpretation is that they developed composite L1–L2 categories based on the combined distribution of the Italian and English /b/ tokens they had heard over the course of their lives. Depending on how their composite Italian-English /b/ categories were specified, some NI speakers modified the realization rule used to produce /b/ in both English and Italian so that it no longer assured prevoicing that continued without interruption until stop release. The L1–L2 interactions under discussion here might also contribute indirectly to cross-dialect differences. Caramazza and Yeni-Komshian (1974) found that French monolinguals in Quebec (Canada) produced



James Emil Flege and Ocke-Schwen Bohn

French /b d ɡ/ with English-like short-lag VOT values far more often than did French monolinguals in France (means = 59 vs. 6 percent) and they also produced French /p t k/ with somewhat longer VOT values. We speculate that both effects seen in French Canadian arose from exposure by French monolinguals in Canada to the French spoken by FrenchEnglish bilinguals whose L1 production had been altered by learning and using English. According to the SLM-r, L1-on-L2 and L2-on-L1 effects arise inevitably because the phonetic elements of a bilingual’s L1 and L2 subsystems exist in a common phonetic space. Although the model makes no predictions regarding the magnitude of such effects, it is worth considering three factors that might modulate them. To begin, Lev-Ari and Peperkamp (2013) measured VOT in English stops produced by English-French bilinguals who had lived in Paris for many years. The authors proposed that the magnitude of the L2 (French) on L1 (English) effects they observed may have been influenced by individual cognitive differences in “inhibitory skill.” Second, the magnitude of cross-language phonetic effects may depend on how bilinguals deploy their phonetic categories. We might expect L2-on-L1 effect sizes to increase, for example, the more strongly activated the L2 is when L1 performance is examined, when bilinguals are tested in the presence of other bilinguals, when they are using their L1 to discuss topics usually discussed while speaking the L2, and so on (e.g., Grosjean, 1998, 2001). Finally, we might expect the L2-on-L1 effect sizes to increase as proficiency in the L2 increases as the result of more input and more frequent use of the L2. This might be evident even in fairly short periods of time. Casillas (2018), for example, tested NE-speaking university students who differed according to how many university-level Spanish-language classes they had taken. The research examined the size of a shift in the location of /b/-/p/ phoneme boundaries in VOT continua that lexically induced “English” and “Spanish” modes of perception. Only the students who had taken the most Spanish-language classes showed significant phoneme boundary shifts. 1.3.2.7  Features Weighting The SLM proposed that a phonetic category formed for an L2 sound by an L2 learner might differ from the phonetic categories formed for the same sound by monolingual native speakers of the target L2 if the L2 sound were specified by “features … not exploited” in the learner’s L1 or



The Revised Speech Learning Model (SLM-r)



if features (perceptual cues) defining the L2 sound were “weighted differently” in than the features specifying the closest L1 sound (Flege, 1995, pp. 239–243). As we now see it, the earlier SLM “feature hypothesis” was incongruent with the model’s first postulate, namely, that “the mechanisms and processes used in learning the L1 sound system, including category formation, remain intact over the life-span and can be applied to L2 learning” (Flege, 1995, p. 239). The SLM-r abandons the earlier SLM feature hypothesis due to the emergence of findings which show that late learners can gain access to features used to define L2 categories not exploited in L1 (see Chapter 2). It formally adopts the “full access” hypothesis proposed by Flege (2005b; see also Escudero & Boersma, 2004). We dedicate the remainder of this section to justifying the change. L1 research. As L1 categories develop, the multiple cues defining them are weighted optimally for correct categorization. For example, research has shown that NE monolinguals use VOT as the primary cue when categorizing word-initial stops as phonologically voiced or voiceless, making lesser use of F0 onset frequency (e.g., Whalen, Abramson, Lisker, & Mody, 1993). The secondary cue, F0, generally exerts a measurable effect on categorization only for a subset of stimuli having ambiguous VOT values (Kong & Edwards, 2015, 2016; Lehet & Holt, 2017). Learning to optimally integrate multiple cues to L1 phonetic categories requires years of input (e.g., Morrongiello, Robson, Best, & Clifton, 1984; Nittrouer, 2004). For example, Idemaru and Holt (2013) found that, to optimally categorize word-initial English liquids, monolingual NE children must learn to give greater weight to the onset frequency of F3 than to F2 onset frequency values. Use of F3 frequency, the primary cue for /r/ categorization by NE adults, develops rapidly, but the use of a secondary cue, F2, continues to develop beyond eight or nine years of age. Differences in cue weighting depend importantly on cue reliability (Idemaru & Holt, 2011; Strange, 2011), which, in turn, depends on the statistical properties of input distributions to which individuals have been exposed during L1 acquisition (Holt & Lotto, 2006, pp. 3060–3062). Individual differences in cue weighting among monolinguals are likely to arise from exposure to different input distributions of tokens specifying a phonetic category (Clayards, 2018; Lee & Jongman, 2018). However, future research must explore the possibility that some individual differences in cue weighting arise from endogenous differences in auditory acuity, early-stage auditory processing, or auditory working memory.



James Emil Flege and Ocke-Schwen Bohn

The cue weighting patterns specified in phonetic categories are not applied rigidly by monolinguals during the categorization of sounds in their native language. Human speech perception is necessarily adaptive (Aslin, 2014), enabling listeners, for example, to better understand foreign-accented renditions of their L1 after a brief exposure to foreignaccented talkers (e.g., Bradlow & Bent, 2008). Also important is the fact that cue weighting may adapt dynamically to what has been heard recently (Lehet & Holt, 2017; Schertz, Cho, Lotto, & Warner, 2016), and can be modified through training (Francis, Kaganovich, & DriscollHuber, 2008). Adaptation occurs at the segmental level in both production and perception. Nielsen (2011) found that NE speakers produced /p/ with significantly longer VOT values after hearing experimental stimuli with artificially lengthened VOT values (see also Clarke & Luce, 2005). Kraljic and Samuel (2006) found that NE adults could recalibrate their categorization of stops based on brief exposure to unusual productions of /t/ and /d/. The NE participants in Kraljic and Samuel performed a lexical decision task. Half of them heard words in which the target sound, which was ambiguous between /d/ and /t/, occurred in words known to have /d/ (e.g., crocodile) while the remaining half of the participants were exposed to the same ambiguous stimuli in words known to have /t/ (e.g., cafeteria). The participants exposed to ambiguous stimuli in /d/ words gave more /d/ responses in a posttest than those exposed to ambiguous stimuli in /t/ words. McQueen, Tyler, and Cutler (2012) showed that the ability to recalibrate perception, which they termed “lexically guided retuning,” is already present in young monolingual children. L2 research. Iverson and Evans (2007) examined the production and perception of English vowels by native speakers of Spanish (n = 25), French (n = 19), German (n = 21), and Norwegian (n = 18) who had spent median periods in an English-speaking country that ranged from zero years (the Norwegians) to three years (the native French participants). The participants’ perceptual representations for English vowels were defined by having them select the best exemplars of various L1 and L2 vowel categories from a five-dimensional array of stimuli differing in F1 and F2 frequencies, formant movement patterns, and duration. Participants whose L1 made little or no use of duration and formant movement patterns were nevertheless observed to use those dimensions when selecting the best exemplars of English vowels. This suggested the nonnative speakers’ representations for English vowels incorporated



The Revised Speech Learning Model (SLM-r)



information pertaining to these dimensions. This conclusion regarding the use of previously unneeded dimensions was corroborated by another finding of the study, namely, that significantly fewer English vowels were correctly identified in noise when the previously unneeded dimensions were neutralized than when those dimensions remained present in the stimulus array. Individual differences exist in cue weighting among monolinguals. That being the case, the SLM-r predicts that individual differences will also be evident in the production and perception of L2 sounds that are perceptually linked to L1 sounds via the mandatory and automatic mechanism of interlingual identification. Individual differences in cue weighting have in fact been observed in both early (Idemaru, Holt, & Seltman, 2012; Kim, 2012) and later phrases of L2 learning (Chandrasekaran, Sampath, & Wong, 2010; Schertz et al., 2016). The SLM-r proposes that the influence of L1 cue weighting patterns will be stronger for L2 sounds which remain perceptually linked to an L1 category than for L2 sounds for which a new L2 phonetic category has been formed. Cue weighing patterns for newly formed L2 phonetic categories are expected to develop as in monolingual L1 acquisition, that is, to be based on the reliability of multiple cues to correct categorization that were present in input distributions. In research examining the formation of nonspeech auditory categories, Holt and Lotto (2006) identified effects due to the variability of multiple dimensions in the input provided during laboratory training that might apply to L2 learning. This research might provide a starting point for explaining intersubject variability in L2 cue weighting. As far as we know, however, the suggestions offered by Holt and Lotto (2006) have not yet been applied to L2 speech learning, probably because of difficulty in specifying the distribution of speech sounds to which learners of an L2 have been exposed. As is the case for monolinguals, the cue weighting patterns evident for L2 learners may reflect the properties of speech stimuli heard in the recent past (Lehet & Holt, 2017; Schertz, Cho, Lotto, & Warner, 2016) and change in response to perceptual training (Francis, Kaganovich, & Driscoll-Huber, 2008; Giannakopoulou, Uther, & Ylinen, 2013; Hu et  al., 2016, Schertz et al., 2016; Ylinen et al., 2010). Seen from the perspective of an attention-to-dimension perceptual learning model (Francis & Nusbaum, 2002), changes in cue weights induced by training underscore the importance of attentional allocation, something demonstrated with elegant simplicity by Pisoni et al. (1982).



James Emil Flege and Ocke-Schwen Bohn

L2 research has focused on the identification of differences in cue weighting patterns between native and nonnative speakers. Consider, for example, the perception of English /i/ and /ɪ/. Flege et al. (1997) examined identification of these vowels in a two-alternative forced-choice (2AFC) test by adult NE speakers and by two groups of 10 native Korean (NK) adults. The NK groups differed in FTE years of English input (means = 4.3 and 0.3 for the relatively “experienced” and “inexperienced” groups). The synthetic vowel stimuli varied orthogonally in temporal and spectral dimensions (duration vs. F1 and F2 frequencies). All 10 NE speakers made predominant use of spectral cues whereas eight experienced and nine inexperienced NK speakers made predominant use of temporal cues. Similarly, Kim, Clayards, and Goad (2018) examined the use of spectral and temporal cues in an /i/-to-/ɪ/ continuum by NK women and their children. These authors obtained a first sample 2.2 months after their participants had arrived in Canada to study English, and also 4, 8, and 12 months after the participants’ arrival in Canada. As expected, the NE speakers made greater use of spectral than temporal cues whereas the reverse held true for the NK speakers. However, the NK speakers made somewhat greater use of spectral cues over the course of the longitudinal study, with greater movement toward the English pattern evident for NK children than adults. Although the findings of Flege et al. (1997) and Kim et al. (2018) were similar, the results of both studies are difficult to interpret because the NK participants in both studies often reversed category labels, something not seen in the NE participants’ responses. The reversals may have reflected confusion by some NK participants regarding how to associate the response alternatives (written labels in Flege et al., 1997; pictures in Kim et al., 2018) to the response alternatives that were offered. Also, use of a 2AFC task might not have permitted listeners to adequately report phonetic-level categorization (see, e.g., Pisoni et al., 1982). The adult-child difference observed by Kim et al. (2018) is of special interest given that it was obtained longitudinally. However, interpretive difficulty exists here as well because of differences in the contexts of L2 learning for participants differing in age (see Flege, 2019). The NK adults attended English classes 22 hours per week. The children attended school five days a week (hours not reported) and also studied English after school for an average of 31 hours per week. We can infer that the children obtained more L2 input than their mothers, and possibly more nativespeaker input as well.



The Revised Speech Learning Model (SLM-r)



Other L2 cue weighting research has examined use of VOT and F0 as cues to the categorization of L2 stops. As already mentioned, VOT is the primary cue to NE-speaking listeners’ categorization of word-initial stops as /t/ or /d/ and F0 onset frequency is a secondary cue. In Korean, on the other hand, F0 is often the primary cue and VOT is often (but not always) a secondary cue. Schertz et al. (2015) examined NK speakers’ use of the two cues when perceiving English stops and measured the two dimensions in their English productions. The NK speakers produced reliable VOT and F0 differences between English voiced and voiceless stops. Greater variability was evident for perception than production, with some NK speakers using VOT as the primary cue, some using F0 as the primary cue and some using both cues. The results of Kong and Yoon (2013) provide insight into the role of input in the modification of cue weighting patterns. These authors examined the use of VOT and F0 in the perception of English stops by two groups of high school students in Seoul. The stimuli consisted of an array of tokens in which VOT and F0 onset frequency varied orthogonally. The “low-proficiency” students were enrolled in a regular high school whereas the “high-proficiency” students attended a special foreignlanguage high school. The students judged stimuli using a visual analog scale with endpoints ranging from “d-like” to “t-like.” The two groups differed little in their use of the VOT dimension, but members of the low-proficiency group were more sensitive to F0 variation than members of the high-proficiency group. This suggested that the more experienced students reduced use of F0, the primary cue in Korean, when perceiving stops in English, where F0 is a secondary cue. Finally, the results obtained by Dmitrieva (2019) suggested that L2 learners can learn to reverse the cue weighting pattern evident in their L1 when perceiving L2 sounds. Dmitrieva (2019) tested 34 monolingual speakers each of both English and Russian and also 37 Russian-English bilinguals who had lived in the United State for an average of 39 years (all but four of whom had arrived after the age of 15 years). The participants labeled, as /k/ or /ɡ/, randomly presented stimuli from an array differing orthogonally in duration of glottal pulsing present in the closures of word-final stops and the duration of vowels preceding the final stops (six steps each). Russian monolinguals relied more on glottal pulsing than vowel duration whereas English monolinguals showed the reverse pattern. When tested in English, the bilinguals as a group did not differ from the monolingual English group. That is, many or most of the Russians had learned to give greater weight to vowel duration than glottal



James Emil Flege and Ocke-Schwen Bohn

pulsing. When tested in Russian, the bilinguals were found to make greater use of vowel duration than the Russian monolinguals did. This suggested that learning to optimally categorize word-final English stops had modified how the bilinguals perceived stops in their L1, Russian. 1.3.2.8  Individual Differences in Speech Learning Ability The extent of native versus nonnative differences generally diminishes over time, especially for individuals who began learning the L2 in childhood, but such differences are usually found to persist even in many early learners (Flege, 2019). An important objective of L2 research is to account for intersubject variability, that is to say, the magnitude of persistent native versus nonnative differences. We have emphasized the importance of input as an explanation of intersubject variability. What, then, accounts for differences between individuals who seem to have received similar L2 input? Individual differences in the capacity to learn speech at any age might contribute to the intersubject variability evident in L2 research. The time children need to attune fully to the ambient language phonetic system varies (e.g., Smit, Hand, Freilinger, Bernthal, & Bird, 1990) but, in the end, typically developing children learn to speak their L1 without noticeable pronunciation errors. L2 speech learning is of course different. It seems reasonable to assume that individual differences in L1 learning will be evident later when an L2 is learned but we know of no systematic research testing this hypothesis. Preliminary support for the hypothesis is the finding that “good” and “poor” perceivers of L2 sounds may differ in their perception of L1 sounds when L1 perception is assessed with sufficient sensitivity at the appropriate processing level (Díaz, Mitterer, Broersma, Escera, & Sebastián-Gallés, 2015). Given that most children eventually learn L1 speech adequately, the SLM-r proposes that endogenous differences in the capacity to learn speech, should they exist, will primarily impact how much input is needed to reach specific speech learning milestones rather than whether or not such milestones are eventually reached. Differences in auditory acuity and in early-stage (precategorical) auditory processing are perhaps the most obvious factors to consider in the search for endogenous individual differences that will influence L2 speech learning. However, two early L2 speech studies dampened interest in auditory factors. Stevens, Liberman, Studdert-Kennedy, and Öhman (1969) observed little difference between native speakers of English and Swedish in the discrimination of front rounded vowels



The Revised Speech Learning Model (SLM-r)



found in Swedish but not English. This suggested that the auditory detection of differences between Swedish vowels was just as possible for listeners who were unfamiliar with Swedish vowels as it was for Swedes. Similarly, Miyawaki et al. (1975) found that native Japanese speakers who were unable to discriminate English “ra” and “la” syllables had no trouble discriminating the third formant (F3) components of the same stimuli when presented in isolation. Given that F3 frequency is the most important cue to the identification of a consonant as /r/ in English, many interpreted this finding to mean that the notorious “Japanese r-l problem” resided at a phonetic and/or phonological level rather than at an auditory level (for discussion, see Iverson, Wagner, & Rosen, 2016). The findings just mentioned do not necessarily mean that individual differences in auditory acuity, early-stage (precategorical) auditory processing and auditory memory are unimportant. The influence of endogenous “auditory” differences may be more evident in early than later stages of L2 learning. Moreover, effects of auditory-level processing differences are usually evident only in specific task conditions (e.g., Werker & Logan, 1985). In fact, later L2 research using methods that required the phonetic categorization of speech sounds soon revealed important effects of L1 background, including effects attributable to language experience and phonetic context (e.g., Gottfried, 1984; Levy & Strange, 2008). Normal-hearing monolingual adults differ in terms of auditory acuity, for example, in how finely they can discriminate formant frequency differences in native-language vowels (Kewley-Port, 2001). Differences exist in early-stage (precategorical) auditory processing, such as the amplitude of the frequency-following response (Galbraith, Buranahirun, Kang, Ramo, & Lunde, 2000; Hoormann, Falkenstein, Hohnsbein, & Blanke, 1992). Kidd, Watson, and Gygi (2007) identified individual differences in four basic auditory abilities, and also in the ability to recognize familiar nonspeech auditory events. This last auditory capability may have derived from individual differences in auditory working memory and/or attentional allocation. The findings just mentioned raise the question of whether auditory acuity, and more generally precategorical auditory processing (Iverson, Wagner, & Rosen, 2016), affects how much native-speaker input is needed to learn L2 sounds once the L1 phonetic system has been established (Kachlika, Saito, & Tierney, 2019, p. 16). Existing research suggests that it does.



James Emil Flege and Ocke-Schwen Bohn

Hazan and Kim (2010) showed that auditory sensitivity to F2 frequency was the best single predictor of how much NE speakers benefited from computer-based training on a phonetic contrast found in Korean but not English. Lengeris and Hazan (2010) administered perceptual training on English vowels to 18 native Greek (NG) adults living in Athens. The L2 perceptual training yielded robust improvements in the accuracy with which the NG participants identified and produced English vowels. The NG participants’ ability to discriminate nonspeech stimuli (isolated F2 formants) was evaluated before training. Their ability to identify English vowels after the training correlated significantly with their discrimination of nonspeech sounds, Greek vowels, and English vowels prior to training (correlations ranging from r = 0.55 to 0.56). All three variables were also found to correlate significantly (r = 0.52 to 0.68) with measures of posttraining English vowel production accuracy. Kachlika et al. (2019) evaluated the relation between auditory processing and the identification of two pairs of English vowels that are known to be difficult for native Polish (NP) speakers. The authors tested 40 Poles who had arrived in the United Kingdom after the age of 18 years and had lived in the United Kingdom for 1–6 years (mean = 3.6 years). The NP participants reported using English from 18 percent to 97 percent of the time (mean = 66 percent) and had studied English at school for 0.5–20 years (mean = 9.4 years). The percentage of correct identifications of English vowels correlated significantly with measures of the NP participants’ spectral processing and neural encoding of F2 formant frequency. Individuals also differ in how they process, code, and store L2 sounds in memory. Golestani, Molko, Dehaene, LiBihan, and Pallier (2007) found that some native French (NF) speakers rapidly learned to distinguish an unfamiliar (foreign) phonetic distinction between dental and retroflex stops, but other NF speakers did so poorly or not at all. The behavioral difference between “fast” and “slow” learners was related to individual differences in functional neuroanatomy and lateralization of language processing. The ability to accurately mimic sounds is crucial for the learning of both L1 and L2 sounds. Most people readily note occasional disparities between what they meant to say, as specified by their phonetic categories, and self-heard vocal output. This is because talkers monitor their vocal output in real time via pathways that connect self-hearing, on the one hand, and phonetic categories and realization rules, on the other hand (e.g., Guenther, Hampson, & Johnson, 1998).



The Revised Speech Learning Model (SLM-r)



Reiterer et al. (2013) identified two subgroups of native German (NG) speakers for a study examining the production of German with a feigned English accent. Members of the two groups were equally able to speak English but differed in ability to mimic sentences from an unknown foreign language in a pretest. Members of the “high-” and “low-mimicryability” groups later produced German and English sentences without special instruction and were also asked to produce German sentences with an English foreign accent based on their prior exposure to Englishaccented German (see also Flege & Hammond, 1982). Acoustic analyses and brain imaging data obtained by Reiterer et al. (2013) pointed to between-group differences in the ability to access, store and retrieve “auditory episodic events.” Members of the high-mimicryability group were said to deploy more “detailed phonetic” knowledge of English sounds and to have greater “phonological awareness” than members of the low-mimicry-ability group when producing their L1 (German) with a feigned L2 (English) foreign accent. Brain imaging data revealed between-group differences in the strength and extent of activation in the sensory motor cortex in a zone within the left inferior parietal cortex thought to “integrate aspects of speech production, phonological representations [and] working memory.” The authors suggested that this may have been a “compensation strategy” by members of the lowmimicry-ability group arising from a “generally weaker auditory working memory” (p. 10). The importance of auditory working memory for L2 speech learning was shown by MacKay, Meador, and Flege (2001). These authors tested the hypothesis that variation in phonological short-term memory (PSTM) will exert a long-term effect on the identification of consonants in an L2. A total of 72 normal-hearing native Italian (NI) speakers who had all lived in Canada for at least 15 years (mean = 35 years) participated. The NI participants differed substantially in age of arrival in Canada from Italy, years of Canadian residence, self-reported frequency of Italian use and competence in Italian. PSTM was assessed by having the NI speakers repeat Italian nonwords ranging in length from two (quite easy) to five syllables (very difficult). The PSTM scores were found to correlate significantly with accuracy in identifying both word-initial and wordfinal English consonants (r = 0.42 and 0.53, p < 0.001). The PSTM hypothesis was confirmed by regression analyses examining the influence of arrival age, years of residence, frequency of L1 use, L1 competence and the PSTM scores on consonant identification accuracy. The NI participants’ ages of arrival in Canada and their competence in



James Emil Flege and Ocke-Schwen Bohn

Italian together accounted for 25.1 percent of the variance for word-initial consonants and 18.9 percent of the variance in the identification of wordfinal English consonants. The PSTM scores were found to account for a significant 7.8 percent increase in the variance accounted for the identification of initial consonant and a significant 14.8 percent increase for word-final English consonants when entered into the regression model following the other predictor variables. Many other individual differences that might potentially influence the basic capacity to learn speech have been identified in the literature. These include, to name but a few, musical ability (e.g., Slevc & Miyake, 2006), selective attention (e.g., Mora & Mora-Plaza, 2019), and phonemic coding ability (Saito, Sun, & Tierney, 2019). The role of these other variables will need to be investigated in greater detail in research focusing on individuals at both early and later stages of L2 learning. 1.3.2.9  Individual Differences in L1 Phonetic Categories Research examining L2 speech learning has compared the performance of groups of L2 learners, for example, immigrants differing in age of arrival in a predominantly L2-speaking country (“early” vs. “late”) or frequency of continued L1 use (“high” vs. “low”). This approach tacitly assumes that all monolingual native speakers of the same L1 have identical, or at least very similar phonetic categories, but this assumption is not always well founded. Differences have been observed in how individual monolinguals produce and perceive native-language vowels and consonants. Hillenbrand, Getty, Clark, and Wheeler (1995) observed substantial differences in how native English (NE) men, women and children produced American English vowels. The differences were especially evident in the production of certain pairs of English vowels. Consider, for example, the vowels found in the English words bed (/ε/) and bad (/æ/). NE speakers normally make greater use of spectral quality than duration cues to distinguish these vowels (e.g., Flege et al., 1997), but Kim and Clayards (2019) observed “substantial” individual differences in use of these cues in both perception and production (p. 781). Cebrian (2006) observed that NE speakers produce the vowel in bait (often symbolized as /eɪ/) with varying amounts of diphthongization. Individual differences in L1 vowel categories may be evident in a perception experiment examining which stimuli in an array that listeners “prefer” (Frieda et al., 2000). Individual differences also exist for L1 consonants. For example, individual NE monolinguals may produce English fricatives with substantial



The Revised Speech Learning Model (SLM-r)



differences in centroid frequency and skew (Newman, Clouse, & Burnham, 2001). Some NE speakers always produce /b d ɡ / with prevoicing whereas at least some NE speakers produce only short-lag VOT values (Dmitrieva, Llanos, Shultz, & Francis, 2015). NE monolinguals always realize /p t k/ with long-lag VOT values, but the actual values vary substantially and consistently across individuals (e.g., Allen, Miller, & DeSteno, 2003; Theodore, Miller, & DeSteno, 2009). The intrasubject differences in VOT production are evident both in connected speech and the production of isolated words, and are systematic in the sense that individuals who produce relatively long VOT values for /p/ will do the same when producing /t/ and /k/ (Chodroff & Wilson, 2017). Moreover, individual differences in VOT production may correspond to the VOT values that NE listeners “prefer” in a goodness rating task (Newman, 2003). L1 phonetic category differences in children may arise from exposure to different dialects (Docherty et al., 2011; Evans & Iverson, 2004) as well as to other statistically defined differences in input distributions that cannot be attributed to dialect differences (e.g., Theodore, Monto, & Graham, 2020). Listeners are able to detect and remember systematic differences between individual talkers and to exploit this knowledge in speech perception. Word-recognition research has shown that individual differences in the production of L1 speech sounds may permit listeners to recognize the identity of a talker, even when indexical properties of the voice have been removed (Remez, Fellowes, & Rubin, 1997), and to better recognize words spoken by familiar talkers (Nygaard, Sommers, & Pisoni, 1994; Allen & Miller, 2004). The SLM-r proposes that individual differences in L1 phonetic categories will sometimes impact L2 speech learning. For example, when first arriving in Spain, a NE speaker who produces little formant movement in the English word bait (/eɪ/) will likely be observed to produce the Spanish /e/, which shows little formant movement, more accurately than a NE speaker who produces the same English vowel with substantial formant movement (Cebrian, 2006). To illustrate this concept further, let’s consider the hypothetical case of NE speakers learning Danish. As shown in Figure 1.3, Flege, Frieda, Walley, and Randazza (1998) obtained mean VOT values for 60 English words as spoken by 20 NE monolinguals. The mean values ranged from 60 to 116 ms. Imagine that an investigator wished to evaluate the role of quantity of L2 input in a study examining how two groups of NE speakers who were matched for length of residence in Denmark but

James Emil Flege and Ocke-Schwen Bohn



NE

AOA 1–14

AOA 16–50

120

100

80 Superstars 60 Slow learners

40

20

No learners

0 Mean VOT Values in Ascending Order by Group

Figure 1.3  Mean VOT values in the production of English /t/ by native speakers of English and native Spanish early and late learners of English. The error bars bracket ±1 SEM around the means, which were each based on 60 observations.

differing in frequency of Danish use (“low use” < 20 percent, “high use” > 80 percent) produced Danish stops. Imagine, furthermore, that the high-use group included individuals who produced English stops with a mean VOT value of 60 ms (participant “A”) and 116 ms (“B”) when they arrived in Denmark. Garibaldi and Bohn (2015) found that long-time NE-speaking residents of Denmark produced /t/ with comparable VOT values in English and Danish words (means = 91.7 and 90.7 ms, respectively). Both values were shorter on average than the VOT values produced in Danish words by Danes (mean = 140.0 ms). If participants A and B were both members of the “high-use” groups and both had lived in Denmark for three years, B would likely be observed to produce Danish stops just like many Danes, whereas A would likely be found to produce Danish stops with VOT values that were far too short for Danish. A researcher who did not know how A and B produced English stops upon arrival in Denmark would erroneously conclude that B had somehow made better use of



The Revised Speech Learning Model (SLM-r)



Danish input than A did, and would be tempted to attribute this finding to an unknown individual difference in speech learning aptitude. Individual differences in the specification of L1 categories is rarely mentioned as a possible explanation for intersubject variability in L2 speech learning. A study hinting at this kind of explanation was that of Mayr and Escudero (2010). These authors examined the production of German vowels by seven NE college students who had studied German at school in England and eight others who had also studied for a year in Germany. The NE students were observed to differ in how they classified German vowels in terms of 14 different English vowels but no systematic difference between the subgroups was observed in how accurately they produced German vowels. NG-speaking listeners correctly identified only slightly more German vowels produced by the study-abroad students than by the students who had only studied German at school in England (means = 59 vs. 50 percent). Mayr and Escudero (2010) suggested that differences in cross-language mapping patterns may have contributed to the individual differences and that differences in the native L1 dialects of the students might also have played a role (p. 293). Two other studies are relevant to the SLM-r hypothesis that differing L1 categories may lead to differences in L2 speech learning. Escudero and Williams (2012) examined the AXB discrimination and forced-choice identification of Dutch vowels by native Spanish speakers from Spain and Peru. The authors attributed differences in the perception of some Dutch vowels between the two Spanish dialect groups to acoustic phonetic differences in the realization of vowels in the native L1 dialect (p. EL411). We suspect that the participants differing in L1 dialect had somewhat different L1 vowel categories. Chládkova and Podlipský (2011) examined the perceptual assimilation of Dutch vowels to L1 vowels by native speakers of Bohemian and Moravian Czech. The two dialects of Czech have the same phonemic inventory (five pairs of phonemically long and short vowels) but differ in how some vowels are specified phonetically. The authors observed differences between the two dialect groups in the perceptual assimilation of certain Dutch vowels to vowels in the native dialect but, unfortunately, did not determine how these differences affected the participants’ accuracy in producing and perceiving Dutch vowels. Some reported individual differences in L2 production may have arisen from inappropriate elicitation techniques (see Chapter 3) but individual differences that are reliable need to be explained. The SLM-r proposes that explanations for many or most intersubject differences in L2



James Emil Flege and Ocke-Schwen Bohn

­ roduction and perception can be obtained by examining how individual p learners (1) specified L1 phonetic categories, both in terms of cue weighting and degree of category precision, when they began learning an L2; (2) how they mapped L2 sounds onto L1 categories; (3) how dissimilar they perceived L2 sounds to be from the closest L1 sound in their individual phonetic inventory; and (4) how much and what kind of L2 input they received. Should phonetically based explanations not account for true individual differences in the production and perception of L2 sounds, it will become necessary to evaluate the role of endogenous factors that might influence whether new phonetic categories have or have not been formed for L2 sounds. This includes probing for individual differences in auditory acuity, early-stage (precategorical) auditory processing, and auditory working memory. 1.3.2.10  L2 Speech Learning Milestones The SLM focused on between-group differences whereas the SLM-r focuses on how individuals learn L2 sounds and how L2 learning influences their production and perception of L1 sounds. This fundamental change in orientation goes well beyond the choice of a particular statistical analysis technique. It requires new research designs and new ways to interpret the data patterns they yield (Iverson & Evans, 2007, p. 2843). The SLM-r focus on individual learners was prompted by evidence that individuals may bring somewhat different L1 categories to the task of L2 learning, and the observation that focusing on groups may obscure differences between individuals (Hazan & Rosen, 1991, p. 197; Markham, 1999). Two practical considerations also prompted the decision to make individual learners the primary unit of analysis for research carried out within the SLM-r framework. First, it is often difficult or impossible to constitute groups differing in a single variable (e.g., “highinput” vs. “low-input” groups). Second, it is sometimes impossible to draw meaningful conclusions from grouped data. For example, Escudero, Benders, and Lipski (2009) examined the use of spectral and temporal cues to the Dutch /a:/-/ɑ/ contrast by native Dutch (ND) and Spanish (NS) speakers. The authors reported that the NS group made significantly greater use of temporal cues than the ND group did but 14 (37 percent) of the NS speakers made greater use of spectral than temporal cues. Figure 1.3 provides evidence that convinces us, at least, of the need to focus on individual learners of an L2. This figure shows the mean VOT values produced in 60 /t/-initial English words by native Spanish (NS)



The Revised Speech Learning Model (SLM-r)



speakers who arrived in the United States at or after the age of 16 years. Some of the late learners in this sample produced English /t/ with Spanish-like short-lag VOT values. Others produced English /t/ with long-lag VOT values resembling those of native speakers, and still others produced the English /t/ with mean VOT values that fell somewhere in between the values typical for Spanish and English. Appending labels to arbitrarily selected subsets of the NS Late learners (e.g., “no learners,” “slow learners,” “superstars”) is tempting but cannot explain the intersubject variability. Working within the SLM-r framework requires obtaining enough data from each participant to permit treating each individual as a separate experiment. Meeting this condition makes it possible to determine if an individual has or has not achieved specific L2 speech learning ­“milestones” Consider, for example, the identification of word-final English stops by native Russian (NR) speakers. For Russian monolinguals, closure voicing is a far more important cue for the identification of word-final stops as /k/ or /ɡ/ than is preceding vowel duration. Individual NR speakers living in the United States either do or do not make significant use of vowel duration when identifying final stops as /k/ or /ɡ/, and they either do or do not learn to weight vowel duration more highly than closure voicing (Dmitrieva, 2019). A statistically significant use of vowel duration as a perceptual cue, and a switch from closure voicing to vowel duration as the primary perceptual cue to stop voicing in English, are specific L2 milestones that might be assessed in L2 speech learning research. Examples of other such milestones are, for Korean learners of English, a switch from primary use of spectral cues to primary use of temporal cues in the identification of English vowels as /ε/ or /æ/ (Flege et al., 1997; Kim et al., 2018), and the production of English /r/ by native Japanese speakers with or without overlap in the F2 and F3 values (Iverson et al., 2005; see also Chapter 2). The intersubject variability illustrated in Figure 1.3 highlights the need to understand why individual L2 learners sometimes differ substantially from one another. As we see it, this may require knowing how individuals specified the closest L1 phonetic category (presumably Spanish /p/, /t/, and /k/ in the current context) when they were first exposed to their L2 (here, English), how they mapped target L2 sounds (/p t k/) onto L1 sounds when first exposed to the L2, how phonetically dissimilar the L2 sounds were judged to be from relevant L1 sounds, and the quantity and quality of the L2 phonetic input that individuals received (see Flege & Wayland, 2019, for discussion). It is also important to know if the indi-



James Emil Flege and Ocke-Schwen Bohn

viduals under examination have or have not formed new phonetic categories for the target L2 sounds of interest (here, English /p/, /t/, and /k/) and whether the presence or absence of category formation for the L2 sounds of interest was influenced by individual differences in auditory acuity, precategorical auditory processing, and auditory working memory. 1.3.2.11  Speech Learning Analyses Observing when, or if, speech learning milestones have been achieved by individual L2 learners is the first step in data analysis. Consider, for example, the application of this approach to the learning of word-final English stops by native Russian (NR) speakers. Acoustic phonetic distinctions between /b/-/p/, /d/-/t/, and /k/-/ɡ/ in the final position of Russian words are incompletely neutralized (e.g., Dmitrieva, Jongman, & Sereno, 2010; Kharlamov, 2014). Dmitrieva (2019) tested NE monolinguals, NR monolinguals and Russians learning English in the United States. Russian monolinguals made less perceptual use of vowel duration, and more use of closure voicing, to distinguish /ɡ/ from /k/ in the final position of Russian words than NE monolinguals did for English words. The NR learners of English showed an increased use of vowel duration as a perceptual cue and, in fact, some were found to closely resemble NE monolinguals in this regard. Imagine a similar study examining the categorization of word-final stops by 100 NE monolinguals and 100 NR immigrants to the United States who are tested twice, the first time when the Russians have lived for one year in the United States (time 1) and a second time three years later (time 2). If 98 of the 100 NE monolinguals are found to make significant use of vowel duration at both time 1 and time 2, and if a significantly larger number of the NR participants are found to do so at time 2 than time 1 (say, 50 vs. 10), it would demonstrate phonetic learning by the Russians. Such an analysis would not tell us, however, why 10 NR participants already showed a significant use of vowel duration at time 1 nor why 50 had still not done so at time 2. A more comprehensive study would be needed to answer these crucial questions. Linear mixed-effect models (e.g., Magezi, 2015) could be developed to draw more general conclusions about how L2 speech is learned over time in a comprehensive longitudinal study. We will now illustrate the kind of research questions that might be addressed within the SLM-r framework by considering a hypothetical longitudinal study. The aim of the hypothetical study we have in mind would be to determine how NR speakers without special training or aptitude learn to



The Revised Speech Learning Model (SLM-r)



produce and perceive /b d ɡ p t k/ in the final position of English words. All participants would be 20 years or older when they arrived in the United States. Crucially, all would (1) be enrolled for the study within four months of their arrival in the United States, (2) have had similar formal education in English in Russia, and (3) have little experience conversing in English before arriving. Their later experiences learning English in the United States and their auditory capacities, on the other hand, would be expected to vary. The first data sample (time “0”) in the hypothetical study would focus on Russian. A NR monolingual would test NR speakers soon after their arrival in the United States. The NR participants would be asked to produce and perceive all six Russian stops in the final position of Russian words and provide information pertaining to prior experience in English. Individual differences in auditory acuity, early-stage (precategorical) auditory processing, and auditory working memory would also be evaluated. Finally, the participants would be asked to categorize productions of the six Russian stops in a six-alternative forced-choice task and rate the perceived phonetic dissimilarity of pairs of the corresponding English and Russian stops (36 pair types in all). The auditory capacity tests administered at time 0 would be readministered at yearly intervals, but the results are not expected to change over time. The samples obtained at time 1 and in subsequent samples, which focus on English, would be elicited in English by an English monolingual. In each sample, the NR participants would be asked to produce the six English stop consonants in the final position of English words, categorize naturally produced tokens of English /b/, /d/, /ɡ/, /p/, /t/, and /k/, report how many hours per week overall they have used English to communicate verbally, and how many of those hours per week they used English with NE speakers. We anticipate that many NR participants would report increased use of English over the course of the longitudinal study but that important individual differences would be evident in English use, especially use of English with NE speakers. The predictor variables to be examined in the study would be the NR participants’ (1) use of vowel duration and closure voicing to categorize Russian stops at time 0; (2) the precision of the phonetic categories they have developed for all six Russian stops when assessed at time 0; (3) their auditory capacity at time 0; (4) estimated total hours of weekly use of English at time 1 and in subsequent samples; and (5) estimated hours of English use with English monolinguals at the same sampling intervals.



James Emil Flege and Ocke-Schwen Bohn

The dependent variables to be examined at time 1 and in subsequent samples would be the NR participants’ use of vowel duration and closure voicing in the categorization and production of all six word-final English stops. The accuracy with which these stops have been produced would be assessed both acoustically and via listener judgments. The research design just outlined could be used to address a number of theoretically important research questions. The first research question that could be addressed is whether NR participants will show increasingly less use of closure voicing and increasingly more use of preceding vowel duration to distinguish /b/-/p/, /d/-/t/, and /k/-/ɡ/ as they gain experience in English. According to the SLM-r, L2 speech learning depends on the input distributions of L2 sounds to which learners have been exposed. The expectation here is that the NR participants will make increasing use of vowel duration and decreasing use of closure voicing when categorizing English stops as result of the greater reliability of the former than latter cue in the speech of NE speakers. Given the SLM-r hypothesis that production and perception coevolve, the model predicts the same trends in production. The amount of input needed to achieve various L2 speech learning milestones, and the vowel duration and closure voicing effect sizes evident in subsequent analyses, may be modulated by individual differences in auditory capacity. Specifically, the SLM-r predicts that individuals with relatively limited auditory capacities will need more native-speaker input to achieve the same milestones (or effect sizes) than individuals with superior auditory capacities. Evidence that some participants show no evidence of speech learning would seriously undermine the SLM-r if that evidence of “failure to learn” cannot be attributed to a paucity of English input, to inadequate auditory capacity, or to some combination of both. The second research question that might be addressed by the hypothetical study being outlined here is whether the NR participants will bring somewhat different Russian phonetic categories to the task of learning English, and whether they will differ in terms of how precisely their Russian categories are defined. We expect a positive answer to both questions. That being the case, the SLM-r predicts that (1) individual differences in L1 category specification (e.g., whether individual participants exploited vowel duration in Russian before arrival in the United States) and L1 precision (see Section 1.3.2.4) will influence how phonetically dissimilar the participants perceive corresponding pairs of Russian and English stops to be at time 0, and (2) degree of perceived cross-language phonetic dissimilarity at time 0 will



The Revised Speech Learning Model (SLM-r)



subsequently influence the extent to which individual NR participants approximate how most NE monolinguals produce and categorize English /b/, /d/, /ɡ/, /p/, /t/, and /k/. The third question that might be addressed by the research design sketched above is whether the NR participants will show greater evidence of learning for the voiced English stops (/b/, /d/, /ɡ/) than for the voiceless English stops (/p/, /t/, /k/). Will they perceive English /b/, /d/, and /ɡ/ to be phonetically more dissimilar from corresponding “voiced” Russian than they perceive English /p/, /t/, and /k/ to be from corresponding Russian stops? Will perceived cross-language dissimilarity predict the amount of learning evident for the six English stops? The fourth and final question regards category formation. A positive response to all three questions in the last paragraph would suggest the formation of new phonetic categories for English /b/, /d/, and /ɡ/ but not English /p/, /t/, and /k/. This interpretation could be evaluated within the SLM-r framework by repeating the time 0 sample, which focused on Russian stops, when English data collection has been completed. The production and perception of Russian voiced stops are predicted to remain unchanged if new categories have been established for the corresponding English stops (/b/, /d/, /ɡ/) whereas the production and perception of Russian voiceless stops are predicted to change if “composite” Russian-English phonetic categories have developed in the absence of phonetic category formation for the corresponding English stops (/p/, /t/, /k/). This might be manifested, for example, by the more frequent production of audible release bursts in the voiceless Russian stops, the prolongation of stop closure intervals, or the production of stops with higher F1 offset frequencies values than were evident at time 0. The hypothetical study we just outlined does not include a comparison of groups defined by age of arrival in the United States (say, 7–12 vs. 20–25 years of age). As we see it, the very different experiences that adult and child immigrants typically have when learning an L2 make such a comparison impractical. Compared to children, for example, adult immigrants usually have had far more formal instruction in English before arriving in the host country (usually from nonnative teachers), have larger vocabularies, and typically receive less native-speaker input after arriving than children typically do. This is because adults typically acculturate less rapidly than children (Cheung, Chudek, & Heine, 2011; Jia & Aaronson, 2003).



James Emil Flege and Ocke-Schwen Bohn

It would be valuable, however, to examine Russian children learning English in a separate or extended study. The SLM-r predicts that the pattern of findings that emerge from research examining adult and child L2 learners will be much the same, albeit extended over longer periods of time for adults than children. This expectation derives from the SLM-r hypothesis that adults and children learn L2 speech in the same way because they exploit the same capacities to learn speech.

1.4 Summary Like its predecessor, the SLM-r focuses on how sequential bilinguals produce and perceive position-sensitive allophones of L2 vowels and consonants. Its aim is to account for how phonetic systems reorganize over the life-span in response to the phonetic input received during naturalistic L2 learning. The core tenets of the SLM-r can be summarized as follows:  1. L2 experience. The SLM focused on highly “experienced” L2 learners and the question of whether such learners will eventually “master” L2 sounds. The SLM-r has abandoned this approach because it now seems evident, at least to us, that L2 learners can never perfectly match monolingual native speakers of the target L2. This is because the phonetic elements making up the L1 and L2 phonetic subsystems of a bilingual necessarily interact, and because the phonetic input upon which new L2 phonetic categories are based cannot be identical to the input that native speakers receive.  2. Production and perception. The SLM hypothesized that the accuracy of perceptual representations for L2 sounds places an upper limit on the accuracy with which the L2 sounds can be produced. The SLM-r, on the other hand, proposes that segmental production and ­perception coevolve without precedence.  3. L2 category formation. Phonetic category formation is possible ­regardless of age of first exposure to an L2 and is crucial for phonetic organization and reorganization across the life-span. The creation of new phonetic categories for L2 sounds creates an important ­nonlinearity in the transformation of phonetic input into phonetic performance. When a new category is not formed for L2 sounds that differ phonetically from the closest L1 sound, a composite L1– L2 phonetic category will develop that is based on phonetic input from two languages.



The Revised Speech Learning Model (SLM-r)



 4. The full access hypothesis. According to the SLM “feature” hypothesis, a new phonetic category formed for an L2 sound might differ from the phonetic category formed for the same sound by native speakers if the L2 sound is defined, at least in part, by features not used in the learner’s L1. The SLM-r adopts the “full access” hypothesis (Flege, 2005b) according to which L2 learners can gain access to such non-L1 features. The SLM-r proposes that all processes and mechanisms used to develop L1 phonetic categories, without exception, remain intact and accessible for L2 learning.  5. Cue weighting. The SLM-r proposes that both new L2 phonetic categories and composite L1–L2 phonetic categories are gradually shaped by the input distributions defining them and are driven by the adaptive need to ensure the rapid and accurate categorization of phonetic segments. By hypothesis, the weighting of multiple perceptual cues that define new L2 categories and composite L1–L2 categories is based on input distributions and so reflects the reliability with which cues are present.  6. Phonetic factors. According to the SLM-r, the formation or nonformation of a new phonetic category for an L2 sound depends primarily on (1) the sound’s degree of perceived phonetic dissimilarity from the closest L1 sound, (2) the quantity and quality of L2 input obtained for the sound in meaningful conversations, and (3) the precision with which the closest L1 category is specified when L2 learning begins.  7. L1 category precision. The “category precision” hypothesis of the SLM-r differs importantly from the earlier SLM “age” hypothesis which it replaces. It predicts that individuals having relatively precise L1 phonetic categories will be better able to discern phonetic differences between an L2 sound and the closest L1 sound than individuals having relatively imprecise L1 categories. This, in turn, will increase their likelihood of forming new phonetic categories for L2 sounds. L1 category precision generally increases through childhood and into early adolescence, but important individual differences exist at all ages. This means that variation in L1 category precision can be ­dissociated from putative age-related changes in neurocognitive ­plasticity at the time individuals are first exposed to an L2.  8. L1 phonetic category differences. Individual speakers of a single L1 may bring somewhat different L1 phonetic categories to the task of learning an L2. Their L1 categories may differ in terms of cue weighting, which is thought to depend primarily on the input



James Emil Flege and Ocke-Schwen Bohn

received during L1 speech development, and also according to how precisely the L1 categories are defined.  9. Endogenous factors. Phonetic category formation for an L2 sound depends on the discernment of cross-language phonetic differences, the creation of stable perceptual links between L1 and L2 sounds, the aggregation of “equivalence classes” of L2 sounds that are perceived to be distinct from the realizations of any L1 phonetic category and, finally, the sundering of previously establish L1–L2 perceptual links. Individual differences in auditory acuity, early-stage (precategorical) auditory processing, and working auditory memory may modulate these phonetic processes by affecting how much L2 phonetic input is needed to pass from one stage to the next. 10. Intersubject variability. Individuals differ in terms of how accurately they produce and perceive L2 sounds. By hypothesis, intersubject phonetic variability can be explained, at least in part, by knowing how individual learner’s L1 phonetic categories were specified when they were first exposed to an L2, how they perceptually linked L2 sounds to L1 sounds via the mechanism of interlingual identification, how dissimilar they perceived an L2 sound to be from the closest L1 sound, and the quantity and quality of L2 phonetic input they have received. 11. Continuous learning. The phonetic categories and realization rules deployed in the L1 and L2 phonetic subsystems remain malleable across the life-span, responding to variation in the phonetic input that has been received, even recent input. An “end state” in learning can be said to exist only for individuals who are no longer exposed to phonetic input differing from what they were exposed to ­previously in life. The SLM-r presented here provides a framework for research that may eventually permit an understanding of how speech is learned across the life-span and why individuals seemingly differ in their ability to learn L2 speech. The model is based on the results of many published studies but will, of course, need to be evaluated in prospective research. We recognize the immensity of this task and realize that evaluating the model will require the expenditure of considerable resources as well as developing improved methodologies and measurement techniques. With that in mind, we provide, in Chapter 3, suggestions regarding how best to obtain speech production data, how to assess the quantity and quality of L2 input, and how to test for category formation.



The Revised Speech Learning Model (SLM-r)



References Allen, J. S., Joanne, L., Miller, J. L., & DeSteno, D. (2003). Individual talker differences in voice-onset-time. Journal of the Acoustical Society of America, 113, 544–552. Allen, J. S., & Miller, J. L. (2004). Listener sensitivity to individual differences in voice-onset-time: Individual talker differences in voice-onset-time. Journal of the Acoustical Society of America, 115(6), 3171–3183. Anderson, J., Morgan, J., & White, K. (2003). A statistical basis for speech sound discrimination. Language and Speech, 46, 155–182. Antetomaso, S., Miyazawa, K., Feldman, N., Elsner, M., Hitczenko, K., & Mazuka, R. (2017). Modeling phonetic category learning from natural acoustic data. In M. Lamendola & J. Scott (Eds.), Proceedings of the 41st annual Boston University Conference on Language Development (pp. 32–45). Somerville, MA: Cascadilla Press. Aslin, R. (2014). Phonetic category learning and its influence on speech production. Ecological Psychology, 26(4), 4–15. Baker, W., & Trofimovich, P. (2006). Perceptual paths to accurate production of L2 vowels: The role of individual differences. International Review of Applied Linguistics, 44, 231–259. Baker, W., Trofimovich, P., Flege, J. E., Mack, M., & Halter, R. (2008). Childadult differences in second-language phonological learning: The role of cross-language similarity. Language and Speech, 51(4), 317–342. Benders, T., Escudero, P., & Sjerps, M. J. (2012). The interrelation between acoustic context effects and available response categories in speech sound categorization. Journal of the Acoustical Society of America, 131(4), 3079–3097. Bent, T. (2014). Children’s perception of foreign-accented words. Journal of Child Language, 41(6), 1334–1355. Bent, T. (2018). Development of unfamiliar accent comprehension continues through adolescence. Journal of Child Language, 45, 1400–1411. Bent, T., & Holt, R. F. (2018). Shhh … I need quiet! Children’s understanding of American, British, and Japanese-accented English speakers. Language and Speech, 61(4), 657–673. Best, C. T. (1995). A direct realist view of cross-language speech perception. In W. Strange (Ed.), Speech perception and linguistic experience: Issues in cross-language research (pp. 107–126). Baltimore, MD: York Press. Best, C. T., & Tyler, M. D. (2007). Nonnative and second-language speech perception. In O. S. Bohn & M. J. Munro (Eds.), Language experience in second language learning: In honor of James Emil Flege (pp. 13–44). Amsterdam: John Benjamins. Bloomfield, L. (1933). Language. New York: Holt. Bohn, O.-S., & Flege, J E. (1993). Perceptual switching in Spanish/English bilinguals. Journal of Phonetics, 21, 267–290. Bohn, O.-S.(2002). On phonetic similarity. In P. Burmeister, T. Piske, & A. Rohde (Eds.), An integrated view of language development: Papers in honor of Henning Wode (pp. 191–216). Trier, Germany: Wissenschaftlicher.



James Emil Flege and Ocke-Schwen Bohn

Bohn, O.-S. (2020). Cross-language phonetic relationships account for most, but not all L2 speech learning problems: The role of universal phonetic biases and generalized sensitivities. In M. Wrembel, A. KiełkiewiczJanowiak, & P. Gąsiorowski (Eds.), Approaches to the study of sound structure and speech: Interdisciplinary work in honour of Katarzyna DziubalskaKołaczyk (pp. 171–184). Abingdon, England: Routledge. Bohn, O.-S., & Bundgaard-Nielsen, R. L. (2009). Second language speech learning with diverse inputs. In: T. Piske & M. Young-Scholten (Eds.), Input matters in SLA (pp. 207–218). Clevedon, England: Multilingual Matters. Bohn, O.-S., & Ellegaard, A. A. (2019). Perceptual assimilation and graded discrimination as predictors of identification accuracy for learners differing in L2 experience: The case of Danish learners’ perception of English initial fricatives. In Proceedings of the 19th International Congress of Phonetic Sciences (pp. 2070–2074). Bohn, O. S., & Steinlen, A. K. (2003). Consonantal context affects crosslanguage perception of vowels. In Proceedings of the 15th International Congress of Phonetic Sciences (pp. 2289–2292). Bosch, L., & Ramon-Casas, M. (2011). Variability in vowel production by bilingual speakers: Can input properties hinder the early stabilization of contrastive categories? Journal of Phonetics, 39, 514–526. Bradlow, A., Akahane-Yamada, R., Pisoni, D., & Tohkura, Y. (1999). Training Japanese listeners to identify English /r/and /l/: Long-term retention of learning in perception and production. Perception and Psychophysics, 61(5), 977–985. Bradlow, A. R., & Bent, T. (2008). Perceptual adaptation to non-native speech. Cognition, 106(2), 707–729. Brière, E. J. (1966). An investigation of phonological interferences. Language, 42(4), 768–796. Broersma, M. (2005). Perception of familiar contrasts in unfamiliar positions. Journal of the Acoustical Society of America, 117(6), 3890–3901. Buckler, H., Oczak-Arsic, S., Siddiqui, N., & Johnson, E. K. (2017). Input matters: Speed of word recognition in 2-year-olds exposed to multiple accents. Journal of Experimental Child Psychology, 164, 87–100. Bundgaard-Nielsen, R. L., Best, C. T., & Tyler, M. D. (2011). Vocabulary size is associated with second-language vowel perception performance in adult learners. Studies in Second Language Acquisition, 33, 433–461. Callan, D. E., Jones, J. A., Callan, A. M., & Akahane-Yamada, R. (2004). Learning-induced neural plasticity associated with improved identification performance after training of a difficult second-language phonetic contrast. NeuroImage, 19, 113–124. Callan, D. E., Tajima, K., Callan, A. M., Kubo, R., Masaki, S., & AkahaneYamada, R. (2003). Phonetic perceptual identification by native- and second-language speakers differentially activates brain regions involved with acoustic phonetic processing and those involved with articulatory-auditory/ orosensory internal models. NeuroImage, 22, 1182–1194.



The Revised Speech Learning Model (SLM-r)



Casillas, J. V., & Simonet, M. (2018). Perceptual categorization and bilingual language modes: Assessing the double phonemic boundary in early and late bilinguals. Journal of Phonetics, 71, 51–64. Cebrian, J. (2006). Experience and the use of non-native duration in L2 vowel categorization. Journal of Phonetics, 34, 372–387. Chandrasekaran, B., Sampath, P., & Wong, P. C. M. (2010). Individual variability in cue-weighting and lexical tone learning. Journal of the Acoustical Society of America, 128(1), 456–465. Chládkova, K., & Podlipský, V. J. (2011). Native dialect matters: Perceptual assimilation of Dutch vowels by Czech listeners. Journal of the Acoustical Society of America, 130(4), EL186–EL192. Chao, S-C., Ochoa, D., & Daliri, A. (2019). Production variability and categorical perception of vowels are strongly linked. Frontiers in Human Neuroscience. doi:10.3389/fnhum.2019.00096. Cheung, B., Chudek, M., & Heine, S. (2011). Evidence for a sensitive period for acculturation: Younger immigrants report acculturating at a faster rate. Psychological Science, 22(2), 147–152. Chodroff, E., & Wilson, C. (2017). Structure in talker-specific phonetic realization: Covariation of stop consonant VOT in American English. Journal of Phonetics, 61, 30–47. Clarke, C., & Luce, P. (2005). Perceptual adaptation to speaker characteristics: VOT boundaries in stop voicing categorization. In Proceedings of the ISCA Workshop on Plasticity in Speech Perception (pp. 15–17). Clayards, M. (2018). Differences in cue weights for speech perception are correlated for individuals within and across contrasts. Journal of the Acoustical Society of America, 144(3), EL172–EL177. Darcy, I., & Krüger, F. (2012). Vowel perception and production in Turkish children acquiring L2 German. Journal of Phonetics, 40, 568–581. DeKeyser, R., & Larson-Hall, J. (2005). What does the critical period really mean? In J. F. Kroll & A. M. B. de Groot (Eds.), Handbook of bilingualism: Psycholinguistic approaches (pp. 88–108). New York: Oxford University Press. de Leeuw, E., & Celata, C. (2019). Plasticity of native phonetic and phonological domains in the context of bilingualism. Journal of Phonetics, 75, 88–93. Díaz, B., Mitterer, H., Broersma, M., Escera, C., & Sebastián-Gallés, N. (2015). Variability in L2 phonemic learning originates from speech-specific capabilities: An MMN study on late bilinguals. Bilingualism: Language and Cognition, 19(5), 955–970. Díaz, B., Mitterer, H., Broersma, M., & Sebastián-Gallés, N. (2012). Individual differences in late bilinguals’ L2 phonological processes: From acoustic-phonetic to lexical access. Learning and Individual Differences, 22, 680–689. DiCanio, C., Nam, H., Amith, J. D., García, R. C., & Whalen, D. H. (2015). Vowel variability in elicited versus spontaneous speech: Evidence from Mixtec. Journal of Phonetics, 48, 45–59.



James Emil Flege and Ocke-Schwen Bohn

Docherty, G. J., Watt, D., Llamas, C., Hall, D., & Nycz, J. (2011). Variation in voice onset time along the Scottish border. In Proceedings of the 17th International Congress of Phonetic Sciences (pp. 591–594). Dmitrieva, O. (2019). Transferring perceptual cue-weighting from second language into first language: Cues to voicing in Russian speakers of English. Journal of Phonetics, 83, 128–143. Dmitrieva, O., Llanos, F., Shultz, A. A., & Francis, A. L. (2015). Phonological status, not voice onset time, determines the acoustic realization of onset f0 as a secondary voicing cue in Spanish and English. Journal of Phonetics, 49, 77–95. Dmitrieva, O., Jongman, A., & Sereno, J. A. (2010). Phonological neutralization by native and non-native speakers: The case of Russian final devoicing. Journal of Phonetics, 38(3), 483–492. Earle, F. S., & Myers, E. B. (2015). Overnight consolidation promotes generalization across talkers in the identification of nonnative speech sounds. Journal of the Acoustical Society of America, 137(1), EL91–EL97. Eilers, R. E., & Oller, D. K. (1976). The role of speech discrimination in developmental sound substitutions. Journal of Child Language, 3(3), 319–329. Elman, J. L., Diehl, R. L., & Buchwald, S. E. (1977). Perceptual switching in bilinguals. Journal of the Acoustical Society of America, 62(4), 971–974. Escudero, P., Benders, T., & Lipski, S. (2009). Native, non-native and L2 perceptual cue weighting for Dutch vowels: The case of Dutch, German, and Spanish listeners. Journal of Phonetics, 17(4), 452–465. Escudero, P., & Boersma, P. (2004). Bridging the gap between L2 speech perception research and phonological theory. Studies in Second Language Acquisition, 26(4), 551–585. Escudero, P., Sisinni, B., & Grimaldi, M. (2014). The effect of vowel inventory and acoustic properties in Salento Italian learners of Southern British English vowels. Journal of the Acoustical Society of America, 135(3), 1577–1584. Escudero, P., & Williams, D. (2012). Native dialect influences second-language vowel perception: Peruvian versus Iberian Spanish learners of Dutch. Journal of the Acoustical Society of America, 131(5), EL406–EL412. Evans, B. G., & Iverson, P. (2004). Vowel normalization for accent: An investigation of best exemplar locations in norther and southern British English sentences. Journal of the Acoustical Society of America, 115(1), 352–361. Evans, S., & Davis, M. H. (2015). Hierarchical organization of auditory and motor representations in speech perception: Evidence from searchlight similarity analysis. Cerebral Cortex, 25(12), 4772–4788. doi:10.1093/cercor/ bhv136. Feldman, N. H., Griffiths, T. L., Goldwater, S., & Morgan, J. L. (2013). A role for the developing lexicon in phonetic category acquisition. Psychological Review, 120(4), 751–778.



The Revised Speech Learning Model (SLM-r)



Feldman, N. H., Griffiths, T. L., & Morgan, J. L. (2009). The influence of categories on perception: Explaining the perceptual magnet effect as optimal statistical inference. Psychological Review, 116(4), 752–782. Flege, J. E. (1984). The detection of French accent by American listeners. Journal of the Acoustical Society of America, 76(3), 692–707. Flege, J. E. (1987). The production of “new” and “similar” phones in a foreign language: Evidence for the effect of equivalence classification. Journal of Phonetics, 15, 47–65. Flege, J. E. (1988). Factors affecting degree of perceived foreign accent in English sentences. Journal of the Acoustical Society of America, 84(1), 70–79. Flege, J. E. (1991). Age of learning affects the authenticity of voice-onset time (VOT) in stop consonants produced in a second language. Journal of the Acoustical Society of America, 89, 395–411. Flege, J. E. (1992). The intelligibility of English vowels spoken by British and Dutch talkers. In R. D. Kent (Ed.), Intelligibility in speech disorders: Theory, measurement, and management (pp. 157–232). Amsterdam: John Benjamins. Flege, J. E. (1995). Second-language speech learning: Theory, findings, and problems. In W. Strange (Ed.), Speech perception and linguistic experience: Issue in cross-language research (pp. 229–273). Timonium, MD: York Press. Flege, J. E. (1998). Factors affecting degree of foreign accent in English sentences. Journal of the Acoustical Society of America, 84, 70–79. Flege, J. E. (1999). Relation between L2 production and perception. In J. Ohala et al. (Eds.), Proceedings of the XIVth International Congress of Phonetics Sciences (pp. 1273–1276). Berkeley, CA: Department of Linguistics, University of California. Flege, J. E. (2005a). Origins and development of the Speech Learning Model. Paper presented at the Acoustical Society of America Workshop in L2 speech learning, Simon Fraser University, Vancouver, BC. doi:10.13140/ RG.2.2.10181.19681. Flege, J. E. (2005b). Evidence for plasticity in studies examining second language speech acquisition. Paper presented at the ISCA Workshop on Plasticity in Speech Perception, University College London. doi:10.13140/ RG.2.2.34539.80167. Flege, J. E. (2007). Language contact in bilingualism: Phonetic system interactions. In J. Cole & J. Hualde (Eds.), Laboratory phonology (Vol. 9, pp. 353–380). Berlin: Mouton de Gruyter. Flege, J. E. (2019). A non-critical period for second-language speech learning. In A. M. Nyvad, M. Hejná et al. (Eds.), A sound approach to language matters: In honor of Ocke-Schwen Bohn (pp. 501–541). Aarhus: Department of English, School of Communication & Culture, Aarhus University. Flege, J. E., Bohn, O.-S., & Yang, S. (1997). Effects of experience on non-native speakers’ production and perception of English vowels. Journal of Phonetics, 25, 437–470. Flege, J. E., & Davidian, R. (1984). Transfer and developmental processes in adult foreign language speech production. Applied Psycholinguistics, 5, 323–347.



James Emil Flege and Ocke-Schwen Bohn

Flege, J. E., & Eefting, W. (1986). Linguistic and developmental effects on the production and perception of stop consonants. Phonetica, 43, 155–171. Flege, J. E., & Eefting, W. (1987). Production and perception of English stop consonants by native Spanish speakers. Journal of Phonetics, 15(1), 67–83. Flege, J. E., & Eefting, W. (1988). Imitation of a VOT continuum by native speakers of Spanish and English: Evidence for phonetic category formation. Journal of the Acoustical Society of America, 83, 729–740. Flege, J. E., Frieda, E. M., Walley, A. C., & Randazza, L. A. (1998). Lexical factors and segmental accuracy in second language speech production. Studies in Second Language Acquisition, 20(2), 155–187. Flege, J. E., & Hammond, R. (1982). Mimicry of non-distinctive phonetic differences between language varieties. Studies in Second Language Acquisition, 5(1), 1–16. Flege, J. E., & Liu, S. (2001). The effect of experience on adults’ acquisition of a second language. Studies in Second Language Acquisition, 23, 527–552. Flege, J. E., & Munro, M. (1994). The word unit in second language speech production and perception. Studies in Second Language Acquisition, 16, 381–411. Flege, J. E., Munro, M. J., & Fox, R. A. (1994). Auditory and categorical effects on cross-language vowel perception. Journal of the Acoustical Society of America, 95(6), 3623–3641. Flege, J. E., Munro, M., & MacKay, I. R. A. (1995a). Factors affecting strength of perceived foreign accent in a second language. Journal of the Acoustical Society of America, 97(5), 3126–3134. Flege, J. E., Munro, M. J., & MacKay, I. R. A. (1995b). Effects of age of second-language learning on the production of English consonants. Speech Communication, 16, 1–26. Flege, J. E., Munro, M. J., & Skelton, L. (1992). Production of the word-final English /t/-/d/ contrast by native speakers of English, Mandarin, and Spanish. Journal of the Acoustical Society of America, 92(1), 128–143. Flege, J. E., & Port, R. (1981). Cross-language phonetic interference: Arabic to English. Language and Speech, 24(2), 125–146. Flege, J. E., Schirru, C., & MacKay, I. R. A. (2003). Interaction between the native and second language phonetic systems. Speech Communication, 40, 467–491. Flege, J. E., Takagi, N., & Mann, V. (1995). Japanese adults can learn to produce English /I/ and /l/ accurately. Language and Speech, 38, 25–55. Flege, J. E., & Wang, C. (1989). Native-language phonotactic constrains affect how well Chinese subjects perceive the word-final English /t/-/d/ contrast. Journal of Phonetics, 17, 299–315. Flege, J. E., & Wayland, R. (2019). The role of input in native Spanish late learners’ production and perception of English phonetic segments. Journal of Second Language Studies, 2(1), 1–45. Francis, A. L., & Nusbaum, H. C. (2002). Selective attention and the acquisition of new phonetic categories. Journal of Experimental Psychology: Human Perception and Performance, 28(2), 349–366.



The Revised Speech Learning Model (SLM-r)



Francis, A. L., Kaganovich, N., & Driscoll-Huber, C. (2008). Cue-specific effects of categorization training on the relative weighting of acoustic cues to consonant voicing in English. Journal of the Acoustical Society of America, 124(2), 1234–1251. Franken, M. K., Acheson, D. J., McQueen, J. M., Eisner, F., & Hagoort, P. (2017). Individual variability as a window on production-perception interactions in speech motor control. Journal of the Acoustical Society of America, 142(4), 2007–2018. Frieda, E. M., Walley, A. C., Flege, J. E., & Sloane, M. E. (2000). Adults’ perception and production of the English vowel /i/. Journal of Speech, Language and Hearing Research, 43, 129–143. Garcia Lecumberri, M. L., Cooke, M., & Cutler, A. (2011). Non-native speech perception in adverse conditions: A review. Speech Communication, 52, 864–886. Galbraith, G. C., Buranahirun, C. E., Kang, J., Ramos, O. V., & Lunde, S. E. (2000). Individual differences in autonomic activity affects brainstem auditory frequency-following response amplitude in humans. Neuroscience Letters, 283(3), 201–204. Garibaldi, C. L., & Bohn, O.-S. (2015). Phonetic similarity predicts ultimate attainment quite well: The case of Danish /i, y, u/ and /d, t/ for native speakers of English and of Spanish. Paper presented at the 18th International Congress of Phonetic Sciences, Glasgow. Giannakopoulou, A., Uther, M., & Ylinen, S. (2013). Enhanced plasticity in spoken language acquisition for child learners: Evidence from phonetic training studies in child and adult learners of English. Child Language Teaching and Therapy, 29(2), 201–218. Golestani, N. (2016). Neuroimaging of phonetic perception in bilinguals. Bilingualism: Language and Cognition, 19(4), 674–682. Golestani, N., Molko, N., Dehaene, S., LiBihan, D., & Pallier, C. (2007). Brain structure predicts the learning of foreign speech sounds. Cerebral Cortex, 17, 575–582. Gottfried, T. L. (1984). Effects of consonant context on the perception of French vowels. Journal of Phonetics, 12, 91–114. Grosjean, F. (1998). Studying bilinguals: Methodological and conceptual issues. Bilingualism: Language and Cognition, 1, 131–149. Grosjean, F. (2001). The bilingual’s language modes. In J. Nicol (Ed.), One mind, two languages: Bilingual language processing (pp. 1–22). Oxford: Blackwell. Guenther, F., Hampson, M., & Johnson, D. (1998). A theoretical investigation of reference frames for the planning of speech movements. Psychological Review, 105(4), 611–633. Gupta, P., & Dell, G. S. (1999). The emergence of language from serial order and procedural memory. In B. MacWhinney (Ed.), The emergence of language (pp. 447–481). Mahwah, NJ: Lawrence Erlbaum. Han, Z., & Odlin, T. (Eds.). (2006). Studies of fossilization in second language acquisition. Clevedon, England: Multilingual Matters.



James Emil Flege and Ocke-Schwen Bohn

Harrington, J., Palethorpe, S., & Watson, C. (2000). Monophthongal vowel changes in received pronunciation: An acoustic analysis of the Queen’s Christmas broadcasts. Journal of the International Phonetic Association, 30(1–2), 63–78. Hazan, V., & Barrett, S. (2000). The development of phonemic categorization in children aged 6–12. Journal of Phonetics, 28, 377–396. Hazan, V., & Kim, Y. H. (2010). Can we predict who will benefit from computer-based phonetic training? Paper presented at the Interspeech 2010, Satellite Workshop on “Second Language Studies: Acquisition, Learning, Education and Technology,” Waseda University, Tokyo, Japan. Hazan, V., & Rosen, S. (1991). Individual variability in the perception of cues to place variability in initial stops. Perception and Psychophysics, 49(2), 187–200. Heald, S. L. M., & Nusbaum, H. (2015). Variability in vowel production within and between days. PLoS ONE, 10(9), e0136791. doi:10.1371/journal. pone.0136791. Hillenbrand, J., Getty, L. A., Clark, M. J., & Wheeler, K. (1995). Acoustic characteristics of American English vowels. Journal of the Acoustical Society of America, 97, 3099–3111. Hintzman, D. L. (1986). “Schema abstraction” in a multiple trace memory model. Psychological Review, 93(4), 411–428. Hockett, C. F. (1958). A course in modern linguistics. New York: Macmillan. Højen, A., & Flege, J. E. (2006). Early learners’ discrimination of secondlanguage vowels. Journal of the Acoustical Society of America, 119(5), 3072–3084. Holt, L., & Lotto, A. J. (2006). Cue weighting in auditory categorization: Implications for first and second language acquisition. Journal of the Acoustical Society of America, 119(5), 3059–3071. Holt, L. L., & Lotto, A. J. (2010). Speech perception as categorization. Attention, Perception, and Psychophysics, 72(5), 1218–1227. Hopp, H., & Schmid, M. S. (2013). Perceived foreign accent in first language attrition and second language acquisition: The impact of age of acquisition and bilingualism. Applied Psycholinguistics, 34, 361–394. Hoormann, J., Falkenstein, M., Hohnsbein, J., & Blanke, L. (1992). The human frequency-following response (FFR): Normal variability and relation to the click-evoked brainstem response. Hearing Research, 59(2), 179–188. Houde, J. F., & Jordan, M. I. (1998). Sensorimotor adaptation in speech production. Science, 279(5354), 1213–1216. Houde, J. F., & Jordan, M. I. (2002). Sensorimotor adaptation of speech I: Compensation and adaptation. Journal of Speech, Language, and Hearing Research, 45, 295–310. Hu, W., Mi, L., Yang, Z., Tao, S., Li, M., Wang, W., Dong, Q., & Liu, C. (2016). Shifting perceptual weights in L2 vowel identification after training. Plos ONE, 11(9), e0162876. doi:10.1371/journal.pone.0162876. Idemaru, K., & Holt, L. L. (2011). Word recognition reflects dimension-based statistical learning. Journal of Experimental Psychology: Human Perception and Performance, 37(6), 1939.



The Revised Speech Learning Model (SLM-r)



Idemaru, K., & Holt, L. (2013). The developmental trajectory of children’s perception and production of English /r/-/l/. Journal of the Acoustical Society of America, 133(6), 4232–4246. Idemaru, K., Holt, L. L., & Seltman, H. (2012). Individual differences in cue weights are stable across time: The case of Japanese stop lengths. Journal of the Acoustical Society of America, 132(6), 3950–3964. Imai, S., Walley, A. C., & Flege, J. E. (2005). Lexical frequency and neighborhood density effects on the recognition of native and Spanishaccented words by native and Spanish listeners. Journal of the Acoustical Society of America, 117(2), 896–907. Iverson, P., & Evans, B. G. (2007). Learning English vowels with different firstlanguage vowel systems: Perception of formant targets, formant movement, and duration. Journal of the Acoustical Society of America, 122(5), 2842–2854. Iverson, P., & Evans, B. (2009). Learning English vowels with different firstlanguage vowel systems II: Auditory training for native Spanish and German speakers. Journal of the Acoustical Society of America, 126(2), 866–877. Iverson, P., Hazan, V., & Bannister, K. (2005). Phonetic training with acoustic cue manipulations: A comparison of methods for teaching English/r/-/l/to Japanese adults. Journal of the Acoustical Society of America, 118(5), 3267–3278. Iverson, P., Wagner, A., & Rosen, S. (2016). Effects of language experience on pre-categorical perception: Distinguishing general from specialized processes in speech perception. Journal of the Acoustical Society of America, 139(4), 1799–1809. Jia, G., & Aaronson, D. (2003). A longitudinal study of Chinese children and adolescents learning English in the United States. Applied Psycholinguistics, 24, 131–161. Jia, G., Strange, W., Wu, Y., Collado, J., & Guan, Q. (2006). Perception and production of English vowels by Mandarin speakers: Age related differences vary with amount of exposure. Journal of the Acoustical Society of America, 119(2), 1118–1130. Johnson, K. (2000). Adaptive dispersion in vowel perception. Phonetica, 57, 181–188. Johnson, K., Flemming, E., & Wright, R. (1993). The hyperspace effect: Phonetic targets are hyperarticulated. Language, 69, 505–528. Jongman, A., & Wade, T. (2007). Acoustic variability and perceptual learning: The case of non-native accented speech. In O.-S. Bohn & M. J. Munro (Eds.), Language experience in second language learning: In honor of James Emil Flege (pp. 135–150). Amsterdam: John Benjamins. Kachlika, M., Saito, K., & Tierney, A. (2019). Successful second language learning is tied to robust domain-general auditory processing and stable neural representation of sound. Brain and Language, 192, 15–24. Kartushina, N., & Frauenfelder, U. H. (2013). On the role of L1 speech production in L2 perception: Evidence from Spanish learners of French. Paper presented at Interspeech 2013, Lyon, France.



James Emil Flege and Ocke-Schwen Bohn

Kartushina, N., Hervais-Adelman, A., Frauenfelder, U. H., & Golestani, N. (2016). Mutual influences between native and non-native vowels in production: Evidence from short-term visual articulatory feedback training. Journal of Phonetics, 57, 21–39. Kharlamov, V. (2014). Incomplete neutralization of the voicing contrast in wordfinal obstruents in Russian: Phonological, lexical, and methodological in influences. Journal of Phonetics, 43(1), 47–56. Kent, R. D., & Forner, L. L. (1980). Speech segment durations in sentence recitations by children and adults. Journal of Phonetics, 8, 157–168. Kewley-Port, D. (2001). Vowel formant discrimination, II: Effects of stimulus uncertainty, consonantal context, and training. Journal of the Acoustical Society of America, 110(4), 2141–2155. Kidd, G. R., Watson, C. S., & Gygi, B. (2007). Individual differences in auditory abilities. Journal of the Acoustical Society of America, 122(1), 418–435. Kim, D., & Clayards, M. (2019). Individual differences in the link between perception and production and the mechanism of phonetic imitation. Language, Cognition, and Neuroscience, 34(6), 769–786. Kim, D., Clayards, M., & Goad, H. (2018). A longitudinal study of individual differences in the acquisition of new vowel contrasts. Journal of Phonetics, 67, 1–20. Kim, M. R. (2012). L1–L2 transfer in VOT and f0 production by Korean English learners: L1 sound change and L2 stop production. Phonetic and Speech Sciences, 4(3), 31–41. Kleinschmidt, D. F., & Jaeger, T. F. (2015). Robust speech perception: Recognize the familiar, generalize to the similar, and adapt to the novel. Psychological Review, 122(2), 148–203. Kluender, K. R., Lotto, A. J., Holt, L. L., & Bloedel, S. L. (1998). Role of experience for language-specific functional mappings of vowel sounds. Journal of the Acoustical Society of America, 104(6), 3568–3582. Kohler, K. (1981). Contrastive phonology and the acquisition of phonetic skills. Phonetica, 38, 213–226. Kong, E. J., & Edwards, J. (2015). Individual differences in L2 learner’s perceptual cue weighting patterns. Paper presented at the 18th International Congress of Phonetic Sciences, Glasgow. Kong, E. J., & Edwards, J. (2016). Individual differences in categorical perception of speech: Cue weighting and executive function. Journal of Phonetics, 59, 40–57. Kong, E. J., & Yoon, I. H. (2013). L2 proficiency effect on the acoustic cueweighting pattern by Korean L2 learners of English. Journal of the Korean Society of Speech Sciences, 5(4), 81–90. Kraljic, T., & Samuel, A. G. (2006). Generalization in perceptual learning for speech. Psychonomic Bulletin and Review, 13(2), 262–268. Kuhl, P. (1983). Perception of auditory equivalence classes for speech in early infancy. Infant Behavioral Development, 6, 263–285.



The Revised Speech Learning Model (SLM-r)



Kuhl, P. (1991). Human adults and human infants show a “perceptual magnet effect” for the prototypes of speech categories, monkeys do not. Perception and Psychophysics, 50, 93–107. Kuhl, P. (2000). A new view of language acquisition. Proceedings of the National Academy of Sciences, 97(2), 11850–11857. Kuhl, P., Conboy, B. T., Coffey-Corina, S., Padden, D., Rivera-Gaxiola, M., & Nelson, T., (2008). Phonetic learning as a pathway to language: new data native language magnet theory expanded (NLM-e). Philosophical Transactions of the Royal Society B, 363, 979–1000. Kuhl, P., Conboy, B. T., Padden, D., Nelson, T., & Pruitt, J. (2005). Early speech perception and later language development: Implications for the “critical period.” Language Learning and Development, 1, 237–264. Labov, W. (1994). Principles of linguistic change: Vol. 1. Internal factors. Oxford: Blackwell. Lado, R. (1957). Linguistics across cultures: Applied linguistics for language teachers. Ann Arbor: University of Michigan Press. Lee, H., & Jongman, A. (2018). Effects of sound change on the weighting of acoustic cues to the three-way laryngeal stop contrast in Korean: Diachronic and dialectal comparisons. Language and Speech, 63(3), 509–530. Lee, S., Potamianos, A., & Narayanan, S. (1999). Acoustics of children’s speech: Developmental changes of temporal and spectral parameters. Journal of the Acoustical Society of America, 105(3), 1455–1468. Lehet, M., & Holt, L. (2017). Dimension-based statistical learning affects both speech perception and production. Cognitive Science, 41, 885–912. Lengeris, A. (2009). Individual differences in second-language vowel learning. Unpublished PhD thesis, University College London. Lengeris, A., & Hazan, V. (2010). The effect of native vowel processing ability and frequency discrimination acuity on the phonetic training of English vowels for native speakers of Greek. Journal of the Acoustical Society of America, 128(6), 3757–3768. Lenneberg, E. H. (1967). Biological foundations of language. New York: Wiley. Lev-Ari, S., & Peperkamp, S. (2013). Low inhibitory skill leads to non-native perception and production in bilinguals’ native language. Journal of Phonetics, 41, 320–331. Levy, E. S. (2009a). Language experience and consonantal context effects on perceptual assimilation of French vowels by American-English learners of French. Journal of the Acoustical Society of America, 125(2), 1138–1152. Levy, E. S. (2009b). On the assimilation-discrimination relationship in American English adults’ French vowel learning. Journal of the Acoustical Society of America, 126(5), 2670–2682. Levy, E. S., & Law, F. F., II. (2009). Production of French vowels by AmericanEnglish learners of French: Language experience, consonantal context, and the perception-production relationship. Journal of the Acoustical Society of America, 128(3), 1290–1305.



James Emil Flege and Ocke-Schwen Bohn

Levy, E. S., & Strange, W. (2008). Perception of French vowels by American English adults with and without French language experience. Journal of Phonetics, 36, 141–157. Lindblom, B. (1990). Explaining phonetic variation: A sketch of the H&H theory. In W. J. Hardcastle & A. Marchal (Eds.), Speech production and speech modeling (pp. 403–439). Dordrecht, Netherlands: Kluwer Academic. MacKay, I. R. A., Flege, J. E., & Imai, S. (2006). Evaluating the effects of chronological age and sentence duration on degree of perceived foreign accent. Applied Psycholinguistics, 27, 157–183. MacKay, I. R. A., Flege, J. E., Piske, T., & Schirru, C. (2001). Category restructuring during second-language acquisition. Journal of the Acoustical Society of America, 110, 516–528. MacKay, I. R. A., Meador, D., & Flege, J. E. (2001). The identification of English consonants by native speakers of Italian. Phonetica, 58, 103–125. Magezi, D. (2015). Linear mixed-effects models for within-participant psychology experiments: an introductory tutorial and free graphical user interface (LMMgui). Frontiers in Psychology, 6(2). doi:10.3389/ fpsyg.2015.00002. Markham, D. (1999). Phonetic imitation, accent, and the learner. Lund, Sweden: Lund University Press. Markham, D., & Hazan, V. (2004). Acoustic-phonetic correlates of talker intelligibility for adults and children. Journal of the Acoustic Society of America, 116(5), 3108–3118. Maye, J., Werker, J., & Gerken, L. (2002). Infant sensitivity to distributional information can affect phonetic discrimination. Cognition, 82, B101–B111. Mayr, R., & Escudero, P. (2010). Explaining individual variation in L2 perception: Rounded vowels in English learners of German. Bilingualism: Language and Cognition, 13(3), 279–297. McAllister, R., Flege, J. E., & Piske, T. (2003). The influence of the L1 on Swedish quantity by native speakers of Spanish, English and Estonian. Journal of Phonetics, 30, 229–258. McQueen, J. M., Tyler, M. D., & Cutler, A. (2012). Lexical retuning of children’s speech perception: Evidence for knowledge about words’ component sounds. Language Learning and Development, 8, 317–339. Miller, J. L. (1994). On the internal structure of phonetic categories. Cognition, 50, 271–285. Mielke, J., Baker, A., & Archangeli, D. (2016). Individual-level contact limits phonological complexity: Evidence from bunched and retroflex /ɹ/. Language, 92(1), 101–140. Mitterer, H., Reinisch, E., & McQueen, J. M. (2018). Allophones, not phonemes in spoken-word recognition. Journal of Memory and Language, 98, 77–92. Miyawaki, K., Jenkins, J. J., Strange, W., Liberman, A. M., Verbrugge, R., & Fujimura, O. (1975). An effect of linguistic experience: The discrimination of [r] and [l] by native speakers of Japanese and English. Perception and Psychophysics, 18(5), 331–340.



The Revised Speech Learning Model (SLM-r)



Mochizuki, M. (1981). The identification of/r/and/l/in natural and synthesized speech. Journal of Phonetics, 9(3), 283–303. Mora, J. C., & Mora-Plaza, I. (2019). Contributions of cognitive attention control to L2 speech learning. In A. M. Nyvad, M. Hejná et al. (Eds.), A sound approach to language matters – In honor of Ocke-Schwen Bohn (pp. 477–499). Aarhus: Dept. of English, School of Communication & Culture, Aarhus University. Mora, J. C., Keidel, J. L., & Flege, J. E. (2010). Why are the Catalan contrasts between /e/-/ε/ and /o/-/ɔ/ so difficult for even early Spanish-Catalan bilinguals to perceive? In K. Dziubalska-Kolaczyk, M. Wrembel, & M. Jul (Eds.), New sounds 2010: Proceedings of the 6th International Symposium on the Acquisition of Second Language Speech (pp. 325–330). Mora, J. C., Keidel, J. L., & Flege, J. E. (2015). Effects of Spanish use on the production of Catalan vowels by early Spanish-Catalan bilinguals. In J. Romero & M. Riera (Eds.), The phonetics–phonology interface: Representations and methodologies (pp. 33–53). Amsterdam: John Benjamins. Morrongiello, B., Robson, R. C., Best, C. T., & Clifton, R. K. (1984). Trading relations in the perception of speech by 5-year-old children. Journal of Experimental Child Psychology, 37, 231–250. Moyer, A. (2009). Input as a critical means to an end: Quantity and quality of experience in L2 phonological attainment. In T. Piske & M. YoungScholten (Eds.), Input matters in SLA (pp. 159–174). Bristol, England: Multilingual Matters. Nam, Y., & Polka, L. (2016). The phonetic landscape in infant consonant perception is an uneven terrain. Cognition, 155, 57–66. Nasir, S. M., & Ostry, D. J. (2009). Auditory plasticity and speech motor learning. Proceedings of the National Academy of Sciences, 106(48), 20470–20475. Nathan, L., Wells, B., & Donlan, C. (1998). Children’s comprehension of unfamiliar regional accents: A preliminary investigation. Journal of Child Language, 25, 343–365. Neuman, A., & Hochberg, L. (1983). Children’s perception of speech in reverberation. Journal of the Acoustical Society of America, 73, 2145–2149. Newman, R. S. (2003). Using links between speech perception and speech production to evaluate different acoustic metrics: A preliminary report. Journal of the Acoustical Society of America, 113(5), 2850–2860. Newman, R. S., Clouse, S. A., & Burnham, J. L. (2001). The perceptual consequences of within-talker variability in fricative production. Journal of the Acoustical Society of America, 109, 1181–1196. Newton, C., & Ridgway, S. (2015). Novel accent perception in typicallydeveloping school-aged children. Child Language Teaching and Therapy, 32(1) 111–123. Nielsen, K. (2011). Specificity and abstractness of VOT imitation. Journal of Phonetics, 39, 132–142. Nittrouer, S. (2004). The role of temporal and dynamic signal components in the perception of syllable-final stop voicing by children and adults. Journal of the Acoustical Society of America, 115(4), 1777–1790.



James Emil Flege and Ocke-Schwen Bohn

Nosofsky, R. M. (1986). Attention, similarity, and the identificationcategorization relationship. Journal of Experimental Psychology: General, 115, 39–57. Nygaard, L. C., Sommers, M. S., & Pisoni, D. B. (1994). Speech perception as a talker-contingent process. Psychological Science, 5, 42–46. Peperkamp, S., & Bouchon, C. (2011). The relation between perception and production in L2 phonological processing. Paper presented at Interspeech 2011, 12th Annual Conference of the International Speech Communication Association, Florence, Italy. Perkell, J. S., Guenther, F. H., Lane, H., Matthies, M. L., Stockmann, E., Tiede, M., & Zandipour, M. (2004). The distinctness of speakers’ productions of vowel contrasts is related to their discrimination of the contrasts. Journal of the Acoustical Society of America, 116(4), 2338–2344. Perkell, J. S., Matthies, M. L., Tiede, M., Lane, H., Zandipour, M., Marrone, N., … Guenther, F. H. (2004). The distinctness of speakers’ /s/-/ʃ/ contrast is related to their auditory discrimination and use of an articulatory saturation effect. Journal of Speech, Language, and Hearing Research, 47(6), 1259–1269. Pisoni, D. B., Aslin, R. N., Perey, A. J., & Hennessy, B. L. (1982). Some effects of laboratory training on identification and discrimination of voicing contrasts in stop consonants. Journal of Experimental Psychology: Human Perception and Performance, 8, 297–314. Pisoni, D., Lively, S., & Logan, J. (1994). Perceptual learning of nonnative speech contrasts: Implications for theories of speech perception. In J. Goodman & H. Nusbaum (Eds.), The development of speech perception: The transition from speech sounds to spoken words (pp. 121–166). Cambridge, MA: MIT Press. Polka, L., & Bohn, O. S. (2003). Asymmetries in vowel perception. Speech Communication, 41(1), 221–231. Polka, L., & Bohn, O. S. (2011). Natural Referent Vowel (NRV) framework: An emerging view of early phonetic development. Journal of Phonetics, 39(4), 467–478. Reiterer, S. M., Hu, X., Sumathi, T. A., & Singh, N. C. (2013). Are you a good mimic? Neuro-acoustic signatures for speech imitation ability. Frontiers in Psychology, 1(3). doi:10.3389/fpsyg.2013.00782. Remez, R. E., Fellowes, J. M, & Rubin, P. E. (1997). Talker identification based on phonetic information. Journal of Experimental Psychology, Human Perception and Performance, 23, 651–666. Rochet, B. L. (1995). Perception and production of second-language speech sounds by adults. In W. Strange (Ed.), Speech perception and linguistic experience: Issue in cross-language research (pp. 229–273). Timonium, MD: York Press. Rogers, C. L., Lister, J. L., Febo, D. M., Besing, J. M., & Abrams, H. B. (2006). Effects of bilingualism, noise, and reverberation on speech perception by listeners with normal hearing. Applied Psycholinguistics, 27(3), 465–485.



The Revised Speech Learning Model (SLM-r)



Saito, K., Sun, H., & Tierney, A. (2019). Explicit and implicit aptitude effects on second language speech learning: Scrutinizing segmental and suprasegmental sensitivity and performance via behavioral and neurophysiological measures. Bilingualism: Language and Cognition, 22(5), 1123–1140. Samuel, A. (1981). Phonemic restoration: Insights from a new methodology. Journal of Experimental Psychology: General, 110(4), 474–494. Sancier, M., & Fowler, C. A. (1997). Gestural drift in a bilingual speaker of Brazilian Portuguese and English. Journal of Phonetics, 25, 421–438. Schertz, J., Cho, T., Lotto, A., & Warner, N. (2015). Individual differences in phonetic cue use in production and perception of a non-native sound contrast. Journal of Phonetics, 52, 183–204. Schertz, J., Cho, T., Lotto, A., & Warner, N. (2016). Individual differences in perceptual adaptability of foreign sound categories. Attention, Perception, and Psychophysics, 78, 355–367. Schmidtke, J. (2016). The bilingual disadvantage in speech understanding in noise is likely a frequency effect related to reduced language exposure. Frontiers in Psychology, 13(7). doi:10.3389/fpsyg.2016.00678. Schulze, K., Vargha-Khade, F., & Mishkin, M. (2012). Test of a motor theory of long-term auditory memory. Proceedings of the National Academy of Sciences, 109(18), 7121–7125. Sheldon, A., & Strange, W. (1982). The acquisition of/r/and/l/by Japanese learners of English: Evidence that speech production can precede speech perception. Applied Psycholinguistics, 3(3), 243–261. Shultz, A. A., Francis, A. L., & Llanos, F. (2012). Differential cue weighting in perception and production of consonant voicing. Journal of the Acoustical Society of America, 132(2), EL95–EL101. Slevc, L. R., & Miyake, A. (2006). Individual differences in second-language proficiency. Psychological Science, 17(8), 675–681. Smit, A., Hand, L., Freilinger, J., Bernthal, J., & Bird, A. (1990). The Iowa articulation norms project and its Nebraska replication. Journal of Speech and Hearing Disorders, 55, 779–798. Smits, R., Sereno, J., & Jongman, A. (2006). Categorization of sounds. Journal of Experimental Psychology: Human Perception and Performance, 32(3), 733–754. Smith, B. L. (1979). A phonetic analysis of consonantal devoicing in children’s speech. Journal of Child Language, 6(1), 19–28. Snow, C., & Hoefnagel-Höhle, M. (1979). Individual differences in secondlanguage ability: A factor-analytic study. Language and Speech, 22, 151–162. Song, J., & Iverson, P. (2018). Listening effort during speech perception enhances auditory and lexical process for non-native listeners and accents. Cognition, 179, 163–170. Song, J. Y., Shattuck-Hufnagel, S., & Demuth, K. (2015). Development of phonetic variants (allophones) in 2-year-olds learning American English: a study of alveolar stops /t, d/ codas. Journal of Phonetics, 55, 152–169.



James Emil Flege and Ocke-Schwen Bohn

Strange, W. (1992). Learning non-native phoneme contrasts: Interactions among subject, stimulus, and task variables. In E. Tohkura, E. Vatikiotis-Bateson, & Y. Sagisaka (Eds.), Speech perception, production, and linguistic structure (pp. 197–219). Tokyo: Ohmsha. Strange, W. (2007). Cross-language phonetic similarity of vowels: Language experience in second language speech learning. In O.-S. Bohn & M. J. Munro (Eds.), Language experience in second language speech learning: In honor of James Emil Flege (pp. 35–55). Berlin: John Benjamins. Strange, W. (2011). Automatic selective perception (ASP) of first and second language speech: A working model. Journal of Phonetics, 39(4), 456–466. Strange, W., Bohn, O.-S., Nishi, K., & Trent, S. A. (2005). Contextual variation in the acoustic and perceptual similarity of North German and American English vowels. Journal of the Acoustical Society of America, 118, 1751–1762. Stevens, K. N., Liberman, A. M., Studdert-Kennedy, M., & Öhman, S. (1969). Cross-language study of vowel perception. Language and Speech, 12, 1–23. Takagi, N. (1993). Perception of American English /r/ and /l/ by adult Japanese learners of English: A unified view. Unpublished PhD dissertation, University of California at Irvine. Theodore, R. M., Miller, J. L., & DeSteno, D. (2009). Individual talker differences in voice-onset time: Contextual influences. Journal of the Acoustical Society of America, 125(6), 3974–3982. Theodore, R. M., Monto, N. R., & Graham, S. (2020). Individual differences in distributional learning for speech: What’s ideal for ideal observers? Journal of Speech, Language, and Hearing Research, 63, 1–13 Thorin, J., Sadakata, M., Desain, P., & McQueen, J. M. (2018). Perception and production in interaction during non-native speech category learning. Journal of the Acoustical Society of America, 144(1), 92–103. Tourville, H. A., & Guenther, F. H. (2011). The DIVA model: A neural theory of speech acquisition and production. Language and Cognitive Processing, 26(7), 952–981. Trubetzkoy, N. (1939). Principles of phonology. C. A. Baltaxe (Trans.). Berkeley: University of California Press. Tulving, E. (1981). Similarity relations in recognition. Journal of Verbal Learning and Verbal Behavior, 20(5), 479–496. Tyler, M. D. (2019). PAM-L2 and phonological category acquisition in the foreign language classroom. In A. M. Nyvad, M. Hejná et al. (Eds.), A sound approach to language matters – In honor of Ocke-Schwen Bohn (pp. 607–630). Aarhus: Department of English, School of Communication & Culture, Aarhus University. Walley, A. C., & Flege, J. E. (1999). Effect of lexical status on children’s and adults’ perception of native and non-native vowels. Journal of Phonetics, 27(3), 307–332. Weinreich, U. (1953). Languages in contact: Findings and problems. Hague: Mouton.



The Revised Speech Learning Model (SLM-r)



Werker, J. F., & Byers-Heinlein, K. (2008). Bilingualism in infancy: First steps in perception and comprehension. Trends in Cognitive Sciences, 12(4),144–150. Werker, J. F., & Logan, J. (1985). Cross-language evidence for three factors in speech perception. Perception and Psychophysics, 37, 35–44. Westbury, J. R., Hashi, M., & Lindstrom, M. J. (1998). Differences among speakers in lingual articulation for American English /r/. Speech Communication, 26(3) 203–226. Whalen, D., Abramson, A. S., Lisker, L., & Mody, M. (1993). F0 gives voicing information even without unambiguous voice onset times. Journal of the Acoustical Society of America, 93(4), 2152–2159. Williams, L. (1977). The perception of stop consonant voicing by SpanishEnglish bilinguals. Perception and Psychophysics, 21(4), 289–297. Yeni-Komshian, G. H., Flege, J. E., & Liu, S. (2000). Pronunciation proficiency in the first and second languages of Korean-English bilinguals. Bilingualism: Language and Cognition, 3(2), 131–149. Ylinen, S., Uther, M., Latvala, A., Vepsäläinen, S. Iverson, P., Akahane-Yamada, R., & Näätänen, R. (2010). Training the brain to weight speech cues differently: A study of Finish second-language users of English. Journal of Cognitive Neuroscience, 22(6), 1319–1332. Zhang, Y., Kuhl, P. K., Imada, T., Kotani, M., & Tohkura, Y. I. (2005). Effects of language experience: Neural commitment to language-specific auditory patterns. NeuroImage, 26(3), 703–720. Zhang, Y., & Wang, Y. (2007). Neural plasticity in speech acquisition and learning. Bilingualism: Language and Cognition, 10(2), 147–160.

chapter 2

The Revised Speech Learning Model (SLM-r) Applied James Emil Flege, Katsura Aoyama, and Ocke-Schwen Bohn

This chapter summarizes, from the perspective of the revised Speech Learning Model (SLM-r; Chapter 1), some of the numerous studies that have examined the learning of English liquids by native Japanese (NJ) speakers. The so-called Japanese /r/–/l/ problem is of special theoretical interest because the two liquids of English (symbolized here for convenience as “r” and “l”; see below) are “infamously difficult” for NJ speakers to produce (Lambacher, 1999, p. 142) and because a perfect perceptual mastery of English liquids seems to be impossible for NJ speakers even following “extensive natural exposure” to English (Takagi & Mann, 1995, p. 387). Many attempts have been made to find a laboratory training procedure that might enable NJ speakers to perceive English liquids as accurately as native English (NE) speakers do. The techniques developed so far (e.g., Logan, Lively, & Pisoni, 1991; Bradlow et al., 1999; Iverson et al., 2005; Shinohara & Iverson, 2018) routinely yield important gains that generalize and persist over time, but the gains usually do not exceed more than a 15 percent increase in percentage correct identifications, and the trainees do not reach native-speaker levels. These considerations led Bradlow (2008, p. 294) to conclude that examining how NJ speakers learn English /r/ and /l/ is a good way to evaluate “general principles of learning” as well as “claims about adult neural plasticity.” One such claim is that the Japanese /r/–/l/ problem results from a reduction of neural plasticity following the close of a critical period for second language (L2) speech learning (Lenneberg, 1967). However, an fMRI study by Callan et al. (2003; see also Callan et al., 2004) provided evidence of plasticity in the perceptual processing of English liquids by NJ university students following short-term perceptual training. Increases in brain activity were evident in brain structures associated with both production and perception, in accordance with the SLM-r hypothesis that segmental production and perception accuracy coevolve. 



The Revised Speech Learning Model (SLM-r) Applied



If the /r/–/l/ problem cannot be attributed to a lack of plasticity, what then is the basis of the widely observed native versus nonnative differences? A conclusion to be drawn from the present review is that contrary to what many people think, NJ speakers can learn English /r/ and /l/. Importantly, however, the learning of English liquids takes time and a substantial amount of native-speaker input, just as it does for monolingual NE children, who only gradually learn to produce and perceive English /r/ and /l/. The crucial difference between L1-learning children and NJ speakers learning English as an L2 is that, for NJ speakers, learning proceeds differently for the two English liquids. The learning of English /r/ and of /l/ differ necessarily because of how phonetic systems reorganize across the life-span in response to new phonetic input. English liquids are produced and perceived differently as a function of whether they occur as singletons or clusters in word-initial position (e.g., lead, read, breed, bleed), as intervocalic singletons, or as word-final singletons or clusters. Most published research has focused on the approximately 50 English minimal pairs that exist for singleton /r/ and /l/ tokens in word-initial position (Iverson et al., 2005, table A1). The SLM-r generates predictions regarding the learning of position-sensitive allophones, not phonemes, and so this review will focus on the word-initial singletons. This means that the conclusions drawn here may not generalize to other phonetic contexts. According to the SLM-r, individuals who are exposed to an L2 map the sounds they hear in L2 words onto native language (L1) phonetic categories. Cross-language mapping patterns arise through the operation of interlingual identification, a cognitive process that operates subconsciously and automatically. The research reviewed here indicates that when NJ speakers are first exposed to English (time “0” in Figure 2.1) they generally classify both English /r/ and /l/ productions as instances of the single liquid of Japanese, symbolized here for convenience as /R/ (see below). Time 0

Time 1 I

R

Time 2

r

R

R r

I

I

I R

Time 3

r

r

Figure 2.1  Hypothetical cross-language mapping between a Japanese liquid consonant (designated here as “R”) and two English liquids (“r”, “l”) at four hypothetical stages of L2 development by native speakers of Japanese.



James Emil Flege, Katsura Aoyama, and Ocke-Schwen Bohn

The phonetic-level L1–L2 perceptual links established via interlingual identification by NJ speakers may vary across individuals as a function of how precisely their Japanese /R/ category is specified when they are first exposed to English (Chapter I), according to the relative weighting of acoustic dimensions that individuals deploy when categorizing Japanese sounds as /R/ (e.g., Idemaru et al., 2012) and to individual differences in auditory acuity, early-stage (precategorical) auditory processing (e.g., Kachlika et al., 2019), or auditory working memory (e.g., MacKay et al., 2001). The interlingual identification of Japanese and English consonants has not been studied in detail or longitudinally, and so the time needed to establish stable L1–L2 mapping patterns is currently unknown, as are the factors that may influence which sounds are perceptually linked. As illustrated in Figure 2.1, NJ speakers begin to discern that the two English liquids are not equally good instances of their Japanese /R/ categories as they gain conversational experience in English. Despite the symbols traditionally used for Japanese and English liquids, phonetic realizations of the English /r/ category will be perceived to be phonetically more dissimilar from the Japanese /R/ category than are realizations of English /l/. According to the SLM-r, the detection of a difference in the “goodness-offit” of the two English liquids to Japanese /R/ (Figure 2.1, time 2) promotes the formation of a new phonetic category for English /r/. Once a difference in phonetic dissimilarity has been discerned, an “equivalence class” of English /r/ tokens will emerge based on the tokens that an individual NJ speaker sees and hears while speaking English. Equivalence classes can be regarded as a kind of cognitive “container” for the assemblage of information regarding the phonetic properties of the English /r/ tokens that fall outside a “perceptual tolerance region” defining an individual’s Japanese /R/ category. By hypothesis, the equivalence class provides the basis for an individual learner’s formation of a new L2 phonetic category for English /r/ once the perceptual link between the English /r/ tokens and the Japanese /R/ category has been sundered (time 3 in the figure). If the scenario just outlined accurately describes what happens when NJ speakers learn English, it means that NJ learners of English will make different use of the phonetic input they receive for English /r/ and /l/. The distribution of English /r/ tokens to which they have been exposed will define the new phonetic categories formed by NJ learners for English /r/. The English /l/ distributions will be combined with Japanese /R/ distributions because the English /l/ will continue to be linked perceptually to Japanese /R/ via the process of interlingual identification. A



The Revised Speech Learning Model (SLM-r) Applied



composite L1–L2 phonetic category (diaphone) will develop that combines the properties of the English /l/ and Japanese /R/ tokens to which NJ learner have been exposed over the course of their lives (Figure 2.1, time 3). By hypothesis, this will lead to a modification in how the Japanese /R/ is produced and perceived.

2.1  Cross-Language Phonetic Differences Japanese has a single liquid that is symbolized here for convenience as “R.” The Japanese liquid is usually realized as an apico-alveolar tap, [ɾ], or as an alveolar lateral approximant, [l] (Arai, 2013; Vance, 2008). Japanese /R/ occurs in prevocalic and intervocalic positions but not word finally. Both English liquids differ phonetically from Japanese /R/. The English /r/ is often symbolized as [ɹ], but this symbolization is not appropriate for all varieties of English. English /l/ is produced with “dark” and “light” variants. The dark /l/ tends to occur in postvocalic position, whereas the light variant tends to occur more often in prevocalic position (Mielke et al., 2016, p. 123). The most important acoustic phonetic characteristic of word-initial English /r/ productions is third formant (F3) frequency, which starts out low, close to F2 frequency, and then rises rapidly when constriction is released. For English /l/, on the other hand, F3 frequency starts out much higher and so is more different from the F2 frequency than is the case for /r/. Iverson et al. (2005) examined the production of minimally paired English words such as rock and lock by 12 native speakers of British English. The /r/ and /l/ tokens spoken by the NE speakers differed in terms of both F2 and F3 starting frequencies, but there was overlap in the normalized frequency values for F2 but not F3. Transition durations were longer for the /r/ than for the /l/ tokens, but here, too, there was overlap in the values obtained for the two liquids. Important differences exist in how /r/ is produced in the languages of the world (Ladefoged & Maddieson, 1996). Not surprisingly, substantial interspeaker differences have been observed in the production of /r/ in word-initial position by NE monolinguals (e.g., Mielke et al., 2016; Westbury, Hashi, & Lindstrom, 1998). In North America, /r/ can be produced as a retroflex approximant or postalveolar approximant with pharyngeal constriction and lip rounding (Ladefoged & Maddieson, 1996). Some NE speakers produce /r/ with a “bunched” tongue in which the tongue tip is not raised, with constriction along the hard palate as well as in the lower pharynx (Delattre & Freeman, 1968). Differences in



James Emil Flege, Katsura Aoyama, and Ocke-Schwen Bohn

the articulation of English /r/ seem to be less evident auditorily to NE-speaking listeners than differences between the dark and light variants of /l/ because substantial differences in the articulation of /r/ yield similar acoustic output.

2.2  L1 Acquisition of Liquids It takes many years for children learning both Japanese and English to reach adult-like levels of performance for liquid consonants. Arai and Mugitani (2016) reviewed four studies examining the acquisition of Japanese /R/ by NJ children. The age at which 90 percent of the children could produce /R/ correctly varied from 4.0 to 6.0 years. According to Arai (2013), /R/ is difficult for NJ children owing to its articulatory nature and because /R/ is produced in diverse ways by Japanese adults. We know of no study that has evaluated the possibility that NJ speakers bring different /R/ categories to the process of learning English liquids. If Japanese /R/ categories do indeed differ across individual NJ learners of English, the SLM-r predicts that this will influence their ability to discern cross-language phonetic differences, to establish stable L1–L2 mapping patterns, their perception of perceived cross-language dissimilarity, how long it may take for them to create a new phonetic category for English /r/, and the cue weighting patterns that will be evident in their newly formed English /r/ categories. Unlike children learning Japanese as their native language, children learning English must develop distinct categories for two liquid consonants. It takes NE children many years to reach adult-like levels of performance for both /r/ and /l/. Idemaru and Holt (2013) examined monolingual children’s perception of English liquids and provided detailed acoustic analyses of their productions of /r/ and /l/. This research suggested that attunement to the properties of English /r/ and /l/ continues until eight to nine years of age. Children in the United States learn to produce /l/ somewhat before /r/. Of the boys tested by Smit et al. (1990), 90 percent could produce /l/ “acceptably” by 6.0 years of age, but this milestone was not reached until 2 years later for /r/. Girls met both milestones somewhat earlier than boys. Importantly, a few boys tested by Smit et al. (1990) had not yet learned to produce American English /r/ acceptably at the age of 9 years. The fact that monolingual NE children learn /l/ sooner than /r/ has two possible explanations. One is that /r/ is more complex articulatorily than /l/ (see McGowan et al., 2004). A second possibility is that children



The Revised Speech Learning Model (SLM-r) Applied



are exposed to a wider range of allophonic variants for /r/ than for /l/ and that this slows the development of the phonetic categories that specify the goals of language-specific realization rules (Mielke et al., 2016; Song et al., 2015). As well, the lip rounding often associated with /r/ productions may lead children to substitute /w/ for /r/. As we will see later, NJ speakers usually make more progress in learning English /r/ than /l/, the opposite of the pattern seen in the acquisition of liquids by children learning English as an L1. The difference between NJ speakers who learn English as an L2 and L1-learning children can be attributed to a difference in the perceived cross-language phonetic dissimilarity of the two English liquids with respect to the single liquid of Japanese. This, in turn, will alter cross-language perceptual mapping patterns and eventually lead to the formation of a new phonetic category for English /r/ but not /l/.

2.3  Perceived Cross-Language Phonetic Dissimilarity According to the SLM-r, the perceived phonetic dissimilarity of an L2 sound from the closest L1 sound is an important determinant of whether a new category will or will not be formed for it. To have predictive value within the SLM-r framework, perceived dissimilarity must be assessed separately for /r/ and /l/ for each participant at the time of first exposure to the L2. This has never been done in longitudinal research, as far as we know. However, research obtained using a variety of experimental techniques provides converging evidence that, as illustrated in Figure 2.1 (time 2), English /r/ comes to be perceived as phonetically more dissimilar from the Japanese /R/ than English /l/ is. Takagi (1993) recruited 10 NJ adults who had lived in the United States for fewer than two years. The stimuli used in perceptual testing consisted of English /r/ and /l/ tokens that had been produced by four NE speakers and tokens of Japanese /R/ and /w/ that had been produced by a single NJ speaker. NJ-speaking participants were asked to perceptually evaluate attempts by NE speakers to produce Japanese /R/ using a scale ranging from 0 (does not sound at all like /R/) to 7 (perfect /R/). The ratings obtained for the Japanese /w/ and /R/ tokens averaged about 0 and 6, respectively. Intermediate ratings were obtained for the English approximants. Importantly, significantly lower ratings were obtained for the English /r/ stimuli than for the /l/ stimuli. This indicated that of the two English liquids, English /r/ was perceived to be phonetically more dissimilar from the Japanese /R/ than English /l/ was.



James Emil Flege, Katsura Aoyama, and Ocke-Schwen Bohn

Hattori (2009; Hattori & Iverson, 2009) used a bilingual identification procedure. Native-speaker productions of English /r/, English /l/, and Japanese /R/ were presented in a three-alternative forced-choice test to NJ participants who had lived briefly in London. The NJ participants correctly identified the English /r/ stimuli significantly more often (mean  = 82 percent) than the English /l/ stimuli (mean = 58 percent correct). More importantly in the present context, the English /r/ tokens were incorrectly identified as Japanese /R/ significantly less often than the English /l/ tokens were (means = 2 vs. 19 percent). Guion et al. (2000) examined the discrimination of Japanese and English consonants by NJ speakers who had lived briefly in the United States. The stimuli consisted of native-speaker productions of /r/, /l/, and /R/. The NJ speakers were able to discriminate English /r/ from Japanese /R/ but not English /l/ from Japanese /R/ tokens at a significantly abovechance rate. Cutler et al. (2006) used an eye-tracking paradigm to examine how NJ adults accessed English words with /r/ and /l/. When instructed to click on rocket (one of four possible pictorial choices), they tended to look at a picture of a locker, but when told to click on a picture of a locker, they seldom looked at the picture of a rocket. This asymmetry was interpreted to mean that /l/ was a better match to the participants’ Japanese /R/ category than English /r/ was. The results of Aoyama and Flege (2011) provided insight into how much English-language experience NJ speakers must receive before beginning to perceive a difference in goodness-of-fit between the two English liquids and Japanese /R/. These authors tested 50 NJ speakers who had arrived in the United States after the age of 18 years, had lived there for 0.1–24.6 years, and reported using English from 0 to 97 percent of the time. Years of full-time equivalent (FTE) English input (the proportion of English use multiplied by LOR) averaged 2 years, ranging from 0 to 14.4 years. This suggests that, at the time of testing, some but not all the NJ adults had received as much English input as monolingual NE children need to learn English /r/ and /l/. The naturally produced stimuli used by Aoyama and Flege (2011) consisted of consonant-vowel syllables beginning with English /r/, /l/, and /w/ and Japanese /R/ and /w/. In the one-talker condition, five productions by a single native speaker of the three English consonants and of the two Japanese consonants were presented. In the five-talker condition, on the other hand, each of the five phonetic categories was represented by a single token produced by five different native speakers.



The Revised Speech Learning Model (SLM-r) Applied



The NJ speakers were instructed in Japanese to rate each consonant for phonetic dissimilarity using a scale ranging from 1 (not similar to /R/ at all) to 7 (very similar to /R/). They then categorized the stimuli. Not surprisingly, the Japanese /R/ stimuli were categorized as /R/. Most of the English /r/ and /l/ tokens were also categorized as Japanese /R/ (means = 77 percent and 82 percent, respectively). The phonetic dissimilarity ratings obtained for the English liquids before categorization were, as expected, intermediate to the ratings obtained for Japanese /R/, on one hand, and English /w/ and Japanese /w/, on the other hand. Aoyama and Flege (2011) reported the results for just one condition, but here we present the results for both conditions. The correlations between the dissimilarity ratings obtained for both English /r/ and /l/ (with respect to /R/) correlated significantly with FTE years of English input in both the one-talker and five-talker conditions (rho = −0.375 and −0.332, respectively; Bonferroni-corrected p < 0.05), whereas the correlations with /l/ were both nonsignificant (rho = −0.282 and −0.233). This indicates that as years of FTE English input increased, the NJ learners who arrived in the United States as adults perceived English /r/ but not /l/ to be increasingly dissimilar from the Japanese /R/. To estimate when the “split” between English /r/ and /l/ occurred (Figure 2.1, time 2), we assigned the NJ speakers to three groups of 16 each according to years of FTE English input. The FTE values ranged from 0.0 to 0.2 years (mean = 0.09) for the relatively low input group, from 0.4 to 1.8 years (mean = 1.01) for the mid-input group, and from 2.1 to 14.4 years (mean = 5.46) for the high-input group. Figure 2.2 shows the mean rated dissimilarity of English /r/ and /l/ with respect to /R/ in the two conditions. For all six Group × Condition combinations, the English /r/ tokens were judged to be more dissimilar from /R/ than the English /l/ tokens were. For the low-input group, the ratings for /r/ and /l/ did not differ significantly in either condition (Friedman Q = 0.81 and 6.0, respectively). For the mid-input group, the /r/–/l/ rating difference reached significance at a Bonferroni-corrected 0.05 level for the one-talker condition (Q = 35.7) but not for the fivetalker condition (Q = 6.0). For the high-input group, on the other hand, the /r/–/l/ rating difference was significant in both conditions (Q = 9.6 and 225.0, respectively; p < 0.05). These results suggest that somewhat more than two years of FTE English input may be needed by NJ adults to note the discrepancy between the two English liquids with respect to /R/. It would be valuable to replicate this finding with both NJ adults and children to evaluate the



James Emil Flege, Katsura Aoyama, and Ocke-Schwen Bohn

Mean Perceived Dissimilarity very similar 7 (a) /r/ to /R/ /I/ 6

Mean Perceived Dissimilarity (b)

5 4 3 2 not similar at all to /R/ 1

Low 0.09 Mid 1.01 High 5.46 Years of English Input

Low 0.09 Mid 1.01 High 5.46 Years of English Input

Figure 2.2  The mean perceived dissimilarity of English /r/ and /l/ in (a) the single-talker condition and (b) the five-talker condition.

role of quality of phonetic input that has been received. According to the SLM-r, how precisely the /R/ category is defined will impact the time needed by NJ speakers to note the discrepancy between the two English liquids. This hypothesis also needs to be tested.

2.4  L2 Perception of /r/ and /l/ Perceptual attunement to the phonetic properties of sounds heard by infants in the surrounding environment leads to language-specific differences in the first year of life. Kuhl et al. (2006) found that Japanese and American infants were equally able to discriminate English /r/ from /l/ at 7 months of age but that, at 11 months of age, American infants discriminated the two liquids more accurately than the Japanese infants did. It appears, however, that Japanese children who have never been exposed to English retain some sensitivity to acoustic phonetic properties that differentiate English /r/ and /l/. Shimizu and Dantsuji (1983) administered an oddity discrimination test to NJ five-year-olds and adults. The NJ children, but not the adults, showed a peak in discrimination along the synthetic /ra/ to /la/ continuum near the location of the /r/–/l/ phoneme boundary obtained for NE adults. The first study examining the “Japanese r–l problem” in English was  that of Goto (1971), who showed that NJ adults have difficulty



The Revised Speech Learning Model (SLM-r) Applied



discriminating English /r/ from /l/. Later research examined both the identification and discrimination of English liquids by NJ adults and children differing in amount of English-language experience. Flege et al. (1996) summarized 12 prior studies examining NJ adults’ identification of /r/ and /l/. In the studies reviewed, NJ speakers were reported to have correctly identified English liquids just 69 percent of the time on average, labeling /r/ as /l/, and vice versa. Importantly, however, just two prior studies included NJ speakers who had lived in the United States for more than 8 years. If NJ speakers use English 50 percent of the time, this means that most of the NJ speakers examined in prior research had received less English input than NE children need to learn /r/ and /l/. Aoyama et al. (2004) examined the discrimination of /r/ and /l/ by NJ adults and children, comparing their performance to that of age-matched NE speakers. The NJ speakers had been living in the United States for an average of 0.5 years at time 1 and for 1.6 years when tested a second time (time 2). The Peabody Picture Vocabulary Test (PPVT) was administered twice. At time 1, the NJ adults had higher age-equivalent scores than the NJ children (means = 6.8 and 2.8 years) because they had studied English at school before coming to the United States. Both adults and children obtained significantly higher PPVT scores at time 2 than at time 1 (means = 9.1 and 6.0). This indicated that members of both age groups were actively learning English and that both NJ adults and children had smaller English vocabularies than age-matched NE speakers. The NJ adults and children obtained substantially lower /r/–/l/ discrimination scores than did the NE adults and children, respectively. At time 1, both the NJ adults and the NJ children obtained near-chance scores. The discrimination scores obtained for the NJ adults at times 1 and 2 did not differ significantly (means = 0.552 vs. 0.606), whereas those of the NJ children (means = 0.445 vs. 0.696) increased significantly from time 1 to time 2. Ingvalson et al. (2012) provided indirect evidence that the learning of English liquids may occur fairly rapidly for some NJ speakers. These authors recruited NJ adults who had lived in the United States for an average of 1.5 years for an /r/–/l/ perceptual training study. Not all participants recruited were retained for the training study. Participants were excluded if they failed to demonstrate what Susan Guion-Anderson termed “sufficient room to grow.” Specifically, 14 of the 42 NJ speakers were excluded (see also Shinohara & Iverson, 2018, p. 244) because their pretraining /r/–/l/ discrimination scores were sufficiently high that



James Emil Flege, Katsura Aoyama, and Ocke-Schwen Bohn

­ otential improvements resulting from training would have been difficult p to observe. The results of MacKain et al. (1981) also pointed to rapid speech learning. This study evaluated NJ adults’ ability to identify and discriminate the members of a synthetic /r/–/l/ continuum. The authors designated two groups of NJ speakers as “inexperienced” and “experienced” based on difference in length of residence (LOR) in the United States (means = 0.7 vs. 2.3 years) and self-reported percentage use of English (means = 29 vs. 55 percent). All NJ participants had begun to study English at school in Japan at about the age of 13 years. The aim of the study was to determine how many participants would show “categorical perception,” that is, would demonstrate better discrimination for pairs of stimuli labeled “r” and “l” than for pairs of stimuli that were both labeled “r” or both labeled “l.” Six of seven inexperienced NJ speakers showed near-chance performance, but all five experienced NJ speakers provided evidence of categorical perception. The results of MacKain et al. (1981) suggested that NJ adult learners of English may be able to form new phonetic categories for at least one of the English liquids after having obtained relatively little conversational experience in English. However, an incidental finding of that study cast doubt on the role of input. After the authors had completed data analysis, they were able to test an additional NJ speaker who had just arrived in the United States and reported using English just 25 percent of the time. This new NJ participant provided evidence of categorical perception despite having had little conversational experience in English. One possible explanation for this surprising result is that the Japanese /R/ category of the last participant to be tested mapped onto English liquids in a way that encouraged processing of English /l/ but not /r/ in terms of the Japanese /R/ category. Another possibility is that this participant had an especially great aptitude for speech learning (see, Chapter 1 for ­discussion). A training study carried out with young NJ adults in Tokyo provided evidence for important individual differences among NJ adults. J. Yamada (1991) administered one hour of identification training to 152 college students. Feedback was provided as the students identified the initial consonant in minimally paired nonwords beginning with /r/ and /l/. Several students could already identify /r/ and /l/ correctly in all four minimal pairs before training began. After training, an additional 6 percent of the students could correctly identify /r/ and /l/ in all four minimal pairs, 35 percent could do so for several minimal pairs, and 51



The Revised Speech Learning Model (SLM-r) Applied



percent of the students remained at chance for all four minimal pairs. The inter-subject variability could not be attributed to differences in conversational experience in English and so might possibly have derived from differences in the students’ /R/ categories, aptitude for speech learning, or both. R. Yamada (1995) examined the identification of synthetic stimuli by 276 NJ speakers having a mean age of 20 years. More than half were tested in Tokyo after returning to Japan following a period of residence in the United States. The “returnees” were assigned to five groups based on LOR. Also included were NJ speakers who had never lived abroad and NE speakers residing in Japan. The NJ and NE speakers identified members of a synthetic continuum ranging from right to light in which both F2 and F3 values varied, but not independently (see also Yamada & Tokhura, 1992). The stimuli were identified as “r,” “l,” or “w.” The /r/-/l/ identification functions obtained for the NJ returnees resembled those of the NE speakers more than those of NJ speakers who had never lived abroad. However, the returnees’ identification functions exhibited more variation than those obtained for the NE speakers. Yamada (1995) also examined the identification of synthetic stimuli in which F2 and F3 frequencies varied independently. NJ speakers who had never lived abroad gave far more “w” responses than the NE speakers did. They relied on F2 and made little use of F3 to distinguish English /r/ and /l/. Their responses might be considered typical for young NJ adults who have had little or no conversational experience in English. The returnees resembled the NE speakers to a greater extent, but some made far less use of F3 frequency than the NE speakers did. The returnees’ use of F3 when identifying English liquids varied as a function of LOR in the United States. The longer the returnees had lived there, the more they used F3. The percentage of correct identifications of /r/, /l/, and /w/ in naturally produced words also varied as a function of LOR, with higher percent correct scores as LOR increased for all three English consonants. Yamada (1995) concluded that two factors influenced the returnees’ perception of English /r/ and /l/. An endogenous (or “biological”) factor was degree of plasticity, which decreased as the age of arrival in the United States increased. The other factor related to overall amount of L2 input that had been received in the United States, as indexed by LOR. Unfortunately, LOR and age of arrival in the United States were confounded in the Yamada (1995) study. Most participants who had lived in the United States for a relatively long period of time had arrived there



James Emil Flege, Katsura Aoyama, and Ocke-Schwen Bohn

before the age of eight years. Given that L2 category precision tends to decrease as a function of age, albeit not in a linear fashion, this might have made it easier for the participants who had arrived in the United States as young children to discern Japanese-English phonetic differences than those who arrived later in life. As well, the NJ speakers who lived in the United States as children may have received more native speaker English input than those who lived in the United States in adulthood. To summarize so far, NJ speakers begin to perceive English /r/ as phonetically more dissimilar from Japanese /R/ than English /l/ as they gain experience using English to communicate. Some NJ speakers who were first immersed in English in adulthood are more successful than others in learning to perceive English liquids, but it is uncertain at present why this is so. F3 is not used to distinguish Japanese consonants but is crucial for the phonetic specification of English /r/. As predicted by the SLM-r, evidence exists that NJ adults can gain access to this dimension. The research cited so far was subject to an important limitation which might help clarify the inter-subject variability. With the exception of some participants in the Yamada (1995) study, the research examined NJ speakers having far less English input than monolingual NE children usually need to perceive English liquids in an adult-like fashion. That being the case, the research considered so far provided little insight into the extent to which NJ speakers might eventually learn English liquids, especially if they are exposed primarily to native-speaker input. Flege, Takagi, and Mann (1996) recruited a group of NJ speakers who had substantially more conversational experience in English than those tested previously. A total of 24 NJ adults were recruited by a native Japanese research assistant and his wife outside a Japanese-owned food and cosmetics store in Irvine, California. To mask the phonetic focus of the research, NJ speakers entering the store were invited to participate in research examining their knowledge of English words. Two gender-balanced groups of 12 each were selected based on LOR in the United States. Members of the group designated “experienced” had lived in the United States far longer than those in the “inexperienced” group (mean = 20.8 years, range = 12 to 29 vs. mean = 1.6 years, range = 0.7 to 3.0). The experienced and inexperienced groups also differed significantly in chronological age (means = 44 vs. 35 years), age of arrival in the United States (means = 23 vs. 34 years), and selfreported use of English (means = 5.1 vs. 3.4 on a 7-point scale). These differences may have contributed to between-group differences in



The Revised Speech Learning Model (SLM-r) Applied



a­ ddition to, or instead of, LOR. No information was obtained regarding how the participants’ Japanese /R/ categories were specified, nor was degree of L1 category precision, perceived cross-language phonetic dissimilarity, or L1–L2 perceptual assimilation mapping patterns evaluated. The stimuli used by Flege et al. (1996) in a four-alternative forcedchoice identification test were seven naturally produced English words each beginning with /w/ and /d/, 19 words each beginning with /r/ and /l/ (e.g., right, light) and four nonwords beginning with /r/ and /l/ (e.g., ruck, lun). As expected, the NJ speakers had difficulty correctly identifying /r/ and /l/ but not /w/ and /d/. A (2) Group × (2) Consonant ANOVA carried out for this chapter indicated that the members of the experienced group identified the two English liquids (/r/, /l/) correctly significantly more often than members of the relatively inexperienced group and that scores were significantly higher for /r/ than /l/ (p < 0.05). The difference between groups differing in LOR in the Flege et al. (1996) study agreed with the results of Ingvalson, McClelland, and Holt (2011). These authors found that NJ speakers who had lived in the United States for more than 10 years correctly identified the liquids in natural productions of English words significantly more often than NJ speakers who had lived in the United States for fewer than two years (see also Yamada, 1995). It is important to note, however, that Ingvalson et al. (2011) did not examine /r/ and /l/ separately but instead combined the results obtained for both /r/ and /l/ across a wide range of phonetic contexts (e.g., rock-lock, crack-clack, array-allay, mire-mile, heart-halt). The experienced NJ participants examined by Flege et al. (1996) performed well, but their percent correct identification scores for the English liquids, while high, were not perfect (mean = 92 percent correct) as was the case for members of the NE comparison group. The few errors that were noted seemed to be the result of lexical bias. According to the SLM-r, the categorization of L2 sounds depends on how well acoustic properties of the L2 sounds conform to the phonetic categories used in lexical access and word recognition. However, the conscious judgments obtained in the overt identification task used by Flege et al. (1996) were subject to the influence of lexico-phonological codes (see Figure 1.1). Lexical biases can shift conscious judgments away from percepts arising at a phonetic category level for certain stimuli. For example, English monolinguals are more likely to judge a stimulus having an ambiguous VOT value as dash, a word in English, than as *dask, a nonword



James Emil Flege, Katsura Aoyama, and Ocke-Schwen Bohn

(Ganong, 1980). Lexical bias has also been shown to affect NJ speakers’ identification of English liquids (Yamada, Tohkura, & Kobayashi, 1997). For example, Yoshida and Hirasaka (1983) obtained significantly higher percent correct identification scores for liquids found in English words that are minimally paired in the English lexicon (e.g., words such as rock and lock) than for liquids found in nonwords (e.g., *remp and *lemp) than for the liquids found in stimulus pairs consisting of a word and a nonword (e.g., run-*lun, *rike-like). Flege et al. (1996) evaluated the influence of lexical bias on NJ speakers’ identification of English words and nonwords beginning with /r/ and /l/. The NJ speakers heard and saw (in writing) 46 English stimuli one at a time after being informed that not all test items were real English words. They responded to each item using one of five response alternatives that were provided. The range of alternatives served to indicate individual participants’ knowledge of the test stimuli. The alternatives included the correct definition of a test item (response alternative #1), an incorrect definition of the item (#2), and the definition of a word that was minimally paired with the test item (#3). Selection of alternatives #4 and #5 indicated that a participant was “not sure” of an item’s meaning or had “never read or heard” the item. The NJ speakers were instructed to press buttons #1, #2 or #3 if they thought they knew an item, but to press button #4 or #5 if they did not know an item. “Known” items were then rated for subjective familiarity using a scale ranging from 1 (never heard and said) to 7 (very often heard and said). A familiarity rating of 1 was assigned to items that were admittedly not known (button 4 or 5). Figure 2.3 plots the mean subjective familiarity ratings obtained for the two NJ groups as a function of the mean ratings obtained for the NE group. Members of the inexperienced but not the experienced group judged the words to be significantly less familiar to them than the NE speakers did. Importantly, however, the ratings for both NJ groups correlated significantly with the NE speakers’ ratings (Spearman rho = 0.92 and 0.90, respectively). For both the NE and NJ participants, for example, the word room was rated as more familiar than the word rook and look was rated as more familiar than loom. The SLM-r is set within a generic three-level model of speech production and perception according to which information in lexical representations can influence lower-level phonetic judgments. To evaluate categorization at a phonetic level, Flege et al. (1996) compared the identification



The Revised Speech Learning Model (SLM-r) Applied



Native Japanese Speakers

7 Experienced Inexperienced

6 5 4 3 2 1 1

2

3 4 5 6 Native English Speakers

7

Figure 2.3  The mean subjective familiarity ratings for English words obtained by Flege et al. (1995) for two groups of native Japanese speakers plotted as a function of the mean ratings obtained for the same words for native English speakers.

of liquids in three subsets of the stimuli: a positive set (a more familiar /r/ than /l/ word, e.g., room, loom), a neutral set (equal familiarity, e.g., rate, late) and a negative set (a less familiar /r/ than /l/ word, e.g., rook, look). The percent correct identification scores for the three “relative familiarity” sets did not differ for the NE speakers. The NE speakers’ identifications of English liquids were not affected by relative subjective word familiarity because the NE speakers were never uncertain if the word and nonword stimuli began with /r/ or /l/. However, the NJ speakers were more likely to correctly identify an English liquid when it occurred in a word that was more familiar than its minimal pair. A separate analysis was carried to reduce or eliminate lexical bias effects. It focused on “balanced” pairs of real word stimuli, that is, stimuli that were equally familiar on average to the NJ speakers. Flege et al. (1996) found that in this subset of stimuli, /r/ tokens were identified perfectly by nine of the 12 experienced and by six of the 12 inexperienced NJ speakers. For /l/, perfect scores were obtained for 4 of the 12 experienced NJ speakers and by none of the 12 inexperienced NJ speakers. The difference in the number of NJ speakers showing perfect identification of /r/ and /l/, 15 vs. 4, reached significance, X(1) = 4.7, p < 0.05. The results supported the hypothesis that once lexical effects have been neutralized a better match existed between the naturally produced English stimuli and

 James Emil Flege, Katsura Aoyama, and Ocke-Schwen Bohn the categories the NJ speakers used to identify /r/ than the categories they used to identify /l/. Flege et al. (1996) carried out a follow-up experiment with new NE-speaking listeners to further investigate the role of lexical bias. The lexical status of stimuli was evident in the unedited condition (e.g., ripe vs. *lipe) but not in the edited condition, where lexical status differences were masked by removing the final consonants (ripe and *lipe became rye and lie, two real English words). As expected, the percent correct identification scores were higher for words than nonwords, but this difference disappeared in the edited condition where differences in lexical status were not evident. The authors offered an account of the lexical bias effects observed for the NJ but not the NE speakers that was derived from the Theory of Signal Detection. This account, which is illustrated in Figure 2.4, assumes that the composite English /l/-Japanese /R/ categories developed by the NJ speakers are more broadly tuned than their new English /r/ categories. This is because the new English /r/ categories were defined by the distribution of /r/ tokens alone rather than on the distribution of tokens defining two categories. In summary, research examining the perception of English liquids support the SLM-r hypothesis that NJ speakers will be more likely to create a new phonetic category for English /r/ than /l/ because of a difference in perceived cross-language phonetic dissimilarity. It also supports the hypothesis that they will use a composite L1–L2 category based on the distribution of Japanese /R/ and English /l/ tokens when identifying English /l/ tokens.

B A C

I

r

Figure 2.4  An account of the effects of subjective lexical familiarity on native Japanese speakers’ identifications of /r/ and /l/ (Flege et al., 1996) that was inspired by the Theory of Signal Detection.



The Revised Speech Learning Model (SLM-r) Applied



2.5  Perceptual Use of F2 and F3 According to the SLM-r, L2 learners of all ages can gain perceptual access to the full range of acoustic cues needed to define L2 sounds, including dimensions not used to define L1 phonetic categories (Idemaru & Holt, 2011, 2014). As mentioned earlier, a low starting F3 frequency is the primary acoustic cue used by NE speakers to identify a liquid as /r/, but this dimension is not used as the primary perceptual cue for any Japanese phonetic category. The results of research summarized in this section support the SLM-r hypothesis that NJ speakers can gain perceptual access to the F3 dimension in English liquids. Gordon, Keyes, and Yung (2001) tested 12 NJ adults having FTE years of English input that ranged from 0.03 (essentially none) to 4.0 years. We might consider all of these NJ participants as being “inexperienced” in that they had all received substantially less English input than NE children normally need to learn English liquids. The NJ participants identified naturally produced tokens of /r/ and /l/ as well as they identified the members of a synthetic /r/-/l/ continuum. The authors found that amount of F3 use in the identification of synthetic stimuli predicted the combined correct identifications of natural productions of /r/ and /l/, which were not examined separately. Iverson et al. (2003) created a grid of stimuli that varied orthogonally in F2 and F3 frequencies, with equal auditory distances between adjacent stimuli. The participants, 24 NE adults living in London and 24 NJ speakers in Tokyo, identified each stimulus in terms of their own nativelanguage phonemes, and rated whether the stimulus was a good exemplar of that category (1 = bad, 7 = good). The NE speakers made frequent use of both available L1 category labels (“r,” “l”) whereas the NJ speakers labeled most stimuli as “R,” rarely using the Japanese “w” label that was provided. The authors also obtained ratings of the perceived dissimilarity of pairs of stimuli (1 = dissimilar, 7 = similar), which were submitted to a multidimensional scaling (MDS) analysis. The MDS analysis revealed that the NE speakers gave greater perceptual weight to the F3 dimension than might have been expected based on purely auditory differences between pairs of stimuli whereas the NJ speakers made less use of the F3 dimension and showed no evidence of perceptually “stretching” the F3 dimension as the NE-speaking participants did. Iverson et al. (2003) concluded that an insufficient use of F3 by the NJ speakers was not due to an auditory-level insensitivity (but see Kachlika et al., 2019). The authors interpreted their results to mean that the



James Emil Flege, Katsura Aoyama, and Ocke-Schwen Bohn

phonetic systems that NJ adults bring to English are “mistuned” for the learning of English /r/ and /l/, and that this mistuning may be “difficult to reverse” in later stages of L2 learning (2003, p. B55). The mistuning to which the authors referred was further clarified by Iverson, Hazan, and Bannister (2005). These authors noted that the NJ adults tested earlier were more sensitive to within-category variation in a discrimination task than were NE adults, and that they were more sensitive to F2 frequency values than NE adults. The authors suggested that the primary problem facing NJ adults may not be attending to F3 frequency values, but rather focusing too much attention on a dimension that is noncritical for NE adults, F2 frequency. A generally lesser use of the F3 dimension by the NJ than NE speakers might be seen as an “L1 optimized” weighting pattern (Lotto, Sato, & Diehl, 2004) or as evidence of Automatic Selective Perception (Strange, 2011). Regardless of how one chooses to regard this phenomenon, it does not appear to be unmodifiable. Takagi and Mann (1995) examined the identification of an /r/-/l/ continuum by two groups of NJ speakers living in the United States. The NJ speakers generally used F3 less than the NE speakers did, but those with relatively long residence in the United States showed significantly greater use of F3 than those with a shorter residence. Shinohara and Iverson (2018) found that perceptual training augmented NJ adults’ sensitivity to the F3 dimension near the boundary between English /r/ and /l/. Work by Hattori and Iverson (2009; see also Hattori, 2009) indicated that this may occur in the absence of formal training. These authors tested 36 NJ adults on an array of synthetic speech stimuli differing along three frequency dimensions (F1, F2, F3) and two temporal dimensions (closure and F1 transition durations). Custom software enabled participants to quickly search through the stimulus array to find the “best exemplars” of /r/ and /l/. Figure 2.5 replots data presented by Hattori (2009), showing the mean F3 values in the stimuli that were selected as the best exemplars of the English /r/ and /l/ categories. The values obtained for the NE and NJ participants in Figure 2.5 are surprisingly similar, indicating that the NJ speakers did gain perceptual access to the F3 dimension. The NE speakers’ preferred F3 values for /r/ and /l/ did not overlap whereas some overlap was evident for the NJ speakers. Most importantly, there was substantial overlap between the NJ speakers’ preferred F3 values for English /l/ and Japanese /R/. What accounted for the inter-subject variability observed by Hattori and Iverson (2009)? The individual differences may have reflected differences in the English input. Four of NJ participants tested in London



The Revised Speech Learning Model (SLM-r) Applied



Preferred F3 Values in /r/, /I/, and /R/ (a)

24 22 20

Frequency (ERB)

26

(b)

/r/

/I/

/r/

/I/

/R/

Figure 2.5  The preferred F3 values obtained from (a) native speakers of English and (b) native speakers of Japanese for English /r/ and /l/ (both groups) and Japanese /R/ (just the native Japanese speakers). The data are from Hattori (2009).

reported never using English and so had an FTE value of zero. They might be considered functionally equivalent to age-matched adults recruited in Japan. Only eight of the 36 NJ speakers had more than 1.0 FTE years of English input. An alternative explanation pertains to the participants’ Japanese /R/ categories. Japanese /R/ can be articulated in diverse ways (Akamatsu, 1971; Arai, 2013; Best & Strange, 1992). Miyawaki et al. (1975, p. 332) reported that the onset frequency of F3 in Japanese /R/ varies “unsystematically” over a range of values that is “sufficient to distinguish” English /r/ from English /l/ in word-initial position. We suppose that the authors were referring to differences between individual NJ monolinguals, not to token-to-token variability in the speech of individuals, inasmuch as intrasubject variability is rarely examined. Lotto et al. (2004) noted that the F3 values in Japanese /R/ occur near the boundary between the distribution of values for English /r/ and /l/. Their summary statistics, intended to describe the Japanese language, also point to the existence of potentially important individual differences in the specification of the Japanese /R/ category. If so, the inter-subject variability seen in Figure 2.5 may simply reflect differences in the Japanese /R/ categories that individual NJ speakers brought to the task of learning English liquids. Previous L2 speech research has seldom considered the possibility that individual differences in L1 phonetic categories might influence L2

 James Emil Flege, Katsura Aoyama, and Ocke-Schwen Bohn speech learning. A study by Ingvalson et al. (2011) bears on the “L1 category” explanation regarding NJ speakers’ use of the F3 dimension. These authors noted a correlation between percent correct identification of English words with /r/ and /l/ and the use participants made of F3 frequency cues. However, F3 use did not increase significantly as a function of length of residence (LOR) in the United States. LOR is known to be a poor index of overall input in English, and so the lack of a correlation with LOR does not rule of an L2 input explanation (Chapter 1). The link between F3 cue weighting and identification may simply have reflected how readily individual NJ participants’ discerned the phonetic difference between English /r/ token and Japanese /R/ based on how perceptual cues were weighted in their Japanese /R/ categories or how precisely those categories were defined when the participants were first exposed to English.

2.6  Production of /r/ and /l/ Cochrane (1980) was among the first to demonstrate the difficulty that NJ speakers, even children, are likely to experience when attempting to produce English liquids. She tested 54 NJ children ranging in age from 3 to 13 years and 24 NJ adults who had lived for about 2 years in the United States. The accuracy with which the NJ speakers produced wordinitial and intervocalic English liquids in three production tasks was evaluated by NE-speaking listeners. Both the NJ adults and the NJ children were judged to have produced /r/ more accurately than /l/ in all three speaking tasks. The productions of some children, but no productions by adults, were judged to be native-like. As well, the adults were judged to have used a “Japanese” sound, probably /R/, more often than the children when producing English liquids. Our inference that participants in the Cochrane (1980) study substituted Japanese /R/ for English liquids was supported by Riney, Takada, and Ota (2000). These authors explicitly evaluated the frequency of /R/-for-/r/ and /R/-for-/l/ substitutions in English words spoken by university students in Japan. These NJ adults were found to substitute /R/ for /l/ significantly more often than /R/ for /r/ (means = 25 vs. 15 percent). The less frequently the NJ adults used /R/ in English words, the milder was their overall degree of perceived foreign accent in English sentences, r = −0.867, p < 0.002, suggesting that a decreasing deployment of Japanese /R/ in the production of English words reflected a global improvement in English pronunciation.



The Revised Speech Learning Model (SLM-r) Applied



Two studies showed that university students in Japan produced /r/ more accurately than /l/ in the initial position of English words. Students in a study by Riney and Flege (1998) read a list of 84 English words that included five words beginning with /r/ and five nonminimally paired words beginning with /l/ (e.g., rag, last). NE-speaking listeners identified which of the two liquids had been produced. The intelligibility scores were substantially higher for /r/ than /l/ (means = 84 vs. 55 percent). Words produced by university students in a study by Bradlow et al. (1997) were also evaluated by NE-speaking listeners. Intelligibility was significantly higher for /r/ than /l/ both before and after perceptual training on English liquids was administered to the NJ participants (means = 68 vs. 60 percent, 81 vs. 68 percent). Shinohara and Iverson (2018) trained NJ speakers in the United Kingdom and Japan to both identify and discriminate English liquids. A total of 10 perceptual training sessions led to both improved perception and to improved production in the absence of specific training on production. Acoustic measurements revealed that the trainees lowered F3 frequency in /r/ productions and raised F3 values for /l/ productions. However, the post- versus pretraining change was said to be larger for /r/ than for /l/. This finding supported the SLM-r hypotheses that perceptual cues not exploited in the L1 can be learned and it showed, indirectly, that category formation is more likely for /r/ than /l/ due to the greater perceived phonetic dissimilarity of /r/ than /l/ from the Japanese /R/. Aoyama et al. (2004) evaluated the production of /r/ and /l/ by NJ adults and children 0.5 years after their arrival in the United States (time 1) and again a year later (time 2). Data were also acquired twice from agematched NE adults and children. As expected, the NE speakers’ productions were nearly always heard as intended by NE-speaking listeners. However, the productions of /l/ by both NJ groups were sometimes heard as /r/, and vice versa. The children’s intelligibility scores were found to increase significantly from time 1 to time 2 for /r/ but not /l/. For adults, neither /r/ nor /l/ productions became significantly more intelligible over the one-year study interval. Two acoustic studies provided evidence of learning in relatively inexperienced NJ speakers of English. Aoyama, Flege, Akahane-Yamada, and Yamada (2019) carried out acoustic analyses of /r/ and /l/ productions by the participants in the Aoyama et al. (2004) study. For /r/, F3 frequency decreased from time 1 to time 2, thereby becoming more English-like in words spoken by both NJ children and adults. However, almost all acoustic parameters in words spoken by age-matched NJ and NE

 James Emil Flege, Katsura Aoyama, and Ocke-Schwen Bohn speakers differed significantly at time 2. The authors concluded that the perceived improvement of the NJ children’s production of /r/ in the earlier study was probably due to a combination of changes in their productions of both /r/ and /l/. Saito and Munro (2014) tested 60 NJ speakers who had come to Canada to study English. The study focused on the NJ speakers’ production of /r/ (/l/ was not examined). The young adult participants were assigned to groups according to length of Canadian residence (means = 1.0, 2.5, 5.2, and 10.1 months). The authors conducted acoustic analyses of word-initial tokens of /r/ elicited in three speaking tasks. Values for F2 frequency, F3 frequency and the duration of F1 formant transitions became more native-like as LOR increased. Flege, Takagi, and Mann (1995) tested the same NJ speakers whose perception was examined in the Flege et al. (1996) study summarized earlier in this chapter. The 12 participants each in the groups designated “experienced” and “inexperienced” differed in LOR in the United States (means  = 20.8 vs. 1.6 years). The study was designed to obviate use by the NJ participants of conscious articulation strategies, something considered likely given widespread English-language instruction in Japan. Japanese students may be instructed, for example, to round the lips while saying a Japanese /w/ in order to produce English /r/ (Goto, 1971; Yamada & Tohkura, 1990; see also Lotto et al., 2004, for strategies that might be used by NJ speakers who are called upon to identify liquids as “r” or “l”). When recruited, the NJ speakers were asked to participate in a study focusing on their knowledge of English words. Production of words with /r/ and /l/ was elicited in three tasks only after the NJ speakers had been asked to define and rate the familiarity of English words. Perception was evaluated after the production tasks described here had been completed. In the first production task the NJ speakers heard Japanese translation equivalents of the intended English target words. They selected one of three aurally presented English definitions if they thought they knew the English test word that was meant to be elicited. If not, further hints were provided in Japanese by the NJ-speaking experimenter. The NJ speakers first produced the target word in isolation. Only when the experimenter nodded to confirm that the intended target word had been selected did participants say the word a second time at the end of a carrier phrase. In the second elicitation task the NJ speakers read the same target words from a list, and in the third, unscripted task, they inserted the target word into a phrase or sentence of their own choosing.



The Revised Speech Learning Model (SLM-r) Applied



For the present chapter, we reexamined the results obtained in Experiment 4 of the Flege et al. (1995) study. CV stimuli were edited out of productions of four words each with /r/ (right, rock, read, rate) and four words with /l/ (light, lock, lead, late) by deleting the final consonants. The syllables beginning in /r/ and /l/ were presented in separate, counterbalanced blocks to 12 NE-speaking listeners, who always knew which of the two English liquids presented in a block had been intended by the speakers. The listeners rated each liquid production on a scale ranging from 1 (strong foreign accent) to 7 (no foreign accent). We tested the statistical significance of differences in /r/ tokens produced by the NE and experienced NJ groups (NE vs. Experienced) and by the NE and inexperienced NJ groups (NE vs. Inexperienced). The /r/ and /l/ tokens elicited in the three speaking tasks were evaluated in a series of six t-tests. The NE-INEXP difference was significant for all three speaking tasks (Bonferroni-adjusted p < 0.05) whereas the NE-EXP difference never reached significance. This suggested that NJ speakers can learn to produce /r/ accurately even when their productions are evaluated using a highly sensitive auditory evaluation procedure. A very different outcome was obtained when the same analysis procedure (3 × 2 = 6 independent t-tests with pooled variances) was applied to the /l/ productions. The NE-INEXP differences were significant for all three speaking tasks (Bonferroni p < 0.05) whereas the NE-EXP differences reached significance at the 0.05 level for the definition and ­word-list reading tasks but not the unscripted speech task. These results indicated that even the NJ speakers who had lived for many years in the United States were unable to produce /l/ accurately in a full range of speaking tasks. The Flege et al. (1995) research was carried out within the framework of the SLM (Flege, 1995), a model whose principal aim was to account for the effect of chronological age at the time of first exposure to an L2 on the performance of highly experienced L2 learners. The SLM emphasized the importance of input in the process of L2 speech learning. The more English-like production of /r/ by the relatively “experienced” than “inexperienced” NJ group is consistent with the SLM view that adults retain the ability to learn speech, and so their performance on some, but not all, L2 sounds will improve as a function of the quantity and quality of input received. The SLM-r (Flege & Bohn, 2020), on the other hand, is an individual differences model that aims to explain how phonetic systems (in monolinguals) and phonetic subsystems (in bilinguals) reorganize over the life-span in response to



James Emil Flege, Katsura Aoyama, and Ocke-Schwen Bohn Mean Ratings for /r/

Mean Ratings for /I/

7 NE

6

Exp

5

INexp

4 3 2 1

In Ascending Order by Group

In Ascending Order by Group

Figure 2.6  The mean ratings of /r/ and /l/ productions that were obtained for the 12 participants in three groups. The “Exp” and “INexp” groups consisted of native Japanese adults differing primarily in length of residence in the United States.

variation in phonetic input. The ­individual learner is the primary unit of analysis, not groups. It is therefore worth considering the individual data obtained by Flege et al. (1995). Figure 2.6 shows the ratings of /r/ and /l/ production accuracy for all 36 participants tested by Flege et al. (1995). The values shown here have been averaged over the three speaking tasks. One of the 12 NE speakers, indicated by an arrow, was a clear outlier for /r/ but there was no NE outlier for /l/. This is not an entirely unexpected finding given that monolingual NE children take longer to learn to produce /r/ accurately than /l/, and given that some NE boys do not yet produce English /r/ accurately by the age of nine years (Smit et al., 1990). More of the 24 NJ speakers obtained ratings that fell within the range of values obtained for the NE speakers for /r/ (outlier excluded) than /l/ [21 vs. 5, X(1) = 9.8, p < 0.001], confirming the SLM-r prediction that /r/ is more learnable than /l/ owing to its greater perceived dissimilarity from Japanese /R/. Flege et al. (1995) also examined /r/ and /l/ productions acoustically. In a regression analysis, four acoustic measures (F2 and F3 starting frequencies, F2 and F3 frequencies at the end of rapid spectral change) accounted for 65 percent of variance in listener ratings of /r/ production accuracy. The beta weight of the F3 onset frequency variable was substantially larger (−0.965) than the beta weights for the other variables (0.237 to −0.467), supporting the view that F3 onset frequency is the single most important acoustic phonetic dimension specifying English /r/ for



The Revised Speech Learning Model (SLM-r) Applied



NE-speaking listeners. Figure 2.7 shows mean F3 onset frequency values as a function of NE listeners’ ratings of the /r/ tokens produced by all 36 participants. The lower and thus more English-like were the F3 onset frequencies for /r/ the higher were the listener ratings of production accuracy [r(34) = −0.83, p < 0.001]. The correlation remained significant, r = −0.41, p < 0.05, when just the 24 NJ speakers were considered. This also confirms the perceptual importance of F3 onset frequency. The NE outlier evident in Figure 2.7 (arrow) is the same individual whose /r/ productions were judged by NE-speaking listeners to have been produced poorly. Not all native speakers of English, it seems, manage to “master” the production of English /r/. This illustrates the peril of evaluating L2 learners’ success in learning L2 vowels and consonants by comparing the performance of individuals to that of a native-speaker comparison group, especially a small one that may or may not adequately represent the persons who provided models for the L2 learners. A consideration of Figures 2.6 and 2.7 indicates that important differences existed both between and within the NJ groups. For the SLM-r, the most important question is not whether a significant difference existed between the two groups of 12 NJ speakers but why some NJ participants performed so much better than others. In Figure 2.6, for example, we see that two of the 12 so-called inexperienced NJ participants obtained values 7 6

NE Exp INexp

Mean Rating

5 4 3 2 1 1000

1500 2000 2500 3000 F3 Onset Frequency (Hz)

3500

Figure 2.7  The mean ratings of /r/ tokens produced by the members of three group as a function of the F3 values in the rated tokens.



James Emil Flege, Katsura Aoyama, and Ocke-Schwen Bohn

for /r/ that fell within the range of values observed for the “experienced” NJ participants and the NE speakers. Did their Japanese /R/ categories differ from those of less successful participants when they were first exposed to English? Did they receive more and/or better-quality English phonetic input? Did they possess some special ability that enabled them to make better use of the input they received? In summary, research examining the production of English liquids by NJ children and adults yielded results like those obtained in the perceptual research reported earlier. NJ speakers were found to produce English /r/ more accurately than /l/. This, when taken together with the perception findings, supports the SLM-r hypothesis that segmental production and perception coevolve in L2 speech learning. The production findings also support the hypothesis that some NJ speakers develop new phonetic categories for English /r/ whereas the learning that takes place for /l/ consists of the development of composite L1–L2 phonetic category that is used to guide the production of both Japanese /R/ and English /l/.

2.7 Summary Although the research presented here spans a half century, we do not yet have a complete understanding of how native Japanese (NJ) speakers learn to produce and perceive English /r/ and /l/. However, the SLM-r provides a framework that researchers might use when designing future research aiming to provide such an understanding. If the SLM-r hypotheses are confirmed, the notorious “Japanese /r/-/l/ problem” will come to be seen as no problem at all but, rather, evidence of a predictable and orderly reorganization of the L1 and L2 phonetic subsystems that occurs when individuals of any age are exposed to phonetic input differing from the input they received earlier in life. More research is clearly needed. As we see it, an ideal future study would examine the learning of English liquids longitudinally beginning soon after NJ children and adults have first been exposed to conversational English in a context in which they need to learn English for everyday use. Such a study would evaluate the production of English /r/, English /l/, and Japanese /R/ acoustically and via fine-grained listener judgments. The categorization of these and other consonants in the English and Japanese phonetic inventories would be assessed as well as individual participants’ selection of the “best examples” of English and Japanese consonants from rich stimulus arrays.



The Revised Speech Learning Model (SLM-r) Applied



The first (time “0”) sample obtained for all individuals participating in a longitudinal study would include the following: 1. measures of auditory acuity, early-stage auditory processing, and auditory working memory; 2. a determination of which Japanese consonants are used to classify English /r/ and /l/, as well as other English consonants; 3. ratings of how dissimilar from Japanese consonants the two English liquids and other English consonants are perceived to be; and 4. assessments of how Japanese /R/ is produced, and how perceptual cues are weighted in the categorization of sounds as Japanese /R/. The Japanese production data obtained at time 0 (#4) would be used to determine how precisely individual participants define their Japanese /R/ category. At time 1 (and subsequent samples) the production and perception of the three consonants of primary interest (/r/, /l/, /R/) would be evaluated along with cross-language mapping, the perception of cross-language phonetic dissimilarity, and the three “auditory” tests (#1 above). The quantity and quality of English phonetic input that each individual has received in each test interval would also be assessed. The longitudinal study would make it possible to address a number of theoretically important research questions. The first research question to be addressed is whether the precision with which individual NJ participants specify Japanese /R/ differ at time 0. If so, are category precision differences maintained in subsequent data samples? Does /R/ category precision correlate with the participants’ chronological ages? According to the SLM-r, individual participants will be found to differ in /R/ category precision and these differences will persist over time. Somewhat greater precision is expected for adults than children but, in a sample including both adults and children, the ageprecision correlation will be moderate or weak because both individual adults and children will be found to differ in terms of category precision. It will be important to determine if the endogenous auditory measures (#1 above) change over time. Will they correlate with the time 0 estimates of /R/ category precision? The expectation here is the endogenous factors will not change over time but will correlate with /R/ category precision in the first data sample (time 0). Cross-language mapping patterns have never been assessed longitudinally, as far as we know. That being the case, it will be valuable to determine how much phonetic input the NJ participants need before they establish stable cross-language mapping patterns.



James Emil Flege, Katsura Aoyama, and Ocke-Schwen Bohn

We noted earlier in this chapter that NJ learners of English eventually begin to perceive English /r/ realizations to be more phonetically dissimilar from Japanese /R/ than English /l/ realizations are. A question of interest is how long this takes. Will the amount of input needed for this to happen vary as a function of the endogenous auditory factors? Preliminary evidence presented in this chapter suggests that at least one year of full-time equivalent English input is needed before a perceived difference between /r/ and /l/ (with respect to Japanese /R/) emerges. The SLM-r predicts that the amount of input needed by individuals to differentiate the two English liquids will depend on /R/ category precision. This is because, by hypothesis, relatively great /R/ category precision will facilitate the discernment of Japanese-English phonetic differences. NJ learners of English eventually produce and perceive English /r/ more accurately than English /l/. It is of interest to learn how long this takes, and if this change is evident for all NJ learners of English. The SLM-r predicts that increasingly more NJ participants, both children and adults, will begin producing and perceiving English /r/ much like agematched English monolinguals, but that such a change for /l/ will be evident for few, if any, participants. Finally, it will be important to learn in detail how learning English affects the perception as well as the production of Japanese /R/. The SLM-r predicts that NJ speakers will form a new phonetic category for English /r/ but not for English /l/ because the phonetic distance between /l/ and /R/ is too small. The SLM-r predicts, however, that phonetic learning may nevertheless occurs in the absence of category formation for /l/. Specifically, the model predicts that NJ learners of English will develop a composite English /l/-Japanese /R/ category (diaphone) based on phonetic input obtained in English and in Japanese (both before and after they began to learn English). By hypothesis, the new English /r/ category formed by an individual will be de-linked perceptually from his or her Japanese /R/ category whereas the distribution of English /l/ tokens to which an individual has been exposed will remain perceptually linked to Japanese /R/ via the cognitive mechanism of interlingual identification. The SLM-r envisages two ways that learning English might affect how NJ speakers of English produce and perceive Japanese /R/. The Japanese /R/ category of an individual might dissimilate from a newly formed English /r/ category or it might assimilate, that is, come to resemble, English /l/. Both assimilation and dissimilation effects have been observed in L2 research (see Chapter 1). The relative strength of the two



The Revised Speech Learning Model (SLM-r) Applied



effects has not yet been established, however, nor is it known if both effects can operate simultaneously to affect a single L1 sound. Given the likelihood that NJ speakers bring somewhat different /R/ categories to the task of learning English, it is essential that future work examining Japanese /R/ focus on individual participants. The overall effect of learning English on Japanese /R/ might be evaluated using a paired comparison test that examines how individuals produced /R/ in the first data sample (time 0) (soon after arriving in a predominantly English-speaking country) and when tested in subsequent sessions of the longitudinal study. The task of Japanese monolingual listeners would be to decide which member of a pair of stimuli presented in a trial (a token produced at time 0 and a token drawn from one of the subsequent sessions) was the “better” production of Japanese /R/. If the monolingual Japanese listeners select the time 0 tokens significantly more often than the tokens obtained in subsequent sessions, it would indicate that learning English has affected how the NJ participants produce Japanese /R/. The relative frequency of time 0 selections would indicate the overall magnitude of the L2-on-L1 effect. Research reviewed in this chapter has pointed to both the underuse of F3 frequency and the overuse of F2 frequency in the production and perception of English liquids by NJ learners of English. Comparison of relevant acoustic phonetic dimensions (e.g., F2 and F3 onset frequencies and transition durations) in time 0 and subsequent /R/ productions might help to distinguish L2-on-L1 effects due to either assimilation or dissimilation. We are mindful that prospective research like that just outlined will require the expenditure of considerable time and financial resources. Moreover, there is relatively little emigration from Japan to predominantly English-speaking countries at the present time. As a practical matter, then, the same or similar questions might be addressed more readily in research examining other L1–L2 pairs and L1 and L2 target sounds.

References Akamatsu, T. (1971). The problem of the so-called “Japanese R.” Linguistische Berichte, 12, 31–39. Aoyama, K., & Flege, J. E. (2011). Effects of L2 experienced on perception of English /r/ and /l/ by native Japanese speakers. Journal of the Phonetic Society of Japan, 15(3), 5–13. Aoyama, K., Flege, J. E., Akahane-Yamada, R., & Yamada, T. (2019). An acoustic analysis of American English liquids by adults and children: Native



James Emil Flege, Katsura Aoyama, and Ocke-Schwen Bohn

English speakers and naïve Japanese speakers of English. Journal of the Acoustical Society of America, 146(4), 2671–2681. Aoyama, K., Flege, J. E., Guion, S. G., Akahane-Yamada, R., & Yamada, T. (2004). Effects of L2 experience on perception of English /r/ and /l/ by native Japanese speakers. Journal of Phonetics, 32, 233–250. Arai, T. (2013). On why Japanese /r/ sounds are difficult for children to acquire. In F. Bimbot et al. (Eds.), 14th annual conference of the International Speech Communication Association (pp. 2445–2449). Arai, T., & Mugitani, R. (2016). The acoustic environment and spoken language development by children. Journal of the Acoustical Society of Japan, 72(3), 129–136. Best, C. T., & Strange, W. (1992). Effects of phonological and phonetic factors on cross-language perception of approximants. Journal of Phonetics, 20(3), 305–330. Bradlow, A. (2008). Training non-native language sound patterns: Lessons from training Japanese adults on the English /r/-/l/ contrast. In J. Hansen Edwards & M. Zampini (Eds.), Phonology and second language acquisition (pp. 287–308). Amsterdam: John Benjamins. Bradlow, A., Akahane-Yamada, R., Pisoni, D., & Tohkura, Y. (1997). Training Japanese listeners to identify English /r/and /l/: IV. Some effects of perceptual learning on speech production. Journal of the Acoustical Society of America, 101(4), 2299–2310. Bradlow, A., Akahane-Yamada, R., Pisoni, D., & Tohkura, Y. (1999). Training Japanese listeners to identify English /r/ and /l/: Long-term retention of learning in perception and production. Perception & Psychophysics, 61(5), 977–985. Callan, D. E., Jones, J. A., Callan, A. M., & Akahane-Yamada, R. (2004). Phonetic perceptual identification by native- and second-language speakers differentially activates brain regions involved with acoustic phonetic processing and those involved with articulatory-auditory/orosensory internal models. NeuroImage, 22, 1182–1194. Callan, D. E., Tajima, K., Callan, A. M., Kubo, R., Masaki, S., & AkahaneYamada, R. (2003). Learning-induced neural plasticity associated with improved identification performance after training of a difficult secondlanguage phonetic contrast. NeuroImage, 19, 113–124. Cochrane, R. M. (1980). The acquisition of /r/ and /l/ by Japanese children and adults learning English as a second language. Journal of Multilingual and Multicultural Development, 1, 331–360. Cutler, A., Weber, A., & Otake, T. (2006). Asymmetric mapping from phonetic to lexical representations in second-language listening. Journal of Phonetics, 34, 269–284. Delattre, P., & Freeman, D. C. (1968). A dialect study of American English r’s by x-ray motion picture. Linguistics, 44, 28–69. Flege, J. E., Takagi, N., & Mann, V. (1995). Japanese adults can learn to produce English /r/ and /l/ accurately. Language and Speech, 38, 25–56.



The Revised Speech Learning Model (SLM-r) Applied



Flege, J. E., Takagi, N., & Mann, V. (1996). Lexical familiarity and Englishlanguage experience affect Japanese adults’ perception of /r/ and /l/. Journal of the Acoustical Society of America, 99, 1161–1173. Ganong, W. F. (1980). Phonetic categorization in auditory word perception. Journal of Experimental Psychology, 6(1), 110–125. Gordon, P., Keyes, L., & Yung, Y.-F. (2001). Ability in perceiving nonnative contrasts: Performance on natural and synthetic speech. Perception & Psychophysics, 63(4), 746–758. Goto, H. (1971). Auditory perception by normal Japanese adults of the sounds “L” and “R.” Neuropsychologia, 9, 317–323. Guion, S., Flege, J. E., Yamada, R. A., & Pruitt, J. (2000). An investigation of current models of second language speech perception: The case of Japanese adults’ perception of English consonants. Journal of the Acoustical Society of America, 107(5), 2711–2744. Hattori, K. (2009). Perception and production of English /r/-/l/ by adult Japanese speakers. PhD dissertation, University College London. Hattori, K., & Iverson, P. (2009). English /r/-/l/ category assimilation by Japanese adults: Individual differences and the link to identification accuracy. Journal of the Acoustical Society of America, 125(1), 469–479. Idemaru, K., & Holt, L. (2011). Word recognition reflects dimension-based statistical learning. Journal of Experimental Psychology: Human Perception and Performance, 37(6), 1939–1956. Idemaru, K., & Holt, L. (2013). The developmental trajectory of children’s perception and production of English /r7-/l/. Journal of the Acoustical Society of America, 133(6), 4232–4246. Idemaru, K., & Holt, L. (2014). Specificity of dimension-based statistical learning in word recognition. Journal of Experimental Psychology: Human Perception and Performance, 40(3), 1009–1021. Idemaru, K., Holt, L. L., & Seltman, H. (2012). Individual differences in cue weights are stable across time: The case of Japanese stop lengths. Journal of the Acoustical Society of America, 132(6), 3950–3964. Ingvalson, E., Holt, L., & McClelland, J. (2012). Can native Japanese listeners learn to differentiate /r–l/ on the basis of F3 onset frequency? Bilingualism: Language and Cognition, 15(2), 255–274. Ingvalson, E., McClelland, J., & Holt, L. (2011). Predicting native Englishlike performance by native Japanese speakers. Journal of Phonetics, 39, 571–584. Iverson, P., Hazan, V., & Bannister, K. (2005). Phonetic training with acoustic cue manipulations: A comparison of methods for teaching English /r//l/ to Japanese adults. Journal of the Acoustical Society of America, 118(5), 3267–3278. Iverson, P., Kuhl, P., Akahane-Yamada, R., Diesch, E., Tohkura, Y., Kettermann, A., & Siebert, C. (2003). A perceptual interference account of acquisition difficulties for non-native phonemes. Cognition, 87, B47–B57.



James Emil Flege, Katsura Aoyama, and Ocke-Schwen Bohn

Iverson, P., Wagner, A., & Rosen, S. (2016). Effects of language experience on pre-categorical perception: Distinguishing general from specialized processes in speech perception. Journal of the Acoustical Society of America, 139(4), 1799–1809. Kachlika, M., Saito, K., & Tierney, A. (2019). Successful second language learning is tied to robust domain-general auditory processing and stable neural representations of sound. Brain and Language, 192, 15–24. Kuhl, P., Stevens, E., Hayashi, A., Deguchi, T., Kiritani, S., & Iverson, P. (2006). Infants show a facilitation effect for native language phonetic perception between 6 and 12 months. Developmental Science, 9(2), F13–F21. Ladefoged, P., & Maddieson, I. (1996). The sounds of the world’s languages. Hoboken, NJ: John Wiley. Lambacher, S. (1999). A CALL tool for improving second language acquisition of English consonants by Japanese learners. Computer Assisted Language Learning, 12, 137–156. Lenneberg, E. (1967). The biological foundations of language. New York: John Wiley. Logan, J., Lively, S., & Pisoni, D. (1991). Training Japanese listeners to identify English /r/ and /l/: A first report. Journal of the Acoustical Society of America, 89(2), 874–886. Lotto, A., Sato, M., & Diehl, R. (2004). Mapping the task for the second language learner: The case of Japanese acquisition of /r/ and /l/. In J. Sliftka et al. (Eds.), From sound to sense: 50+ years of discoveries in speech communication (pp. C181–C186). Cambridge, MA: Research Laboratory of Electronics at MIT. MacKain, K., Best, C., & Strange, W. (1981). Categorical perception of English /r/ and /l/ by Japanese bilinguals. Applied Psycholinguistics, 2, 369–390. MacKay, I. R. A., Meador, D., & Flege, J. E. (2001). The identification of English consonants by native speakers of Italian. Phonetica, 58, 103–125. McGowan, R. S., Nittrouer, S., & Manning, C. J. (2004). Development of [ɹ] in young, Midwestern, American children. Journal of the Acoustical Society of America, 115(2), 871–884. Mielke, J., Baker, A., & Archangeli, D. (2016). Individual-level contact limits phonological complexity: Evidence from bunched and retroflex /ɹ/. Language, 92(1), 101–140. Miyawaki, K., Jenkins, J., Strange, W., Liberman, A., Verbrugge, R., & Fujimura, O. (1975). An effect of linguistic experience: The discrimination of (r) and (l) by native speakers of Japanese and English. Perception & Psychophysics, 18(5), 331–340. Riney, T., & Flege, J. E. (1998). Changes over time in global foreign accent and liquid identifiability and accuracy. Studies in Second Language Acquisition, 20, 213–243. Riney, T., Takada, M., & Ota, M. (2000). Segmentals and global foreign accent: The Japanese flap in EFL. TESOL Quarterly, 34(4), 711–737.



The Revised Speech Learning Model (SLM-r) Applied



Saito, K., & Munro, M. (2014). The early phase of /ɹ/ production development in adult Japanese learners of English. Language and Speech, 57(4), 451–469. Shimizu, K., & Dantsuji, M. (1983). A study of the perception of /r/ and /l/ in natural and synthetic speech sounds. Studia Phonologica, 17, 1–14. Shinohara, Y., & Iverson, P. (2013). Perceptual training effects on production of English /r/- /l/ by Japanese speakers. Paper presented at the Phonetics Teaching and Learning Conference, London. Shinohara, Y., & Iverson, P. (2018). High variability identification and discrimination training for Japanese speakers learning English /r/-/l/. Journal of Phonetics, 66, 242–251. Smit, A., Hand, L., Freilinger, J., Bernthal, J., & Bird, A. (1990). The Iowa articulation norms project and its Nebraska replication. Journal of Speech and Hearing Disorders, 55, 779–798. Song, J. Y., Shattuck-Hufnagel, S., & Demuth, K. (2015). Development of phonetic variants (allophones) in 2-year-olds learning American English: A study of alveolar stop /t, d/ codas. Journal of Phonetics, 52, 152–169. Strange, W. (2011). Automatic selective perception (ASP) of first and second language speech: A working model. Journal of Phonetics, 39(4), 456–466. Takagi, N. (1993). Perception of American English /r/ and /l/ by adult Japanese learners of English: A Unified View. Unpublished PhD dissertation, University of California at Irvine. Takagi, N., & Mann, V. (1995). The limits of extended naturalistic exposure on the perceptual mastery of English /r/ and /l/ by adult Japanese learners of English. Applied Psycholinguistics, 16(4), 380–406. Vance, T. (2008). The sounds of Japanese. Cambridge: Cambridge University Press. Westbury, J., Hashi, M., & Lindstrom, M. J. (1998). Differences among speakers in lingual articulation for American English /ɹ/. Speech Communication, 26(3), 203–226. Yamada, J. (1991). The discrimination learning of the liquids/r/ and /l/ by Japanese speakers. Journal of Psycholinguistic Research, 20(1), 31–46. Yamada, R. A. (1995). Age and acquisition of second language speech sounds: Perception of American English /r/ and /l/ by native speakers of Japanese. In W. Strange (Ed.), Speech perception and linguistic experience: Issue in cross-language research (pp. 305–320). Timonium, MD: York Press. Yamada, R. A., & Tohkura, Y. (1990). Perception and production of syllableinitial English /r/ and /l/ by native speakers of Japanese. In ICSLP-1990, First International Conference on Spoken Language Processing (pp. 757–760). Yamada, R., & Tohkura, Y. (1992). The effects of experimental variables on the perception of American English /r/ and /l/ by Japanese listeners. Perception & Psychophysics, 52(4), 376–392.



James Emil Flege, Katsura Aoyama, and Ocke-Schwen Bohn

Yamada, R. A., Tohkura, Y., & Kobayashi, N. (1997). Effect of word familiarity on non-native phoneme perception: identification of English /r/, /l/, and /w/ by native speakers of Japanese. In A. James & J. Leather (Eds.), Second-language speech, structure and process (pp. 103–118). Berlin: Mouton de Gruyter. Yoshida, K., & Hirasaka, F. (1983). The lexicon in speech perception. Sophia Linguistica, 11, 105–116.

chapter 3

New Methods for Second Language (L2) Speech Research James Emil Flege*

This chapter focuses on second-language (L2) speech research methodology, a topic that has received relatively little attention in recent years. The lack of attention to methodology, in my opinion, has slowed progress in the field and resulted in a heterogeneity of research findings that are difficult to interpret. Consider, for example, how the distance between an L2 sound and the phonetically closest L1 sound is assessed. Phonetic distance has traditionally been assessed through listener judgments rather than acoustically. It is widely agreed that the greater is the perceived phonetic distance of an L2 sound from the closest L1 sound, the more accurately the L2 sound is likely to be produced and perceived. However, Flege and Bohn (Chapter 1) cited research that failed to support the hypothesis that the magnitude of perceived cross-language phonetic distances is predictive of success in the learning of L2 sounds. Two explanations might be offered to explain the discordant findings. The hypothesis itself might be incorrect, or the method used to assess perceive cross-language phonetic dissimilarity may have been inadequate. The method used in the cited research involved obtaining two judgments per trial. Participants first classified an L2 sound in terms of an L1 category and then rated how good (or poor) the L2 token was as an instance of the L1 category. Flege and Bohn (Chapter 1) suggested that the paired comparison technique suggested by Flege (2005) may provide a more adequate measure of perceived cross-language phonetic dissimilarity. This method requires the presentation of a pair of L1 and L2 sounds in each trial for a rating that ranges from 1 (very similar) to 7 (very dissimilar). * The work presented here was supported by grants from the National Institute of Deafness and Other Communicative Disorders. I thank Ocke-Schwen Bohn and the many other active L2 researchers with whom I have corresponded in the past year for their comments and suggestions.





James Emil Flege

Alas, this method is more time consuming because of the large number of L1–L2 sound pairings, and apparently has never been used in L2 speech research for that reason. The present chapter has two aims. The first is to identify other crucial methodological issues in L2 speech research. The second aim is to propose specific ways to improve existing methods and to suggest new methods that might be applied to issues of theoretical interest. The first crucial methodological issue to be considered involves the elicitation of speech production samples. Many published studies have examined the production of L2 sounds, and a smaller number of studies have also examined the production of sounds in learners’ native language (L1). There is general agreement that the traditional word-list reading technique, which is simple to administer, is inadequate for a number of reasons, not the least of which is that differences between individuals in reading ability or familiarity with L2 orthography may influence how target vowels and consonants are produced. A more general problem is the lack of standardized speech elicitation technique(s). This has generated some uncertainty as to whether the L2 speech production samples presented in some published studies are representative of how learners typically produce target L2 speech sounds in everyday life. The second crucial issue pertains to the assessment of L2 input. There is widespread agreement that the methods currently in use are inadequate. Many studies examining how immigrants learn an L2 in a predominantly L2-speaking environment have relied on length of residence (LOR) to index overall amount of L2 input. However, as discussed by Flege (2019), LOR provides a poor estimate of L2 input because it merely indicates an interval of time, not what occurred during the interval. It also fails to provide insight into the quality of L2 input that has been received by L2 learners. Flege and Bohn (Chapter 1) suggested that LOR be multiplied by the proportion of L2 use, yielding a variable designated “Full-time equivalent” (FTE) years of L2 input. This may provide a slightly better estimate of L2 input than LOR alone, but it necessarily relies on learners’ self-estimates of percentage L2 use. The third crucial issue is how best to assess not only overall percentage L2 use, but also how much of the input obtained by L2 learners has been provided by native speakers as opposed to other nonnative speakers. According to the revised Speech Learning Model, or SLM-r (Chapter 1), L2 speech learning, just like the learning of native-language (L1) speech, is data driven. The SLM-r proposes that the learning of position-sensitive L2 vowels and consonants (“sounds,” for short) by an individual L2



New Methods for Second Language (L2) Speech Research



learner is based on the statistical properties of the input distributions to which the L2 learner has been exposed during meaningful conversations. Improved methods are provided here for estimating the overall percentage of L1 and L2 use. A new technique is also unveiled that may make it possible for the first time to quantify how much input learners of an L2 have received, and how much of that input is foreign-accented. This new technique can also be used to test the SLM-r hypothesis that new phonetic categories are based on the input distributions to which L2 learners have been exposed. The fourth critical issue to be considered here is how to assess the perception of L2 speech sounds. Flege (2003) identified a number of limitations that argue against the use of forced-choice identification tests for assessing L2 vowel perception. These limitations include how to provide labels that L2 learners can use to identify vowels and the confusion and errors that may result from the use of too many labels. Flege (2003) proposed that the perception of L2 sounds be investigated using a three-interval oddity discrimination test. The stimuli used in the “Categorial Discrimination Test” (CDT) he proposed are natural productions by multiple native speakers. When hearing the three stimuli in each trial (always produced by different talkers), participants must identify the serial position of the odd item out which occurs in 50 percent of the trials, or respond “no” when all three stimuli are instances of a single L2 category. Flege and Wayland (2019) recently noted, however, that the CDT may yield misleading results due to the use of an “X-not-X” decision strategy, which will be discussed later in this chapter. A problem for any kind of discrimination test is the focus on differences between contrastive phonetic categories that exist among L2 sounds rather than on the properties defining L2 sounds. A distinction is drawn here between identification and categorization. The categorization of L2 speech sounds at a phonetic rather than a phonemic level of processing is deemed to be the most appropriate method for examining the perception of L2 speech sounds.

3.1  Eliciting Speech Production Samples The aim of eliciting L2 production samples is to determine how learners typically produce speech in their L2, usually with the intent to determine which aspects of the L2 phonetic system have been learned and which have not been learned or learned only partially. The motivation for



James Emil Flege

e­liciting L1 production samples, on the other hand, is to determine if, and to what extent, L2 learning has affected the production of speech sounds in L2 learners’ native language. 3.1.1  General Factors in Bilingual Speech Production Research The target sounds examined in speech research are generally found in isolated words or words embedded in short phrases or sentences. Learners presumably know most L1 words that might be elicited, but this might not be equally true for L2 words. It is more appropriate to examine L2 words that are known to all participants in a study than words that are known by only some of the participants. This is because spelling-tosound conversion routines may be invoked by participants who do not know certain words they have been asked to produce. If participants representing a wide range of L2 abilities are to be recruited, this may mean examining only common and/or frequently occurring L2 words. Researchers have sometimes noted segmental production differences in speech samples elicited using different techniques or materials. This raises a question that is likely to be unanswerable, namely, which of several samples is most representative of how participants typically produce the target sounds of interest? Care must therefore be taken when selecting the speech materials to be elicited and when deciding which technique(s) to use for obtaining speech production samples. L2 learners who use their L2 on a regular basis are, by definition, bilinguals. It is important to study how bilinguals alter speech production when switching between their two languages or when mixing their languages, and also to determine how and to what extent their production of speech sounds changes as a function of who they are speaking to and what the topic is. Questions like these, however, fall outside the scope of the SLM-r. Speech production research with bilinguals should focus on either the L2 or the L1, not both at the same time. Eliciting both L1 and L2 production samples in a single session may augment both L2-on-L1 and L1-on-L2 effect sizes. When attempting to elicit samples that exemplify how L2 sounds are typically produced, it is important to ensure that participants are in an L2 “mode” with the L1 phonetic subsystem deactivated to the extent possible (see Grosjean, 1998, 2001, for discussion). The procedures used to achieve these goals may include using only L2 written and spoken materials, testing in a location where the L2 is usually or always used, having monolingual native speakers elicit speech production samples, and making no reference to participants’ L1 or their



New Methods for Second Language (L2) Speech Research



country of origin until all production data have been obtained (see Antoniou, Best, Tyler, & Kroos, 2010). The best approach for research examining both L1 and L2 phonetic performance is to administer similar protocols in separate sessions. If possible, steps should be taken to mask the fact that the two sessions are part of a single study. This may mean, for example, obtaining informed consent twice using translation-equivalent consent forms administered in the L1 and L2. It is important to mask a focus on pronunciation. People tend to alter their speech when they know that their pronunciation is being scrutinized. One way to shift attention away from pronunciation is to invite participants to take part in research examining their “knowledge” of L2 words. How the sounds making up L2 words are pronounced is, of course, an aspect of word knowledge, but participants will likely assume that “word knowledge” refers primarily or exclusively to the semantic meaning of words. Many L2 studies have focused on minimally paired words, but this should be avoided. Consider, for example, research examining how native Japanese (NJ) speakers produce English liquids in minimally paired words such as right and light (see Chapter 2). Students in Japan are drilled on such minimal pairs in their English classes and so using such words will alert NJ speakers to the fact that they are participating in research focusing on their pronunciation of /r/ and /l/. This may evoke the use of articulation strategies that have been taught in school. It is just as easy to elicit near-minimal pairs beginning in /r/ and /l/, such as the words risk and list (or wrote and loaf). The initial liquids in words like these can be measured acoustically, and fine-grained listener judgments of how accurately the liquids were produced can be obtained after digitally removing the final consonants. Variation in speaking rate must be controlled because such variation affects phonetic variables such as voice-onset time (VOT), and so may obscure or augment native versus nonnative differences. Consider, for example, a study by Birdsong (2003) which examined 21 native English (NE) participants who had arrived in France after the age of 18 years and lived there for an average of 11 years (range = 5–22 years). The NE participants and native French (NF) speakers read minimally paired French words that differed in the voicing feature of word-initial stops. The test words were randomly presented along with foils to deflect attention away from the stop voicing distinctions of interest. The aim of Birdsong (2003) was to determine if NE speakers could shorten VOT sufficiently in French /p t k/ to resemble NF speakers. Of



James Emil Flege

the 21 NE speakers tested, 14 produced French /t/ with mean VOT values that differed by less than 1.0 standard deviation (SD) from the mean VOT value evident for the NF speakers. When the data for French /p/, /t/, and /k/ were pooled, the 1-SD criterion was reached in 65 percent of instances. The author concluded that learning French after the end of a critical period for L2 speech learning did not prevent some NE speakers from producing French stops “authentically.” The NE speakers’ production of French /p t k/ may actually have been more native-like than these results suggest. The NE participants tested by Birdsong (2003) spoke more slowly than the NF speakers did, producing following vowels that were 48 percent longer on average than the NF speakers’ vowels. The VOT values in English voiceless stops increase as speaking rates slow and the duration of the following vowel increases. Birdsong (2003) reported vowel duration values as well as VOT values for all 21 NE participants. I adjusted the French VOT values reported for individual NE participants using the VOT-vowel duration relationship reported for English monolinguals by Theodore, Miller, and DeSteno (2009). The adjustments shortened the French VOT values by an average of 8.7 ms. This left an average native versus nonnative VOT difference of just 2.1 ms, slightly less than the expected measurement error. What is the best way to reduce or eliminate uncontrolled variation in speaking rate? In my opinion it is counterproductive to ask participants to “speak normally” or the like because this will augment their conscious attention to speech. The most effective way, as I see it, is to model the desired speaking rate, as will be exemplified later. An important question to ask regarding the findings of Birdsong (2003) is why most but not all of the NE participants produced French /p t k/ with French-like short-lag VOT values. The author suggested that individual differences in motivation or prior phonetic training may have played a role, but two other possible explanations come to mind. First, the elicitation task might not have yielded L2 production samples that were typical of how the NE participants produced French /p t k/. Some of the participants were language teachers who had received phonetic training, and the study clearly focused on pronunciation. That being the case, some participants might have adopted a conscious articulation strategy for producing the minimally paired French words, for example, producing the short-lag French stops as if they were short-lag variants of English /b d ɡ/. A second possible explanation for intersubject variability in the Birdsong (2003) study of late learners is suggested by the SLM-r (Chapter



New Methods for Second Language (L2) Speech Research



1). The SLM-r posits that the L1 and L2 phonetic subsystems of bilinguals necessarily interact because they exist in a common phonetic space. Another SLM-r hypothesis is that when individual differences in the specification of L1 phonetic categories exist, these differences may affect the production and perception of L2 sounds. As will now be discussed, this may have been the case for some of the NE participants tested by Birdsong (2003). As is well known, NE monolinguals differ in how they produce /b d ɡ/ in word-initial position. For example, of the 30 NE speakers tested by Dmitrieva, Llanos, Shultz, and Francis (2015), seven (23 percent) always produced English /b/ with short-lag VOT values, one always produced /b/ with lead VOT, and 22 produced both lead and short-lag VOT values. Flege (1982) examined laryngeal timing patterns in the production of English /b/. Two of the 10 NE speakers tested adducted the vocal folds when they released /b/ closure, and so only produced /b/ with short-lag VOT values. The other eight participants always adducted the vocal folds long before the release of /b/ closure, producing /b/ with varying amounts of lead VOT or with short-lag VOT values. The VOT variation in these eight participants was likely due to differences in vocal fold tensing and/or supraglottal cavity expansion, not to differences in laryngeal timing. The intersubject variability seen in the Birdsong (2003) study may have derived from individual differences in how the NE participants specified their L1 categories for /b/, /d/, and /ɡ/. The NE speakers who did not produce French /p t k/ with French-like short-lag VOT may have been among the NE speakers who always adduct the vocal folds at the time of stop release and so always produce English /b d ɡ/ with short-lag VOT values. If so, their inaccurate production of French /p t k/ should not be regarded as a failure to learn this phonetic aspect of French. Instead, it should be regarded as the operation of a universal phonetic constraint. A universal principle of phonetic system organization in monolinguals is that contrast must be maintained between the elements making up the L1 phonetic system. The SLM-r proposes that bilinguals strive to maintain contrast among the phonetic elements in the L1 and L2 phonetic subsystems, which interact with one another because they are found in a common phonetic space. That being the case, NE speakers who always time vocal fold adduction to coincide with stop release when producing English /b/, /d/, and /ɡ/, and so always produce these stops with shortlag VOT, will be prevented from using the same phonetic gestures when



James Emil Flege

producing /p/, /t/, and /k/ in French. Evaluating this hypothesis will require an examination of both L1 (English) and L2 (French) production in future research. 3.1.2  Sentence Completion Test This section illustrates a method that might be used to elicit L2 speech production. The Sentence Completion Test (SCT) would be appropriate, for example, for examining how native Italian (NI) speakers produce /t/ in the initial position of English words. The NI participants in a hypothetical study would be invited to take part in research examining their “knowledge of English words.” Prospective participants would be informed that their task will be to select “which of five English words” best completes various English sentences. Here are examples of four possible trials: 13. When your clothes fit tight you need to lose …   sleep rice pens weight hope 19. When you get thirsty you want to …   sleep drink sing read think 29. Each day at breakfast the man drinks …   corn fish tea spice gas 43. If you have to wait forever you may get …   hot lost tall bald bored Each italicized sentence to be completed will appear on a computer screen along with five response alternatives. Only one word completes the sentence in a sensible way. When a sentence appears on the screen it will also be presented aurally via a loudspeaker. The aurally presented English sentences will have been produced at a constant moderate speaking rate by 10 NE monolinguals. Once the aural presentation of a sentence has been completed, the participants will select a response using a mouse. If an inappropriate response is selected, “Wrong” will appear on the computer screen, the five response alternatives will remain illuminated, and the sentence will again be presented aurally. When the correct word has been selected, “Right!” will appear on the screen, the incorrect response alternatives will be dimmed, and the participants will say aloud the word that best completes the sentence. The SCT ensures that all test words are produced in isolation at a single modeled speaking rate. It would be helpful, when using this



New Methods for Second Language (L2) Speech Research



method, to (1) examine each target L2 sound of interest in a separate test, (2) elicit the production of many sounds other than just the target sound, and (3) eliminate from analysis any target words that participants do not select on the first presentation. 3.1.3  Number Selection Task This section presents another method for eliciting L2 production samples. A Number Selection Task (NST) such as the one illustrated here might be used to elicit numerous productions of word-initial /t/ and also provide detailed information regarding how often and with whom the L2 is used. The NST is illustrated using English as the L2 and Italian as the L1 of the participants. Questions in English that have been produced by 10 representative English monolinguals will be presented both aurally from sound files, and in writing on a computer screen. The native Italian (NI) participants will select one of five written response alternatives after hearing (and seeing) the English questions. The NST illustrated here exploits the fact that four English numbers (two, ten, twelve, twenty) begin with /t/. Each trial of the NST will consist of a question that regards how frequently a NI participant uses English and Italian for specific kinds of verbal interaction or during particular events. The participants will give frequency estimates for four periods of time. The four successive blocks of the NST will focus on use in a typical day, week, month, and year. The five response alternatives offered for each question (never, one, two, five, ten, twenty, fifty, one hundred, two hundred) will vary according to the time interval. Here are examples of items that might be used to examine frequency of English use in a typical week:  9. How often do you use English when buying milk? (a) never (b) two times (c) five times (d) ten times (e) twenty times  18. How often do you listen in English when watching a film? (a) never (b) two times (c) five times (d) ten times (e) twenty times 22. How often do you use English with friends at church? (a) never (b) two times (c) five times (d) ten times (e) twenty times 39. How often do you use English with your neighbors? (a) never (b) two times (c) five times (d) ten times (e) twenty times



James Emil Flege

Here are examples of items that might be used to examine English use over a year:  11. How often do you use English while waiting for a doctor? (a) never (b) ten times (c) twenty times (d) one hundred times (e) two hundred times 18. How often do you use English when buying groceries? (a) never (b) ten times (c) twenty times (d) one hundred times (e) two hundred times 23. How often do you use English when visiting a good friend? (a) never (b) ten times (c) twenty times (d) one hundred times (e) two hundred times 38. How often do you use English when walking near your house? (a) never (b) ten times (c) twenty times (d) one hundred times (e) two hundred times The written response alternatives will appear below the written version of the questions as they are presented aurally. If participants respond prematurely, “Please wait” will appear on the screen and the question will again be presented aurally. A further control for speaking rate is that all of the English questions will have been produced at a single, moderate speaking rate. The specific conversational interactions and events to which the NCT refers are meant to cover a wide swath of everyday life. Unfortunately, some uncertainty regarding coverage will likely remain. Assessing language use in the workplace is especially troublesome. For many but not all immigrants, the workplace is an important source of L2 input. But what response should a participant give if he or she is unemployed or retired when tested? What should a participant report if he or she has never worked outside the home? Perhaps a reader who decides to adopt this procedure will provide a solution to this vexing problem. The NST described here will provide many productions of English /t/. If each time interval (day, year, month, year) included 40 items, 160 productions of /t/ in the word times will be available for analysis and nearly that many tokens of /t/ in the English numbers. The use of exclusively English speech materials is meant to ensure that the NI participants are in an English mode with little or no activation of the Italian phonetic subsystem. The use of translation-equivalent Italian materials could be used to estimate Italian use. Unfortunately, an Italian version of the test could not be used to elicit production of /t/ in Italian because no Italian number begins with /t/.



New Methods for Second Language (L2) Speech Research



If frequency of use responses were elicited in both Italian and English, it would be possible to estimate NI participants’ overall percentage use of the L1 and L2. Each response would be coded with a value ranging from 0 (when “never,” indicating no use, is the response) to 4 (maximum use), and then summed. The maximum possible score for both languages would be 640 (maximum use = 4 × 4 time intervals × 40 items). If the scores obtained for Italian and English were 325 and 450, it would indicate 41.9 percent use of Italian [325/(325 + 450) * 100] and 58.1 percent use of English [450/(325 + 450) * 100]. 3.1.4  Stress-Anxiety Test This section illustrates a method that might be used to obtain wordinitial productions of /d/ and /t/ in English and Spanish words from a large number of anonymous volunteers. The theoretical aim of the study will be to test a prediction generated by the SLM-r hypothesis that was mentioned earlier, namely, that variation in L1 phonetic categories influences L2 speech learning. Hartshorne, Tenenbaum, and Pinker (2018) demonstrated the possibility of obtaining data from large numbers of respondents if an activity presented on the internet via social media is sufficiently short and interesting. The protocol outlined here could be presented as a psychological study that aims to develop a way to predict “stress and anxiety” levels based on “patterns of everyday activity.” Potential respondents would be asked to respond vocally to questions in a quiet setting using a microphone with sufficient sensitivity. They would be informed that stress and anxiety levels may vary as a function of “age, gender, and cultural background,” and so asked to indicate their age and gender, where they were born and currently live (city, state), what language(s) they learned as a child, what language(s) they learned “later in life,” and what languages they use every day. Respondents will hear questions in English that have been produced in English by 10 NE speaker at single, constant speaking rate. The questions will all regard the frequency of activities “in a typical 24-hour period” (to avoid modeling production of the word day). After hearing a question, respondents will click a response, then say it aloud. If a response is given before a question ends, “Please wait” will appear on the screen and the question will be repeated. If the recording volume is insufficient, or if there is too much background noise for later, off-line acoustic measurements, “Sorry, I can’t hear you!” will appear and the question will be presented again. The first 5 of 50 items will have this form.



James Emil Flege

How often are you likely to brush your teeth? a) never b) one time a day c) two times a day d) five times a day e) ten times a day How often are you likely to speak on the phone? a) never b) one time a day c) two times a day d) five times a day e) ten times a day Voice recognition software will assure that when response alternatives (b) to (e) are selected, productions of the words times and day are recognizable. To reduce the time needed to collect data, questions #6 to #50 will be abbreviated, as illustrated here: Change your clothes? a) never b) one time c) two times d) five times e) ten times Check the weather? a) never b) one time c) two times d) five times e) ten times The other items will refer to daily activities that are potentially relevant to socio-economic status and psychological state. These may include items such as “entering a store, hearing a shout, climbing stairs, forgetting a name, asking for help, singing a song, laughing out loud, taking a shower, washing your hands, fixing your hair, petting a dog, calling your mom, counting your change, dropping something, holding a baby, drinking water.” Once data collection has been completed, the responses that have been recorded on the respondents’ computers will be uploaded for VOT



New Methods for Second Language (L2) Speech Research



a­ nalysis of the /t/ in times and the /d/ in day. Data will be retained for analysis only for respondents who responded to all questions and whose IP address coincides with their indication of current place of residence. VOT will be measured in the /t/ tokens, and the /d/ tokens will be examined to determine if closure voicing is present before stop release. Each participant will be classified as either producing closure voicing in /d/ in more than 15 of the expected 40 /d/ tokens that will have been elicited, or never. The data for three subgroups of respondents will be analyzed separately: (1) monolingual native speakers of English, (2) individuals who learned both English and Spanish as children, and (3) native English speakers who learned Spanish later in life. One test will evaluate the proportion of respondents in the three groups who produce /d/ without closure voicing, that is, who always produced short-lag stops. The expectation is that the proportion will be higher for the NE monolinguals (group 1) than for the NE late learners of Spanish (group 3) than for Spanish-English simultaneous bilinguals (group 2). The same pattern of between-group differences is expected for the VOT in English /t/. Testing the SLM-r hypothesis will require inviting the NE respondents who learned Spanish later in life to respond to a Spanish-language version of the instrument. They will be told that the purpose of the second session is to determine if stress-anxiety levels varies as a function of which of two languages, English or Spanish, is being used. Frequency of daily activities will be indicated as nunca (never) or X veces al día (X times a day). The SLM-r hypothesis is that individual differences in the specification of L1 phonetic categories may influence how certain L2 sounds will be produced. This hypothesis will be supported if significantly more NE speakers of Spanish who produce English /d/ only as a short-lag stop will produce the Spanish /t/ with significantly longer VOT values than NE speakers of Spanish who produced English /d/ with prevoicing. 3.1.5  English Comprehension Test As suggested in the last section, use of the internet provides important new ways to collect data for L2 research. It will of course be important to develop adequate controls, for example, ensuring that anonymous volunteers whose L2 performance is being assessed can actually speak the L2, not just read it (Flege, 2019). The English Comprehension Test (ECT) outlined here provides a way to assess overall L2 competence while at the same time providing a well-controlled speech production sample.



James Emil Flege

Anonymous volunteer respondents would be invited to participate via the internet in a study examining their “knowledge of English words.” The ECT could be administered to respondents from a variety of L1 backgrounds but here we illustrate the test using Italian as the L1 and English as the L2. The ECT will consist of 50 items at each of three levels of difficulty. Each of the 150 items will consist of an aurally presented question followed by multiple written response alternatives, one of which is correct and is to be said aloud. The native Italian (NI) participants will select the correct response based on their general knowledge of the world. Three, four, and five response alternatives will be provided for the “Beginner,” “Intermediate,” and “Advanced” levels, respectively. Here are some items that might be appropriate for the “Beginner” level: 6. Which animal is largest? (a) a mouse (b) a dog (c) a cow 18. Which person is male? (a) a mom (b) a son (c) a sister 33. Which place has the most people? (a) a town (b) a house (c) a store 48. Which shape has four sides? (a) a triangle (b) a circle (c) a square Here are examples from an “Intermediate” level: 56. Which of these can hold water? (a) a glass (b) a seat (c) a suit (d) a pen 63. What do you find in every house? (a) a rug (b) a phone (c) a cat (d) a door 72. Which of these things is round? (a) a rock (b) a tire (c) a square (d) a fence 81. What would you use to go from Paris to London? (a) a wall (b) a cup (c) a plane (d) a broom Finally, here are examples from an “Advanced” level: 105. Who should you call when you smell a gas leak in the street? (a) a bishop (b) a baker (c) a fireman (d) a cashier (e) a shepherd 117. What do we call a person who plants wheat, corn, and alfalfa? (a) a farmer (b) a burglar (c) a baker (d) a broker (e) a liar 128. What would you use to sleep outdoors in the summer? (a) a book (b) a beard (c) a link (d) a phone (e) a tent



New Methods for Second Language (L2) Speech Research



138. Who do you need when something really bothers you? (a) a pilot (b) a friend (c) a plumber (d) a diet (e) a cook The italicized English questions, spoken at a constant moderate rate by 10 NE speakers, will be presented only aurally from sound files, not in writing. Once presentation of a sound file has been completed, the written response alternatives will appear on the computer screen and the participants will select a response. If the correct response has been selected, “Right” will appear on the screen and the incorrect response alternatives will be dimmed. The participants will then say the correct response alternative aloud and the next item will be presented. If the correct response has not been selected, “Try again” will appear on the screen and the item will be repeated. All correct responses will be recorded and then uploaded when the ECT has been completed. A study focusing on the production of English /t/ might include 30 items such as #33, #72, and #128 for which the correct responses is a phrase with /t/ (a town, a tire, a tent). The VOT in /t/ can be measured automatically from the uploaded samples, making it possible to obtain large production samples from many participants in a cost-effective way. The 150-item ECT will generate scores indexing the NI respondents’ aural comprehension of English. One question of interest for L2 speech research is whether comprehension scores like these will predict a significant amount of variance in the NI participants’ production of VOT in English /t/ independently of variables such as their age of first exposure to English, years of English study, years of residence in a predominantly L2-speaking country, aptitude, motivation, hearing acuity and so on. The aural comprehension scores are likely to covary with cumulative English input and so correlate with VOT values that are produced in English /t/. To evaluate the hypothesized relation between aural comprehension and L2 input, it would be valuable to administer the Number Selection Test described earlier following the ECT. In addition to providing a large sample of /t/ productions, the NST provides finegrained estimates of English input.

3.2  Estimating L2 Input L2 speech learning is data driven because it depends on the quantity and quality of input that has been received. This section offers methods that might be used to provide better estimates of L2 input.



James Emil Flege 3.2.1  Cumulative Use Index (CUI)

Cumulative speech input refers to the input that learners of an L2 have obtained over a relatively long period of time. This input is used to specify the phonetic categories used in the formulation of word candidates in the process of word recognition, and which define the goals of segmental production. Recent input, on the other hand, may influence how bilinguals adapt their perception to particular talkers or speaking situations or how they deploy the phonetic categories in their L1 and L2 phonetic subsystems to achieve certain effects. It is probably impossible to assess the input received from the time of first exposure to an L2 to the time of testing when examining highly experienced L2 learners. The suggestion made here is to examine cumulative input over the five years preceding the moment in which L2 learners are tested. This suggestion has two motivations. First, participants are more likely to accurately recall which languages they have been using and how frequently over the past five years than for longer intervals of time. Second, fewer major life changes are expected to have occurred in the preceding five years than for longer intervals of time. If so, it would affect the accuracy of the L2 use estimates yielded by the Cumulative Use Index (CUI) outlined here. The CUI is designed for use with bilinguals. Monolinguals can only use their L1, but bilinguals can choose which of their two languages to use when speaking with other bilinguals who share their two languages. The CUI will be illustrated here for native Italian (NI) speakers of English who must select one language, either Italian or English, as the base (primary) language but are free to introduce materials from their other language. NI speakers who respond to CUI will be asked to indicate how often they use Italian and English as the base language when speaking to particular persons in particular social contexts. They will also estimate what percentage of time they use English when Italian has been selected as the base language, and vice versa. The CUI will be administered to one participant at a time using special software. Both the experimenter and the participant will view the same computer screen as responses are collected. The experimenter will enter each participant’s responses, item by item, as they are provided vocally by the participant. This is to assure that the CUI resembles a one-on-one interview. The participants will provide four responses for each CUI item. Figure 3.1 illustrates a CUI item dealing with language use in a particular social



New Methods for Second Language (L2) Speech Research



How often do you typically use English and Italian when you... dine with friends in a local restaurant never

1-2x per year

3-10x per year

1-2x 3-4x 2-3x per month per month per week

4-6x per week

1-2x per day

6

7

3-6 per day

even more

x

English Italian

0

1

2

x

3

4

5

8

9

When you speak mostly English, how often do you use Italian? 0 x

5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95 100

When you speak mostly Italian, how often do you use English? 0

5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95 100 x

Figure 3.1  Example of a test item from the Cumulative Use Index. The term “typically use” refers to the selection of English or Italian as the base language in particular social contexts or with particular persons. See text.

context, dining in a local restaurant with friends. The responses illustrated have been given by a hypothetical NI participant, a 52-year-old man who emigrated from Italy to Canada at the age of 21 years and has lived in Canada for 31 years. As shown in Figure 3.1, the hypothetical participant is asked to indicate “how often” he uses English and Italian. Ten response alternatives indicating a gradually increasing frequency of use are offered for both languages. The alternatives, which will be coded with values that range from 0 (when “never” is the response) to 9 (which indicates use of a language more than six times per day). As discussed below, these values will be used to determine a weighting factor that is applied to percentage use estimates. Our hypothetical participant indicates that when dining in a local restaurant with friends he uses Italian as the base language 1 to 2 times per month and using English as the base language somewhat less frequently, 3 to 10 times per year. These differing estimates trigger different weights, 3 versus 2, which will be applied to the percentage use estimates. The hypothetical NI participant indicates that when he uses Italian as the base language, he switches into English for 10 percent of a conversation. When English is the base language, on the other hand, he never switches into Italian (0 percent Italian use).



James Emil Flege

Responses like these are plausible in the context of how bilinguals use their two languages. These responses suggest that our hypothetical participant dines out more often with Italian-speaking than English-speaking friends. It is not unusual for bilinguals, especially late learners, to prefer using their L1 when they want to relax and enjoy themselves. When the participant dines with Italian-speaking friends and uses Italian as the base language, the conversation may briefly shift to English, for example, to discuss a work-related topic (assuming that English is the language used at work). However, when the participant dines with English-speaking friends, he never uses Italian, perhaps because these friends cannot speak Italian. Figure 3.2 illustrates 40 items that might be included in an eventual 50-item instrument. The final choice of items will depend on factors such as the chronological ages of the participants at the time of testing, their ages of arrival in the host country, their national origin, the size of the L1-speaking community in which they live, and various other cultural and economic aspects of daily life. That being the case, a CUI must be pilot tested to ensure that most or all aspects of participants’ daily lives are covered, and that few if any items will generate “never use” responses when applied to both the L1 and L2. Weighted values will be calculated for each of the CUI items. For example, the weighted values for the dining in a local restaurant item (#28) are 3.0 for Italian (3 * 100/100) and 1.8 for English (2 * 90/100). These values are obtained by multiplying the two percentage estimates by the two weighting factors which range in value from 0 to 9. The 40 weighted values obtained for both languages will then be summed, as illustrated in Figure 3.2. The summed values for our hypothetical participant’s use of Italian and English are 81.4 and 94.5, respectively. Percentage Italian use is then calculated as 81.4/(81.4 + 94.5) * 100 = 46.3 percent, English use as 94.5/(81.4 + 94.5) * 100 = 53.7 percent. The CUI will yield valid percentage use estimates of percentage L1 and L2 use if the instrument provides complete coverage of individual participants’ normal activities in the preceding five years. Frequency of L2 use often but not always increases as length of residence in a predominantly L2 speaking environment increases. That being the case, there is no guarantee that estimates obtained for the preceding five years will accurately reflect earlier language use patterns. Nor is it certain that phonetic category specification will depend primarily on the five most recent years of input.



New Methods for Second Language (L2) Speech Research %IT

%EN

W-IT

95 100 90

40 0 15 80

8.6 6.0 6.3 3.6

70

80

90

0.7 5.4

75 20 90 50 65 95 80

8.0 85 80 50 100 100 100 100

85 0 0

95 0 0

0 85 75 75 85 75 0 100 60 75 0 100 70 70 90 70 0 85 65 90 80 75 0 60

100 80 80 90 80 90 100 0 100 65 100 90 0 85 85 90 100 95 100 90 95 95 100 90

Speaking with persons mother father sister(s)

4

brother(s)

4

3

90

sons(s) daughter(s) cousin(s) grandchild/grandchildren best friend(s)

1

7 5 1 3 3 2 3 4 6 9 0

5 6 7 8 9 10 11 12 13

pharmacist/optometrist waiters/waitresses on phone/via Skype neighbor(s) Social contexts 14 primary job 15 secondary job 16 attending school 17 other educational activities 18 church services/activities 19 waiting for doctor/dentist 20 attending sports events 21 birthday family members 22 birthday friends 23 favorite hobby/pastime 24 additional hobby/pastime 25 traveling in Canada 26 weddings/funerals 27 banking/insurance 28 dining w/ friends local restaurant 29 barber/hairdresser 30 volunteer activity 31 cafe/bar 32 social/fraternal organization(s) 33 hardware/home improvement store 34 asking for/giving directions 35 laundry/drycleaner 36 buying bread/pastries 37 buying groceries 38 buying clothes 39 buying shoes 40 buying meat/fish weighted totals % use

F-IT 9 6 7

F-IT 2 0 2

1 2 3

6 2 1 8 1 2 6 3 1 0 0 0 5 1 2 2 2 0 2 1 2 0 3 2 1 3 3 1 3 2 1 2 0 1

0 2 2 5 3 2 2 6 0 2 1 2 2 0 5 1 2 2 2 1 3 4 3 1 5

 W-EN 0.8 0.0 0.3 2.4 5.6 4.3 0.9 2.4 1.5 2.0

1.5 0.2 7.2 0.5 1.3 5.7 2.4

3.0 4.0 6.0

0.9 0.0 0.0 0.0 4.3 0.8 1.5 1.7 1.5 0.0 2.0 0.6 1.5 0.0 3.0 1.4 0.7 2.7 2.1 0.0 0.9 2.0 1.8 0.8 1.5 0.0 0.6 81.4 46.3

8.6 0.0 0.0 2.0 1.6 4.0 2.7 1.6 1.8 6.0 0.0 2.0 0.7 2.0 1.8 0.0 4.3 0.9 1.8 2.0 1.9 1.0 2.7 3.8 2.9 1.0 4.5 94.5 53.7

Figure 3.2  Sample items from a Cumulative Use Index. F-IT and F-EN indicate frequency of use of Italian and English. %IT and %English indicate estimated percentage use of Italian and English when that respective language is being used as the base language. W-IT and W-EN are weighted Italian and English use values. See text.



James Emil Flege

The CUI provides quantitative estimates of language through use of a highly structured interview format. It is important to note, however, that experimenter skill may influence the results obtained. Ideally, a single experimenter will administer the instrument to all participants in a study and will begin each interview by asking general orienting questions. These questions might regard family structure (Are both of your parents still living? Do you have brothers and sisters? Children?), work history (Are you currently employed? How long have you had the same job?), and hobbies and pastimes. Orienting questions help put participants at their ease and may help the experimenter adapt questions appropriately. For example, a participant might tell the experimenter that he has three pastimes: ballroom dancing, playing cards with a group of close friends, and stamp collecting. A skilled experimenter would ask if all three pastimes involve talking to people. If this holds true for ballroom dancing and playing cards, the two items dedicated to pastimes would refer to these and not to stamp collecting.

3.3  Measuring L2 Input According to the SLM-r, the phonetic categories used to categorize the sounds making up words, as well as the specification of how those sounds are to be produced, are defined by the distributions of tokens to which learners have been exposed. Unfortunately, this fundamental SLM-r hypothesis cannot be evaluated unless and until input distributions can be measured. This section outlines a method for doing so. Section 3.2 provided suggestions on how to obtain better estimates of percentage L1 and L2 use. If implemented, these procedures will likely provide more accurate estimates than the methods now routinely used in L2 research. However, even the methods proposed earlier are limited in two ways. First, they provide estimates, not measurements. Second, they provide no information regarding quality of L2 input, which requires determining how much of the L2 input that each learner has received was provided by native and nonnative speakers. 3.3.1 Background As discussed by Flege and Bohn (Chapter 1), native Spanish (NS) speakers living in the United States show substantial intersubject variability when producing VOT in /t/-initial English words. Some NS Late learners have been observed to produce English stops with English-like



New Methods for Second Language (L2) Speech Research



VOT values, some with Spanish-like VOT values, and others with VOT values falling somewhere in between the average values produced by groups of Spanish and English monolinguals. Some intersubject variability of this kind may derive from variation in how much L2 input NS speakers have received from native speakers of English as opposed to fellow native speakers of Spanish. It is at least possible, for example, that only NS speakers who have been exposed to large amounts of input from NE monolinguals, and very little English input from other NS speakers, will form a new phonetic category for English /t/ and so produce English /t/ with VOT values that closely resemble the values typical for NE monolinguals. NS speakers who have obtained an equal quantity of English input from NE and NS speakers might be prevented from forming new categories or require more input to do so. The new phonetic category that a NS learner of English forms for English /t/ would be expected to reflect the input distribution to which he or she has been exposed. This leads to the SLM-r prediction (Chapter 1) that NS learners of English who have been exposed to a mix of native-speaker and foreign-accented input will produce English /t/ with shorter VOT values than most NE monolinguals. Flege and Eefting (1988) examined VOT values in productions of English /t/ by NS adults who had begun learning English in childhood (“early” learners). These authors developed a technique for determining if NS learners of English do or do not form a new phonetic category for the English /t/. The members of a VOT continuum were randomly presented to the early learners and to monolingual native speakers of Spanish and English. These participants were asked to mimic the stimuli as accurately as possible. In a separate experiment the same participants were also asked to label, as /d/ or /t/, the members of the same VOT continuum ranging from d/ to /t/. The participants tested by Flege and Eefting (1988) did not accurately track the VOT values in the stimuli. They showed nonlinearities reflecting how they labeled the stimuli in a two-alternative forced-choice identification task. The nonlinearities in the functions relating stimulus VOT values and imitation responses that were obtained from the NS monolinguals occurred near their Spanish “phoneme boundary,” that is, near the upper limit of short-lag VOT values that are typical for short-lag productions of Spanish /t/. The nonlinearities evident for the NE monolinguals, on the other hand, occurred at longer VOT values and corresponded to their English phoneme boundary between short-lag and long-lag stops. Both groups of monolinguals produced bimodal distributions of VOT values when imitating the VOT stimuli. The two groups of early learners,



James Emil Flege

on the other hand, showed tri-modal distributions. One of the nonlinearities seen for many early learners occurred at the boundary between lead and short-lag stops, the other at the boundary between short-lag and long-lag stops. Flege and Eefting (1988) interpreted this pattern of results to mean that the two groups of monolinguals had two phonemic categories linked to two phonetic categories whereas members of the early learner groups had two phonemic categories for /t/ (one for Spanish, one for English) and three phonetic categories for the stops found in their two languages. In addition to the identification and vocal imitation experiments that were just mentioned, Flege and Eefting (1988) also measured the production of English /t/ by the two groups of early learners. When taken together, the results suggested that they had established new phonetic categories for English /t/. However, their phonetic categories differed from those established by most NE speakers living in the United States in that their new English phonetic categories specified shorter VOT due to the fact that they were tested in Puerto Rico, where Spanish-accented English is the norm rather than the exception. Flege (2009) proposed using the Experience Sampling Method (see, e.g., Larson & Csikszentmihalyi, 1983; Bos, Schoevers, & Rot, 2015; Heron et al., 2017) to obtain measurements of L2 input. The ESM method, also known as Ecological Momentary Assessment, is predicated on the assumption that participants in behavioral research can respond more accurately to simple questions about the here and now (e.g., What language are you using?) than to broader questions (e.g., What percentage of the time do you use English?). Flege and Wayland (2019) proposed using smartphones to collect ESM data. This section outlines a hypothetical three-month longitudinal study that further develops this suggestion. The study to be outlined could be used to measure the quantity and quality of input that NS learners receive when learning to produce and perceive word-initial English /t/ tokens. The length of the test interval, three months, derives from the finding by Sancier and Fowler (1997) that VOT production values might shift as a function of input obtained in the preceding two months. 3.3.2 Design The participants in the hypothetical study will be NS speakers who have lived in the United States for at least one year and can speak English well enough to respond to questions and give informed consent in English.



New Methods for Second Language (L2) Speech Research



Prospective participants will be told that the study examines how language use affects “knowledge of English words.” To be included, a potential participant must have a personal smartphone and permit a special application to be installed on it. Inclusion will also require: 1. willingness to respond to several short questions, and to record short number strings when notifications arrive on the smartphones at two randomly selected times per waking day for 180 consecutive days; and 2. willingness to indicate the native language of other persons who may be present when a notification arrives, and to ask those persons to record similar number strings. The final criterion for inclusion is an affirmation by each prospective participant that his/her language use patterns have not changed importantly in the recent past and are not likely to change importantly during the three-month study interval. Perspective participants will be assured that if a notification arrives at an inopportune time, for example, while the participant is driving, they can vocally respond “not now” to postpone the notification for a short, programmatically determined interval. In addition to responding to 360 notifications over 180 days, each participant will also take part in two sessions to be held on a university campus, one before and one after the 3-month study interval. In both on-campus sessions, the participants will take a test designed to estimate the size of their English vocabularies and respond to a version of the Cumulative Use Index described earlier. The participants will use their smartphones to respond to 15 notifications that arrive on their smartphone during the course of the 1.5-hour long on-campus sessions. The second and last on-campus session will conclude with the vocal mimicry test used by Flege and Eefting (1988), followed by debriefing. 3.3.3 Procedures Each of the 360 accepted notifications will begin with the visual presentation on the participants’ smartphones of number strings such as 5–1–10– 6–8–3–7–2–4–9. Each number string will be unique, but all will contain the numbers two and ten (order counterbalanced) in the third and eighth positions. The number strings will be referred to as “test codes” that are used for administrative purposes. Only when participants are debriefed at the very end of the second on-campus session will they be informed that



James Emil Flege

the number strings were recorded for acoustic analyses of the /t/ tokens found in the numbers two and ten. After recording the number string, participants will indicate if someone else is present. If the answer is “no,” the notification ends. If other persons are present, participants will select one of the following responses to indicate what language(s) were being used when the notification arrived: (1) only English, (2) only Spanish, (3) English more than Spanish, (4) Spanish more than English, or (5) the two languages were being used about equally. Participants will then indicate the number of other persons present with whom they have been vocally interacting. If three persons are present, for example, three new number strings will appear on the participant’s smartphone. The participant will indicate the L1 of the three persons (English, Spanish, or “other”) and ask each of these persons, in succession, to record the number strings as he or she has just done. If an interlocutor does not wish to participate anonymously by recording a number string, the participant will click “not willing” and move on to the next interlocutor. Once number strings have been recorded for all willing interlocutors, the responses given by the participant and the recordings will be uploaded for later off-line analyses and the notification will be terminated. To maximize the number of interlocutors who record the number strings, the participants enrolled in the study will be paid a small sum for each recording of interlocutors in addition to a set fee for completing the three-month study. All participants will, of course, give informed consent when enrolled in the study, but informed consent will not be obtained from the anonymous interlocutors because it will not be possible to identify them as individuals based on their production of the English number strings. 3.3.4  Dependent Variables The following outcome measures will be derived for each participant: 1. Percentage use of English and Spanish. These values will be derived by determining a. How often only English was being used when another person(s) was present at the time of a notification (English = 1.0). b. How often only Spanish was being used (Spanish = 1.0). c. How often English was being used more than Spanish (English = 0.7, Spanish = 0.3).



New Methods for Second Language (L2) Speech Research



d. How often the reverse held true (English = 0.3, Spanish = 0.7). e. How often the two languages were being used equally (English = 0.5, Spanish = 0.5). 2. The percentage of interlocutors whose native language is English, Spanish, or “other.” The reports of L1 background will be verified by automatic analysis of language (as English, Spanish, or neither) based on the interlocutors’ pronunciation of the number strings. 3. The mean VOT values produced by each participant in the words two and ten during the 3-month study interval. 4. The mean VOT in the /t/ tokens that each participant has recorded on his/her personal smartphone during the two on-campus sessions. 5. The mean VOT values produced by each participant during the 3-month study interval when: a. No other person was present b. Only English was being used with NE speakers c. Only English was being used with both NE and NS speakers present d. Only Spanish was being used with only NS speakers present e. Only Spanish was being used with both NE and NS speakers present f. Both languages were being used with NS speakers present g. Both languages are being used with NE speakers present 6. Statistical properties of the VOT distributions in productions by interlocutors of the words two and ten will be determined. Analyses will be undertaken to determine if the distribution of /t/ VOT values to which each participant has been exposed over the three-month study interval is unimodal or bimodal, what is the modal value of the one (or two) distributions, and how frequently occurring are values falling within a 90 percent confidence interval of the one (or two) modal distribution values. 3.3.5 Analyses The data just described will be analyzed to answer a number of theoretically important research questions such as the following: 1. Will participants show significant changes in the production of VOT in /t/ during the course of the three-month study? In the second compared to the first on-campus session?



James Emil Flege

2. Will the VOT values produced by individual participants during the 3-month study reflect what language(s) is (are) being used at the time of notifications. In particular, will participants produce English /t/ with shorter VOT values when they have just been speaking Spanish with fellow native speakers of Spanish than when they have just been speaking English with NE speakers? 3. Do both quantity of English input and quality of input (i.e., the proportion of Spanish-accented English to which each participant has been exposed) influence mean VOT production? If so, which variable is more important? Will quantity and/or quality of the input to which participants have been exposed during the three-month study interval account for more variance in their production of VOT than other variables that are commonly examined in L2 research such as age of first exposure to English, years of residence in the United States, percentage L1 and L2 use, and so on? The most important aim of the research, however, is to determine what kinds of input distributions promote the formation of a new phonetic category for an L2 sound. It is plausible that the NS participants will need to be exposed to a bimodal distribution of VOT values, one more Englishlike than the other, in order to form a new category for English /t/. In the second on-campus session the NS participants will be asked to mimic as accurately as possible the member of a VOT continuum that spans the range from lead to long-lag VOT. As mentioned earlier, Flege and Eefting (1988) found that many of the NS early learners they tested showed two nonlinearities between stimulus VOT values and imitations of the VOT stimuli, suggesting that they had formed new phonetic categories for English /t/. An important theoretical question that might be addressed by the hypothetical study illustrated here is whether NS participants who show a nonlinearity in imitating short-lag and long-lag stops have been exposed to a bimodal distribution of VOT values in English /t/. A related question is whether the VOT values at which the nonlinearities occur in the VOT imitation experiment coincides with VOT values demarking the bimodal distribution of values to which participants have been exposed. The hypothetical study outlined here, if carried out, will also make it possible to determine the effect on L2 speech learning of exposure to a broadly tune, unimodal distribution of VOT values for English /t/ that span the range from the short-lag VOT values typical for Spanish /t/ to



New Methods for Second Language (L2) Speech Research



the very long-lag VOT values produced by some NE speakers. Will input like this block the formation of a new phonetic category for English /t/, leading to the formation of a composite Spanish-English phonetic category? If so, the NS participants who are exposed to such an input distribution are expected to produce Spanish /t/ with VOT values that exceed the short-lag VOT values produced by NS monolinguals. Another question of theoretical importance concerns the relative frequency of tokens over the VOT range. Will NS participants exposed to a unimodal distribution of VOT values produce /t/ with the modal VOT value of the distribution to which they have been exposed?

3.4  Assessing L2 Speech Perception The primary aim of L2 perception research is to determine if learners can consistently and accurately categorize L2 sounds. In early stages of learning this is rarely if ever possible for L2 sounds without a close counterpart in the L1. For example, native speakers of Spanish cannot distinguish the English vowel /æ/ from either the /ε/ or the /ɑ/ of English when they are first exposed to English (Flege & Wayland, 2019). Once L2 learners are able to categorize the realizations of an L2 category consistently and accurately, it then becomes necessary to determine what properties of the L2 sounds they use in the categorization process and what is the relative importance (weighting) of those properties. Many L2 perception studies have made use of two-alternative forcedchoice identification tests. Unfortunately, this methodology may provide little insight into how L2 sounds are categorized because L2 speech sounds can be correctly identified on the basis of a perceived difference between the realizations of an L2 sound and the realizations of an L1 category rather than on the basis of the phonetic properties defining the L2 sound. Consider the results obtained by Bohn and Flege (1993). These authors presented naturally produced CV syllables to native English (NE) monolinguals and to native Spanish (NS) monolinguals who had recently arrived in the United States. Four sets of nine CV syllables were used as stimuli: Spanish /d/ tokens that were all produced with lead VOT (mean  = −94 ms), Spanish /t/ tokens produced with short-lag VOT values (mean = 21 ms), English short-lag /d/s (mean VOT = 17 ms) and long-lag English /t/s (mean VOT = 84 ms). The NS and NE participants were asked to identify the stimuli as /d/ or /t/. As expected, the members of both groups consistently identified



James Emil Flege

the Spanish lead VOT stimuli as /d/. Somewhat surprisingly the NS participants closely resembled the NE participants in consistently labeling the English long-lag VOT stimuli as /t/ even though these stimuli had substantially longer VOT values than the Spanish short-lag /t/ stimuli did (means = 84 vs. 21 ms). The NS monolinguals demonstrated perceptual sensitivity to phonetic differences between English and Spanish stops. They labeled the short-lag VOT English /d/ stimuli as /t/ just 35 percent of the time even though these stimuli had somewhat shorter VOT values (mean = 17 ms) than the Spanish short-lag /t/ stimuli did (mean = 21 ms) and so, on the basis of VOT, should all have been labeled /t/. Given their perceptual sensitivity to cross-language phonetic differences one might reasonably ask why the NS participants consistently labeled the long-lag English VOT stimuli as /t/. The results obtained by Flege, Schmidt, and Wharton (1996) help answer this question. These authors asked NE monolinguals and NS “near-monolinguals” to use one of three labels to identify the members of two VOT continua ranging from /bi/ to /pi/. The NS participants had been living in the United States for an average of seven months but, when tested, could not yet carry on even a simple conversation in English. One of the two continua to which the participants responded simulated a relatively slow speaking rate, the other a faster speaking rate. One of the labels offered to the NE and NS participants was “b or Spanish p.” The NS participants were expected to use this label if they heard either of two native-language sounds, Spanish /b/, which is produced with lead VOT, or Spanish /p/, which is produced with shortlag VOT. The NE participants were expected to use the “b or Spanish p” label in response to stimuli having both lead and short-lag VOT values because English /b/ can be realized in either way. NE speakers were expected to use the “English p” label to identify realizations of their English long-lag VOT /p/ category, and the “exaggerated or breathy” label when hearing stimuli having VOT values that exceeded the range of VOT values typically used in English at a particular speaking rate. One aim of the Flege et al. (1996) experiment was to determine if the NE monolinguals would apply the “English p” label to stimuli having VOT values that are typical for English long-lag /p/ stimuli, and the “exaggerated or breathy” label for stimuli having VOT values that are longer than those typically used in the production of English /p/. As shown in Figure 3.3, the NE participants identified a well-defined midrange of the VOT stimuli as “English p.” The range of stimuli to which this label was applied shifted predictably as a function of speaking rate



New Methods for Second Language (L2) Speech Research English Monolinguals Slow Rate

50

Spanish Near-Monolinguals Slow Rate /b/, Spanish /p/ English /p/ exaggerated /p/

40 Number



30 20 10 0 Fast Rate

50

Fast Rate

Number

40 30 20 10 0 0

50

100 150 200 250 Stimulus VOT

300

0

50

100 150 200 250 Stimulus VOT

300

Figure 3.3  The classification of the members of a VOT continuum ranging from /bi/ to /pi/ by native English (NE) monolinguals and native Spanish (NS) “near-monolinguals” using one of three response labels (see text).

(simulated by varying the duration of the following vowel). The NS participants, on the other hand, used the “English p” label far less often than the NE participants did, and the portion of the VOT continuum they designated using this label was less sharply differentiated from the other two phonetic categories. This is what one would expect for NS speakers who have not yet formed a new phonetic category for the longlag /t/ of English. These results shed light on why the NS monolinguals tested by Bohn and Flege (1993) were able to consistently label long-lag English VOT stimuli has /t/. Figure 3.3 shows that NS speakers who had very little prior exposure to English could distinguish stops having long-lag VOT values typical for English and other stops having even longer VOT values from the Spanish short-lag VOT stimuli. This suggests that the NS speakers tested by Bohn and Flege (1993) applied an “X-not-X” decision strategy when labeling the members of a VOT continuum in a two-alternative forced-choice identification test. More specifically, they may have labeled the English long-lag VOT stimuli as /t/ not because the stimuli had long-lag VOT values but because they could not be Spanish stops. L2 speech perception must be assessed using a technique that prevents participants from using an X-not-X labeling strategy. MacKay, Meador,



James Emil Flege

and Flege (2001) developed a method that accomplishes this aim. It examined phonetic-level categorization of English consonants rather than their identification. The authors tested three groups. One consisted of 18 NE speakers and the other two groups consisted of 18 age-matched native Italian (NI) speakers each. Members of the two NI groups arrived in Canada from Italy at an average age of 7 years and had lived there for an average of 40 years. The two NI groups differed in self-reported percentage use of Italian, averaging 8 percent for the “early-low” group and 32 percent for the “early-high” group. The study examined the participants’ categorization of 18 word-initial and word-final English consonants. The consonants were produced in /Cɑdo/ and /hodɑC/ nonwords that were spoken by an adult male native speaker of English. Nonwords were examined to minimize lexical bias effects and possible differences in the use of contextual information by the NI participants. Some of the English consonants had close phonetic counterparts in Italian whereas others did not. For example, the English /p t k/ stimuli were produced with longer VOT values in wordinitial position than is typical for Italian /p t k/. In word-final position the English /p t k/ stimuli were released, which is unusual for English but is typical for word-final stops in Italian. The English word-initial /b d ɡ/ stimuli resembled productions of /b d ɡ/ of Italian in that all were produced with lead VOT. The word-final English /b d ɡ/ stimuli, on the other hand, differed from their Italian counterparts in that they were partially devoiced. The word-initial and word-final English consonant stimuli were presented in four successive blocks in which signal-to-noise (S/N) levels varied in a fixed order (12, 6, 0, and −6 dB). The NE and NI participants categorized the English consonants by selecting one of five written response alternatives. One was the correct response and the other four were incorrect (foils). For example, the labels offered for the /s/ in /sɑdo/ were “S” (the correct response), “TH,” “SH,” “F,” and “T” (the foils). The labels offered for the /v/ in /vɑdo/ were “V” (correct), “TH,” “B,” “Z,” and “F” (foils). English spelling conventions were used for both the correct responses and the foils. The foils differed from the target consonants in place and/or manner of articulation. The four foils selected as response alternatives for each target sound were the consonants most likely to be confused with the target sound in previous perceptual confusion studies. The crucial design feature of the methodology developed by MacKay et  al. (2001) was providing five response alternatives for each English consonant. Doing so required participants to process the acoustic



New Methods for Second Language (L2) Speech Research



phonetic information available in each target consonant. The testing format prevented them from selecting the correct response alternative by excluding it as a possible realization of an L1 consonant category. Members of the early-high group were found to make significantly more errors than the NE participants when categorizing both wordinitial and word-final consonants. Members of the early-low group, on the other hand, did not differ significantly from the NE speakers when categorizing either word-initial or word-final English consonants. MacKay et al. (2001) interpreted these findings to mean that L2 input exerts an important influence on how even early learners perceive L2 speech sounds. This interpretation was supported by the results obtained by Meador, Flege, and MacKay (2000). This study examined the recognition of English words in sentences by members of the same three groups (NE, early-low, early-high). As shown in Figure 3.4, the members of all three groups showed much the same effect of the variation in signal-to-noise Mean Words Recognized (max = 50) 50 45 NE

40

Early-low

35

Early-high

30 25 20 15 10 5 0

–6 dB

0 dB S/N Ratio

+6 dB

Figure 3.4  Mean number of English words that were correctly recognized by native English (NE) speakers and two groups of native Italian (NI) speakers who arrived in Canada at the mean age of seven years but differed in how frequently they used Italian. The English stimuli were presented at three different signal-to-noise (S/N) levels. The error bars bracket ±1 SEM.



James Emil Flege

levels. Members of the early-high group recognized significantly fewer English words at all three S/N levels than members of the early-low group who, in turn, recognized significantly fewer words at all three S/N levels than the NE participants did (Bonferroni-corrected p < 0.05). When data for 36 NI early learners obtained by Meador et al. (2000) were pooled with the data obtained from 36 additional NI participants who had arrived in Canada after the age of 12 years, the word-recognition scores showed moderate correlations with the categorization data reported by MacKay et al. (2001). This held true both for word-initial English consonants, r(70) = 0.59, p < 0.001, and for word-final English consonants, r(70) = 0.49, p < 0.001. This finding indicated that how the NI participants categorized English consonants was predictive of how well they managed to recognize English words.

3.5  A Category Formation Test No standard method now exists for determining when and if learners form phonetic categories for L2 sounds. A new approach for doing so with L2 vowels is presented here. Once again, the new method will be illustrated with reference to the learning of English as an L2 by native Italian (NI) speakers. 3.5.1 Background The point of departure for a hypothetical study that aims to determine if NI participants form a new category for an English vowel is the research reported by Flege and MacKay (2004). These authors tested NI university students who had come to Canada during the summer in order to study English at a university located in Ottawa. The NI students identified 11 English vowels in terms of the 7 vowels of standard Italian. They categorized naturally produced tokens of English /ɪ/ (in bit) as being Italian /i/ (65 percent of judgments) or as Italian /e/ (35 percent). The NI students also used a scale ranging from 1 (very different) to 5 (very similar) to rate the goodness-of-fit of the English vowel tokens they had just heard to the Italian vowel they just used to categorize it. The NI students who categorized the English [ɪ] tokens as Italian /e/ rated the stimuli as better exemplars of Italian /e/ (mean = 4.0) than did the students who categorized the English [ɪ] tokens as Italian /i/ (mean = 2.9). Given that some varieties of Italian have just five vowels (e.g., only /ε/, not both /ε/ and /e/), and given that the NI students came from



New Methods for Second Language (L2) Speech Research



several Italian regions, this finding may have derived from differences in the students’ L1 vowel categories. Flege and MacKay (2004) also examined the NI students’ categorial discrimination of English [ɪ] and [i] tokens using a test having a chance level of A’ = 0.50. The NI students obtained significantly lower scores than age-matched NE speakers did but their scores were nevertheless above chance. The primary aim of Flege and MacKay (2004), however, was to examine the perception of English vowels by NI adults who had lived in Canada for decades. These participants were assigned to four groups differing orthogonally according to age of arrival in Canada (“early” vs. “late”) and frequency of Italian use (“high” vs. “low”). The members of all four groups obtained significantly higher /ɪ/-/i/ discrimination scores than the NI students did. Only one of four NI groups, the group consisting of participants who had arrived in Canada relatively late in life and used Italian relatively often (and so English relatively seldom) obtained significantly lower discrimination scores than age-matched NE speakers did. Although members of the “late-high” group had lived for decades in Canada, their lower discrimination scores might nonetheless be attributed to inadequate L2 input rather than to inability to form new phonetic categories. These NI participants had received substantially fewer years of full-time equivalent (FTE) English input, 13.6 years, than had members of the other three groups (range = 27.9 to 39.1 years). The results of an error detection experiment by Flege and MacKay (2004) underscored the importance of the quality of input that L2 learners receive. The stimuli used in this experiment were short phrases edited out of spontaneous English speech samples produced by NI speakers living in Canada. The stimuli included English words in which the target vowel /ɪ/ had been produced correctly (as [ɪ]) or incorrectly (with an [i]-quality vowel). The NI participants’ task was to determine if the target vowel in a phrase such as l*ttle boys had been produced correctly or incorrectly. (The target vowel was replaced by an asterisk to indicate which vowel was to be judged.) It is typically the case that early learners perform more accurately than late learners do. However, Flege and MacKay (2004) found that the late learners who used Italian seldom performed very much like the early learners who used Italian often. The absence of the expected early versus late difference between these groups was attributed to how often members of the two groups had heard the target vowel (English /ɪ/) produced incorrectly (as an [i]-quality vowel) by other NI speakers.



James Emil Flege 3.5.2 Design

The Category Formation Test (CFT) presented in this section employs a five-alternative forced-choice categorization procedure. Use of this perceptual test will be illustrated in a hypothetical experiment examining natural productions of the English vowels /i/, /ɪ/, /eɪ/, /ε/, and /u/ as produced by 20 NE males and 20 NE females in several phonetic contexts, including a /b_d/ context, at a single, moderate speaking rate. The vowel stimuli produced by the NE males and females will be examined in separate blocks. For now, we will focus attention on just the vowels produced by NE males. Ten productions of /ɪ/ by each of the 20 NE males (n = 200 tokens) and five productions of /i/, /eɪ/, /ε/, and /u/ (n = 400 tokens) will be selected based on a pretest with NE-speaking listeners. Hillenbrand, Getty, Clark, and Wheeler (1995) found that NE-speaking listeners were able to identify American English vowels spoken by other NE speakers at average rates exceeding 95 percent correct. The correct identification rates for /i/ and /ɪ/ tokens exceeded 99 percent correct. That being the case, any vowel stimulus identified by fewer than nine of 10 NE-speaking listeners in the pretest will be replaced. The vowel stimuli selected for analysis in the hypothetical experiment will surely exhibit overlap when plotted in a F1 × F2 acoustic space. Hillenbrand et al. (1995) found that natural productions of the five English vowels of interest (/i/, /ɪ/, /eɪ/, /ε/, /u/) were correctly classified in discriminant analyses at rates that were similar to listeners’ percent correct identification rates, 94.8 percent correct, when the F1 and F2 frequencies, duration, and formant movement measurements were available to the algorithm. A lower rate of correct classifications, 68.2 percent, was evident when the algorithm had access to only steady state F1 and F2 values in the vowel stimuli. In the hypothetical experiment described here, the 200 tokens of the /ɪ/ will be randomly presented to NI participants two times each whereas production of the other four vowels will be presented a single time, yielding a total of 800 single-interval trials. Both NI and NE participants will classify the vowel stimuli using the English keywords shown here:

For the /i/ vowels: BEAD (lead, read, feed, seed, need) For the /ɪ/ vowels: BID (lid, rid, Sid, kid, mid) For the /eɪ/ vowels: BADE (laid, raid, fade, made, paid) For the /ε/ vowels: BED (led, red, fed, said, dead, Ned) For the /u/ vowels: BOOED (lewd, rude, food, sued, nude)



New Methods for Second Language (L2) Speech Research



3.5.3 Analyses The first step in the analysis of each NI participant’s classifications is to determine how many of the 200 [ɪ] tokens have been classified the same way when presented twice. The subset of the [ɪ] stimuli meeting this criterion will be plotted in a three-dimensional acoustic phonetic space, with F1 and F2 frequencies at the vowel midpoint defining the x- and y-axes, and duration defining the z-axis. The question of interest is whether a contiguous set of stimuli in this acoustic phonetic space have been consistently identified as /ɪ/ and, if so, what is the extension of this portion of the vowel space. Concluding that a NI participant has formed a phonetic category for /ɪ/ cannot depend on whether a participant’s consistent classifications of stimuli as /ɪ/ occupies the same portion of vowel space evident for many or most NE-speaking listeners. It is plausible that a NI participant has formed a category for English /ɪ/ that differs from native speakers’ categories. Nor can the extension of stimuli that have been labeled as /ɪ/ provide a criterion for deciding whether a new /ɪ/ category has or has not been formed. NE speakers who are tested are likely to have /ɪ/ categories showing differing extensions, that is, differing in L1 category “precision” (see Chapter 1 for discussion). Other criteria need to be invoked to conclude that a new category has been formed. First, if a new category has been formed for English /ɪ/, the stimuli that are consistently classified as instances of this vowel should occupy much the same portion of vowel space when the English vowel stimuli produced by 10 NE females are examined, even though these stimuli differ in formant frequency values from the male-produced vowel stimuli. Second, much the same results seen for vowel spoken in the /b_d/ context should be evident for the same vowels when produced in other phonetic contexts that are known to systematically shift vowel formant frequencies. Third, if a new /ɪ/ category has been formed, much the same results should be obtained for sets of vowel stimuli that have been produced at faster or slower speaking rates despite the systematic changes in formant frequency values resulting from the rate changes. Discriminant analyses can be carried out to determine if the set of contiguous vowel stimuli consistently classified as /ɪ/ by an L2 learner yields a higher rate of correct classifications when the duration and/or formant movement values that have been measured in the stimuli are available to the algorithm than when neither is made available in addition to the midpoint F1 and F2 values. If the model that includes vowel



James Emil Flege

duration and/or formant movement patterns performs better than a purely F1-F2 model, it would suggest that the L2 learner was exploiting those dimensions when categorizing vowel stimuli as /ɪ/. That being the case, much the same use of vowel duration and/or formant movement patterns should be observed in the L2 learner’s productions of English /ɪ/.

3.6 Summary The present chapter has focused on a number of crucial issues in L2 speech research methodology. These included (1) how to elicit speech production samples that reflect how learners typically produce sounds in their L1 and L2; (2) how to determine overall percentage use of the L1 and L2; (3) how to determine whether the L2 input that learners receive comes from native speakers or from other nonnative speakers; (4) how to evaluate the perception of L2 sounds; and (5) how to measure L2 input distributions, and determine what kinds of distributions promote the formation of new phonetic categories for L2 sounds. Improvements in existing research methods and procedures were suggested and completely new ones proposed, some of which are suitable for internet research. Most of the methods presented here were illustrated with a Romance language as the L1 (Italian, Spanish) and English as the target L2. There is no justification for the focus on English as the L2 other than convenience. It is my hope that readers who find the methods outlined here to be useful will find ways to apply them to other L1–L2 pairs. The methods suggested in this chapter will be time consuming, and so costly. This raises the issue of whether these methods can and should be adopted. My response is to observe that science, like politics, can be defined as “the art of the possible.” What is deemed to be an interesting theoretical question in a particular field of research depends, at least in part, on what practitioners are able to measure with the resources at their disposal. Most of the methods and procedures suggested here can eventually be streamlined as investigators gain experience using them. Fewer trials might yield the same results, but this must be determined empirically.

References Antoniou, M., Best, C. T., Tyler, M. D., & Kroos, C. (2010). Language context elicits native-like stop voicing in early bilinguals’ productions in both L1 and L2. Journal of Phonetics, 38, 640–653.



New Methods for Second Language (L2) Speech Research



Birdsong, D. (2003). Authenticité de prononciation en français L2 chez des apprenants tardifs anglophones: Analyses segmentales e globales. Acquisition et Interaction en Langue Étrangère, 18, 17–36. Bohn, O. S., & Flege, J. E. (1993). Perceptual switching in Spanish/English bilinguals. Journal of Phonetics, 21(3), 267–290. Bos, M., Schoevers, R. A., & ann het Rot, M. (2015). Experience sampling and ecological momentary assessment studies in psychopharmacology: A systematic review. European Neuropharmacology. doi:10.1016/j. euroneuro.2015.08.008. Dmitrieva, O., Llanos, F., Shultz, A. A., & Francis, A. L. (2015). Phonological status, not voice onset time, determines the acoustic realization of onset f0 as a secondary voicing cue in Spanish and English. Journal of Phonetics, 49, 77–95. Flege, J. E. (1982). Laryngeal timing and phonation onset in utterance-initial English stops. Journal of Phonetics, 10, 177–192. Flege, J. E. (2003). A method for assessing the perception of vowels in a second language. In E. Fava & A. Mioni (Eds.), Issues in clinical linguistics (pp. 19–44). Padvoa: Unipress. Flege, J. E. (2005). Origins and development of the Speech Learning Model. Paper presented at the 1st Acoustical Society of America Workshop in L2 speech Learning, Simon Fraser University, Vancouver, BC. doi:10.13140/ RG.2.2.10181.19681. Flege, J. (2009). Give input a chance! In T. Piske & M. Young-Scholten (Eds.), Input matters in SLA (pp. 175–190). Bristol, England: Multilingual Matters. Flege, J. E. (2019). A non-critical period for second-language speech learning. In A. M. Nyvad, M. Hejná et al. (Eds.), A sound approach to language matters – In honor of Ocke-Schwen Bohn (pp. 501–541). Aarhus: Department of English, School of Communication & Culture, Aarhus University. Flege, J. E., Aoyama, K., & Bohn, O.-S. (2020). The revised Speech Learning Model (SLM-r) applied. In R. Wayland (Ed.), Second language speech learning: Theoretical and empirical progress. Cambridge: Cambridge University Press. Flege, J. E., & Eefting, W. (1988). Imitation of a VOT continuum by native speakers of Spanish and English: Evidence for phonetic category formation. Journal of the Acoustical Society of America, 83, 729–740. Flege, J. E., & MacKay, I. R. (2004). Perceiving vowels in a second language. Studies in Second Language Acquisition, 26, 1–34. Flege, J. E., Schmidt, A. M., & Wharton, G. (1996). Age of learning affects ratedependent processing of stops in a second language. Phonetica, 53, 143–161. Flege, J. E., & Wayland, R. (2019). The role of input in native Spanish Late learners’ production and perception of English phonetic segments. Journal of Second Language Studies, 2(1), 1–45. Grosjean, F. (1998). Studying bilinguals: Methodological and conceptual issues. Bilingualism: Language and Cognition, 1, 131–149.



James Emil Flege

Grosjean, F. (2001). The bilingual’s language modes. In J. Nicol (Ed.), One mind, two languages: Bilingual language processing (pp. 1–22). Oxford, England: Blackwell. Hartshorne, J., Tenenbaum, J., & Pinker, S. (2018). A critical period for second language acquisition: Evidence from 2/3 million English speakers. Cognition, 177, 263–277. Heron, K. E., Everhart, R. S., McHale, S. M., & Smyth, J. M. (2017). Using mobile-technology-based ecological momentary assessment (EMA) methods with youth: A systematic review and recommendations. Journal of Pediatric Psychology, 42(10), 1087–1107. Hillenbrand, J., Getty, L. A., Clark, M. J., & Wheeler, K. (1995). Acoustic characteristics of American English vowels. Journal of the Acoustical Society of America, 97(5), 3099–3111. Larson, R., & Csikszentmihalyi, M. (1983). The experience sampling method. In. H. T. Resi (Ed.), Naturalistic approaches to studying social interactions: New directions for methodology of social and behavioral sciences (Vol. 15, pp. 41–56). San Francisco: Jossey-Bass. MacKay, I. A. R., Meador, D., & Flege, J. E. (2001). The identification of English consonants by native speakers of Italian. Phonetica, 58, 103–125. Meador, D., Flege, J. E., & MacKay, I. R. (2000). Factors affecting the recognition of words in a second language. Bilingualism: Language and Cognition, 3(1), 55–67. Sancier, M. L., & Fowler, C. A. (1997). Gestural drift in a bilingual speaker of Brazilian Portuguese and English. Journal of Phonetics, 25(4), 421–436. Theodore, R., Miller, J., & DeSteno, D. (2009). Individual talker differences in voice-onset-time: Contextual influences. Journal of the Acoustical Society of America, 125, 3974–3982.

chapter 4

Phonetic and Phonological Influences on the Discrimination of Non-native Phones Michael D. Tyler

4.1 Introduction Attunement to the native language (L1), or languages, has a profound effect on the perception of speech segments (or phones). Infants’ attunement to the L1 has clear benefits for native listening as an adult, as it facilitates rapid and efficient L1 communication (for reviews, see Best, Goldstein, Nam, & Tyler, 2016; Cutler, 2012; Tyler, Best, Goldstein, & Antoniou, 2014). Many are interested in the effects of L1 attunement on the success of learning to communicate in a second language (L2), but the field of cross-language speech perception is focused on how L1 attunement affects perception. Cross-language speech perception is the focus of this chapter because L1 attunement determines the initial state of the L2 learner. It is crucial to understand how the L1 shapes perception to be able to make theoretical and practical advances in second language speech learning. To probe how the L1 shapes perception, listeners are presented with phonologically contrasting phones (contrasts) from a never-before-heard non-native language. Discrimination accuracy of different contrasts is compared with a single listener group or, ideally, performance of listeners from different language backgrounds are compared to demonstrate the differential effects of L1 attunement on perception. The L1 has been shown to limit discrimination accuracy of non-native contrasts with a range of listener backgrounds and with a range of stimulus languages. One situation that has captured the attention of the research community is the discrimination of English /r/ and /l/ by native speakers of Japanese (e.g. Best & Strange, 1992; Bradlow, AkahaneYamada, Pisoni, & Tohkura, 1999; Goto, 1971; Jenkins, Strange, & Polka, 1995; MacKain, Best, & Strange, 1981; Miyawaki et al., 1975). Both children (Aoyama, Flege, Guion, Akahane-Yamada, & Yamada, 2004) and adults (Guion, Flege, Akahane-Yamada, & Pruitt, 2000) discriminate 



Michael D. Tyler

English /r/ and /l/ less accurately than native English speakers. This can improve with training or immersion experience, but it rarely reaches the ceiling performance of a native listener. There is widespread agreement that prior attunement to Japanese is responsible for difficulties in discriminating English /r/-/l/, but there is debate about the precise mechanisms that underlie that difficulty (see e.g. Iverson et al., 2003). In addition to /r/-/l/ discrimination by Japanese native listeners, English native speakers have been shown to have difficulty discriminating dental versus retroflex stops (Hindi: Polka, 1991; Werker & Logan, 1985), plosive versus implosive stops (Ma’di: Antoniou, Best, & Tyler, 2013; Zulu: Best, McRoberts, & Sithole, 1988), and a range of vowel contrasts (Danish: Faris, Best, & Tyler, 2018; French: Levy, 2009; Norwegian: Tyler, Best, Faber, & Levitt, 2014). A large cross-language and cross-listener group comparison of the contrasting effects of native-language attunement on discrimination of non-native phones showed varying identification and discrimination of Malayalam, Marathi, and Oriya nasal stops by Malayalam, Marathi, Punjabi, Tamil, Oriya, Bengali, and American English listeners (Harnsberger, 2001). While the focus of cross-language speech perception research has been on consonants and vowels, there is also evidence that native-language attunement influences perception of lexical tone (for a recent review, see Best, 2019). Wayland and Guion (2004) showed that native Chinese and English listeners both discriminated Thai tones less accurately than native Thai listeners. The Chinese listeners were slightly more accurate than the English listeners, which Wayland and Guion suggested may be due to Chinese listeners using similarity to their own lexical tones to guide perception. This speculation was confirmed with a different listener group by So and Best (2010), who found that Cantonese native listeners categorized Mandarin Chinese tones in terms of their native Cantonese tonal categories, and by Reid et al. (2015), who found that Thai tones were discriminated differently by Mandarin Chinese and Cantonese native listeners. Interestingly, non-tone language speakers appear to perceive similarity between lexical tones and L1 intonation patterns, and this is related to their discrimination accuracy of lexical tone contrasts (Reid et al., 2015; So & Best, 2010, 2014). Most studies on cross-language speech perception have focused on cases where L1 attunement limits discrimination accuracy of non-native contrasts, but perceiving a non-native phone as consistent with an L1 phonological category may also facilitate discrimination. For example, English /ε/-/æ/ was discriminated well by Polish native listeners (Balas,



Phonetic and Phonological Influences of Non-native Phones



2018), who perceived each vowel as an instance of a different Polish vowel category, but it is notoriously difficult for Dutch native speakers who appear to perceive both as instances of their native /ε/ category (Escudero, Hayes-Harb, & Mitterer, 2008; Weber & Cutler, 2004). Greek native listeners accurately discriminated Ma’di /d/-/t/, as each was perceived as its corresponding Greek phoneme, whereas English native listeners perceived them both as instances of English /d/ and their discrimination was correspondingly poor (Antoniou et al., 2013). The important point to note here is that the Greek listeners can perceive the difference between /d/ and /t/ in Ma’di not because of any previous experience with that language, but because tuning in to the phonological properties of Greek happens to correspond to a phonological property of Ma’di. Thus, it appears that sensitivity to a phonetic difference that is perceived as phonologically contrastive in the L1 can facilitate discrimination of a non-native contrast. When listeners do not perceive an L1 phonological contrast between a pair of contrasting non-native phones, attunement to the L1 may still support discrimination if they are perceived as notably different instances of the same L1 phonological category. For example, discrimination accuracy for non-native vowels that English listeners perceived to differ in goodness-of-fit to a single native category was 20 per cent higher than it was for those who perceived them as equally good or poor instances of a single native category (Tyler, Best, Faber et al., 2014). This highlights that sensitivity to phonetic goodness-of-fit to a native phonological category can facilitate discrimination, that individuals may differ from each other in the degree to which they perceive those differences, and that there appears to be a corresponding effect of those individual assimilation patterns on discrimination. As infants are born with the potential to learn any phonological contrast from any language, it stands to reason that they are sensitive to phonetic differences independent from the language that they are in the process of acquiring (for a review, see Tyler, Best, Goldstein et al., 2014). Those language-independent phonetic differences may also be a source of information that adults could use to discriminate non-native phones. However, to acquire a phonological category, infants must not only tune in to the phonetic differences that signal a change in meaning (the principle of phonological distinctiveness), but they must also learn to treat as the same any within-category phonetic differences that do not signal a change in meaning (the principle of phonological constancy) (Best, 2015; Best, Tyler, Gooding, Orlando, & Quann, 2009). The vestiges of that



Michael D. Tyler

phonological attunement can be seen in the facilitatory and inhibitory influences of the L1 on adults’ discrimination of non-native contrasts. Adults might be sensitive to language-independent phonetic differences in regions of phonetic space that are not utilized by the L1 phonology or under certain presentation conditions (e.g. a short intersimulus interval; Werker & Logan, 1985). Thus, while the circumstances under which adults may be sensitive to language-independent phonetic information are not entirely clear, it is certainly a possible source of information that they could use to discriminate non-native phones. Finally, there is evidence that some non-native phonemes are so unlike any phoneme in the native phonology that they do not sound like speech at all. Best, McRoberts, and Sithole (1988) found that English native listeners’ discrimination of Zulu clicks was fairly accurate. In a post-test questionnaire, all listeners indicated that they depended on non-speech aspects of the clicks for discrimination. For example, some indicated that they sounded like fingers snapping, water dripping, or mouth sounds. In a follow-up study, English listeners reported that !Xóõ clicks sounded like non-speech more often than Zulu or Sesotho listeners did (both are click languages) (Best, Traill, Carter, Harrison, & Faber, 2003). In fact, the clicks appear to have been perceptually assimilated to one of the three clicks in Zulu, or to the single click in Sesotho. Across two !Xóõ click contrasts, the English listeners generally outperformed both clicklanguage groups, with the Sesotho listeners performing least accurately among the groups. Thus, it appears that the click-language speakers’ discrimination was influenced by native-language attunement, whereas the English listeners were freed from that influence. The differences that they perceived between the click consonants were not phonetic, because they did not perceive them as speech-like. Rather, they appear to have been sensitive to perceptually salient aspects of a non-linguistic auditory difference. 4.1.1  Summary and Chapter Aims To summarize, there are (at least) four sources of information that perceivers might be sensitive to when discriminating non-native contrasts: 1. a phonetic difference that is perceived as phonologically contrastive in the L1; 2. the phonetic goodness-of-fit to a native phonological category;



Phonetic and Phonological Influences of Non-native Phones



3. language-independent phonetic distance; and 4. the perceptual salience of a non-linguistic auditory difference. To explain how attunement to the L1 shapes perception, models of cross-language speech perception need to account for the four sources of information that may be used by listeners for discrimination of nonnative contrasts. The main aim of this chapter is to demonstrate how those four sources might be accounted for by an existing model of speech perception. As my own research is based on the Perceptual Assimilation Model (Best, 1994a, 1994b, 1995), it will form the basis of the illustration.1 The chapter also provides the opportunity to incorporate new findings into a summary of PAM’s core principles. The second aim is to consider the methodological requirements for assessing which sources of information listeners use when discriminating non-native contrasts.

4.2  How the Perceptual Assimilation Model Accounts for the Four Sources of Information The Perceptual Assimilation Model (Best, 1994a, 1994b, 1995) was designed to account for how attunement to the L1 shapes perception. PAM considers both the process by which infants attune to the L1 (Best, 1994a; Best et al., 2016; Best & McRoberts, 2003; Best et al., 1988; Best et  al., 2009; Tyler, Best, Goldstein et al., 2014) and the vestiges of that attunement in adult speech perception (Antoniou et al., 2013; Best, 2015; Faris et al., 2018; Fenwick, Best, Davis, & Tyler, 2017; So & Best, 2014; Tyler, Best, Faber et al., 2014). It is through perceptual learning of dynamic invariant properties of speech that an infant transitions from language-independent phonetic perception to natively tuned perception (Tyler, Best, Goldstein et al., 2014). The infant learns the natural multidimensional phonetic variability that defines a native phonological category (phonological constancy), but also tunes in to the distinctive multidimensional phonetic features that set it apart from other phonological categories (phonological distinctiveness; Best, 2015; Best et al., 2009). Attention comes to be drawn automatically to those regions of phonetic space that serve linguistic functions in the L1, which is defined as the native phonological space. Within the phonological space, attention is drawn automatically to those phonetic features that optimize detection of 1

As the focus of this chapter is on attunement to the L1, the review will not focus on PAM’s extension to L2 speech learning (PAM-L2: Best & Tyler, 2007), but a recent review focusing on PAM-L2 can be found in Tyler (2019).



Michael D. Tyler

L1 phonological contrast. One consequence of this L1 attunement is that non-native phones that fall within the native phonological space are perceptually assimilated. That is, the listener’s attention, which is optimally tuned to the distinctive phonetic features of the L1, is drawn to those same features of the non-native phones. This may result in facilitation or inhibition of discrimination depending on whether those same distinctive features happen to signal phonological distinctiveness or phonological constancy in the L1. To describe the various ways that non-native phones might be assimilated to the L1, PAM first considers whether or not individual non-native phones are perceived as L1 phonological categories. The assimilation types for individual phones are then combined to form contrast assimilation types, from which predictions can be made about the accuracy of non-native contrast discrimination. An individual non-native phone can be 1. assimilated to a native phonological category as a good, an acceptable but not ideal, or a notably deviant exemplar; 2. assimilated as speech, but as unlike any native phonological category (uncategorized-dispersed) (Faris, Best, & Tyler, 2016); 3. assimilated as speech, and weakly consistent with one (uncategorizedfocalized) or more (uncategorized-clustered) native phonological categories (Faris et al., 2016); or 4. not assimilated to speech. The model provides predictions for discrimination accuracy of pairs of contrasting non-native phones according to how each individual nonnative phone is assimilated. These are known as contrast assimilation types. All pairwise combinations of the four assimilation possibilities are logically possible, but investigations using PAM as the theoretical basis have focused on six different contrast assimilation types: 1. Two-Category (TC): Each non-native phoneme in the contrast is perceived as a different native phonological category. Discrimination should be excellent, as it would be for a native contrast, because the phonetic difference is one that would signal a phonological contrast in the L1. 2. Single-Category (SC): Both non-native phones are perceived as equally good or poor instances of the same L1 phonological category. Discrimination is predicted to be poor, although it may be above chance. 3. Category-Goodness (CG): Both non-native phones are perceived as instances of the same L1 phonological category and there is a



Phonetic and Phonological Influences of Non-native Phones



­ erceptible difference in goodness-of-fit to the L1 category. p Discrimination accuracy is predicted to lie in between two-category and single-category assimilations. 4. Uncategorized-Categorized (UC): One non-native phone is perceived as a native phonological category and the other is perceived as speech but not as consistent with any one native category. Recently, Faris et al. (2018) have elaborated on the discrimination predictions for contrasts involving uncategorized phones. UC contrasts where the uncategorized phone is uncategorized-dispersed should be discriminated accurately (note that this prediction has not yet been tested empirically). If the uncategorized phone is uncategorized-clustered or uncategorized-focalized then discrimination accuracy should vary depending on the perceived phonological overlap between the categorized and the uncategorized phones, such that the relative discrimination accuracy should be: non-overlapping (UC-N) > partially overlapping (UC-P) > completely overlapping (UC-C). For example, for the Australian English listeners in Faris et al. (2018), Danish /o/ was perceived as weakly consistent with four native vowel categories (/ʊ, ɔ, oː, əʉ/), whereas Danish /œ/ was consistent with only one L1 category (/ɜː/). The contrast /o/-/œ/ was, therefore, UC-N because there were no L1 categories in common between the two vowels. Danish /ø/ was perceived as weakly consistent with three native categories (/ʊ, ɜː, ʉː/), so the /ø/-/œ/ was UC-P because both phones were perceived as consistent with English /ɜː/, but /ø/ was also weakly consistent with other native vowel categories. AXB Discrimination for the UC-N contrast (85 per cent correct) was more accurate than the UC-P contrast (57 per cent correct). Additional research is required to provide support for discrimination predictions about completely overlapping contrasts. 5. Uncategorized-Uncategorized (UU): Neither of the non-native phones are perceived as consistent with any one native category. For pairwise combinations of non-native phones involving uncategorized-focalized or uncategorized-clustered assimilations, discrimination should vary according to perceived phonological overlap: UU-N > UU-P > UU-C (see Faris et al., 2018, for results showing that UU-N > UU-P). When one of the non-native phones is uncategorizeddispersed and the other is clustered or focalized, discrimination should be very good. Finally, if both are dispersed, then discrimination should be moderate to excellent, depending on the perceived phonetic distance.



Michael D. Tyler

6. Non-assimilable (NA): The contrasting non-native phones are both perceived as non-speech. They should be discriminated well, but discrimination should vary according to the salience of the perceptual difference. For PAM, it is the contrast assimilation type that determines which of the four sources of information the listener is sensitive to when discriminating non-native contrasts. If listeners detect L1 phonological ­information in non-native contrasts then their attention will be drawn automatically to that information, even if sensitivity to language-­ independent phonetic distance would potentially facilitate more accurate discrimination. The model would predict that attention is drawn to speech information over non-speech information, and to L1 phonological information over language-independent phonetic information. Note that sensitivity to phonetic, acoustic, or gestural distinctions is not lost and may be observed if attention can be drawn to lower-order information (e.g. under specific task conditions; Strange, 2011). Listeners clearly detect a phonetic difference that signals an L1 phonological contrast in TC assimilations, and that facilitates their discrimination. Attention is automatically drawn to phonological information in SC assimilations as well, and there are two possible reasons why discrimination is poor. The non-native phones may share a set of critically distinctive features that set an L1 phonological category apart from every other L1 category. Alternatively, or additionally, the phonetic features that set the non-native phones apart may be phonologically constant in the L1. Discrimination of SC assimilations is often above chance, which may be due to detection of language-independent phonetic differences, or of a subtle difference in goodness-of-fit that is not detected using standard laboratory goodness rating tasks. For CG, listeners’ attention is drawn to L1 phonological information. As in SC assimilations, the non-native phones share the phonetic features that critically distinguish phonemes in the L1, but listeners are able to detect a difference between them in terms of their phonetic goodness-of-fit to the same L1 category. When presented with a UC contrast involving an uncategorized-dispersed phone, the listeners’ attention would be drawn to the L1 phonological information in the categorized phone. There would be no L1 phonological information to capture attention in the uncategorized-dispersed phone, but their discrimination should be facilitated by sensitivity to the phonological distinction between what is and what is not a phonological category in the L1. For UC contrasts involving uncategorized-focalized



Phonetic and Phonological Influences of Non-native Phones



or -clustered phones, attention is drawn to the L1 phonological information in the categorized phone and to the range of L1 phonological similarity in the uncategorized phone. Discrimination is facilitated when different L1 phonological information is perceived in the non-native phones (i.e. non-overlapping) and inhibited when the same information is perceived (i.e. completely overlapping). Listeners would be sensitive to phonologically contrastive phonetic differences in the same way for UU focalized-focalized, focalized-clustered, and clustered-clustered assimilations. For UU focalized-dispersed and clustered-dispersed assimilations, listeners should be sensitive to the phonological difference between phones that have some phonological similarity to the L1 and those that have no phonological similarity. It is possible that attention is only weakly drawn to the L1 phonological information in uncategorized-focalized or -clustered phones. If so, then listeners may be more likely to be sensitive to language-independent phonetic distance in contrasts involving those phones than for other assimilation types. For UU dispersed-dispersed assimilations, listeners would be sensitive to language-independent phonetic distance, and for NA they would be sensitive to the perceptual salience of a non-linguistic auditory difference. Therefore, for PAM, detection of distinctive features that signal L1 phonological contrast is thought to underlie discrimination accuracy in the majority of cases. Sometimes attention directed towards those distinctive features facilitates discrimination (e.g. TC, UC-N) and sometimes it inhibits discrimination (e.g. UC-C, SC). The only contrast assimilation types where attunement to L1 phonological categories does not play a role are UU dispersed-dispersed and NA, which are likely to be rare across the languages of the world.

4.3  Methodological Requirements In this chapter, I have argued that it is important for models of crosslanguage speech perception to account for which sources of information listeners are sensitive to when discriminating non-native phones. While discrimination tasks may seem like an appropriate choice, they are not suited to answering the question of which information listeners attend to when discriminating non-native speech. Listeners may arrive at the same level of discrimination for different contrasts by attending to different sources information in each. For example, they might assimilate one contrast as UC-P and discriminate it at 75 per cent accuracy due to their sensitivity to L1 phonological distinctions. Another might be assimilated



Michael D. Tyler

as CG and discriminated at 75 per cent accuracy due to their sensitivity to phonetic goodness-of-fit. Discrimination accuracy simply gauges how well listeners can detect differences between non-native phones. It cannot determine the phonological, phonetic, or acoustic/gestural differences that they detect to achieve a given degree of discrimination accuracy. This applies to any discrimination task (e.g. AXB, AX, ABX, 4IAX, Oddity, change detection; for a comparison, see Gerrits & Schouten, 2004; Strange & Shafer, 2008), and to similarity judgement tasks (e.g. Hattori & Iverson, 2009). These tasks can certainly establish whether listeners perceive a difference, or determine the magnitude of that perceived difference, but they are not well equipped to determine which of the four sources of information is responsible for the perceived differences. Studies in support of PAM assess perceptual assimilation of individual non-native phones to the native phonological system using a categorization task. Originally, this took the form of a post-test questionnaire where participants provided descriptions of non-native phones using their native orthography (see Best, McRoberts, & Goodell, 2001; Best et  al., 1988). More recent studies have used a forced-choice task where participants first select a native phonological category label (which may include allophonic variants) and then rate the goodness-of-fit to that category using a Likert scale (e.g. Faris et al., 2016; Tyler, Best, Faber et al., 2014). If any label is chosen above a threshold level (usually 50 per cent or 70 per cent; see Bundgaard-Nielsen, Best, & Tyler, 2011) then it is deemed to be assimilated to a native phonological category (i.e. categorized), otherwise it is uncategorized. To filter out responses that may be due to random responding, the categorization percentages for each label are compared to the level of responding that would be expected if participants were responding at chance, which is 1 divided by the number of options (i.e. if there were 10 labels to choose from, then chance responding would be 10 per cent). If the labelling response is statistically above chance it is retained, otherwise it is discarded. If only one category is selected above chance then the assimilation is focalized, if more than one category is selected above chance then it is clustered, and if no label is selected above chance then it is dispersed (Faris et al., 2016). Contrast assimilation types are determined by comparing the individual assimilation types for each non-native phone. The goodness ratings are only consulted when both non-native phones are assimilated to the same L1 phonological category. In those cases, if there is a significant difference in the goodness-of-fit to the same L1 category then it is a category-goodness assimilation, otherwise it is a single-category assimilation.



Phonetic and Phonological Influences of Non-native Phones



In terms of the four sources of information for discriminating nonnative phones, the forced-choice categorization task with goodness ratings assesses listeners’ perception of phonological category membership using the labelling task and their perception of phonetic goodness-of-fit to that native category using the rating task. As PAM considers perception of L1 phonological information to be automatic, it is assumed that listeners are sensitive to language-independent phonetic distance only when the categorization task fails to show any similarity with a native phonological category. Recent studies testing PAM predictions have not focused on non-native phones that are likely to be perceived as non-speech. To evaluate whether listeners perceive non-speech information, the categorization task would need to include another response option (e.g. not speech). It is only by comparing the categorization results for contrasting non-native phones that inferences can be made about which information listeners may be using to detect phonetic differences between them. There are a number of limitations to using the forced-choice categorization task (see also Bohn, 2017). The forced-choice version was adopted because listeners’ open-ended responses were often difficult to interpret (see Best et al., 2001, table III), but the open-ended version had the advantage of not limiting the range of responses that listeners could provide. With a forced-choice task it is possible that the participant clearly perceives the non-native phone as a category, or a combination of categories, but none of the response options provided match the percept. This is mitigated, in part, by ensuring that listeners are provided with the full range of L1 consonants or vowels as response options (BundgaardNielsen et al., 2011; Faris et al., 2016, 2018). However, that does not allow for the possibility that some listeners may perceive a non-native phoneme as two native categories (e.g. Spanish /ɲ/ could be perceived as English /nj/). While it could be argued that a ‘not there’ or ‘no fit’ option might improve the task, it is not clear whether listeners select that option because it clearly sounds like a category that is not provided, because it sounds like speech but unlike any category, or not like speech at all. It may also leave the task open to response bias, where certain participants may use that response by default when the non-native stimulus does not fit perfectly into an L1 category. Categorization tasks provide a reliable indication of listeners’ perceptual assimilation when categorization consistency is high. If categorization of a given non-native phone is at 95 per cent then we can be fairly confident about how listeners perceive it relative to the L1 phonological system. The task would appear to be less reliable when categorization



Michael D. Tyler

consistency is low. When averaging across listeners, the categorization percentage could be low for a given label because some participants chose it 100 per cent of the time and others did not select it at all. Alternatively, listeners’ perception of the non-native phone could be unstable, such that the same non-native phone is consistent with one L1 category on the first occasion and a different L1 category on another occasion. It may also reflect a compromise response when listeners perceive the non-native phone as weakly consistent with more than one phonological category, which is the assumption underlying uncategorized-clustered assimilations. If listeners do simultaneously perceive phonological similarity with more than one native category, then a task requiring listeners to choose a single category would appear to be poorly suited to the requirements of the model. In short, when the categorization percentage is high, the result is easily interpreted, but a low categorization percentage is open to a number of different interpretations. It is also straightforward to interpret goodness ratings when the categorization consistency is high. If two non-native phones are categorized as the same L1 category at 95 per cent, one with a goodness rating of 3/7 and the other with 6/7, then listeners clearly perceive a difference in phonetic goodness-of-fit to the same native category. If label choice is consistently low across participants, then the rating was obtained only on those occasions when the participant selected that label as the best one. Furthermore, if the non-native phone is only perceived as weakly consistent with the L1 category, how reliable is the listener’s judgement of phonetic goodness-of-fit? This question is particularly pertinent to studies where categorization and goodness ratings are combined to form a single value. For example, Guion et al. (2000) created a fit index, where the goodness rating was scaled by the categorization percentage. A label with a categorization percentage of 80 per cent and a goodness rating of 2.5 would have a fit index of 2. Creating a single measure is useful because it avoids the need to make inferences over separate analyses of categorization and goodness rating. When the categorization percentage is high, the fit index is easily interpretable. For example, if the categorization percentage is 100 per cent and the goodness rating is 2, then the fit index of 2 entirely reflects the perceived phonetic goodness-of-fit. A fit index of 2 would also be obtained with a categorization percentage of 40 per cent and a goodness rating of 5. By combining the two values, the assumption is that the perceived similarity to the L1 is qualitatively the same in all three cases where the fit index is 2. That may be true, but it is possible that the measure is conflating measures that are qualitatively different.



Phonetic and Phonological Influences of Non-native Phones



The final problem with forced-choice categorization is that arbitrary thresholds determine whether a non-native phone is categorized. While studies in support of PAM have used 50 per cent or 70 per cent, Harnsberger (2001) advocated for 90 per cent, precisely to ensure that categorized phones did not overlap phonologically with other categories. Harnsberger raises a valid point. By setting an arbitrary threshold of 50 per cent, it is quite possible that an L1 category is selected more than 50 per cent of the time while another category is also selected above chance. If another non-native phone is assimilated to the secondary label, then this would be a partially overlapping two-category assimilation. Setting a higher threshold may render the assimilation uncategorized, but a 90 per cent threshold is no less arbitrary than a 50 per cent threshold. The solution to this problem is not as simple as abandoning thresholds altogether. A categorization task requires a decision to be made about whether any value below 100 per cent constitutes recognition of phonological information that is consistent with the L1. It is possible that the typical analysis of a categorization task has led to a conception that categorization is gradient rather than absolute. Indeed, in my own papers, and in this chapter, I have referred to the notion of weak phonological similarity, which assumes a gradient level of categorization. One solution to the threshold problem might be to consider categorization as dichotomous rather than gradient. Until evidence suggests otherwise, I propose that the most parsimonious assumption is that phonological categories are perceived in an all-or-nothing fashion. Perception may be unstable when the non-native phone is phonetically distant from the native category, such that a non-native phone is perceived as an instance of a phonological category on some occasions but not others, but there are no degrees of phonological perception. Overcoming these limitations requires a rethink of the categorization task. To ensure that any new task is fit for purpose, development needs to be done thoughtfully and any proposal needs to be supported by systematic methodological investigation demonstrating how it improves on existing methods. To establish which information is detected in discrimination of non-native phones, a test of perceptual assimilation needs to provide unambiguous answers to the following questions: 1. Does the listener perceive the stimulus as speech or non-speech? 2. Does the listener perceive the speech stimulus as consistent with one category, more than one category, or no category? 3. How good a version is the stimulus of each perceived category?



Michael D. Tyler

4. Do individuals differ from each other in their perception of the same non-native phones? 5. Is an individual’s perception of the non-native phone stable or variable across multiple presentations?

4.4  Summary and Conclusions In this chapter, I have argued that there are four sources of information that a listener could use to detect differences between contrasting non-native phones. The Perceptual Assimilation Model (Best, 2015) infers which information is being attended to by comparing how pairs of individual nonnative phones are assimilated to the L1 phonological space. Predictions about discrimination accuracy for non-native phones are based on the assumption that listeners’ attention is drawn automatically to phonological information that was attuned to during L1 acquisition. Listeners would only be sensitive to language-independent phonetic distance when no L1 phonological distinction is perceived, and to the perceptual salience of a nonlinguistic auditory difference when both stimuli are not perceived as speech. Perceptual assimilation is typically assessed using a forced-choice categorization task with goodness rating. While I have identified some limitations with that task, it still appears to be the most suitable task available for assessing which information listeners attend to when perceiving nonnative phones. For now, studies in support of PAM will continue to use a forced-choice categorization task, perhaps with the addition of some new features to address some of its limitations. Finally, cross-language speech perception research is important for L2 speech learning because it establishes the influence of prior attunement to the L1 on perception of the L2 at the initial state of learning. Establishing which information learners attend to automatically at the initial stages of learning could help tailor learning experiences to maximize detection of phonological distinctions in the L2. I hope that framing non-native contrast discrimination in terms of the information detected by the listener will be useful for theoretical and methodological development in both cross-language speech perception and L2 speech learning.

References Antoniou, M., Best, C. T., & Tyler, M. D. (2013). Focusing the lens of language experience: Perception of Ma’di stops by Greek and English bilinguals and monolinguals. Journal of the Acoustical Society of America, 133(4), 2397–2411.



Phonetic and Phonological Influences of Non-native Phones



Aoyama, K., Flege, J. E., Guion, S. G., Akahane-Yamada, R., & Yamada, T. (2004). Perceived phonetic dissimilarity and L2 speech learning: The case of Japanese /r/ and English /l/ and /r. Journal of Phonetics, 32(2), 233–250. Balas, A. (2018). English vowel perception by Polish advanced learners of English. Canadian Journal of Linguistics/Revue canadienne de linguistique, 63(3), 309–338. Best, C. T. (1994a). The emergence of native-language phonological influences in infants: A perceptual assimilation model. In J. C. Goodman & H. C. Nusbaum (Eds.), The development of speech perception: The transition from speech sounds to spoken words (pp. 167–244). Cambridge, MA: MIT Press. Best, C. T. (1994b). Learning to perceive the sound pattern of English. In C. Rovee-Collier & L. P. Lipsitt (Eds.), Advances in infancy research (Vol. 9, pp. 217–304). Norwood, NJ: Ablex. Best, C. T. (1995). A direct realist view of cross-language speech perception. In W. Strange (Ed.), Speech perception and linguistic experience: Issues in crosslanguage research (pp. 171–204). Baltimore: York Press. Best, C. T. (2015). Devil or angel in the details? Perceiving phonetic variation as information about phonological structure. In J. Romero & M. Riera (Eds.), Phonetics-phonology interface: Representations and methodologies (pp. 3–31). Amsterdam: John Benjamins. Best, C. T. (2019). The diversity of tone languages and the roles of pitch variation in non-tone languages: Considerations for tone perception research. Frontiers in Psychology, 10, 364. Best, C. T., Goldstein, L. M., Nam, H., & Tyler, M. D. (2016). Articulating what infants attune to in native speech. Ecological Psychology, 28(4), 216–261. Best, C. T., & McRoberts, G. W. (2003). Infant perception of non-native consonant contrasts that adults assimilate in different ways. Language and Speech, 46(2–3), 183–216. Best, C. T., McRoberts, G. W., & Goodell, E. (2001). Discrimination of non-native consonant contrasts varying in perceptual assimilation to the listener’s native phonological system. Journal of the Acoustical Society of America, 109(2), 775–794. Best, C. T., McRoberts, G. W., & Sithole, N. M. (1988). Examination of perceptual reorganization for nonnative speech contrasts: Zulu click discrimination by English-speaking adults and infants. Journal of Experimental Psychology: Human Perception and Performance, 14(3), 345–360. Best, C. T., & Strange, W. (1992). Effects of phonological and phonetic factors on cross-language perception of approximants. Journal of Phonetics, 20(3), 305–330. Best, C. T., Traill, A., Carter, A., Harrison, K. D., & Faber, A. (2003). !Xóõ click perception by English, Isizulu, and Sesotho listeners. In M. J. Solé, D. Recasens, & J. Romero (Eds.), Proceedings of the 15th International Congress of Phonetic Sciences (pp. 853–856). Barcelona: Causal ­Productions.



Michael D. Tyler

Best, C. T., & Tyler, M. D. (2007). Nonnative and second-language speech perception: Commonalities and complementarities. In M. J. Munro & O.-S. Bohn (Eds.), Language experience in second language speech learning: In honor of James Emil Flege (pp. 13–34). Amsterdam: John Benjamins. Best, C. T., Tyler, M. D., Gooding, T. N., Orlando, C. B., & Quann, C. A. (2009). Development of phonological constancy: Toddlers’ perception of native-and Jamaican-accented words. Psychological Science, 20(5), 539–542. Bohn, O.-S. (2017). Cross-language and second language speech perception. In E. M. Fernández & H. S. Cairns (Eds.), Handbook of psycholinguistics (pp. 213–239). Hoboken, NJ: Wiley. Bradlow, A. R., Akahane-Yamada, R., Pisoni, D. B., & Tohkura, Y. (1999). Training Japanese listeners to identify English /r/ and /l/: Long-term retention of learning in perception and production. Perception and Psychophysics, 61, 977–985. Bundgaard-Nielsen, R. L., Best, C. T., & Tyler, M. D. (2011). Vocabulary size is associated with second-language vowel perception performance in adult learners. Studies in Second Language Acquisition, 33, 433–461. Cutler, A. (2012). Native listening. Cambridge, MA: MIT Press. Escudero, P., Hayes-Harb, R., & Mitterer, H. (2008). Novel second-language words and asymmetric lexical access. Journal of Phonetics, 36, 345–360. Faris, M. M., Best, C. T., & Tyler, M. D. (2016). An examination of the different ways that non-native phones may be perceptually assimilated as uncategorized. The Journal of the Acoustical Society of America, 139(1), EL1– EL5. Faris, M. M., Best, C. T., & Tyler, M. D. (2018). Discrimination of uncategorised non-native vowel contrasts is modulated by perceived overlap with native phonological categories. Journal of Phonetics, 70, 1–19. Fenwick, S. E., Best, C. T., Davis, C., & Tyler, M. D. (2017). The influence of auditory-visual speech and clear speech on cross-language perceptual assimilation. Speech Communication, 92, 114–124. Gerrits, E., & Schouten, M. (2004). Categorical perception depends on the discrimination task. Perception & Psychophysics, 66(3), 363–376. Goto, H. (1971). Auditory perception by normal Japanese adults of the sounds ‘l’ and ‘r’. Neuropsychologia, 9(3), 317–323. Guion, S. G., Flege, J. E., Akahane-Yamada, R., & Pruitt, J. C. (2000). An investigation of current models of second language speech perception: The case of Japanese adults’ perception of English consonants. Journal of the Acoustical Society of America, 107, 2711–2724. Harnsberger, J. D. (2001). On the relationship between identification and discrimination of non-native nasal consonants. Journal of the Acoustical Society of America, 110(1), 489–503. Hattori, K., & Iverson, P. (2009). English /r/-/l/ category assimilation by Japanese adults: Individual differences and the link to identification accuracy. Journal of the Acoustical Society of America, 125, 469–479.



Phonetic and Phonological Influences of Non-native Phones



Iverson, P., Kuhl, P. K., Akahane-Yamada, R., Diesch, E., Tohkura, Y., Kettermann, A., & Siebert, C. (2003). A perceptual interference account of acquisition difficulties for non-native phonemes. Cognition, 87, B47–B57. Jenkins, J. J., Strange, W., & Polka, L. (1995). Not everyone can tell a ‘rock’ from a ‘lock’: Assessing individual differences in speech perception. In D. J. Lubinski & R. V. Dawis (Eds.), Assessing individual differences in human behavior: New concepts, methods, and findings (pp. 297–325). Palo Alto, CA: Davies-Black. Levy, E. S. (2009). On the assimilation-discrimination relationship in American English adults’ French vowel learning. The Journal of the Acoustical Society of America, 126(5), 2670–2682. MacKain, K. S., Best, C. T., & Strange, W. (1981). Categorical perception of English /r/ and /l/ by Japanese bilinguals. Applied Psycholinguistics, 2, 369–390. Miyawaki, K., Jenkins, J. J., Strange, W., Liberman, A. M., Verbrugge, R., & Fujimura, O. (1975). An effect of linguistic experience: The discrimination of [r] and [l] by native speakers of Japanese and English. Perception & Psychophysics, 18(5), 331–340. Polka, L. (1991). Cross-language speech perception in adults: Phonemic, phonetic, and acoustic contributions. Journal of the Acoustical Society of America, 89, 2961–2977. Reid, A., Burnham, D., Kasisopa, B., Reilly, R., Attina, V., Rattanasone, N. X., & Best, C. T. (2015). Perceptual assimilation of lexical tone: The roles of language experience and visual information. Attention, Perception & Psychophysics, 77(2), 571–591. So, C. K., & Best, C. T. (2010). Cross-language perception of non-native tonal contrasts: Effects of native phonological and phonetic influences. Language and Speech, 53(2), 273–293. So, C. K., & Best, C. T. (2014). Phonetic influences on English and French listeners’ assimilation of Mandarin tones to native prosodic categories. Studies in Second Language Acquisition, 36(2), 195–221. Strange, W. (2011). Automatic selective perception (ASP) of first and second language speech: A working model. Journal of Phonetics, 39, 456–466. Strange, W., & Shafer, V. L. (2008). Speech perception in second language learners: The re-education of selective perception. In J. G. Hansen Edwards & M. L. Zampini (Eds.), Phonology and second language acquisition (pp. 159–198). Philadelphia: John Benjamins. Tyler, M. D. (2019). PAM-L2 and phonological category acquisition in the foreign language classroom. In A. M. Nyvad, M. Hejná, A. Højen, A. B. Jespersen, & M. H. Sørensen (Eds.), A sound approach to language matters – In honor of Ocke-Schwen Bohn (pp. 607–630). Aarhus, Denmark: Department of English, School of Communication and Culture, Aarhus University.



Michael D. Tyler

Tyler, M. D., Best, C. T., Faber, A., & Levitt, A. G. (2014). Perceptual assimilation and discrimination of non-native vowel contrasts. Phonetica, 71(1), 4–21. Tyler, M. D., Best, C. T., Goldstein, L. M., & Antoniou, M. (2014). Investigating the role of articulatory organs and perceptual assimilation in infants’ discrimination of native and non-native fricative place contrasts. Developmental Psychobiology, 56, 210–227. Wayland, R. P., & Guion, S. G. (2004). Training English and Chinese listeners to perceive Thai tones: A preliminary report. Language Learning, 54(4), 681–712. Weber, A., & Cutler, A. (2004). Lexical competition in non-native spoken-word recognition. Journal of Memory and Language, 50(1), 1–25. Werker, J. F., & Logan, J. S. (1985). Cross-language evidence for three factors in speech perception. Perception & Psychophysics, 37, 35–44.

chapter 5

The Past, Present, and Future of Lexical Stress in Second Language Speech Production and Perception Annie Tremblay

5.1 Introduction Lexical stress poses difficulties to second/foreign-language (L2) learners in both the production and the perception of spoken words. These difficulties can make L2 learners’ pronunciation less intelligible and adversely affect L2 learners’ word comprehension. Much of the research on L2 learners’ production and perception of word stress has sought to explain cross-linguistic variability in L2 learners’ ability to develop target-like representations of stress and encode stress in lexical representations. In speech production, generative approaches examined the effect of the native-language (L1) phonological grammar on the generalizations that L2 learners make with respect to stress placement in the word; these approaches were later challenged by the research of Susan Guion and her colleagues, which investigated how statistical regularities shape the inferences that L2 learners (and native speakers) make when producing stress in novel words. In speech perception and spoken word recognition, phonological approaches such as Peperkamp and Dupoux’s (2002) Stress Parameter Model predicted whether or not L2 learners can encode stress in their lexical representations on the basis of whether or not stress is lexically contrastive in the L1; these approaches were later refined to highlight the importance of phonetic cues that distinguish words in the L1 for determining whether L2 learners can use stress in spoken word recognition. The present chapter provides a critical review of the existing research on L2 learners’ production and perception of stress. It focuses on the degree to which the aforementioned approaches can explain L1-based variability in L2 learners’ ability to reach target-like generalizations in their stress placement, as reflected in their lexical stress production, and encode stress in their lexical representations, as reflected in their perception and processing of lexical stress (for a similar but briefer review, see 



Annie Tremblay

Jongman & Tremblay, In press). Accordingly, the chapter is divided into two sections – one section on L2 learners’ stress placement in speech production and one on nonnative listeners’ perception and processing of lexical stress. The chapter concludes with directions for future research on L2 lexical stress.

5.2  L2 Learners’ Stress Placement in Speech Production L2 learners’ stress placement in speech production was previously studied from a generative phonological perspective, focusing on whether L2 learners who speak different L1s stress the correct or anticipated syllable in real L2 words and in novel words intended to mimic real L2 words (e.g., Archibald, 1992, 1993, 1994, 1997; Kawagoe, 2003; Mairs, 1989; Ou & Ota, 2015; Pater, 1997; Tremblay & Owens, 2010; Van der Pas & Zonneveld, 2004; Youssef & Mazurkewich, 1998). The general prediction from this approach is that L2 learners are more likely to produce the targeted lexical stress pattern if the L1 and L2 have similar phonological generalizations for deriving stress placement. Archibald (1992, 1993), for example, investigated Polish and Spanish speakers’ stress placement in English. He analyzed the stress systems of participants’ L1 and L2 using the parameters of Metrical Theory proposed by Dresher and Kaye (1990) and predicted L2 learners’ stress placement in the L2 on the basis of L1 parameters. Polish does not have lexically contrastive stress, with words being consistently stressed on the penultimate syllable; in contrast, Spanish and English have lexically contrastive stress, with (regular) words being stressed on the final syllable if it is heavy and otherwise on the penultimate syllable, but with the last syllable of English nouns being extrametrical and therefore ignored for the purpose of assigning stress to a syllable (Dresher & Kaye, 1990; see also Harris, 1983). Using a read-aloud task, Archibald (1992) showed that the incorrectly produced words of Polish-speaking L2 learners of English tended to be stressed on the penultimate syllable. He attributed these results to Polish stress not being related to syllable weight (unlike English stress) and to Polish words not ending in an extrametrical syllable (unlike English nouns). In a similar task, Archibald (1993) found that the incorrectly produced words of Spanish-speaking L2 learners of English tended to be stressed either on the penultimate syllable or on the final syllable if the latter contained a diphthong or one or more coda consonants. Since most of the incorrectly stressed English words contained a derivational suffix, the author suggested that these results were due to derivational



The Past, Present, and Future of Lexical Stress



suffixes not being extrametrical in Spanish (for similar results, see Mairs, 1989). Thus, Archibald (1992, 1993) attributed L2 learners’ stress errors to the different generalizations that have been proposed to derive stress placement in the L1 and the L2. One limitation of using real words to investigate the generalizations underlying L2 learners’ stress placement is that L2 learners’ correct stress productions are not informative – they can be lexicalized on a case-bycase basis and may not reflect the generalizations that L2 learners have made. Pater (1997) avoided this pitfall by instead eliciting the production of English nonwords from French Canadian L2 learners of English. Canadian French was analyzed as not having lexically contrastive stress, with words being stressed only on the final (nonreduced) syllable (Dresher & Kaye, 1990). The targeted nonwords were elicited as nouns in the subject position of a carrier sentence. Pater (1997) reported that French Canadian L2 learners of English most frequently stressed the first syllable of trisyllabic nouns independently of whether any of the syllables in the nonword was heavy, unlike native English speakers, who often stressed nonextrametrical heavy syllables.1 Importantly, the L2 learners almost never stressed the last syllable of the trisyllabic nonwords, which is what Pater (1997) had predicted from his analysis of Canadian French using Dresher and Kaye’s (1990) parameters. These results suggest a lack of complete L1 transfer in these L2 learners’ stress production. Using a similar task with the same population of L2 learners, Tremblay and Owens (2010) also found that French Canadian L2 learners of English tended to stress the first syllable of disyllabic and trisyllabic nonce nouns, independently of syllable structure. Unlike Archibald’s (1992, 1993) results, Pater’s (1997) and Tremblay and Owens’s (2010) results suggest that L2 learners’ production of lexical stress did not necessarily show strong evidence of L1 transfer. Tremblay and Owens (2010) proposed that L2 learners’ production of initial stress was due to the statistical frequency with which nouns are stressed on the first syllable in English (e.g.,  Clopper, 2002; Cutler & Carter, 1987); this statistical frequency led them to overgeneralize the word-initial stress pattern to contexts where stress should in fact not be word-initial (e.g., when trisyllabic nouns contained a heavy penultimate syllable). Some researchers, however, have later questioned the psychological reality of the generative stress assignment rules proposed to explain stress 1

Pater (1997) also investigated L2 learners’ production of secondary stress in quadrisyllabic words. These results are not discussed here due to space limitations.



Annie Tremblay

patterns, at least in English where stress placement is opaque and often unpredictable, making this framework less than ideal for explaining the generalizations that L2 learners of English make with respect to stress placement (e.g., Davis & Kelly, 1997; Guion, 2005; Guion, Harada, & Clark, 2004; Wayland, Landfair, Li, & Guion, 2006). Guion and colleagues sought to address this concern by investigating L2 learners’ production of lexical stress from a statistical perspective, focusing on whether L2 learners’ production of lexical stress reflects the statistical regularities of stress patterns in the target language. The seminal work of Guion and colleagues has had an important impact on the understanding of how statistical regularities in the input influence L2 learners’ stress placement generalizations. Guion et al. (2004) investigated the production of English nonwords by Spanish-speaking “early” L2 learners of English (they learned English a mean age of 3.7 years) and Spanish-speaking “late” L2 learners of English (they learned English at a mean age of 21.5 years). Statistically, it is more likely that a disyllabic English word will be stressed on the first syllable if it is a noun than if it is a verb (Guion, Clark, Harada, & Wayland, 2003; Kelly & Bock, 1988; Sereno, 1986; Sereno & Jongman, 1995); it is more likely that a syllable in a disyllabic word will be stressed if it contains a diphthong than if it contains a lax vowel (Guion et al., 2003); and it is more likely that a disyllabic English verb will be stressed on the last syllable if it contains two coda consonants than if it contains only one (Guion et  al., 2003). Guion et al. (2004) showed that the early L2 learners (and native English speakers) were more likely to stress the first syllable of disyllabic nonwords when the nonword was elicited as a noun than when it was elicited as a verb. Furthermore, in disyllabic nonwords, the early L2 learners (and native speakers) were more likely to stress syllables that contained a diphthong than syllables that contained a lax vowel, and in disyllabic nonwords elicited as a verb, they were more likely to stress syllables that contained a complex coda than syllables that contained a simple coda. The similar patterns of results found in the early L2 learners and native speakers suggest that the early L2 learners had learned the statistical regularities of stress patterns in English (for similar results with L2 learners of English from a variety of L1 backgrounds and ages of acquisition, see Davis & Kelly, 1997). By contrast, Guion et al.’s (2004) late L2 learners showed an effect of lexical class only when the nonword contained both a lax vowel in the first syllable and a simple coda in the second syllable, and they were more likely to stress a syllable that contained a diphthong than a syllable that



The Past, Present, and Future of Lexical Stress



contained a lax vowel only when this vowel change occurred in the second syllable of the disyllabic nonwords. It was also the case that the late L2 learners more frequently stressed the first syllable of all words compared to the early L2 learners and native speakers, suggesting an overgeneralization of the word-initial stress pattern similar to those observed in Pater (1997) and Tremblay and Owens (2010). These results were interpreted as suggesting that age of acquisition had an important effect on L2 learners’ ability to extract and learn the statistical regularities of stress patterns from the input, particularly those regularities that relate to syllable weight. Guion et al. (2004) also reported the results of regression analyses conducted on a subset of the nonwords that participants produced. These analyses examined whether lexical class, syllable structure, and phonological similarity to real English words significantly predicted L2 learners’ and native speakers’ stress patterns. They found that lexical class was the strongest predictor of early L2 learners’ and native speakers’ stress patterns, followed by phonological similarity and syllable structure, whereas phonological similarity was the strongest predictor of the late L2 learners’ results, followed by lexical class; syllable structure was not found to be a significant predictor of the late L2 learners’ results. From these results, Guion et al. (2004) concluded that late L2 learners may rely more on analogy to existing words than early L2 learners, and the statistical patterns related to syllable structures may be more difficult for them to extract and learn. Guion (2005) conducted a replication of Guion et al.’s (2004) study but with early and late Korean-speaking L2 learners of English. Unlike English and Spanish, Korean does not have lexically contrastive stress. Guion (2005) sought to determine whether Korean speakers could extract and learn the statistical regularities of stress patterns from the English input despite the lack of lexically contrastive stress in the L1. Her production results were similar to those of Guion et al. (2004), with late Korean-speaking L2 learners of English showing weaker effects of lexical class and vowel length compared to early Korean-speaking L2 learners of English (neither group of L2 learners showed an effect of coda consonant). The results of the regression analyses conducted on a subset of the produced nonwords indicated that syllable structure, lexical class, and phonological similarity to real English words predicted the early L2 learners’ stress productions, whereas phonological similarity to real English words and syllable structure predicted the late L2 learners’ stress productions. The author hypothesized that the nonsignificance of the



Annie Tremblay

lexical class predictor may be due to Korean listeners’ exposure to phrasal rather than lexical prosody in the L1, leading them to have difficulty making stress placement generalizations at a lexical level. Wayland et al. (2006) additionally investigated whether Thai-speaking L2 learners of English could also show sensitivity to the distributional properties of English lexical stress. Unlike English and Spanish and like Korean, Thai does not have lexical stress; however, unlike English, Spanish, and Korean, Thai has lexical tones, with the distribution of lexical tones being conditioned by whether or not syllables contain a short or a long vowel. Wayland et al. (2006) aimed to establish whether the presence of lexical tones in Thai – in particular the relationship between lexical tones and vowel length – would enhance Thai speakers’ ability to extract and learn the relationship between syllable structure and stress in English. Thai-speaking late L2 learners of English completed the same experiments as in Guion et al. (2004) and Guion (2005). The results showed a weak effect of lexical class in the productions of Thai-speaking L2 learners of English. Importantly, unlike the Spanish-speaking and Korean-speaking late L2 learners of English in, respectively, Guion et al. (2004) and Guion (2005), Thai-speaking L2 learners of English were more likely to stress a syllable that contained a diphthong than a syllable that contained a lax vowel (however, they did not show an effect of coda consonant). This sensitivity to vowel length was attributed to the relationship between lexical tones and vowel length in Thai. Furthermore, unlike the results of Guion et al. (2004) and Guion (2005), the regression analyses conducted in Wayland et al. (2006) on a subset of the produced nonwords revealed that phonological similarity to real English words was the only predictor of Thai speakers’ stress placement. The authors reconciled these results with those of the overall analysis by suggesting the effect of phonological similarity to real English words on Thai speakers’ stress productions was much larger than that of syllable structure (the latter evidenced only in the analyses of all L2 learners’ productions). In summary, research on L2 learners’ production of lexical stress suggests that L2 learners can but do not necessarily transfer the generalizations underlying stress placement from the L1 to the L2. The research conducted by Guion and colleagues, in particular, has provided critical insights on input-related factors that can influence L2 learners’ generalizations with respect to stress placement. This research suggests that L2 learners are, to some degree, able to learn the statistical regularities that relate to stress placement in the L2, but (1) late L2 learners show weaker sensitivity to distributional properties of stress patterns compared to early



The Past, Present, and Future of Lexical Stress



L2 learners; (2) which property late L2 learners attend to appears to be contingent on the functional relevance of the corresponding property in the L1; and (3) late L2 learners’ generalizations are largely driven by phonological analogies to real words.

5.3  Nonnative Listeners’ Perception and Processing of Lexical Stress Lexical stress poses difficulties to L2 learners not only for identifying which syllable in the word should be stressed but also for perceiving stress and using it in spoken word recognition. One approach that sought to explain L1 influences on adults’ perception and processing of lexical stress is the phonological approach. This approach stipulates that whether or not nonnative listeners can perceive and process lexical stress is determined by whether stress is represented phonologically as part of listeners’ L1 lexical representations; more precisely, nonnative listeners are more likely to show sensitivity to lexical stress in speech perception and spoken word recognition if stress is lexically contrastive in the L1 (i.e., if different L1 words have different stress patterns) than if it is not lexically contrastive (e.g., Dupoux, Pallier, Sebastián, & Mehler, 1997; Dupoux, Peperkamp, & Sebastián-Gallés, 2001; Dupoux, Sebastián-Gallés, Navarrete, & Peperkamp, 2008; C. Y. Lin, Wang, Idsardi, & Xu, 2014; Peperkamp, 2004; Peperkamp & Dupoux, 2002; Peperkamp, Vendelin, & Dupoux, 2010; Tremblay, 2008, 2009). This approach received support from a number of experimental studies. Dupoux and colleagues investigated the perception of lexical stress in Spanish nonwords by native Spanish listeners and native French listeners without knowledge of Spanish (Dupoux et  al., 1997; Dupoux et  al., 2001). Whereas Spanish words differ in their stress patterns (e.g., Harris, 1983), French words have their final syllable accented in phrase-final position (e.g., Jun & Fougeron, 2000, 2002).2 In AX perception and sequence recall tasks, Dupoux et al. (1997) and Dupoux et al. (2001) found that native French listeners were less accurate than native Spanish listeners when attempting to perceive stress in Spanish nonwords produced by different Spanish talkers. French listeners’ “stress deafness” (to use Dupoux and colleagues’ terminology) was attributed to French 2

French does not have lexical stress; prosodic prominence is instead phrasal, with the last nonreduced syllable in the phrase being perceived as more prominent than the preceding syllables (e.g., Jun & Fougeron, 2000, 2002).



Annie Tremblay

not having lexically contrastive stress (Dupoux et  al., 1997; Dupoux et al., 2001). To explain these findings, Peperkamp and Dupoux (2002) proposed the Stress Parameter Model (see also Peperkamp, 2004). This model stipulates that listeners who are exposed to a language in which stress is lexically contrastive (e.g., Spanish) during the first two years of their life set the Stress Parameter to encode (i.e., represent) stress phonologically in their lexical representations; by contrast, listeners who are exposed to a language where stress is not lexically contrastive (e.g., French, Finnish) do not encode stress phonologically in their lexical representations. For listeners whose L1 does not have lexically contrastive stress, the model further predicts sensitivity to lexical stress as a function of whether the L1 prosodic system treats content and function differently. To illustrate, the first syllable of the first content word in a phrase is “stressed” in Hungarian (e.g., Vago, 1980), and the penultimate syllable of every content word is stressed in Polish – with a number of exceptions (e.g., Comrie, 1967). The model predicts that speakers of these languages will experience a lower degree of stress deafness compared to speakers of languages where prosodic generalizations do not depend on the lexical status of words (e.g., French, Finnish; for such results, see Peperkamp et al., 2010).3 To determine whether French listeners’ ability to perceive lexical stress improves once they reach a higher level of proficiency in a language with lexical stress, Dupoux et al. (2008) examined the perception of stress in French-speaking L2 learners of Spanish at different levels of proficiency in Spanish. In sequence recall and speeded lexical decision tasks, the authors found that French-speaking L2 learners of Spanish performed similarly to native French listeners without knowledge of Spanish and significantly worse than native Spanish listeners when recalling the stress pattern of the nonwords they heard, and they performed significantly worse than native Spanish listeners when making a lexical decision about nonwords that differed from Spanish words only in their stress patterns. Importantly, increased Spanish proficiency did not improve French listeners’ performance on the tasks. To establish whether age of acquisition is a critical factor in being able to perceive stress, Dupoux, Peperkamp, and Sebastián-Gallés (2010) 3

For electrophysiological evidence that Polish listeners are not “stress deaf ” when they hear violations of both canonical and exceptional Polish stress patterns, see Domahs, Knaus, Orzechowska, and Wiese (2012).



The Past, Present, and Future of Lexical Stress



tested French-Spanish simultaneous bilinguals whose dominant language was French or Spanish. Using the same tasks as in Dupoux et al. (2008), Dupoux et al. (2010) showed that French-dominant simultaneous bilinguals and French-speaking L2 learners of Spanish performed similarly, and both groups performed significantly worse than Spanish-dominant simultaneous bilinguals and monolingual Spanish listeners, who in turn performed similarly. In light of these findings, Dupoux et al. (2010) proposed that the Stress Parameter is set to encode stress phonologically in lexical representations only if the language with lexical stress is learned from birth and is the dominant language. Support for the phonological approach to the study of L2 stress perception was also provided from additional language pairings. For example, using a word-identification task, Tremblay (2008) showed that French Canadian L2 learners of English performed significantly less accurately than native English listeners in a task where they selected the continuation of a stressed or unstressed word fragment they heard. Likewise, in an AXB perception task, Tremblay (2009) found that French Canadian L2 learners of English were less successful than native English listeners at perceiving lexical stress in different recordings of English nonwords. C. Y. Lin et al. (2014) reported similar findings for Korean listeners. (Seoul) Korean is a language without lexical stress (e.g., Jun, 2005). In sequence recall and lexical decision tasks, Korean-speaking L2 learners of English had significantly more difficulty recalling nonwords that differed in stress compared to native English listeners. In contrast to Korean, (Standard) Mandarin has both lexical stress and lexical tones (e.g., Chao, 1968; Duanmu, 2007). C. Y. Lin et al. (2014) found that Mandarin-speaking L2 learners of English outperformed Koreanspeaking L2 learners of English in their ability to recall sequences of nonwords that differed in stress. These findings suggest that whether or not stress is lexically contrastive has an important influence on L2 learners’ ability to perceive and process stress. Despite the support that the phonological approach received, its predictions are somewhat coarse. A phonetic approach that focuses on the specific cues that distinguish words from one another in the L1 and in the L2 may offer a more refined approach to the study of L2 learners’ perception of stress and use of stress in spoken word recognition. Such an approach stipulates that adult listeners’ ability to perceive and learn stress is also influenced by the degree to which those acoustic cues that are used to signal stress in the L1 (e.g.,  fundamental frequency [F0], duration, intensity, vowel quality) convey lexical contrasts in the L1 (e.g.,



Annie Tremblay

Chrabaszcz, Winn, Lin, & Idsardi, 2014; Cooper, Cutler, & Wales, 2002; C. Y. Lin et al., 2014; Ortega-Llebaria, Gu, & Fan, 2013; Qin, Chien, & Tremblay, 2017; Rahmani, Rietveld, & Gussenhoven, 2015; Zhang & Francis, 2010). An increasingly large number of studies have provided empirical support for this approach (for a similar approach to the use of prosodic cues in L2 speech segmentation, see Tremblay, Broersma, & Coughlin, 2018). Cooper et al. (2002), for example, investigated Dutch and English listeners’ use of lexical stress in spoken word recognition when stress was not cued by segmental information. English and Dutch both have lexically contrastive stress, but vowels in unstressed syllables are more reduced in English than in Dutch (e.g., Sluijter & van Heuven, 1996). Native English listeners and Dutch L2 learners of English heard stressed or unstressed English word fragments that did not differ in their vowel quality and chose one of two-word continuations for the fragment (experiment 3). The results showed that Dutch listeners were more accurate than English listeners at selecting the right word continuation. Since the word fragments all contained full vowels, listeners were forced to rely on suprasegmental cues such as F0, duration, and intensity to decide whether the fragment was stressed and select the correct continuation for that fragment. Dutch listeners’ higher accuracy on the task was attributed to their greater sensitivity to the suprasegmental cues to stress compared to English listeners (see also van Heuven & de Jonge, 2011). These results suggest that English and Dutch listeners’ reliance on suprasegmental cues to lexical stress differ, providing some support for a phonetic approach to the study of L2 learners’ perception and processing of lexical stress. C. Y. Lin et al. (2014), discussed earlier, also conducted a lexical decision experiment from which the results can be interpreted within a phonetic approach. More precisely, they showed that native English listeners, but not Korean- or Mandarin-speaking L2 learners of English, were more likely to reject incorrectly stressed English nonwords if the incorrect stress affected vowel quality. The authors attributed the Korean listeners’ results to the absence of vowel reduction in their L1 and the Mandarin listeners’ results to the fact that reduced vowels cannot occur in word-initial syllables in Mandarin (many of the stimuli in the lexical decision experiment had a vowel quality change in the first syllable). Zhang and Francis (2010) provided stronger evidence for a phonetic approach to the L2 perception and processing of lexical stress. In a wordidentification task with auditory stimuli where segmental and suprasegmental cues to lexical stress were independently manipulated, the authors



The Past, Present, and Future of Lexical Stress



found that native English listeners and Mandarin-speaking L2 learners of English relied more on vowel quality than on F0, duration, or intensity when recognizing English words that differed in stress; however, Mandarin listeners showed a greater relative reliance on F0 than did English listeners. These results were attributed to the importance of lexical tones, cued primarily by F0, in Mandarin (e.g., Gandour, 1983; Howie, 1976). Ortega-Llebaria et al. (2013) also provided support for a phonetic approach to the study of L2 stress perception and processing. They tested native Spanish listeners and English L2 learners of Spanish in their ability to perceive lexical stress in a sentential context. Stressed syllables in prenuclear position in Spanish have a posttonic F0 rise (e.g., Hualde, 2005; Prieto, van Santen, & Hirschberg, 1995); L2 learners thus need to associate this F0 rise with the (stressed) syllable preceding it. The duration ratio of stressed to unstressed syllables is also larger in English than in Spanish (e.g., Delattre, 1966), due in part to the occurrence of vowel reduction in English but not in Spanish. Listeners listened to declarative sentences and identified the stress pattern of a word in prenuclear position. The results showed English L2 learners of Spanish perceived syllables with an F0 rise as being stressed, unlike native Spanish listeners, who perceived the syllable preceding the F0 rise as being stressed. The L2 learners also made greater use of duration cues than did the native listeners. English listeners thus transferred the use of F0 and duration cues from the perception of English lexical stress to the perception of Spanish lexical stress. These results also provide strong evidence that the presence of lexical stress in both the L1 and the L2 does not guarantee that L2 learners will correctly perceive stress in the L2; L2 learners must also learn the acoustic cues associated with lexical stress in the L2 in order to perceive stress accurately. Further evidence that L2 learners’ perception of stress is contingent on the cues that signal stress in the L1 was provided by Chrabaszcz et al. (2014). Using a stress perception task with nonwords, the authors showed that native English listeners, Russian-speaking L2 learners of English, and Mandarin-speaking L2 learners of English differed in their reliance on suprasegmental cues to stress: English and Mandarin listeners weighted F0 cues more heavily than duration and intensity cues, whereas Russian listeners showed the opposite pattern of results. These results were attributed to the participants’ L1, with F0 not being a reliable cue to lexical stress in Russian, unlike English and Mandarin. Rahmani et al. (2015) also provided support for a cue-based approach to L2 learners’ perception of lexical stress by investigating Dutch, French,



Annie Tremblay

Indonesian, Japanese, and Persian listeners’ perception of stress in Spanish-like nonwords. Japanese does not have lexical stress, but it has lexical pitch accents, with words differing in their tonal (i.e., pitch) patterns; in contrast, Persian does not have lexical stress or lexical pitch accents, and neither does Indonesian (for more details on the prosodic systems of each of these languages, see Rahmani et  al., 2015). Using a sequence recall task with nonwords that differed in their stress patterns, the authors found that Dutch and Japanese listeners outperformed Persian, Indonesian, and French listeners. The authors interpreted their results as suggesting that listeners can perceive lexical stress only if the L1 encodes prosodic markings at a lexical level. The authors’ explanation is more phonological than phonetic, but it yields the same predictions as a phonetic, cue-based approach, with Japanese listeners’ use of pitch to distinguish L1 words enabling them to perceive and process L2 stress. Likewise, Qin et al. (2017) demonstrated that the prosodic cues that signal lexical contrasts in the L1 modulate listeners’ processing of lexical stress in the L2. They investigated whether Standard-Mandarin and Taiwan Mandarin L2 learners of English would differ in their ability to use F0 and duration cues in the perception of English nonwords that differed in their stress pattern. Standard Mandarin, the dialect of Mandarin spoken in Beijing, China, has been proposed to have lexical stress contrasts (e.g., Chao, 1968; Duanmu, 2007), whereas Taiwan Mandarin, the dialect of Mandarin spoken in Taiwan, has been proposed not to have such contrasts (e.g., Kubler, 1985; Swihart, 2003). In both varieties of Mandarin, F0 is the primary cue to lexical tones (e.g., Gandour, 1983; Howie, 1976), and in Standard Mandarin, duration is the primary cue to lexical stress (e.g., T. Lin, 1985). In a sequence recall task, the authors found that StandardMandarin L2 learners of English made greater use of duration cues when perceiving English lexical stress compared to Taiwan Mandarin L2 learners of English, and both L2 groups made lesser use of these cues compared to native English listeners. Crucially, when lexical stress was realized with conflicting F0 and duration cues, both L2 groups relied more on F0 cues than on duration cues when perceiving English lexical stress, unlike native English listeners, who showed similar reliance on both types of cues. These results were interpreted as suggesting that these listeners transfer the use of cues to lexical contrasts from the L1 to the L2, and do so even across phonological phenomena (i.e., from the perception of lexical tones in the L1 to the perception of lexical stress in the L2). These findings thus provide further support for a cue-based, phonetic approach to the perception of L2 lexical stress.



The Past, Present, and Future of Lexical Stress



Another study that shed crucial light on the importance of specific acoustic cues when investigating L2 learners’ processing of lexical stress is that of Connell et al. (2018). The authors investigated the use of lexical stress in spoken word recognition by native English listeners, Korean-speaking L2 learners of English, and Mandarin-speaking L2 learners of English. As mentioned earlier, (Seoul) Korean does not have lexical stress (e.g., Jun, 2005). However, because English stress is cued by vowel quality, Korean listeners may perceive the stressed and unstressed vowels in English words as different Korean vowels and thus be able to perceive stress with such cues. Connell et al. (2018) effectively demonstrated that Korean-speaking L2 learners of English could use stress to recognize English words when the target and competitor words differed in vowel quality cues to stress, but not when vowel quality cues to stress were absent.4 By contrast, Mandarin-speaking L2 learners of English could use stress to recognize English words when the target and competitor words did not differ in vowel quality cues, suggesting that Mandarin listeners could rely on suprasegmental cues alone in their perception of stress. The results of Connell et al. (2018) were attributed to listeners’ transfer of acoustic cues to lexical contrasts from the L1 to the L2. All in all, research on nonnative listeners’ perception and processing of lexical stress suggests that their success at perceiving stress is predicted not only by whether stress is lexically contrastive in the L1, but also by what prosodic cues signal lexical contrasts in the L1. Importantly, only the latter approach can explain why listeners from L1 backgrounds where tonal information is lexically contrastive (e.g., Japanese, Mandarin) can successfully perceive lexical stress in the L2 (e.g., Qin et  al., 2017; Rahmani et  al., 2015). Further research that focuses on the transfer of specific acoustic cues from the L1 to the L2 is needed in order to refine the predictions of the phonetic approach.

5.4  Conclusions and Future Directions Past (or earlier) studies on L2 learners’ production of stress and on nonnative listeners’ perception and processing of lexical stress have for the most part adopted phonological approaches, predicting target-like 4

Note that these results differ from those of C. Y. Lin et al.’s (2014) lexical decision task, which showed that Korean listeners were not more successful at rejecting incorrectly stressed English nonwords when the incorrect stress pattern led to vowel quality changes in the word.



Annie Tremblay

stress placement generalizations and accurate stress perception and processing on the basis of whether or not the L1 and L2 had similar stress assignment rules and whether both languages had lexically contrastive stress (respectively). Present (or more recent) studies have shifted their focus from strictly phonological approaches to approaches that are more statistical (in stress production) and/or phonetic (in stress perception) in nature, examining whether L2 learners can learn particular associations between linguistic properties of words (e.g., word class, syllable structure) and stress placement (in production) and between acoustic cues in words and stress patterns. These approaches have shown that associations that are functionally relevant in the L1 (i.e., relevant for determining what syllable in a word is stressed and for distinguishing among different words to be perceived) have a strong influence on L2 learners’ production and perception of stress. Future research on L2 learners’ production and perception of lexical stress should seek to further refine these more recent approaches. For speech perception, in particular, it will be important to test the limits of listeners’ transfer of acoustic cues from L1 to the perception of lexical stress in the L2. One example serves to illustrate this point. Connell et al. (2018) have interpreted Korean listeners’ successful perception of lexical stress in the presence of vowel quality cues as suggesting that Korean listeners might have assimilated full and reduced English vowels to different Korean vowels, these segmental cues thus enabling Korean listeners to process English lexical stress. One question that arises is whether suprasegmental cues could also transfer from the perception of segmental contrasts in the L1 to the perception of lexical stress in the L2. Seoul Korean has a three-way laryngeal stop contrast, with lenis and aspirated stops differing primarily in F0 rather than in voice-onset time (e.g., Kang, 2014). Korean listeners’ difficulty in the perception of suprasegmental cues to lexical stress (e.g., Connell et  al., 2018; C. Y. Lin et  al., 2014) may therefore be interpreted as evidence against a strong view of acoustic cue transfer from the L1 to the L2. However, Korean dialects vary in the cues that signal the three-way laryngeal stop contrast (e.g., Lee & Jongman, in 2019), and studies on Korean listeners’ processing of lexical stress have generally not provided a tight control of Korean listeners’ native dialect. Further research should provide such a control in order to test the limit of the cue transfer hypothesis and refine phonetic, cue-based approaches to nonnative listeners’ perception and processing of lexical stress.



The Past, Present, and Future of Lexical Stress



References Archibald, J. (1992). Transfer of L1 parameter settings: Some empirical evidence from Polish metrics. Canadian Journal of Linguistics, 37, 301–339. Archibald, J. (1993). Learnability of English metrical parameters by adult Spanish speakers. International Review of Applied Linguistics, 31/32, 129–142. Archibald, J. (1994). A formal model of learning L2 prosodic phonology. Second Language Research, 10, 215–240. Archibald, J. (1997). The acquisition of English stress by speakers of nonaccentual languages: Lexical storage versus computation of stress. Linguistics, 35, 167–181. Chao, Y.-R. (1968). A grammar of spoken Chinese. Oakland: University of California Press. Chrabaszcz, A., Winn, M., Lin, C. Y., & Idsardi, W. J. (2014). Acoustic cues to perception of word stress by English, Mandarin, and Russian speakers. Journal of Speech, Language, and Hearing Research, 57, 1468–1479. doi:10.1044/2014_JSLHR-L-13-0279. Clopper, C. G. (2002). Frequency of stress patterns in English: A computational analysis. Indiana University Linguistics Club Working Papers Online, 2. Retrieved from www.indiana.edu/iulcwp Comrie, B. (1967). Irregular stress in Polish and Macedonian. International Review of Slavic Linguistics, 1, 227–240. Connell, K., Hüls, S., Martínez-García, M. T., Qin, Z., Shin, S., Yan, H., & Tremblay, A. (2018). English learners’ use of segmental and suprasegmental cues to stress in lexical access: An eye-tracking study. Language Learning, 68, 635–668. Cooper, N., Cutler, A., & Wales, R. (2002). Constraints of lexical stress on lexical access in English: Evidence from native and non-native listeners. Language and Speech, 45, 207–228. doi:10.1177/00238309020450030101 Cutler, A., & Carter, D. M. (1987). The predominance of strong initial syllables in the English vocabulary. Computer Speech and Language, 2, 133–142. doi:10.1016/0885-2308(87)90004-0 Davis, S. M., & Kelly, M. H. (1997). Knowledge of the English noun-verb stress difference by native and nonnative speakers. Journal of Memory and Language, 36, 445–460. doi:10.1006/jmla.1996.2503 Delattre, P. (1966). A comparison of syllable length conditioning among languages. International Review of Applied Linguistics in Language Teaching, 4, 183–198. Domahs, U., Knaus, J., Orzechowska, P., & Wiese, R. (2012). Stress “deafness” in a Language with Fixed Word Stress: An ERP Study on Polish. Frontiers in Psychology, 3, 439. doi:10.3389/fpsyg.2012.00439 Dresher, B. E., & Kaye, J. D. (1990). A computational learning model for metrical phonology. Cognition, 34, 137–195. doi:0.1016/0010-0277(90)90042-i Duanmu, S. (2007). The phonology of standard Chinese. Oxford: Oxford University Press.



Annie Tremblay

Dupoux, E., Pallier, C., Sebastián, N., & Mehler, J. (1997). A destressing “deafness” in French? Journal of Memory and Language, 36, 406–421. doi:10.1006/jmla.1996.2500 Dupoux, E., Peperkamp, S., & Sebastián-Gallés, N. (2001). A robust method to study stress “deafness.” Journal of the Acoustical Society of America, 110, 1606–1618. doi:10.1121/1.1380437 Dupoux, E., Peperkamp, S., & Sebastián-Gallés, N. (2010). Limits on bilingualism revisited: Stress “deafness” in simultaneous French-Spanish bilinguals. Cognition, 114, 266–275. doi:10.1016/j.cognition.2009.10.001 Dupoux, E., Sebastián-Gallés, N., Navarrete, E., & Peperkamp, S. (2008). Persistent stress “deafness”: The case of French learners of Spanish. Cognition, 106, 682–706. doi:10.1016/j.cognition.2007.04.001 Gandour, J. T. (1983). Tone perception in far eastern languages. Journal of Phonetics, 11, 149–175. Guion, S. G. (2005). Knowledge of English word stress patterns in early and late Korean-English bilinguals. Studies in Second Language Acquisition, 27, 503–533. doi:10.1017/s0272263105050230 Guion, S. G., Clark, J. J., Harada, T., & Wayland, R. P. (2003). Factors affecting stress placement for English non-words include syllabic structure, lexical class, and stress patterns of phonologically similar words. Language and Speech, 46, 403–427. doi:10.1177/00238309030460040301 Guion, S. G., Harada, T., & Clark, J. J. (2004). Early and late Spanish–English bilinguals’ acquisition of English word stress patterns. Bilingualism: Language and Cognition, 7, 207–226. doi:10.1017/s1366728904001592 Harris, J. W. (1983). Syllable structure and stress in Spanish: A nonlinear analysis. Cambridge: Cambridge University Press. Howie, J. M. (1976). Acoustical studies of Mandarin vowels and tones. Cambridge: Cambridge University Press. Hualde, J. I. (2005). The sounds of Spanish. Cambridge: Cambridge University Press. Jongman, A., & Tremblay, A. (in press). Word prosody in L2. In A. Chen & C. Gussenhoven (Eds.), The Oxford handbook of language prosody. Oxford: Oxford University Press. Jun, S.-A. (2005). Korean intonational phonology and prosodic transcription. In S.-A. Jun (Ed.), Prosodic typology: The phonology of intonation and phrasing (pp. 201–229). Oxford: Oxford University Press. Jun, S.-A., & Fougeron, C. (2000). A phonological model of French intonation. In A. Botinis (Ed.), Intonation: Analysis, modeling and technology (pp. 209–242). Dordrecht: Kluwer Academic. Jun, S.-A., & Fougeron, C. (2002). Realizations of accentual phrase in French intonation. Probus, 14, 147–172. doi:10.1515/prbs.2002.002 Kang, Y. (2014). Voice Onset Time merger and development of tonal contrast in Seoul Korean stops: A corpus study. Journal of Phonetics, 45, 76–90. Kawagoe, I. (2003). Acquisition of English word stress by Japanese learners. In J. M. Liceras, H. Zobl, & H. Goodluck (Eds.), Proceedings of the 6th Generative



The Past, Present, and Future of Lexical Stress



Approaches to Second Language Acquisition Conference (GASLA 2002): L2 Links (pp. 161–167). Somerville, MA: Cascadilla Proceedings Project. Kelly, M. H., & Bock, J. K. (1988). Stress in time. Journal of Experimental Psychology: Human Perception and Performance, 14, 389–403. Kubler, C. (1985). The influence of Southern Min on the Mandarin of Taiwan. Anthropological Linguistics, 27, 156–176. Lee, H., & Jongman, A. (2019). Effects of sound change on the weighting of acoustic cues to the three-way laryngeal stop contrast in Korean: Diachronic and dialectal comparisons. Language and Speech. doi:10.1177/0023830918786305 Lin, C. Y., Wang, M. I. N., Idsardi, W. J., & Xu, Y. I. (2014). Stress processing in Mandarin and Korean second language learners of English. Bilingualism: Language and Cognition, 17, 316–346. doi:10.1017/s1366728913000333 Lin, T. (1985). Tantao Beijinghua qingyin xingzhi de chubu shiyan [On neutral tone in Beijing Mandarin]. In S. Hu (Ed.), Beijing yuyin shiyanlu [Working papers in experimental phonetics] (pp. 1–26). Beijing: Peking University Press. Mairs, J. L. (1989). Stress assignment in interlanguage phonology: An analysis of the stress system of Spanish speakers learning English. In S. M. Gass & J. Schachter (Eds.), Linguistic perspectives on second language acquisition (pp. 260–283). Cambridge: Cambridge University Press. Ortega-Llebaria, M., Gu, H., & Fan, J. (2013). English speakers’ perception of Spanish lexical stress: Context-driven L2 stress perception. Journal of Phonetics, 41, 186–197. doi:10.1016/j.wocn.2013.01.006 Ou, S.-C., & Ota, M. (2015). Is second-language stress acquisition guided by metrical principles? Evidence from Mandarin-speaking learners of English. In Y. E. Hsiao & L.-H. Wee (Eds.), Capturing phonological shades within and across languages (pp. 389–413). Newcastle upon Tyne, England: Cambridge Scholars. Pater, J. V. (1997). Metrical parameter missetting in second language acquisition. In S. J. Hannahs & M. Young-Scholten (Eds.), Focus on phonological acquisition (pp. 235–261). Amsterdam: John Benjamins. Peperkamp, S. (2004). Lexical exceptions in stress systems: Arguments from early language acquisition and adult speech perception. Language, 80, 98–126. Peperkamp, S., & Dupoux, E. (2002). A typological study of stress deafness. In C. Gussenhoven (Ed.), Proceedings of laboratory phonology 7 (pp. 203–240). Berlin: Mouton de Gruyter. Peperkamp, S., Vendelin, I., & Dupoux, E. (2010). Perception of predictable stress: A cross-linguistic investigation. Journal of Phonetics, 38, 422–430. doi:10.1016/j.wocn.2010.04.001 Prieto, P., van Santen, J., & Hirschberg, J. (1995). Tonal alignment patterns in Spanish. Journal of Phonetics, 23, 429–451. doi:10.1006/jpho.1995.0032 Qin, Z., Chien, Y.-F., & Tremblay, A. (2017). Processing of word-level stress by Mandarin-speaking second language learners of English. Applied Psycholinguistics, 38, 541–570. doi:10.1017/s0142716416000321



Annie Tremblay

Rahmani, H., Rietveld, T., & Gussenhoven, C. (2015). Stress “deafness” reveals absence of lexical marking of stress or tone in the adult grammar. PLoS One, 10 (12), e0143968. doi:10.1371/journal.pone.0143968 Sereno, J. A. (1986). Stress pattern differentiation of form class in English. Journal of the Acoustical Society of America, 79, S36. doi:10.1121/1.2023191 Sereno, J. A., & Jongman, A. (1995). Acoustic correlates of grammatical class. Language and Speech, 38, 57–76. Sluijter, A. M. C., & van Heuven, V. J. (1996). Spectral balance as an acoustic correlate of linguistic stress. Journal of the Acoustical Society of America, 100, 2471–2485. Swihart, D. A. W. (2003). The two Mandarins: Putonghua and Guoyu. Journal of the Chinese Language Teachers Association, 38, 103–118. Tremblay, A. (2008). Is second language lexical access prosodically constrained? Processing of word stress by French Canadian second language learners of English. Applied Psycholinguistics, 29, 553–584. doi:10.1017/ s0142716408080247 Tremblay, A. (2009). Phonetic variability and the variable perception of L2 word stress by French Canadian listeners. International Journal of Bilingualism, 13, 35–62. doi:10.1177/1367006909103528 Tremblay, A., Broersma, M., & Coughlin, C. E. (2018). The functional weight of a prosodic cue in the native language predicts the learning of speech segmentation in a second language. Bilingualism: Language and Cognition, 21(3), 640–652. doi:10.1017/s136672891700030x Tremblay, A., & Owens, N. (2010). The role of acoustic cues in the development of (non-)target-like second-language prosodic representations. Canadian Journal of Linguistics, 55, 84–114. doi:10.1353/cjl.0.0067 Vago, R. M. (1980). The sound pattern of Hungarian. Washington, DC: Georgetown University Press. Van der Pas, B., & Zonneveld, W. (2004). L2 parameter resetting for metrical systems. Linguistic Review, 21, 125–170. van Heuven, V. J., & de Jonge, M. (2011). Spectral and temporal reduction as stress cues in Dutch. Phonetica, 68(3), 120–132. doi:10.1159/000329900 Wayland, R., Landfair, D., Li, B., & Guion, S. G. (2006). Native Thai speakers’ acquisition of English word stress patterns. Journal of Psycholinguistic Research, 35(3), 285–304. doi:10.1007/s10936-006-9016-9 Youssef, A., & Mazurkewich, I. (1998). The acquisition of English metrical parameters and syllable structure by adult speakers of Egyptian Arabic (Cairene dialect). In S. Flynn, G. Martohardjono, & W. A. O’Neil (Eds.), The generative study of second language acquisition (pp. 303–332). Mahwah, NJ: Lawrence Erlbaum Associates. Zhang, Y., & Francis, A. (2010). The weighting of vowel quality in native and non-native listeners’ perception of English lexical stress. Journal of Phonetics, 38, 260–271. doi:10.1016/j.wocn.2009.11.002

part ii

Segmental Acquisition

chapter 6

English Obstruent Perception by Native Mandarin, Korean, and English Speakers Yen-Chen Hao and Kenneth de Jong*

6.1 Introduction A large majority of studies in the phonetics of second language acquisition focus on the role of learners’ native language (L1) as the locus of divergence of their capabilities from those of native speakers of the language being learned (L2). Many phonetically and linguistically oriented studies of a speaker’s performance in a second language have been built around models of second language learning, such as the Speech Learning Model (SLM; Flege, 1987, 1995), or cross-language perception and learning, such as the Perceptual Assimilation Model (PAM; Best, 1995; Best & Tyler, 2007). It is not an overstatement to say that these models dominate the research field in second language sound acquisition. One striking similarity between these two models is that they seek to determine how first language experience impacts the learning of an L2. The SLM maintains that the learnability of L2 sounds depends on their similarity to L1 categories; the greater dissimilarity between an L2 sound and its closest L1 category, the more likely it is for L2 learners to establish a new phonetic category for the L2 sound and attain native-like accuracy. As for L2 sounds that are perceived to be similar to the L1 categories, learners are more likely to perceive and produce these L2 sounds as their L1 phones, which may be phonetically different from the L2 targets. One the other hand, the PAM was designed to generate hypotheses about the ease of distinguishing nonnative contrasts based on the assimilation of these contrasts to the existing ones in the L1. For instance, if two L2 sounds are mapped onto two different L1 categories, the discrimination of this L2 sound pair will be good; if two L2 sounds are mapped onto the same L1 category, the discrimination of this L2 sound pair will be poor. In practice, then, most research in L2 phonological * We thank Dr. Mi-Hee Cho for her help in collecting the Korean data, and Dr. Janice Fon in collecting the Mandarin data. Work was supported by the NSF (grant BCS-04406540).





Yen-Chen Hao and Kenneth de Jong

acquisition in these and similar frameworks focuses on aspects of the L1 as determining factors for the L2 performance. In contradistinction to these L1-targeted theoretical models, other researchers have proposed that L2 learners’ difficulty may be better explained by the inherent properties of the L2 sounds themselves. That is, L2 sounds that are difficult for L2 learners may have low acoustic salience or involve complex articulatory gestures and coordinations (Calabrese, 1995; Harnsberger, 2001; Hume, 2011; Polka, 1991; Simon, 2009). For instance, voiced obstruents require more articulatory effort than voiceless obstruents, because as the oral pressure increases due to the oral closure, a greater subglottal pressure must be maintained to enable upward movement of air through the glottis to continue voicing (Ohala, 1983). This is particularly difficult in the word-final position when the airflow is reduced toward the end of an utterance. Perhaps as a result of such articulatory difficulty, voiced obstruents in the coda position are generally acquired later in L1 acquisition than voiceless obstruents, and are less common in world languages than their voiceless counterparts (Blevins, 2006; Eckman, 1977; Greenberg, 1978; Jacobson, 1968; Maddieson, 1984; Major & Faudree, 1996). This may also explain L2 learners’ greater difficulty with voiced coda obstruents pervasively reported in previous studies on speakers with a broad range of L1s. For example, both Flege, Munro, and Skelton (1992) and Flege (1993) found that L2 learners’ production of the English word-final /t/ was closer to the native norm than that of /d/, regardless of whether their L1 permits word-final obstruents (Spanish and Taiwanese) or not (Mandarin). Moreover, L2 learners have often been found to devoice coda obstruents even when their L1 allows neither voiced nor voiceless obstruents (Broselow, Chen, & Wang, 1998; Eckman, 1981), or allows both voiced and voiceless obstruents (Altenberg & Vago, 1983; Eckman, 1984). These research findings suggest that some sounds may be inherently difficult because of their acoustic or articulatory structures, and often pose a challenge for L1 speakers as well as L2 learners with diverse L1 backgrounds. In addition to the cause of difficulty in L2 consonant perception, another issue that is still under debate is the influence of prosodic context on L2 consonant perception. Some previous studies have observed that L2 learners’ perception of the same contrast varies in different prosodic positions. One explanation for this is that the learners’ native language has a similar contrast in one prosodic position but not in another. For instance, Ingram and Park (1998) compared Japanese and Korean speakers’ identification of the English /r/-/l/ contrast in the initial,

English Obstruent Perception by Native Mandarin, Korean, and English  c­ lusters, and medial position. Both Japanese and Korean only have one liquid phoneme, occurring in the initial and medial positions in Japanese and in the medial and final positions in Korean. In the medial position Korean has a singleton-geminate contrast in its liquid, which resembles the English /r/-/l/ contrast. The results of the /r/-/l/ identification task showed that the Korean group outperformed the Japanese group in the medial and clusters positions, the latter of which was probably perceived as appearing word-medially since inserting a vowel to break up L2 consonant clusters is a common strategy among L1 Korean speakers. Yet the two groups did not differ in the initial position, suggesting that the Koreans could not transfer their ability to distinguish the /r/-/l/ contrast in the medial position to the onset position. In contrast, some studies have found that L2 learners are able to transfer their ability to distinguish consonants in one prosodic position to another position. For instance, although most Chinese dialects contrast /t/ and /d/ in the word-initial position but not in the word-final position, Flege (1989) observed that Chinese speakers from a variety of dialect backgrounds could identify the English word-final /t/-/d/ contrast as accurately as native English speakers when a release burst was audible, which is the primary cue for the prevocalic voicing contrast in Chinese. Similarly, while in Dutch a voicing contrast is only maintained in the onset position and neutralized in the coda, Broersma (2005) showed that Dutch speakers could perceive the English voicing distinction in the word-final position as well as that in the initial position, and their performance did not differ from that of native English speakers. Taken together, the existing literature has not reached a clear consensus on the effect of prosodic context on L2 perception. To address these issues, the current study compared L1 Mandarin and L1 Korean speakers’ perception of English obstruents in the onset and coda positions. Mandarin and Korean are ideal for comparison because they differ from English and from each other in their obstruent inventories and allophonic variation conditioned by prosodic position. Therefore, L1 Mandarin and Korean speakers are expected to differ in their identification of English obstruents if L1 crucially influences the L2 performance. Native English speakers’ identification of the same target obstruents was also assessed to serve as the reference point for native-like accuracy and to reveal inherent difficulty of certain sounds. Finally, this study included a wide variety of consonants, encompassing voicing, manner, and place contrasts. By having more consonants, we have an indication about the generality of effects of L1 across a larger

Yen-Chen Hao and Kenneth de Jong



portion of the phonological system. The main targets of analysis, then, are the patterns which arise across the consonant sets. Our research questions are as follows: (1) Do L1 Mandarin, Korean, and English speakers differ in their identification accuracy of English obstruents in the onset and coda positions? (2) What are the possible explanations for their differences? 6.1.1  Linguistic Background and Hypotheses In English, almost all the consonants that can appear in the onset position can appear in the coda position, with the exception of /h/. The present study focuses on eight obstruents /p, b, t, d, f, v, θ, ð/, which encompass three phonological contrasts: voicing (voiceless vs. voiced), manner (stop vs. fricative), and place (labial vs. coronal). Table 6.1 lists the English obstruents examined in this study and their closest counterparts in Mandarin and Korean in the onset and coda positions. Regarding the stop consonants in the onset position, both English and Mandarin have a two-way contrast for homorganic stops. Acoustic studies of English and Mandarin stop consonants have shown that the phonetic distinction between stops in syllable-initial position in both languages is long- versus short-lag VOT (Chao & Chen, 2008; Lisker, 1986; Keating, 1984; Rochet & Fei, 1991). Hence there is a good correspondence in onset stop categories in English and Mandarin. Korean, on the other hand, has a three-way laryngeal contrast in homorganic stops, typically called aspirated, lenis, and fortis. These three categories are distinguished primarily by VOT and F0 of the following vowel, along with other secondary cues such as breathiness on release and closure duration. (Cho, Jun, & Ladefoged, 2002; Cho & Keating, 2001; Kang & Guion, 2006). While the phonetic correspondence between English and Korean stops is not exact, Korean listeners have been found to Table 6.1  English obstruents examined in the current study and the closest Mandarin and Korean obstruents in the onset and coda positions English

Mandarin

Korean

Word initial (onset)

pbtd fvθð

ph p th t f

ph p p’ th t t’

Word final (coda)

pbtd fvθð



pt

English Obstruent Perception by Native Mandarin, Korean, and English  ­ erceptually assimilate English voiceless stops to Korean aspirated stops, p and assimilate English voiced stops to Korean lenis or fortis stops (Park & de Jong, 2008; Schmidt, 1996). As for the fricatives in the onset position, English has a relatively rich array of anterior nonsibilant fricatives /f, v, θ, ð/, while Mandarin has only /f/ and Korean has none. In fact, Korean has a very small set of fricative categories, consisting only of /s’/ (fortis), /s/ (lenis), and /h/. As for Mandarin, it has /s/, /ʂ/, /ʐ/, /ɕ/, and /x/ in addition to /f/. When it comes to obstruents in the coda position, these three languages differ not only in terms of permitted obstruent phonemes but also in the allophonic voicing variation conditioned by the prosodic positions. In English, all the eight target obstruents can appear in the coda position. However, the stops in the coda position may be unreleased, and the voicing distinction is heavily cued by the length of the preceding vowel (Klatt, 1976; Raphael, 1972; Wardrip-Fruin, 1982). In Mandarin, no obstruents are allowed in the coda position. In Korean, the three-way contrast of stop consonants in the initial position are neutralized into a homorganic lenis stop in the word-final position. In addition to stops, the underlying fricatives /s/ and /s’/ in the coda position also surface as the lenis stop /t/. In other words, there is a laryngeal and manner neutralization in the coda position in Korean, and examination of the acoustics and listeners’ perception have both shown that such a neutralization is complete (Kim & Jongman, 1996). If the difference between L1 and L2 sounds crucially affects learners’ L2 acquisition, as suggested by most L2 speech learning theories, we expect the L1 Mandarin and L1 Korean groups to differ in two specific aspects. First, these two groups are likely to perform differently on the target sound /f/ in the onset position, because Mandarin has a close equivalent while Korean does not have one. Second, these two groups may differ in their performance on the coda stops, because Mandarin does not allow any obstruents in the coda position, while Korean allows lenis stops, which are perceived to be closest to the English voiced stops (Park & de Jong, 2008; Schmidt, 1996). However, the manner and laryngeal neutralization in the Korean codas may also affect their perception of English codas. Aside from the L1 influence, the two learner groups are expected to similarly have more difficulty with voiced obstruents than with voiceless obstruents, particularly in the coda position, as the former have been found to be more challenging than the latter for L2 learners with diverse L1 backgrounds. In fact, it is possible that even the L1 English group may make more errors on voiced sounds than on voiceless sounds in the coda



Yen-Chen Hao and Kenneth de Jong

position, since voiced codas have been found to be more challenging than voiceless codas in L1 acquisition as well.

6.2 Methods 6.2.1 Participants The L1 Mandarin group consisted of 41 native Mandarin speakers (19 female and 22 male; mean age = 21.1 years) recruited from the undergraduate population at a university in northern Taiwan. All but three listeners reported Taiwanese as a third language of familiarity in addition to Mandarin and English, as is typical of the Taiwan Mandarin speaking population. The L1 Korean group was composed of 40 native speakers of Korean (28 female and 12 male; mean age = 24.97 years) recruited from the undergraduate students at a university near Seoul in Korea. Both the Mandarin- and Korean-speaking participants had been studying English for more than seven years as a regular course in school, but none had lived in an English-speaking country for longer than two months prior to the experiment. While these participants appeared to have studied English for a long period of time, it should be noted that they were typically instructed by nonnative speakers of English who shared the same native language as them. Moreover, the English classes that these participants took in Taiwan and Korean generally focused on reading, writing, and grammar. Therefore, the L2 learners in our study can be characterized as being more familiar with formal use of English but had very limited experience with English spoken by native speakers. Seventeen native English speakers (11 female and 6 male; mean age = 20 years) were recruited from a university in the United States to serve as the control group. They were all born and raised in the United States, and English was their only native language. Nine of them listed Spanish as their second language, one listed Japanese, one listed Latin, one listed French, while the other five did not know a second language. None of them had resided in a non-English-speaking country for more than two months prior to the experiment. 6.2.2 Stimuli The eight English obstruents /p, b, t, d, f, v, θ, ð/ were combined with the vowel /ɑ/ either in prevocalic position (onset) or postvocalic position

English Obstruent Perception by Native Mandarin, Korean, and English  (coda). The stimuli were produced by four Midwestern American English speakers (two male, two female), who were cued with IPA prompts. Since the IPA symbols for the two interdental fricatives differ from the English orthographic symbols, they were explained with key words at the beginning of the recording session. The 64 productions (8 consonants × 2 prosodic positions × 4 talkers) were spliced and randomized into one block with an interstimulus interval of five seconds. 6.2.3 Task The participants were seated in groups of 3 to 11 in a quiet room, and were presented with the stimuli played by a PC through a loudspeaker. All the stimuli were presented once. The listeners were asked to identify the consonant in each stimulus by circling the appropriate Roman consonant symbol from a list of 15 alternatives presented on a paper response form. Along with each response alternative was a key word, which was chosen to exemplify each segment in the initial position. They could also mark the consonant as being “other” and write down a symbol other than one of the 15 alternatives. Before the experiment started, the experimenter made sure that the participants were familiar with the all the orthographic probes including the IPA symbols /θ/ and /ð/. A block of five items, randomly selected from the stimuli, were run first as practice. When there was no more question about the procedure, the participants proceeded to the actual task. 6.2.4 Analysis To compare the three L1 groups’ accuracy in English obstruent identification, each participant’s accuracy rate for every consonant in the onset and coda positions was computed. The accuracy rates were converted to Rationalized Arcsine Units (RAU) to make them more suitable for statistical analysis (Studebaker, 1985). A repeated measures ANOVA was conducted on the RAU of the accuracy rates with one between-group factor L1 (Mandarin, Korean, English), and four within-group factors: prosodic position (onset, coda), voicing (voiceless, voiced), manner (stop, fricative), and place (labial, coronal). To further investigate the hypothesis that the ability to identify L2 targets in one position can be transferred to another position, a correlational analysis across participants within the Mandarin and Korean group’s accuracy rates between the onset and coda position was conducted.



Yen-Chen Hao and Kenneth de Jong

6.3 Results The three L1 groups’ mean accuracy rates for the eight English obstruents in the coda position are plotted in Figure 6.1 against that group’s accuracy rates for the same consonant in the onset position. The English group is plotted in the left panel, the Korean group is in the middle, and the Mandarin group is in the right panel. A repeated measures ANOVA revealed that the L1 [F(2,95) = 35.94, p < 0.001, ηp2 = 0.43], prosodic ­position [F(1,95) = 351.93, p < 0.001, ηp2 = 0.79], voicing [F(1,95) = 230.08, p < 0.001, ηp2 = 0.71], manner [F(1,95) = 161.01, p < 0.001, ηp2 = 0.63], and place [F(1,95) = 20.78, p < 0.001, ηp2 = 0.18] all significantly affected the participants’ accuracy. Post hoc analysis of the main effects (Bonferroni adjusted) indicated that L1 English speakers were significantly more accurate than L1 Mandarin speakers (p < 0.001), who were more accurate than L1 Korean speakers (p = 0.027). With regard to within-subject factors, the participants were overall more accurate in the onset than in the coda position (p < 0.001), as is evident in all of the markers in Figure 6.1 lying below the diagonal. In addition, they were more accurate with voiceless obstruents than voiced obstruents (p < 0.001), more accurate with stops than fricatives (p < 0.001), and more accurate with labials than coronals (p < 0.001). To answer our research question regarding the effect of L1 and prosodic position, we conducted post hoc analysis on significant interactions involving these two factors. In the onset position, the interaction between the L1 and prosodic position was significant [F(2,95) = 13.61, p < 0.001, ηp2 = 0.22]. Post hoc tests (Bonferroni adjusted) revealed that L1 English speakers performed significantly better than both the L1 Mandarin and Korean speakers (ps ≤ 0.007), evident in markers lying further to the right in the left panel of Figure 6.1, while there was no difference between the Mandarin and Korean groups. In the coda position, on the other hand, the L1 English group outperformed the Mandarin group (p < 0.001), who outperformed the Korean group (p = 0.024), as is evident in the vertical positioning across the panels in Figure 6.1. The 3-way interaction between the L1, prosodic position, and manner was found to be significant as well [F(2,95) = 5.70, p = 0.005, ηp2 = 0.11]. Post hoc analysis showed that in the onset position, both the English and Mandarin groups performed better than the Korean group on stops (ps ≤ 0.001), while on fricatives the English group only outperformed the Mandarin group (p = 0.02). In the coda position, the English group outperformed both learner groups on stops (ps < 0.001), and the

English Obstruent Perception by Native Mandarin, Korean, and English 

0.8

f d p

v

t

0.4

0.6

b

0.2

T D

0.0

Proportional Accuracy in Coda Position

1.0

English

0.0

0.2 0.4 0.6 0.8 Proportional Accuracy in Onset Position

1.0

0.8

1.0

Korean

t

0.4

0.6

f

0.2

T v d b

p

0.0

D 0.0

0.2 0.4 0.6 0.8 Proportional Accuracy in Onset Position

1.0

Figure 6.1  Proportional accuracy for consonants in coda position plotted by ­proportional accuracy for the same consonant in onset position. Each panel indicates ­accuracies for groups with different L1s. Here, “T” = / θ /, and “D” = / ð /.

Yen-Chen Hao and Kenneth de Jong



0.8

1.0

Mandarin

0.6

f t

0.4

d v

p

b

0.2

T

0.0

D 0.0

0.2

0.4

0.6

0.8

1.0

Proportional Accuracy in Onset Position

Figure 6.1  (Continued  )

Mandarin group also outperformed the Korean group (p = 0.016). As for fricatives, the English group similarly outperformed the two learner groups (ps < 0.001), while there was no difference between the Mandarin and Korean groups. As can be expected from the complex patterns in Figure 6.1, the interaction between all five factors also reached significance [F(2,95) = 3.37, p  = 0.038, η p2 = 0.07]. Post hoc analysis revealed that in the onset ­position, both the English and Mandarin groups outperformed the Korean group on /b/, /d/, and /f/ (ps ≤ 0.024). In addition, the English group outperformed the Korean group on /v/ (p = 0.048). In the coda position, L1 English speakers outperformed both learner groups on /p/, /b/, /t/, /d/, and /v/ (ps < 0.001). They also outperformed the Korean group on /f/ (p = 0.045). The Mandarin group outperformed the Korean group on /d/ (p = 0.001). No other between-group differences were identified. To summarize, there are general trends for all the participants to have higher accuracy in the onset than in the coda position, in identifying voiceless than voiced obstruents, stops than fricatives, and labials than coronals. L1 English speakers performed better than the two learner groups, while the relative accuracy of the L1 Mandarin and Korean

English Obstruent Perception by Native Mandarin, Korean, and English  groups depended on the prosodic position and the type of contrast. Specifically, there was no overall accuracy difference between the Mandarin and Korean groups in the onset position, while in the coda position the Mandarin group outperformed the Korean group, particularly in the identification of stops. Regarding individual obstruents, the L1 English speakers outperformed the two L2 learner groups on most of the sounds except /θ/ and /ð/ in both prosodic positions. The Mandarin group outperformed the Korean group on /f/ in the onset position as predicted. In addition, they also achieved higher accuracy on voiced stops in the onset position and on /d/ in the coda. This suggests on the surface that, while not having coda obstruents, as in Mandarin, is a significant impediment to perceptually identifying them in the L2, having a neutralized set in the coda position, as in Korean, is an even bigger impediment. To test the hypothesis that L2 learners can transfer their ability to distinguish L2 contrasts in one prosodic position to another, we correlated each participant’s accuracy for the same target sound in the onset position with that in the coda position. If the hypothesis is correct, one would expect to see significant correlations across individuals on sounds that have a close counterpart in the learners’ L1 despite the fact that they appear in different prosodic positions. The results of the correlation analysis on the Mandarin group revealed that only the correlation of /p/ between the two prosodic positions reached significance (r = 0.331, p = 0.035). As for the Korean group, the only significant correlation was found on /d/ (r = 0.477, p = 0.002). In other words, our study does not provide strong support for the transferability of obstruent identification ability between different prosodic contexts.

6.4 Discussion This study investigated L1 Mandarin and L1 Korean speakers’ perception of English obstruents in the onset and coda positions, and compared their performance to that of each other and of native English speakers. The analysis of the participants’ accuracy rates showed that the English group was more accurate than both L2 groups on almost all the target sounds except for /θ/ and /ð/. The lack of difference with the two interdental targets seems to be due to a floor effect. It appears that interdental fricatives are challenging for native speakers and L2 learners alike, and are particularly difficult in the coda position. This difficulty has also been well documented both in L1 (Clark, 2009; Edwards, 2003; Ingram et al., 1980; Polka, Colantonio, & Sundara, 2001) and L2 acquisition



Yen-Chen Hao and Kenneth de Jong

(Hanulíková & Weber, 2010; Brannen, 2002; Lombardi, 2003; Rau, Chang, & Tarone, 2009; Wester, Gilbers, & Lowie, 2007). For example, Edwards (2003, pp. 120–125) observed that the age of English-speaking children’s acquisition of /θ/ is seven years and that of /ð/ is eight years, both of which are acquired later than other consonants such as /p/ and /r/, which are acquired at the age of three and six years, respectively. Research on L2 acquisition has frequently reported that L2 learners tend to substitute the English interdental fricatives with /t/, /s/, or /f/. The difficulty in acquiring these sounds has been attributed to their lack of perceptual distinctiveness from other sounds (Jongman et al., 2000; Miller & Nicely, 1955), and perhaps for the same reason, interdental fricatives are relatively uncommon cross-linguistically (Maddieson, 1984, 2005). It is thus not surprising that all three groups in our study had low accuracy rates in identifying these sounds, regardless of L1 background. Other segmental variation, however, does seem to be related to the structure of the L1s. In onset position, the L1 Mandarin group was significantly more accurate than the Korean group on /b/, /d/, and /f/. Given that Mandarin but not Korean has /f/, this result seems to suggest that having a similar sound in the L1 gives Mandarin speakers an advantage in identifying the English /f/. As for Mandarin speakers’ higher accuracy on English /b/ and /d/ than the Korean speakers, one possible explanation is that English voiced stops are mapped only onto voiceless unaspirated stops in Mandarin but onto both lenis and fortis stops in Korean. Park and de Jong (2008) have suggested that one-to-two mapping between the L2 and L1 categories often leads to lower accuracy in L2 perception than one-to-one mapping, because each of the two L1 categories has some likelihood to be associated with other L2 segments. Specifically, they found that English /p/, which was predominantly assimilated to Korean /ph/, was identified by Korean speakers at higher accuracy than English /b/, which was assimilated to both Korean /p/ and /p’/. They attributed the difference in accuracy to the fact that English /f/ was sometimes assimilated to Korean /p’/ as well. As a result, there was some confusion between English /b/ and /f/ because they were associated with the same L1 category, whereas English /p/ was exempt from confusion with other sounds. The same reasoning might account for Mandarin speakers’ higher accuracy on /b/ and /d/ than that of Korean speakers. Concerning the overall effect of prosodic position, there was a very robust reduction in accuracy with codas in all three groups as illustrated in Figure 6.1, the only exception being the English group’s accuracy on

English Obstruent Perception by Native Mandarin, Korean, and English  /v/. This is contrary to some previous research findings which revealed no difference between L2 learners’ consonant perception in the onset and coda positions (Broersma, 2005). Our results suggest that the perception of English obstruents in the coda position is more difficult than that in the onset position. Added to this, the evidence that L2 learners do not transfer their ability to identify these sounds in the onset position to the coda, as evidenced by the lack of significant correlations in the learners’ accuracy between the two prosodic contexts, suggests that this difficulty of codas is not a minor modulating factor, but represents an important aspect of perceptual acquisition. With regard to the role of L1 influence, the L1 Mandarin group achieved overall higher accuracy rates than the Korean group in the coda position, and they were significantly more accurate on /d/ than the Korean group. In this case, it seems that having lenis stops in the coda position does not give Korean speakers an advantage in coda obstruent perception over Mandarin speakers, whose L1 does not allow any obstruents. These findings seemingly lend support to one of the SLM’s propositions that L2 learners are more likely to achieve native-like accuracy on L2 sounds dissimilar to L1 phones than on those similar to the L1 (Flege, 1987, 1995). Another possible cause for the Korean group’s lower accuracy in identifying word-final obstruents than the Mandarin group is its coda neutralization rule, which actually merges the underlying contrasts. The effect of such a neutralization in the L1 might actively desensitize Korean speakers’ distinctions in word-final position. While we have discussed some L1-specific differences, it is clear from our data that these three groups shared similar patterns in the identification of English obstruents as well. For instance, they were more accurate in identifying voiceless than voiced targets, stops than fricatives, and labials than coronals. Most of these patterns are compatible with the relative “markedness” of speech sounds and resemble the order in L1 speech acquisition (Ferguson, 1978; Greenberg, 1978; Dinnsen & Elbert, 1984; Hawkins, 1987). While labials are not generally considered to be less marked than coronals, the participants’ overall higher accuracy on labial targets probably resulted from their low accuracy on the two coronal interdentals /θ/ and /ð/, which are categorized as marked sounds in world languages. These results suggest that second language acquisition work needs to proceed in light of the observation that all listeners, regardless of their L1, find some consonants more difficult to perceive than others, including native listeners. While native listeners are much more proficient at identification, this does not mean that all phonological structures have the same degree of difficulty. Many patterns found in studies of



Yen-Chen Hao and Kenneth de Jong

second language learners may not be due to the L1, but just be more extreme versions of patterns found in the native population. What this, in turn, suggests is that models of second language phonetic and phonological learning need to take seriously other components to the learning process than the L1 of the learner.

6.5 Conclusion The present study suggests that both the intrinsic properties of L2 sounds and the L1 influence are important factors in accounting for learners’ L2 obstruent perception. Concerning the former, more marked sounds tend to be perceived less accurately than the less marked sounds, which are evidenced by all three L1 groups’ generally lower accuracy on voiced targets than on voiceless targets, fricatives than stops, and on the two interdental fricatives / θ/ and /ð/ as compared to other obstruents. Regarding the latter, the relative performance of L2 sounds cannot always be satisfactorily explained by the presence or absence of similar counterparts in the L1. Furthermore, in the word-final position the Korean group did not have an advantage over the Mandarin group by allowing lenis stops in their L1. This could be because the Mandarin group has established new phonetic categories for the coda obstruents, or because the Korean speakers were negatively influenced by their L1 coda neutralization rule. In sum, the relationship between L2 performance and L2-L1 mapping appears to be rather complicated. Learners’ relative difficulty with L2 sounds cannot always be predicted from the comparison of L2 and L1 phonological inventories. One final point to note is that several similarities and differences between the three L1 groups are most obvious in the current study because the study probes a variety of consonants and thus gains a larger picture of the phonological system. Future work with a broader range of structures is surely called for.

References Altenberg, E. P., & Vago, R. M. (1983). Theoretical implications of an error analysis of second language phonology production. Language Learning, 33, 427–447. Best, C. T. (1995). A direct realist view of cross-language speech perception. In W. Strange (Ed.), Speech perception and linguistic experience: Issues in cross-language research (pp. 171–204). Timonium, MD: York Press.

English Obstruent Perception by Native Mandarin, Korean, and English  Best, C.T., and Tyler, M. D. (2007). Nonnative and second-language speech perception: Commonalities and complementarities. In O.-S. Bohn and M. J. Munro (Eds.), Language experience in second language speech learning: In honor of James Emil Flege (pp. 13–34). Amsterdam: Benjamin. Blevins, J. (2006). A theoretical synopsis of Evolutionary Phonology. Theoretical Linguistics, 32, 117–166. Brannen, K. (2002). The role of perception in differential substitution. Canadian Journal of Linguistics/Revue canadienne de linguistique, 47, 1–46. Broersma, M. (2005). Perception of familiar contrasts in unfamiliar positions. Journal of the Acoustical Society of America, 117, 3890–3901. Broselow, E., Chen, S. I., & Wang, C. (1998). The emergence of the unmarked in second language phonology. Studies in Second Language Acquisition, 20, 261–280. Calabrese, A. (1995). A constraint-based theory of phonological markedness and simplification procedures. Linguistic Inquiry, 26, 373–463. Chao, K. Y., & Chen, L. M. (2008). A cross-linguistic study of voice onset time in stop consonant productions. Computational Linguistics and Chinese Language Processing, 13, 215–232. Cho, T., & Keating, P. A. (2001). Articulatory and acoustic studies on domaininitial strengthening in Korean. Journal of Phonetics, 29, 155–190. Cho, T., Jun, S. A., & Ladefoged, P. (2002). Acoustic and aerodynamic correlates of Korean stops and fricatives. Journal of Phonetics, 30, 193–228. Clark, E. V. (2009). First language acquisition. Cambridge: Cambridge University Press. Dinnsen, D. A., & Elbert, M. (1984). On the relationship between phonology and learning. ASHA Monographs, 22, 59–68. Eckman, F. R. (1977). Markedness and the contrastive analysis hypothesis. Language Learning, 27, 315–330. Eckman, F. R. (1981). On the naturalness of interlanguage phonological rules. Language Learning, 31, 195–216. Eckman, F. R. (1984). Universals, typologies, and interlanguages. In W. E. Rutherford (Ed.), Language universals and second language acquisition (pp. 79–105). Amsterdam: John Benjamins. Edwards, H. T. (2003). Applied phonetics: The sounds of American English (3rd ed.). Clifton Park, NY: Thomson-Delmar Learning. Ferguson, C. A. (1978). Phonological processes. In J. Greenberg, C. Ferguson, & E. Moravcsik (Eds.), Universals of human language: Vol. 2. Phonology (pp. 403–442). Stanford, CA: Stanford University Press. Flege, J. E. (1987). The production of “new” and “similar” phones in a foreign language: Evidence for the effect of equivalence classification. Journal of Phonetics, 15, 47–65. Flege, J. E. (1989). Chinese subjects’ perception of the word-final English /t/–/d/ contrast: Performance before and after training. Journal of the Acoustical Society of America, 86, 1684–1697.



Yen-Chen Hao and Kenneth de Jong

Flege, J. E. (1993). Production and perception of a novel, second-language phonetic contrast. Journal of the Acoustical Society of America, 93, 1589–1608. Flege, J. E. (1995). Second language speech learning: Theory, findings, and problems. In W. Strange (Ed.), Speech perception and linguistic experience: Issues in cross-language research (pp. 233–277). Timonium, MD: York Press. Flege, J. E., Munro, M. J., & Skelton, L. (1992). Production of the word-final English /t/-/d/ contrast by native speakers of English, Mandarin, and Spanish. Journal of the Acoustical Society of America, 92, 128–143. Greenberg, J. H. (1978). Some generalizations concerning initial and final consonant clusters. In J. Greenberg, C. Ferguson, & E. Moravcsik (Eds.), Universals of human language: Vol. 2. Phonology (pp. 243–279). Stanford, CA: Stanford University Press. Hanulikova, A., & Weber, A. (2010). Production of English interdental fricatives by Dutch, German, and English speakers. In New Sounds 2010: Sixth International Symposium on the Acquisition of Second Language Speech (pp. 173–178). Poznan, Poland: Adam Mickiewicz University. Harnsberger, J. D. (2001). On the relationship between identification and discrimination of non-native nasal consonants. Journal of the Acoustical Society of America, 110, 489–503. Hawkins, J. A. (1987). Implicational universals as predictors of language acquisition. Linguistics, 25, 453–473. Hume, E. (2011). Markedness. In M. van Oostendorp, C. Ewen, E. Hume, & K. Rice (Eds.), Companion to phonology (Vol. 1, pp. 79–106). Malden, MA: Blackwell. Ingram, D., Christensen, L., Veach, S., & Webster, B. (1980). The acquisition of wordinitial fricatives and affricates in English between 2 and 6 years. In G. Yeni-Komshian, J. Kavanagh, & C. Ferguson (Eds.), Child phonology (pp. 169–192). New York: Academic Press. Ingram, J. C., & Park, S. G. (1998). Language, context, and speaker effects in the identification and discrimination of English /r/ and /l/ by Japanese and Korean listeners. Journal of the Acoustical Society of America, 103, 1161–1174. Jakobson, R. (1968). Child language aphasia and phonological universals. Paris: Mouton. Jongman, A., Wayland, R., & Wong, S. (2000). Acoustic characteristics of English fricatives. Journal of the Acoustical Society of America, 108, 1252–1263. Kang, K. H., & Guion, S. G. (2006). Phonological systems in bilinguals: Age of learning effects on the stop consonant systems of Korean-English bilinguals. Journal of the Acoustical Society of America, 119, 1672–1683. Keating, P. A. (1984). Physiological effects on stop consonant voicing. UCLA Working Papers in Phonetics, 59, 29–34. Kim, H., & Jongman, A. (1996). Acoustic and perceptual evidence for complete neutralization of manner of articulation in Korean. Journal of Phonetics, 24, 295–312.

English Obstruent Perception by Native Mandarin, Korean, and English  Klatt, D. H. (1976). Linguistic uses of segmental duration in English: Acoustic and perceptual evidence. Journal of the Acoustical Society of America, 59, 1208–1221. Lombardi, L. (2003). Second language data and constraints on manner: Explaining substitutions for the English interdentals. Second Language Research, 19, 225–250. Lisker, L. (1986). “Voicing” in English: A catalogue of acoustic features signaling /b/ versus /p/ in trochees. Language and Speech, 29, 3–11. Maddieson, I. (1984). Patterns of sound. Cambridge: Cambridge University Press. Maddieson, I. (2005). Presence of uncommon consonants. In M. Haspelmath, M. S. Dryer, D. Gil, & B. Comrie (Eds.), The world atlas of language structures (pp. 82–83). Oxford, England: Oxford University Press. Major, R. C., & Faudree, M. C. (1996). Markedness universals and the acquisition of voicing contrasts by Korean speakers of English. Studies in Second Language Acquisition, 18, 69–90. Miller, G. A., & Nicely, P. E. (1955). An analysis of perceptual confusions among some English consonants. Journal of the Acoustical Society of America, 27, 338–352. Ohala, J. (1983). The origin of sound patterns in vocal tract constraints. In P. F. MacNeilage (Ed.), The production of speech (pp. 189–216). New York: Springer. Park, H., & de Jong, K. J. (2008). Perceptual category mapping between English and Korean prevocalic obstruents: Evidence from mapping effects in second language identification skills. Journal of Phonetics, 36, 704–723. Polka, L. (1991). Cross-language speech perception in adults: Phonemic, phonetic, and acoustic contributions. Journal of the Acoustical Society of America, 89, 2961–2977. Polka, L., Colantonio, C., & Sundara, M. (2001). A cross-language comparison of /d/-/ð/ perception: Evidence for a new developmental pattern. Journal of the Acoustical Society of America, 109, 2190–2201. Raphael, L. J. (1972). Preceding vowel duration as a cue to the perception of the voicing characteristic of word-final consonants in American English. Journal of the Acoustical Society of America, 51, 1296–1303. Rau, D. V., Chang, H. H. A., & Tarone, E. E. (2009). Think or sink: Chinese learners’ acquisition of the English voiceless interdental fricative. Language Learning, 59, 581–621. Rochet, B. L., & Fei, Y. (1991). Effect of consonant and vowel context on Mandarin Chinese VOT: Production and perception. Canadian Acoustics, 19, 105–106. Schmidt, A. M. (1996). Cross-language identification of consonants. Part 1. Korean perception of English. Journal of the Acoustical Society of America, 99, 3201–3211.



Yen-Chen Hao and Kenneth de Jong

Simon, E. (2009). Acquiring a new second language contrast: An analysis of the English laryngeal system of native speakers of Dutch. Second Language Research, 25, 377–408. Studebaker, G. A. (1985). A “rationalized” arcsine transform. Journal of Speech, Language, and Hearing Research, 28, 455–462. Wardrip-Fruin, C. (1982). On the status of temporal cues to phonetic categories: Preceding vowel duration as a cue to voicing in final stop consonants. Journal of the Acoustical Society of America, 71, 187–195. Wester, F., Gilbers, D., & Lowie, W. (2007). Substitution of dental fricatives in English by Dutch L2 speakers. Language Sciences, 29, 477–491.

chapter 7

Changes in the First Year of Immersion

An Acoustic Analysis of /s/ Produced by Japanese Adults and Children Katsura Aoyama*

7.1 Introduction The acquisition of second language (L2) speech has been studied in many ways. Previous studies have investigated segmental aspects (vowels, e.g., Guion, 2003; consonants, e.g., Guion, Flege, AkahaneYamada, & Pruitt, 2000a) and prosodic aspects (stress, e.g., Guion, 2005; tone, e.g., Wayland & Guion, 2003). Previous studies also investigated factors that affect the degree of overall foreign accent in L2 (e.g., Piske, MacKay, & Flege, 2001). Age of acquisition (AOA) (e.g., Guion, Flege, Liu, & Yeni-Komshian, 2000b), amount of first language (L1) use (e.g., Guion, Flege, & Loftin, 2000c), and length of residence (LOR) in an L2-speaking environment (e.g., Guion, 2005) are some of the commonly investigated factors that affect the degree of overall foreign accent. In addition, these aspects and factors can be studied through speech perception (e.g., Guion et  al., 2000a) or production (e.g., Oh et  al., 2011). Previous research also includes numerous L1 and L2 ­combinations including Japanese-English (Guion et al., 2000a), KoreanEnglish (Lee, Guion, & Harada, 2006), Spanish-English (Guion, Harada, & Clark, 2004), and Quichua-Spanish (Guion, 2003). In short, multiple factors interact with one another when speakers learn a new language, and these interactions manifest in complex ways in L2 speakers’ speech perception and production. * This research was supported by a grant from the National Institute of Health (DC00257) to James E. Flege and by two university internal grants from the University of North Texas. The Japanese Longitudinal Project was designed and implemented by James E. Flege and Susan GuionAnderson at the University of Alabama at Birmingham. Susan was actively involved in data ­collection, analysis, and dissemination of the project until her death in 2011. She was a wonderful collaborator and a great mentor, and I am grateful that I had an opportunity to know her. I thank James E. Flege, Reiko Akahane-Yamada, and Tsuneo Yamada for their work for this project and Allard Jongman for his help with acoustic analysis. In addition, Linda Legault, Tina Boike, Sarah Hayes, and Karen McPhearson contributed to various aspects of this study.





Katsura Aoyama

This chapter reports a preliminary acoustic analysis of English /s/ produced by native Japanese-speaking adults and children. The production data are from the Japanese Longitudinal Project conducted at the University of Alabama at Birmingham (UAB). There are several published reports based on this project, including overall foreign accent (Aoyama, Guion, Flege, Yamada, & Akahane-Yamada, 2008), segmental perception and production (Aoyama, Flege, Guion, Akahane-Yamada, & Yamada, 2004; Aoyama et al., 2008; Oh et al., 2011), and prosodic aspects (Aoyama & Guion, 2007). Additional analyses have been conducted on the acoustical properties of liquids (Aoyama, Flege, Akahane-Yamada, & Yamada, 2019a). This chapter reports a preliminary analysis of /s/ from an acoustical analysis of voiceless fricatives (Aoyama, Flege, Yamada, & Akahane-Yamada, 2019b). 7.1.1  Voiceless Fricatives in American English and Japanese There are many differences in fricative phonemes between American English and Japanese. American English has four voiceless fricatives: labiodental /f/, interdental /θ/, alveolar /s/, and palato-alveolar /∫/ (Ladefoged & Johnson, 2015). In addition, the English phoneme inventory includes /h/, although it is often excluded from acoustical analysis of fricatives (e.g., Jongman, Wayland, & Wong, 2000). This is because [h] is considered a voiceless transition into the neighboring vowel (Ladefoged & Johnson, 2015). In contrast, Japanese has an alveolar /s/ and a voiceless fricative that is farther back than [s] (Vance, 1987). Li (2012) uses /s/ and /∫/ for these two voiceless fricatives, and Vance (1987) uses /s/ and /ɕ/ for this contrast. Japanese has five vowel phonemes (/a/, /e/, /i/, /o/, and /u/ [ ɯ]). Additional voiceless fricatives appear as allophones in some vowel contexts. The bilabial fricative [ɸ] appears in front of /u/ (fune [ɸɯne] “ship”) and in loanwords (e.g., firumu [ɸiɾɯmɯ] “film”). In front of /i/, /s/ appears as [ɕ] (e.g., shi [ɕi] “poem”). It has also been reported that Japanese /s/ is laminal (Vance, 1987) compared to an apical /s/ in American English, and thus less sibilant than /s/ in English (Li, Edwards, & Beckman, 2009). Li et al. compared the acoustic characteristics of English and Japanese /s/ (as well as English /ʃ/ and Japanese /ɕ/). They found that acoustic characteristics of /s/ were slightly different between the two languages, and suggested that there may be some subtle differences in articulatory configurations for English and Japanese.



Changes in the First Year of Immersion



7.1.2  Japanese Speakers’ Perception and Production of American English Fricatives Guion et al. (2000a) investigated how native Japanese speakers categorize English consonants in terms of Japanese consonants. The Japanese participants were also asked to rate English consonants in terms of goodnessof-fit to the closest Japanese consonant. The results showed that the Japanese speakers identified English /s/ as Japanese /s/ 87 percent of the time, and judged English /s/ as a good example of Japanese /s/. The Japanese speakers identified English /θ/ as Japanese /s/ or bilabial /ɸ/ (less than 40 percent of the time), and rated English /θ/ as a poor example of both Japanese /s/ and /ɸ/. These results showed that English /θ/ did not fit in any Japanese consonant category, while English /s/ was categorized as an acceptable example of Japanese /s/. Guion et al. (2000a) also examined Japanese speakers’ discrimination ability of some of English consonant pairs (/ɹ/-/l/, /ɹ/-/w/, /s/-/θ/, and /b/-/v/). The discrimination scores of the /s/-/θ/ contrast were low, even among the experienced Japanese speakers with average LOR of three years. The English /s/-/θ/ contrast would fall into “categorized” (i.e., /s/) versus “uncategorized” (i.e., /θ/) type in the Perceptual Assimilation Model (Best, 1995; Best & Tyler, 2007) and the discrimination scores were predicted to be relatively high. The English /s/-/θ/ contrast was also examined in Aoyama et al. (2008). The Japanese adults’ and children’s discrimination scores of the English /s/-/θ/ contrast were also low in this study, and the scores did not improve significantly after one year of residence in an English-speaking country. Aoyama et al. (2008) examined native Japanese (NJ) adults’ and children’s productions of English voiceless fricatives (/f/, /s/, and /θ/). Their productions were evaluated by native English listeners’ perceptual judgments (i.e., intelligibility scores). Sixteen listeners without phonetic training identified the NJ speakers’ productions of /f/, /s/, and /θ/ as one of the following choices: f, s, th, sh, h, fw, hw, d, and t. The intelligibility scores indicated that the NJ adults’ production of /f/ were identified as f around 70 percent of the time. When the NJ speakers’ /f/ productions were misidentified, they were identified as fw (around 23 percent of the time). The NJ adults’ productions of /θ/ was identified as th 58 percent at approximately six months after their arrival in the United States, and intelligibility scores slightly improved to 64 percent a year later. The NJ adults’ production of /s/ was identified as s at 89 percent of the time at six months after their arrival, but the intelligi-



Katsura Aoyama

bility scores were lower (75 percent) after one year of residence in the United States. When the NJ adults’ /s/ productions were misidentified, it was most often misidentified as th. When their /θ/ productions were misidentified, it was most often identified as s. The NJ children’s intelligibility scores showed different patterns from the NJ adults’ in Aoyama et al. (2008). The NJ children’s /f/ productions were identified as f only 55 percent of the time at the first testing, but the intelligibility scores improved to 70 percent one year later. When the NJ children’s /f/ productions were misidentified, they were either identified as fw or th. The intelligibility scores were the highest for /s/ (74 percent at first testing and 85 percent one year later). When the NJ children’s /s/ productions were misidentified, they were most often misidentified as th. Lastly, the Japanese children’s /θ/ productions were identified as th only 39 percent at six months after their arrival, but the mean intelligibility score improved to 69 percent one year later. Their /θ/ productions were misidentified as s and f. In sum, the intelligibility scores of English voiceless fricatives differed widely depending on the consonant, and the observed patterns were somewhat different between Japanese adults and children. Interestingly, the Japanese speakers’ productions of /f/ were most often identified as intended, although /f/ [ɸ] only appears in front of /u/ and in loanwords (Vance, 1987). In addition, Japanese speakers might categorize English /s/ as Japanese /s/ perceptually (Guion et  al., 2000a), but their productions of English /s/ may be different from the NE speakers’ productions of /s/. Aoyama et al. (2019b) conducted acoustic analysis of three fricative consonants (/f/, /s/, and /θ/) from the production experiment in Aoyama et al. (2008). The purpose of Aoyama et al. (2019a) was to examine the acoustic nature of the differences between Japanese L2 speakers and native English speakers that were observed in Aoyama et al. (2008). This chapter reports the preliminary results on the acoustic analysis of /s/.

7.2  The Acoustic Analysis of /s/ 7.2.1 Method 7.2.1.1 Participants Table 7.1 summarizes the participant demographics. The participants were 32 native Japanese speakers (NJ) and 32 native American English (NE) speakers. The data collection took place in the Birmingham, Alabama, and the Dallas and Houston, Texas areas. All participants were



Changes in the First Year of Immersion



Table 7.1  Characteristics of the native English (NE) and native Japanese (NJ) participants Gender

Mean age (year)

Mean LOR (year) Time 1

Time 2

NE adults

7m/9f

40.3 (4.7)





NE children

10m/6f

10.6 (2.1)





NJ adults

8m/8f

39.9 (3.8)

0.5 (0.2)

1.6 (0.3)

NJ children

9m/7f

  9.9 (2.4)

0.4 (0.2)

1.6 (0.3)

Note: LOR = length of residence in the United States in years. Age = chronological age in years at time 1. Standard deviations in parentheses.

tested twice (time 1 and time 2) with approximately one year between recordings. Data were collected twice to investigate the changes between time 1 and time 2 in the NJ adults and children. The NJ speakers had lived in the United States for approximately six months at time 1. There were 16 NJ adults (8 males and 8 females, mean age 39.11) and 16 NJ children (9 males and 7 females, mean age 9.11). The NJ adults were exposed to English from an average age of 12.2 years. All but one of the NJ children had no previous formal exposure to English prior to coming to the United States. The NE speakers included 16 adults (7 males and 9 females, mean age 40.4) and 16 children (10 males and 6 females, mean age 9.11). In both NE and NJ groups, adults were the parents of the children. 7.2.1.2  Data Collection Data collection was conducted in a quiet room at either UAB or a participant’s home. Several tasks were given to the participants both at time 1 and time 2, including vowel discrimination, consonant discrimination, English word production, and English sentence production. The data from the English word production task were analyzed in various studies of the NJ speakers’ production of English consonants (Aoyama et al., 2004, 2008, 2019a, 2019b) and vowels (Oh et al., 2011) and as well as their overall foreign accent (Aoyama et al., 2008). The following 26 English words were elicited from each participant: book, bug, cage, dog, eat, egg, eight, feet, fish, food, foot, hug, leaf, light, neck, read, six, shoe, sock, thousand, think, vase, voice, watch, wing, write. These words were selected because they contained a variety of English vowels and



Katsura Aoyama

consonants, and they were all frequently occurring words. The initial /s/ from six was analyzed in this study. 7.2.1.3  Elicitation Procedure The participants were recorded with a head-mounted Shure microphone (Model SM 10A) connected to a Sony digital audio tape recorder (Sony model TCD-D8). The 26 words were elicited three times each in random orders. At the first elicitation, the participants saw a picture on the screen of a laptop computer and heard the corresponding word via a loudspeaker. An auditory model was not provided to elicit the second and third tokens of the test words. The experimenter played out the auditory model of the word only when the participant was unable to say the word in response to the corresponding picture. An equivalent word in Japanese was displayed in Japanese orthography in addition to the picture. The recordings were digitized at 22.05 kHz with 16-bit amplitude ­resolution. The first and third productions from each speaker were acoustically analyzed. A total of 256 tokens of syllable-initial /s/ in six (16 participants × 4 groups × 2 tokens × 2 testing times) were acoustically analyzed and reported here. These productions are the same tokens that were perceptually evaluated by native English listeners in the production experiment in Aoyama et al. (2008). 7.2.1.4  Acoustic Analyses Fricative noise duration, normalized duration, center of gravity (CoG), and fricative noise amplitude were measured for each token of /s/. These four parameters included some of the many durational, spectral, and amplitudinal parameters of English voiceless fricatives reported in previous studies (e.g., Jongman et al., 2000; Maniwa, Wade, & Jongman, 2009). Not all parameters in these previous studies were measured here because the purpose of this study was to investigate the differences between NE speakers and NJ speakers, and not to conduct a comprehensive analysis of acoustical properties of the English /s/. All acoustic measurements were conducted using Praat (Boersma & Weernink, 2019). The /s/ was segmented by simultaneously examining the waveform and wideband spectrogram. The onset of the /s/ was defined as the point at which high-frequency energy first appeared on the spectrogram. The offset of the fricative was defined as the point immediately preceding the onset of the following vowel. Fricative noise duration was measured as between the onset and offset of the fricative. Because



Changes in the First Year of Immersion



absolute duration may vary as a function of speaking rate, normalized duration was calculated as the ratio between the fricative duration over the duration of the following vowel. Vowel duration of /ɪ/ in six was measured from the onset of voicing to the onset of the following consonant (i.e., /k/ in six). Normalized duration was then calculated as the ratio between the noise duration and the duration of the following vowel. The center of gravity (CoG) was measured in the middle 40 ms of the fricative noise. The mean intensity values in dB was measured for the entire noise portion (fricative noise amplitude).

7.3 Results 7.3.1 Duration Table 7.2 shows mean fricative noise duration, vowel duration, and normalized duration in each group at time 1 and time 2. The data were statistically analyzed using a three-way Language (2) × Age (2) × Time (2) ANOVA with fricative noise duration as the dependent variable. This analysis yielded nonsignificant main effects of language, age, and time [F(1,120) = 0.75 to 2.21, p > 0.1]. The two-way interaction between language and age was significant [F(1,120) = 4.05, p 0.05]. Post hoc test showed that the NJ speakers’ normalized durations of /s/ were longer than the NE speakers’ (mean NJ: 2.11 vs. NE 1.86) [F(1,62) = 4.70, p = 0.03]. 7.3.2  Center of Gravity (CoG) Table 7.3 shows mean CoG values in each group at time 1 and time 2. The data were statistically analyzed using a three-way Language (2) × Age (2) × Time (2) ANOVA with CoG values as the dependent variable. This analysis yielded a significant main effect of language [F(1,120) = 11.72, p < 0.001]. The main effects of age and time were not significant [F(1,120) = 0.04 and 1.12, p > 0.1]. All two-way interactions and the threeway interactions were not significant [F(1,120) = 0.05 to 1.17, p > 0.1].

Table 7.3  Mean center of gravity (CoG) values (in Hz) averaged across speakers in each group Time 1

Time 2

NE adults

6806 (1071)

6878 (1118)

NE children

6714 (1197)

6838 (1319)

NJ adults

6472 (1288)

5971 (1912)

NJ children

5577 (1959)

6120 (1347)

Note: Standard deviations are in parentheses.



Changes in the First Year of Immersion



Table 7.4  Mean noise amplitude (in dB) averaged across speakers in each group Time 1

Time 2

NE adults

65 (5)

68 (5)

NE children

64 (5)

66 (4)

NJ adults

63 (6)

63 (5)

NJ children

60 (4)

65 (4)

Note: Standard deviations are in parentheses.

Post hoc tests showed that NE groups’ CoG values of /s/ (mean 6809 Hz) was higher than NJ groups’ CoG (mean 6022 Hz) [F(1,126) = 11.92, p < 0.001]. 7.3.3  Fricative Noise Amplitude Table 7.4 shows mean fricative noise amplitude in each group at time 1 and time 2. The data were statistically analyzed using a three-way Language (2) × Age (2) × Time (2) ANOVA with noise amplitude as the  dependent variable. This analysis yielded a significant main effect of  language [F(1,120) = 18.19, p < 0.001] and time [F(1,120) = 9.69, p = 0.002]. The three-way interaction was statistically significant [F(1,120) = 4.95, p = 0.028]. The main effect of age and the two-way interactions were not significant [F(1,120) = 0.04 to 2.26, p > 0.1]. Post hoc tests showed that NJ children’s noise amplitude was significantly higher at time 2 than at time 1 (65 vs. 60 dB) [F(1,31) = 16.97, p < 0.001]. There was no difference in noise amplitude between time 1 and time 2 in any of the other three groups [F(1,31) = 0.11 to 4.00, p > 0.05].

7.4 Discussion The results of the present study revealed several differences in acoustic characteristics of English /s/ between native English speakers and Japanese L2 speakers of English. In duration, the NJ adults’ /s/ was longer than the NE adults’ /s/ in both absolute noise duration and normalized duration (see Table 7.2). In normalized duration, the NJ adults’ and children’s /s/ productions were more than twice as long as the duration of the following vowel (/ɪ/), whereas the ratios between the /s/ and /ɪ/ were



Katsura Aoyama

smaller in the NE adults’ and children’s productions. The duration of /s/ was the shortest in the NE adults’ productions among the four groups both in absolute duration and normalized duration. The CoG values were greater in the NE adults’ and children’s /s/ than in the NJ adults’ and children’s /s/. Since the CoG values correspond to the place of articulation (Jongman et  al., 2000; Li, 2012), this finding suggests that the NJ speakers’ /s/ was slightly more back than the NE speakers’ /s/ productions. Lastly, noise amplitude was greater in the NE adults’ and children’s /s/ than in the NJ adults’ and children’s /s/, indicating that the native speakers’ /s/ is more sibilant than the Japanese L2 speakers’ /s/. Noise amplitude in /s/ was higher at time 2 than at time 1 in the NJ children’s productions, suggesting improvement in their /s/ productions. Previous studies have reported differences in durational aspects between native speakers and L2 speakers (e.g., Guion, Flege, Liu & YeniKomshian 2000b). For instance, Guion et al. (2000b) reported that overall utterance durations are generally longer in L2 speakers’ speech than in native speakers’ speech. Moreover, the average utterance duration was longer as the AOA in an L2-speaking country increased. This effect was observed in two large immigrant groups (native Italian speakers in Canada and native Korean speakers in the United States) in Guion et al. (2000b). For the Japanese Longitudinal Project, both Oh et al. (2011) and Aoyama and Guion (2007) reported that some of the vowels and syllables were longer in the NJ adults’ and children’s utterances than in the NE adults’ and children’s utterances. These observed durational differences were not simply due to a slower speech rate or lack of proficiency in speaking in an L2. Guion, Flege & Loftin. (2000c) analyzed durations of different types of segments (vowels, sonorants, and obstruents), and found that the vowels showed a greater difference between native speakers and Korean L2 speakers of English. For the Japanese Longitudinal Project, Aoyama and Guion (2007) showed that function words (e.g., pronouns) may be proportionally longer in the NJ speakers’ than in the NE speakers’ utterances. In Oh et al. (2011), one vowel (i.e., /i/) was longer in the NJ adults’ speech than in the NE adults’ speech, whereas another (i.e., /ɑ/) was longer in the NE speakers’ speech than in the NJ speakers’ productions. The present study showed normalized duration of /s/ (i.e., the ratio between /s/ and /ɪ/ in six) was longer in the NJ speakers’ than in the NE speakers’ utterances. These findings indicate that the durational differences are not simply due to a slower rate of speech. The durational differences in some segments and proportional differences between segments may reflect underlying



Changes in the First Year of Immersion



prosodic differences between the native speakers and L2 speakers of English. The CoG values negatively correlate with the length of the resonating cavity (Li et  al., 2009) and they are typically higher than 6000  Hz for English /s/ (Jongman et  al., 2000; Li, 2012). The CoG results in this study suggest that the NJ adults’ and children’s /s/ may be slightly more back, and thus closer to English /ʃ/, than the NE speakers’ /s/. This may be because the CoG values in Japanese /s/ may be slightly lower than those in the English /s/ (Li, 2012). In addition, the /s/ analyzed in this study was followed by /ɪ/ in the word six, and this vowel context may have affected how Japanese speakers produced the /s/. As mentioned in the introduction, /s/ appears as an allophone [ɕ] in front of /i/ in Japanese (Vance, 1987). The acoustic characteristics of Japanese speakers’ English /s/ might have been different if it was in a different vowel context, such as in the word sock. The average noise amplitude in /s/ was greater in the NE groups than in the NJ groups. This difference in noise amplitude may correspond to the fact that English /s/ is apical and alveolar whereas Japanese /s/ is laminal (Li, 2012; Vance, 1987). The noise amplitude increased significantly from time 1 to time 2 in the NJ children’s productions, and it was comparable to the NE groups’ /s/ at time 2 (Table 7.4). This change in noise amplitude may be one of the reasons for the improved intelligibility scores (75 percent to 85 percent) in Aoyama et al. (2008). These results suggest that the NJ children noticed the subtle cross-linguistic differences in the phonetic realizations of /s/ between English and Japanese. Unlike the NJ children’s /s/, the NJ adults’ intelligibility scores of /s/ decreased from time 1 to time 2 (89 vs. 75 percent) in Aoyama et  al. (2008). The results of the present study showed that their /s/ productions were indeed different from the NE adults’ /s/ productions especially in the CoG values (see Table 7.3). The CoG values of /s/ in the NJ adult’s productions, however, did not differ statistically between time 1 and time 2. In fact, none of the time 1–time 2 comparisons reached statistical significance for the NJ adults’ productions of /s/. There are several reasons for why changes were observed between time 1 and time 2 in one study, but not in another. One factor for these differences may be how the productions were evaluated through listener judgments. The listener judgment experiment employed in Aoyama et al. (2008) was a forced-choice identification task, and the NE and NJ speakers’ productions were presented as consonant-vowel (CV) syllables (e.g., /sɪ/). The /s/ tokens were also presented with two other fricatives



Katsura Aoyama

(/f/ and /θ/). It is possible that changes in other two sounds, /f/ and /θ/, indirectly affected the NE listeners’ judgments of the NJ adults’ /s/ productions. The acoustic analysis of /f/ and /θ/ will reveal whether or not the NJ adults’ and children’s /f/ and /θ/ also changed between time 1 and time 2. These results demonstrate complex relationships between native speakers’ perception and the acoustical nature of L2 speakers’ production. A similar case was found in the NJ children’s productions of English /ɹ/ in Aoyama et al. (2004) and Aoyama et al. (2019a). Listener judgment scores showed that the NJ children’s productions of English /ɹ/, but not /l/, improved significantly from time 1 to time 2 (Aoyama et al., 2004). However, the acoustic analysis showed that the nature of this perceived change was less straightforward, because the NJ children’s productions of both /l/ and /ɹ/ changed in some aspect (Aoyama et al., 2019a). It appears that the combination of changes in both sounds contributed to higher intelligibility scores of the NJ children’s production of /ɹ/. Aoyama et al. (2008) also demonstrated that the NJ children’s overall foreign accent ratings significantly improved from time 1 to time 2. Some of the same word tokens (e.g., six, fish, right, light) were part of the foreign accent experiment (Aoyama et  al., 2008), vowel analysis (Oh et al., 2011), and acoustic analyses (Aoyama et al., 2019a, 2019b). In the foreign accent experiment, the NE listeners were instructed to judge each stimulus based on the “overall degree of foreign accent,” and the stimuli contained strings of meaningful words instead of CV syllables. All acoustic analyses that have been conducted thus far show only modest changes, if at all, for the NJ children from time 1 to time 2 (Aoyama et al., 2019a, 2019b; Aoyama & Guion, 2007, Oh et al., 2011). These findings demonstrate the difficulty of measuring what listeners perceive as the clue of foreign accent. It appears that the listeners are sensitive to subtle differences in many different segments, and they are able to make overall judgment for the degree of foreign accent. That is, the NJ children’s improvement from time 1 to time 2 may be modest and occurred only in some segments, but small changes collectively contributed to perceived differences in foreign accent. In sum, the series of studies from the Japanese Longitudinal Project demonstrate the importance of examining multiple aspects of L2 acquisition. Acoustic analysis of L2 speech may be able to uncover fine details in vowels and consonants, even when they are identified as intended. Acoustic analysis can also show exactly which aspect of the sound is different, such as the CoG and noise amplitude of /s/ in the present



Changes in the First Year of Immersion



study. On the other hand, observed differences in perceptual evaluation of a sound (e.g., /ɹ/ in the NJ children’s production and /s/ in the NJ adults) may not directly correspond to differences in acoustical measurements. The acoustic analysis of /s/ also showed differences in a category that exists in both the speakers’ L1 (Japanese) and L2 (English). The differences in phonetic realization of /s/ between English and Japanese may be more difficult to discern and thus more difficult to learn (Flege, 1995). These differences between similar or shared sounds in L1 and L2 may ultimately be the source of foreign accent in advanced speakers of an L2 (Flege, 1987). The Japanese Longitudinal Project did not investigate the effects of psychological and social factors behind learning English. The two main factors that were considered in this project were AOA (by comparing adults and children) and LOR (by comparing time 1 and time 2). Although a language background questionnaire was given to the participants, factors such as language use, motivation, and attitude were not considered in the analysis. It would have been potentially interesting to consider the social aspects in the Japanese Longitudinal Project, because the participants were families who came to the United States for a relatively short period of time. In this respect, the social circumstances behind learning an L2 were different from other studies (e.g., immigrants, Guion Flege Liu and Yeni-Komshian 2000b; speakers without any experience in the target language, Wayland & Guion, 2003). This chapter is a preliminary report of the acoustic analysis of fricatives (Aoyama et al., 2019b), and acoustic analysis of /f/ and /θ/ are currently under way. The future reports will include the difference between noise amplitude and vowel amplitude of the following vowel (/ɪ/) to normalize for speaker differences. Normalized duration will also be calculated as the proportion of the fricative noise within a word to be consistent with previous studies (e.g., Jongman et al., 2000). The acoustic analysis of /θ/ will reveal how the NJ adults and children produced a category that does not exist in Japanese (Aoyama et al., 2008). As the body of previous research richly demonstrates, there are so many ways to study L2 speech acquisition. Acoustical analyses that are conducted recently (Aoyama et  al., 2019a, 2019b) have provided new insights into interpreting the findings from our previous studies (Aoyama et al., 2004, 2008). Speech production data can provide rich materials to study how people learn to speak a new language. It is a challenging and yet fulfilling field of research, and I feel fortunate to be able to spend my career studying it.



Katsura Aoyama

References Aoyama, K., Flege, J. E., Guion, S. G., Akahane-Yamada, R., & Yamada, T. (2004). Perceived phonetic dissimilarity and L2 speech learning: The case of Japanese /r/ and English /l/ and /r/. Journal of Phonetics, 32, 233–250. Aoyama, K., Flege, J. E., Akahane-Yamada, R., & Yamada, T. (2019a). An acoustic analysis of American English liquids by adults and children: Native English speakers and native Japanese speakers of English. The Journal of the Acoustical Society of America, 146(4), 2671–2681. Aoyama, K., Flege, J. E., Yamada, T., & Akahane-Yamada, R. (2019b). Acoustical analysis of English voiceless fricatives by native Japanese adults and children (/f θ s/). Paper presented at the 130th meeting of the Acoustical Society of America, Louisville, KY. Aoyama, K., & Guion, S. G. (2007). Prosody in second language acquisition: An acoustic analysis on duration and F0 range. In O. Bohn & M. J. Munro (Eds.), Language experience in second language speech learning: In honor of James Emil Flege (pp. 281–297). Amsterdam: John Benjamins. Aoyama, K., Guion, S. G., Flege, J. E., Yamada, T., & Akahane-Yamada, R. (2008). The first years in an L2-speaking environment: A comparison of Japanese children and adults learning American English. International Review of Applied Linguistics in Language Teaching, 46, 61–90. Best, C. T. (1995). A direct realist view of cross-language speech perception. In W. Strange (Ed.), Speech perception and linguistic experience: Issues in ­cross-language research (pp. 171–204). Timonium, MD: York Press. Best, C. T., & Tyler, M. D. (2007). Nonnative and second-language speech perception: Commonalities and complementarities. In O.-S. Bohn & M. Munro (Eds.), Language experience in second language speech learning: In honor of James Emil Flege (pp. 13–34). Amsterdam: John Benjamins. Boersma, P., & Weenink, D. (2019). Praat: Doing phonetics by computer ­(version 6.0.52) [Computer program]. Retrieved from www.praat.org/ Flege, J. E. (1987). The production of “new” and “similar” phones in a foreign language: Evidence for the effect of equivalence classification. Journal of Phonetics, 15, 47–65. Flege, J. E. (1995). Second language speech learning: Theory, findings, and ­problems. In W. Strange (Ed.), Speech perception and linguistic experience: Issues in cross-language research (pp. 233–277). Timonium, MD: York Press. Guion, S. G. (2003). The vowel systems of Quichua-Spanish bilinguals age of ­acquisition effects on the mutual influence of the first and second ­languages. Phonetica, 60, 98–128. Guion, S. G. (2005). Knowledge of English word stress patterns in early and late Korean-English bilinguals. Studies in Second Language Acquisition, 27, 503–533. Guion, S. G., Flege, J. E., Akahane-Yamada, R., & Pruitt, J. C. (2000a). An investigation of current models of second language speech perception: The case of Japanese adults’ perception of English consonants. Journal of the Acoustical Society of America, 107, 2711–2724.



Changes in the First Year of Immersion



Guion, S. G., Flege, J. E., Liu, S. H., & Yeni-Komshian, G. H. (2000b). Age of learning effects on the duration of sentences produced in a second ­language. Applied Psycholinguistics, 21, 205–228. Guion, S. G., Flege, J. E., & Loftin, J. D. (2000c). The effect of L1 use on ­pronunciation in Quichua-Spanish bilinguals. Journal of Phonetics, 28, 27–42. Guion, S. G., Harada, T., & Clark, J. J. (2004). Early and late Spanish-English bilinguals’ acquisition of English word stress patterns. Bilingualism: ­Language and Cognition, 7, 207–226. Jongman, A., Wayland, R., & Wong, S. (2000). Acoustic characteristics of ­English fricatives. Journal of the Acoustical Society of America, 108, 1252–1263. Ladefoged, P., & Johnson, K. (2015). A course in phonetics (7th ed.). Stamford, CT: Cengage Learning. Lee, B., Guion, S. G., & Harada, T. (2006). Acoustic analysis of the ­production of unstressed English vowels by early and late Korean and Japanese ­bilinguals. Studies in Second Language Acquisition, 28, 487–513. Li, F. (2012). Language-specific developmental differences in speech production: A cross-language acoustic study. Child Development, 83, 1303–1315. Li, F., Edwards, J., & Beckman, M. E. (2009). Contrast and covert contrast: The phonetic development of voiceless sibilant fricatives in English and Japanese toddlers. Journal of Phonetics, 37, 111–124. Maniwa, K., Wade, T., & Jongman, A. (2009). Acoustic characteristics of clearly spoken English fricatives. Journal of the Acoustical Society of America, 125, 3962–3973. Oh, G. E., Guion-Anderson, S., Aoyama, K., Flege, J. E., Akahane-Yamada, R., & Yamada, T. (2011). A one-year longitudinal study of English and Japanese vowel production by Japanese adults and children in an English-speaking setting. Journal of Phonetics, 39, 156–167. Piske, T., MacKay, I. R. A., & Flege, J. E. (2001). Factors affecting degree of foreign accent in an L2: A review. Journal of Phonetics, 29, 191–215. Vance, T. (1987). An introduction to Japanese phonology. New York: State ­University of New York Press. Wayland, R. P., & Guion, S. G. (2003). Perceptual discrimination of Thai tones by naïve and experienced learners of Thai. Applied Psycholinguistics, 24, 113–129.

chapter 8

Effects of the Postvocalic Nasal on the Perception of American English Vowels by Native Speakers of American English and Japanese Takeshi Nozawa and Ratree Wayland

8.1 Introduction Researchers generally agree that one’s L1 phonology exercises a strong influence on nonnative speech perception and production. Some nonnative contrasts are easily distinguished without prior native experience, whereas others are challenging despite previous native exposure (e.g., Polka, 1992; Best, McRoberts, & Sithole, 1988). These and similar findings lead to a hypothesis formulated by two models of cross-language phonological acquisition, namely, that difficulty in both perceiving and producing nonnative segmental contrasts depend not on the abstract, context-free relationship between the native and nonnative segments to be learned, but on their acoustic or articulatory phonetic similarities or dissimilarities (Best, 1994, 1995; Bohn & Flege, 1992, 1995). Evidence in support of this hypothesis has been found in several studies reporting how cross-language differences in context-induced phonetic realizations of segments affect their perception. Many of these studies examined the effects of consonantal contexts on vowel perception (e.g., Strange et al., 2001; Nozawa & Wayland, 2012), but only a few have reported the perception of American English (AE) vowels in the nasal consonant context. This is the main goal of the current investigation. Specifically, this study examined how well native speakers of Japanese (NJ) could distinguish AE vowels produced before a final alveolar nasal /n/ consonant, and if their perception was predictable based on how they identify these vowels in terms of Japanese vowel categories. 8.1.1  Consonantal Effects on Vowel Perception Studies examining the perception or production of vowels in any language typically consider vowels uttered in a single context unless the 

Effects of the Postvocalic Nasal on the Perception of American English



effects of consonantal contexts are also being measured (see, e.g., Polka, 1995; Fox, Flege, & Munro, 1995; Flege & MacKay, 2004; Ingram & Park, 1997; Strange et al., 1998; Bundgaard-Nielsen et al., 2011; Boomershine, 2013). However, an acoustic analysis of American English vowels produced in /CVC/ syllables performed by Hillenbrand et al. (2001, p. 761) revealed the effects of the onset and coda consonants on both vowel formant and duration values. Compared with coda consonants, onset consonants were shown to assert a larger effect on the vowel formant. Moreover, vowels are longest in duration when the onset and coda consonants are both voiced, and shortest when the onset and coda consonants are voiceless. However, despite the influence of the consonantal context, native listeners are able to accurately perceive the vowels, suggesting that they “internalize knowledge about the effects of context on vowel formants and invoke this knowledge in perception.” On the contrary, the available evidence suggests that unlike native listeners, nonnative listeners are not as successful at adjusting their perception to compensate for nonnative allophonic variation in vowel duration. For example, Morrison (2002) and Lengeris (2009) found that a distinction between /i/ and /ɪ/ in English was poorly distinguished before a voiceless consonant among Japanese listeners because both of these values were equated to the Japanese single mora /i/. In contrast, the distinction between these two vowels was more accurately perceived in a voiced consonantal context because they were mapped to the Japanese two-mora /ii/ and the single mora /i/, respectively. The effects of the consonantal context on the temporal assimilation pattern was also found by Strange et al. (2001). Specifically, that study showed that four AE short vowels [ɪ, ɛ, ʌ, ʊ] were assimilated to Japanese two-mora vowels less frequently in voiceless final consonantal contexts than in voiced consonantal contexts. Similar findings were reported in our previous study (Nozawa & Wayland, 2012). Consonantal context has also been shown to influence cross-language spectral assimilation patterns. Strange et al. (2001), for example, reported that short AE vowels [ɪ, ɛ, ʌ] were perceptually assimilated to spectrally different Japanese vowels depending on the consonantal contexts in which they were produced and presented. By investigating the perceptual assimilation of French vowels by AE learners, Levy (2009) found that experienced and inexperienced English learners of French labeled the French vowel /y/ differently in a bilabial context, but that experience had no effect in an alveolar context. To our knowledge, the perception of



Takeshi Nozawa and Ratree Wayland

American English vowels in the prenasal position by native and nonnative speakers of English has yet to be studied. According to Ladefoged (2005), in producing a [CVn] syllable, the velum-lowering gesture occurs considerably before the tongue tipraising gesture in producing the final alveolar nasal [n], causing the vowel to become nasalized. Acoustically, nasalized vowels exhibit overall lower amplitude, wider formant bandwidths, a weak F1, and a lowfrequency nasal formant between F1 and F2 (Olive et al., 1993; Ladefoged, 2003). Perceptually, vowels tend to be less distinctive, or some vowel contrasts are lost or neutralized in the /VN/ context (Wright, 1986; Beddor, 1993). The reduced distinctiveness of nasalized vowels can be seen in the fact that “in a given language the (phonologically) nasal vowels may number the same as or fewer than, but no more than, the oral vowels” (Beddor, 1993, p. 186). Based on a survey of 75 languages that exhibit allophonic or morphophonemic nasal vowel raising or lowering, Beddor (1993) concluded that nasalization has a lowering effect on high vowels and a raising effect on low vowels. AE is among the 75 languages where /æ/ is raised when nasalized. Furthermore, Labov (2010) notes that the raising and fronting of the prenasal /æ/ is observed in all North American English dialects. In addition, /ɛ/ and /ɪ/ merge before nasals in the Southeastern United States (Labov et al., 2005). The weaker F1 and the presence of a nasal formant are likely responsible for this effect. However, a previous study by Krakow et al. (1988) suggested that only noncontextually nasalized vowels (/bṽd/) result in a misperception of vowel height. Beddor and her colleagues (Beddor et al., 1986; Beddor, 1993) reached a similar conclusion: perceived vowel height is affected only when nasalization is phonetically or phonologically inappropriate. This finding suggests that native speakers of American English can adjust their perceived height of nasalized vowels in a context where vowels are usually nasalized. A synthesized /æ/-/ɛ/ continuum was used in these studies. Thus, the lowering effects of nasalization on other American English vowels remain to be explored. Moreover, unlike native listeners, nonnative listeners’ perception of vowel height may be more strongly affected by nasalization because of their inability to compensate for this contextsensitive allophonic variation. According to Tanowitz and Beddor (1997), approximately 80 percent of the total vowel duration in the English /CVNC/ context is nasalized, with small variations across talkers and vowels. Therefore, it is plausible that nasalization will affect the spectral assimilation patterns of AE vowels among Japanese listeners.

Effects of the Postvocalic Nasal on the Perception of American English



8.2  The Current Study This study attempts to examine how the coda nasal affects the identification and discrimination of American English vowels by native speakers of American English and Japanese by comparing the results with those of our previous study (Nozawa & Wayland, 2012). In our previous study, native speakers of American English and Japanese identified six American English vowels /i/, /ɪ/, /ɛ/, /æ/, /ɑ/, and /ʌ/,1 and discriminated six vowel pairs /i/-/ɪ/, /ɛ/-/ɪ/, /æ/-/ɛ/, /æ/-/ɑ/, /æ/-/ʌ/, and /ɑ/-/ʌ/ in six consonantal contexts /pVt/, /bVd/, /tVt/, /dVd/, /kVt/, and /ɡVd/. Native Japanese speakers also identified these six vowels in terms of Japanese vowel categories. Thus, the focus was on the effects of the place of articulation of the preceding consonant and the vowel duration. 8.2.1 Listeners The listeners in this study included the same 10 native speakers of Japanese (NJ) and 12 native speakers of English (NE) who participated in our previous study (Nozawa & Wayland, 2012). The NJ group was recruited in Kobe and the surrounding area, and they were undergraduate students at a Japanese university. The NE participants were recruited in Auburn, Alabama.2 At the time of the study, none of the NJ listeners had studied English outside of the Japanese school system, and they did not often have opportunities to use English in everyday life. Neither of the two groups of listeners contained participants with hearing problems. They were asked to perform a vowel discrimination task and a vowel identification task. The order of the tasks was counterbalanced across the listeners. For the NJ group, those who completed both the discrimination and identification tasks were then asked to participate in the perceptual assimilation task. 8.2.2 Stimuli The stimuli were produced by the same four female native speakers of American English who had participated as speakers in our previous study 1

We were unable to obtain /ʌ/ tokens in the /ɡVd/ context; thus, /ʌ/ was not included in the identification task in this context, and /æ/-/ʌ/ and /ɑ/-/ʌ/ were not included in the discrimination task in this context. 2 Six of them were from parts of the United States where /ɛ/ and /ɪ/ do not merge in perception and production, and the other six were from parts of the United States where the /ɛ/-/ɪ/ merger is predominant in perception and production (Labov et al., 2006).



Takeshi Nozawa and Ratree Wayland

(Nozawa & Wayland, 2012). A printed word list was provided, and the speakers read aloud the words and nonwords on the list. These words and nonwords included /i/, /ɪ/, /ɛ/, /æ/, /ɑ/, and /ʌ/ in /pVn/, /bVn/, /tVn/, /dVn/, /kVn/, and /ɡVn/ frames. The utterances were digitally recorded in a recording booth at a sampling rate of 44,100 Hz and later edited in Cool Edit 2000 for use as stimuli. After editing, the stimuli were normalized for peak intensity. Figure 8.1 shows the mean F1 and F2 frequencies of the six vowels uttered by the four native speakers in preplosive and prenasal contexts averaged across five consonantal contexts (/pV-/, /bV-/, /tV-/, /dV/, and /kV-/). The /ɡVd/ and /ɡVn/ contexts were excluded from the analysis. Vowels followed by /n/ denote vowels uttered in a prenasal context. In the prenasal context, / ɪ/ and /ɛ/ are lowered, and /æ/ is raised and fronted. /ɑ/ is also higher in the prenasal context. On the contrary, the positions of /i/ and /ʌ/ are not affected as much. 8.2.3  Perceptual Assimilation Only the NJ group performed the perceptual assimilation task in addition to the discrimination and identification tasks. Based on the results of this task, the discrimination and identification accuracy of American English vowels by the NJ group was predicted.

Figure 8.1  Mean F1 and F2 frequencies of six vowels uttered by four native speakers ­averaged across five preplosive contexts. Here, “ı̃, ɪ̃, ɛ̃, æ̃, ɑ̃, ʌ̃” denote /i/, /ɪ/, /ɛ/, /æ/, /ɑ/, /ʌ/ uttered in a prenasal context.

Effects of the Postvocalic Nasal on the Perception of American English



8.2.3.1 Procedure The NJ listeners were provided with answer sheets with vowel choices given in katakana: “a, e, i o, u, aa, ee, ii, oo, uu, ao, ea, ei, ia, ie, iu, oa, ya, yu, and yo.” They were instructed to circle the Japanese vowel that best represents the AE vowel that they heard and to rate its category goodness on a five-point scale (1 = poor, 5 = good). They were allowed to hear the same stimulus as many times as they wished before providing a response. They were instructed to click “NEXT” on the computer screen when they were ready to hear the next trial. In total, they listened to 48 trials (4 speakers × 6 vowels × 2 times) in each consonantal context. 8.2.3.2 Results Table 8.1 shows the modal responses in percentages and mean categorical goodness ratings in parentheses. For reference, also shown are the results of preplosive adapted from Nozawa and Wayland (2012). The AE /i/ was perceived as the Japanese single mora /i/ in the prenasal context. The mean duration of the prenasal /i/ is 159 ms, whereas in voiceless and voiced stop contexts it is 100 ms and 184 ms, respectively. The AE /ɪ/ was also mapped most frequently to the Japanese single mora /i/ except in /tVn/, where it was perceived as the Japanese /e/. When /ɪ/ was equated with the Japanese single mora /i/, it was perceived as a less ideal exemplar of the Japanese /i/ than the AE /i/; thus, it can be assumed that spectral differences between /i/ and /ɪ/ were discerned. The AE /ɛ/ was consistently perceived as the Japanese /e/. The AE /æ/ was heard as the Japanese /e/ in /pVn/ but as /ea/ or /ia/ in other syllables, whereas in the preplosive context /æ/ was more often equated with the Japanese /a/. More interestingly, the AE /ɑ/ and /ʌ/ were both mapped to the Japanese /o/ in /pVn/ and /bVn/ syllables, but to the Japanese /a/ in /dVn/ and /kVn/ syllables. Additionally, the AE /ɑ/ was heard equally as the Japanese /o/ and the single mora /a/ in the /tVn/ context, but as the Japanese two-mora /aa/ in /ɡVn/. The AE /ʌ/ was also perceived as the Japanese /o/ in /pVn/, /bVn/, and /tVn/, and as the Japanese /a/ in /dVn/, /kVn/, and /ɡVn/ syllables. It may be important to note that AE /ɑ/ and /ʌ/ were almost always perceptually assimilated to the same Japanese vowel categories. In order to predict the discrimination of the six AE vowel pairs of interest (/i/-/ɪ/, /ɛ/-/ɪ/, /æ/-/ɛ/, /æ/-/ɑ/, /æ/-/ʌ/, and /ɑ/-/ʌ/), a classification overlap score (Flege & McKay, 2004) was computed for each pair. For the classification overlap score of /i/-/ɪ/ in the /pVn/ context, for instance, the AE /i/ and /ɪ/ were classified as the Japanese single mora /i/ in 68.75 percent and 41.25 percent of instances, respectively, which yielded a 41.25

Takeshi Nozawa and Ratree Wayland



Table 8.1  Perceptual assimilation: the most frequent responses in percentages and mean categorical goodness ratings

/pVn/ /pVt/ /bVn/ /bVd/ /tVn/ /tVt/ /dVn/ /dVd/ /kVn/ /kVt/

/i/

/ɪ/

/ɛ/

/æ/

/ɑ/

/ʌ/

/i/ 68.8

/i/ 41.3

/e/ 67.5

/e/ 30.0

/o/ 38.7

/o/ 46.3

(4.2)

(3.3)

(4.0)

(3.5)

(3.9)

(4.2)

/i/ 57.5

/i/ 63.8

/e/ 66.3

/a/ 31.3

/a/ 47.5

/a/ 26.3

(4.0)

(3.0)

(3.3)

(3.3)

(3.5)

(3.0)

/i/ 61.3

/i/ 58.8

/e/ 68.3

/ea/ 33.8

/o/ 37.5

/o/ 33.8

(4.2)

(3.3)

(3.7)

(3.6)

(4.0)

(4.0)

/ii/ 52.5

/i/ 55.0

/e/ 56.3

/aa/ 21.3

/a/ 35.0

/a/ 45.0

(3.6)

(2.9)

(3.6)

(3.3)

(3.9)

(2.9)

/i/ 58.8

/e/ 46.3

/e/ 72.5

/ia/ 23.5

/o/ 22.5

/o/ 36.3

(4.0)

(3.6)

(3.8)

(3.7)

(4.3)

(3.6)

/i/ 63.8

/i/ 73.8

/e/ 48.8

/a/ 45.0

/a/ 53.8

/a/ 43.8

(3.9)

(3.1)

(3.2)

(3.3)

(4.0)

(3.3)

/i/ 60.0

/i/ 61.25

/e/ 41.25

/ea/ 28.3

/a/ 28.8

/a/ 58.3

(4.1)

(3.7)

(3.7)

(3.3)

(3.9)

(3.7)

/ii/ 58.8

/i/ 53.8

/e/ 62.5

/aa/ 31.3

/aa/ 43.8

/a/ 36.3

(3.8)

(3.4)

(3.6)

(2.9)

(4.3)

(3.3)

/i/ 47.5

/i/ 35.0

/e/ 56.25

/ia/ 33.8

/a/ 36.25

/a/ 56.25

(4.1)

(3.5)

(3.1)

(3.1)

(3.9)

(3.8)

/i/ 62.5

/i/ 50.0

/e/ 55.0

/ja/ 73.8

/a/ 71.3

/a/ 45.0

(4.3)

(3.5)

(4.0)

(3.9)

(4.0)

(3.1)

Note: The results of the preplosive contexts are adapted from Nozawa and Wayland (2012).

percent overlap. Likewise, the AE /i/ and /ɪ/ were classified as the Japanese two-mora /ii/ in 31.25 percent and 2.5 percent of instances, respectively, which gave rise to a 2.5 percent overlap. While /ɪ/ was equated to other Japanese syllables, /i/ was only equated to /i/ and /ii/; thus, the classification overlap score was 43.75 percent. The mean classification overlap scores of the six vowel pairs averaged across five ­consonantal contexts are shown

Effects of the Postvocalic Nasal on the Perception of American English



Figure 8.2  Classification overlap scores of six vowel pairs in preplosive and prenasal contexts.

in Figure 8.2. A larger overlap is observed for /ɛ/-/ɪ/ and /æ/-/ɛ/ in the prenasal context while the opposite is true for /æ/-/ɑ/ and /æ/-/ʌ/. The overlap remains the largest for /ɑ/-/ʌ/ in both contexts.

8.3 Discrimination 8.3.1 Procedure The AXB format was adopted. In this method, a listener heard three stimuli per trial and was asked to determine whether the second stimulus was categorically the same as the first (A) or the third (B). In other words, the task of the listeners was to determine whether they heard AAB or ABB. They responded by moving the cursor to “First” or “Last” on the computer screen and clicking on the appropriate option. The three stimuli in each trial were always produced by different speakers. Thus, to successfully perform this task, the listeners had to ignore within-category differences and rely on relevant between-category phonetic properties. The ISI was 500 ms, and the ITI was 1000 ms. A listener was allowed to hear the same trial again if he/she waited 10 seconds before giving a response. The stimuli were blocked by the preceding consonantal context and was counterbalanced across listeners. Listener sensitivity in discriminating each vowel pair was assessed based on eight trials. For instance, of the eight trials testing the /i/-/ɪ/ pair, a listener heard two different tokens



Takeshi Nozawa and Ratree Wayland

of /i/ and one token of /ɪ/ in four trials and two different tokens of /ɪ/ and one token of /i/ in four additional trials. Thus, 48 trials (6 vowel pairs × 8 trials) were created for each consonantal context. 8.3.2 Results The discrimination accuracy for both NE and NJ is reported in Figure 8.3. Overall, NE outperformed NJ on all vowel pairs in both consonantal contexts, and both NE and NJ performed less accurately in the prenasal context (NE: 89.1 vs. 86.3 percent, NJ 72.5 vs. 71.9 percent). A mixed-design ANOVA with two listener groups as the betweensubject factor, and two consonantal contexts (preplosive vs. prenasal) and six vowel pairs as within-subject factors, yielded a significant main effect of listener groups [F(1,108) = 170.77, p < 0.001], and vowel pairs [F(5,540) = 58.83, p < 0.001], but the main effect of the consonantal contexts barely missed the significance [F(1,108) = 3.79, p = 0.054]. Two-way interactions between-vowel pairs and listener groups and between consonantal contexts and vowel pairs were both significant at the p < 0.001 level, but a two-way interaction between listener groups and the consonantal context was not significant (p = 0.215). The three-way interaction of the three factors was significant (p < 0.001). Bonferroni-adjusted pairwise

Figure 8.3  Mean percentages and standard errors of English and Japanese listeners’ discrimination accuracy of the six AE vowel pairs in preplosive and prenasal contexts.

Effects of the Postvocalic Nasal on the Perception of American English



c­ omparisons revealed that NJ discriminated /ɛ/-/ɪ/ and /æ/-/ɛ/ significantly less accurately in the prenasal context, but they discriminated /æ/-/ɑ/ and /æ/-/ʌ/ significantly more accurately in the prenasal context than in the preplosive context. NE, on the other hand, discriminated /ɛ/-/ɪ/ less accurately in the prenasal context. Bonferroni-adjusted pairwise comparisons also revealed that in the prenasal context, NJ discriminated all the vowel pairs significantly less accurately than NE, while in the preplosive context NJ discriminated all the vowel pairs but /ɛ/-/ɪ/ significantly less accurately than NE.

8.4 Identification 8.4.1 Procedure The same stimuli used in the discrimination experiment were used in this part of the study. The listeners heard one stimulus per trial and responded by moving the cursor to the word that they heard. Six choices were given, and each choice was spelled in English orthography. The choices were aligned in the order of /i, ɪ, ɛ, æ, ɑ, and ʌ/ for each consonantal context. The listeners were given a word list and were instructed to match the spelling with the correct pronunciation. The listeners were also told that they would hear nonwords and therefore should not be concerned with the lexical meaning of each word. A listener’s accuracy in identifying each vowel in each consonantal context was assessed in eight trials (4 speakers × 2 times). Thus, 48 trials were prepared for each consonantal context (4 speakers × 2 times × 6 vowels). The ITI was 1000 ms, and a listener heard the next stimulus 1000 ms after responding to the previous stimulus. 8.4.2 Results The two listener groups’ mean percentages of identification accuracy and standard errors (SE) are shown in Figure 8.4. As in the preplosive context, NE outperformed NJ in the prenasal context (85.1 vs. 45.9 percent). A mixed-design ANOVA with two listener groups as the betweensubject factor, and two consonantal contexts (preplosive vs. prenasal) and six vowels as within-subject factors yielded a significant main effect of listener groups [F(1, 108) = 413.12, p < 0.001], of consonantal contexts [F(1, 108) = 91.66, p < 0.001], and of vowels [F(5, 540) = 7.04, p < 0.001]. All the two-way interactions were significant at the p < 0.001 level, and



Takeshi Nozawa and Ratree Wayland

Figure 8.4  Mean percentages and standard errors of English and Japanese listeners’ ­identification accuracy of the six AE vowels in preplosive and prenasal contexts.

the three-way interaction of the three factors was also significant (p < 0.001). Bonferroni-adjusted pairwise comparisons revealed that NE identified /ɪ/, /ɛ/, and /ʌ/ less accurately in the prenasal context than in the preplosive context, but NE identified /i/ better in the prenasal context (p = 0.013). NJ, on the other hand, identified /i/, /ɪ/, and /æ/ significantly less accurately in the prenasal context all at the p < 0.001 level, but they identified /ɑ/ significantly better in the prenasal context (p < 0.001). Bonferroni-adjusted pairwise comparisons also revealed that NE identified all six vowels but /ɛ/ significantly better than NJ in the prenasal context (p < 0.001), while in the preplosive context NE’s identification accuracy of all six vowels was better than that of NJ (p < 0.001).

8.5 Discussion Japanese listeners mapped six American English vowels produced in /CVn/ syllables to their Japanese vowels. All six vowels were mapped to more than one Japanese vowel with different degrees of fit. The classification overlap scores for vowel pairs varied considerably, suggesting that the degree of the successful discrimination of these vowel pairs would also be different according to PAM. The obtained discrimination scores largely supported the prediction. Specifically, vowel pairs with higher overlap

Effects of the Postvocalic Nasal on the Perception of American English



scores, namely, /i/-/ɪ/, /ɛ/-/ɪ/, /æ/-/ɛ/, and /ɑ/-/ʌ/, were significantly more poorly discriminated than those with lower overlap scores, namely, /æ/-/ɑ/ and /æ/-/ʌ/. However, a higher discrimination score for /i/-/ɪ/ compared to /ɛ/-/ɪ/ and /ɑ/-/ʌ/ was not predicted by their overlap scores. In the next sections, we discuss how the two listener groups’ discrimination and identification accuracy were affected by nasalization. 8.5.1  Effects of Nasalization on Vowel Discrimination Overall, these results suggest that compared with NE, NJ is more susceptible to allophonic variation induced by nasalization. Specifically, /ɛ/-/ɪ/ is the only vowel pair that NE discriminated significantly less accurately in the prenasal context, whereas NJ had difficulty with both /ɛ/-/ɪ/ and /æ/-/ɛ/. The lower discrimination accuracy of these vowel pairs in the prenasal context can be attributed to the larger overlap of /ɪ/, /ɛ/, and /æ/ in the vowel space as seen in Figure 8.1. /ɪ/ is produced with higher F1 frequencies, and /æ/ is raised and fronted in the prenasal context; thus, these vowels are spectrally closer and become more difficult to discriminate. An effect of spectrally closer vowels can be seen in the larger classification overlap scores in the prenasal context (Figure 8.2). On the contrary, NJ’s discrimination of /æ/-/ɑ/ and /æ/-/ʌ/ was more accurate in the prenasal context than in the preplosive context. In the preplosive context, /æ/, /ɑ/, and /ʌ/ were generally equated to the Japanese low vowel /a/, but in the prenasal context, because /æ/ is raised and fronted, /æ/ was more frequently equated with /e/ or /ea/, and the classification overlap scores of /æ/-/ɑ/ and /æ/-/ʌ/ were low in the prenasal context (7.75 percent each). To see whether classification overlap affects discrimination accuracy of each vowel pair in each context, classification overlap scores and discrimination accuracy were submitted to Spearman’s correlation analysis, which yielded a moderately significant negative correlation ρ = 0.398 (p < 0.001). Thus, the larger the classification overlap, the less accurate the discrimination. For NE, /æ/-/ɛ/ was not particularly difficult to discriminate in the prenasal context. Native speakers seem to be able to internalize or adjust themselves to this allophonic variation of /æ/ in the prenasal context. In some parts of the United States, /ɛ/-/ɪ/ mergers in perception and production in a prenasal context are observed; thus, these two vowels may be prone to confusion in a prenasal context.



Takeshi Nozawa and Ratree Wayland 8.5.2  Effects of Nasalization on Vowel Identification

NE’s inaccurate identification of /ɛ/ and /ɪ/ agreed with the discrimination results discussed above. Nasalization makes the discrimination and identification of these two vowels more difficult for native English listeners. The fact that NE had trouble with /ɛ/ and /ɪ/ relative to /i/, /ʌ/, and /ɑ/ was also likely due to the previously discussed effect of nasalization. Specifically, the distinction between /ɛ/ and /ɪ/ became less salient. The fact that /ɪ/ and /ɛ/ were most frequently misidentified as the other suggested that in addition to a vowel raising effect, a vowel lowering effect of nasalization was also involved, causing /ɪ/ to be misheard as /ɛ/. NE’s poor identification of /ɪ/ (as /ɛ/) further supported this hypothesis. Furthermore, it is likely that their misidentification of these three vowels led to their poor discrimination of the /ɛ/-/ɪ/ vowel pair. Interestingly, the lowering and raising effects appeared to affect only front vowels but not back vowels, as /ʌ/ and /ɑ/ were both identified with high accuracy (91.4 percent and 90.2 percent). Discrimination between /ɑ/ and /ʌ/ (85.1 percent) was also significantly better than discrimination between /ɛ/ and /ɪ/ and between /æ/ and /ɛ/ (68.2 percent and 76.9 percent, respectively). NE’s poorer identification of /ʌ/ in the prenasal context was largely due to their confusion of “ton” and “tun” (their identification accuracy of /ʌ/ in the prenasal context). Although NE’s identification accuracy of /ʌ/ was lower in the prenasal context, the vowel was still identified with high accuracy; thus, it is not a particularly difficult vowel to identify in a prenasal context. The most interesting result was the finding that NJ found /ɛ/ to be the most identifiable among all six AE vowels even though NE identified the vowel least accurately in the prenasal context. This is a typical example to show that nasalization does not affect vowel perception by native and nonnative speakers in the same way. NJ had the most difficulty identifying the AE /i/ and /æ/, followed by the AE /ɪ/ and /ʌ/. The AE /ɛ/ was consistently mapped to the Japanese /e/. Larger classification overlaps with /ɪ/ and /æ/ made the identification of these two vowels difficult rather than that of /ɛ/. In other words, /ɪ/ and /æ/ get perceptually closer to what Japanese speakers believe /ɛ/ sounds like, while /ɛ/ remains relatively unchanged as far as perceptual assimilation is concerned. NJ also had difficulty in identifying /æ/ in the prenasal context. As shown above, /æ/ in the prenasal context is raised and fronted. This vowel is usually transcribed as /a/ in Japanese, and previous studies have shown that native Japanese listeners usually equate /æ/ with the Japanese /a/ or

Effects of the Postvocalic Nasal on the Perception of American English



/aa/ (Strange et al., 1998, 2001; Frieda & Nozawa, 2007; Nozawa & Wayland, 2012). Japanese speakers with limited exposure to English expect /æ/ to sound like the Japanese /a/ (Nozawa & Frieda, 2007) because /æ/ is commonly spelled as “a” as in bag or dad, but /æ/, in fact, is a rather a more distant exemplar of the Japanese /a/ than /ɑ/, as shown in the lower category goodness rating. In the prenasal context, the vowel was more frequently heard as /e/, /ia/, or /ea/ because of the prenasal raising or diphthongization of /æ/. /æ/ was identified less accurately because it does not sound like the Japanese /a/ in the prenasal context. NJ’s lower identification accuracy of /ɪ/ can be attributed to the apparent lowering effect of nasalization. Though /ɪ/ was most frequently equated to the Japanese /i/ both in preplosive and prenasal contexts (except in the /tVn/ context), a larger classification overlap score is observed between /ɛ/ and /ɪ/; acoustic analysis revealed that /ɪ/ is actually lower in the prenasal context. NJ identified /i/ less accurately in the prenasal context. As far as F1 and F2 frequencies are concerned, /i/ uttered in the prenasal context is not very different from /i/ in the preplosive context. Thus, it is unlikely that /i/ was misidentified as lower vowels because it was perceived lower. Previous studies demonstrated that native Japanese listeners identify /i/ most accurately when /i/ is equated with the Japanese two-mora /ii/ (Nozawa & Wayland, 2012; Nozawa, 2019), and Japanese listeners expect /i/ to sound like the Japanese /ii/ (Nozawa & Frieda, 2007; Nozawa, 2018). In the current study, NJ most frequently equated the AE /i/ with the Japanese single mora /i/ in the prenasal context. In the preplosive context, /i/ was equated with the single mora /i/ when /i/ was preceded and followed by voiceless consonants (/pVt/, /tVt/, /kVt/); /i/ was equated with /ii/ when /i/ was preceded and followed by voiced consonants (/bVd/, /dVd/). Acoustic analysis revealed that the mean durations of /i/ are 100.8 ms in the voiceless consonant context, 184.3 ms in the voiced consonant context, and 159.5 ms in the prenasal context. The prenasal /i/ may not have been long enough to be heard as /ii/. Another possibility is that because the vowel is nasalized, NJ may have heard [n] earlier than the onset of the coda /n/, and as a result, the vowel was perceived as shorter. NJ identified AE /ɑ/ slightly better in the prenasal context. This can be attributed to the fact that /ɑ/ was more frequently equated to the Japanese /o/ in the prenasal context. /ɑ/ is typically adapted as /o/ to Japanese because it is usually spelled as “o” as in “hot” or “pot” even though the AE /ɑ/ is phonetically closer to the Japanese /a/ than to /o/.



Takeshi Nozawa and Ratree Wayland

Japanese speakers expect /ɑ/ to sound like /o/ (Nozawa & Frieda, 2007; Nozawa, 2018, 2019) and they identify /ɑ/ better when they perceive the vowel as an exemplar of the Japanese /o/ (Nozawa, 2019; Nozawa & Cheon, 2017). As shown in Figure 8.1, /ɑ/ is slightly higher in the prenasal context than in the preplosive context because of nasalization. /ʌ/ is the only AE vowel where NJ’s identification accuracy did not reach 50 percent in both preplosive and prenasal contexts. Japanese speakers do not seem to have a clear image of this vowel (Nozawa, 2019). /ʌ/ is usually adapted as /a/ in Japanese, but because it is commonly spelled as “u” as in “cut” or “but,” it is difficult for inexperienced Japanese learners of English like NJ in the current study to associate an English vowel that sounds like the Japanese /a/ with /ʌ/. Nasalization did not affect NJ’s identification accuracy of the vowel, and an acoustic analysis revealed that F1 and F2 frequencies of /ʌ/ are not very different between preplosive and prenasal contexts. These images of English vowels have been derived by Japanese speakers from the Japanese adaptation of English vowels. Limited exposure to authentic English and a huge influx of loanwords from English accounts for the strong influence of the “images” on native Japanese speakers’ identification of English vowels. Quackenbush (1974) commented that “most Japanese seldom or never hear spoken English; they do not attempt to pronounce English words, and they do not borrow English words. They simply use words of English origin that are borrowed for them by others, mainly writers. They pronounce them the way they hear them pronounced on radio and television, and they spell them the way they see them spelled in the popular press, that is, as fully assimilated Japanese words with a minimum of departures from the sounds and sound sequences and spelling principles that characterize native Japanese words” (p. 64). Kasahara et al. (2012) demonstrated that Japanese children are more familiar with katakana English, i.e., loanwords from English transcribed in katakana, than they are with English words learned through English study and practice. In prenasal contexts, some of the AE vowels are “deviant” from the image of English vowels that Japanese speakers hold.

8.6 Conclusion Nasalization led to a perceptual raising and lowering of /ɛ/ and /ɪ/, which made these two vowels difficult for native speakers of American English to discriminate and identify. The /ɛ/-/ɪ/ merger before a nasal consonant (or pin/pen merger) is one of the four widespread vowel mergers that

Effects of the Postvocalic Nasal on the Perception of American English



Labov and his colleagues have observed, the other three being /i/-/ɪ/ and /ʊ/-/u/ mergers before the /l/ and the /ɑ/-/ɔ/ merger. The /ɛ -/ɪ/ merger before a nasal consonant is a common feature of English in the Southeastern United States, but the results of this study suggest that the vowel pair can be confusable even for native speakers of English from other parts of the United States. /ɛ/ was perceived higher and /ɪ/ was perceived lower, but the lower identification accuracy of /ɛ/ showed that /ɛ/ was perceived as closer to /ɪ/ rather than the other way around. This outcome may be where the lowering and raising effects of nasalization collide. A raised /æ/ before a nasal consonant seems to be a ­well-established allophone, and native listeners seem to be able to “internalize” the contextual effect, as suggested by Hillenbrand et al. (2001). Japanese listeners’ perception of American English vowels was also affected by nasalization, but differently. NJ heard the American English /i/ in the prenasal context as a Japanese single mora /i/, which led to the lower identification accuracy of the vowel. Both /i/ and /ɪ/ were heard as a Japanese single mora /i/. However, nasalization also affected NJ’s perceived height of /ɪ/. The classification overlap scores of /ɛ/-/ɪ/ are larger in the prenasal context, leading to less accurate discrimination of the /ɛ/-/ɪ/ vowels. The perceived height of /ɛ/ may have been affected by a coda nasal consonant, but because the vowel was labeled as the Japanese /e/ regardless of whether the following consonant is a plosive or a nasal, the identification of the vowel was not affected by nasalization. /æ/ has a prenasal allophone, which is shifted from its original or default position to a proximity of /ɛ/. This allophonic shift caused /æ/-/ɛ/ to be more challenging in the prenasal context for NJ, but in turn, /æ/-/ ɑ/ and /æ/-/ʌ/ were easier to discriminate in the prenasal context. Thus, those vowels with large classification overlaps are generally difficult to differentiate.

References Beddor, P. S. (1993). The perception of nasal vowels. In M. K. Huffman & R. A. Krakow (Eds.), Nasals, nasalization, and the velum (Phonetics and ­Phonology Vol. 5, pp. 171–196). New York: Academic Press. Beddor, P. S., & Krakow, R. A. (1999). Perception of coarticulatory nasalization by speakers of English and Thai: Evidence for partial compensation. Journal of the Acoustical Society of America, 106, 2868–2887. Beddor, P. S., Krakow, R. A., & Goldstein, L. M. (1986). Perceptual constraints and phonological change: A study of nasal vowel height. Phonology ­Yearbook, 3, 197–217.



Takeshi Nozawa and Ratree Wayland

Best, C. T. (1994). The emergence of language-specific phonemic influences in infant speech perception. In J. G. Goodman & H. C. Nusbaum (Eds.), The development of speech perception (pp. 167–224), Cambridge, MA: MIT Press. Best, C. T. (1995). A direct realist view of cross-language speech perception. In W. Strange (Ed.), Speech perception and linguistic experience (pp. 171–204). Baltimore: York Press. Best, C. T., McRoberts, G. W., & Sithole, N. M. (1988). Examination of ­perceptual reorganization for nonnative speech contrasts: Zulu click ­discrimination by English-speaking adults and infants. Journal of ­Experimental Psychology: Human Perception and Performance, 14(3), 345–360. Bohn, O.-S., & Flege, J. (1992). The production of new and similar vowels by adult German learners of English. Studies in Second Language Acquisition, 14, 131–158. Boomershine, A. (2013). The perception of English vowels by monolingual, ­bilingual, and heritage speakers of Spanish and English. In Selected ­proceedings of the 15th Hispanic Linguistics Symposium (pp. 103–118). ­Somerville, MA: Cascadilla Proceedings Project. Bundgaard-Nielsen, R., Best, C. T., & Tyler, M. D. (2011). Vocabulary size matters: The assimilation of second-language Australian English vowels to first-language Japanese vowel categories. Applied Psycholinguistics, 32, 51–67. Flege, J. E. (1995). Second language speech learning theory, findings, and problems. In W. Strange (Ed.), Speech perception and linguistic experience (pp. 223–277). Baltimore: York Press. Flege, J. E., & MacKay, I. (2004). Perceiving vowels in a second language. Studies in Second Language Acquisition, 26, 1–34. Fox, A. R., Flege, J. E., & Munro, M. J. (1995). The perception of English and Spanish vowels by native English and Spanish listeners: A multidimensional scaling analysis. Journal of the Acoustical Society of America, 97, 2540–2551. Frieda, E., & Nozawa, T. (2007). You are what you eat phonetically: The effect of linguistic experience on the perception of foreign vowels. In O.-S. Bohn & M. J. Munro (Eds.), Language experience in second language speech learning: In honor of James Emil Flege (pp. 79–96). Amsterdam: John Benjamins. Harnsberger, J. D. (2001). On the relationship between identification and discrimination of non-native nasal consonants. Journal of the Acoustical Society of America, 110, 489–503. Hillenbrand, J. M., Clark, M. J., & Nearey, T. M. (2001). Effects of consonant environment on vowel formant patterns. Journal of the Acoustical Society of America, 109, 748–763. Ingram, J. C. L., & Park, S.-G. (1997). Cross-language vowel perception and production by Japanese and Korean learners of English. Journal of Phonetics, 25, 343–370. Johnson, K. (1997). Acoustic and auditory phonetics. Cambridge, MA: Blackwell. Kasahara, K., Machida, N., Osada, E., Takahashi, T., & Yoshizawa, S. (2012). Vocabulary knowledge of 5th and 6th graders at elementary school: Connection between sound, meaning and spelling. JES Journal, 12, 90–101.

Effects of the Postvocalic Nasal on the Perception of American English



Krakow, R. A., Beddor, P. S., Goldstein, L. M., & Fowler, C. A. (1988). Coarticulatory influence on the perceived height of nasal vowels. Journal of the Acoustical Society of America, 83, 1146–1158. Labov, W. (2010). Principles of linguistic change: Cognitive and cultural factors. West Sussex, England: Wiley-Blackwell. Labov, W., Ash, S., & Boberg, C. (2005). Atlas of North American English: Phonetics, phonology and sound change. Berlin: Mouton de Gruyter. Ladefoged, P. (2003). Phonetic data analysis: An introduction to fieldwork and instrumental techniques. Malden, MA: Blackwell. Ladefoged, P. (2005). A course in phonetics (5th ed.). Boston: Thomson. Lengeris, A. (2009). Perceptual assimilation and L2 learning: Evidence from the perception of southern British English vowels by native speakers of Greek and Japanese. Phonetica, 66, 169–187. Levy, E. S. (2009). Language experience and consonantal context effects on perceptual assimilation of French vowels by American-English learners of French. Journal of the Acoustical Society of America, 125, 1138–1152. Morrison, G. S. (2002). Japanese listeners’ use of duration cues in the identification of English high front vowels. In Proceedings of the 28th annual meeting of the Berkeley Linguistics Society (pp. 189–200). Nozawa, T. (2018). How native speakers of Japanese and American English label vowels of each other’s L1 in terms of their L1 vowel categories. Research on Phonetic Language, 12, 39–54. Nozawa, T. (2019). Effects of the manner of articulation of the syllable-final consonant on the perception of American English vowels by native Japanese speakers: Divergence between Japanese Speakers’ image of English vowels and what English vowels really sound like to them. PhD dissertation, Osaka University. Nozawa, T., & Cheon, S. Y. (2017). Identification of vowels of two different varieties of English by native speakers of Japanese and Korean. Journal of the Acoustical Society of America, 141, 3519 (abstract). Nozawa, T., & Frieda, E. M. (2007). Perceptual similarity of American English and Japanese vowels for native speakers of American English and Japanese. Journal of the Acoustical Society of America, 122, 3529 (abstract). Nozawa, T., & Wayland, R. (2012). Effects of consonantal contexts on the discrimination and identification of American English vowels by native speakers of Japanese. Journal of the Japan Society of Speech Sciences, 13, 19–39. Olive, J. P., Greenwood, A., & Coleman, J. (1993). Acoustics of American English speech. New York: Springer. Polka, L. (1992). Characterizing the influence of native language experience on adult speech perception. Perception and Psychophysics, 1, 37–52. Polka, L. (1995). Linguistic influence in adult perception of non-native vowel contrasts. Journal of the Acoustical Society of America, 97, 1286–1296. Quackenbush, E. (1974). How Japanese English words. Linguistics – An Interdisciplinary Journal of the Language Sciences, 131, 59–85. Roeder, R. (2010). Effects of consonantal context on the pronunciation of /æ/ in the English of speakers of Mexican heritage from South Central Michigan.



Takeshi Nozawa and Ratree Wayland

In D. R. Preston & N. Niedzielski (Eds.), A reader in sociophonetics (pp. 71–89). New York: De Gruyter Mouton. Strange, W., Akahane-Yamada, R., Kubo, R., Trent, S. A., & Nishi, K. (2001). Effects of consonantal context on perceptual assimilation of American English vowels by Japanese listeners. Journal of the Acoustical Society of America, 109, 1691–1704. Strange, W., Akahane-Yamada, R., Kubo, R., Trent, S. A., Nishi, K., & Jenkins, J. J. (1998). Perceptual assimilation of American English vowels by Japanese listeners. Journal of Phonetics, 26, 311–344. Tanowitz, J., & Beddor, P. S. (1997). Temporal characteristics of coarticulatory vowel nasalization in English. Journal of the Acoustical Society of America, 101, 3194 (abstract). Tyler, M. D., Best, C. T., Faber, A., & Levitt, A. G. (2014). Perceptual assimilation and discrimination of non-native vowel contrasts. Phonetica, 71, 4–21. Wright, J. T. (1986). The behavior of nasalized vowels in the perceptual vowel space. In J. J. Ohara & J. J. Jaeger (Eds.), Experimental phonology (pp. 45–67). Orlando, FL: Academic Press.

part iii

Acquiring Suprasegmental Features

chapter 9

Relating Production and Perception of L2 Tone James Kirby and Đinh Lu, Giang*

9.1 Introduction The perception and production of second language (L2) speech has been widely studied in a variety of populations with a range of methods. One of the central questions in this line of research has been the degree to which perception guides production of L2 sound categories. According to Flege’s Speech Learning Model (SLM; Flege, 1995, 1999), the accuracy with which nonnative segments are perceived will limit how well they can be produced. The SLM posits that L2 ability is not simply a function of age, but rather depends on the nature of L2 exposure and usage as well as the structural similarities between the L1 and L2.1 The SLM attributes the often observed decrease in L2 production accuracy over the life-span to age-related changes in how the L1 and L2 systems interact: as perception becomes increasingly tuned to the L1, the likelihood of establishing new categories progressively decreases, because L2 sounds are increasingly perceived through the “filter” of L1. Thus, although L2 perceptual ability is predicted to decrease with age, the SLM posits that this is due to perceptual attunement rather than the effects of a critical acquisition period (Flege, 1999). In general, however, the SLM predicts that perception should precede production, and that perception and production abilities will converge over the course of learning. If this is the case, * This project was funded in part by a Council of American Overseas Research Centers (CAORC) Senior Research Fellowship from the Center for Khmer Studies to J. Kirby. Thanks to Charles Nagle and audiences at the Institute of Phonetics and Speech Processing, LMU Munich; the Phonology Laboratory at the University of Chicago; and LabPhon 16 for thoughtful comments on earlier versions of this work. The authors are solely responsible for any errors of fact or interpretation. We also extend our thanks to the People’s Committee of Giồng Riềng province, the clergy of the Cái Đuốc Giữa temple, and to all of the participants, without whom this work would not have been possible. 1 This basic premise is also shared by other models of L2 perception such as the Perceptual Assimilation Model (Best, 1995; Best & Tyler, 2007) and the Second Language Linguistic Perception Model (Escudero, 2005).





James Kirby and Đinh Lu, Giang

production and perception should generally be correlated, at least for novice and advanced speakers, and moderate correlations have been found in studies of both vowels and consonants in a variety of languages (Bettoni-Techio et al., 2007; Elvin et al., 2016; Flege, 1993; Flege et al., 1999; Levy & Law, 2010; Llisterri, 1995; Morrison, 2003). However, there is also evidence suggesting that learning in production is not always dependent on perception developing first. In a longitudinal study of late L1 English learners of the L2 Spanish onset voicing contrast, Nagle (2018) found that production of the L2 contrast began to improve before learners’ ability to discriminate the contrast had reached native-like levels. In fact, there is some evidence that producing sounds during perceptual training may actually impede the formation of perceptual representations. Baese-Berk (2019) studied how L1 English speakers’ ability to produce a Spanish-like obstruent contrast was affected by training modality. She manipulated training modality (perception only or interleaved perception and production) while holding testing modality constant (all participants were tested for both production and discrimination). Participants who were trained in both perception and production showed substantial improvement in production accuracy, but their perceptual improvement lagged behind. In other words, they were more accurate at producing the contrast than perceiving it, suggesting that performance in production may be unrelated to performance in perception. Baese-Berk suggests this may be an effect of interleaving production and perception training, while Nagle raises the possibility that the production–perception link may be lagged or asynchronous. Studies like these provide evidence that perceptual ability does not always appear to be a necessary prerequisite for facility in production to improve (see Chapter 1). There is also some evidence that perceptual difficulties may persist even after production is objectively “mastered” (Strange, 1995). An example is provided by Sheldon and Strange (1982), who tested L1 Japanese learners of L2 English on their ability to perceive and produce the /r/-/l/ contrast. The authors found that native English listeners were more accurate at distinguishing L2 productions of /r/ and /l/ than the Japanese listeners themselves were. This was interpreted as evidence that the production of an L2 contrast can be superior to the perception of that contrast, and thus that production and perception performance may be uncorrelated. The reasons underlying these apparent instances of “perceptuo-productive heteromorphism” (Bohn & Flege, 1997) – that correlations are sometimes observed and sometimes not – has been a source of ongoing investigation, potentially involving age limits on learning new forms of

Relating Production and Perception of L2 Tone



a­ rticulation, the type of contrast being studied, and a diverse range of methodological differences such as the phonetic dimensions being measured to assess production (Flege, 1999), the interstimulus interval used in perception studies (Peperkamp & Bouchon, 2011; Wayland & Guion, 2003), and the tasks used to evaluate performance in each modality (Sakai & Moorman, 2018). All this work makes it clear that the relationship between production and perception is unlikely to be as straightforward as the classical models might suggest. The topic of production and perception of L2 tone has been studied for East Asian tone languages such as Thai (Gandour, 1983; Wayland & Guion, 2003), Vietnamese (Blodgett et al., 2008; Nguyen & Macken, 2008), and Mandarin Chinese (Wang et al., 2012; Yang, 2015). Much of this literature focuses on how properties of a learner’s L1, such as whether or not it is also a tone language, may affect their success at tone production and perception in L2. In general, speakers of a tonal L1 are more accurate at identifying and discriminating tones in a tonal L2 compared to speakers whose L1 is nontonal (Francis et al., 2008; Hallé et al., 2004; Lee et al., 1996; Wayland & Guion, 2004), although even for tone language speakers, the specifics of the tone systems involved may play a nontrivial role (So & Best, 2010). Furthermore, listeners who speak a tonal L1 have been found to be more sensitive to pitch direction when perceiving L2 tones, while listeners with nontonal L1 backgrounds are more apt to attend to pitch height (Francis et al., 2008; Gandour, 1983; Guion & Pederson, 2007; Hallé et al., 2004). In production, L2 learners whose L1 is nontonal often have a compressed pitch range compared to native tone language speakers (Chen, 1974), show interference with certain segments (Nguyen & Macken, 2008; Yang, 2012), and often have difficulty with accurately producing complex contour tones as well as determining the correct starting pitch height (Bauman et al., 2009; Blodgett et al., 2008). Compared to the literature on segments, however, much less attention has been given to the production–perception relationship for L2 tone. The current, tentative consensus seems to be that, contra the predictions of L2 acquisition models, production leads perception, both in the sense of order of acquisition (production is mastered earlier) and facility (production ability is superior to perception). For example, in Yang’s (2012) study of American English learners of Mandarin Chinese, learners had considerable difficulty correctly identifying the rising tone /35/. However, this perceptual difficulty was not matched in production: learners’ productions of this tone were not any less likely to cause errors



James Kirby and Đinh Lu, Giang

for native-speaker transcribers (but cf. Miracle, 1989; Ding et al., 2011). Yang (2012) suggests this may be because L2 tone production is primarily phonetic in nature, involving imitation and generalization of acoustic targets such as pitch heights, turning points, and perhaps durations. This same sensitivity to phonetic detail, however, works against learners in perception, because they lack robust phonological tone categories in the first place (see also Hallé et al., 2004). The perceptual advantage for L1 speakers of other tone languages would then be explained by their having phonological representations for tone categories that can be carried over from their L1. As far as we are aware, almost all work explicitly addressing the production/perception relationship in L2 tone has focused on populations acquiring the L2 (usually Mandarin Chinese) in postsecondary instructional environments. This suggests another possible reason production has been found to lead perception, namely, the emphasis on repetition and assessment typical of this setting. In many scenarios, however, learners are receiving little or no formal training in the L2, but instead find themselves in immersion environments where the L2 is the medium of instruction. In these environments, learners are unlikely to be receiving targeted feedback on the phonetic realization of L2 tones (or segments, for that matter). The degree to which the L1 is used relative to the L2 would also presumably play a role (Flege et al., 1997), but as far as we know, this has not been studied for tone. This study contributes to our understanding of production and perception of L2 tone by investigating how production and perception are realized at the level of individual speakers in a noninstructional setting. We consider how speakers of a nontonal language (Khmer) treat the tones of their L2 (Southern Vietnamese). Because of the social and linguistic dynamics of southern Vietnam, this setting presents an interesting opportunity to study L2 tone acquisition “in the wild,” complementing studies of L2 tone acquisition looking at populations who have undertaken formal second language instruction, as well as those who have received explicit training specifically focused on improving tone production and/ or perception. In an attempt to mitigate the methodological issue of selecting potentially arbitrary acoustic features, we opt to use global measures of curve similarity to measure the distance between tonal realizations. We consider how well L1 Khmer speakers of L2 Vietnamese distinguish Vietnamese tones in production by measuring their acoustic distances from native Vietnamese productions, but also by considering the extent to which they are acoustically distinctive in a speaker’s own

Relating Production and Perception of L2 Tone



tone space. We also look at both native and nonnative listeners’ ability to discriminate these tones. By working with participants who have a broad range of ages and educational backgrounds, we can also gain some insight into how experience shapes the relationship between production and perception of L2 tone.

9.2  Language Background 9.2.1  Khmer Krom Khmer is an Austroasiatic language spoken primarily in Cambodia, northeastern Thailand, and southern Vietnam.2 Khmer speakers have probably inhabited the Mekong Delta region from at least the seventh century CE. Today, there are around one million ethnic Khmers in Vietnam (General Statistics Office of Vietnam, 2010). Around 5 percent of speakers (mostly older) are monolingual in Khmer, while around 15 percent (mostly younger and/or of mixed Khmer-Vietnamese ethnicity) are monolingual in Vietnamese (Đinh Lư Giang, 2011). The Khmer dialects spoken in present-day Vietnam are referred to variably as Southern Khmer or Khmer Krom (literally “Khmer from below”). Mutually intelligible with Khmer varieties spoken in central Cambodia, they are often subsumed as part of the Central Khmer construct. That said, Khmer Krom varieties have at least some lexical and phonological features which differentiate them from Standard Khmer (Sochoeun, 2006, pp. 64–66), some of which are probably the result of contact (Đinh Lư Giang, 2011, 2015; Nguyễn Thị Huệ, 2010; Thạch Ngọc Minh, 1999). The Khmer varieties of Vietnam remain underdescribed. Kiên Giang, one of Vietnam’s southernmost provinces, shares its northwestern border with Kampot province in Cambodia. Ethnic Khmer in Kiên Giang make up around 10 percent of the provincial population. The present study was conducted in the district of Giồng Riềng, where Khmers account for about 15 percent of the total population. In the hamlet of Ngọc Chúc, home to most of the participants in our study, nearly one-third of the population is Khmer. Although not a tone language, pitch does play a (very) limited contrastive role in at least some Khmer dialects, including the local variety spoken in Kiên Giang (Kirby, 2014; Kirby & Đinh Lư Giang, 2017; Thạch Ngọc Minh, 1999). Whether 2

This section is adapted from section 2 of Kirby and Đinh Lư Giang (2017); the reader is directed to that article for more detailed information on Kiên Giang Khmer.

James Kirby and Đinh Lu, Giang



or not this impacts the production and perception of their L2 Vietnamese tones is a question we return to in Section 9.5. 9.2.2  Southern Vietnamese “Southern Vietnamese” refers to the relatively homogenous language varieties of the Kinh (Vietnamese) people spoken in and south of Khánh Hoà province (Brunelle, 2015). Vietnamese dialects differ considerably in terms of phonetics, phonology, and lexicon, but with the exception of some central dialects, they maintain a high level of mutual intelligibility. The tone systems of the major Vietnamese dialects are well described (Brunelle, 2015; Hoàng Thị Châu, 1989; Phạm, 2003; Vũ Thanh Phương, 1982). Northern Vietnamese (NVN) has six tones that contrast in voice quality as well as pitch (Nguyễn Văn Lợi & Edmondson, 1998), while Southern Vietnamese (SVN) has five tones that are distinguished exclusively by differences in f0 height and excursion (see Table 9.1).

9.3  Methods and Materials 9.3.1 Participants Eighteen adult speakers of Kiên Giang Khmer (18–47, 5 female; hereafter KG) and 10 monolingual native speakers of Southern Vietnamese (19–52, 7 female; hereafter VN) were recruited from the local population. The Khmer speakers also took part in a separate study (Kirby & Đinh Lư Giang, 2017). All Khmer participants completed a short questionnaire which asked their year of birth (age), their highest completed grade (education), as Table 9.1  Production stimuli Item

Tone

Orthography

Gloss

taː

33

ngang

ta

’1sg (neutral, nonformal)’

taː21

huyền



‘dusk, twilight’

taː35

sắc



‘dozen’

taː214

hỏi-ngã a

tả

‘describe’

nặng

tạ

‘picul (100 kg)’

212

taː

Note: Vietnamese names for tones are given for reference. a The hỏi and ngã tones, which are distinct in Northern Vietnamese, are merged in Southern Vietnamese.

Relating Production and Perception of L2 Tone



well as a self-reported assessment of what percentage of their daily language usage was Vietnamese as opposed to Khmer (vietnamese usage). We did not explicitly ask about age of first exposure to Vietnamese, although we surmise that for most participants it coincided with the onset of formal education (so between ages four and six). Participants’ ages ranged from 18 to 52 (mean 35). Education level ranged from no formal schooling of any kind to 12 years (completion of upper secondary education in the Vietnamese system), with the average being completion of grade 7. Self-assessment of percentage of Vietnamese used in daily life ranged from 10 to 80 percent (mean 40 percent). All Khmer participants self-reported as native speakers of Khmer, and our impressions corroborated these self-assessments. Khmer participants completed the production and perception studies at the Cái Đuốc Giữa temple in Ngọc Bình village, Ngọc Chúc hamlet, Giồng Riềng district, Kiên Giang province. Sessions with the Vietnamese participants took place at the Trung tâm Học tập Cộng UBND xã Ngọc Chúc (Community Learning Center of the Ngọc Chúc People’s Committee). All data were collected in August 2011. 9.3.2  Production Study: Methods and Materials Participants were recorded producing the syllable /taː/ three times with each of the five Southern Vietnamese tones in the carrier phrase Tôi nói ______ cho anh biết [toj33 noj35 ____ cɔ33 an33 biək45] “I say ____ for you.” This syllable was selected as it can be combined with all five tones to give commonly occurring lexical items (see Table 9.1). 24 bit, 44.1 kHz recordings were made using an omnidirectional headset condenser microphone and portable solid-state recorder. Recordings were annotated in Praat (Boersma & Weenink, 2015) to indicate the onset and offset of phonation, and a Praat script was used to measure f0 at 11 equidistant points in the vowel. 9.3.2.1  Measuring Production Accuracy Typically, studies of L2 tone production measure accuracy either in terms of acoustic landmarks like pitch range, overall f0 change, timing of turning points, and so on, and/or in terms of native-speaker evaluations (e.g., Chen, 1974; Wang et al., 2003; Yang, 2012). In order to facilitate comparison to perception data, however, it can be useful to have a “onenumber summary” of similarity, which potentially captures other aspects of the f0 contours, such as slope. For this, we considered two global



James Kirby and Đinh Lu, Giang

measures of trajectory comparison: the dynamic time warping (DTW) distance (Müller, 2007) and the Fréchet distance (Chambers et al., 2010). The DTW distance derives from an algorithm originally developed in the context of speech recognition to find the optimal alignment between two sequences of different lengths. The DTW distance between two sequences X and Y is the minimum of the sum of distances:

{

}

DTW ( X ,Y ) = min c p * ( X ,Y ) , where L

)

(

c p ( X ,Y ) := ∑ c x n1 , ym1 , c a local cost measure. l =1

The Fréchet distance between two curves, sometimes also called the ­“dog-walking distance,” is “the minimum length of a leash required to connect a dog and its owner as they walk without backtracking along their respective curves from one endpoint to the other” (Chambers et al., 2010, p. 295). Because the Fréchet metric takes the shape of the curves into account, it can provide a more accurate similarity measure than alternative measures which first reduce the curves to a small number of points. It can be thought of as the minimum of the maximum distance between the curves. The Fréchet distance δ is given as

{

}

δ ( X ,Y ) = min max d ( X (α (t )) ,Y ( β (t ))) . α ,β

t ∈[0,1]

This reads as: for every possible function α(t) and β(t), find the largest distance between the man and his dog as they walk along their respective path, and keep the smallest distance found among these maximum distances. 9.3.3  Perception Study: Methods and Materials Following their production session, each participant completed an AX discrimination task. Five syllables (/taː/ with each of the five Southern Vietnamese tones) were synthesized using the KlattSyn implementation in Praat 5.4.08 (Boersma & Weenink, 2015), based on pilot recordings taken from two native speakers of the local Southern Vietnamese dialect who did not otherwise participate in the study. A spectrogram of the stimulus and the synthesized f0 contours are shown in Figure 9.1. Stimuli were then arranged to form 30 AX pairs, 10 “same” pairs and 20

Amplitude (dB)

Relating Production and Perception of L2 Tone



0.75 0 –0.75

Frequency (Hz)

5000

0

0

0.5

Time(s)

180

f0

160

tone

21 212

140

214 33

120

35 100

3

6 T

9

Figure 9.1  (top) Waveform and spectrogram of stimulus /taː33/. (bottom) f0 contours of synthesized perception stimuli.

“different” pairs, forming all possible permutations of both orders. Within a pair, stimuli were separated by a 300 ms interstimulus interval (ISI). Responses were recorded by pressing keys on a laptop keyboard (g  for “same,” k for “different,” corresponding to the first letter of the corresponding words in Vietnamese). Five hundred milliseconds of silence followed each button press before the next stimulus pair was



James Kirby and Đinh Lu, Giang

presented. A  short ISI was selected as nonnative listeners are typically found to have better discrimination in short ISI conditions (Burnham & Francis, 1997; Wayland & Guion, 2003; Werker & Tees, 1984). Participants heard each pair five times, with presentation order randomized within block and participant. All participants completed a short pretest with 10 pairs (5 same, 5 different) to insure they understood the nature of the experimental task. The entire experiment took most participants about 10–15 minutes to complete.

9.4 Results For brevity and expositional clarity, and given the small sample size of the study, we focus here primarily on descriptive statistics and informative visual displays. The reader interested in more sophisticated statistical summaries should consult the data and code, available at https://doi .org/10.7488/ds/2635. 9.4.1 Production Figure 9.2 plots the f0 contours for the five Southern Vietnamese tones averaged over VN (left) and KG (right) speakers. Among the KG speakers we observe pitch range compression, typical of both tonal (Chen, 1974) and nontonal (Mennen, 1998; Zimmerer et al., 2014) L2; deviation from native-speaker targets in terms of the timing of the turning points (Wang et al., 2003); and a possible merger/confusion between the two complex contour tones 212 and 214, perhaps unsurprising given that they are acoustically indistinguishable for at least the first 30 percent of their excursions. Table 9.2 shows the mean global distances between the KG and VN productions of the Vietnamese tones. As the Fréchet and DTW distances are strongly correlated (ρ = 0.82), the remainder of the chapter will focus on the Fréchet distance.3 For the KG speakers, mean Fréchet distance correlates most strongly with speaker age (0.72), followed by education (−0.53) and to a lesser extent vietnamese usage (−0.35). age and

3

It is worth noting that the ranking is not perfectly matched, with the Fréchet distance penalizing the shallow slope of the KG realization of the /35/ sắc tone more heavily than DTW.

Relating Production and Perception of L2 Tone



Table 9.2  Mean global Fréchet and DTW distances between KG and VN tone productions, from most to least similar Tone

Fréchet

DTW

33

ngang

1.1

8.0

21

huyền

1.9

10.6

212

nặng

2.2

15.8

214

hỏi-ngã

2.9

14.1

35

sắc

3.1

13.7

Vietnamese

Khmer

mean.st

5.0

tone 21

2.5

212 214

0.0

33 35

−2.5

3

6

9

T

3

6

9

Figure 9.2  Average f0 contours for Southern Vietnamese tones across speakers by L1. Shading ribbon, where present, indicates 95 percent confidence interval.

­education are negatively correlated (−0.67), as are vietnamese usage and age (−0.5), while self-reported usage increases with education (0.62). Although the averages in Figure 9.2 are broadly representative, there was also considerable individual variation among the Khmer (but not Vietnamese) participants. Figure 9.3 shows the tones produced by 6 of the 18  KG speakers, averaged over utterances (plots for all speakers can be found in the Supplementary Materials). In general, older speakers tended to group tones into two pitch registers, such as high and low (KM7, KF4) or high and rising (KF1). Interestingly, which tones were grouped together was not always consistent: for example, the 33 tone seems to be treated as part of a high register for KM7 and KF1, but as part of a low register by KF4.

James Kirby and Đinh Lu, Giang

 20, 12, KM9

24, 12, KM11

24, 12, KF7

3 0 −3

tone

mean.st

21 46, 9, KM7

48, 3, KF1

212

51, 0, KF4

214 33 35

3 0 −3

3

6

9

3

6 T

9

3

6

9

Figure 9.3  Tone productions for six KG participants, averaged over repetitions of each target syllable. The header for each panel shows age, highest grade completed (scale of 0–12), and subject code (KM = male, KF = female). Shading ribbon, where present, indicates 95 percent confidence interval.

9.4.2 Perception The results of the AX discrimination task were converted into accuracy scores (1 = correct, 0 = incorrect). The results are plotted in Figure 9.4, which shows just the “different” responses; however, including all responses does not meaningfully impact the results (a change of just 1.5 percent in the mean difference in accuracy across all participants). Results are collapsed across presentation order, that is, 33–21 and 21–33 are both treated as a single pair 33/21. Vietnamese participants had an overall mean accuracy of 89 percent, while mean accuracy for Khmer participants was 71 percent. Khmer listeners appeared to have the most difficulty with pairs involving overlapping pitch ranges, especially 21/212 (huyền/nặng) and 21/214 (huyền/hỏi-ngã). Of note is the fact that the 212/214 (nặng/ hỏi-ngã) pair was difficult for both groups; this is likely due to the speeded nature of the AX task, combined with the fact that these stimuli are identical for nearly a third of their total excursions. Simple generalized linear mixed-effect logistic regressions predicting the correctness of each trial (correct/incorrect) on the basis of trial, tone pair, and language (with subject-specific intercepts) are consistent with the figure: a model with a predictor language provides a better fit than one with just trial,

Mean Discrimination Accuracy

Relating Production and Perception of L2 Tone language

Khmer

Vietnamese

33/214

21/35 21/212 Tone Pair

21/214



0.9

0.6

0.3

0.0 33/21

33/35

33/212

35/212

35/214

212/214

Figure 9.4  Mean discrimination accuracy by tone pair, averaged over speakers and ­repetitions.

tone pair, and their interaction (χ2 = 7.26, df = 1, p = 0.007), and is further improved by the addition of a tone pair: language interaction (χ2 = 31.24, df = 10, p < 0.001), which better models the group-level differences in discrimination accuracy of the pairs such as 21/212 and 21/214. To get a sense of how the demographic variables (age, education, vietnamese usage) correspond to discrimination accuracy, we computed a mean discrimination accuracy for each Khmer listener and correlated this with each variable. Discounting the responses of one clear outlier (KM5, who appeared to have treated this as a dissimilarity task), mean accuracy was correlated most strongly with education (0.65) and to a lesser extent (inversely) with age (−0.35). The weakest correlation was with vietnamese usage (0.13). 9.4.3  Relating Production and Perception Figure 9.5 shows the production patterns of two speakers, KM10 (male, age 19, completed seventh grade), and KF1 (female, age 48, completed third grade), with their mean pair-level discrimination accuracies given in Table 9.3. These two speakers illustrate two types of patterns in the data. First, accuracy in distinguishing one tone from another can be quite poor even when production of those tones is objectively native-like. For example, KM10 produces rather native-like tones /33/ and /212/ (Fréchet distances from VN productions of 1.25 and 1.22, respectively), but was at

James Kirby and Đinh Lu, Giang



Table 9.3  Mean discrimination accuracies for KM10 and KF1 by tone pair Pair

KM10

KF1

33/21

0

0.3

33/35

0.9

0.3

33/212

0.56

0.6

33/214

0.6

0.5

21/35

0.75

0.7

21/212

0.38

0.7

21/214

0.57

0.4

35/212

1

0.6

35/214

0.63

0.3

212/214

0.43

0.8

19, 7, KM10

5.0

48, 3, KF1

2.5

tone

mean.st

21 212

0.0

214 33

−2.5

35

−5.0 3

6

9

T

3

6

9

Figure 9.5  Tone productions for KM10 and KF1. Shading ribbon, where present, ­indicates 95 percent confidence interval.

chance distinguishing them from one another (mean discrimination accuracy of 0.56). Similarly, his native-like tone /21/ production (δ = 1.52) did not seem to help him distinguish it from tone 33, which he failed to do on every trial. At the same time, these data suggest that listeners can be relatively good at discriminating two tones even when their productions are not native-like, so long as they are acoustically distinct. This is illustrated by

Relating Production and Perception of L2 Tone



KF1, whose productions of /214/ and /212/ are rather dissimilar to native targets (δ = 2.6 and 5.7 from VN), but who nevertheless is fairly accurate at discriminating these tones, perhaps because she keeps them distinct in her own productions. Conversely, her (non-native-like) production of /21/ (δ = 3.7 from VN) is virtually identical to her (rather more nativelike) /214/ tone, and her discrimination accuracy on this tone pair is less than 50 percent. Based on these observations, we explored two possible ways of relating production and perception of L2 tone, based on two different operationalizations of production accuracy: as a deviation from native norms (9.4.3.1), and as a within-speaker difference between tone pairs (9.4.3.2). In both cases, we operationalize perception as discrimination accuracy averaged over all pairs in which a tone occurs. 9.4.3.1  Correlation with Mean Discrimination Accuracy First, for each tone T for each speaker, we compared the Fréchet distance between T and its VN exemplar with that speaker’s mean discrimination accuracy over all pairs containing T (a rough and ready measure of “perception accuracy”). For example, speaker KM10’s production of tone /35/ had a (fairly high) mean Fréchet distance from the VN target of 2.25, but a mean discrimination accuracy of (0.9 + 0.75 + 1 + 0.63)/4 = 0.82. The overall correlation was weak (ρ = −0.3), but in the expected direction: smaller Fréchet distances correlate with higher discrimination accuracies. We then fit a linear mixed model predicting discrimination accuracy from a linear combination of fréchet distance, age, education and vietnamese usage, with random intercepts for speaker and tone and by-speaker slopes for distance. The coefficient estimate for distance was 0.7, with a standard error of 0.78 and a t value of 0.89; thus, even if this effect is robust (and given the small sample size, it is almost certainly anticonservative), this would mean that a fairly large one-unit change in Fréchet distance would on average correspond to less than a 1 percent difference in discrimination accuracy. None of the demographic predictors emerged as statistically significant (p-values from 0.06 to 0.33), and coefficient estimates were again very small, ranging from −0.5 to 1.6. 9.4.3.2  Correlation with Pairwise Discrimination Accuracy Next, on the basis of the within-subject separations observed in Section 9.4.3, we correlated the Fréchet distance between a Khmer speaker’s own productions of a particular tone pair – regardless of their similarity to native-speaker productions – with their discrimination accuracy for that same tone pair. For example, KF1 has a large Fréchet distance between



James Kirby and Đinh Lu, Giang

her own productions of /21/ and /212/, since she (“incorrectly”) produces /21/ as a high level tone, but her discrimination accuracy on this pair is fairly high (0.7). As in Section 9.4.3.1, the overall strength of correlation was weak (ρ = 0.3) but in the expected direction: larger Fréchet distance correlates with higher discrimination accuracy. Here, in a linear mixed model predicting discrimination accuracy from a linear combination of distance, age, education and vietnamese usage, with random intercepts for speaker and tone pair and by-speaker slopes for distance, the distance predictor is statistically significant (β = 2.95, SE = 1.34, t = 2.20) but the effect size remains very small.

9.5 Discussion In general, both Khmer and Vietnamese listeners were able to accurately discriminate most pairs of Vietnamese tones. While the native Vietnamese listeners had overall higher discrimination accuracies, the Khmer listeners were also fairly skilled at this task, and both groups had difficulty with the same pairs of tones. Production, conversely, was much more variable: some KG participants produced Vietnamese tones that were quite close to those of native speakers, while others produced realizations that would potentially confuse a native listener if produced in isolation. In terms of the Fréchet distance between a given L2 production of a tone and its native-speaker exemplar, we found the largest raw correlation to be with speaker age. All else being equal, younger KG speakers were more likely to produce tones which were more similar to those of native speakers. Discrimination accuracy was best predicted by amount of education, which correlates strongly with age only for the oldest and youngest speakers in our sample. The tonal pairs which presented the most difficulty for KG listeners were those which shared aspects of phonetic realization such as pitch height and contour, although to some extent these proved challenging for the native listeners as well, probably due to the speeded nature of the discrimination task. We also considered two approaches to relating tone production and discrimination. The first compared KG speakers’ tone productions to those of native speakers by measuring the acoustic distance between the f0 contours of KG speakers and VN exemplars. The second compared the acoustic distance between any two tones in a given speaker’s own tone productions with that speaker’s ability to discriminate between nativespeaker productions of those same tones. Modest correlations were observed in both cases, but while the effect of speaker-internal distance

Relating Production and Perception of L2 Tone



was significant in our second model, the size of the effect was extremely small after parceling out the variation due to individuals and tones. All of our KG participants demonstrated high, if not completely native-like, perceptual discrimination performance, consistent with the prediction made by models like the SLM that perceptual facility precedes production ability. The productions, compared to native-speaker exemplars, were much more variable. The relative uniformity of perceptual accuracy and the high degree of variability in production mirror the findings of Baese-Berk (2019) and Nagle (2018), and underscore the finding that production accuracy is not necessarily promoted by having achieved a native-like perceptual facility. Although we do not have data on the time course of acquisition, it is clear that strong perceptual skills do not automatically transfer to production, a result which corroborates other L2 studies (e.g., Kartushina et al., 2015). This would appear to hold regardless of whether or not L2 perceptual abilities preceded production for all of our KG participants. In this respect, the present findings do not appear to support the prediction of models like PAM and SLM that perception and production will converge over the course of learning, but it is worth considering the possible reasons why. One reason may have to do with the interaction of input and usage rates. Bohn & Flege (1997) suggest that experience affects production more than perception. They found that experienced L1 German learners of L2 English (designated as speakers who had lived in the United States for at least five years) were able to produce an /a-æ/ contrast not present in their L1 more accurately than inexperienced German learners of English. However, degree of experience had less of an impact on perception, consistent with the predictions of the SLM. If perception is tuned fairly early in acquisition, the considerable, if passive exposure to Vietnamese tones may explain the relatively good discrimination abilities of our KG participants. Conversely, as shown by Bohn & Flege, improving production at a later stage is possible, but requires a real difference in usage rate. While all of our KG participants grew up in an environment where Vietnamese would be heard, not all of them used it to the same extent, and crucially, these usage rates may have been different at particular time periods over the course of L2 acquisition. The weak correlation we observe between acoustic separation in a speaker’s own L2 production repertoire and his or her ability to distinguish two tones in perception is especially intriguing. This finding seems consistent with work showing that the degree to which a speaker clearly differentiates two L1 categories in production correlates with facility to



James Kirby and Đinh Lu, Giang

discriminate those categories in perception (Byun & Tiede, 2017; Ghosh et al., 2010; Perkell et al., 2004). This type of production–perception correlation is predicted by models of speech production such as DIVA (Guenther & Perkell, 2004), in which planning goals are regions in a multidimensional, acoustic-auditory and somatosensory space. What is interesting in the present case is that this would seem to hold even when the acoustic-auditory input fails to match the production region. What seems more relevant for predicting discrimination accuracy in our study is not whether tones are well separated in the native acoustic space, but in the listener’s own production repertoire (with the important caveat that the correlation coefficient was rather small). This suggests that the relation between L2 production and perception may be mediated by the L2 acoustic targets, even if these are objectively non-native-like. That is, learners would have categories for each tone class, as abstractions over sets of lexical items, and would learn to associate native Vietnamese pitch contours with those classes. At the same time, they would be developing a separate set of production routines, also associated with those same tone classes/lexemes, but which may not bear any particular resemblance to the pitch targets learned from perception. If the production routines are co-activated when receiving acoustic input, having well-separated production targets for tones A and B would facilitate perception. This scenario supposes that, even in a setting which is supposed to target low-level, precategorical phonetic information, L2 discrimination is nevertheless mediated through some kind of intermediate representation. This may seem unexpected in the context of the current study, given that the very short (300 ms) ISI used is expected to discourage the use of phonological processing. However, as noted by Wayland & Guion (2003), while a short ISI can facilitate discrimination for inexperienced listeners, this does not necessarily rule out access to phonological information, especially for more experienced learners. We further note sporadic reports of language-specific effects in speeded AX discrimination elsewhere in the literature (e.g., Huang, 2007). Our findings also lead us to ask how some speakers come to develop tonal production targets that are so divergent from the native-speaker exemplars. One possibility is that L2 Vietnamese tone perception is actually affected by the KG speakers’ L1 prosodic system. The tendency of older speakers to group tones into two registers is consistent with findings indicating less proficient listeners are more likely to be sensitive primarily to tone height than contour (Gandour, 1983; Hallé et al., 2004). It might also be related to the fact that KG Khmer has a nascent pitch-based

Relating Production and Perception of L2 Tone



contrast between level and rising f0 (Kirby & Đinh Lư Giang, 2017; Thạch Ngọc Minh, 1999). However, this quasi-tonal use of f0 is extremely limited in KG Khmer, distinctive only in items which have lost /r/ in onset position (e.g., Standard Khmer /krɑː/ > KG Khmer [kɑ̌ː] “poor,” SK /riən/ > KG [hıˇən] “to learn”) and distinguishing perhaps 20 or 30 minimal pairs. Furthermore, there is no evidence that this use of f0 has spread or is spreading to any other contexts. As demonstrated by So & Best (2010), experience with L1 tones (or other prosodic suprasegmentals) does not necessarily facilitate L2 tone perception, but depends heavily on both the phonemic status of the contrast as well as the phonetic features of the tones themselves. For all practical intents and purposes, we view KG Khmer as a nontonal language, and thus are more inclined to attribute the differences between speakers to properties of those individuals such as age, fluency, and degree of usage/exposure. Finally, it is worth bearing in mind that the statistical evidence of any production–perception link can be impacted by methodological, as well as linguistic factors. As Nagle (2018) and Sakai & Moorman (2018) remind us, the type of task chosen in a given L2 study may considerably impact the results. On the production side, the present study utilized a simple reading task, using aural and orthographic prompts. However, we must recognize the possibility that some participants may simply have been confused about which item they were expected to produce. Despite prompting by a native speaker of Southern Vietnamese (the second author), this procedure did not guarantee imitation; if the participant misheard the cue, they may have been accurately producing the tone they thought they had been asked to produce. The desire to obtain a minimal tone set (where the syllable content did not vary) meant including items that were difficult to depict in a picture-naming task. Similarly, we should be careful not to overinterpret the results of our AX discrimination experiment as a stand-in for “perception.” Recall that Yang (2015) determined production abilities tended to be ahead of perception for L1 English late learners of L2 Mandarin. However, Yang’s perception study was a 4AFC lexical identification task, in which real lexical items in a meaningful carrier phrase were heard with a range of resynthesized f0 contours. This is clearly a very different kind of task from speeded AX discrimination, with the latter tapping primarily into auditory abilities rather than phonological or lexical knowledge. In short, while one can imagine a range of improvements to our experimental procedures, we simply point out that the present findings are likely heavily taskdependent and should be interpreted with appropriate caution.



James Kirby and Đinh Lu, Giang

9.6 Summary We compared the lexical tone productions by native speakers of Southern Vietnamese with those of speakers of Kiên Giang Khmer with L2 knowledge of Vietnamese, and also considered the discrimination of tones for the same L2 speakers. Production accuracy, as measured by the Fréchet distance between f0 contours, was most strongly predicted by age, while discrimination correlated best with the length of a listener’s education. The correlations observed between production and perception – one between discrimination accuracy and the acoustic distance from a native-speaker exemplar, and one between discrimination accuracy and the speaker-specific acoustic separation – were at best modest. Our results are broadly consistent with previous work indicating that L2 production can be independent of perception; however, for the purpose of understanding how production and perception are related, we suggest that the notion of “accuracy” in production may benefit from considering measures in addition to the degree to which a native-speaker target is approximated.

Supplementary Materials The data and R code necessary to reproduce all figures and statistical results in this chapter, along with additional figures and analyses, is available at https://doi.org/10.7488/ds/2635.

References Baese-Berk, M. M. (2019). Interactions between speech perception and production during learning of novel phonemic categories. Attention, Perception, and Psychophysics, 81(4), 981–1005. Bauman, J., Blodgett, A., Rytting, C. A., & Shamoo, J. (2009). The ups and downs of Vietnamese tones: A description of native speaker and adult learner tone systems for Northern and Southern Vietnamese (Technical Report No. E.5.3 TTO 2118). College Park, MD: University of Maryland Center for Advanced Study of Language. Best, C. T. (1995). A direct realist view of cross-language speech perception. In W. Strange (Ed.), Speech perception and linguistic experience: Issues in crosslanguage research (pp. 171–204). Timonium, MD: York Press. Best, C. T., & Tyler, M. D. (2007). Nonnative and second-language speech perception: commonalities and complementarities. In O.-S. Bohn & M. J. Munro (Eds.), Language experience in second language speech learning: In honor of James E. Flege (pp. 13–34). Amsterdam: John Benjamins.

Relating Production and Perception of L2 Tone



Bettoni-Techio, M., Rauber, A. S., & Koerich, R. D. (2007). Perception and production of word-final alveolar stops by Brazilian Portuguese learners of English. In INTERSPEECH 2007 (pp. 2293–2296), Antwerp. Blodgett, A., Bauman, J., Bowles, A., & Winn, M. B. (2008). A comparison of native speaker and American adult learner Vietnamese lexical tones. In Proceedings of Acoustics 08 (pp. 688–692), Paris. Boersma, P., & Weenink, D. (2015). Praat: Doing phonetics by computer (Version 5.4.08). Bohn, O.-S., & Flege, J. E. (1997). Perception and production of a new vowel category by second-language learners. In A. James & J. Leather (Eds.), Second-language speech: Structure and process (pp. 53–74). Berlin: Walter de Gruyter. Brunelle, M. (2015). Vietnamese (Tiếng Việt). In M. Jenny & P. Sidwell (Eds.), The handbook of Austroasiatic languages (Vol. 2, pp. 909–953). Leiden: Brill. Burnham, D., & Francis, E. (1997). The role of linguistic experience in the perception of Thai tones. In A. S. Abramson (Ed.), Southeast Asian linguistics studies in honor of Vichin Panupong (pp. 29–48). Bangkok: Chulalongkorn University Press. Byun, T. M., & Tiede, M. (2017). Perception-production relations in later development of American English rhotics. PLoS ONE, 12(2), e0172022. Chambers, E. W., Colin de Verdière, É., Erickson, J., Lazard, S., Lazarus, F., & Thite, S. (2010). Homotopic Fréchet distance between curves or, walking your dog in the woods in polynomial time. Computational Geometry, 43(3), 295–311. Chen, G. (1974). The pitch range of English and Chinese speakers. Journal of Chinese Linguistics, 2(2), 159–171. Ding, H., Hoffmann, R., & Jokisch, O. (2011). An investigation of tone perception and production in German learners of Mandarin. Archives of Acoustics, 36(3). doi:10.2478/v10168-011-0036-6 Đinh Lư Giang. (2011). Tình hình song ngữ Khmer-Việt tại đồng bằng sông Cửu Long: một số vấn đề lý thuyết và thực tiễn [Khmer-Vietnamese bilingualism in the Mekong Delta: Theoretical and practical issues]. PhD dissertation, Ho Chi Minh City University of Social Sciences and Humanities. Đinh Lư Giang. (2015). Các đặc điểm chính của song ngữ Khmer-Việt vùng Nam Bộ [The main features of Khmer-Vietnamese bilingualism in the South]. Ngôn ngữ & Đời sóng, 4(234), 81–88. Elvin, J., Williams, D., & Escudero, P. (2016). The relationship between perception and production of Brazilian Portuguese vowels in European Spanish monolinguals. Loquens, 3(2), e031. Escudero, P. R. (2005). Linguistic perception and second language acquisition: Explaining the attainment of optimal phonological categorization. Utrecht: LOT. Flege, J. E. (1993). Production and perception of a novel, second-language phonetic contrast. Journal of the Acoustical Society of America, 93(3), 1589–1608.



James Kirby and Đinh Lu, Giang

Flege, J. E. (1995). Second language speech learning: Theory, findings, and problems. In W. Strange (Ed.), Speech perception and linguistic experience: Issues in cross-language research (pp. 233–277). Timonium, MD: York Press. Flege, J. E. (1999). The relation between L2 production and perception. In Proceedings of the XIVth International Congress of Phonetics Sciences (pp. 1273–1276), Berkeley. Flege, J. E., Frieda, E. M., & Nozawa, T. (1997). Amount of native-language (L1) use affects the pronunciation of an L2. Journal of Phonetics, 25(2), 169–186. Flege, J. E., MacKay, I. R. A., & Meador, D. (1999). Native Italian speakers’ perception and production of English vowels. Journal of the Acoustical Society of America, 106(5), 2973–2987. Francis, A. L., Ciocca, V., Ma, L., & Fenn, K. (2008). Perceptual learning of Cantonese lexical tones by tone and non-tone language speakers. Journal of Phonetics, 36(2), 268–294. Gandour, J. T. (1983). Tone perception in Far Eastern languages. Journal of Phonetics, 11, 149–175. General Statistics Office of Vietnam. (2010). The 2009 Vietnam population and housing census: Major findings. Hanoi: General Statistics Office of Vietnam. Retrieved from www.gso.gov.vn/default_en.aspx?tabid=515&idmid=5&Item ID=9813 Ghosh, S. S., Matthies, M. L., Maas, E., … Perkell, J. S. (2010). An investigation of the relation between sibilant production and somatosensory and auditory acuity. Journal of the Acoustical Society of America, 128(5), 3079–3087. Guenther, F. H., & Perkell, J. S. (2004). A neural model of speech production and supporting experiments. In From sound to sense: 50+ years of discoveries in speech communication (pp. B98–B106). Cambridge, MA: MIT Press. Guion, S. G., & Pederson, E. (2007). Investigating the role of attention in phonetic learning. In O.-S. Bohn & M. J. Munro (Eds.), Language experience in second language speech learning: In honor of James Emil Flege (pp. 57–77). Amsterdam: John Benjamins. Hallé, P. A., Chang, Y.-C., & Best, C. T. (2004). Identification and discrimination of Mandarin Chinese tones by Mandarin Chinese vs. French listeners. Journal of Phonetics, 32(3), 395–421. Hoàng Thị Châu. (1989). Tiếng Việt trên các miền đất nước: Phương ngữ học [Vietnamese in the various areas of the motherland: A dialectological study]. Hà Nội: NXB Khoa học Xã hội. Huang, T. (2007). Perception of Mandarin tones by Chinese-and Englishspeaking listeners. In Proceedings of the 16th International Congress of Phonetic Sciences (pp. 1797–1800), Saarbrücken. Kartushina, N., Hervais-Adelman, A., Frauenfelder, U. H., & Golestani, N. (2015). The effect of phonetic production training with visual feedback on the perception and production of foreign speech sounds. Journal of the Acoustical Society of America, 138(2), 817–832. Kirby, J. (2014). Incipient tonogenesis in Phnom Penh Khmer: Acoustic and perceptual studies. Journal of Phonetics, 43, 69–85.

Relating Production and Perception of L2 Tone



Kirby, J., & Đinh Lư Giang. (2017). On the r>h shift in Kiên Giang Khmer. Journal of the Southeast Asian Linguistics Society, 10(2), 66–85. Lee, Y.-S., Vakoch, D. A., & Wurm, L. H. (1996). Tone perception in Cantonese and Mandarin: A cross-linguistic comparison. Journal of Psycholinguistic Research, 25(5), 527–542. Levy, E. S., & Law, F. F. (2010). Production of French vowels by AmericanEnglish learners of French: Language experience, consonantal context, and the perception-production relationship. Journal of the Acoustical Society of America, 128(3), 1290–1305. Llisterri, J. (1995). Relationships between speech production and speech perception in a second language. In Proceedings of the 13th International Congress of Phonetic Sciences (Vol. 4, pp. 92–99), Stockholm. Mennen, I. (1998). Can language learners ever acquire the intonation of a second language? In STiLL-1998 (pp. 17–20), Marholmen, Sweden. Miracle, W. C. (1989). Tone production of American students of Chinese: A preliminary acoustic study. Journal of the Chinese Language Teachers Association, 24, 49–65. Morrison, G. S. (2003). Perception and production of Spanish vowels by English speakers. In Proceedings of the 15th International Congress of Phonetic Sciences (pp. 1533–1536), Barcelona. Müller, M. (2007). Dynamic time warping. In Information retrieval for music and motion (pp. 69–84). Berlin: Springer. Nagle, C. L. (2018). Examining the temporal structure of the perceptionproduction link in second language acquisition: A longitudinal study. Language Learning, 68(1), 234–270. Nguyen, H. T., & Macken, M. A. (2008). Factors affecting the production of Vietnamese tones: A study of American learners. Studies in Second Language Acquisition, 30(1), 49–77. Nguyễn Thị Huệ. (2010). Tiếp xúc ngôn ngữ giữa tiếng Khmer với tiếng Việt (trường hợp tỉnh Trà Vinh) [Language contact between Khmer and and Vietnamese in Tra Vinh province]. PhD dissertation, Ho Chi Minh City University of Social Sciences and Humanities. Nguyễn Văn Lợi & Edmondson, J. A. (1998). Tone and voice quality in modern northern Vietnamese: Instrumental case studies. Mon-Khmer Studies, 28, 1–18. Peperkamp, S., & Bouchon, C. (2011). The relation between perception and production in L2 phonological processing. In INTERSPEECH (pp. 161–164), Florence. Perkell, J. S., Guenther, F. H., Lane, H., … Zandipour, M. (2004). The distinctness of speakers’ productions of vowel contrasts is related to their discrimination of the contrasts. Journal of the Acoustical Society of America, 116(4), 2338–2344. Phạm, A. H. (2003). Vietnamese tone: A new analysis. New York: Routledge. Sakai, M., & Moorman, C. (2018). Can perception training improve the production of second language phonemes? A meta-analytic review of 25 years of perception training research. Applied Psycholinguistics, 39(1), 187–224.



James Kirby and Đinh Lu, Giang

Sheldon, A., & Strange, W. (1982). The acquisition of /r/ and /l/ by Japanese learners of English: Evidence that speech production can precede speech perception. Applied Psycholinguistics, 3(3), 243–261. So, C. K., & Best, C. T. (2010). Cross-language perception of non-native tonal contrasts: Effects of native phonological and phonetic influences. Language and Speech, 53(2), 273–293. Sochoeun, C. (2006). Khmer Krom migration and their identity. MA thesis, Royal University of Phnom Penh, Phnom Penh. Strange, W. (1995). Phonetics of second-language acquisition: Past, present, future. In P. Branderud & K. Elenius (Eds.), Proceedings of the 13th International Congress of Phonetic Sciences (pp. 76–83), Stockholm. Thạch Ngọc Minh. (1999). Monosyllabization in Kiengiang Khmer. Mon-Khmer Studies, 29, 81–95. Vũ Thanh Phương. (1982). Phonetic properties of Vietnamese tones across dialects. In D. Bradley (Ed.), Papers in Southeast Asian linguistics: No. 8. Tonation (pp. 55–76). Canberra: Pacific Linguistics. Wang, Y., Jongman, A., & Sereno, J. A. (2003). Acoustic and perceptual evaluation of Mandarin tone productions before and after perceptual training. Journal of the Acoustical Society of America, 113(2), 1033–1043. Wang, Y., Sereno, J. A., & Jongman, A. (2012). L2 acquisition and processing of Mandarin tones. In P. Li, L. H. Tan, E. Bates, & O. J. L. Tzeng (Eds.), Handbook of East Asian psycholinguistics: Vol. 1. Chinese (pp. 250–256). Cambridge: Cambridge University Press. Wayland, R. P., & Guion, S. (2003). Perceptual discrimination of Thai tones by naive and experienced learners of Thai. Applied Psycholinguistics, 24(01), 113–129. Wayland, R. P., & Guion, S. G. (2004). Training English and Chinese listeners to perceive Thai tones: A preliminary report. Language Learning, 54(4), 681–712. Werker, J. F., & Tees, R. C. (1984). Phonemic and phonetic factors in adult cross-language speech perception. Journal of the Acoustical Society of America, 75(6), 1866–1878. Yang, B. (2012). The gap between the perception and production of tones by American learners of Mandarin – an intralingual perspective. Chinese as a Second Language Research, 1(1), 33–53. Yang, B. (2015). Perception and production of Mandarin tones by native speakers and L2 learners. Berlin: Springer. Retrieved from http://link.springer.com/ 10.1007/978-3-662-44645-4 Zimmerer, F., Jügler, J., Andreeva, B., Möbius, B., & Trouvain, J. (2014). Too cautious to vary more? A comparison of pitch variation. In Proceedings of the 7th International Conference on Speech Prosody (pp. 1037–1041), Dublin.

chapter 10

Production of Mandarin Tones by L1-Spanish Early Learners in a Classroom Setting Lucrecia Rallo Fabra, Xialin Liu, Si Chen, and Ratree Wayland*

10.1 Introduction 10.1.1  Mandarin Chinese More than half of the world languages are tonal languages (Crystal, 1987; Yip, 2002), among them Mandarin Chinese. As opposed to nontonal languages, which use vowels and consonants to contrast words, tonal languages also employ tone to mark lexical differences (Gussenhoven, 2004). Acoustically, tone differences are signaled through F0 variations, as well as changes in amplitude, duration and voice quality (Yip, 2002). According to Pike (1948), tonal languages, in turn, can be divided between those which have no pitch variation across the syllable (level or static tones) and those with pitch movement across the syllable (contour tones). The four lexical tones of Mandarin Chinese include a high-level tone (T1), a rising tone (T2), a dipping tone (T3), and a falling tone (T4). These four tones are minimally contrastive and mark lexical distinctions. For instance, the word tang produced with a high-level tone means soup, with a rising tone means candy, with a dipping tone to lie down and with a falling tone burning hot (Yang, 2015). Xu & Wang propose a theoretical framework that accounts for F0 variation in tonal languages such as Mandarin (Xu & Wang, 2001). They argue that “observed F0 targets are not linguistic units per se. Rather they are the surface realizations of linguistically functional units such as tone or pitch accent” (p. 321). These units are referred to as pitch targets and could be considered the suprasegmental equivalent units of segmental phones. Pitch targets are classified as static and dynamic, depending on the presence or absence of a linear F0 movement. According to this theoretical model, Mandarin would then have two static pitch targets, * This research was funded by grants FFI2017-84479-P and FFI2013-48640-C2-2-P AEI/FEDER, EU from the Spanish Ministry of Economy and Competitiveness. We are grateful to Clara Vega for her assistance with the data analysis.



 Lucrecia Rallo Fabra, Xialin Liu, Si Chen, and Ratree Wayland namely, high and low and two dynamic targets rise and fall, corresponding to T1, T3, T2, and T4, respectively. 10.1.2  Mandarin Tone Production among Native Mandarin Children According to the generally accepted account of laryngeal f0-related laryngeal muscle activities, the cricothyroid muscle (CT) determines F0 rises, whereas the sternohyoid (SH) muscle controls F0 falls (Hallé, 1994). The vocalis (VOC), the sternothyroid (ST) and the lateral cricoarytenoid may also contribute to F0 control (Atkinson, 1978). In addition, the subglottal pressure (Ps) may play a secondary role in the controlling process of F0 (Ohala, 1978; but see Atkinson, 1973, 1978). Based on patterns of laryngeal muscle activity, Mandarin tone 3 is the most articulatorily complex. To produce this tone, a high level of SH muscle activity is required at tonal onset to create a sharp fall in F0, followed by a decreased SH activity and an increased CT activity to produce the final rise. In contrast, Mandarin T4 requires the least laryngeal control, particularly for a high-pitched speaker, as only an increase in CT activity is needed to produce a high F0 at tonal onset, followed by a reduced level of CT activity to allow the F0 to fall toward the end of the tone. T4 requires a relatively higher level of control of CT activity because a high level of CT activity is needed to reach an initial high F0 level. In addition, this high level of CT activity has to be sustained to maintain the initial high F0 value through the end of the tone. For T2, SH muscle activation is required to achieve the initial fall, followed by an increased CT activity to realize the sharp final rise. Thus, the order from the least to the most complex is T4, T1, T2, and T3. Laryngeal control complexity accounts for Mandarin-speaking children’s tone production accuracy quite well. In general, the children found tones with low F0 targets (e.g., T2 and T3) to be more challenging to master than high F0 targets (e.g., T1, T4). For example, Li and Thompson (1977) reported that Mandarin tones 2 and 3 were more frequently misarticulated than tones 1 and 4 by children aged 1:6 to 3:0 years old. Similar results were reported for three-year-old Mandarin children investigated by Wong, Schwartz and Jenkins (2005). These children were asked to produce Mandarin lexical tones in monosyllabic words using a picture-naming task. The productions were low-pass filtered at 500  Hz and 400  Hz to eliminate segmental information and 10 Mandarin-speaking adult judges identified the tones from the filtered speech. Children’s productions were significantly less accurately ­identified



Production of Mandarin Tones by L1-Spanish Early Learners



compared to adult productions and Mandarin tone 3 was the least accurately produced. Wong (2012a) acoustically examined monosyllabic Mandarin lexical tones produced by the 13 three-year-old children and four female adults reported in Wong et al. (2005). The results confirmed that tone production among these children is not adult-like. The order of tone production from the least to the most adult-like appeared to align with the complexity of the laryngeal muscle control for producing these tones, with tone 4 being the most adult-like, followed by tone 1, tone 2, and tone 3, respectively. Wong (2012b) investigated Mandarin tone production in isolated words among three-year-old children growing up in Taiwan and the United States and found that this group of children did not produce adult-like tones either. Acoustic differences between the adult and the children productions were found even for the children’s tones that were accurately categorized by adult listeners. Interestingly, Wong (2012b) also observed that children growing up in Taiwan made more errors on tone 2 and tone 4 than the Mandarinspeaking children growing up in the United States while their tone 1 and tone 3 production accuracy was comparable. The author did not offer an explanation, but exposure to the English rising and falling intonation patterns among the children raised in the United States may have increased their sensitivity to the rising and falling pitch movements that distinguishes Mandarin T2 from Mandarin T4. In contrast, Taiwanese children’s difficulty with these two Mandarin tones may have resulted from their exposure to Taiwanese tones. The fact that their Mandarin T2 and T4 productions were misheard mostly as T3 suggested that they may have mapped these two Mandarin tones to the Taiwanese tone 4 (lowrising). Xu et al. (2007) used an artificial neural network to classify Mandarin tones produced by 61 children between six and nine years of age. The neural network successfully classified the tone production of the 61 child speakers with an accuracy rate of about 85.6 percent compared to 79.5 percent accuracy rate performed by the adult human listeners. High degree of variability in the children’s productions was evident in the tone recognition rate by the adult listeners and by the neural network. Unfortunately, the study did not report recognition rate by tones. Wong (2008) examined Mandarin tone production accuracy in disyllabic words among five- to six-year-old Mandarin-speaking children and adults growing up in the United States. The results showed that the children’s tone production did not reach adult-like accuracy and directional asymmetry in production accuracy was observed. For example, p ­ roduction

 Lucrecia Rallo Fabra, Xialin Liu, Si Chen, and Ratree Wayland of the T1 and T2 combination was more difficult in the T1–T2 than in the T2–T1 direction because the difference between the F0 offset of the preceding T1 and the F0 onset of the following T2 was greater than the difference between F0 offset of the preceding T2 and F0 onset of the following T1. This is consistent with the view that tones with a greater motor demand are more difficult to produce (Wong, 2008). 10.1.3  L2 Learning of Mandarin Chinese Tones by Speakers of Nontonal Languages Second language learners of Mandarin whose native language (L1) is nontonal face considerable challenges. They must learn to produce not only the consonants and the vowels of this language but also the four lexical tones. This difficulty has been acknowledged by Cheng (2011), who found that, when asked about the major difficulties of learning Mandarin as a foreign language, students from different L1 backgrounds reported that lexical tones were the most difficult aspect of Mandarin pronunciation. Acoustic measurements showed that learners often confused tones 2 and 3. Spanish belongs to the group of intonational languages because it does not use lexical tones to signal phonemic differences. Instead, as an intonational language, pitch movements provide cues to syntactic constituents, focus, differences between old and new information as well as sentence types such as statements and questions. Pitch movements also serve to indicate paralinguistic information such as attitudes and emotions (Hualde & Prieto, 2015). Perception and production difficulty with lexical tones may depend partly on how they are processed. Specifically, available evidence points to native and nonnative listeners of tonal languages focusing their attention to different F0 dimensions in lexical tone perception. While native speakers of tone languages such as Mandarin attend to the linguistic tone dimensions such as pitch height as well as direction and slope of the F0, speakers of nontonal languages place more attention to the nonlinguistic tone dimensions of average pitch and extreme endpoints (Wayland & Guion, 2004; So & Best, 2010). Compared to the extensive body of research examining acquisition of L2 segmental features, research examining production of Mandarin tones by learners of nontonal languages in formal learning settings is at its infancy and it is mostly limited to adult populations with an L1-English background (Hao, 2012; Tsukada, Xu, & Rattanasone, 2015; Wang, 1995). For instance, Leather (1996) investigated the perception and production



Production of Mandarin Tones by L1-Spanish Early Learners



of the lexical tone system of standard Beijing Chinese (Putonghua) by two groups of adult naive Dutch learners who received training either in perception or production of the four tones and then were tested on their ability to either produce or perceptually identify the target tones. The perceptual training procedure involved listening to digitized tone tokens, tone labeling practice with response information feedback for each trial and a testing labeling task in which learners were asked to meet a “proficiency criterion” of 75–80 percent labeling success. After the training period, learners who met the proficiency criterion in the perceptual training phase, produced the four tonally distinctive minimal quadruplets of the target Putonghua words /ȳ/, /ý/, /y̌/, and /ỳ/. The production training procedure included the provision of visual feedback of the target tones via a laryngograph/Voiscope which allowed learners to visualize the target tone contours and model them. Feedback messages on the screen were given to guide the learners’ productions. For instance, “too high” or “too low” indicated correct/incorrect production of the target tones at start, mid-point or end of a given contour. The messages “too short” or “too long” indicated adequate duration of the contour relative to the target. After this training phase, learners produced the target tones until they achieved a success rate of two consecutive elicitations of each tone. When these production requirements were met, learners moved on to a perceptual task in which they were required to identify the word matching each of the target tones presented auditorily. Analysis of the production data by the learner group who followed the perceptual training showed an overall high consistency of tone production, except for tones 2 and 3. Tone 2 exemplars were too high at onset and too low at offset. Tone 3 did not dip sufficiently low. Covariance of the perceptual and production abilities yielded positive correlation of perception and production skills for the two learner groups. The contrast between tones 2 and 3 was also the most difficult to perceive by the learners who received production training. Kim et al. (2015) investigated the effects of study abroad on US students’ overall proficiency after spending a semester at a Chinese university. Proficiency was measured before and after a study-abroad semester by means of a battery of measures including fluency ratings, speech rate, filled and unfilled pauses, tone accuracy and vocabulary development. A comparison of the pretest and posttest tonal accuracy revealed significant gains in tone production as rated by native Mandarin listeners. However, the authors did not report any differences of which tone patterns were most difficult to produce by American learners.

 Lucrecia Rallo Fabra, Xialin Liu, Si Chen, and Ratree Wayland

10.2  The Present Study: Aims and Research Questions The aim of this study is to explore the relative difficulty of Mandarin tone production by a group of L1-Spanish children in a classroom setting. Specifically, we intend to investigate whether L1-Spanish learners’ production of Mandarin tones follows the same trends as Mandarin tone production by speakers of other nontonal languages such as English. Specifically, we addressed the following research questions: RQ1: Which Mandarin tones pose more difficulties to Spanish learners of Mandarin? We hypothesized that tones 2 (rising tone) and 4 (falling tone) would be easier to produce since these two contours would correspond to the intonation patterns of yes/no questions and declaratives in Peninsular Spanish (Hualde & Prieto, 2015). However, it is also possible that, similar to native Mandarin-speaking children, their production accuracy may depend on the articulatory motor demand, in which case tone 2 and tone 3 would be relatively more difficult to master relative to tone 1 and tone 4. RQ2: Which aspects of Mandarin tones are more difficult to master: Pitch height or pitch contour? On the basis of previous findings on nonnative production of lexical tones (Leather, 1996; Wang, Jongman, & Sereno, 2003; Wayland & Guion, 2004), we hypothesized that the Spanish learners of Mandarin would model their native Chinese peers in the production of the four pitch contours but their pitch heights would significantly differ from the pitch heights produced by the native Chinese peers.

10.3 Method 10.3.1 Participants Twelve L1-Spanish students with a mean age of 9.6 years from the Agora International School in Palma de Mallorca participated in the study. They had a two-year experience learning Mandarin at school, which placed them at a YCT2 level. Taking the Common European Reference Level for Languages, the YCT2 level corresponds to level A1 (beginner). Four native Mandarin children living in Qingdao with a mean age of 10 years participated as controls. The teaching methodology used to train the children to perceive and produce the target tones was based on comparisons



Production of Mandarin Tones by L1-Spanish Early Learners



with the musical scale, for example the musical note mi is used as reference for tone 1. Whenever possible, comparison with Spanish intonation patterns was also used. For instance, tone 2 is somewhat similar to the rising intonation in a yes/no question in Spanish and tone 4 has a falling pitch like in an exclamation in Spanish. Additionally, the instructor used metaphorical gestures in the form of arm positions to convey the four Mandarin tones (see Appendix 10.A). The support of visual illustrations depicting the shape of the lexical tones has been found to have a facilitative role in helping nontonal learners produce and perceive the Mandarin tones (Liu et al., 2011; Morett & Chang, 2015; Baills et al., 2019). 10.3.2  Speech Materials and Procedure The participants were recorded individually using an Olympus WS-320  M digital recorder with a built-in microphone. The recording sessions of the Spanish learners of Mandarin took place in a quiet room located in the school premises. The native Mandarin children were recorded in China in similar conditions. The children were given a list of the target words embedded in the carrier phrase shuō___ bābiàn (I say ___ again). All the target words had the CV syllabic structure and they were printed in pinyin (see Appendix 10.B). We intended to include the same word for each of the four tones but this was not always possible. In order to facilitate accurate production of the target Mandarin tones, the Mandarin instructor of the participating children elicited each one of the  target sentences and immediately after, the children modeled the sentence they heard as closely as possible.

10.4 Results 10.4.1  Measurement and Normalization Methods A total of 96 speech samples from the learner group (8 words × 12 speakers) and 32 samples from the control group (8 words × 4 speakers) were obtained for each of the four Mandarin tones. The vowel portions of each target word were manually segmented using the Textgrid annotation utility of the Praat software (Boersma & Weenik, 2019). The F0 values at 20 normalized time points from vowel onset to vowel offset were automatically extracted using the ProsodyPro (Xu, 2013) script for Praat. This method allows continuous examination of the tone contours thus solving the limitations of prior work examining a limited set of measurements (see Yang, 2015, for a review). The F0 analysis range was

 Lucrecia Rallo Fabra, Xialin Liu, Si Chen, and Ratree Wayland set at a minimum of 75 Hz and a maximum of 600 Hz and the sampling rate was set a 100  Hz. The obtained F0 values were then converted to logarithmic z-scores to normalize F0 variation across talkers (Rose, 1987; Zhu, 1999; Yang, 2015). As shown in the formula below, a given z-score is calculated by subtracting the mean from the raw F0 scores and then divided by the standard deviation: zi = xi − m /s , where xi stands for an observed F0, m stands for the mean, and s is the standard deviation (SD). For example, if the observed F0 value for a given speaker is 220  Hz, and the overall mean and SD are 180 and 20  Hz, respectively, the normalized value for 220 Hz would be (220 − 180)/20 = 2. 10.4.2  Statistical Analysis To compare the tone contours of the Spanish learners and the Mandarin natives, we used a statistical modeling method known as growth curve analysis (Mirman, Dixon, & Magnuson, 2008; Mirman, 2014, pp. 51–55). This method allowed comparison of surface F0 contours of all four tones produced by the native Spanish speakers and the Mandarin controls and it has been used in prior work modeling phonetic tone variation and tone distinction in Japanese and Chongming Chinese (Chen, Zhang, McCollum, & Wayland, 2017) and tone sandhi in Mandarin and Nanjing Chinese (Chen, Wiltshire, Li, & Wayland, 2019). We started with a simple model as follows (Mirman, Dixon, & Magnuson, 2008): Yij = (γ 00 + ζ 0i ) + (γ 10 + ζ 1i ) × Timeij + ε ij , where i stands for the ith F0 contour and j is the jth time point of extracted F0 value, γ00 is the population mean of the intercept, ζoi models variability of an individual’s intercept, γ10 is the population mean of the slope, ζ1i models the variability of an individual’s slope and εij are the error terms. Orthogonal polynomials were used to avoid any correlation of the linear and quadratic terms (Mirman, 2014, p. 52). The differences between the native and nonnative F0 contours for each tone were examined by comparing a model treating them as the same with a model treating them as different using a likelihood ratio test. A significant difference in the model comparison indicates a difference between the members of each pair of the surface F0 contours.



Production of Mandarin Tones by L1-Spanish Early Learners



10.4.3  Cross-Linguistic Comparisons of Pitch Contours and Pitch Height The results of the growth curve analysis are shown in Table 10.1 revealing that, overall, the F0 contours of all four tones produced by native Mandarin and native Spanish speakers differed significantly. The tone 1 contour patterns produced by the Spanish learners and the native Mandarin children are shown in Figure 10.1. A first inspection of the pitch contours reveals clear cross-language differences between both groups. The Spanish learners did not succeed in modeling the native Mandarin peers’ contours and produced a dynamic rise-fall pattern, which does not match the more static contour produced by the native Mandarin speakers. The major difficulty involved the onset and offset Table 10.1  Growth curve analysis results for the four Mandarin tones Tone

Quadratic term 2

χ (1) 1

Slope

Intercept

p-Value

2

χ (1)

p-Value

χ (1)

p-Value

7.34

0.007**

1.45

0.22

2.27

0.13

2

0.04

0.83

28.40