Lexical Representation: A Multidisciplinary Approach (Phonology & Phonetics) 3110224925, 9783110224924

This book includes the work of experts from a wide range of backgrounds who share the desire to understand how the human

197 68 5MB

English Pages 260 [258] Year 2010

Report DMCA / Copyright

DOWNLOAD FILE

Polecaj historie

Lexical Representation: A Multidisciplinary Approach (Phonology & Phonetics)
 3110224925, 9783110224924

Table of contents :
Acknowledgements
Contents
Lexical representation: A multidisciplinary approach
Semantic constraints on morphological processing
The lexicon and phonetic categories: Change is bad, change is necessary
Early links in the early lexicon: Semantically related word-pairs prime picture looking in the second year
Words: Discrete and discreet mental representations
Neural systems underlying lexical competition in auditory word recognition and spoken word production: Evidence from aphasia and functional neuroimaging
Connectionist perspectives on lexical representation
Recognizing words from speech: The perception-action-memory loop
Brain structures underlying lexical processing of speech: Evidence from brain imaging
The neuronal infrastructure for unification at multiple levels
Subject index
Colour plates

Citation preview

Lexical Representation

Phonology and Phonetics 17 Editor

Aditi Lahiri

De Gruyter Mouton

Lexical Representation A Multidisciplinary Approach edited by

Gareth Gaskell Pienie Zwitserlood

De Gruyter Mouton

ISBN 978-3-11-022492-4 e-ISBN 978-3-11-022493-1 ISSN 1861-4191 Library of Congress Cataloging-in-Publication Data Lexical representation : a multidisciplinary approach / edited by Gareth Gaskell, Pienie Zwitserlood. p. cm. ⫺ (Phonology and phonetics; 17) Includes bibliographical references and index. ISBN 978-3-11-022492-4 (alk. paper) 1. Lexicology. 2. Psycholinguistics. 3. Grammar, Comparative and general ⫺ Word formation. 4. Grammar, Comparative and general ⫺ Morphology. I. Gaskell, M. Gareth. II. Zwitserlood, Pienie. P326.L386 2011 4011.9⫺dc22 2011013270

Bibliographic information published by the Deutsche Nationalbibliothek The Deutsche Nationalbibliothek lists this publication in the Deutsche Nationalbibliografie; detailed bibliographic data are available in the Internet at http://dnb.d-nb.de. 쑔 2011 Walter de Gruyter GmbH & Co. KG, Berlin/New York Printing: Hubert & Co. GmbH & Co. KG, Göttingen ⬁ Printed on acid-free paper Printed in Germany www.degruyter.com

This book is dedicated to William Marslen-Wilson

Acknowledgements

The editors would like to thank the chapter authors first and foremost for their supreme generosity in terms of time and effort. We thank Aditi Lahiri for encouraging us to develop this volume, Helen Brown for her astute work in helping to put the volume together, and Lolly Tyler for keeping tight-lipped about the project in extreme circumstances.

Contents Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii Lexical representation: A multidisciplinary approach . . . . . . . . . . . . . . Pienie Zwitserlood and Gareth Gaskell

1

Semantic constraints on morphological processing . . . . . . . . . . . . . . . . 13 Kathleen Rastle and Marjolein Merkx The lexicon and phonetic categories: Change is bad, change is necessary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 Arthur G. Samuel Early links in the early lexicon: Semantically related word-pairs prime picture looking in the second year . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 Suzy Styles and Kim Plunkett Words: Discrete and discreet mental representations . . . . . . . . . . . . . . . 89 Aditi Lahiri Neural systems underlying lexical competition in auditory word recognition and spoken word production: Evidence from aphasia and functional neuroimaging . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123 Sheila E. Blumstein Connectionist perspectives on lexical representation . . . . . . . . . . . . . . . 149 David C. Plaut Recognizing words from speech: The perception-action-memory loop 171 David Poeppel and William Idsardi Brain structures underlying lexical processing of speech: Evidence from brain imaging . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 197 Matthew H. Davis and Jennifer M. Rodd

x

Contents

The neuronal infrastructure for unification at multiple levels . . . . . . . . 231 Peter Hagoort Subject index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 243 Colour plates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 246

Lexical representation: A multidisciplinary approach Pienie Zwitserlood and Gareth Gaskell

The processing of spoken language is as fascinating as it is complex. While work on speech perception has quite a long tradition, psycholinguistics started to emerge as a discipline only in the 1950s, fed by its many roots in 19-century neurology, experimental psychology, behaviorism, and linguistics. It took some twenty years until the recognition of words in their natural sentential habitat, and the nature of lexical representation, seriously entered the research arena. The last 40 years have brought forward invaluable insights in lexical process and representation. This volume presents current views, theories and data on lexical representation—with an intended bias towards the work of William Marslen-Wilson. The study of complex issues such as processing “from sound to meaning” profits from well-specified theories. One such theory is MarslenWilson’s Cohort model. Inspired by both the logogen model of word recognition in reading (Morton 1969) and his own data from speech shadowing, William Marslen-Wilson sketched the outlines of his model during the 1970s. The cohort model became one of the most influential theories of spoken-word recognition for decades to come—and it still is, as many of the chapters in this book vouch for. Like all good theories, the Cohort model had friends and foes. Like all good theories, it made testable predictions. As with far fewer good theories, it was adapted and modified when the empirical evidence dictated revision, eventually culminating in the distributed Cohort model by Gaskell and Marslen-Wilson (1997). In due time, more and more became known about the nature of the representations involved in spoken-word processing. One line of research focused on representations of word-form, on the goodness-of-fit between input and word forms, on the impact of sub-phonetic detail, and on the necessity of featural detail in the lexicon (Marslen-Wilson and Zwitserlood, 1989; Marslen-Wilson and Warren 1994; Lahiri and Marslen-Wilson 1991; Gaskell and Marslen-Wilson 1998). Another major theme concerned wordinternal structure, including prelexical morphemic parsing, morphological structure in the lexicon, and functional differences between regular and

2

Pienie Zwitserlood and Gareth Gaskell

irregular inflection—and, as such, on aspects of syntax (cf. Marslen-Wilson 2007, for an overview). Although methods and means have changed and developed, these are still the core themes in lexical processing, flanked on the one side by the processes and representations that provide the “input” for lexical processing, and on the other by the computations of global structure and meaning operating on the “output” of lexical processing. The use of input and output should not suggest sequentiality and independence of these processes: We are looking at an orchestrated piece of mental and neural music, with slow waves, continuous murmurs, fast beats, loud and muted tones that only properly resonate in each other’s presence. The chapters in this book attest and document various facets of the enormous progress made in the area of lexical process and representation. We briefly summarise each contribution, in the order of appearance in this volume, to identify where the field stands, and where it is heading. The chapter by Rastle and Merkx focuses on the internal structure of words. Their core question is whether and how the proposal by MarslenWilson and colleagues (1994) that morphological complexity is coded in the lexicon only for semantically transparent words can be reconciled with the meanwhile ample evidence for early morphological decomposition. Data from masked visual priming show morphological parsing whenever the input contains parts that are identifiable as morphemes—even if the word is monomorphemic (as in corner or brother; note that Rastle and Merkx label these word as morphologically complex but semantically opaque). Weighing up the available evidence from masked-priming studies, Rastle and Merkx conclude that there is little evidence that semantic information influences the early parsing of potentially complex words. But true compositionality and semantic transparency do have an effect when primes are fully visible, or presented auditorily. This indicates that semantic transparency comes into play at a later point in time, after initial decomposition that is blind to semantics. There are two theoretical options to accommodate these findings. One is a two-level theory of morphological processing, with distinct functionality for early orthographically-based and later semantically-informed decomposition phases (cf. Rastle and Davis 2008). A second option are models with a single level of morphological parsing, followed by an evaluation or licensing of ensuing decompositions in the light of morpho-syntactic and/or semantic constraints. On the basis of results from priming studies with multiple stimulus onset asynchronies, and

Lexical representation: A multidisciplinary approach

3

from neurophysiological (ERP) data, and backed up by considerations on the development of morphemic knowledge, Rastle and Merkx argue in favour of models in which early, form-driven decompositions are evaluated through semantic (and potentially morpho-syntactic) information. The chapter provides an interesting view on the experimental effort into the role of morphemes in word recognition. The authors point to many gaps with respect to theory and data—a major one would be to find out whether the early parsing observed in reading also applies to spoken words. Samuel’s chapter provides a timely position statement on the dynamic nature of the spoken word recognition system. Pioneering work by Marslen-Wilson and colleagues has done much to define the nature of the spoken word recognition process, both in terms of selection of lexical items (Marslen-Wilson and Tyler 1980; Marslen-Wilson 1987), and the mechanisms that underlie perception of phonemes (e.g., Marslen-Wilson 1973; Marslen-Wilson and Warren 1994). This has led to a detailed understanding of the recognition system in its steady state, for “fully developed” adults. Samuel argues that it is now time to move beyond this assumption of a steady state to explore the ways in which even adult language users can adapt to new forms of input, such as unencountered accents or words. In these cases, there is a clear tension between the need to accommodate the idiosyncrasies of new speakers or new words that we encounter, while at the same time ensuring that the recognition system is robust and existing knowledge is preserved. The system needs to be able to incorporate systematic and persistent novelty without being led astray by simple mispronunciations or non-linguistic distortions. To deal with these conflicting constraints, Samuel suggests that the language system follows a conservative adjustment/restructuring principle (CA/RP). This principle is fleshed out in Samuel’s chapter with reference to recent studies on lexical and phonetic reorganisation. At the lexical level, Samuel discusses studies that address the time course of different aspects of novel word learning (e.g., Gaskell and Dumay 2003; Leach and Samuel 2007). The conservative aspect of the CA/RP principle is nicely illustrated by considering the potential for interference between new word representation and existing knowledge. In some cases, new word knowledge can be acquired without any deleterious consequences for the previously acquired knowledge (e.g., learning the phonetic components of a new word). This “lexical configuration” can be filled in as and when the information becomes available. On the other hand other dynamic components of new word knowledge require linking of new and existing networks, and could

4

Pienie Zwitserlood and Gareth Gaskell

potentially interfere with normal language functioning if the new information is introduced too quickly (lexical engagement). Because of this, the CA/RP dictates that lexical engagement operate at a more gentle pace, becoming established over the course of days or weeks. In a quite separate domain, much the same principles appear to be operating. Samuel describes a series of elegant studies (e.g., Norris, McQueen and Cutler 2003; Kraljic, Samuel and Brennan 2008) that explore the ability of listeners to use lexical knowledge to help retune phonetic boundaries in cases where ambiguities arise (for example when the two editors of this book speak to each other using our different accents of English). Recent studies have shown that listeners can quickly adapt to these new circumstances, and this adaptation has long-term consequences (e.g., Eisner and McQueen 2006). This is a clear example of the dynamic nature of the language system: we can adapt our phonetic boundaries to maximise our understanding. Nonetheless, the adaptation is conservative; if there is a transient cause of the discrepant speech (e.g., the speaker has a pen in her mouth) adaptation would be disadvantageous and does not happen. In their chapter on the early lexicon, Styles and Plunkett report data from an interesting and novel method to investigate the lexical-semantic network in toddlers. How do concepts link up with word forms? Where do effects of semantic relatedness reside in the developing lexical-semantic system, and does it reveal properties of the adult system from the very onset? Styles and Plunkett briefly review models of conceptual memory, and methods to investigate relations between words and concepts during development. Investigation of lexical-semantic networks in adults almost exclusively relies on some form of semantic priming, and with good cause. Styles and Plunkett discuss a number of important requirements for a toddler-friendly variant of priming: (partial) sequentiality of primes and targets, lack of material repetition, natural tasks that allow for online behavioral measures, which can be informative about automatic and strategic processes. They put forward a primed preferential-looking paradigm for children 18-24 months of age, with two simultaneously presented pictures, preceded by related or unrelated spoken words. Dependent measures were the total time the eyes spent on each picture—as an index of overall preference, and first-fixation duration—as a measure of engagement time, and disengagement. The spoken name of one of the pictured concepts (dog) preceded the picture pairs. This name could again be preceded by a semantically related word (“cat”), or an unrelated word (“boat”). In a final study,

Lexical representation: A multidisciplinary approach

5

the picture name was left out while the semantic prime remained (“cat” – dog). Styles and Plunkett showed that a spoken name indeed facilitated the processing of the relevant picture in an online manner, and led to faster disengagement when the other picture was fixated first. Moreover, semantic priming was evident in both measures. No effects were observed when the object’s name was left out, showing that the semantic priming effects were mediated via lexical representations. Styles and Plunkett’s novel method has a number of excellent properties for the study of the developing lexicalsemantic memory network. The data reported in their chapter elegantly show that small children have a system of interlinked representations that is adultlike in some, but not in all aspects. Lahiri’s chapter focuses on the mental representation of the form of words. The theoretical core of this research is a set of seminal studies that began as a collaboration with Marslen-Wilson in the early 90s (e.g., Lahiri and Marslen-Wilson 1991). This research strand is notable for drawing on and developing theory in phonology and psycholinguistics in equal measures. The current chapter is a fine example of this approach, and presents a theory of representation of spoken words that is applied to data as diverse as diachronic language change and speech perception mechanisms. Lahiri argues that the representations of spoken words in the mental lexicon are both discrete and discreet. Discreteness ensures that spoken words are (on the whole) not confusable with each other, whereas discretion is a metaphor for the abstractness of lexical representations: according to Lahiri’s theory, lexical representations should not code for information that is predictable by other means. This aspect of Lahiri’s work relates to a broader debate in psycholinguistics and many other cognitive domains on the extent to which the mental lexicon stores detailed exemplars or more abstract prototypes. Lahiri makes cogent arguments for a particularly abstract form of representation, underspecification, for which asymmetry of representation is the norm. For example, words with nasal vowels are lexically specified with the nasal feature, whereas words with oral vowels are left unspecified. Another key feature of this approach is that representational asymmetries are largely universal, and do not depend on experience alone. The combination of these properties is encapsulated by the FUL model of lexical representation (Lahiri and Reetz 2010). As Lahiri points out, the evidence to date on some aspects of the FUL model remains scarce, and there is a particular need for more evidence on whether abstraction and asymmetry have innate roots. Nonetheless the research programme that

6

Pienie Zwitserlood and Gareth Gaskell

began with Lahiri and Marslen-Wilson’s (1991) cross-linguistic work on vowel nasalisation has provoked much fruitful work. In particular, the emerging evidence on the neural representation of form has thrown up some surprising asymmetries. We can expect the current theoretical work to be similarly valuable in engendering future studies of asymmetry in lexical representation. The chapter by Blumstein provides a timely overview of, and a strong view on, lexical competition in the adult language system, focusing on the neural underpinnings of competition and selection. The leading premise is that multiple lexical candidates are simultaneously active in speaking and in language comprehension—for which there is ample evidence. The next and somewhat more controversial tenet is that the process by which selection among these lexical elements is accomplished is by means of competition. A final assumption is that language comprehension and production use a common lexicon—an elegant and parsimonious view that not all models share. Blumstein argues that lexical processing is intrinsically interactive, and that combining data from neuroimaging studies and aphasic patients can help flesh out the functional architecture of lexical processing and representation both in comprehension and production. She proposes an extended brain network involved in lexical activation, competition, and selection, with the posterior part of the superior temporal gyrus and the supramarginal gyrus involved in mapping sounds onto the lexicon, the supramarginal and angular gyri activated during access to stored lexical representations (both form and meaning), and inferior frontal regions involved in executive control of lexical competition and selection. Drawing on evidence from a wealth of behavioral and imaging studies with healthy adults and aphasic patients, Blumstein argues that the inferior frontal regions are responsible for the selection of a target from amongst a set of lexical representations whose properties are stored in temporal areas. This speaks for an interactive process, and for the cascading of information within the lexical network at large. Given the clear overlap in activation patterns in the production and comprehension studies, Blumstein surmises that this brain network reflects a processing stream that is common to spoken-word recognition and production. The extent to which competitor effects percolate downstream in speaking is convincingly demonstrated by their presence in the precentral gyrus, an area involved in speech motor planning.

Lexical representation: A multidisciplinary approach

7

In his chapter on the contribution of connectionist modelling research to lexical representation, Plaut summarises the state of play after more than 20 years of debate. Although the issues relating to symbolic versus connectionist models, and localist versus distributed representation, are relevant to cognition generally, the key battleground has been language (e.g., Pinker and Ullman 2002; McClelland and Patterson 2002). In the current chapter, Plaut argues against the notion of a localist logogen-style account of lexical representation. Instead, he argues that the conception of lexical processing in a distributed connectionist network implies that there is no such thing as a lexical representation. This radical theme is developed in two ways. At a basic level, the elimination of a localist representation of words leaves a connectionist network able to exploit partial regularities and functional relationships between the different properties of words such as semantic, phonological and grammatical knowledge. Plaut argues that a distributed approach to the morphological structure of words is ideally placed to exploit the partial systematicity that is observed in morphological structure. Localist representations can of course be hypothesised either at the word level or at the morpheme level, but the greyer areas of partial systematicity are harder to deal with in this way. Plaut makes a similar argument when the issue of lexical ambiguity is considered. Here, although there are varieties of ambiguity that can be pigeonholed as homonymy (e.g., bank) or polysemy (e.g., paper), the connectionist approach provides a simple means of capturing the continuum of meaning-relatedness that is found in natural language. Plaut uses this approach to model the word recognition process for ambiguous words in isolation. Although the models sketched out above can be thought of as having no lexical representation, an alternative description is that they have a lexical representation that is distributed across all the nodes representing meaning and form. However, Plaut argues against the standard view of a lexical representation in a more fundamental way in the second half of his chapter. He argues that when we come to consider the process of mapping language input onto the representations of sentence meanings, the concept of a lexical representation becomes even hazier. When sentence level comprehension is view in connectionist terms it becomes hard to conceive of words as having fixed meanings; rather they act as cues to a broader sentence meaning, with no componential subdivision of this meaning into lexical units. The chapter by Poeppel and Idsardi focuses on speech perception—the part of spoken-language comprehension dedicated to the operations that

8

Pienie Zwitserlood and Gareth Gaskell

take the noisy speech signal as input and generate “those representations (morphemic, lexical) that serve as the data structures for subsequent operations mediating comprehension” (p. 175). This part of spoken-language processing encompasses more than what is often thought to be the realm of speech perception: the sublexical properties and units of speech, with a heavy focus on categorical perception. It is also different from the nature of mental representations of lexical form that are the theme in the chapter by Lahiri. Finally, speech perception in their view does not include the processes of lexical access, competition, and word recognition—as reviewed in the chapters by Blumstein and Davis and Rodd. Thus, Poeppel and Idsardi concentrate on modality-specific operations that transform acoustic waveforms into words. Their approach is clearly one of cognitive neuroscience. Drawing on a host of behavioral, EEG, MEG, fMRI and lesion data, Poeppel and Idsardi sketch a model that encompasses the neural architecture for the computations, operations, and representations involved in speech perception. The present a wealth of interesting suggestions as to how acoustic properties encoded in auditory cortex are coarse-coded into universal classes (such as plosive, fricative, nasal, approximant), grouped, in parallel, into chunks of different time size (segments, syllables), dealing with (predictable) variation along the way. This part of speech perception, starting in primary auditory cortex, is subserved by what the authors, in analogy to visual perception, call the ventral steam. Processing is bilateral, with potential granularity-of-processing preferences for the two sides. The second major processing stream, the dorsal one, is decidedly left-dominant and dedicated to the mapping of sensory to motor representations. A major role is reserved for the Sylvian parieto-temporal area as sensorimotor interface, important for the loop between perception, memory and action (speech production, in this case). Note that the view proposed here, that speech perception has bilateral as well as specialized left-hemispheric components is also part and parcel of the current view held by William Marslen-Wilson and colleagues, for language comprehension at large (Marslen-Wilson and Tyler 1997) The contribution by Davis and Rodd provides a timely and focused overview of recent brain imaging work on the neural processes underlying spoken-word recognition. Thematically, they are running a relay race, starting their leg where Poeppel and Idsardi handed over the baton. The leitmotif of their endeavor to review neuroscientific evidence for the lexical processing of speech is the distributed Cohort model (Gaskell and Marslen-

Lexical representation: A multidisciplinary approach

9

Wilson, 1997). Davis and Rodd provide a broad overview of brain areas involved in lexical processing at large, and of the functional pathways between these areas. Dorsal and ventral parts of superior and middle temporal gyri play a key role, with pathways taking off in dorsal (auditory-to-motor mapping), and ventral steams. In some aspects, this sketch is still tentative, due to ambiguous picture provided by the results from neuroimaging studies. By pointing to the pro and cons of imaging, the problems and pitfalls of experimental materials, manipulations, and tasks, Davis and Rodd come up with an intriguing view on what it all means. Whatever manipulation of methods and materials (comparing words and pseudowords; manipulating speech intelligibility; using semantic or syntactic anomaly or ambiguity), they put their finger on the problem: A world is actually happening within the time frame of seconds (this is what fMRI gives us), so, what can we deduce from activations and subtractions with respect to the fast and ephemeral processes of recognizing words and their meanings? At first sight, there seem to be conflicting results everywhere one looks, with brain areas purportedly dedicated to early speech processing activated when sentential meaning is manipulated, with failures to dissociate lexical and semantic properties, semantic and decision effects, or the role of anterior and posterior temporal areas. Nevertheless, in the course of their review, Davis and Rodd sketch a picture of the lexical system that does have distinct roles for the areas anterior and posterior of auditory cortex, for their relation to the ventral pathways, and for the left inferior frontal gyrus. Given that neuroimaging provides us with a wide temporal window on lexical processing, and acknowledging the interactive nature exemplified by the Cohort view, it is obvious that a host of bottom-up and top-down processes will leave their imprint on the neuroimaging data. No wonder that it is so hard to isolate lexical from semantic activation; imaging data have no say about what exactly happens after, say, 200 milliseconds of speech input. No wonder that Davis and Rodd see the future for the neurocognition of lexical processing in a combination of methods with voxel- and with millisecondprecision. Hagoort’s chapter develops similar issues to Plaut’s but with respect to studies of brain imaging rather than connectionism. His starting point is the body of research from Marslen-Wilson and colleagues implicating the left inferior frontal cortex (LIFC) in the process of morphological decomposition. Marslen-Wilson’s group has used data from neuroimaging and behavioural studies of patients and normal participants to develop a leading theory of the neural infrastructure supporting language comprehension.

10

Pienie Zwitserlood and Gareth Gaskell

Here, Hagoort describes research that builds on such a model with a particular focus on comprehension at a sentence rather than word level. His approach emphasizes the domain-general, shared aspects of language processing: memory, unification and control. The unification process is the key one for current purposes, as it relates to how the language perceiver combines lexical syntactic, semantic and phonological information into a unified representation of an utterance. Thus, Hagoort diverges from Plaut’s conception of sentence comprehension in that he assumes that lexical representations can provide the building blocks for unified comprehension. Hagoort discusses both key empirical papers and meta-analytic reviews in order to develop a consensus view of the brain regions that contribute to unification. Emerging from these data, Hagoort sketches out a network of frontal, temporal and parietal regions that contribute to unification in some way. As in Marslen-Wilson’s work though, the LIFC plays a crucial role. Hagoort’s view is that this area carries out much of the unification operation, but drawing on memory structures located in temporal regions. The role of LIFC is seen as multipurpose rather than specifically linguistic. Taken together, the contributions to this volume provide a wealth of new and thought-provoking results and theoretical insights on lexical processing and representation. Obviously, lexical information is essential to all domains of language comprehension, whether one considers lexical representations as concrete building blocks, or as distributed clusters of volatile nature. The route from sound to meaning starts with speech perception, a complex set of operations tailoring the noisy input to fit representations of lexical form. A major theme is the nature of word-form representations: how abstract and asymmetric, how specific, and how malleable are they, to accomodate redundancy, but also to allow adaptation to non-trivial changes in the input and the lexicon proper? How is the internal structure of words represented, what kind of mechanisms serve to parse the input into relevant morphemic units—getting rid of erroneous parses along the way? Another question, as relevant now as it was 40 years ago, concerns wordrecognition proper. How do the many activated pieces of lexical information eventually converge on one solution—by means of competition for selection? What is the nature of the relation between word forms and their meaning, and how does the complex lexical-semantic network evolve during language development? Finally, how are semantic, syntactic and form aspects pertaining to words combined, or unified, to provide the desirable outcome that we understand what the speaker is talking about? All these issues and questions are addressed in the chapters of this book. The bottom

Lexical representation: A multidisciplinary approach

11

line is that a lot progress has been made over the last decades in the understanding of lexical processing and representation. Some of the questions might be old, but with the advance of multidisciplinary approaches and a neuroscience perspective, methods, means and data have changed—and in important ways, so have the theories. References Eisner, Frank and James M. McQueen 2006 Perceptual learning in speech: Stability overtime. Journal of the Acoustical Society of America 119 (4): 1950-1953. Gaskell, M. Gareth, and William D. Marslen-Wilson 1997 Integrating form and meaning: A distributed model of speech per Gaskell, M. Gareth and Nicolas Dumay 2003 Lexical competition and the acquisition of novel words. Cognition 89: 105-132. Gaskell, M. Gareth and William D. Marslen-Wilson 1998 Mechanisms of phonological inference in speech perception. Journal of Experimental Psychology: Human Perception and Performance 24: 380-396. Kraljic, Tanya, Arthur G. Samuel, and Susan E. Brennan 2008 First impressions and last resorts: How listeners adjust to speaker variability. Psychological Science 19: 332-338. Lahiri, Aditi and William D. Marslen-Wilson 1991 The mental representation of lexical form: A phonological approach to the recognition lexicon. Cognition 38: 245-294. Lahiri, Aditi and Henning Reetz 2010 Distinctive features: Phonological underspecification in representation and processing. Journal of Phonetics, 38, 44-59. Leach, Laura and Arthur G. Samuel 2007 Lexical configuration and lexical engagement: When adults learn new words. Cognitive Psychology 55: 306-353. Marslen-Wilson, William D. 1973 Linguistic structure and speech shadowing at very short latencies. Nature 244: 522-523. 1987 Functional parallelism in spoken word recognition. Cognition 25: 71-102. 2007 Morphological processes in language comprehension. In M. G. Gaskell (Ed.) Oxford Handbook of Psycholinguistics, 175-193. Oxford: OUP.

12

Pienie Zwitserlood and Gareth Gaskell

Marslen-Wilson, William D., and Lorraine K. Tyler 1980 The temporal structure of spoken language understanding. Cognition, 8: 1-71. 1997 Dissociating types of mental computation. Nature, 387: 592-594. Marslen-Wilson, William, Lorraine Tyler, Rachelle Waksler and Lianne Older 1994 Morphology and meaning in the English mental lexicon. Psychological Review 101: 3-33. Marslen-Wilson, William and Paul Warren 1994 Levels of representation and process in lexical access. Psychological Review, 101: 653-675. Marslen-Wilson, William and Pienie Zwitserlood 1989 Accessing spoken words: the importance of word onsets. Journal of Experimental Psychology: Human Perception and Performance 15(3): 576-585. McClelland, James L. and Karalyn Patterson 2002 Rules or connections in past-tense inflections: what does the evidence rule out? Trends in Cognitive Sciences, 6: 465-472. Morton, John 1969 The interaction of information in word recognition. Psychological Review, 76: 165-178. Norris, Dennis, James M. McQueen and Anne Cutler 2003 Perceptual learning in speech. Cognitive Psychology 47: 204-238. Pinker, Steven and Michael T. Ullman 2002 The past and future of the past tense. Trends in Cognitive Sciences, 6: 456-463. Rastle, Kathy and Matt H. Davis 2008 Morphological decomposition based on the analysis of orthography. Language and Cognitive Processes, 23: 942-971.

Semantic constraints on morphological processing Kathleen Rastle and Marjolein Merkx

Morphologically-complex words such as distrust, trusty, or trustworthy constitute a large proportion of words in most of the world’s languages. They also constitute the primary means for lexical productivity, with up to 70% of new English words over the past 50 years comprising novel combinations of existing morphemes (e.g., arborist, therapise, bioweapon; Algeo 1991). It is therefore unsurprising that the delineation of mental processes involved in the recognition of morphologically-complex words has been one of the key topics in psycholinguistics over the past 30 years (e.g., Taft and Forster 1975). There is now a relatively broad consensus that morphologically-complex words are analysed in terms of their constituent morphemes (e.g., {dark} + {-ness}) during word recognition, though fundamental questions remain such as to how to define morphological complexity and whether decomposition constitutes an all-or-none phenomenon. One of the most influential pieces of work to emerge over the past 30 years was written by Marslen-Wilson and his colleagues and published in Psychological Review (Marslen-Wilson, Tyler, Waksler, and Older 1994). Marslen-Wilson et al. promised a more unified framework for understanding the representation and access of morphologically-complex words, in a literature which they regarded as “conflicting and inconclusive” at that time. They conducted a series of cross-modal priming experiments in which participants made lexical decisions to visual targets preceded by auditory primes. Their chief rationale for using this task was that they believed it would permit inferences to be drawn regarding the structure of a ‘modalityindependent lexical entry’, as opposed to some ‘modality-specific access representation’. The key finding was that morphological priming was observed only when the relationship between prime and target was semantically transparent (e.g., hostess-host) but not when it was semantically opaque (e.g., missile-miss). This finding led to the important conclusion that morphologically-complex words are represented in a decomposed manner at the level of the lexical entry, but only in cases in which the meaning of the full form can be derived from the meanings of its constituents (i.e., when the full form is semantically transparent; e.g., darkness, kindly). In

14

Kathleen Rastle and Marjolein Merkx

cases in which the meaning of the full form cannot be derived from the meanings of its constituents (i.e., when the full form is semantically opaque; e.g., department, gingerly), Marslen-Wilson et al. proposed that the full form is represented at the level of the lexical entry in a nondecomposed manner. The work of Marslen-Wilson et al. (1994) had an immediate impact on research investigating the processes underlying the recognition of morphologically-complex words (e.g., Drews and Zwitserlood 1995; Frost, Forster, and Deutsch 1997; Rastle, Davis, and New 2000) and on research investigating the breakdown of these processes in patients with brain injury (e.g., Miceli 1994; Rastle, Tyler, and Marslen-Wilson 2006; Tyler and Ostrin 1994). It was also central to the development of a distributed-connectionist perspective on morphological processing (e.g., Plaut and Gonnerman 2000; Rueckl and Raveh 1999), in which morphology is seen as a characterisation of the mapping between form and meaning. These models posit that morphologically-complex words are represented componentially in the learned internal representations that mediate orthography (or phonology) and meaning (so that the representation of hunter overlaps that of hunt), but that these componential representations develop only to the extent that the morphologically-complex word is related in meaning to its stem. This means that morphologically-structured words that are semantically opaque (e.g., witness, department) should develop internal representations that are unlike their stems in these models. Despite the impact of Marslen-Wilson et al. (1994), the publication ten years later of two articles in French (Longtin, Segui, and Hallé 2003) and in English (Rastle, Davis, and New 2004) suggested that this seminal work had not told the right story, or at least had not told the whole story. These two groups of authors both used a visual masked priming paradigm to investigate semantic effects on morphological priming. In contrast to the results of Marslen-Wilson et al. (1994), these authors reported substantial priming for derived-stem pairs, which was not influenced by semantic transparency. Critically, semantically-transparent (e.g., hunter-hunt) and semantically-opaque (e.g., department-depart; corner-corn)1 pairs yielded significantly greater priming effects than prime-target pairs with an orthographic but no morphological relationship (e.g., brothel-broth), indicating that they could not be attributed to pure orthographic (or phonological) overlap. Further, the priming effects from semantically-opaque pairs were similar in magnitude to those from semantically-transparent pairs.

Semantic constraints on morphological processing

15

On the theory expressed by Marslen-Wilson et al. (1994, also Plaut and Gonnerman 2000; Rueckl and Raveh 1999), it was not clear how one would explain the robust priming effects seen in the semantically-opaque condition, since on that theory only semantically-transparent derived words share lexical representations with their stems. Marslen-Wilson et al. had originally proposed that the lexical representations that they were interested in might be accessed through a ‘modality-specific input parser’ – and one could claim that the masked priming effects observed by Longtin et al. (2003) and Rastle et al. (2004) reflected the operation of such a procedure. However, Marslen-Wilson et al. ultimately rejected the notion of this input parser precisely because of the problem of morphologically-structured words that are semantically opaque. They argued that without replicating syntactic information already stored in the lexical entry, the input parser would be unable to decide accurately that a word like vanity should be decomposed but that a word like vanish should not. Thus, they favoured a model in which the (orthographic or phonological) input is mapped directly onto modality-independent lexical representations, in which morphologically-complex words were proposed to be decomposed only if they were semantically transparent. Of course, such a model provided no explanation for the masked priming effects observed ten years later. The findings of Longtin et al. (2003) and Rastle et al. (2004) thus seemed to demand a radically different theory in which the decomposition of morphologicallystructured stimuli (at least in visual word recognition) is based purely on the appearance of morphological structure and is unaffected by semantic information (Rastle et al. 2004; Rastle and Davis 2008). This chapter considers the proposals of Marslen-Wilson et al. (1994) in the light of the evidence published over the past 15 years focusing in particular on priming data (mostly in the visual domain). In the first section of the chapter we argue that the role of semantic information in the initial decomposition of morphologically-complex words is negligible. The second section of the chapter then considers whether there is any evidence that semantic information constrains morphological decomposition at later stages of analysis, or whether effects such as those reported by MarslenWilson et al. (1994) could be explained as a result of the close semantic and form relationship between morphological relatives (Gonnerman, Seidenberg, and Andersen 2007; Plaut and Gonnerman 2000). Having determined that semantic information does constrain morphological decomposition at some stage of analysis, the third section of the chapter then describes two ways in which semantic information might have an influence on decompo-

16

Kathleen Rastle and Marjolein Merkx

sition. The chapter closes by considering some questions about the acquisition of morphological information and the potential role of semantic information in this process. 1. Does semantic information constrain initial decomposition of morphologically-complex words? The role of semantic information in early morphological processing has received considerable interest in recent years. The key question is whether semantic transparency is necessary for the initial decomposition of morphologically-complex words. The articles by Longtin et al. (2003) and Rastle et al. (2004), which showed derived-stem masked priming effects for semantically-opaque complex words, started the debate. They brought into doubt the proposal by Marslen-Wilson et al. (1994) that non-transparent morphologically-complex words are lexically represented as full forms. Since then, numerous studies have looked at this issue. Rastle and Davis (2008) reviewed nineteen masked priming studies in which derived-stem priming of opaque morphological pairs (e.g., corner-corn) was compared with derived-stem priming of transparent morphological pairs (e.g., hunter-hunt) and/or non-morphological form pairs (e.g., brothel-broth; all against an unrelated baseline). They reported average priming effects of 30 ms for transparent pairs, 23 ms for opaque pairs, and 2 ms for non-morphological form pairs. Rastle and Davis (2008) interpreted these data as being inconsistent with the theory proposed by Marslen-Wilson et al. (1994), because if non-transparent complex words are represented only as full forms, then recognition of targets like broth should be influenced to the same degree by morphologically-structured primes like brother and by non-morphological primes like brothel. Instead, they argued that the data implicate a rapid morphological segmentation procedure based on the appearance of morphological structure that operates without regard to semantic information. Though it is clear from these data that semantic transparency is not necessary for decomposition, the proposal that the initial decomposition of morphologically-structured stimuli is truly blind to semantic information has since been challenged, leading to the secondary question of whether semantic information influences this decomposition. Feldman, O’Connor, and Moscoso del Prado Martín (2009) pointed out that the 7ms transparency effect in Rastle and Davis’ (2008) meta-analysis is significant if the grand means from the various studies are used as individual data points in

Semantic constraints on morphological processing

17

an analysis that compares priming effects in transparent and opaque conditions via a t-test.2 On the basis of this result and an additional experiment which yielded a robust transparency effect and no priming for opaque pairs at all, Feldman et al. (2009) claimed that semantic information plays a role in morphological decomposition right from the earliest time points of word recognition. However, in assessing the magnitude of priming effects across the transparent and opaque conditions in the analysis of Rastle and Davis (2008), it is important to recognize that the 7 ms significant effect is driven entirely by two experiments conducted by Diependaele, Sandra, and Grainger (2005), in which the transparent priming effect averaged around 24 ms and the opaque priming effect averaged around -5 ms. Critically, these two studies differed from the others in substantive ways: (a) in the first study, a backward mask was inserted after presentation of the 53 ms prime, taking the SOA outside of the range of the other studies; and (b) in the second study, stimuli were repeated three times over various SOAs during the experiment (13 ms, 40 ms, 67 ms), and participants were also exposed to all primes and targets during the practice session. Once these two studies are excluded, the transparency effect across studies becomes a non-significant 3ms, with every remaining study showing significant opaque priming (mean 28 ms, range 18ms-51ms). The validity of Feldman et al.’s (2009) own work comparing priming effects across transparent and opaque conditions is also questionable given that a large number of their opaque items were not morphologically structured (e.g., coyness-COIN; harness-HARP; saccade-SACK; see Davis and Rastle in press, for a discussion of these issues). Thus, we claim that there is at present no convincing evidence that semantic information influences the initial decomposition of morphologically-complex words. 2. Does semantic information constrain morphological decomposition at any stage of analysis? Having argued that the initial decomposition of morphologically-complex words is not governed by semantic transparency, a question then arises as to whether the decomposition of these words is constrained by semantic information at any stage of analysis. There have certainly been plenty of replications of the original findings of Marslen-Wilson et al. (1994) in which morphological priming effects are observed only in semantically-

18

Kathleen Rastle and Marjolein Merkx

transparent cases. These replications have normally come from paradigms in which primes can be perceived consciously, such as cross-modal priming (e.g., Longtin et al. 2003), visual priming with fully-visible primes (e.g., Rastle et al. 2000), and long lag priming (in which a number of items separate prime and target; Drews and Zwitserlood 1995; Rueckl and Aicher 2008). Together with the masked priming data, these findings have led some to suggest that written word identification might be characterized by two kinds of morphological decomposition, one orthographically based and one semantically based, which arise at different time points in the recognition process (Meunier and Longtin 2007; Rastle and Davis 2003; Rastle et al. 2004). The problem with this evidence is that it has been difficult to demonstrate that the morphological effects being observed (e.g., departure priming depart) cannot be attributed to semantic overlap or, more likely, to a combination of form plus semantic overlap (Gonnerman et al. 2007). For example, though Rastle et al. (2000) reported a large transparency effect in their study of visual priming with fully-visible primes (SOA 230 ms), transparently-related morphological pairs in their study (e.g., departuredepart) produced statistically equivalent priming to semantically-related pairs (e.g., cello-violin) and to pairs that were semantically and orthographically similar (e.g., glisten-glass). Marslen-Wilson et al. (1994) encountered a similar problem: they reported no difference between transparent derivedstem priming (e.g., hunter-hunt) and pure semantic priming (e.g., ideanotion) in their cross-modal task. Thus, it is not clear from these studies whether the priming effects for transparent items reflected morphological processing, or whether they instead reflected semantic similarity or some combination of semantic and orthographic similarity. Rueckl and Aicher (2008) made some headway on this problem in showing that long-lag priming for semantically-transparent derivations could not be reduced to semantic similarity between prime and target (priming effects for teacher-teach pairs measured a robust 52 ms while priming effects for ocean-water pairs measured a non-significant 6 ms). However, it remains possible that the priming effects for transparent derivations arose due to some combination of orthographic plus semantic similarity. This possibility could be investigated by comparing effects for transparent derivations against effects produced by pairs such as screech-scream or glisten-glass as in Rastle et al. (2000). Marslen-Wilson et al. (1994) originally noted that an account based on pure semantic overlap could explain their key findings (e.g., priming for

Semantic constraints on morphological processing

19

punishment-punish pairs but not for department-depart pairs). However, they ruled this out when no priming effects were observed for transparent pairs in which both prime and target were suffixed (e.g., saintly-sainthood) – a finding that they attributed to cohort-related competition between suffixed forms in which a suffixed word not only activates the lexical entry for its stem but also inhibits every other suffixed form that shares the same stem. Clearly, an account of Marslen-Wilson et al.’s findings based on pure semantic priming would have predicted an effect here as well as in all of the other semantically-related conditions. However, for an effect on which so much theoretical weight rested, it is disappointing that Marslen-Wilson et al. (1994) did not attempt to replicate their suffix inhibition effect, especially since every other combination of derived-derived pairs yielded significant priming as long as they were transparent, including prefixedprefixed pairs (e.g., unfasten-refasten), prefixed-suffixed pairs (e.g., distrust-trustful), and suffixed-prefixed pairs (e.g., judgment-misjudge). To make matters even worse, there is no indication of a suffix-suffix inhibition effect in masked visual priming or in visual priming with fully visible primes (Rastle et al. 2000). Particularly if this latter task is thought to reflect the same lexical representations as are tested in cross-modal priming, then similar competitive processes would have been expected despite the fact that primes were visual. Finally, the recent article by Gonnerman et al. (2007) reported robust cross-modal priming effects for suffixed-suffixed pairs, thus failing to replicate Marslen-Wilson et al. (1994). In this light of all of these data, it is difficult to rule out an account of Marslen-Wilson et al.’s (1994) findings based on pure semantic overlap. In our view, the best evidence that semantic information constrains morphological decomposition at some stage of analysis comes from work by Meunier and Longtin (2007) based on earlier work by Longtin and Meunier (2005). These studies looked at priming for French morphologically-structured nonword primes, comparing (using English examples) interpretable prime-target pairs like rapidify-rapid (to make rapid) to uninterpretable pairs like rapidion-rapid and non-morphological pairs like rapidilk-rapid. Using visual masked priming, Longtin and Meunier (2005) found that (a) morphologically-structured primes yielded robust priming effects that could not be reduced to pure orthographic overlap; and (b) these effects did not depend on the semantic or syntactic interpretability of the nonwords. Longtin and Meunier (2005) argued that these effects were consistent with the notion of a rapid morphological segmentation process blind to lexical and semantic information. Meunier and Longtin (2007) then used

20

Kathleen Rastle and Marjolein Merkx

the same stimuli in a cross-modal priming experiment in which they found that only the interpretable morphologically-structured primes yielded significant priming effects. The uninterpretable primes yielded effects that could not be distinguished in magnitude from the non-morphological form primes. These cross-modal priming data show that semantic information constrains priming for these items at longer time courses and are difficult to explain on the basis of pure semantic overlap (or on the basis of some combination of semantic and form overlap) because nonwords have no semantic representations. It is only by virtue of the decomposition of a stimulus like ‘rapidify’ that its meaning ‘to make rapid’ becomes clear. 3. How might semantic information play a role in morphological decomposition? The data thus suggest that there is an initial form of decomposition (at least in visual word recognition) that is unaffected by semantic information and that is based solely on the appearance of morphological structure. This orthographically-based decomposition then appears to give way to a type of decomposition that is semantically informed. Though some have argued that these orthographically-based and semantically-based forms of decomposition might arise in parallel during word recognition (Diependaele et al. 2005), it seems significant that the orthographic form of decomposition is observed primarily using masked priming techniques (but see Bozic et al. 2007, who observed orthographically-based decomposition in an fMRI study using long-lag priming), and that the semantic form of decomposition is observed primarily using techniques in which the prime is displayed for a longer period and is accessible to conscious analysis (e.g., cross-modal priming, long-lag priming, visual priming with fully visible primes). This body of results thus lends itself to a hierarchical model of word recognition in which the analysis of form is followed by the analysis of meaning over the time course of recognition (contra Feldman et al. 2009). One way to instantiate this kind of hierarchical model with respect to morphology was described by Rastle et al. (2004; also Rastle and Davis 2003, 2008) and is illustrated in Figure 1. The general idea behind this theory is that the two forms of decomposition observed behaviourally reflect decomposed representations at two separate levels of the recognition system: (a) a level of representation that characterizes the earliest stages of word recognition in which morphologically-complex words are decom-

Semantic constraints on morphological processing

21

posed based on the appearance of morphological structure; and (b) a level of representation that characterizes later stages of processing in which morphologically-complex words are decomposed on the basis of semantic transparency. This theory is an extension of the distributed-connectionist theories described earlier (e.g., Davis, van Casteren, and Marslen-Wilson 2003; Plaut and Gonnerman 2000; Rueckl and Raveh 1999) that focus on the second (semantically-informed) type of decomposition. Rastle et al. (2004) proposed that these models could also explain the first (orthographically-based) type of decomposition if they assumed that orthographic representations are organized morphologically – a proposal that Rastle and Davis (2008) later fleshed out with three theories of how orthographic representations might become morphologically structured in the acquisition process (such that orthographic representations of corner come to overlap those of corn). This theory thus posits that distributed representations for hunter and hunt and for corner and corn overlap at the orthographic level, but that in the hidden units mediating form and meaning only those distributed representations for hunter and hunt overlap. How then might this two-level theory account for the observation of two functionally-distinct forms of decomposition? Rastle and Davis (2008) argued that this could be achieved as long as it were assumed that masked priming effects reflect orthographic levels of processing while longer-SOA paradigms reflect higher levels of processing (i.e., representations in learned internal units). In the case of briefly-presented masked primes (e.g., 40 ms), it seems reasonable to suggest that the units predominately activated are orthographic, as semantic priming effects in masked cases are extremely difficult to find.3 Conversely, in the case of auditory primes or visual primes presented for a longer period (e.g., 200 ms), it seems reasonable to suggest that the units predominately activated are semantic, as these are precisely the situations in which robust semantic priming effects are observed (e.g., Marslen-Wilson et al. 1994; Rastle et al. 2000) and in which orthographic priming effects are weak and unreliable (Forster and Davis 1984). Thus, we might predict that savings on target processing would arise in the orthographic units in the case of masked primes (thus yielding effects for corner-corn and hunter-hunt pairs), and that savings on target processing would arise at higher levels of processing in the case of unmasked primes (thus yielding effects only for the hunter-hunt pairs). However, it is clear that there are many gaps left to fill in this theory: the most immediate of these is whether or not distributed connectionist models of this kind can perform the lexical decision task in a manner consistent with

22

Kathleen Rastle and Marjolein Merkx

the way that human readers can perform the task (see Coltheart 2004; Rastle and Coltheart 2006, for discussion). Filling this gap is necessary of course if we are to delineate precisely how the primes in these various priming paradigms yield savings on target processing. print

Orthography

Hidden Units

Semantics

Figure 1. Illustration of the two-level theory of morphological decomposition (see also Rastle and Davis 2008).

It is also possible that the two forms of decomposition observed behaviourally are consistent with a model in which there is only a single level of processing at which representations are decomposed. This kind of singlelevel theory was described by Meunier and Longtin (2007) and involves an initial decomposition based on the appearance of morphological structure followed by a process of semantic and/or syntactic integration in which inappropriate decompositions (e.g., a corner being interpreted as ‘someone who corns’) are ruled out. Meunier and Longtin (2007) outlined two theoretical instantiations of this integration process, the first of which consisted of an extension of the model proposed by Taft (2004; also Taft 1994). On this theory, a novel pseudoword like quickify would first be decomposed into its morphemic constituents and, following activation of these morphemes at a lemma level of representation (proposed to reside between orthographic and semantic representations), semantic features of the components would be activated and combined to yield the concept {to make quicker}. This process of semantic integration would fail for semantically uninterpretable pseudowords like walkify and according to Meunier and Longtin (2007), would thus result in “the loss of any morphemic activation” (and consequently no evidence for savings in the processing of the

Semantic constraints on morphological processing

23

target walk). However, it should be clear that delineating precisely how morphemic activation is “lost” under these kinds of circumstances will be critical in respect of this theory’s success in explaining why walkify doesn’t prime walk under cross-modal presentation conditions. The other single-level theory described by Meunier and Longtin (2007) was based on the notion of a ‘licensing procedure’ (as described e.g., by Schreuder and Baayen 1995) that checks the grammatical appropriateness of morphemic combinations rather than their semantic interpretability. On this theory, a novel pseudoword like walkify would first be decomposed into its morphemic constituents during an ‘access’ stage of processing, but would then fail at a licensing stage ( –ify cannot attach to verbs). Only those decompositions for which the licensing process is successful go on to a ‘combination’ stage in which a lexical representation is computed from the syntactic and semantic properties of its morphemes. Presumably, masked primes would be expected to activate representations only at the access stage of processing (in which both corner and hunter would be decomposed), whereas primes exposed for a longer duration would be expected to activate representations at the combination stage of processing (in which lexical representations are computed only for those combinations that pass the licensing stage). However, one key question in respect of this theory concerns pseudo-derivations like whisker and number for which the licensing stage would be completed successfully (–er can attach to verbs and adjectives), but for which the computed lexical representation (whisker: ‘someone who whisks’; number: ‘more numb’) is not the correct meaning of the word. Though Meunier and Longtin’s data could not speak to whether inappropriate decompositions are ruled out on the basis of semantic or syntactic information, one piece of data that favours their single-level semantic theory was reported by Rastle et al. (2000). They examined priming effects for transparent morphological (e.g., hunter-hunt), opaque morphological (e.g., department-depart), and non-morphological semantic (e.g., violincello) pairs at prime exposure durations of 43 ms, 72 ms, and 230 ms. Results revealed priming effects for the transparent morphological pairs that did not differ as a function of prime exposure duration. However, priming effects for the opaque morphological pairs and the non-morphological semantic pairs appeared to arise in direct opposition to one another (see Figure 2). Priming effects for opaque morphological pairs were robust at the shortest prime exposure duration, absent at the longest prime exposure duration and somewhere in between for the middle prime exposure duration.

24

Kathleen Rastle and Marjolein Merkx

In contrast, priming effects for the semantic pairs were absent at the shortest prime exposure duration, robust at the longest prime exposure duration and, again, somewhere in between for the middle prime exposure duration. These data seem consistent with the notion that the increased activation of semantic information through the time course of recognition serves to refine the form-based morphological segmentations that characterize earlier points in the recognition process.

Figure 2. Priming effects at three prime exposure durations observed by Rastle et al. (2000) for prime-target pairs with a transparent morphological relationship, an opaque morphological relationship, and a nonmorphological semantic relationship.

Recent neurophysiological data reported by Lavric, Rastle, and Clapp (submitted) nicely complement these behavioural findings. Lavric et al. examined the neural correlates of morphological priming using ERPs – comparing transparent morphological (e.g., hunter-hunt), opaque morphological (e.g., corner-corn), and non-morphological form (e.g., brothelbroth) conditions in a visual priming paradigm with fully-visible primes (SOA=227ms). Unsurprisingly, they observed robust attenuation of the N400 component for targets preceded by related as opposed to unrelated primes. More interestingly, when they divided the N400 range that they analyzed in half (300ms – 379ms and 380ms – 459ms), they observed a significant three-way interaction between priming (related or unrelated),

Semantic constraints on morphological processing

25

condition (transparent, opaque, or form), and time window (early or late). Specifically, in the earlier time window they observed robust priming effects of equivalent magnitude in the opaque and transparent conditions, both of which yielded greater attenuation of the N400 than was the case in the form condition. However, in the later time window they observed a reduction in N400 priming in the opaque condition such that it became indistinguishable from that in the form condition, while N400 priming in the transparent condition remained robust. These findings are again consistent with a theory in which decompositions are based initially on form and then are progressively refined through information that becomes available later in the time course of recognition. 4. Further directions The theory that we have described – decomposition based entirely on form followed by a refinement process based on higher-level information – seems somewhat in conflict with the fact that morphemes are by definition units of meaning. Indeed, it is the semantic content of morphemes that places them at the heart of lexical productivity, so it is difficult to understand why semantic information seems to play such a limited role in the decomposition of morphologically-structured words. Thus, one question that would be interesting to investigate is whether semantic information is necessary for (or perhaps plays a significant role in) the acquisition of the form-based morphemic representations used to segment stimuli in the recognition process. Rastle and Davis (2008) proposed three theories concerning the discovery of morphemic knowledge from exposure to morphologically-complex words. The first two theories suggest that morphemic knowledge is acquired on the basis of form information alone – specifically, through sensitivity to (a) information about transitional letter probabilities within morphemes and across morphemic boundaries in morphologically-structured words (see also Rastle et al. 2004; Seidenberg, 1987); and (b) information about the frequency with which particular letter clusters (e.g., affixes) arise in combination with other familiar letter sequences (e.g., stems). However, the third theory posits that higher-level regularities between form and meaning drive lower-level orthographic learning of morphemic units. On this theory, semantic knowledge of morphologically-complex words facilitates the acquisition of form-based morphemic knowledge by reinforcing

26

Kathleen Rastle and Marjolein Merkx

the preferred orthographic alignment of morphologically-complex words into their constituents. Thus, while it is clear that semantic information plays a limited (or negligible) role in the online decomposition of morphologically-structured stimuli, it could be that the strength of the morphemic representations used in this decomposition process is determined by the extent to which those morphemes occur in semantically-transparent, consistent contexts (e.g., in which a particular stem or affix is consistently associated with a particular meaning). The acquisition of morphological knowledge has been studied in some detail in children and second language learners. Although the role of semantic information in morpheme learning has not been the focus of these studies, there is some evidence that semantic information does affect affix acquisition with semantically transparent affixes being learned more quickly (e.g., Mithun 1989). These studies concern the acquisition of natural affixes (all of which have semantic representations) in uncontrolled settings, however, and do therefore not provide a direct way of comparing the theories described above. We have to turn to studies of word learning in adults to find experiments looking at the effect of semantics in acquisition in a more controlled manner. However, research in this field has been largely inconclusive. While a study by Leach and Samuel (2007) suggested that semantic information is important for the lexicalisation of newly learned words (see also Tamminen and Gaskell 2008), Dumay, Gaskell, and Feng (2004) found no additional benefit of including semantics in their word learning study. The study which comes closest to examining the role of semantics in the acquisition of morpheme-like units is in a paper by Rueckl and Dror (1994). They provided participants with 36 words containing six different endings which they paired either systematically or inconsistently with six semantic categories. They found that items trained systematically were not only learned more quickly but were also identified faster in an identification task in which the presentation duration was adjusted to 50 percent accuracy. This could suggest that form-meaning regularities are important in the acquisition of sublexical units. Thus, the part semantic information plays in morpheme acquisition has remained largely unexplored. Recent work by Merkx, Rastle, and Davis (2008) suggests a novel way in which to examine semantic effects in morpheme learning in a controlled laboratory setting. Participants are trained on novel affixes (e.g., -nule) that appear in novel word contexts (e.g., sleepnule, buildnule), and then tested on a range of tasks reflecting online and offline morphological processing. Though this research is at a prelimi-

Semantic constraints on morphological processing

27

nary stage, we have shown that this paradigm can be used successfully to create morpheme-like representations and that these representations affect automatic morphological processing. Further research is underway to determine how these representations are shaped by the provision of consistent and inconsistent semantic information during learning. Notes 1.

2.

3.

Longtin et al. (2003) tested for priming effects separately using semanticallyopaque words with a historical morphological relationship (e.g., departmentdepart) and using pseudomorphological words with no relationship (e.g., corner-corn). They found that these kinds of prime-target pairs yielded robust priming effects that were indistinguishable. Therefore, throughout the rest of this chapter, the term “semantically opaque” will be used to refer to both of these types of prime-target pair. We thank Davide Crepaldi for pointing out that by using the grand means of different experiments as individual data points, this analysis grossly underestimates the original error variance in the data. More informative would have been to use a measure of effect size such as Rosnow and Rosenthal’s (1996) r-statistic weighted by the number of participants in each study (as in Rastle and Brysbaert 2006). There have been various claims that semantic priming effects on lexical decision arise under masked conditions (e.g., Perea and Gotor 1997). However, these claims have typically involved prime presentation conditions in the range of partial visibility (e.g., 67 ms; Perea and Gotor 1997), the use of prime-target pairs that are associatively as well as semantically related (Marcel 1983), the use of highly restricted prime sets (e.g., numbers 1 through 9; Dehaene et al. 1998) in which the formation of stimulus-response mappings that bypass conceptual analysis is possible (Damian 2001), and the use of presentation paradigms in which primes are classified in visible form prior to the masked priming experiment (Draine and Greenwald 1998).

This work was supported by research grants from the Leverhulme Trust (F/07 537/AB) and the Economic and Social Research Council (RES-062-23-2268) awarded to Kathleen Rastle and Matt Davis. We thank Matt Davis for valuable discussion concerning the issues discussed in this chapter. Correspondence may be addressed to [email protected].

28

Kathleen Rastle and Marjolein Merkx

References Algeo, J. 1991

Fifty years among the new words: A dictionary of neologisms 19411991. Cambridge: Cambridge University Press. Bozic, M., W. D. Marslen-Wilson, E. A. Stamatakis, M. H. Davis, and L. K. Tyler 2007 Differentiating morphology, form, and meaning: neural correlates of morphological complexity. Journal of Cognitive Neuroscience 19: 1464-1475. Coltheart, M. 2004 Are there lexicons? Quarterly Journal of Experimental Psychology Section A 57: 1153-1171. Damian, M. F. 2001 Congruity effects evoked by subliminally presented primes: Automaticity rather than semantic processing. Journal of Experimental Psychology: Human Perception and Performance 27: 154–165. Davis, M. H., M. van Casteren, and W. D. Marslen-Wilson 2003 Frequency effects in processing inflected Dutch nouns: A distributed connectionist account. In Morphological structure in language processing, R. H. Baayen and R. Schreuder (Eds.), 427-462. Berlin: Mouton de Gruyter. Davis, M. H. and K. Rastle in press Form and meaning in early morphological processing: Comment on Feldman, O’Connor, and Moscoso del Prado Martín. To appear in Psychonomic Bulletin & Review. Dehaene, S., L., G. Naccache, H. Le Clec’, E. Koechlin, M. Mueller, G. DehaeneLambertz, P-F. van de Moortele, and D. Le Bihan 1998 Imaging unconscious semantic priming. Nature 395: 597–600. Diependaele, K., D. Sandra, and J. Grainger 2005 Masked cross-modal morphological priming: Unravelling morphoorthographic and morpho-semantic influences in early word recognition. Language and Cognitive Processes 20: 75-114. Draine, S. C. and A. G. Greenwald 1998 Replicable unconscious semantic priming. Journal of Experimental Psychology: General 127: 286–303. Drews, E. and P. Zwitserlood 1995 Morphological and orthographic similarity in visual word recognition. Journal of Experimental Psychology: Human Perception and Performance 21: 1098-1116. Dumay, N., M. G. Gaskell, and X. Feng 2004 A day in the life of a spoken word. Proceedings of the 26th Annual Conference of the Cognitive Science Society.

Semantic constraints on morphological processing

29

Feldman, L. B., P. A. O’Connor, and F. Moscoso del Prado Martín 2009 Early morphological processing is morphosemantic and not simply morpho-orthographic: A violation of form-then-meaning accounts of word recognition. Psychonomic Bulletin and Review 4: 684-691. Forster, K. I. and C. Davis 1984 Repetition priming and frequency attenuation in lexical access. Journal of Experimental Psychology: Learning, Memory, and Cognition 10: 680-698. Frost, R., K. I. Forster, and A. Deutsch 1997 What can we learn from the morphology of Hebrew? A maskedpriming investigation of morphological representation. Journal of Experimental Psychology: Learning, Memory, and Cognition 23: 829-856. Gonnerman, L.M., M. S. Seidenberg, and E. S. Andersen 2007 Graded semantic and phonological similarity effects in priming: evidence for a distributed connectionist approach to morphology. Journal of Experimental Psychology: General 136: 323-345. Lavric, A., K. Rastle, and A. Clapp submitted What do fully-visible primes and brain potentials reveal about morphological decomposition? Submitted to Psychophysiology. Leach, L. and A. G. Samuel 2007 Lexical configuration and lexical engagement: When adults learn new words. Cognitive Psychology 55: 306-353. Longtin, C-M. and F. Meunier 2005 Morphological decomposition in early visual word processing. Journal of Memory and Language 53: 26-41. Longtin, C-M., J. Segui, and P. A. Hallé 2003 Morphological priming without morphological relationship. Language and Cognitive Processes 18: 313-334. Marcel, A. 1983 Conscious and unconscious perception: Experiments in visual masking and work recognition. Cognitive Psychology 15: 197–237. Marslen-Wilson, W., L. K. Tyler, R. Waksler, and L. Older 1994 Morphology and meaning in the English mental lexicon. Psychological Review 101: 3-33. Merkx, M. M., K. Rastle, and M. Davis 2008 Learning morphemes: Insights from skilled readers. Abstracts of the Psychonomics Society 89: 3028. [Poster at the 49th Annual Meeting, Chicago.] Meunier, F. and C-M. Longtin 2007 Morphological decomposition and semantic integration in word processing. Journal of Memory and Language 56: 257-471.

30

Kathleen Rastle and Marjolein Merkx

Miceli, G. 1994

Mithun, M. 1989

Morphological errors and the representation of morphology in the lexical-semantic system. Philosophical Transactions of the Royal Society B-Biological Sciences 346: 79-87.

The acquisition of polysynthesis. Journal of Child Language 16: 285-312. Perea, M. and A. Gotor 1997 Associative and semantic priming effects occur at very short SOAs in lexical decision and naming. Cognition 62: 223–240. Plaut, D. C. and L. M. Gonnerman 2000 Are non-semantic morphological effects incompatible with a distributed connectionist approach to lexical processing? Language and Cognitive Processes 15: 445-485. Rastle, K. and M. Brysbaert 2006 Masked phonological priming effects: Are they real? Do they matter? Cognitive Psychology 53: 97-145. Rastle, K. and M. Coltheart 2006 Is there serial processing in the reading system; and are there local representations? In From inkmarks to ideas: Current issues in lexical processing, Andrews, S. (Ed.). Hove: Psychology Press. Rastle, K. and M. H. Davis 2003 Reading morphologically complex words: Some thoughts from masked priming. In Masked Priming: The State of the Art, S. Kinoshita and S. Lupker (Eds.), 279-305. New York: Psychology Press. 2008 Morphological decomposition based on the analysis of orthography. Language and Cognitive Processes, 23: 942-971. Rastle, K., M. H. Davis, W. D. Marslen-Wilson, and L. K. Tyler 2000 Morphological and semantic effects in visual word recognition: A time-course study. Language and Cognitive Processes 15: 507-537. Rastle, K., M. H. Davis, and B. New 2004 The broth in my brother's brothel: morpho-orthographic segmentation in visual word recognition. Psychonomic Bulletin & Review 11: 1090-1098. Rastle, K., L. K. Tyler, and W. Marslen-Wilson 2006 New evidence for morphological errors in deep dyslexia. Brain & Language 97: 189-199. Rosnow, R. L. and R. Rosenthal 1996 Computing contrasts, effect sizes, and counternulls on other people’s published data: General procedures for research consumers. Psychological Methods 1: 331–340.

Semantic constraints on morphological processing

31

Rueckl, J. G. and K. A. Aicher 2008 Are CORNER and BROTHER morphologically complex? Not in the long term. Language and Cognitive Processes 23: 972-1001. Rueckl, J. G. and I. Dror 1994 The effect of orthographic-semantic systematicity on the acquisition of new words. In Attention and Performance XV, C. Umilta and M. Moscovitch (Eds.). Hillsdale, NJ: Erlbaum. Rueckl, J. G. and M. Raveh 1999 The influence of morphological regularities on the dynamics of a connectionist network. Brain and Language 68: 110-117. Schreuder, R. and R. H. Baayen 1995 Modelling morphological processing. In Morphological aspects of language processing, L. B. Feldman (Ed.), 131-154. Hillsdale, New Jersey: Erlbaum. Seidenberg, M. S. 1987 Sublexical structures in visual word recognition: Access units or orthographic redundancy. In Attention and performance XII: The psychology of reading, M. Coltheart (Ed.), 245-263. Hove, UK: Lawrence Erlbaum Associates. Taft, M. 1994 Interactive-activation as a framework for understanding morphological processing. Language and Cognitive Processes 9: 271-294. 2004 Morphological decomposition and the reverse base frequency effect. Quarterly Journal of Experimental Psychology Section A 57: 745– 765. Taft, M. and K. I. Forster 1975 Lexical storage and retrieval of prefixed words. Journal of Verbal Learning and Verbal Behavior 14: 638-647. Tamminen, J. and M. G. Gaskell 2008 Novel words entering the mental lexicon: Is meaning necessary for integration? [Poster at the meeting of the Experimental Psychology Society, Cambridge, UK.] Tyler, L. K. and R. K. Ostrin 1994 The processing of simple and complex words in an agrammatic patient – evidence from priming. Neuropsychologia 32: 1001-1013.

The lexicon and phonetic categories: Change is bad, change is necessary Arthur G. Samuel 1. The illusion of the adult language user as being in a steady state Most psycholinguists would probably accept the following characterization of the development of speech perception: Babies are prepared to learn all possible phonological segments, and young infants show the ability to distinguish phonemic contrasts not only in the language spoken around them, but in other languages as well. Over the course of the first year of life, the system becomes more attuned to the language in the child's environment, with a complementary loss in the ability to distinguish speech sounds not found in this environment (Werker and Tees 1999). A two-year-old's perception of phonetic contrasts is quite similar to that of an adult from the same language community. The development of the mental lexicon is widely viewed as playing out on a much longer time scale, but as following a similar path. Over the course of the first several months babies begin to recognize a small number of words from their environment, and over the course of the first several years there is a rapid growth and development of the child's vocabulary (Ganger and Brent 2004). By the time the person is a young adult, the mental lexicon is fully developed, with modest changes after that. Although these descriptions are valid at a macro level, they obscure the dynamic nature of speech perception and word recognition when these processes are examined in more detail. Adding new words to the lexicon of an adolescent or adult language user is a surprisingly frequent event. Nation and Waring (1997) estimate that people add about a thousand word forms (a word and its close morphological relatives) per year, up to an asymptote of about 20,000 word forms. This translates to about three word forms per day, every day of the year: The lexicon is constantly developing, and models of word recognition must account for how this development affects processing. As we shall see, the need to insert new entries into the lexicon actually has the potential to undermine the basic functioning of the system. At the phonemic level, the apparent stability of the system is also only true at the macro level. There are a number of processes, operating on time scales that range between fractions of a second and months or years, that

34

Arthur G. Samuel

produce changes in the way that acoustic phonetic information gets mapped in phonetic space. An example of a very short term adjustment is the contextual effect of one segment on an adjacent segment. For example, the phonetic boundary between /d/ and /g/ will shift as a function of the place of articulation of a preceding sound: a sound near the boundary will be heard as /d/ if it is preceded by /r/, but as /g/ if the preceding sound is /l/. This shift has been studied in the compensation for coarticulation literature (e.g., Elman and McClelland 1988; Mann and Repp 1981). A similar shift has been examined in studies of phonetic contrast (e.g, Diehl, Kluender, and Parker 1985; Simon and Studdert-Kennedy 1978), in which a sound near a phonetic boundary will be reported contrastively with a preceding sound (e.g., a sound near the /d/-/g/ boundary will be reported as /d/ if it follows a clear /g/, but as /g/ if it follows a clear /d/). These two contrastive effects operate on a very short time scale, on the order of seconds. A third contrastive effect, selective adaptation, lasts considerably longer, on the order of minutes (Harris 1980). Unlike compensation for coarticulation or contrast, adaptation is generated by the repeated presentation of a stimulus; test items are presented after the repetition of the adapting sound (Eimas and Corbit 1973; Samuel 1986). The adaptation effect grows with increasing numbers of presentations of the adaptor (Vroomen et al. 2007). All three effects show that phonetic category boundaries are not static – they shift as a function of recently heard speech sounds. Other shifts in phonetic categorization operate on a much longer time scale. For example, when listeners are presented with a phonetic contrast that does not exist in their native language, modifying their phonetic category space may take months, if it occurs at all. A well-studied case is the discrimination of English /r/ and /l/ by native speakers of Japanese. This contrast does not exist in Japanese – there is a single liquid that is not a very good match to either /r/ or /l/, and native Japanese speakers are known to have difficulty with this distinction. Logan, Lively, and Pisoni (1991; Lively, Logan, and Pisoni 1993) trained native Japanese speakers (who were living in the US) on the /r/-/l/ contrast for three weeks. In most of the conditions the listeners heard English words with /r/ or /l/ produced by five different talkers, in a variety of different phonetic contexts (e.g., wordinitial; in a cluster such as /pl/; word-final). The listeners were asked to decide whether a given stimulus included /r/ or /l/, and they received feedback on each trial. The listeners slowly improved their ability to distinguish /r/ versus /l/, demonstrating that it is possible, but difficult, to modify the phonetic categorization process for these kinds of contrasts. For our purposes, an important question is why these types of changes are so slow, and are only obtained with large amounts of intentional learning. We will soon

The lexicon and phonetic categories

35

consider one answer to this question: Changing a working system may jeopardize the operation of the system more generally, and thus modifications must be made carefully. There is a recent group of studies that also examine changes in the phonetic categorization process. These studies are similar to the adaptation literature in the sense that they involve the presentation of a number of stimuli that induce changes in phonetic categorization, but the changes in this case are more long-lasting, like those in the intentional learning procedures for teaching the /r/-/l/ contrast. This type of phonetic adjustment process has been termed "perceptual learning" by some (e.g., Norris, McQueen and Cutler 2003), and "recalibration" by others (Bertelson, Vroomen, and de Gelder 2003). The experimental approach in these studies is to present listeners with phonetically ambiguous stimuli, with some source of contextual information that disambiguates the stimuli. Perceptual learning (or recalibration) is subsequently indexed by a shift in phonetic categorization toward the contextually-defined speech environment. As in adaptation studies, the measurement is one of category boundary location, assessed by having the listeners identify members of a continuum of speech sounds. After exposure to acoustically ambiguous speech sounds that are contextually disambiguated, listeners increase their report of sounds consistent with the context they received. Note that this effect is in the opposite direction to the shift found with adaptation. In adaptation, listeners reduce report of phonetic categories similar to the repeated sounds (as in the contrastive effect found in colour vision, in which extended exposure to a red stimulus produces an aftereffect biasing the system toward green, the perceptual opposite of red). In perceptual learning studies of speech, in which ambiguous speech sounds (rather than the unambiguous items repeated in adaptation studies) are disambiguated by context, the assimilative shifts could presumably help the listener to understand speech better in the prevailing input environment. In this domain, as in others, there is evidence that the perceptual system does not make adjustments casually, because such modifications might impair the overall performance of phonetic categorization. In fact, at both the phonetic level and at the lexical level, there is an inherent tension: The input to the system changes over time (new words appear; new speakers, with different ways of speaking, are encountered) and in order to operate optimally the system should adjust to these changes. However, the existing system is one that has been developed to operate well at the phonetic and lexical levels, and changes should only be made if the newly-encountered material is something that is likely to persist – a simple mispronunciation of “this” as “thizz” should neither generate a new lexical entry “thizz”, nor

36

Arthur G. Samuel

should it drive changes in the phonetic category boundary for /s/. Therefore, an optimal perceptual system should generally have a relatively high threshold for modifying its representations at either level. We have called this the Conservative Adjustment/Restructuring Principle, or CA/RP for short (Samuel and Kraljic 2009). The following two sections present some of the reasons and the evidence for the CA/RP, at the lexical and at the phonetic levels. 2. Changes in the lexicon: Growing pains In any data storage system, adding new information to the system has the potential to impair overall functioning. As was noted above, such additions are inescapable in spoken word recognition, at multiple levels: People learn new words all the time (Nation and Waring 1997), and listeners are constantly encountering new speakers with potentially different dialects, accents, or idiolects. The system clearly does accommodate all of these changes – we learn new words, and we adjust to the speech of new acquaintances. However, there are both theoretical and empirical reasons to believe that such learning has costs. French (1999) and McCloskey and Cohen (1989) have both discussed the possibility of creating “catastrophic interference” by adding new entries to an existing memory system if the information is fed into the existing network too rapidly. The basic notion is that if a memory system consists of a pattern of connections among many units, with the strength of these connections established by prior experience, inserting new patterns into the system rapidly will cause the changes in connections to propagate through the system in a way that will undermine the previously-established equilibrium. Although there are differing views on the exact form of representation of the mental lexicon (e.g., Gaskell and Marslen-Wilson 2002; Goldinger 1998; Grossberg 1980; MarslenWilson 1987), most models do posit some kind of network of lexical entries that would be subject to this type of catastrophic interference. Given the inescapable need to continuously add large numbers of new words to this network, how can such catastrophic interference be prevented? McClelland, McNaughton, and O'Reilly (1995) have noted that a way to solve this problem would be to buffer the new information so that its rate of introduction into the network is kept at a safe level. Specifically, McClelland et al. suggested that information is initially represented in the hippocampus, and that it is then slowly fed into the neocortex over time, particularly during sleep. The hippocampal representations do not need to

The lexicon and phonetic categories

37

be in a system with the long-term structure that is subject to catastrophic interference because less information is kept there, and for a comparatively short time. It is worth noting that if the system is in fact organized in this fashion, then this is a very nice example of the CA/RP. McClelland et al.’s model is a conservative restructuring process, in both senses of conservative: The system is conservative about adding new items to the network (because of the possible damage that could be done to the network), and the buffering is explicitly designed to conserve the network’s integrity. McClelland et al.'s (1995) hypothesis that complementary learning systems are needed to avoid catastrophic interference has two very interesting implications. First, the hypothesis predicts that there should be two quite different representational forms for additions to the lexicon (the form in the hippocampus, and the form in the neocortex). Second, if the neocortical component is really the system where words are represented over the long term, it is possible that probing the memory system before the words have been transferred to the neocortex may yield worse performance than after they have been connected to the rest of the lexicon. In other words, it may take some time (at least overnight, according to their hypothesis) for the information to be consolidated and allow the words to function as other lexical items do. In fact, recent studies that examine the development of new lexical entries provide support for all three aspects of this conception: (1) Recently learned words show evidence for having two qualitatively different forms of representation; (2) one of these representational forms seems to involve hippocampal function; and (3) newly learned words appear to undergo a consolidation process over a timescale of days. These properties can be seen as way to operate under the constraints posited by CA/RP – the system must change, but it must do so carefully/conservatively. Leach and Samuel (2007) introduced a distinction between “lexical configuration” and “lexical engagement” when a new word is added to the mental lexicon. Lexical configuration is the collection of information that the person has acquired about a word. Depending upon the learning circumstances, different bits of knowledge may be acquired at different times. For example, for young children, phonetic and semantic information will typically be the first parts of a lexical representation to be acquired, because children will hear (rather than read) words, and they will hear them in a particular (semantic) context. An adult, in contrast, may well have orthographic information without phonetic information if a word is encountered while reading, and initially the person might know more about the syntactic properties (e.g., if the new word came after “the” and before a verb) than the semantics. Over time, with additional encounters, the lexical configura-

38

Arthur G. Samuel

tion can be filled out. In contrast to this relatively static set of facts associated with a word, Leach and Samuel suggested that lexical engagement is a dynamic property that functional members of the lexicon have: Such items activate or compete with other lexical representations, and they can support the perception of their components during listening or during reading. Leach and Samuel (2007) drew the distinction between lexical configuration and lexical engagement on logical grounds, but they were strongly influenced by Gaskell and Dumay’s (2003) exploration of how newly acquired words begin to compete with words that were already present in the mental lexicon. Gaskell and Dumay created conditions designed to link new words to existing ones, to see if the acquisition of the new words would affect processing of the existing words. For example, for the word “cathedral”, Gaskell and Dumay created the new word “cathedruke”. Each day, for five days, participants were repeatedly exposed to such nonwords in the context of a phoneme monitoring task. The participants also completed a lexical decision task each day that included real words (e.g., “cathedral”) that were similar to the newly learned nonwords (e.g., “cathedruke”). If and when a functional lexical entry for “cathedruke” developed, it should compete with (i.e., engage) the entry for “cathedral” in a lexical decision task, slowing responses to such similar words (compared to control words without new competitors). By the third day of training, Gaskell and Dumay found exactly this pattern, providing evidence for the emergence of lexical engagement. Note that the particular form of engagement was competition/inhibition, an example of the potential cost of adding new information to the existing lexicon. It is plausible that lexical engagement is a property of lexical representations that are established in the neocortex; the relatively short-term buffering of the new words in the hippocampus may not be structured in a way that supports these kinds of engagement. In fact, it may be that precisely the kinds of networks that are subject to “catastrophic interference” are those that allow for dynamic engagement between the lexical entries. Leach and Samuel (2007) began a systematic study of the development of lexical configuration and lexical engagement. The first step was to test several tasks that had the potential to tap either lexical configuration or lexical engagement. Based on these tests, one task was selected to measure configuration, and one to measure engagement. For configuration, on each trial a word or nonword was initially presented at a very poor signal-tonoise ratio, and was then presented at successively more favorable S/N ratios. Subjects were told to push a button when they were confident they knew what the utterance was, and they then wrote down the item. The measure of lexical engagement was the ability of a newly learned word to

The lexicon and phonetic categories

39

support perceptual learning. Recall that through perceptual learning, when listeners hear an odd pronunciation of a speech sound in a word, they shift their phonetic space to incorporate the odd sound; the shift does not occur when the critical sound is in a nonword (Norris et al. 2003). At various points during the learning of new words, subjects were presented with versions of those words with the kind of odd pronunciations that can trigger perceptual learning. If the new words have established fully functioning lexical representations, these representations can engage sublexical representations and restructure the sublexical codes, a test of lexical engagement. Leach and Samuel (2007) used these two tasks to assess how lexical configuration and lexical engagement develop as a function of the learning regime. In one learning condition of this five-day experiment, subjects heard twelve new words (e.g,. “bibershack”, “figondalis”) in the context of a phoneme monitoring task. A second group of subjects learned the twelve new words with a very different regime: Each word was paired with a colour picture of an unusual/unfamiliar object, and subjects learned which name went with each picture. On each training trial, two of the pictures were presented side-by-side on a monitor, while one of the words was played over headphones. The task was to push either the right or left button on a response pad, to indicate whether the word matched the picture on the right or the left of the screen. After the button push, the correct picture stayed on the screen for 1 sec, to provide feedback. In both learning regimes, each novel word was presented 24 times per day. Each day lexical configuration was assessed using the words-in-noise task, and lexical engagement was measured by looking for perceptual learning that was mediated by the new words. Both training regimes produced strong development of lexical configuration, as shown by ability of listeners to recognize the words at higher and higher noise levels across training; accuracy improved similarly. The training regimes were, however, strikingly different in their ability to promote lexical engagement. With phoneme monitoring training, there were small and statistically weak perceptual learning effects, with no increase in these effects over the course of training. In contrast, learning words by associating them with pictures produced lexical representations that were fully capable of engaging sublexical codes, generating large and growing perceptual learning effects over training. This contrast suggested that lexical engagement develops when there are semantic associations available for the new words, with no such necessity for lexical configuration to develop. To test this idea, a new group of subjects learned the words over the course of five days by hearing each word several times in a story each day. This se-

40

Arthur G. Samuel

mantically rich regime, as predicted, led to lexical representations that produced significant perceptual learning effects. Thus, Gaskell and Dumay (2003) established that a form of lexical engagement (lexical competition) can develop over the course of several days, and Leach and Samuel (2007) showed that the properties of the learning situation can differentially support the development of lexical engagement or just lexical configuration. Gaskell and his colleagues have conducted a comprehensive set of studies that have documented the role of overnight consolidation of the new information needed to support lexical engagement. Recall that such consolidation can be viewed as a consequence of the Conservative Adjustment/Recalibration Principle: New lexical information must be fed into the existing lexical network slowly to avoid catastrophic interference (French 1999; McClelland et al. 1995; McCloskey and Cohen 1989). Dumay and Gaskell’s (2007) study was directly focused on the potential role of sleep in consolidating newly learned lexical items. The central manipulation was when the new words were taught to the subjects, with the critical contrast being between those who learned the novel words in the morning versus those who learned them at night. With the appropriate controls that were included in this study, this contrast provides a simple test of the role of sleep: The group who learned the words in the morning could be tested twelve hours later (at night) before they had slept, whereas those who learned the words at night would have slept before being tested twelve hours later (in the morning). If sleep plays an important role in transforming the initial representations into ones that have the ability to engage in true lexical activities, then the two groups should differ in the level of lexical competition that they display. In fact, that is what Dumay and Gaskell found: The subjects who learned the words at night produced significant lexical competition effects when they were tested twelve hours later, after sleep; the subjects who learned the words in the morning did not produce such lexical competition effects twelve hours later (without sleep). Given the various control conditions in the study, the results provide good evidence that sleep is important for lexical consolidation. Tamminen and Gaskell (2008) extended these results by testing subjects 1, 2, 3, 19, 20, and 33 weeks after they learned the novel words, and they confirmed that the lexical competition effects produced by items like “cathedruke” remain robust over all of these time frames; when tests were conducted on the day of training they generally showed no evidence for such competition. There is thus good evidence for the existence of two different types of representations (in Leach and Samuel's (2007) framework, those that reflect lexical consolidation versus those that support lexical engagement), with a

The lexicon and phonetic categories

41

consolidation process needed to develop the more active form of representation. Recall that McClelland et al. (1995) hypothesized that the hippocampus is an important site for the initial (configural) representation, with the more robust representations developed through a slow transfer to the neocortex. There are recent fMRI results that support this idea of the hippocampus as an initial buffer for the newly acquired lexical information. For example, Breitenstein et al. (2005) taught people to associate novel words with pictures of objects, and measured hippocampal activation during the learning process. They found that as the associations strengthened, the activity level dropped; high hippocampal activation accompanied initial learning, but as the new “words” became established, the hippocampus was less engaged. Davis et al. (2009) have provided additional converging evidence that ties hippocampal involvement together with the consolidation process. Davis et al. constructed three sets of new words of the “cathedruke” type. Subjects were taught one of the sets initially, and were brought back the following day for additional testing. On the second day the subjects learned a second set of novel words. This design produced words with three hypothesized states of knowledge: One set that had consolidated overnight, one set that was learned but not consolidated, and one set that had not been learned. At this point the subjects were put in a scanner and asked to do a task that has been shown to be sensitive to lexical competition (pause detection; Mattys and Clark 2002). In addition to replicating the behavioral effects of consolidation, Davis et al. found reduced hippocampal activation for the words that had been learned but not yet consolidated compared to new words. They also reported a very intriguing and very specific result: For the words that were first heard in the scanner, there was a by-item correlation of hippocampal activity with performance on a recognition test that was given after the scanning was completed: Words that received more hippocampal processing were better remembered afterwards. These results are very interesting, and clearly call for additional work of this sort (e.g., examining whether lexical engagement, measured the following day, would also be correlated with the hippocampal activation pattern seen for the words that had been learned just prior to the scan). This is an area of research that is just developing, but the results that are emerging are quite consistent with the view that adding new words to the mental lexicon follows that pattern suggested by McClelland et al. (1995): The initial storage is mediated by the hippocampus, and the information is carefully fed into the neocortex over time (and sleep) in a way that avoids the catastrophic interference that such networks can suffer from as a result of overly rapid change. In this sense, new word learning can be seen as

42

Arthur G. Samuel

being subject to the CA/RP: In order to conserve the highly functional mental lexicon, while still meeting the need to change (i.e., to add new words), a fairly elaborate process has developed that only allows these changes to occur slowly and carefully. The CA/RP seems to be a general property of cognition and perception. In fact, we (Samuel and Kraljic 2009) did not develop this hypothesis in the domain of lexical acquisition. Rather, this principle emerged from studies we conducted of language change at the phonetic level, not the lexical level. The next section describes how the phonetic categorization process shifts as a function of the input to the system. As at the lexical level, such change is necessary, and such change brings with it the potential for damage to the previously well-functioning system. Presumably for the same reasons, it appears that changes at the phonetic level are also done conservatively. 3. Changes in phonetic coding: Breaking up is hard to do (right) As with lexical processing, the steady-state system for phonetic categorization is grounded in years of development and fine-tuning. The languages of the world collectively draw upon a relatively large inventory of consonants and vowels, but each language uses a much smaller subset of the possibilities. One of an infant’s first linguistic jobs is to learn what subset is used in the prevailing environment, and to develop a set of boundaries that will divide the phonetic space into the correct categories for the language being spoken in the infant’s environment (a fascinating variation of this problem, which will not be discussed here, is how an infant develops multiple phonetic spaces when the prevailing linguistic input includes more than one language). As noted in the introduction, classic studies of infant speech perception have shown that babies come ready to perceive all possible phonetic distinctions (though not necessarily via specifically linguistic mechanisms), and that over the course of their first year they have honed in on the correct subset, enhancing their ability to make these discriminations at the expense of being able to make those that are used in other languages (Werker and Tees 1999). The phonetic boundaries that emerge from this process presumably reflect the distribution of sounds that the child has encountered. This distribution, in turn, reflects the variability that characterizes speech – each token of a word will differ from all others due to differences in speakers, speech rate, phonetic contexts, and many other factors.

The lexicon and phonetic categories

43

Even with a system that has been developed on the basis of such variation, there will still be new variants that are encountered, ones that do not map particularly well onto the existing category structure. Some of these cases were mentioned in the introduction, e.g., when a listener must learn a contrast that is not in the native language inventory (such as when Japanese listeners encounter the /r/-/l/ contrast). A complementary situation arises when listeners are trying to understand speech in their native language when this speech is produced with a heavy foreign accent; in this case, the speaker’s native language categories intrude on the contrasts in the listener’s native language and create large deviations from the established categories. Under these conditions, new phonetic mappings must be developed. In some cases this is very slow and effortful (Bradlow and Bent 2008), similar to the intentional learning of a foreign contrast, but under certain conditions the adjustment can be made after hearing just a few sentences (Clarke and Garrett 2004). A likely factor in how quickly a listener can adapt to phonetic variation is the degree to which the adjustment is relatively local. If a change only applies to a limited set of cases (either in terms of the range of items or in terms of the time it will be in effect), then relatively quick shifts can occur. Some of these short-term situations were discussed in the introduction, such as phonetic contrast effects (e.g., Diehl et al. 1985; Simon and StuddertKennedy 1978) and compensation for coarticulation (e.g., Elman and McClelland 1988; Mann and Repp 1981). These effects only apply to perception of sounds heard in a very short window of time, and as such, they presumably do not involve any changes to the categorization boundaries themselves. Even somewhat longer-lasting and more general shifts generated by repeated presentation of a speech sound (selective adaptation; Eimas and Corbit 1973; Samuel 1986) are still relatively limited and again could be implemented without restructuring phonetic space in any lasting way. Thus, these cases should not be subject to the CA/RP, because they have limited potential to damage the operation of the phonetic categorization process. The situations that have been examined in studies of perceptual learning (Norris, McQueen, and Cutler 2003) are fundamentally different in the sense that these appear to be cases in which long term changes are made to the phonetic categorization process. Under these circumstances changes should be made more cautiously, as they have the potential to disrupt the well-functioning system that is in place, and it is in the context of these studies that we developed the CA/RP (Samuel and Kraljic 2009). In order to understand how the CA/RP applies in this domain, it is necessary to re-

44

Arthur G. Samuel

view some of the basic facts that have been established with respect to perceptual learning in speech. As described in the introduction, perceptual learning or recalibration occurs when a listener encounters a relatively small number (the minimum has not yet been well specified) of acoustically ambiguous tokens of an existing phoneme. Critically, there is additional information available to the listener to disambiguate the phoneme. In Bertelson, Vroomen, and de Gelder’s (2003) study of recalibration, the ambiguous phoneme was a sound midway between /b/ and /d/, and the disambiguation was provided by the mouth of the talker presented on a computer screen; the mouth clearly articulated either a /b/ or a /d/, and the subject thus heard the acoustically ambiguous sound in accord with the visual cues. In Norris et al.’s (2003) study of perceptual learning, the ambiguous sound was midway between /s/ and /f/, and the disambiguation was provided by lexical context. If the ambiguous sound came at the end of a word that can only end in /s/ (and not in /f/), then listeners heard it as /s/; conversely, the same ambiguous sound was heard as /f/ if the lexical context only allowed /f/. Previous work had already established, for both visual and lexical context, that the immediate perception of an acoustically ambiguous phonetic segment would be strongly influenced by the context. What the Bertelson et al. and the Norris et al. studies showed, however, was something new: As a result of encountering a small number of these situations, subjects shifted their phonetic category boundaries; probes given after the exposures, now without any context, showed that the phonetic categories had been expanded. Sounds that had been ambiguous now clearly belonged to one of the two categories, with the category determined by the earlier visual or lexical context. For example, if the ambiguous sound had been presented in words that must have /s/, then stimuli in the formerly ambiguous range were now in the /s/ category. In the following discussion, the focus will be on perceptual learning mediated by lexical context because work in this domain (Eisner and McQueen 2006; Kraljic and Samuel 2005) has demonstrated the kind of long-lasting shifts that meet the CA/RP’s enabling conditions. Kraljic and Samuel (2005) examined the “resetting” issue: When listeners hear odd phonetic variants and undergo perceptual learning, how does the system subsequently reset to its previously successful parameters? One possibility is that after a certain amount of time has passed the adjustments fade away. This scenario would make sense if the retuning is used to enhance word recognition while listening to someone with an odd (and therefore challenging) accent, but once the conversation is done the system resets to its normal parameters. An alternative possibility is that the system resets when it gets new input that is more “normal” (e.g., after hearing

The lexicon and phonetic categories

45

someone with an odd pronunciation of /s/, hearing normally-produced /s/ sounds will reset the system). To test these possibilities, listeners were presented with odd pronunciations of either /s/ or /∫/ in a lexical decision task. One group was tested on /s/ -/∫/ identification immediately afterward to measure the size of the perceptual learning effect for these conditions. Several other groups had the same identification test 25 minutes later, to see whether the system reset as a function of time or of later input. To test whether the effect dissipates over time, one group did a silent filler task for this period. To test whether resetting requires hearing more normal speech, other groups received speech input between the exposure phase and the labeling test, with the groups differing in the nature of the speech (whether it contained normal exemplars of /s/ and /∫/, and whether the voiced matched or mismatched the voice used during the lexical decision exposure phase). The results were clear: Neither time nor later speech input returned the representations to their original settings – none of the treatment conditions was significantly different from the immediate test. Eisner and McQueen (2006) have extended the time window much further. They exposed listeners to ambiguous pronunciations of /s/ and /f/, and measured perceptual learning either immediately, or the following day. They found that the shifts were just as large a day later as they were initially. These results illustrate just how robust the perceptual learning effect is, and suggest that the system should not restructure itself without good cause. This is, of course, exactly what is entailed by the CA/RP. Two recent papers (Kraljic, Brennan, and Samuel 2008; Kraljic, Samuel, and Brennan 2008) have provided additional results that support the CA/RP. The thrust of this research is that if there is some alternative explanation for phonetic variation, rather than it being something that signals a long term phonetic change, then the system conserves its existing phonetic categorization – perceptual learning is blocked. Kraljic, Brennan, and Samuel (2008) took advantage of the fact that in certain dialects, including the Long Island region where the research was done, many speakers spontaneously produce /s/ as an ambiguous blending of /s/ and /∫/ when the /s/ is followed by /tr/. For example, “street” gets pronounced more like “shtreet” than “street”. This dialectal feature provides a test of whether perceptual learning is blocked if listeners have this alternative attribution available for any odd pronunciations they hear. Two groups of subjects did a lexical decision task that included ambiguous tokens, followed by identification of items varying between /s/ and /∫/. Critically, one group heard ambiguous /s/ sounds in a wide range of phonetic contexts, as in previous studies; the other group only heard the mixtures (made in exactly the same way) immediately before /tr/. The usual percep-

46

Arthur G. Samuel

tual learning shift was found for the first group, but no perceptual learning occurred for the second group. The contrasting results are consistent with the CA/RP: When an odd pronunciation has a possible “explanation”, perceptual learning is blocked. For the /str/-only exposure condition, the ambiguous pronunciation was treated as a dialectal variation, rather than as a new type of variation that should trigger phonetic restructuring. It is also possible that the manipulation used in this study (a more “sh”-like sound of the /s/ before /tr/) might have been perceived as a kind of assimilation of the /s/ to the place of articulation of the following sounds, again providing the system with a reason to block changes in the phonetic categorization boundaries. On either interpretation, the phonetic categorization process preserves its existing structure when there is some way to account for phonetic variation; the system is conservative. Kraljic, Samuel, and Brennan’s (2008) study provides two more findings that support the CA/RP. In one experiment, the exposure phase for the ambiguous /s/ or /∫/ stimuli was manipulated in a rather subtle way. The 200-item lexical decision task was divided into two blocks, with 50 words and 50 nonwords per block. In one block, there were 10 “odd” pronunciations of /s/ or of /∫/; in the other block, all of the pronunciations of these critical phonemes were normal. The key manipulation was the order of the two blocks: Half of the subjects heard the block with odd tokens first, followed by the block with all normal sounds; for the other half, the block order was reversed (the blocks were presented without any additional pauses, so that subjects did not know that there were actually two different blocks). If listeners establish a model of a talker based on the initial exposure, then hearing the odd tokens first should generate perceptual learning, and hearing later good tokens would not change this (as shown in Kraljic and Samuel’s 2005 resetting study). In contrast, if only good tokens are encountered first, then any later odd ones from this source may be attributed to some (unknown) transient cause – a conservative system should not restructure. The results supported this “first impressions” idea, with the usual pattern of perceptual learning when the odd tokens were in the first block, and no shifts when they were preceded by a block of good tokens. The second test in this study used audiovisual presentation. Odd versions of /s/ were presented from the outset in the lexical decision exposure phase, a procedure that has consistently generated robust perceptual learning. The critical manipulation was in the video of the speaker. The speaker was constantly fiddling with a pen, and sometimes put the pen in her mouth. For one group of subjects the pen was in the speaker’s mouth on all of the critical trials (i.e., the ones with /s/ produced as a mixture of /s/ and /∫/). For the other subjects, the pen was never in her mouth on critical trials.

The lexicon and phonetic categories

47

Even though the audio was exactly the same for the two groups, the pattern of perceptual learning was not: When the mispronunciation could be attributed to a temporary cause (the pen seen in the mouth), perceptual learning was blocked. When the system has information that can account for the odd pronunciation, it is conservative and does not restructure; when there is no other basis for the variation, perceptual learning takes place. 4. Change is bad, change is necessary The subjective impression of most adult language users is that they have a steady-state language system that operates very effectively. As noted in the introduction, this impression is only half true: Most language use is indeed impressively effective, but the steady-state aspect is only true at a macro level. At both the lexical and the phonetic level there is much more change going on than people notice. Adults are learning a surprisingly large number of new words all the time, and their supposedly static phonetic categorization system is actually undergoing changes on multiple time scales whenever they are interacting with other speakers. Thus, because of the changing input that the language system is constantly receiving, change is necessary; without change, people could not understand all of the new words that confront them, and without change, processing the rampant variation in speech could not be done. Although change is thus unavoidable, change also necessarily brings with it the potential for damage. Exactly because the system is working so well, changing it could upset the smooth equilibrium that exists. For this reason, an optimal system is also a conservative system: Change should only occur when there is strong evidence that the new lexical or phonetic information is not some transient variation on an existing representation, and even then, changes to the existing system should be executed slowly and carefully. The consolidation effects for acquiring new words that have now been well documented (e.g., Dumay and Gaskell 2007; Tamminen and Gaskell 2008) can be seen as a consequence of the Conservative Adjustment/Restructuring Principle (Samuel and Kraljic 2009). There have also been some reports of such consolidation at the phonetic level (e.g., Fenn, Nusbaum, and Margoliash 2003). More generally, at the phonetic level, there are now multiple examples of the system blocking long-term adjustments (perceptual learning) when there are alternative attributions for the phonetic variation (Kraljic, Brennan, and Samuel 2008; Kraljic, Samuel, and Brennan 2008). Collectively, the results of recent studies support the

48

Arthur G. Samuel

view that the language processing system is best viewed as one that, at multiple levels, maintains sets of representations that are operating in dynamic equilibrium, rather than steady state. This is an optimal adaptation in a world in which change is bad, but change is necessary. References Bertelson, Paul, Jean Vroomen, and Beatrice De Gelder 2003 Visual recalibration of auditory speech identification: A McGurk aftereffect. Psychological Science 14: 592-597. Bradlow, Anne R., and Tessa Bent 2008 Perceptual adaptation to non-native speech, Cognition 106: 707-729. Breitenstein, Caterina., Andreas Jansen, Michael Deppe, Ann-Freya Foerster, Jens Sommer, Thomas Walbers, and Stefan Knecht 2005 Hippocampus activity differentiates good from poor learners of novel lexicon. Neuroimage 25: 958-968. Clarke, Constance M. and Merrill Garrett 2004 Rapid adaptation to foreign-accented English. Journal of the Acoustical Society of America 116 (6): 3647-3658. Davis, Matthew H., Anna Maria Di Betta, Mark J. E. MacDonald, and M. Gareth Gaskell 2009 Learning and consolidation of novel spoken words. Journal of Cognitive Neuroscience 21: 803-820. Diehl, Randy L., Keith R. Kluender, and Ellen M. Parker 1985 Are selective adaptation and contrast effects really distinct? Journal of Experimental Psychology: Human Perception and Performance 11: 209-220. Dumay, Nicolas and M. Gareth Gaskell 2007 Sleep-associated changes in the mental representation of spoken words. Psychological Science 18: 35-39. Eimas, Peter D. and John D. Corbit 1973 Selective adaptation of linguistic feature detectors. Cognitive Psychology 4: 99-109. Elman, Jeff L. and James L. McClelland 1988 Cognitive penetration of the mechanisms of perception: Compensation for coarticulation of lexically restored phonemes. Journal of Memory and Language 27: 143-165. Eisner, Frank and James M. McQueen 2006 Perceptual learning in speech: Stability overtime. Journal of the Acoustical Society of America 119 (4): 1950-1953. Fenn, Kimberly M., Howard C. Nusbaum, and Daniel Margoliash 2003 Consolidation during sleep of perceptual learning of spoken language. Nature 425: 614-616.

The lexicon and phonetic categories

49

French, Robert M. 1999 Catastrophic forgetting in connectionist networks. Trends in Cognitive Sciences 3: 128-135. Ganger, Jonathan and M.R. Brent 2004 Reexamining the vocabulary spurt. Developmental Psychology 40: 621-632. Gaskell, M. Gareth and Nicolas Dumay 2003 Lexical competition and the acquisition of novel words. Cognition 89: 105-132. Gaskell, M. Gareth and William D. Marslen-Wilson 2002 Representation and competition in the perception of spoken words. Cognitive Psychology 45: 220-266. Goldinger, Stephen 1998 Echoes of echoes? An episodic theory of lexical access. Psychological Review 105: 251-279. Grossberg, Steven 1980 How does a brain build a cognitive code? Psychological Review 87: 1-51. Harris, Laura B. 1980 An assessment of current hypotheses concerning selective adaptation using phonetic stimuli. Unpublished masters thesis, SUNY Binghamton. Kraljic, Tanya, and Arthur G. Samuel 2005 Perceptual learning for speech: Is there a return to normal? Cognitive Psychology 51: 141-178. Kraljic, Tanya, Susan E. Brennan and Arthur G. Samuel 2008 Accommodating variation: Dialects, idiolects, and speech processing. Cognition 107: 54-81. Kraljic, Tanya, Arthur G. Samuel, and Susan E. Brennan 2008 First impressions and last resorts: How listeners adjust to speaker variability. Psychological Science 19: 332-338. Leach, Laura and Arthur G. Samuel 2007 Lexical configuration and lexical engagement: When adults learn new words. Cognitive Psychology 55: 306-353. Lively, Scott E., John S. Logan, and David B. Pisoni 1993 Training Japanese listeners to identify English /r/ and /l/ II: The role of phonetic environment and talker variability in learning new perceptual categories. Journal of the Acoustical Society of America 94: 1242-1255. Logan, John S., Scott E. Lively, and David B. Pisoni 1991 Training Japanese listeners to identify English /r/ and /l/: A first report. Journal of the Acoustical Society of America 89 (2): 874-886. Mann, Virginia A., and Bruno H. Repp 1981 Influence of preceding fricative on stop consonant perception. Journal of the Acoustical Society of America 69: 548-558.

50

Arthur G. Samuel

Marslen-Wilson, William D. 1987 Functional parallelism in spoken word recognition. Cognition 25: 71-102. Mattys, Sven L. and Jamie H. Clark 2002 Lexical activity in speech processing: Evidence from pause detection. Journal of Memory and Language 47: 343-359. McClelland, James L., Bruce L. McNaughton, and Randall C. O'Reilly 1995 Why there are complementary learning systems in the hippocampus and neocortex: Insights from the successes and failures of connectionist models of learning and memory. Psychological Review 102: 419-457. McCloskey, Michael and Neal J. Cohen 1989 Catastrophic interference in connectionist networks: The sequential learning problem. In The Psychology of Learning and Motivation G.H. Bower (Ed.), 109-165. New York: Academic Press. Nation, Paul, and Robert Waring 1997 Vocabulary size, text coverage, and word lists. In Vocabulary: Description, acquisition, pedagogy, Schmitt, N., & McCarthy, M. (Eds.), 6-19. New York: Cambridge University Press. Norris, Dennis, James M. McQueen and Anne Cutler 2003 Perceptual learning in speech. Cognitive Psychology 47: 204-238. Samuel, Arthur G. 1986 Red herring detectors and speech perception: In defense of selective adaptation. Cognitive Psychology 18: 452-499. Samuel, Arthur G. and Tanya Kraljic 2009 Perceptual learning in speech perception. Attention, Perception & Psychophysics 71: 1207-1218. Simon Helen J., and Michael Studdert-Kennedy 1978 Selective anchoring and adaptation of phonetic and nonphonetic continua. Journal of the Acoustical Society of America 64: 1338-1357. Tamminen, Jakke., and M. Gareth Gaskell 2008 Newly learned spoken words show long-term lexical competition effects. The Quarterly Journal of Experimental Psychology 61: 361371. Vroomen, Jean, Sabine Van Linden, Beatrice de Gelder, and Paul Bertelson 2007 Visual recalibration and selective adaptation in auditory-visual speech perception: Contrasting build-up courses. Neuropsychologia 45 (3): 572-577. Werker, Janet F., and Richard C. Tees 1999 Influences on infant speech processing: Toward a new synthesis. Annual Review of Psychology 50: 509-535.

Early links in the early lexicon: Semantically related word-pairs prime picture looking in the second year Suzy Styles and Kim Plunkett 1. Introduction With word meanings and word forms paired up like books in their jackets, the human lexicon is a vast and complex library. From its sparse beginnings in infancy, the lexicon incorporates thousands of words and concepts into an efficient processing system. Relationships between words provide an interconnected cross-referencing system, allowing the mature languageuser to slip between the shelves with ease. For decades, the technique of ‘priming’ has been used to probe organisational characteristics of the adult semantic system. In the priming method, semantic context is systematically manipulated to influence on-line language processing. When sequential activation of particular items alters task performance, inferences can be made about the psychological reality of the relationship between the items – and thereby, inferences about the nature of the system. Both visual and auditory primes are known to influence the speed of lexical access (Antos 1979; Meyer and Schvaneveldt 1971; Radeau 1983) and ambiguity resolution (Swinney 1979). Various types of semantic relationship have been demonstrated using the priming method, including word association (Moss et al. 1995; Nation and Snowling 1999), taxonomy (Meyer and Schvaneveldt 1971), shared semantic features (McRae et al. 2005; Moss, McCormick, and Tyler 1997), and instrumental relationships (Moss et al. 1995). Thus, facilitation in priming tasks can be understood as a spread of activation between related items in a semantic network (Anderson 1983; Collins and Loftus 1975; Meyer and Schvaneveldt 1976). Yet little is known about the development of the semantic system from first words and concepts, to this complex adult system. Is a system encoding relationships between concepts in place from the early stages of word learning? Or does it arise after extensive experience? Are early relationships adult-like? Or does reorganisation occur? The goal of this chapter is to review recent evidence for adult-like relationships between words in the second year using a recently developed method for investigating the organisational properties of the infant lexicon.

52

Suzy Styles and Kim Plunkett

1.1. Models of the lexicon connectivity and development Many traditional models of network organisation propose that each word or concept acts as a discrete node, with semantic relationships forming connections between nodes. Collins and Loftus’ (1975) spreading activation model, for example, characterises each concept in semantic memory as an individual node, with nodes linked by relationships, through which activity can flow. In lexical network models of this kind, the flow of activity from a ‘prime’ word can result in partial pre-activation of a related test word, allowing more efficient retrieval. While such models successfully capture many adult lexical processing effects, they do not typically provide a description of how newly acquired words might ‘link in’ to an existing network, nor when such a system arises during lexicon growth. For example, are the earliest-learned words represented as individual ‘semantic islands’ allowing maximum discrimination between known concepts? Or does the early semantic system encode relatedness between similar concepts from the very beginning? Some suggestions about the early stages of semantic network development come from the mathematical growth modelling of Steyvers and Tenenbaum (2005). In their model, networks derived from word association norms demonstrate a mathematical relationship between how long a word has been part of a normative network, and how densely it is connected to other words: Early learned words are more deeply integrated in the network than recently learned words. This finding suggests that adult network connectivity encodes the sequence of network development, including the early stages of word learning. Steyvers and Tenenbaum’s (2005) statistical modelling of lexicon growth uncovered a ‘scale-free small-world’ structure suggesting that some properties of lexicon structure are continuous throughout development. These statistical models of network growth suggest that adult-like semantic relationships may be evident very early in lexicon development, although they do not rule out the possibility of an earlier stage, during which individual words are effectively unconnected ‘islands’, followed by substantial lexical re-organisation. Alternative, ‘distributed’ models of the adult lexicon propose a highly featured semantic space in which each concept is represented by a broad pattern of activation across numerous nodes, some of which are shared by other concepts. For example, cat and dog share nodes encoding features such as ‘four legs’ and ‘is furry’ (e.g., Cree and McRae 2003; McRae, de Sa, and Seidenberg 1997). Models of this kind encode semantic relatedness via overlap in feature space. Thus, activating one concept gives related

Early links in the early lexicon

53

concepts a ‘head start’ in priming contexts, due to the ongoing activation of shared resources. As in McClelland and Rogers’ (2003) parallel distributed processing account of semantic learning, similarity can be encoded in the overlap of shared resources. These distributed models of semantic learning thus accommodate semantic relatedness from the beginnings of concept formation. Support for shared resources in early semantic representations can be found in the extensive literature on children’s concept formation (reviewed, for example, in Mandler 2000), and characteristic reports of young children’s ‘overextension’ errors (e.g., Bowerman 1987; E. Clark 1973). While theories of semantic development account for semantic relatedness early in conceptual development, there is little evidence to suggest how the conceptual system develops a fully functional interface with word forms. As conceptual representations are mapped onto words to in the lexicon, the nature of word-to-word relationships in the lexicon is an important, yet relatively unexplored domain. From a word learning perspective, Hills et al.’s (2009) network modelling of normative toddler vocabulary growth suggests that semantic features such as ‘has legs’ or ‘is furry’ allow new words to link into a semantic network consisting of very few words. At different stages in lexicon growth, their model shows some words as isolated lexical islands, with others clustering together into proto-categories, and theme-groups, suggesting that organisation may be variable in the early stages, but lead to systematic adult-like semantic organisation with scale. However, Hill’s et al.’s (2009) model takes its data from a normative toddler vocabulary, and is unable to assess whether the feature-based relationships are psychologically valid in the individual learner’s lexicon. Are they adult-like? Would they influence performance in an on-line language-based task? 1.2. A methodological conundrum Traditional sequential priming studies rely on adult reaction times (RTs) during behavioural tasks such as lexical-decision, categorisation, and wordreading – tasks typically employed over large sets of stimuli. Despite the ubiquity of priming methods in the adult semantic memory literature, no current behavioural method is widely accepted for investigating semantic relationships in the lexicons of children below the age of five. Barriers to the use of adult methodologies in early development include participants’ limited attention spans, relatively small vocabularies, lack of explicit

54

Suzy Styles and Kim Plunkett

metalinguistic knowledge (e.g., whether a string of sounds is a ‘word’), and inability to follow complex instructions. A small number of auditory priming methodologies have been extended for use with toddlers and school-aged children. Primed lexical decision, object decision and picture naming tasks have shown that semantic, thematic and associative relationships between words can influence reaction time for normally developing children in the early school years (Carr et al. 1982; Hashimoto, McGregor, and Graham 2007; Nation and Snowling 1999), and a primed verbal memory task has shown that associative relationships can affect recall accuracy for 3- to 4-year-olds (Krackow and Gordon 1998). These studies suggest that children as young as 3 years-ofage demonstrate aspects of adult-like semantic memory. However, the applicability of these methods to even younger populations remains limited by the complexity of the tasks employed. Priming for toddlers requires an intuitive experimental task without complex instructions, which can also accommodate the high temporal accuracy typically required to observe priming effects. In an alternative approach, researchers in the field of electroencephalography have sought to replicate adult patterns of brain activity in studies designed for toddlers. The ‘N400 component’ in adult event-related potentials (ERPs), is understood to index semantic congruency (Kutas and Hillyard 1980), semantic relatedness (Federmeier and Kutas 1999) and category organisation (Heinze, Muente, and Kutas 1998). In recent studies seeking a toddler analogue of the N400 component, Friedrich and Friederici (2004, 2005) presented toddlers with nouns while they looked at a single picture on screen. Toddlers’ ERPs following the onset of the spoken word included a component similar to the adult N400, which was large in the presence of a mismatch between the picture and the label, small in the presence of a match, and midway between these values for a mismatch drawn from the same semantic category. Thus toddlers looking at a picture of a cat were less sensitive to a mismatch when they heard a word like dog than they were when they heard a word like chair. The authors interpret their findings as a demonstration of adult-like ‘graded semantic sensitivity’ in a priming task. However, certain features of the experimental design limit the applicability of these findings to the standard priming literature. Firstly, the Friedrich and Friederici task investigates neural responses to language while semantic information from the picture is still available. The response thus indexes a ‘match’ between the auditory string and the ongoing picture. Many studies in the sequential priming literature attempt to distinguish automatic priming effects (in which the prime pre-activates the target prior

Early links in the early lexicon

55

to target activation) from strategic priming effects (which can include backwards priming from the target to the prime following target activation). The automaticity of a priming effect is thus more difficult to interpret when one of the stimuli remains available continuously. Torkildsen and colleagues (2007) addressed the sequential aspect of lexical processing more directly in an auditory-only toddler ERP study. When 24-month-olds were presented with pairs of words in a sequential priming paradigm, the ERP following the onset of the target word differed according to whether the prime and target were members of the same category. This finding suggests that toddlers’ on-line language processing is indeed influenced by preceding auditory context. However, the authors acknowledge that it is difficult to tell what cognitive correlate these ERP signatures might index in terms of the ease or speed of linguistic processing. In adult studies, it is possible to combine ERPs with behavioural responses, and thereby assess how cortical activity relates to cognitive outcomes. Without a behavioural correlate to the toddler N400 effect, it is unclear whether this component reflects adult-like semantic relatedness, or arises from a different process. An additional complication for toddler ERP studies is the necessity to control physical differences between stimuli. Every participant is exposed to large numbers of stimuli, and all stimuli are included in the analysis of all participants. Thus, the ERP method makes it difficult to take into account whether individual toddlers understood, or were familiar with individual words. Consequently, any appearance of graded sensitivity to a ‘within category mismatch’ (e.g., hearing dog while looking at a picture of a cat) may actually be the outcome of pooling data of different types. That is to say, toddlers who understood both dog and cat may have registered the word as a mismatch to the picture. Yet, toddlers who only understood only one of the two words may have accepted the pairing as a ‘match’ through overextension. A mean drawn from these two populations would produce a value midway between ‘match’ and ‘mismatch’, giving the appearance of graded sensitivity. It thus remains to be seen whether adult-like primed processing effects can be replicated in an on-line behavioural task for toddlers, in which itemlevel sensitivity is achievable. If toddlers’ behaviour is affected by verbal context in sequential word presentation, it would provide strong support for a model of lexicon development which includes adult-like semantic organisation from a very early age. It is necessary, therefore, to employ a toddlerfriendly behavioural task in which stimulus presentation can be systematically varied to produce a priming context, and which uses ease and speed of lexical comprehension as an index of online language processing.

56

Suzy Styles and Kim Plunkett

The inter-modal preferential looking (IPL) task, first used by Golinkoff et al. (1987), is a free-looking task for toddlers, in which a pair of pictures is presented, and eye gaze monitored while auditory stimuli are introduced. When an image is labelled, toddlers’ looking behaviour shows an increase in preference for the named image (Reznick 1990), and a tendency to initiate fixations away from pictures mismatching the label (Swingley and Fernald 2002). The free-looking task is a flexible framework which is sufficiently sensitive to index comprehension of individual words, as well as toddlers’ sensitivity to minor manipulations in picture typicality (Meints, Plunkett, and Harris 1999; Meints et al. 2002), and phonological specificity (Fernald, Swingley, and Pinto 2001; Mani and Plunkett 2007; Swingley and Aslin 2007; White and Morgan 2008), to mention just a few of factors that have been investigated. Contemporary implementations of IPL employ offline frame-by-frame analysis of video recordings, thus achieving high temporal accuracy. This infant method shares many similarities with the adult ‘Visual World’ paradigm, in which adults’ eye movements to an array of pictures reflect the time course of online language processing (e.g., Huettig and Altmann 2005; Huettig and McQueen 2007; Kamide, Altmann, and Haywood 2001; Yee and Sedivy 2006). In this chapter we describe a recent adaptation of the IPL method first described by Styles and Plunkett (2009a), which accommodates the sequential auditory presentation of adult priming studies, and employs a period of free-looking in the test phase of the trial. To create a priming context, words are presented prior to the onset of the picture pair. This ensures that lexical access begins in the absence of semantic information from the visual domain. The free looking task involves implicit processing of the auditory stimuli, but has an overt behavioural outcome (eye-gaze). This wordsbefore-pictures design allows us to ask a variety of questions: For example, does relatedness between known words influence the ease and speed of toddler’s lexical comprehension? Does the locus of the priming effect reside in the relationship between the prime word and the target word, or between the prime and the picture, or both? Are any observed priming effects excitatory or inhibitory? 1.3. The words before pictures task In order to validate the sequential auditory priming approach, Experiment 1 presents an un-primed IPL task which acts as a baseline for the wordsbefore-pictures stimulus organisation. As illustrated in Figure 1.1, this ex-

Early links in the early lexicon

57

periment assesses whether a word beginning prior to picture presentation generates a sufficiently stable internal representation for referent identification, in a fast-paced task. A second experiment introduces sequential auditory priming1. During the auditory phase of the trial, lexical primes precede all target labels. In half of the trials, primes and targets are related by taxonomy and association, in half, they are unrelated. The primary hypothesis is that when the prime and target word are related, lexical access of the target word will be enhanced relative to the unrelated condition: If the semantic system has adult-like connectivity at this age, toddlers are predicted to show more interest in the named target picture (due to greater activation of the concept), and to identify the named target picture faster (due to pre-activation of the concept). As illustrated in Figure 1.2, this priming effect may arise out of spreading activation/resource sharing at a lexical or conceptual level of representation, or some combination of both. However, is possible to ascertain whether the predicted semantic relationships influence the ability to process a spoken word and to perform a picture-matching task. As illustrated in Figure 1.3, a third experiment examines whether the prime word alone is sufficient to generate interest in the target, if it remains unnamed.

58

Suzy Styles and Kim Plunkett

Figure 1. Schematic of lexical processing models for three experiments. A. in Experiment 1, the target label activates a representation allowing the target picture to be identified. B. in Experiment 2, two different kinds of relationship between the prime and the target could increase target discrimination – broad (overlapping) representations, or discrete (interlinked) representations. C. in Experiment 3, the overlapping representation model is assessed, by changing the stimulus organisation to include only the auditory prime, without the target label.

2. Experiment 1 In order to validate responses to pictures in a primed context, it is critical to provide a baseline for toddlers’ performance in an un-primed version of the proposed task. As noted above, the priming task has no explicit instructions, and relies on the tendency shown by both toddlers and adults to preferentially fixate a picture when it is named. However, the task employed here presents the target label shortly before the onset of the pictures, and there are a number of reasons to predict that the novel stimulus arrangement might generate a different outcome. First, given the age of participants (18- and 24-month-olds), the abrupt onset of the visual stimuli might

Early links in the early lexicon

59

disrupt normal processing, causing the target word to be ‘erased’ by picture processing. Second, as standard implementations of IPL include a long picture familiarisation phase prior to word onset, toddlers might need to explore both pictures for some time before they are able to accumulate a clear preference for the named target. Third, although toddlers are able to demonstrate rejection of a mismatching picture by shifting their eyes away quickly (Swingley and Fernald 2002), this skill has only been demonstrated when toddlers were already fixating the picture at the onset of the target label. Toddlers may not have the same motivation to switch fixation if the word is presented before the picture. Concerning the first possibility, previous picture-first studies have demonstrated that toddlers as young as 18 months-of-age can extract sufficient acoustic information from the first 300 ms of a word to correctly identify its referent, even if the rest of the word is omitted (Fernald et al. 2001). In the present experiment, a stimulus onset asynchrony of 400ms from the onset of the target label to the onset of the picture-pair was selected to ensure that if toddlers’ auditory processing is disrupted by picture onset, they would still hear enough of the target word to correctly identify a matching picture. In order to address concerns about whether named pictures will attract preferential gaze in the short time available, and whether toddlers will be motivated to reject mismatching pictures quickly, two measures of eyegaze will be used. The first assesses relative picture preference over the whole presentation period, and the second looks at the micro-structure of eye gaze within the trial, assessing reaction time following the first fixation. Using both large- and small-scale measures of eye-gaze provides a nuanced investigation of the time-course of lexical access in the novel stimulus arrangement, and the relative interest labels invoke in the as-yet-unseen pictures. If 18-month-olds and 24-month-olds’ comprehension of the target words is undisrupted and the timing appropriate, toddlers are predicted to accumulate more looking time to the named target picture than the unnamed distracter. Toddlers are also predicted to disengage from their first fixated picture faster if it mismatches the target label, than in the case of a match. 2.1. Method Participants were recruited from a database of parents who had previously expressed an interest in participating in developmental studies. In the week before visiting the laboratory, primary caregivers of all participants were

60

Suzy Styles and Kim Plunkett

sent the Oxford CDI (Hamilton, Plunkett, and Schafer 2000), a British adaptation of the MacArthur-Bates CDI (Fenson et al. 1994), normed on the local population. The Oxford CDI lists 416 words, and assesses both receptive and productive vocabulary. Parents brought their completed CDIs with them to the testing session. In a small number of cases, they completed the form on the day of testing, either during their visit, or returned by post shortly after. Toddlers who visited the laboratory were given a small gift for their participation. Twenty 18-month-olds and twenty 24-month-olds participated in the study. Three failed to complete the study, and were removed from analysis. Total receptive CDI scores were checked against previously collated norms, and toddlers’ comprehension of test items was assessed. Following preliminary checks, one further eighteen-month-old was removed from analysis for an extremely low receptive CDI score (in the 5th percentile of previously collected CDIs for 18-month-olds). The final sample included seventeen 18-month-olds (10 males, mean age: 18.3 months; range: 17.0 to 18.9 months), and nineteen 24-month-olds (13 males, mean age: 24.1 months; range: 23.3 to 24.9 months). 2.1.1. Selection of test items To select age-appropriate words, 548 previously collected Oxford CDIs were consulted. The younger age (18 months) was used as a baseline for lexical comprehension rates. From the 179 CDIs which fell in the age range 17.5 to 18.5 months-of-age, concrete nouns reported as ‘understood’ by more than 50% of 18-month-olds were considered as potential stimuli. Normative word associations were collated from the Birkbeck Word Association Norms (Moss and Older 1996), and 18 words were selected for use in the current study. 2.1.2. Stimulus preparation Two stimulus lists were created, in which half of the items were named ‘target’ pictures, and half of the items were unnamed ‘distracter’ pictures. Pairs of pictures shared no taxonomic or associative relationship, and no phonological onset or rhyme (e.g., target: fish, distracter: aeroplane), and were yoked across lists. Items which acted as named targets in one list were

Early links in the early lexicon

61

unnamed distracters in the other list. Stimulus lists are given in the Appendix. Audio stimuli were created in a single recording session, in a soundattenuating booth on a Marantz solid state recorder sampling at 44.1 kHz. A minimum of three tokens of each auditory stimulus were produced by a female native speaker of British English, using high-affect, child-directed speech. The single best token of each stimulus was manually selected for clarity, typicality and affect, and edited to remove head and tail clicks. Visual stimuli were high quality digital photographs, judged as typical exemplars of test items by three native speakers of English. Pictures were presented on a 10% grey background. 2.1.3. Procedure After a few minutes of ‘settling in’ in a dedicated play room, toddlers sat on their caregiver’s lap facing a large flat-screen monitor in a purpose built IPL booth. Caregivers were asked to wear headphones and to close their eyes during the procedure, which lasted approximately one and a half minutes. The experimenter moved to an adjacent control room. After drawing the toddler’s attention to the screen area using a ‘ding’ sound from a centrally located loudspeaker, each trial was manually initiated when the toddler’s attention was centred on the screen. While the screen was blank, an auditory attention phrase began (e.g., Ooh look!) followed by an interstimulus interval (ISI) of 200 ms, then the target word in isolation (e.g., Fish!). 400 ms after the onset of the target word, the picture pair appeared. Pictures remained onscreen for 2,500 ms. Trial time-course is illustrated in Figure 2. Each toddler saw 9 trials from a single stimulus list. Trial order was randomised on presentation. Target side was counterbalanced. Toddlers sat approximately 90 cm from the screen, with a display area 87 cm wide. Each picture was 34 cm wide. Together they occupied a visual angle of approximately 52º, separated by a gap of 17 cm (12º).

62

Suzy Styles and Kim Plunkett

Exp. 1

-2000ms

-600ms -600ms-400ms

0ms

2500ms

Picture Presentation

O o h Exp. 2

look !

F i s h !

-600ms-400ms

-3200ms

0ms

2500ms

Picture Presentation

Ye s t e rd a y, I s a w a Exp. 3

c

a

t !

D o g ! -400ms

-3000ms

0ms

2500ms

Picture Presentation

Ye s t e rd a y, I s a w a

c

a

t !

Figure 2. Trial timing in three experiments. 0ms indicates the onset of the testphase of the trial. The mean duration of the target words in Experiment 1 was 729 ms (SD = 96), and in Experiment 2, 551ms (SD = 20ms), meaning that the majority of target labels continued after picture presentation had begun.

2.1.4. Scoring and Measures Toddlers’ eye movements were monitored by small cameras located above the two picture areas, and combined into a split-screen picture by a video mixer. Recordings were digitally captured during test. Blind manual coding was conducted offline, by an experienced coder, using frame-by-frame observations at a temporal accuracy of 40 ms. Ten percent of participants were re-coded by a second coder to ensure consistency (inter-coder reliability: r(36) = 0.97, p < 0.001). Looks to the left and right picture areas of the screen were coded from the onset of the pictures. Both large- and smallscale timing measures were calculated: a macro-level measure assessed the relative preference for the named target picture over the course of the whole trial, and two micro-level measures assessed the direction and duration of the toddlers’ first fixation. Macro-level. The proportion of target looking (PTL) is the total amount of time spent looking at the target (T) as a proportion of the total amount of time spent looking at both pictures (T + D). It can be represented as T / (T + D). This measure represents relative picture interest over the whole picture presentation period (2,500 ms), excluding time spent switching between pictures, blinking, or looking away from the screen area. Micro-level. The percentage of target-first fixations is calculated from the number of trials in which the first responsive fixation was to the target

Early links in the early lexicon

63

picture, out of all trials where anticipatory fixations had not occurred. The duration of the first fixation is a measure of reaction time which describes the amount of time taken to initiate a saccade away from the first fixated picture. In this words-before-pictures version of the IPL task, it is possible for toddlers to shift their gaze away from the centre of the screen prior to picture presentation. To ensure that fixations represent responses to the test stimuli, trials including anticipatory eye movements or external distractions are not included in analysis of micro-level measures. The ‘linking hypotheses’ for using these measures are as follows: If toddlers distribute their fixations randomly between pictures for the duration of the trial, then relative measures assessing looking preference would even out across trials, creating means of similar value for target and distracter pictures. However, if the name of the picture induces a systematic visual preference for named targets over unnamed distracters, then the pattern of behaviour is consistent with toddlers mapping spoken words to the correct pictures. As argued by Aslin (1997), the duration of accumulated fixations is difficult to interpret (does more looking imply continuous interest, effortful processing, or blank staring?). For this reason, the micro-structure of the trial is also valuable, as the very first fixation following the appearance of a novel picture can serve as a measure of the ease and speed of semantic processing. Shorter first fixations to distracters than to targets would indicate how quickly toddlers are able to reject pictures mismatching the referent of the auditory label. Accuracy in the direction of the first fixation indicates how easily they are able to recognise features of novel pictures presented parafoveally. Data exclusions. Recent IPL research has demonstrated that parents can pick out which words their children will be able to identify in a standard picture-finding task. Despite historic concerns about the accuracy of parental comprehension reports (e.g., Tomasello and Mervis, 1994), recent research has shown that British parents are typically quite accurate in their assessment of comprehension. Styles and Plunkett (2009b) demonstrated that 18-month-olds showed the predicted preference for named items which their parent had marked as ‘understood’ in a vocabulary inventory, but not for words which remained unmarked. Differences between the design of this and previously reported tasks by Houston-Price, Mather and Sakkalou (2007), suggests that British parents are somewhat conservative in their judgements, and only mark words which will be identified in a relatively difficult task (i.e., a single presentation, paired with an easily confusable distracter from the same taxonomic category). As the primary interest here is about relationships between words which are integrated into the lexicon, a conservative exclusion criterion is employed: Only those trials in which

64

Suzy Styles and Kim Plunkett

the target word was reported as understood are included in analysis of eye movements. In addition, trials in which toddlers fixated the left or the right picture area of the screen prior to picture presentation were excluded from analysis of first fixation measures. In analysis of reaction time, the visual inspection method described in detail by Canfield et al. (2007), was used to remove the small proportion of trials in which the first saccade was launched prior to the onset of visual stimuli (via expectation), or was landed abnormally late. This ensured that all fixation durations included in participant means were from trials in which looking behaviour was typical. 2.2. Results The mean receptive CDI score for 18-month-olds was 191 words (SD = 75) out of a possible 416, and for 24-month-olds, 350 words (SD = 57). Eighteen-month-olds were reported to understand a mean of seven of the nine words used as targets in their list (SD = 1.8). Twenty-four-month-olds were reported to understand a mean of nine (SD = 0.6). Twenty-four-month-olds thus had larger receptive vocabularies (U (17, 19) = 20, p < 0.001), and knew more of the target labels (U (17, 19) = 65.5, p < 0.001). The lexical exclusion criterion resulted in exclusion of 32 of the original 153 trials available for 18-month-olds (21%) and 8 of the original 171 trials available for 24-month-olds (5%). In a small number of trials, infants looked away from the screen prior to the onset of pictures, or recorded screen fixations shorter than 120 ms. A further 8 trials (3%) remained unanalysed according to this criterion. The proportion of target looking (PTL) is illustrated in Figure 3. Mean PTL for 18-month-olds was 0.56 (SD = 0.11), and 24-month-olds, 0.59 (SD = 0.11). Both age groups thus looked at the target significantly more than at the distracter (18m: t(16) = 2.12, p = 0.05, d = 0.51; 24m: t(18) = 3.91, p < 0.01, d = 0.90). This finding is consistent with the prediction that toddlers understood the target label, and correctly mapped it to the target picture. A two-way ANOVA of age group (18m, 24m) and stimulus list (list A, list B) revealed no main effects or interactions, indicating that lexical comprehension did not differ between participant groups or word lists.

Early links in the early lexicon

65

Figure 3. Proportion of target looking (PTL) according to age group in Experiment 1. +/- one standard error. Single sample t-tests compare PTL to the chance value of .5. * p < 0.05. ** p < 0.01.

For 18-month-olds, trials in which the first fixation was to the target constituted 47% (SD = 24%) of all trials, and for 24-month-olds, 54% (SD = 16%). The percentage of first fixations to the target did not differ significantly from chance, indicating that toddlers were equally likely to fixate either picture in their first look. The average duration of first fixations is illustrated in Figure 4, according to whether the first fixated picture was the named target or the unnamed distracter. For 18-month-olds, the mean duration of target fixations was 1196 ms (SD = 283 ms), and distracter fixations, 858 ms (SD = 381 ms).

66

Suzy Styles and Kim Plunkett

For 24-month-olds, target fixations were 1094 ms (SD = 382 ms) and distracter fixations, 554 ms (SD = 203 ms). The speed of disengaging from a distracter was thus faster than the speed of disengaging from a target for both age groups (18m: t(25) = 2.6, p < 0.05, d = 1.06; 24m: adjusted t(25.8) = 25.8, p < 0.01, d = 1.81). This finding is consistent with the prediction that reaction time indexes comprehension of the target word, as infants in both age groups looked away from pictures which mismatched the unfolding auditory label faster than they looked away from pictures which matched.

Figure 4. Duration of first fixation according to age group and look direction in Experiment 1. +/- one standard error. * p < 0.05. ** p < 0.01. Comparisons marked above a pair of bars indicates an independent samples t-test comparing responses to target and distracter pictures. The comparison at the top of the plot indicates a difference between age groups, for distracter fixations only.

Early links in the early lexicon

67

When comparing reaction time across age groups, 24-month-olds were significantly faster to disengage from distracters than 18-month-olds (adjusted t(15.2) = 2.53, p < 0.05, d = 1.30). No significant age difference was evident in the speed of disengaging from targets. This finding shows an age-related improvement in the speed of rejecting a mismatching distracter picture, while named targets remained equally engaging. 2.3. Discussion This study demonstrates that toddlers at 18 and 24 months-of-age showed behaviour consistent with recognition of known words in a fast-paced freelooking task, where the target label began shortly before the pictures appeared on screen. As illustrated in Figure 1A, hearing the target label activated an internal representation sufficiently detailed to facilitate recognition of a target picture which appeared on the screen after the target word had begun. Named targets accumulated more looking time than unnamed distracters, and the first fixated picture was disengaged from faster when it mismatched the target label than when it matched. Novel features of the task have now been validated: Lexical access of the target word was not impaired by the onset of visual stimuli; the trial was sufficiently long for toddlers in both age groups to accumulate a preference for the named target; and the SOA of 400 ms between the target word and the onset of the pictures allowed target recognition. In addition, the measures used were sufficiently sensitive for a ‘one-shot’ task, in which toddlers were exposed to each stimulus only once. The finding of speeded distracter rejection parallels earlier toddler eyegaze studies investigating reaction time (e.g., Fernald et al. 1998; Swingley and Fernald 2002) or ‘latency’ (e.g., Mani and Plunkett 2007) to switch from mismatching pictures following the onset of a word. Indeed, the increase in the speed of distracter rejection between 18- and 24-month-olds’ distracter fixations mirrors developmental improvements in lexical processing speed reported in picture-first looking tasks (Fernald et al. 1998). The finding of target identification for both age groups acts as a behavioural baseline against which primed adaptations of the task can be compared.

68

Suzy Styles and Kim Plunkett

3. Experiment 2 The previous experiment demonstrated that 18- and 24-month-olds are able to identify named pictures in a free looking task where the target word preceded picture presentation. Two features of the task piloted in Experiment 1 are critical for an IPL implementation of sequential auditory priming. Firstly, the words-before-pictures stimulus organisation means that both the prime and the target words can be presented prior to the onset of the pictures, prior to any interference from the visual domain. Secondly, presenting each stimulus only once avoids the possibility of teaching novel associations between items, or inducing memory effects. This is particularly important in a population where participants’ vocabulary size and attention span precludes ‘padding’ repetitions with large numbers of intervening trials. In adapting this free-looking task to a priming context, Experiment 2 retains the trial timing of Experiment 1, but replaces the auditory attention phrase with a priming phrase. In half of the trials the prime and the target are taxonomically and associatively related, in half they are not. Given that the priming task will include two conditions per participant, the number of trials is increased from nine to twelve. If the toddler lexicon is interconnected in an adult-like way, hearing a prime would be expected to affect the ease and speed of lexical access for related words either through overlapping representations, or via the flow of activation between discrete but interconnected representations (cf. Figure 1B). Toddlers’ lexical access, as indexed by eye movements, would be expected to be enhanced in the related prime condition, relative to the unrelated prime condition. 3.1. Method Thirty six 18-month-olds and thirty six 24-month-olds participated in the study. Following preliminary checks of receptive CDI scores, two eighteenmonth-olds were removed from analysis for extremely low vocabularies (in the 5th percentile of previously collected CDIs for 18-month-olds), and one, when it was observed that they contributed valid trials to only one priming condition. In the final sample, 33 18-month-olds (15 males; mean age: 18.1 months; range: 17.5 to 18.8) and 35 24-month-olds (14 males; mean age: 24.0 months; range: 23.4 to 25.0) were available for analysis.

Early links in the early lexicon

69

3.1.1. Materials Auditory and visual stimuli were selected and prepared as in the previous experiment using a DAT recorder sampling at 44.1 kHz. Two stimulus lists were created in which twelve words acted as auditory ‘primes’ and twelve words acted as auditory ‘targets’. Each of the twelve target pictures appeared on screen alongside one of the twelve unnamed ‘distracters’ during the test phase of the trial. Picture pairs were yoked across lists. The two stimulus lists are given in the Appendix. Across lists, each target occurred with two different primes, a ‘related’ prime in one list and an ‘unrelated’ prime in the other. Related word-pairs had an attested forward association in adult British English (Moss and Older 1996) and were basic-level taxonomic sisters (e.g., prime: cat; target: dog). The decision to include both association and taxonomy was made on the basis of the ‘associative boost’ evident when the relationships of taxonomy and association are combined in adult priming (Moss, Ostrin and Tyler 1995). Unrelated word-pairs shared no semantic or associative relationship, and no phonological onset or rhyme (e.g., prime: plate; target: dog). Similarly, distracters shared no phonological, semantic or associative relationship with prime or target (e.g., distracter: boat). Within a list, half of the primes were related and half unrelated, and no stimulus was repeated. In the primed procedure, words acting as ‘targets’ were labelled for all participants, and never appeared as unnamed distracters. Given that inherent differences in the visual interest of pictures used as target and as distracters could confound straightforward assessment of target preference, analysis is therefore limited to the effect of the auditory priming condition. For each participant, priming was calculated as the difference between the related and the unrelated priming condition. 3.1.2. Procedure The procedure was the same as in the previous experiment, with the exception that infants were tested in an adjacent booth, facing a rear-projection screen. Pictures were 32 cm wide. Together they occupied a visual angle of approximately 48º, separated by a gap of 15 cm (10º). While the screen was blank, the priming phrase began (e.g., Yesterday, I saw a cat!), followed by an inter-stimulus interval (ISI) of 200 ms, then the target word in isolation (e.g., Dog!). A short prime-to-target ISI was employed to capture the early stages of automatic activation. Moss et al.’s (1995) ISI of 200ms was selected, on the grounds that toddlers’ phonologi-

70

Suzy Styles and Kim Plunkett

cal processing speed is similar to adults’ (Swingley, Pinto, and Fernald 1999). To reduce potential sources of interference, no attention device appeared on the screen during the priming phase of the trial. The 400 ms SOA and 2500ms picture duration were identical to Experiment 1, as illustrated in Figure 2. Each toddler saw 12 trials from a single stimulus list. Half of the infants saw list A, and half, list B. Trial order was randomised on presentation, and target side was counterbalanced within and between lists. 3.2. Results The mean receptive CDI score for 18-month-olds was 181 words (SD = 64), and for 24-month-olds, 317 words (SD = 62). Eighteen-montholds were reported to understand a mean of nine of the twelve prime words (SD = 2.7) and nine of the twelve target words (SD = 2.3), 24-month-olds, eleven primes (SD = 1.5) and eleven targets (SD = 1.4). Twenty-fourmonth-olds’ receptive vocabularies were thus significantly larger than 18month-olds’ (U (33, 35) = 79.5, p < 0.001), and they knew significantly more test items than 18-month-olds (U (33, 35) = 172.5, p < 0.001). Only trials in which both the prime and the target were reported as understood were included in analysis of priming. According to this criterion, 242 of the original 396 trials were available for analysis (61%) for 18month-olds, and 378 of the original 420 of trials (90%), for 24-month-olds. A further 27 trials (4%) were unanalysed as they contained no fixations longer than 120ms. The mean PTL in each priming condition was calculated for each toddler. Priming, calculated as the difference between PTL for related and unrelated trials, is illustrated in Figure 5. Mean priming for 18-month-olds was 0.04 (SD = 0.18), and for 24-month-olds, 0.09 (SD = 0.13). For 24month-olds, priming was significantly above chance (t (32) = 4.0, p < 0.001, d = 0.70). In a two-way ANOVA of age group (18m, 24m) and list (list A, list B), there was a main effect of age (F(1,55) = 4.83, p < 0.05, partial η2 = 0.08) indicating that the priming effect was larger for the older age group, although the difference did not achieve significance in an independent-samples t-test. A main effect of list was also observed (F(1,55) = 22.63, p < 0.001, partial η2 =0.29), indicating that priming was larger for one list of words than the other. No significant interaction was evident.

Early links in the early lexicon

0.20

71

n.s. __________

** PTL Priming

0.10

0.00

-0.10

-0.20

SOA: long

18m

24m

Exp. 2

18m

24m

Exp. 3

Figure 5. Priming effect for PTL (Related prime minus unrelated prime), in two experiments. Single sample t-test compares priming in each age group to the chance value of 0ms. Independent samples t-test compares priming in the two age groups. +/- one standard error. ** p < 0.01.

In 468 trials (78%), toddlers had not fixated either of the picture areas of the screen by the time the pictures were presented. 58% of first fixations were to the target for 18-month-olds and 59%, for 24-month-olds. A threeway repeated measures ANOVA compared the influence of priming condition (related, unrelated), age group (18m, 24m) and stimulus list (list A, list B) on the accuracy of the first fixation. There was a main effect of prime condition (F(1, 58) = 5.735, p < 0.05, partial η2 = 0.09), with no further effects or interactions. Paired sample t-tests clarified that the proportion of first fixations to targets was greater in the related prime condition (related: M = 65% (SD = 27%); unrelated: M = 53% (SD = 30%); t(61) = 2.23, p < 0.05, d = 0.35). This finding indicates that the priming condition influenced processing prior to landing the first fixation. The mean duration of first fixations to targets and distracters was calculated separately for trials from each of the priming conditions. When tod-

72

Suzy Styles and Kim Plunkett

dlers’ first fixation was to the target, fixation priming was calculated as the difference between the duration of target-first fixations in the two priming conditions. The magnitude of target fixation priming was 38ms (SD = 615 ms) for 18-month-olds and 40ms (SD = 389 ms) for 24-month-olds. A twoway ANOVA comparing the influence of age group (18m, 24m) and list (list A, list B) on target fixation priming revealed no significant main effects or interactions. Target fixation priming (pooled across groups) was not significantly greater than 0 ms. Distracter fixation priming, illustrated in Figure 6, was calculated as the difference between the duration of fixations in the two priming conditions. The magnitude of distracter fixation priming was 218 ms (SD = 553 ms) for 18-month-olds, and 123 ms (SD = 348 ms) for 24-month-olds. A two-way ANOVA comparing the influence of age group (18m, 24m) and list (list A, list B) on distracter fixation priming also revealed no significant main effects or interactions. Distracter fixation priming (pooled across groups) tended toward significance, (t(25) = 0.18, p=0.06, d = 0.04). These findings show that while the duration of the target fixation was unaffected by the priming condition, primed word pairs tended to facilitate the rejection of mismatching pictures. 3.3. Discussion Having demonstrated in Experiment 1 that toddlers at 18 and 24 months-ofage can identify target pictures in the fast-paced words-before-pictures task, and that two measures index comprehension of the named targets, the primary goal of this study was to investigate whether relationships between words influence the pattern of target looking. Novel features of the priming task included the use of auditory primes prior to target labels (ISI = 200ms), in a words-before-pictures stimulus arrangement. A combination of taxonomy and association was selected to boost any priming effects observed. As predicted, looking behaviour was affected at both the macro-level and the micro-level of analysis. For older toddlers, there was greater interest in the target in the related condition (PTL). The age groups tended to differ, with older infants showing a greater priming effect. The prime tended to influence reaction time (albeit not significantly). This reactiontime trend did not differ between age groups, and suggested that toddlers in

Early links in the early lexicon

73

750ms

Distracter Fixation Priming

500ms

250ms

0ms

-250ms

-500ms

*

-750ms

18m

n.s.

24m

Expt.2

18m

24m

Expt.3

Error bars: +/- 1 SE

Figure 6. Priming effect for duration of first fixation, in two experiments. Single sample t-test compares priming (pooled across age group) to the chance value of 0ms. +/- one standard error. * p< 0.05. Comparison indicates pooled value.

both age groups might be faster to disengage from unnamed distracters in the related prime condition. In addition, toddlers in both age groups showed an increase in the percentage of first fixations to targets in the related prime condition.Comparing this finding with the results from the previous experiment, toddlers in both age groups showed enhanced target recognition in the percentage of target-first trials when a related word preceded the target label, compared to when the target label occurred alone. This finding is especially interesting given that toddlers had not previously seen the test pictures, and did not know which picture would occur in which location. In both age groups, the higher percentage of target-first trials and the tendency towards faster distracter rejection suggests faster lexical access of the target word due to pre-activation. In addition, for older toddlers, the

74

Suzy Styles and Kim Plunkett

effect of priming condition on PTL is consistent with more and longerlasting activation of the target representation. However, the locus of the priming effect remains unclear. In order to assess whether the prime word alone was responsible for the priming effect, or whether the prime word influences picture looking via its influence on the target word, a third experiment was conducted. 4. Experiment 3 The previous experiment demonstrated a clear effect of the lexical prime on referent identification. The findings were consistent with both word-toword ‘lexical’ priming or concept-to-picture ‘semantic’ priming, or some combination of the two (cf., Fig. 1.2). Removal of the target label (cf., Fig. 1.3) allows these two possibilities to be teased apart to some degree: If the prime alone triggers target preference in the absence of the target label, it would suggest that the internal representation of ‘cat’ is sufficiently similar to the internal representation of the word ‘dog’ to trigger some form of ‘best-match’ between the spoken word and the picture; If toddlers’ looking is unsystematic, with both unlabelled pictures rejected as equally bad matches to the prime word, it would suggest that the prime word influences looking at the target picture only via enhanced activation of the target word. 4.1. Method Thirty six 18-month-olds and thirty six 24-month-olds participated in the study. Following preliminary checks of receptive CDI scores, three additional18-month-olds and one 24-month-old were removed from analysis when it was observed that they contributed analyzable trials to only one priming condition. In the final sample, 32 18-month-olds (14 males; mean age: 18.3 months; range: 17.7 to 18.8) and 35 24-month-olds (18 males; mean age: 24.3 months; range: 23.5 to 24.9) were available for analysis. The stimuli were identical to Experiment 2, with the exception that all target labels were omitted. The inter-stimulus interval between the offset of the prime and the onset of the pictures was also shortened slightly, to avoid a prolonged silence. Timing is illustrated in Figure 2. Apart from these changes, the procedure was the same as previously described.

Early links in the early lexicon

75

4.2. Results The mean receptive CDI score for 18-month-olds was 260 words (SD = 76), and for 24-month-olds, 356 words (SD = 52). Eighteen-montholds were reported to understand a mean of ten of the twelve prime words (SD = 1.6) and eleven of the twelve target words (SD = 1.0), 24-montholds, twelve primes (SD = 1.3) and twelve targets (SD = 0.7). Twenty-fourmonth-olds’ receptive vocabularies were thus significantly larger than 18month-olds’ (U (32, 33) = 689.5, p < 0.001), and they knew significantly more test items than 18-month-olds (U (32, 33) = 753.0, p < 0.001). For two 24-month-olds, where CDI data were unavailable, the median score for each item in their age group was used for lexical exclusion. According to the lexical exclusion criterion employed in the previous study, 306 of the original 374 trials (80%) were available for analysis for 18-month-olds, and 396 of the original 419 trials (95%) were available for 24-month-olds. A further 27 trials (4%) remained unanalysed as they contained no fixations longer than 120ms. Mean PTL in each priming condition was calculated for each toddler. Priming is illustrated in Figure 5. Mean priming for 18-month-olds was 0.02 (SD = 0.18), and for 24-month-olds, 0.04 (SD = 0.17). A two way ANOVA comparing age group (18m, 24m) and list (list A, list B) revealed a significant main effect of list (F(1, 63) = 47.7, p < 0.001, partial η2 = 0.43), indicating that priming was larger for participants who saw one set of stimuli than for participants who saw the other. There was no significant main effect of age, nor interaction, and priming did not significantly differ from zero for either age group. In 517 trials (77%), toddlers had not fixated either of the picture areas of the screen by the time the pictures were presented. 53% of first fixations were to the target for 18-month-olds and 56%, for 24-month-olds. A threeway repeated measures ANOVA comparing the influence of priming condition (related, unrelated), age group (18m, 24m), and stimulus list (list A, list B) revealed no main effects or interactions. The percentage of first fixations to target did not differ from chance, a finding differing notably from the findings of the previous experiment. The mean duration of first fixations to targets and first fixations to distracters was calculated separately for trials from each of the priming conditions. The magnitude of target fixation priming was 4 ms (SD = 583 ms) for 18-month-olds and 60 ms (SD = 751 ms) for 24-month-olds. A two-way ANOVA comparing the influence of age group (18m, 24m) and list (list A, list B) on target fixation priming revealed no significant main effects or

76

Suzy Styles and Kim Plunkett

interactions and target fixation priming (pooled) did not differ from chance. The magnitude of distracter fixation priming was 179 ms (SD = 784 ms) for 18-month-olds and 6 ms (SD = 632 ms) for 24-month-olds. A two-way ANOVA comparing the influence of age group (18m, 24m) and list (list A, list B) on distracter fixation priming revealed no significant main effects or interactions and distracter fixation priming (pooled) did not differ from chance. Thus, in the absence of an explicit target label, distracter pictures were rejected equally fast regardless of whether the picture was related to the auditory prime. 4.3. Discussion The purpose of Experiment 3 was to establish whether the prime generated interest in the target picture via similar conceptual representations in the infant semantic system. Overlapping representations could result from shared abstract semantic features like ‘is an animal’, ‘lives in a house’, or from similarities between an internally generated image of the prime and the visual form of the target. By using the same stimuli as the previous experiment, but omitting the target label, it was possible to assess whether the effect of the prime was mediated by activation of the target word. The relationship between the auditory prime word and the unnamed target picture was not found to influence looking behaviour in macro-level measures of target preference, nor in micro-level measures of first fixation direction and duration. Given the well-documented cases of overextension in children (e.g., Bowerman 1987; E. Clark 1973) and theoretical predictions about semantic feature overlap (c.f., Hills et al. 2009), it is somewhat surprising that performance in the priming task differed so dramatically with the omission of target labels. However, these results indicate that toddlers in both age groups were not willing to accept a picture of a dog as a referent of the word ‘cat’ when they understood both words: They judged related targets and unrelated distracters to be equally uninteresting in the absence of the target label. This finding also sheds light on the nature of the priming effect observed in Experiment 2, an issue to which we will return in the General Discussion. As Experiment 3 demonstrated no difference between priming conditions, interpretation of the null result should be treated with some degree of caution. For example, the prime-to-pictures ISI of 400ms may have been too short for a fully featured activation of the prime word to exhibit overlap

Early links in the early lexicon

77

with the target picture. Conversely, this ISI may have been too long to observe the influence of a short-lived feature-based overlap. Future studies will therefore be needed to clarify the time course of activation of different kinds of semantic relationship. 5. General Discussion This series of studies was designed to address two main questions: Are adult-like semantic relationships between words/concepts evident in toddlers’ behavioural responses to spoken language? And if so, where does the locus of the priming effect reside? Priming effects were observed for toddlers in both age groups, and the observed effects were consistent with a model of semantic organisation in which related words have interlinked representations between which activation can ‘flow’ during online language processing. The primed IPL task was designed to mimic adult sequential priming tasks in a toddler-friendly context, by replacing lexical decision with a period of free-looking. The first experiment piloted an adaptation of a traditional free-looking task for infants, in which the auditory target label began shortly before the onset of the picture pair, allowing the unfolding target word to begin prior to any disruption from the visual domain. Toddlers in both age groups demonstrated the ability to identify the referents of named pictures in both micro- and macro-level measures of eye-gaze. Features of this design which are novel in toddler language research include the task’s fast pace, lack of repetition, and use of reaction time measures in a wordsbefore-pictures scenario. Given the high speed of stimulus presentation, it is noteworthy that toddlers at both ages demonstrated reliable target identification in both macro- and micro-level measurements. This indicates that the current task and the measures employed are effective indices of lexical comprehension, even though the label was presented prior to the onset of the previously unseen pictures. In the second experiment, the auditory attention phrase was replaced with a priming phrase, concluding in a related word half of the time. Both age groups showed an improvement of almost 10% in the accuracy of the first fixation, and were more than 100ms faster to reject unnamed distracters, in the related prime condition, compared to the unrelated prime condition. For the older age group, a difference of almost 10% was also observed in overall target preference in the related prime condition. These findings demonstrate that the current design was sufficiently sensitive to capture

78

Suzy Styles and Kim Plunkett

differences in the ease and speed of lexical comprehension brought about by relationships between items integrated into the toddler lexicon. The third experiment, in which the target label was omitted, clarified that the priming effect observed in Experiment 2 was produced by sequential activation of related prime and target words. When the prime alone was heard, it did not produce sufficient activation of the target concept to facilitate target preference or above chance accuracy in the direction of the first glance – even though the timing of the prime word was consistent in both primed experiments. It is interesting to note how these findings relate to data collected from implicit processing tasks, such as toddler ERP studies. Torkildsen and colleagues (2007) report toddlers’ sensitivity to semantic relationships between sequential auditory word-pairs, when assessing the ‘N400 component’. Our results suggest that this component does indeed have a cognitive outcome, as sequential activation of related words influences looking behaviour in this age range. Fast Pace. The fast-paced design included a relatively short period of free looking time (2,500ms) which was expected to eliminate strategic responding, by giving toddlers little time to inspect both pictures. The finding of reliable target discrimination in both age groups (Experiment 1) demonstrated that only a short amount of time is required for toddlers to show lexical comprehension, using measures of fixation accuracy, fixation duration, and accumulated picture preference. New Measures. The words-before-pictures stimulus arrangement in this experiment necessitated the introduction of novel reaction time measures. In standard toddler free-looking tasks, an auditory ‘decision point’ during ongoing picture display becomes a reference point for measurements of ‘latency’ (e.g., Mani and Plunkett 2007) or RT to switch from a distracter (e.g., Fernald et al. 1998). In the primed IPL task, toddlers heard the start of the target word before they knew which side the target would be located. The direction and duration of the first fixation was expected to index toddlers’ responses to whether or not the first fixated picture matched the unfolding acoustic label. The measure of first fixation duration was sensitive to target discrimination, and yielded priming effects in both age groups. The measure of first fixation direction was not sensitive to general target discrimination, but was boosted by a related prime when the target was labelled. Developmental Trajectory and Automaticity. In the micro-level measurements concerning the first fixation in the trial, toddlers in both age groups demonstrated priming effects. These priming effects occurred quite fast, and influenced behaviour at the onset of the trial. In the macro-level

Early links in the early lexicon

79

measure concerning accumulated fixations over the course of the whole trial, on the other hand, only older toddlers showed priming effects. This finding suggests that the accumulated fixations may be the outcome of a more-advanced ‘strategic’ looking behaviour, which arises out of developments in short-term memory, or in interpreting the nature of the ‘game’. The measures of fist fixation direction and duration, by contrast, appear to be more automatic, as they are acquired earlier, and have their effect earlier than the measure of general target preference. These contrasting measures could prove to be useful in future investigations of automatic versus strategic priming effects. Primed Facilitation or Inhibition. All sequential priming effects can be characterised as facilitation from a related prime enhancing lexical access, or as interference in the lexicon from an unrelated prime inhibiting lexical access. The priming effect observed in the macro-level measure could arise out of either process. For example, hearing ‘cat’ might make recognition of the word ‘dog’ easier than usual. Alternatively, hearing ‘plate’ before ‘dog’ might confuse toddlers, making recognition of ‘dog’ more difficult than usual. While the measure of primed target preference is consistent with either account, the finding of above-chance first-fixation accuracy only in the case where the prime is related and the target named, strongly suggests facilitation of lexical comprehension. However, given that the comparison of label-only and prime-plus-label occurs between subject groups (Experiment 1 and Experiment 2), it is possible that individual differences may have contributed. Further studies are needed to clarify whether this facilitation account is also valid within the same group of participants. Stimulus Controls. Two stimulus controls were considered critical for primed IPL. First, in order to avoid inadvertent memory effects which might interfere with priming, no stimuli were repeated within a testing session. Secondly, and perhaps most importantly, only those trials in which toddlers were reported to understand both the prime and the target were included in the analysis. This is an important control for studies attempting to establish the organisation of the developing lexicon. From an experimental perspective, the inclusion of unknown words generates a pattern of noise which is likely to mask genuine priming effects: If a prime or a target is not understood in a ‘related’ trial, that trial effectively becomes ‘unrelated,’ as only one word shares a relationship with the named target. Yet if a prime in an ‘unrelated’ trial is not understood, the trial remains ‘unrelated’ (with one word sharing a relationship with the target). This skew reduces the difference between the two trial types, potentially masking legitimate priming effects.

80

Suzy Styles and Kim Plunkett

Eliminating this confound is particularly relevant for toddlers with small vocabularies, for whom this noise would be greatest. Pre-testing of items (e.g., in a standard IPL task) could have provided data about each toddler’s comprehension, but would have confounded the stipulation of nonrepetition. Instead, lexical comprehension of test items was assessed via parental report. Despite valid concerns about the voracity of parental report of comprehension (Tomasello and Mervis 1994), recent research has demonstrated that parental report can reliably predict items which will attract lexical comprehension in IPL tasks at 18-months-of-age (Styles and Plunkett 2009b). Association and Taxonomy. In stimulus preparation, we maximised the likelihood of an associative ‘boost’ (Moss et al. 1995) by selecting related prime-target pairs which shared both semantic and associative relationships. It was interesting to note that many taxonomic sisters in the pool of potential test items also exhibited normative word association. Given the small size of the toddler lexicon in the second year, the common combination of association with taxonomy may be a salient property of the developing lexicon, and is a pattern which fits well with the computational modelling of Steyvers and Tenenbaum (2005). However, to tease apart which of these two kinds of relationship is more important for sequential activation in the lexicon, further investigations are needed to establish which, of taxonomy and association, is the greater source of organisation in toddlers’ knowledge structures. List Effects. In Experiment 2 and 3, main effects of ‘list’ were detected in the primed measures of PTL, indicating that toddlers who saw one stimulus list showed a greater priming effect than toddlers in the other. Interpretation of list effects is difficult, because it is not possible to separate variance generated by different groups of participants from variance generated by physically different test items (H. H. Clark 1973). However, as the list effect was systematic across two different populations of toddlers (Experiment 2 and Experiment 3), it is likely to arise out differences between test stimuli. It might be expected that toddlers who saw the items in the list with stronger associations would show greater priming. However, toddlers who saw stimulus list B (with weaker associations) showed stronger priming than toddlers who saw list A. Further investigation revealed that differences in the visual interest of distracter pictures were the most likely cause: Related targets in list A had to compete with distracter pictures depicting animals (lion, chicken), vehicles (boat, pushchair) and food (toast, cheese), whereas related trials in list B had animals (monkey, bear) and clothing (sock, trousers), and inanimate household objects (bowl,

Early links in the early lexicon

81

table) for competition. Toddlers who saw list B were better able to demonstrate enhanced target recognition, as the distracters were less distracting. Raajmakers and colleagues (Raajmakers 2003; Raajmakers, Schrijnemakers, and Gremmen 1999), argue that where list effects do not interact with priming, then the priming can be considered statistically robust. In the current series of experiments, no list effects were observed in the direction or duration of the first fixation, making these measures less sensitive to stimulus variation than the macro-level measure. This pattern further supports the interpretation that macro-level measures involve more strategic responses than micro-level measures of eye gaze. The micro-level measures were those in which priming effects were observed for both age groups, and which indexed high-speed responses to the sequential auditory stimuli. 6. Conclusions Toddlers in their second year discriminated named target pictures from unnamed distracter pictures in a fast-paced looking task following sequential auditory word processing. Target identification was influenced by the combination of words heard prior to picture presentation. Target detection was enhanced when the label was preceded by a related word, compared to an unrelated word. This contrast demonstrates adult-like organisation in toddlers’ semantic system affecting the ease and speed of their lexicosemantic processing. This primed pattern of responding is consistent with a model of a developing lexicon in which sequential activation influences online language processing. The failure of the isolated prime word to influence target detection in the absence of the target label indicates that the priming effect is driven by relationships between words in the early lexicon. It remains to be seen whether one type of relationship (associative or taxonomic) provides organisational structure, or whether an adult-like mix of structures is encoded. Nonetheless, these findings demonstrate that priming effects can be observed in the online language processing of toddlers in the second half of the second year, and establishes stimulus arrangements and timing intervals for further investigation of the developing semantic system.

82

Suzy Styles and Kim Plunkett

Appendix Stimulus Lists Experiment 1 List A Attention phrasea Hey Wow! Ooh look! Look at this! Hey Wow! Ooh look! Look at this! Hey Wow! Ooh look! Look at this!

Target Book Baby Fish Key Bath Toast Teddy TV Monkey

Distracter Juice Phone Aeroplane Milk Flower Bin Frog Butterfly Brush

List B Attention phrasea Ooh look! Look at this! Hey Wow! Ooh look! Look at this! Hey Wow! Ooh look! Look at this! Hey Wow!

Target Juice Phone Aeroplane Milk Flower Bin Frog Butterfly Brush

Distracter Book Baby Fish Key Bath Toast Teddy TV Monkey

a

Attention phrases counterbalanced through presentations using Latin Square order. Stimulus Lists Experiments 2-3 List A Prime & Carrier

Target

Distracter

Yesterday, I saw a cat Yesterday, I saw a sheep Yesterday, I ate an apple Yesterday, I bought a boot Yesterday, I bought a plate Yesterday, I bought a cot Yesterday, I saw a train

Dog Cow Banana Shoe Cup Bed Horse

Boat Toast Lion Cheese Pushchair Chicken Sock

Prime Type Related Related Related Related Related Related Unrelated

WAStrengtha 66.7 13.3 2.4 30.0 25.2 9.5 0

Early links in the early lexicon Yesterday, I saw a lorry Yesterday, I saw an elephant Yesterday, I bought a hat Yesterday, I saw a pig Yesterday, I ate a biscuit

83

Mouse Cake

Table Trousers

Unrelated Unrelated

0 0

Bus Car Coat

Monkey Bowl Bear

Unrelated Unrelated Unrelated

0 0 0

Prime & Carrier

Target

Distracter

Yesterday, I bought a plate Yesterday, I bought a boot Yesterday, I bought a cot Yesterday, I saw a cat Yesterday, I bought a sheep Yesterday, I ate an apple Yesterday, I saw a pig Yesterday, I saw an elephant Yesterday, I ate a biscuit Yesterday, I saw a train Yesterday, I saw a lorry Yesterday, I bought a hat

Dog Cow Banana Shoe Cup

Boat Toast Lion Cheese Pushchair

Prime Type Unrelated Unrelated Unrelated Unrelated Unrelated

Bed Horse Mouse

Chicken Sock Table

Unrelated Related Related

Cake Bus Car Coat

Trousers Monkey Bowl Bear

Related Related Related Related

List B

a

WAStrengtha 0 0 0 0 0 0 2.1 8.9 4.8 6.2 4.8 12.5

Forward Adult Word Association norms from Moss & Older

Notes 1.

Experiment 2 has been previously presented in Styles and Plunkett (2009b). Here we present a re-analysis of the original experiment in a broader context to examine the important question of the nature of the priming effects observed. This issue was not fully addressed in the original study.

References Anderson, J. R. 1983 A spreading activation theory of memory. Journal of Verbal Learning and Verbal Behavior, 22: 261-295

84

Suzy Styles and Kim Plunkett

Antos, S. J. 1979

Processing facilitation in a lexical decision task. Journal of Experimental Psychology: Human Perception and Performance, 5(3): 527545.

Aslin, R. 2007 What's in a look? Developmental Science, 10 (1): 48-53. Bowerman, M. 1987 The acquisition of word meaning. In Development of communication: Social and pragmatic factors in language acquisition, N. Waterson and C. Snow (Eds.). New York: Wiley. Canfield, R. L., E. G. Smith, M. P. Brezsnyak, and K. L. Snow 1997 Information processing through the first year of life: A longitudinal study using the visual expectation paradigm. Monographs of the Society for Research in Child Development, 62 (2): 1-170. Carr, T. H., C. McCauley, R. D. Sperber, and C. M. Parmelee 1982 Words, pictures, and priming: On semantic activation, conscious identification, and the automaticity of information processing. Journal of Experimental Psychology: Human Perception and Performance, 8 (6): 757-758. Clark, E. 1973 What’s in a word? On the child’s acquisition of semantics in his first language. In Cognitive development, and the acquisition of language, T. E. Moore (Ed.). New York: Academic Press. Clark, H. H. 1973 The language-as-fixed-effect fallacy: A critique of language statistics in psychological research. Journal of Verbal Learning and Verbal Behavior, 12: 335-359. Collins, A. M., and E. F. Loftus 1975 A spreading-activation theory of semantic processing. Psychological Review, 82 (6): 407-428. Cree, G. S., and K. McRae 2003 Analyzing the factors underlying the structure and computation of the meaning of chipmunk, cherry, chisel, cheese and cello (and many other such concrete nouns). Journal of Experimental Psychology: General, 132(2): 163-201. Federmeier, K. D., and M. Kutas 1999 A rose by any other name: Long-term memory structure and sentence processing. Journal of Memory and Language, 41: 469-495. Fenson, L., P. S. Dale, J. S. Reznick, E. Bates, D. J. Thal, and S. J. Pethick 1994 Variability in early communicative development. Monographs of the Society for Research in Child Development, 59 (5): 1-173. Fernald, A., J. P. Pinto, A. Weinberg, and G. McRoberts 1998 Rapid gains in speed of verbal processing by infants in the 2nd year. Psychological Science, 9 (3): 228-231.

Early links in the early lexicon

85

Fernald, A., D. Swingley, and J. P. Pinto 2001 When half a word Is enough: Infants can recognize spoken words using partial phonetic information. Child Development, 72 (4): 10031015. Friedrich, M., and A. D. Friederici 2004 N400-like semantic incongruity effect in 19-month-olds: processing known words in picture contexts. Journal of Cognitive Neuroscience, 16(8): 1465-1477. 2005 Lexical priming and semantic integration reflected in the eventrelated potential of 14-month-olds. Cognitive Neuroscience and Neuropsychology, 16 (6): 653-656. Golinkoff, R. M., K. Hirsh-Pasek, K. M. Cauley, and L. Gordon 1987 The eyes have it: Lexical and syntactic comprehension in a new paradigm. Journal of Child Language, 14: 23-45. Hamilton, A., K. Plunkett, and G. Schafer 2000 Infant vocabulary development assessed with a British communicative development inventory. Journal of Child Language, 27: 689705. Hashimoto, N., K. K. McGregor, and A. Graham 2007 Conceptual organization at 6 and 8 years of age: evidence from the semantic priming of object decisions. Journal of Speech, Language, and Hearing Research, 50: 161-176. Heinze, H. J., T. F. Muente, and M. Kutas 1998 Context effects in a category verification task as assessed by eventrelated brain potential (ERP) measures Biological Psychology, 47 (2): 121-135. Hills, T. T., M. Maouene, J. Maouene, A. Sheya, and L. Smith 2009 Categorical structure among shared features in networks of earlylearned nouns. Cognition, 112: 381-396. Houston-Price, C., E. Mather, and E. Sakkalou 2007 Discrepancy between parental report of infants' receptive vocabulary and infants' behaviour in a preferential looking task. Journal of Child Language, 34 (4): 701-724. Huettig, F., and G. T. M. Altmann 2005 Word meaning and the control of eye fixation: semantic competitor effects and the visual world paradigm. [Brief Article]. Cognition, 96: B23-B32. Huettig, F., and J. M. McQueen 2007 The tug of war between phonological, semantic and shape information in language-mediated visual search. Journal of Memory and Language, 57: 460-482. Kamide, Y., G. T. M. Altmann, and S. L. Haywood 2001 The time-course of prediction in incremental sentence processing: Evidence from anticipatory eye movements. Journal of Memory and Language, 40: 133-156.

86

Suzy Styles and Kim Plunkett

Krackow, E., and P. Gordon 1998 Are lions and tigers substitutes or associates? Evidence against the slot filler accounts of children's early categorization. Child Development, 69 (2): 347-354. Kutas, M., and S. A. Hillyard 1980 Event-related brain potentials to semantically inappropriate and surprisingly large words. Biological Psychology, 11 (2): 99-116. Mandler, J. M. 2000 Perceptual and Conceptual Processes in Infancy. Journal of Cognition and Development, 1: 3-36. Mani, N., and K. Plunkett 2007 Phonological specificity of consonants and vowels in early lexical representations. Journal of Memory and Language, 57: 252-272. McClelland, J. L., and T. T. Rogers 2003 The parallel distributed processing approach to semantic cognition. Nature Reviews Neuroscience, 4: 1-7. McRae, K., G. S. Cree, M. S. Seidenberg, and C. McNorgan 2005 Semantic feature production norms for a large set of living and nonliving things. Behavior Research Methods, Instruments, and Computers, 37 (4): 547-559. McRae, K., V. R. de Sa, and M. S. Seidenberg 1997 On the nature and scope of featural representations of word meaning. Journal of Experimental Psychology: General, 126 (2): 99-130. Meints, K., K. Plunkett, and P. Harris 1999 When does an ostrich become a bird? The role of typicality in early word comprehension. Developmental Psychology, 35 (4): 10721078. Meints, K., K. Plunkett, P. L. Harris, and D. Dimmock 2002 What is 'on' and 'under' for 15-, 18- and 24-month olds? Typicality effects in early comprehension of spatial prepositions. British Journal of Developmental Psychology, 20: 113-130. Meyer, D. E., and R. W.Schvaneveldt 1971 Facilitation in recognising pairs of words: Evidence of a dependence between retrieval operations. Journal of Experimental Psychology, 90: 227-234. 1976 Meaning, memory structure, and mental processes: People's rapid reactions to words reveal how stored semantic information is retrieved. Science, 192: 27-33. Moss, H. E., S. F. McCormick, and L. K. Tyler 1997 The time course of activation of semantic information during spoken word recognition. Language and Cognitive Processes, 12 (5/6): 695731. Moss, H. E., and L. Older 1996 Birkbeck Word Association Norms. Hove, UK: Psychology Press.

Early links in the early lexicon

87

Moss, H. E., R. K. Ostrin, L. K. Tyler, and W. D. Marslen-Wilson 1995 Accessing different types of lexical semantic information: Evidence from priming. Journal of Experimental Psychology: Learning, Memory and Cognition, 21 (4): 863-883. Nation, K., and M. J. Snowling 1999 Developmental differences in sensitivity to semantic relations among good and poor comprehenders: evidence from semantic priming. Cognition, 70: B1-B13. Raajmakers, J. G. W. 2003 A further look at the "Language-as-a-fixed-effect Fallacy". Canadian Journal of Psychology, 57 (3): 141-151. Raajmakers, J. G. W., J. M. C. Schrijnemakers, and F. Gremmen 1999 How to deal with the "Language-as-a-fixed-effect fallacy": Common misconceptions and alternative solutions. Journal of Memory and Language, 41: 416-426. Radeau, M. 1983 Semantic priming between spoken words in adults and children. Canadian Journal of Psychology, 4: 547-556. Reznick, J. S. 1990 Visual preference as a test of infant word comprehension. Applied Psycholinguistics, 11 (2): 145-166. Steyvers, M., and J. B. Tenenbaum 2005 The large-scale structure of semantic networks: Statistical analyses and a model of semantic growth. Cognitive Science, 29: 41-78. Styles, S. J., and K. Plunkett 2009a How do infants build a semantic system? Language and Cognition, 1 (1): 1-24. 2009b What is ‘word understanding’ for the parent of a 1-year-old? Matching the difficulty of lexical comprehension tasks to parental CDI report. Journal of Child Language, 36: 895-908. Swingley, D., and R. Aslin 2007 Lexical competition in young children's word learning. Cognitive Psychology, 54: 99-132. Swingley, D., and A. Fernald 2002 Recognition of words referring to present and absent objects by 24month-olds. Journal of Memory and Language, 46 (1): 39-56. Swingley, D., J. P. Pinto, and A. Fernald 1999 Continuous processing in word recognition at 24 months. Cognition, 71 (2): 73-108. Swinney, D. A. 1979 Lexical access during sentence comprehension: (Re)consideration of context effects. Journal of Verbal Learning and Verbal Behaviour, 18: 645-659.

88

Suzy Styles and Kim Plunkett

Tomasello, M., and C. B. Mervis 1994 The instrument is great, but measuring comprehension is still a problem. Monographs of the Society for Research in Child Development, 59 (5): 174-179. Torkildsen, J. V. K., G. Syversen, H. G. Simonsen, I. Moen, and M. Lindgren 2007 Electrophysiological correlates of auditory semantic priming in 24month-olds. Journal of Neurolinguistics, 20: 332-351. White, K. S., and J. L. Morgan 2008 Sub-segmental detail in early lexical representation. Journal of Memory and Language, 59 (1): 114-132. Yee, E., and J. Sedivy 2006 Eye movements to pictures reveal transient semantic activation during spoken word recognition. Journal of Experimental Psychology: Learning, Memory and Cognition, 32 (1): 1-14.

Words: Discrete and discreet mental representations Aditi Lahiri 1. Introduction Words are the central unit of every language and their representation in the mental lexicon happens to be both discrete and discreet. Although these two words are identical in their phonetic output and may even have the same spelling in some dialects of English, they have different nuances. The latter suggests inconspicuous or understated, while the former implies distinct or separate. An uncontroversial assumption within phonology is that underlying lexical representations of words are on the one hand discreet enough so as not to code predictable information, while being discrete so as to keep lexical items apart. The problem is that although lexical phonological representations may be discrete and discreet, no two words even when spoken by the same individual are ever acoustically identical! Adult speech is produced with great speed and accuracy, where words are selected from a mental store of tens of thousands (the mental vocabulary size of a university undergraduate is assumed to be in the range of 14,000 to 17,000 words) and articulated at an average rate of 160 words per minute. The fluency of speech production, however, disguises the complexity and variation of the speech output. Setting aside noisy environments, differences in vocal tract size, age, gender, and dialects, the sounds of words also undergo substantial variation in difference utterance contexts Surprisingly, adult native speakers are able to tolerate extreme variability in word pronunciation to the extent that they will accept “incorrect” enunciations. This tolerance of the adult brain is to a large extent asymmetric and germane to word recognition, not for sound discrimination. For instance, although we can readily discriminate between the sounds [n] and [m], we are willing to tolerate certain mispronunciations of words with [n] but not with [m]: *ha[m]bag is an acceptable variant of ha[nd]bag, but *A[n]track is never mistaken for A[m]track. The speaker versus listener scenario is also complex given the somewhat incompatible demands of each. Ostensibly the speaker is in control and applies her knowledge of grammatical principles and rules to produce a meaningful utterance after accessing the words in the mental lexicon. Nevertheless, irrespective of the speaker’s efforts to control how she speaks, her pronunciation is bound to

90

Aditi Lahiri

vary. The listener has a different mission since she has no control over the speech stream. Her task is to undo the variation that the speaker produces, access the phonological lexicon, and recognise the words. Despite the fact that speaker~listener requirements are different, the phonological representation of words in the mental lexicon is generally assumed to be the same. Why does the brain tolerate certain aspects of words asymmetries and not others? One possibility is that the listener’s brain encodes detailed acoustic information of the individual tokens of the word she hears. Accumulation of a sufficient number of acoustic variants permits the adult to identify the important dimensions of variation. A second possibility is that the variation permitted is context dependent which can be undone by the listener. The third alternative could be that the acoustic information of the intended sound of the speaker is always available to the listener. A final scenario is that the mental lexicon of the adult stores words in a form which contains feature-asymmetries, permitting only certain variations to be accepted, and thereby allowing some changes to take place and not others. Both experimental and theoretical research are still inconclusive in these matters. Furthermore, there has been some, though not a great deal of rigorous attempts to combine the knowledge from theoretical, historical and experimental fields to ask questions about word asymmetric representations in the brain. Experimental researchers are usually unacquainted with phonological asymmetries or data from historical change involving wordvariation and asymmetries, and theoretical scholars are unaware of experimental possibilities. In this chapter, I would like to first elaborate on the types of variation a speaker usually produces which the listener has to resolve. The second part will establish that an abstract, sparse, discreet, underspecified representation is more viable and makes better predictions as to what variations may or may not occur and how they may be processed. Finally, a thorny and irksome aspect of words is that their structure, complexity, and even they way they change, is not symmetric. Underspecified representations imply asymmetry, which has the consequence that the output of phonological processes applying to them is also asymmetric. What I would like to argue for is that the phonological representations not only govern speaking and listening, but also how the phonological grammar changes. Although this is not a historical paper, I will briefly illustrate how synchronic systems are also reflected in diachronic change. Finally, I would like to show that the discreetness of the phonological representations cannot be accounted for by experience alone. Of course, experience counts and indeed it must. But a

Words: Discrete and discreet mental representations

91

level of abstraction based on universal principles rather than having experienced a particular variation. The basis for these universals is also favoured by the native speaker. The basis for these universals must be entrenched in our cognitive system, on which I can only speculate. Before we delve into the nature of phonological representation, we first turn to the types of variation produced by the speaker that the listener has to cope with. 2. Types of variations and asymmetries If we accept that word representations contain discrete elements (segments) which are not indivisible wholes but are made up of features, then the alternations of word shape (which can be dubbed as variations) necessarily make use of features and segments. The segments are in sense epiphenomena since they are made up of features. Although one could provide an innumerable list of potential word variation in a variety of contexts (affixed and phrasal), both synchronic and diachronic, they can be confined to a tangible set involving deletion, insertion, spreading, lengthening, and shortening. Furthermore, there is habitually some asymmetry implicated – either in the feature/segment alternation or in the context as we will see below. 2.1. Types of phonological variations and contexts (1)

Phonological variation possibilities

a. Feature deletion: eliminate [VOICE] when the following stop is not voiced; [bt] > [pt] b. Feature insertion: add [ASPIRATION] to word initial voiceless stops;[p]a > [ph]a c. Feature spreading: spread [NASAL] from the following nasal to the vowel; [v]n > []n d. Segment deletion: delete word final [r]; CV[r] > CVØ e. Segment insertion: insert [e] word initially before word initial [st]; [st]V > [est]V f. Segment lengthening: vowels lengthen before final voiced consonants; VD > V:D g. Segment shortening: vowels shorten before clusters; V:CC > VCC

92

Aditi Lahiri

The term “segment” refers to both vowels and consonants. The examples above are all from attested languages. There is an asymmetric aspect to all of the above which will be discussed in §2.3. The processes listed here are sensitive to either featural contexts or metrical contexts (syllable, foot, or phonological word). On rare occasions, a phonological process may also be context free. A famous example is Grimm’s Law, which is composed of three context-free processes changing obstruents from Indo-European (IE) to Germanic. For example, all voiced stops in IE lost their [VOICE] feature in Germanic; cf. Sanskrit daša, English ten. Feature spreading (1c) is invariably dependent on the context of a contiguous feature and involves what is generally known as assimilation, whereas (1d) and (1e), viz. segment insertion and deletion are generally governed by metrical constraints. We can also isolate two different domains where any of the seven types of phonological alternations may change word shapes. (2)

Word and phrasal domains

a. Word-internal alternations – domain affixation b. Across-word alternations – phrasal domain 2.1.1. Word internal variation as a consequence of suffixation The phonological shape of words can vary with affixation, where both root and affix can change depending on the context. In the following examples, the alternations involve vowels and consonants which change in phonological contexts after affixation. The number referring to the type of phonological process is written below the example. (3)

Examples of phonological processes keep kept

[kiːp] [kɛp-t] (1a, g)

ineligible [In-ɛldbəl] incoherent [I-kohiərɛnt] (1c)

twelve[twɛlv] twelfth [twɛlf-θ] (1a)

The vowel quantity and quality changes in keep~kept and five~fifth while the consonant preceding the suffix changes in critic~criticise and five~fifth. There are distinct phonological processes leading to these alternations. For keep~kept, the suffix [t] is added to the verb root which obeys a 3-mora

Words: Discrete and discreet mental representations

93

rhyme constraint and as a result the root vowel is shortened resulting in 2moras. The root itself has a long vowel /kip/ where the rhyme has two moras since the final consonant is considered to be extrametrical. (4) Vowel shortening and lowering (O=onset, R=rhyme, =extrametrical) σ / \

σ / |

O R

O

|

| /|\ μμ μ /k i p /

/\ μμ / k i

/

R

>

σ / |

σ / |

O R

O R

|

| /\ μμ k ɛ p

/\ μμ k ɪ p >

The two consonantal processes in (1a,c) are feature changing and are dependent on the following consonant. For twelfth, the lack of voicing in the final consonant of the root is caused by the following voiceless suffix: [twɛlv-θ] > [twɛlf-θ]. In incoherent, the place of articulation of [k] of coherent spreads to the preceding nasal in the prefix /in-/ changing it to [ŋ]. 2.1.2. Variation in word-internal and phrasal domains The process of [r]-deletion (1d) requires a word like fear to be articulated differently in different domains even if followed by an identical vowel [l]. (5)

[r]-deletion fear a. (I fear) (illness makes one weak). b. (My fear is) (that the weather will change). c. (I do fear) (if it gets cold). d. (His fearing me) (does not help).

[fɪər] [fɪər] [ɪ]llness [fɪər ɪ]s [fɪər] [ɪ]f [fɪər-ɪ]ng

Usually, the word-final [r] is not pronounced in southern British English although it does appear if a suffix is added as in fearing [fɪərɪŋ] in (5d). In all the other sentences, fear is followed by a word beginning with a vowel [ɪ] which is identical to first vowel in the suffix [ɪŋ]; however, in (5a) and (5c) the [r] is not pronounced, but in (5b) it re-appears, because fear and is group together as one phonological unit, behaving as if it were word internal. The analysis we would prefer is that ‘is’ encliticises to the preceding word, making (fear is) a single prosodic word (cf. Lahiri and Plank 2010).

94

Aditi Lahiri

The suffixed word fearing is of course a prosodic word. In addition to acoustic and phonological variation in context, the syntactic structure of a word may vary depending on its usage. In (5a), fear is a verb while it is a noun in (5b). Furthermore, endings may be added as appropriate to either the verb or the noun leading to other words such as fearing, feared, fears, etc. Several other suffixes can be added to make more words, this time adjectives such as fearful, fearsome, fearless and so on, creating a morphological cohort. Irrespective of whether the usage is as a noun or a verb, however, the [r]-deletion is governed by the presence of a following vowel if, and only if, it is part of a prosodic word. The examples above illustrate phonological alternations involving vowels and consonants, both of which are equally common in languages of the world. In English, vowel alternations are usually governed by lexical suffixes, and may appear to be obscure or more difficult at first sight as in keep~kept. However, alternations pertaining to vowels are no more convoluted than those involving consonants; cf. word final devoicing leads to a consonantal alternation in German Ta[g]e ~ Ta[k] ‘day-PL/SG’, or vowel raising in the context of a following high vowel /i/ in Bengali /tʃen-/, [tʃene]~[tʃin-i] ‘see-2P/1P’. 2.1.3. Propensity to change In addition, words do not remain static in time. This is because the phonological rules that have applied may be reinterpreted, or the representations of the words they are applying to may be reanalysed. Consider for example the word five and how it changed from the proto-language Indo-European to all the related languages (the sign † indicates reconstructed forms of an extinct stage of a language which are deduced on the basis of comparative data). Classical Greek and Sanskrit are sister languages to Proto-Germanic and their forms are listed as a basis for comparison. While Greek and Sanskrit have written data, Proto-Germanic is reconstructed. Bengali, like many other Indo-Aryan languages is ultimately derived from Sanskrit and is also given for comparison. Although the historical development may be transparent to the philologist and linguist, neither Bengali nor English speakers of the 21st century will easily recognise the word-forms of their ancestors.

Words: Discrete and discreet mental representations

(6)

Indo-European (IE) Classical Greek Sanskrit

† penque pente panca

Bengali

[pt]

Proto-Germanic Old English Old Norse Old High German Old Saxon

† fimfe fif fimm fünf, fümf fif

English Norwegian German Dutch

five fim fünf vijf

95

Language change naturally cannot and does not happen in a void: it is part and parcel of communication, language acquisition and language processing. Furthermore, there is always a system in how words change – random changes are rare. For example IE †p became Proto-Germanic †[f] and has remained so in all its descendants, while it remained [p] in Sanskrit and all its descendants.1 The consequence for Germanic speakers was that there were no original †[p]s left, and the words with original †[p] were reanalysed as [f]. The point to note is that through centuries both language families have maintained these consonants, thereby indicating the stability of these systems. This is of course, not always the case. The development of IE †t, for instance, has not been so stable. It remained in Greek and Sanskrit (and Bengali), but it changed to Proto-Germanic†θ which was retained only in English, but became [d] in Dutch and German: cf. Bengali [tin], English three [θ], German drei, Dutch dri. Two other phonological changes are similar to the processes we have discussed above. From Sanskrit to Bengali, the vowel [a] is nasalised by the following nasal consonant (cf. 1a). The loss of the nasal consonant in Bengali has led to a change in the underlying representation of vowels, introducing nasal vowel phonemes as discussed in Lahiri and Marslen-Wilson (1991). Later generations of Bengali speakers did not realise that the vowel [] originally came from []. In English, however, the nasal disappeared, but did not leave a nasal vowel. Instead, the vowel was lengthened (compensatory lengthening) to a long vowel [i]. Later, in Middle English, all long vowels became one step higher and the high vowels were diphthongised, but short vowels remained unchanged; e.g., [e, o, e, o] became [i, u, e, o] while [i, u, i, u] became [, , , ]. This process is known as the Great Vowel Shift. We can see the effect of this in our examples of keep~kept and five~fifth. OE cēp-, cēpte (both long [e]) > ME kēp-, kepte (additional rule of vowel shortening) from which we get English k[ip]

96

Aditi Lahiri

(vowel raising), k[ɛ]pt (no change). An example of high vowel diphthongisation is seen in OE fif > English f[ai]v. Thus, simple straightforward phonological processes had major diachronic consequences. Non-Germanic languages kept IE‡[p] where Germanic languages had a new sound [f]. Bengali obtained a new set of vowel phonemes, namely underlying nasal vowels. English short~long vowel alternations were originally symmetric in quality: ME [ē]~[e]. But thanks to the Vowel Shift, the alternations in modern English reflect an alternation in quantity and quality: [i]~[ɛ] (keep~kept), [aɪ]~[ɪ] (five~fifth). Note that the diachronic process of the Great Vowel Shift involved vowel raising; in our analysis, it would be analysed as “feature insertion” where the vowel [e] which is not specified for height, acquires [HIGH]. However, the modern English synchronic phonological alternation involves vowel lowering, and hence feature deletion, since the underlying representation of the verb to keep now has been reanalysed to a high vowel. Thus, the vowel shift caused all the long vowels to raise. But since the ME kepte had a short vowel, modern English kept remained non-high causing the phonological rule for the next generation to become vowel lowering since the main verb had been reanalysed. We now turn to the asymmetry of phonological rules (synchronic and diachronic) and phonological representations. 2.2. Asymmetries Most phonological alternation types in (1) have inbuilt asymmetries. For instance, segments may delete word initially, finally or medially. However, where vowels are concerned, final deletion (apocope) and medial deletion (syncope) are frequent while initial deletion is rare. The opposite holds for vowel insertion — vowels may be inserted word initially (prothesis) and medially (epenthesis), but rarely word finally. Phonological processes and phonological contexts can also be asymmetric. Vowels shorten usually in closed syllables and not in open syllables; place of articulation features usually spread from specified to unspecified. Voiced consonants lose their voicing word finally while voiceless consonants do not gain voicing. Assimilations, which are phonological processes involving feature manipulations in the context of other features, are invariably asymmetric. Assimilation may be regressive where features from a following segment have an effect, or progressive where the features from the following segment dominate. I will discuss the asymmetries as well as types of assimilations in

Words: Discrete and discreet mental representations

97

§3 with the feature-geometry at hand to make the discussion more concrete. Languages, as all cognitive systems, thrive in asymmetries. Consider here a range of surface phonological asymmetries between word pairs within and across languages. The synchronic asymmetries have usually arisen due to diachronic phonological processes, which may themselves have been asymmetric. (7)

Word-asymmetries

a. Stress can differentiate word categories in English: óbject ~ to objéct; a cóntract ~ to contráct. The point to note is that such stress pairs are not symmetric. Disyllabic noun-verb pairs in English allow the following stress patterns: (i) σσN ~ σσV a méasure ~ to méasure, (ii) σσN ~σσV an attáck ~ to attáck, (iii) σσN ~ σσV a pérmit ~ to permít, but not (iv) *σσN ~ σσV. b. In a monosyllabic word in English, vowels are approximately twice as long before voiced consonants [b, d, g] than before voiceless consonants [p, t, k]: the [æ]s in lab, mad, bag are longer than those in lap, mat, back. This is not so in Bengali where there is no contrastive vowel length, all vowels in monosyllabic words are equally lengthened irrespective of the voicing of the final consonant: [ʃɑp]~[ɖɑb], [ɖɑ]~[] ‘snake, green coconut, mail, mark’. The lengthening holds for all places of articulation.2 The lengthening process does not add a new contrast in either language, and is purely allophonic. Bengali has no vowel length contrast and in English this lengthening affects both short and long vowels; e.g. [iː] in bead is longer than beat and [ɪ] is also longer in bid as compared to bit. c. Word-final voiced consonants [b, d, g] become voiceless [p, t, k] in languages such as German, Dutch, and Russian, but not word initially: e.g. German Zug [tsuk]; but Gaum [gaum] is not *[kaum]. In contrast to voiced stops, voiceless [p t k] do not become voiced word initially or finally: German Hut remains [hut] and does not become *[hud], neither does Tür become [dyr] d. In most languages, [n] changes to bilabial [m] when other bilabial sounds [b, p] follow, but [m] does not become alveolar [n] when alveolar [d, t] follow: English gunboat => [gumboat], gumtree ≠> not *[guntree], but remains [gumtree]. This is more apparent in word medial sequences; [nb] sequences are invariably pronounced as [mb] as in rainbow > rai[mb]ow, but [md] sequences do not become [nd] as in humdrum ≠> *hu[nd]rum.

98

Aditi Lahiri

e. As mentioned above, the change in language systems through the ages is also asymmetric. Consider the high vowels [u, i, y=ü]3 in Old English, where the [ü] was derived from Proto Germanic †[u] when the vowel [i] or [j] followed. This is an assimilation like the consonant place assimilation above; here the rounded vowel [u] deletes its DORSAL feature to match the [CORONAL] place feature of the following [i] or [j] and became [ü] which is thus a [CORONAL] rounded vowel. Both the short and long vowels were similarly affected although the same script was used: [u] > [y], [] > [Y]. Old High German had the same phonological rule and modern German has maintained these rounded front vowels as in küss. Unlike German, the [Y] in OE cynn ([i], and not []>[Y]>≠>[ ]. Consequently, OE [Y] and [ɪ] merged such that Modern English sit and kiss have identical vowels although their sources are respectively [ɪ] and [Y]. f. Even the words we borrow reflect the asymmetric potential of representations. In English, monomorphemic words may never end with /nz/ although inflected words can easily end with such a sequence: cf. pence, dance, lance vs. pen-s, bun-s, plan-s etc. One exception is bronze. When words are borrowed, the tendency is to use [ns] as a final sequence, unless it is treated as a plural, as in lens. Rarely, if at all, new nouns are borrowed ending in [nz]. Once we acknowledge that word-forms are not invariant either in a fixed period of time nor across time, we need to understand how the human brain muddles through. To recognise a word, a human listener has to strip away the variation, grasp the essentials, and map it onto a somewhat “idealised” mental representation. What is amazing is that the human brain handles the variation with ease. Furthermore, since words are not static, the representations in the brain must also change accordingly. Adult native speakers are able to tolerate extreme variability in word pronunciation to the extent that they will accept “incorrect” enunciations. But, as we mentioned in the introduction, this tolerance is not only asymmetric but also specific to word recognition and not just sound identification. We can readily discriminate between the sounds [n] and [m]. Nevertheless, we are willing to tolerate *gu[m]boats as an acceptable variant of gu[n]boats, but *gu[n]boots is not mistaken for gu[m]boots. The systematic nature of adult sensitivity to the variability in word pronunciation suggests that it is well entrenched in their phonological knowl-

Words: Discrete and discreet mental representations

99

edge. A central question for our understanding of adult word knowledge is how this systematic tolerance of word variation is established in the brain. How are the phonological shapes of words stored and how do we identify and recognise them? The crux of the proposal I have tried to lay out in the course of the last fifteen years, beginning with joint work with William Marslen-Wilson, is based on asymmetries. The mental lexicon must cope with the asymmetries from the speech signal. I approached the problem of 4 WORD variation in four ways with a number of colleagues: (8)

Addressing word variation

a. We carried out a theoretical synchronic study of the phonological systems of a variety of languages in order to understand the full nature of the constraints on the systems; b. In addition, a historical study of the same languages was conducted to grasp at what may change, the constraints on change, and more importantly, why certain aspects never change; c. We developed a model of speech perception and language comprehension called Featurally Underspecified Lexicon (FUL) based on knowledge and principles gleaned from the theoretical studies that make claims about (i) how words are represented, and (ii) how the speech signal maps on to word-form representations. d. A computational system was developed with Henning Reetz, based on FUL’s principles and with a complete linguistic lexicon, without recourse to HMM models to examine how far one could one achieve a speaker independent system. Based on the intricacies and variability of word-forms, one could suggest two solutions to the problem of how the listener handles variation. One possibility is that the brain encodes detailed information about the individual tokens of the words she hears and which she uses to evaluate new instances of the word. Accumulation of a sufficient number of examples permits one to identify the important dimensions of variation. A second solution is that we represent words without all possible surface detail; that is, we allow the word’s phonological representation to be abstract. For instance, German words like Tag, Tag-e ‘day, day-PL’ could be represented as [tak] and [tage] if we do not incorporate final devoicing as an active rule operating on /tag/. The alternative is to assume that the representation of the noun is always /tag/ and the perceptual system has a way of allowing the final [k] to be accepted as a version of represented /g/.

100

Aditi Lahiri

FUL’s principles empower the mental word-lexicon to do much of the work. Since word-forms are so variant, examining all acoustic detail is time consuming and may result in mistakes. Instead, the idea is to make each word-entry in the mental lexicon abstract and sparse – distinct enough to keep it apart from other words, but abstract enough to tolerate variation. The algorithm which the listener uses, first applies heuristic measures to extract rough acoustic cues, then turns them into approximate phonological features and maps these features on to the lexicon following a three-way matching algorithm – full match, tolerated (no-mismatch), conflict (mismatch).5 The toleration no-mismatch principle allows for acceptance of variation! I now turn to the actual feature geometry which was proposed as a theoretical model but also as a model which governs feature representations for processing. The aim is to try to show how the proposed representations make predictions about variations observed in languages. 3. Why is abstract representation better than non-abstract representation? One particular issue about the nature of representations currently being prominently investigated from various disciplinary and theoretical angles, and that will continue to receive much attention in the literature, is their UNDERSPECIFICATION. It is one facet of the more general theme investigating how ABSTRACT or CONCRETE representations are. I will remind the reader of the feature geometry assumed in FUL which will make it easier to understand what happens. A crucial point to remember is that features that are contrastive but underspecified in the representation may still be extracted by the auditory system and will play a crucial role in the activation or inhibition of lexical items. As we will see below, since features are monovalent, one does not have access to a “minus” feature. Features like [VOICE] or [NASAL], therefore, do not have a *[-VOICE] or *[-NASAL] counterpart. Consequently, no non-nasality or orality can be extracted from the signal. If a feature is contrastive and is part of the language to establish a system of contrasts, but is nevertheless underspecified, then it can be extracted from the signal. This asymmetry is crucial in understanding how both processing and language change work. The hierarchical structure of the features of FUL is summarised below.

101

Words: Discrete and discreet mental representations

(9)

Feature organisation in FUL (Lahiri and Reetz 2010) ROOT [CONSONANTAL] / [VOCALIC] [SONORANT] / [OBSTRUENT] [NASAL] [LATERAL] [STRIDENT] [RHOTIC]

LARYNGEAL

[VOICE]

[SPREAD GLOTTIS]

CONSTRICTION

[PLOSIVE]

[CONTINUANT]

PLACE

ARTICULATOR

[LABIAL][CORONAL]

[DORSAL]

[RADICAL] TONGUE HEIGHT

[HIGH]

[LOW]

TONGUE ROOT

[ATR]

[RTR]

These features, all monovalent, are all that are required to express segmental contrasts in the languages of the world. Lahiri and Reetz (2010) discuss the features in some detail. Here we only touch on the features that are relevant to our processing claims. There are two pairs of opposing binary features – CONSONANTAL or VOCALIC and SONORANT or OBSTRUENT – which are the major class features available in all languages. The members of each pair are conflicting – i.e., CONSONANTAL implies not VOCALIC and vice versa. There are other features like ATR and RTR which are mutually exclusive, but these are not binary. For instance, a consonant cannot be both ATR and RTR, but it may be neither. The truly binary features do not have this possibility: a segment must be either CONSONANTAL or VOCALIC, and SONORANT or OBSTRUENT. The only dependencies we assume are universal and must be listed: [NASAL] => [SONORANT], [STRIDENT] => [OBSTRUENT], and [CONSTRICTION] => [OBSTRUENT]. We assume that [HIGH] or [LOW] can differentiate the various coronal consonants (dental, palatoal-

102

Aditi Lahiri

veolar, retroflex etc.) instead of [±anterior] which is a dependent of [ CORA partial list of segment classification is given below.

ONAL].

(10) Features and segments [LABIAL] [CORONAL] [DORSAL] [RADICAL] [HIGH] [LOW] [ATR] [RTR]

Labial consonants, rounded vowels Front vowels, dental, palatal, palatoalveolar, retroflex consonants Back vowels, velar, uvular consonants Pharyngealized vowels, glottal, pharyngeal consonants High vowels, palatalized consonants, retroflex, velar, palatal, pharyngeal consonants Low vowels, dental, uvular consonants Palatoalveolar consonants Retroflex consonants

The assumptions about abstract phonological representations are not controversial. The controversial aspect is that not all features are specified. However, phonological and processing theories have different interpretations of underspecification and abstractness. Within phonology, the standard assumption continues to be that predictable allophonic features are never specified. For instance, SPREAD GLOTTIS or aspiration is not specified in English because it is fully predictable. Thus, a certain level of underspecification is always assumed for non-contrastive features. The controversy arises when one questions whether all contrastive features are specified or not (cf. McCarthy and Taub 1992; Steriade 1995; response in Ghini 2001a). Lahiri and Reetz (2010) provide a phonological account of our take on phonological underspecification. In this chapter, I focus on the relevance of representations for processing. (11) Assumptions concerning phonological representation a. Each morpheme has a unique phonological representation. If the morpheme is phonologically predictable, then no variants, either morphophonological or postlexical, are stored. b. The phonological representation is abstract, exploits contrasts and is featurally underspecified, leading to asymmetries. c. Feature representation is constrained by universal properties and language specific contrasts. d. All features are monovalent — there are no “minus” features; they are either present or not. Contrasts are expressed by the presence of fea-

Words: Discrete and discreet mental representations

103

tures. e. Even if a feature is underspecified, its NODE may be present. If CORONAL is underspecified, the PLACE node will be available. Returning to assimilation, under the feature analysis I have outlined above, most variations sketched under (2) fall under the general rubric of “assimilation” which means feature manipulation (deletion, addition, insertion) due to the influence of a nearby segment. I will touch on four assimilation types — vowel nasalisation, voicing assimilation, vowel height assimilation and place of articulation assimilation. The last two are both covered as PLACE assimilation, since the PLACE node governs ARTICULATOR (including tongue and lips) and TONGUE HEIGHT. (12) Examples of assimilations a. Vowel nasalisation (feature insertion 1b) This is usually regressive, although progressive assimilations exist. In English, this rule is allophonic since it adds a feature contrast which did not exist in the lexicon. In Bengali, nasalisation is neutralising because the contrast exists. For both languages, the process falls under our domain (2b), i.e. within words, affixed words and if the conditions meet, across words within phonological phrases. Recall oral vowels have no feature marking [nasal], and hence no orality can be extracted from the signal. English:

V

C ]Σ —> |

V

\

NASAL

C ]Σ / NASAL

/kæn/ []can (monomorphemic) /flo-/ [fl] flown (suffixed form) Bengali:

V

C | NASAL

cf.

Æ

V \

C / NASAL

/tʃɑn/ [tʃn]bath (monomorphemic) /tʃa-n/ [tʃn]want-FORMAL3P.PRES (suffixed) h /kd / shoulder underlying nasal vowel

104

Aditi Lahiri

b. Voice assimilation (feature deletion or spreading) Unlike vowel nasalisation, voice assimilation can be both regressive and progressive, although regressive is probably more frequent. According to our model [VOICE] can be specified if there is a contrast within obstruents, but voiceless consonants remain unspecified. Voicing assimilation itself is asymmetric. Generally, in Indo-European languages, monomorphemic word-medial consonant clusters share voicing properties – they are either both voiced or both voiceless: abdomen, apt, but not *[bt], *[kb] etc. Furthermore, voiceless sequences are far more common than voiced ones. If voicing assimilation occurs, then we find it most frequently within compounds and in affixed words. Bengali: C LAR

feature spreading C —> C

C

C

LAR

LAR

LAR

LAR

|

\

/

|

VOICE

/hɑt/ /bæthɑ/ hand pain

VOICE

/hɑdbæthɑ/

feature deletion C —> C LAR

LAR

C LAR

VOICE

/rɑg/ /kɔrɑ/ [rɑkkɔra] anger do ‘ to be angry’

Since voicelessness is not a feature, the lack of voicing cannot spread. The rule says that the laryngeal properties of the second consonant must be similar to the previous one. Consequently, if the second consonant has the feature VOICE, it spreads leftwards; if on the other hand, the second consonant has no laryngeal feature specified, but the first is specified for VOICE, the latter specification is deleted. c. Vowel height assimilation (feature spreading or deletion) The TONGUE HEIGHT features, [HIGH] and [LOW], can both be specified in a given language if there is a three-way contrast (Lahiri and Reetz 2010). If there is a two-way contrast, then only one of them is specified and the other remains unspecified. Briefly the idea is that in the course of acquisition, [LOW] is specified first. If there is any evidence that there is a further contrast necessary, then [HIGH] is specified, otherwise not. This is intended to be universal, but certainly more theoretical research is required. It would be good if it were universal, but then, there are always exceptions. Here, I will give an example from Bengali, but similar height assimilation is found in many languages. Usually vowels are raised or lowered depending on the feature of the following vowel. Again, regressive assimilation is more fre-

Words: Discrete and discreet mental representations

105

quent than progressive assimilation. In Bengali, there is a three-way vowel height contrast, and the underlying vowel is raised if the following vowel is [HIGH]. The features specifications of the relevant vowels are the following:

HIGH LOW

Bengali: V TH

i e æ u o ɔ √ √ √ √

feature deletion V —> V V TH

TH TH

V TH

feature spreading V —> V V TH

TH TH

| | \ / --- HIGH --- HIGH HIGH æ i —> e i o i u i ɔ i —> o i e i i i /i/ [tʃin i] /phæl/ /i/ [phe li]/tʃen/ /kɔr/ /i/ [ko ri] /khol/ /i/ [khu li] ‘open-1P.PRE ‘throw-1P.PRES’ ‘recognise-1P.PRES’ ‘do-1P.PRES’ |

|

LOW HIGH

d. ARTICULATOR assimilation (feature spreading, feature deletion) Assimilation of ARTICULATOR features is commonly known as place assimilation which implies vowels or consonants sharing place features. Although, consonant place assimilation is more familiar, vowel place assimilation is not rare. The process of umlaut is an example of vowel place assimilation. Vowel harmony, when it refers to articulators, is also an instance of place assimilation. I provide an example of umlaut from Old English, Modern German and consonant place assimilation from English. In umlaut, a stressed vowel must match its ARTICULATOR feature to a following /j/ or /i/, both of which are CORONAL and HIGH. Underlying [CORONAL] is of course, not specified. Consequently if the preceding vowel is “back” or DORSAL, it loses this feature. Umlaut is always deletion of a feature.

106

Aditi Lahiri

Umlaut (feature deletion) German[gytiç] V ART

V

—>

ART TH

V ART

|

| | | — HIGH — /gut/ /ɪc/ —> [gyt DOR

Old English [bēc] ‘book-PL’ V ART TH

| | — HIGH iç]

V

V

ART

ART TH

—> V ART

V ART TH

|

| | | | | — HIGH — — HIGH /bōc/ /i/ —> [bēc i]6 DOR

Consonant ARTICULATOR assimilation (feature spreading) English [ɪŋkeːpəbəl] incapable, [reːmboʊ] rainbow C ART

C —> C ART

ART

C

C

C

ART

ART

ART

| | \ / — DOR DOR /ɪn/ /kepəbəl/ [ɪŋkepəbəl]

| | — LAB /reɪn/ /bo/

—> C

C

ART

ART

\

/ LAB

[reɪmbo]

Thus, ARTICULATOR assimilations appear to be only feature spreading and not feature deletion. However, feature deletions do sometimes occur, but only if a feature spreading rule exists (cf. Mohanan 1993)7. Such a process would look like I’m going > [aɪŋgoɪŋ] where [m] assimilates to the following [g] which means that the [LABIAL] of [m] deletes and takes on the feature [DORSAL]. 4. Variation and comprehension In this paper, our focus is on phonological variation and processing. How would an assumption of a single underlying representation in the listener’s mental representation cope with the variation we have seen above? As I mentioned before, one option is that all possible variants or exemplars are listed (Johnson, Pierrehumbert). Under this hypothesis, the variants may have weights attached to them where one variant is more likely to be activated than another. A second assumption is to hypothesise that if a phonological rule has the potential to apply under a contextual influence, then the context will play a role in resolving the variation (Gaskell and MarslenWilson, 1996, 1998, 2001); Zwitserlood and her colleagues). Although the individual experiments and details may differ, both scholars and their colleagues have focused on the assimilation of context. Their models do not

Words: Discrete and discreet mental representations

107

store all variants but since the compensation applies to novel items. Rather, the correct context enables the listener to activate the intended lexical representation. I am uncertain how the asymmetric recognition of the isolated words can be explained, though, without storing them. For instance, *browm can be accepted in isolation as a variant of brown. A third option is that a phonological rule never quite “changes” a segment – there are always residues of the underlying segment (cf. Gow 2001, 2002, 2003) which are recovered. However, Snoeren, Gaskell, and Di Betta (2009) as well as Zimmerer, Reetz, and Lahiri (2009) show that complete assimilations are indeed possible in natural speech such that the listener perceives that the segment has become altered. Finally, FUL assumes that universal and language specific contrasts determine the representation. This means that phonological alternations may determine underlying representations, but abstract and underspecified representations may occur without alternations. Let us consider the following examples, can [khn], cat [khæt] and Bengali [] song, [] village [gɑ] body. The nasalised vowels in [khn] and [] are context dependent, and the nasalisation may be optional and the degree of nasalisation will differ on speakers and words. Under hypothesis one, listeners have had experience of the nasalised vowels in this context and can easily undo it. Under hypothesis two, all variants are stored and hence both the non-nasal and nasal variants will be present in the mental lexicon. Under hypothesis three, the nasalisation is never complete, hence the orality of the vowel will always be present to help the listener. In FUL, other than the Bengali [], all other vowels are underlying oral. But since this vowel is unspecified for nasality, all nasal vowels of the same quality (i.e., PLACE) will be tolerated and accepted. Now suppose that listeners mishear the nasal vowel as oral, and the oral vowel as nasal, what can we expect? The table in (13) presents how different models would deal with word variants. The second column gives surface forms which the phonology predicts on the basis of possible assimilations. The third column gives variants which are just mispronunciations and not a contextually determined variant. The four models are as follows: Model 1 refers to exemplar based approaches, Model 2 refers to Gaskell and his colleagues who infer the correct form based on assimilation, Model 3 refers to models such as Gows where the signal is assumed to contain relevant information since assimilation is not complete, and Model 4 represents our model FUL.

108

Aditi Lahiri

(13) Predictions of accepting or rejecting acoustic variations in four models Lexical representation

khæt

cat

h

Context dependent possible variants

*khn

k æn can hæ

*k m pɑ leg g body dɑm price

*dm

Nonassimilatory mis-pronounced variants *kht *thn *p *gɑ

Prediction of possible activation of intended words M-1 M-2 M-3 M-4 x x x 9 9 9 9 9 x x x x 9 9 9 9 x x x 9 x x x 9 9 9 9 9

*bm

x

x

x

9

*dn

x

x

x

x

This table endeavours to provide a synopsis of how different models would accept or reject variation. The first two words are from English, and the next three are from Bengali. Both languages allow vowel nasalisation and hence the corresponding nasalised vowels are in Column 2 which lists contextually permissible variants. A further permissible variant is the [m] from /n/ process of ARTICULATOR assimilation, viz. *khæm, in phrases like It can be done. The idea is to indicate which models predict that listeners may or may not detect “correct” vowels and consonants. Before we discuss the different predictions, we need to list our claims regarding the extraction of features and mapping from the signal to the lexicon for listeners. (14) FUL’s claims for the listener a. The perceptual system analyses the signal for rough acoustic features which are transformed into phonological features. b. All features are extractable by the auditory system, even if they are underspecified in the representation. c. Phonological features extracted from the signal map directly onto features specified in the lexicon – there is no intermediate representation. d. A three-way matching procedure (match, mismatch, no-mismatch) determines the choice of candidates activated.

Words: Discrete and discreet mental representations

109

e. Features from the signal which conflict with the representation mismatch, and constrain activation of candidates. f. Along with the phonological information, morphological, syntactic, and semantic information is also activated. Model 1 assumes that variants are accepted if they have been heard or experienced, and possibly if they are close enough to an accepted exemplar. Model 2 assumes that the context allows the listener to compensate for the change whereby the variant can be accepted. Model 3 would argue that no context dependent assimilation is complete and hence the cues to the lexical representation are perceived and the correct word identified. Model 4 assumes that not only context dependent assimilations, but any variant which does not conflict with the representation, will be accepted. An immediate question one may ask is why should there be a variant that is not produced by phonological rules? Well, sometimes, due to the noise in the surroundings (phone, car, background noise etc.) a sound may not be perceived correctly. FUL assumes that not all these variants will be discarded. Other models do not discuss this possibility. However, as things stand at the moment, all of them would reject such variants. Perhaps, Model 1 may accept certain variants, if the sounds were “similar” to an existing stored exemplar; but then we need to define what is “similar”. Likewise, Model 3 may argue that the acoustic signal still contains original cues; if so are there any directional differences? Model 4’s strength (or perhaps unreasonable obstinacy), suggests an asymmetry in the perception of these variants. Let us discuss each example in turn. The word cat is usually never pronounced with a nasal vowel, so all models other than FUL would reject *kht. The reason why FUL would tolerate the variant is because oral vowels are not specified for orality and hence if nasality is perceived, it would be ignored. The word can, on the other hand, is usually pronounced with a nasal vowel and so all models accept the variant. The variant *khm is also accepted by all models because it is a contextually possible variant. However, the variant *thn is not accepted by any, but for different reasons. FUL does not accept it because the [CORONAL] extracted from [t] conflicts with initial /k/. The other models reject the variant because it is not contextually viable. The Bengali variants *p for pɑ and *gɑ for g will both be accepted by FUL but not by the other models because they are not contextually possible variants. Although [NASAL] is specified, there is no feature *[ORAL] which can be extracted and hence nothing is at odds with the specified value and

110

Aditi Lahiri

consequently both variants will be tolerated. The word dam has three variants. The first with a nasalised vowel is contextually valid and accepted by all. The second variant *bam is accepted by FUL, but not by the others. The extraction of [LABIAL] from [b] does not conflict with /d/. The other nonword *dn is rejected as a possible variant of dam by all. FUL does not tolerate it because [CORONAL] of final [n] conflicts with final /m/. If *dan were to be accepted as a variant of dam, FUL would have to argue that this is only possible if other mechanisms were in place, namely postlexical processing. We have seen in Friedrich, Eulitz, and Lahiri (2006) that in an EEG study a later peak develops even when the “regular” variant is realised as a nonword. Only FUL has the tolerance level to accept certain contextually inappropriate variants where the features from the signal do not conflict with those in the lexicon. This depends on the level of specification, not on the context. Perhaps the nonword variant *thm for can is a good test example. Exemplar models would reject variants like this if it were the first time the nonword is encountered. They may be accepted because they are near enough to the real word. However, FUL would never accept this word since the ARTICULATOR feature of the initial consonant would conflict with the lexical representation. With respect to vowel nasalisation in words like cat, Model 2, which assumes that nasalisation may be complete in the appropriate context, would not accept nasalisation in * kht. Hence if nasalisation is heard by mistake on an oral vowel which should never be nasal, then the listener will not be able to get to the real word. Probably this is what the test has to be. One can obviously ask, do we need the level of abstractness that FUL predicts? FUL assumes that no orality as such is ever extracted from the signal. The acoustic information only gives the quality of the vowel and the presence of nasality, never its absence. 4.1. Word contexts and phrasal contexts The variation literature in psycholinguistics has focused largely on assimilation beginning from Lahiri and Marslen-Wilson (1991, 1992). But it has invariable been across word – either in phrasal contexts, in compounds or as in our earlier paper, word-internal predictable assimilation and not governed by affixes. Word internal variation which is governed by affixes has fallen within the rubic of “morphology”; cf. Clahsen and his colleagues

Words: Discrete and discreet mental representations

111

(Clahsen and Neubauer 2010), and Marslen-Wilson and his colleagues (Ford, Davis, and Marslen-Wilson 2010). Marslen-Wilson et al. (1994) claim at the end of their paper that phonological alternation in suffixed words as in sane~sanity, pose no problems since the root vowel is stored in an abstract form. However, they do not go any further. Clahsen and his colleagues have usually treated their phonological variants as stored although this was not the primary reason for storing them (Clahsen et al 2001). For example, in German strong verbs present tense, mid and low vowel roots always have a raised vowel in the context of coronal suffixes – st, -t, fahr-en~fährst [a, E], geb-en~gibst [e, i]. The authors assume that the raised form is stored and accessed differently from the weak verbs where the alternation is not present. Again, I must emphasise that the reason for storing these forms was morphological rather than phonological. However, Scharinger, Reetz, and Lahiri (2009) argue that the regularity of the phonological process of vowel raising is such that the root vowel must be underspecified. I will not summarise this analysis but take up the English example where Marslen-Wilson et al. (1994) claim that the roots must have an abstract representation. The issue was the vowel alternation usually known as Trisyllabic Shortening (TS) or Trisyllabic Laxing in pairs like sincere~sincerity, vane~vanity. In each pair, the stressed vowel is higher and longer in the adjective than the affixed noun. The suffix {-ity} is disyllabic and after affixation the root vowel becomes antepenultimate (2 syllables away from the end). This triggers the shortening of the antepenultimate syllable. The so-called lowering is due to the Great Vowel Shift which affected our previous example of keep~kept. The synchronic scenario is as follows

112

Aditi Lahiri

(15) Shortening and lowering x sɪnsiər

(x) sɪnsiər-ɪ

(x) seɪn

(x) seɪn-ɪ

x (x) sɪnsiər sɪnsɪr-ɪ Trisyllabic Shortening x (x .) sɪnsiər sɪnsɛr-ɪ Lowering

(x) seɪn

(x) seɪn-ɪ

(x) seɪn

(x .) sæn-ɪ

The diachronic scenario differs in that the original vowel was the lowered vowel in the monomorphemic word, viz. [sinsɛr] and [sæn]. Indeed, these words are all loans and the words that were initially borrowed (as whole words) were what we call affixed words. That is, sanity was borrowed well before sane. The derivational relationship was established later (Lahiri and Fikkert 1999). The question we need to ask here is that if the root vowel is the same for both the affixed and monomorphemic form, what is the nature of the vowel. Note, that the alternations invariably maintain the same ARTICULATOR features. It is only the TONGUE HEIGHT and the length that vary. Within FUL, it does not really matter which vowel is assumed to be the underlying one. Let us assume that the SPE (Chomsky and Halle 1968) analysis is the right one. In that case the underlying vowels are the lowered vowels, viz, /sinsr/ and /sæn/. I will explain FUL’s analysis by using only the vowel symbols.

Words: Discrete and discreet mental representations

113

(16) Root vowels in the TS alternations Surface form of stressed vowel [eɪ] sane [æ] sanity [iə] sincere [ɛ] sincerity

Feature extracted ART and TH [CORONAL] [—] [CORONAL] [LOW] [CORONAL] [HIGH] [CORONAL] [—]

Matching of ART and TH no conflict no conflict no conflict, match no conflict, no conflict no conflict no conflict

Lexical Representation ART [---] TH [LOW]

ART [---] TH [ ---]

Since the coronality remains constant, and the height is a one-step difference, the acceptance is easy. One could consolidate the suggestion put forth in Marslen-Wilson et al. (1994) and propose a concrete solution to the representational issue. 4.2. Abstractness versus vagueness Abstractness and underspecification is not vagueness. Of course the signal plays a role – one cannot extract features had they not been there. But precisely because the signal is noisy, it is perhaps not inappropriate to suggest that our lexical representation is less rigid. Contextual experience must count as well. After all if we always encounter hambag, why should we store only handbag? The abstract storage is only necessary when there is variation, and since assimilation is optional and variable, some of the time it may be incomplete. To allow for both options, the relevant feature of the ARTICULATOR is left unspecified. However, in time, a reanalysis may occur. 4.3. Variation and production If the assumption is that lexical entry is the same for production and perception, does the listener undo all the rules that the speaker produce? Our claim is no. The lexical form representation may be the same, but the speaker has complete knowledge, and indeed control, of what she wishes to

114

Aditi Lahiri

say and how she will say it. The listener is entirely dependent on the speaker and has to cope with the variation as best she can. I will take the {ity} example to show how the surface form is achieved. (17)

Root vowels in the TS alternations Lexical Representation of root μμ ART [---] TH [ LOW] μμ ART [---] TH [ LOW] + {ity} μμ ART [---] TH [ ---] μμ ART [---] TH [ ---] + {ity}

Feature and mora added/deleted to root add [CORONAL] delete [LOW]

Surface form of stressed vowel [eɪ]

sane

delete μ, add [CORONAL]

[æ]

sanity

add [CORONAL] add [HIGH] delete μ, add [CORONAL]

[iə] sincere [ɛ]

sincerity

Since these phonological rules are lexical and not meant for all lexical forms, they need to be coded within the lexical entry. That is, the words that require these production rules will have this information in the lexicon. The listener’s duty is not to recreate these rules. The matching procedure outlined in (16) will access the correct form by the simple process of accepting or rejecting conflicting or non-conflicting information extracted from the signal. If we go back to the possible and impossible variants in (13), partially repeated below, the “regular rules” produce the “possible” variants while the “mispronounced variants” are those that are either production mistakes or if the listener picks up a feature from the signal due to background noise. The table below recreates a scenario where possible and impossible variants are produced, the latter either as mistakes by the speaker or perceived by the listener due to environmental noise. (18) From lexicon to speech output The nonword variant *[dn] has two steps – the nasalisation of the vowel is a normal expected process, while the change of ARTICULATOR feature from /m/ to [n] is not due to any existing rule. Its occurrence is attributable to a mistake by the speaker or the noisy signal. Nevertheless, the listener has to decode both, to be able to identify the word. Within Model 1, if a variant

Words: Discrete and discreet mental representations

115

such as this is stored then of course the word is accessible. In Model 2, if the change is incomplete, then too, the underlying form is recognised. Under the other models, this is not the case. A listener normally does not experience a change from /m/ to [n] in any context and consequently, Model 3 will not store the variant. Under FUL the coronality of [n] will conflict with the specified [LABIAL] and hence the real word will not be normally activated. Only due to sentential or semantic context, will the word be recovered. A comparison of (13) and (18) illustrates the fairly dissimilar tasks of the speaker and the listener making use of an identical lexicon. Lexical repSpeaker’s planned feature resentation addition or deletion rules of word Khæt

cat

khæn

can



leg

g

body

dɑm

price

Mistakes of speaker or perceived as incorrectly by listener

add [NASAL] to V *khn add [LABIAL] to final C *khæm

[NASAL] added to V *kht [DORSAL] deleted from initial C *thn

add [NASAL] to V *dm add [NASAL] to V *dm

add [NASAL] to V *p delete [NASAL] from V *gɑ add [LABIAL] to initial C *bm delete [LABIAL] from final C *dn

5. Conclusion Words are not holistic in their structure. Their phonological representations are made up of discrete segments and features. Furthermore, the representations are not identical to the surface variants of each word; rather the representation is discreet, abstract, and underspecified. On the one hand, no two pronunciations of a word are acoustically identical. A finger-print may identify a person, but no word-print uniquely distinguishes an individual (Nolan 2005). Approaches to variants differ and usually the literature has dealt with contextual variation. The representation I have advocated in this

116

Aditi Lahiri

chapter is abstract and underspecified and hence asymmetric in the way contrasts are expressed. If the language has a two-way height contrast the only one height feature is required, since the other remains underspecified. This means that representation is automatically asymmetric. The real claim is that asymmetric representation will govern phonological processing and language change, because reanalysis is only possible if the native speaker has reanalysed the representation based on what she hears. Her production process will change because she will no longer require a rule to produce the alternation. For instance, all [o]s became [ö] when [i] followed in Old High German. If the alternation between [o] and [ö] is transparent, then the native speaker and listener will keep a single underlying representation (CöCi~CoCe). Once the alternation stops being transparent, for example when the speaker introduces a new rule to change final unstressed [i] to [e], then alternations such as CöCi~CoCe in Middle High German become opaque. The listener does not know the source of the final [e] and cannot distinguish where the /ö/ comes from. Consequently, these words developed an underlying contrast in /ö/ and /o/. The listener who has introduced /ö/ in the lexicon, has no need for a rule to change /o/ to /ö/ because it is now underlying /ö/. The result is that in the representation both [DORSAL] and [LABIAL] are required to be specified to express the contrast: /o/ [DORSAL, LABIAL], /ö/ [--] [ LABIAL], /e/ [---] . Before, the contrast was only between /e/ and /o/ and either [LABIAL] or [DORSAL] specification was enough. The asymmetries in the representation need not always produce asymmetry in processing. This depends on the contrast. Asymmetries occur when there is more than a two-way contrast under one node and always for [CORONAL]. Let us first look at non-ARTICULATOR features with two-way contrasts. For example, [VOICE] or [NASAL] are always a two-way contrast and their absence does not count as a feature to be extracted. Thus, if there is an alternation in voicing as in many languages, only [VOICE] is specified and extracted. Voicelessness is not a property which is extracted and does not play a role in a conflict or non-conflict relationship. This means that voiceless consonants do not conflict with voiced ones and vice versa. Of course, there is the issue of a better match or less perfect match. Lahiri and Reetz (2002) formalised an algorithm for their computational modelling to assess the matching procedure, but this has not been put to test in an experimental paradigm. As for ARTICULATOR features, the idea is that no matter what system the child is faced with, [CORONAL] is a must. No language occurs without a

Words: Discrete and discreet mental representations

117

[CORONAL] segment. Furthermore, no language exists without an ARTICULATOR contrast although there may be no height contrast. Hence, there is always a [CORONAL]~non-[CORONAL] representational contrast. There may be a single vowel with no PLACE specification, and a consonantal contrast in [CORONAL]~non-CORONAL. If there is a second vowel, the ARTICULATOR contrast will step in before the height contrast (Ghini 2001b, Lahiri and Reetz 2010; Levelt 1995). We focused on the feature that was extracted. Only [CORONAL] is extracted from the signal even if it is never specified. This may sound like a contradiction, but it is not. Coronality is salient with the high frequency energy and it is always present in a natural language. Hence, there is no need to specify it in the representation. Nevertheless, it is extracted from the signal and therefore plays an important role in the conflict~non-conflict relationship. Returning to assimilation versus noise, features may alter in metrical contexts, word boundaries and the domain can be lexical, bound by affixes, or postlexical which means bound by phonological phrases. Features can be inserted, deleted or spread. All of this is governed by a phonological system of rules that are handled by the speaker using the same lexicon as the listener. The listener’s task is to get from the speaker’s output to the lexicon. The output may have been produced by rules or the listener may perceive features that have not been intended by the speaker. The comprehension system works on a global basis and does not distinguish them. Consequently, the “undoing” of rules is feasible to the extent that the extracted perceived feature is mapped straight to the lexicon and judged whether it is accepted as a variant of an existing word or not. If it conflicts with all the lexical representations, then phonology cannot help any longer. The listener must resort to syntactic and semantic context. Lahiri and Marslen-Wilson (1991) began by comparing the effect of sparse phonological representations when listeners are faced with either allophonic predictable alternation or with neutralisation, comparing the same process of vowel nasalisation in two languages. The crucial point was that the oral vowel in English (e.g. bag), which ought to have been accepted only as an oral vowel was mapped on to vowels in both oral and nasal contexts (bag, ban) thereby showing that the allophonic nasalisation in ban was not lexically specified. This paper incited a research programme in assessing phonological representations in the mental lexicon. Recent work with brain-imaging techniques also support asymmetric representations (Friedrich et al. 2006, 2008) irrespective of whether a representation is subject to a regular neutralising process or not. Thus, oral vowels will

118

Aditi Lahiri

always be unspecified for nasality irrespective of whether there is a nasal vowel contrast and even if there is a neutralising rule converting oral vowels to nasals in a nasal context. Even tonal representations appear to be asymmetric (Felder 2010). Although we have made some progress since then, much more work needs to be done. Perhaps accepting that phonological representations are both discrete and discreet may assist rather than retard our understanding of how words are represented, accessed and change. Acknowledgements The research programme outline here would never have been possible without the inspiration of William Marslen-Wilson, to whom the editors of the book as well as I owe a deep sense of gratitude.

Notes 1. 2. 3. 4.

5. 6. 7.

8.

This change was discovered by and is named after Jacob Grimm, one of the brother’s of the famous Grimm’s fairy tales. When a syllabic suffix is added, one can see that the initial vowel is short [Sɑp-er] [ɖɑb-er] etc.‘snake-GEN, green coconut-GEN’ The OE symbol was which is also the IPA symbol The research work was made possible particularly with close cooperation of many young colleagues on several research grants and research prizes. Of special mention, in alphabetical order, are Carsten Eulitz, Verena Felder, Paula Fikkert, Claudia Friedrich, Astrid Kraehenmann, Henning Reetz, Allison Wetterlin, Linda Wheeldon and Frank Zimmerer. For instance, the if F1 < 300Hz, the feature [ HIGH] would be extracted. Final [i] is deleted in a weak branch of foor – High Vowel Deletion. Although Mohanan is arguing against underspecification of contrastive features, and is forwarding a claim for “fields of attraction” he too states that coronal assimilation is the most frequent and other features may assimilate but only if coronal assimilation also exists. In Coenen, Zwitserlood, and Bölte (2001), results indicate that the unexpected variant /m/ to [n] is also accepted by the listener. I imagine that in this case this variant must be stored.

Words: Discrete and discreet mental representations

119

References Chomsky, Noam and Morris Halle 1968 The sound pattern of English. New York: Harper and Row. Clahsen, Harald, S. Eisenbeiss, M. Hadler, and I. Sonnenstuhl 2001 The mental representation of inflected words: an experimental study of adjectives and verbs in German. Language 77: 510-543. Clahsen, Harald and K. Neubauer 2010 Morphology, frequency, and the processing of derived words in native and non-native speakers. Lingua 120: 2627–263. Coenen, E., Pienie Zwitserlood, and Jens Bölte 2001 Variation and assimilation in German: Consequences of assimilation for word recognition and lexical representation. Language and Cognitive Processes 16: 535-564. Ford, M.A., M. H. Davis, and William D. Marslen-Wilson 2010 Derivational morphology and base morpheme frequency. Journal of Memory and Language, 63: 117-130. Friedrich, Claudia, Carsten Eulitz, and Aditi Lahiri 2006 Not every pseudoword disrupts word recognition: An ERP study. Behavioral and Brain Functions 2: 1-36. Friedrich, Claudia, Aditi Lahiri, and Carsten Eulitz 2008 Neurophysiological evidence for underspecified lexical representations: Asymmetries with word initial variations. Journal of Experimental Psychology: Human Perception and Performance 34: 1545-1559. Gaskell, M. Gareth and William D. Marslen-Wilson 1996 Phonological variation and inference in lexical access. Journal of Experimental Psychology: Human Perception and Performance 22: 144-158. 1998 Mechanisms of phonological inference in speech perception. Journal of Experimental Psychology: Human Perception and Performance 24: 380-396. 2001 Lexical ambiguity and spoken word recognition: Bridging the gap. Journal of Memory and Language 44: 325-349. Ghini, Mirco 2001a Asymmetries in the phonology of Miogliola. Berlin: Mouton. 2001b Place of articulation first. In Distinctive feature theory, T. Alan Hall, 71-146. Berlin: Mouton. Gow, David W. 2001 Assimilation and anticipation in continuous spoken word recognition. Journal of Memory and Language 45: 133-159.

120

Aditi Lahiri 2002

Does English coronal place assimilation create lexical ambiguity? Journal of Experimental Psychology Human Perception and Performance 28: 163-179. 2003 Feature parsing: Feature cue mapping in spoken word recognition. Perception and Psychophysics 63: 575-590. Lahiri, Aditi & Paula Fikkert 1999 Trisyllabic Shortening in English: Past and Present. English Language and Linguistics 3:229-267. Lahiri, Aditi and William D. Marslen-Wilson 1991 The mental representation of lexical form: A phonological approach to the recognition lexicon. Cognition 38: 245-294. 1992. Lexical processing and phonological representation. In Papers in laboratory phonology II, G. J. Docherty and D. R. Ladd (eds.), 229-254. Cambridge: Cambridge University Press. Lahiri, Aditi and Frans Plank 2010 Phonological phrasing in Germanic: The judgement of history, confirmed through experiment. Transactions of the Philological Society, 108:3, 370–398. Lahiri, Aditi and Henning Reetz 2002 Underspecified recognition. In Labphon 7, Carlos Gussenhoven and Natasha Warner (eds.), 637-676. Berlin: Mouton. 2010 Distinctive features: Phonological underspecification in representation and processing. Journal of Phonetics, 38, 44-59. Levelt, Clara 1995 Segmental structure of early words: Articulatory frames or phonological constraints. In The Proceedings of the Twenty-seventh Annual Child Language Research Forum, Eve V. Clark (ed.), 19-27. Stanford: CSLI. Marslen-Wilson, William, Lorraine K. Tyler, Rachelle Waksler, and L. Older 1994 Morphology and meaning in the English mental lexicon. Psychological Review 101: 3-33. McCarthy, John J. and A. Taub 1992 Review of Carole Paradis and Jean-François Prunet (eds.) (1991). The special status of coronals: internal and external evidence. Phonology, 9, 363-370. Mohanan, K. P. 1993 Fields of attraction in phonology. In The last phonological rule – Reflections on constraints and derivations, John Goldsmith (ed.), 61-116. Chicago: University of Chicago Press. 2005 Forensic speaker identification and the phonetic description of voice quality. In A Figure of Speech, W. Hardcastle and J. Beck (eds.), 385-411. Mahwah, New Jersey: Erlbaum. Scharinger, Mathias, Henning Reetz and Aditi Lahiri 2009 Levels of regularity in inflected word form processing. The Mental Lexicon 4: 77-114.

Words: Discrete and discreet mental representations

121

Snoeren, Natalie D., Gareth M. Gaskell and A.M. Di Betta 2009 The perception of assimilation in newly learned novel words. Journal of Experimental Psychology: Learning, Memory and Cognition 35: 542-549. Steriade, Donca 1995 Underspecification and markedness. In Handbook of phonological theory, John Goldsmith (ed.), 114-174. Oxford: Blackwell. Zimmerer, Frank, Henning Reetz, and Aditi Lahiri 2009 Place assimilation across words in running speech: Corpus analysis and perception. Journal of the Acoustical Society of America 125: 2307-2322.

Neural systems underlying lexical competition in auditory word recognition and spoken word production: Evidence from aphasia and functional neuroimaging Sheila E. Blumstein 1. Introduction Language is in the service of communication. Whether we are communicating about feelings, wants, and desires or the state of the world or cognitive neuroscience, we engage a language processing system which typically works quickly and efficiently. Among other aspects of speaking and understanding, we need to select the appropriate words from our mental lexicon, a lexicon which contains thousands of words many of which share soundshape properties. The functional architecture of a number of models of word recognition and spoken word production, and the one adopted in this chapter, assumes that in the selection of a target word, a smaller set of candidates from the lexicon are ultimately activated based on these shared phonological properties (Dell 1986; Gaskell and Marslen-Wilson 1999; Marslen-Wilson 1987; see Goldrick 2006, for a recent review). These multiple candidates compete with each other, and, in the end, this competition must be resolved and the appropriate lexical candidate must be selected (cf. Levelt (1999, 2001) for a different functional architecture in spoken word production). It is the goal of this chapter to examine the neural systems underlying lexical competition in both word recognition and spoken word production. Such study provides a unique window into how the mind maps on to the brain, and as a consequence, gives insights into the nature of the mechanisms and processes underlying competition and its resolution. It may also provide confirming or challenging evidence to current models of word recognition and spoken word production. To this end, we will consider results from both studies of aphasic patients and neuroimaging studies which speak to whether the functional architecture of the lexical processing system is interactive or modular; whether similar or different neural systems are recruited when competition is overt in the contextual environment or when it is implicit and intrinsic to the properties of the stimulus; whether

124

Sheila E. Blumstein

there is a common lexicon for word recognition and spoken word production; and whether the resolution of lexical competition is domain-specific or domain-general. To presage the findings, we will argue that the evidence supports the view that the language processing system is interactive, as shown by the modulation of activation patterns throughout the lexical processing system as a function of lexical competition. We will provide evidence showing that similar neural systems are recruited under conditions of competition irrespective of the presence of the competitor in the environment. We will also propose that there is a common lexicon for word recognition and spoken word production as evidenced by shared neural resources in accessing words under conditions of lexical competition. Finally, we will consider whether the inferior frontal gyrus (IFG) shares some neural resources in resolving competition across levels of the grammar (phonetics/phonology and semantics). As we will see, a neural system in the left hemisphere is recruited under conditions of competition in both auditory word recognition and spoken word production that includes regions in the temporal, parietal, and frontal areas. As Figure 1 shows, these include the posterior portion of the left superior temporal gyrus (STG) in the temporal lobe, the supramarginal gyrus (SMG), angular gyrus (AG), and the inferior frontal gryus (IFG) in the frontal lobe. Each of these areas appears to be part of the processing stream that ultimately contributes to the selection of a word for word recognition or for word production. It has been proposed that the posterior portion of the STG and the SMG are involved in the mapping of sound structure to the lexicon (Hickok and Poeppel 2000, 2004; Scott and Wise 2004), that stored lexical representations are accessed in parietal areas including the SMG and AG (Binder and Price 2001; Geschwind 1965; Hickok and Poeppel 2000; Indefrey and Levelt 2004), that the IFG is involved in executive control processes involved in the selection among competing alternatives (Thompson-Schill et al. 1998; Thompson-Schill, D’Esposito, and Kan 1999), and that within the IFG, the posterior portion of the IFG (Brodmann area (BA) 44), is recruited in phonological planning processes (Guenther 2006; Huang, Carr and Cao 2001; Indefrey and Levelt 2004; Poldrack et al. 1999). For the purposes of this chapter, we will focus exclusively on the role of left hemisphere structures in lexical processing. The neuropsychological literature clearly implicates the left hemisphere in this processing stream. Some neuroimaging studies show right hemisphere activation in various tasks and under various conditions; however, such findings tend to be less consistent across studies, tend to be ignored even if shown, and are generally not considered to play a major role in lexical processing.

Neural systems underlying lexical competition

125

Figure 1. Neural system recruited under conditions of competition in both auditory word recognition and spoken word production. IFG refers to the inferior frontal gryus, STG, the superior temporal gyrus, SMG, the supramarginal gyrus, and AG, the angular gyrus. For colour version see separate plate.

1.1. Converging Evidence from Lesion Studies and Functional Neuroimaging Both studies of aphasia and functional neuroimaging studies provide a rich set of findings about the neural systems underlying lexical competition. However, neither alone can give the whole picture. Lesions of aphasic patients tend to be large, making it difficult to determine which areas are contributing to the particular language function being explored. And while localization of activation patterns in functional neuroimaging is precise, it is the case that typically multiple areas are activated and it is impossible

126

Sheila E. Blumstein

from the neuroimaging data to determine whether a specific area is necessary for the particular function under study (Price et al. 1999). Together, however, the study of both aphasic patients and neuroimaging data allows for the integration of aspects of the ‘best of both’. Areas of activation in a neuroimaging study should show pathological performance in an aphasic patient, if that area is engaged in the process, function, or task under study. And if multiple areas appear to be recruited in a neuroimaging study, potential differences in the functional role of those areas may be revealed by showing different patterns of pathological performance in aphasic patients who have lesions in these areas. Both lesion studies and neuroimaging experiments utilize a set of behavioral measures in investigating lexical competition. These methods are typically drawn from those used in the study of word recognition and spoken word production in the psycholinguistic literature. Nonetheless, in the best of all worlds, studies providing converging evidence from lesion studies and functional neuroimaging should share both stimulus materials and behavioral tasks. A common set of stimuli assures that similar parameters are controlled for and that the level of complexity or difficulty of processing the stimuli is equivalent. Using the same task is perhaps even more important. Different behavioral tasks can tap very different cognitive processes, and these cognitive processes may have very different neural sequelae. Asking a subject to do phoneme monitoring on a set of word stimuli is very different from asking the same subject, for example, to select the picture of a word from a pictorial stimulus array, even though both may tap lexical processing. And it has been shown that different task demands recruit different neural systems (Poeppel 1996). Comparing the performance of patients with that of neuroimaging findings in normals using different tasks makes it a challenge to unpack the neural systems recruited for a particular language function from the neural systems recruited for accomplishing the task. That said, the number of studies in which the stimuli and tasks are the same is relatively few. Thus, we will consider both evidence that uses the same stimuli and methods as well as those that do not as we examine the neural systems underlying lexical competition.

Neural systems underlying lexical competition

127

2. Lexical Competition in Auditory Word Recognition 2.1. Modulatory effects of lexical competition Building on/from the seminal work of Marslen-Wilson (1987; MarslenWilson and Welsh 1978), most current models of auditory word recognition assume that auditory information is used online to help in the process of lexical selection. In this view, at word onset, all word candidates in the mental lexicon that share a target words’ onset are activated. As the auditory input unfolds and more phonetic/phonological information becomes available, this information is used to pare down the potential set of word candidates until the word can be uniquely identified. Thus, there is activation and hence competition among these multiple candidates up to the point where the sound structure disambiguates the auditory input and the target word can be selected (Luce and Pisoni 1998; Marslen-Wilson and Welsh 1978; McClelland and Elman 1986; Norris 1994). There is a rich behavioral literature that supports these assumptions. Using a range of behavioral methodologies including lexical decision (Marslen-Wilson 1989; Zwitserlood 1989), gating (Grosjean 1980), and word recognition latencies (Luce and Pisoni 1998), it has been shown that the selection of a word is affected by the presence of onset competitors. More recently, the visual world paradigm has been used to track online the effects of selecting a word that has an onset competitor (e.g., Allopenna, Magnuson, and Tanenhaus 1998; Tanenhaus et al. 1995). In this paradigm eye movements are tracked as subjects are asked to select an auditory target from a visual array of objects. Results of behavioral studies show that given an auditory target, e.g. hammock, and a four-picture array including the target, an onset competitor, e.g. hammer, and two objects phonologically and semantically unrelated to the target or onset competitor, there are increased looks to the onset competitor compared to the unrelated items until the auditory input is disambiguated. After that, when the target word is uniquely identified, the subject’s looks are to the target object (Allopenna et al. 1998; Dahan, Magnuson, and Tanenhaus 2001; Dahan et al. 2001a; Tanenhaus et al. 1995). These results indicate that there is activation of both the target word and its onset competitor, and both remain active until the lexical competition is resolved. In a recent study, the neural systems underlying word-onset lexical competition have been examined using the visual world paradigm during functional neuroimaging (Righi et al. 2009). While in the scanner, subjects’ eye movements were tracked as they were presented with a four-picture

128

Sheila E. Blumstein

display in two conditions, a competitor condition in which the four-picture display included a target object, an onset competitor object, and two unrelated objects, and a no competitor condition in which the same target object used in the competitor condition was presented with three unrelated objects. Subjects’ task was to look at the appropriate picture corresponding to the auditorily presented word target. Behavioral results replicated the onset competitor effects shown previously with more looks to the onset competitor object than to the unrelated objects. Neuroimaging results revealed a number of clusters of activation in which there was greater activation in the competitor compared to the no-competitor condition. These clusters were located in the left posterior STG and SMG, in the left IFG (BA44 and 45) and in the insula extending into BA47 (see Figure 2). The two IFG clusters that emerged, one in BA44 and the other BA45, are consistent with other neuroimaging data suggesting that there is a functional subdivision of the IFG, with BA44 recruited in phonological processing and BA45 recruited in semantic processing (Buckner, Raichle, and Petersen 1995; Burton Small and Blumstein 2000; Fiez 1999; Poldrack et al. 1999) (see Section 4 on domain specificity for the resolution of competition). Thus, activation in both temporo-parietal and frontal structures was modulated by the presence of onset competitors. There is nothing directly in the activation patterns that indicate the functional role that these clusters play. However, a series of fMRI studies with normals and with aphasic patients provides some clues. They suggest that while both the STG/SMG and IFG are recruited under conditions of onset competition, their roles are indeed different, and together they form a processing stream engaging multiple neural areas in which information flow at one level of processing modulates and influences other stages of processing downstream from it. In particular, results from neuroimaging studies suggest that mapping from sound to lexical form recruits the STG, SMG and also the angular gyrus (AG) in the parietal lobe (Hickok and Poeppel 2000). That there is activation in these areas as a function of onset competition is consistent with these data, as they suggest that the STG/SMG is sensitive to and modulated by the phonological properties of the lexicon. Indeed, activation in the STG/SMG is not only sensitive to onsets but also lexical density (i.e. the number of words of in the lexicon which share phonological properties across phonetic positions) (Okada and Hickok 2006; Prabhakaran et al. 2006). Results of these studies show that there is increased activation in the SMG and posterior STG when subjects make lexical decisions on words from high density neighborhoods where there are many competitors, i.e. many words that share phonological properties with a target, compared to words from low density neighborhoods where there are

Neural systems underlying lexical competition

129

few competitors, i.e. few words that share phonological properties with a target (cf. Luce and Pisoni 1998 for discussion of behavioral effects with normals). These results suggest that increased neural resources are required the greater the number of competitors and hence the harder it is to select a lexical candidate. And lesions in the posterior STG and SMG areas produce deficits in discriminating words that are phonologically similar and hence compete with each other (Caplan, Gow, and Makris 1995).

Figure 2. Clusters of activation in which there was greater activation in the competitor compared to the no-competitor condition. These clusters were located in the left posterior STG and SMG, in the left IFG (BA44 and 45) and in the insula extending into BA47. (A) Clusters in the LIFG showing greater activation for competitor trials compared to no-competitor trials. Sagittal slice shown at x = 35 and coronal cut shown at y = 15. (B) Cluster in the LIFG (BA 45). Axial slice shown at z = 11. (C) Cluster in the LIFG (BA 44/45). Axial slice shown at z = 17. (D) Cluster in the left temporo-parietal region showing greater activation for competitor trials compared to no-competitor trials. Sagittal slice shown at x = 50, coronal slice shown at y = 20. (reprinted with permission from Righi et al. 2009, Figure 3). For colour version see separate plate.

In contrast to the proposed functional role of temporal and parietal areas, evidence from neuroimaging suggests that the IFG is involved in domaingeneral “executive control functions” (e.g., Duncan 2001; Duncan and

130

Sheila E. Blumstein

Owen 2000; Miller and Cohen 2001; Smith and Jonides 1999), and in particular, in response selection among competing alternatives. In a series of studies using a variety of tasks, Thompson-Schill and colleagues have shown that the IFG is recruited in selecting among competing conceptual alternatives (Thompson-Schill, D’Esposito, and Kan 1999; ThompsonSchill et al. 1997, 1998; Snyder, Feignson, and Thompson-Schill 2007). Aphasic data also suggests that the IFG is recruited in selecting among conceptual alternatives. For example, Broca’s aphasics show impairments in resolving meanings of ambiguous words (Milberg, Blumstein, and Dworetzky 1987). That the Righi et al. visual-world fMRI study (2009) showed increased activation under conditions of lexical competition in the IFG suggests that this area is not only recruited in resolving competition among conceptual alternatives but it is also recruited in resolving competition among phonologically similar lexical alternatives. Even though subjects ultimately must access a conceptual representation of a word in order to select the appropriate picture, it is the presence of lexical competition and the activation of multiple lexical candidates that modulates, i.e. influences the activation patterns in the IFG (cf. also Gold and Buckner, 2002). Perhaps the strongest evidence that both posterior and anterior areas are recruited under conditions of competition and have different functional roles comes from a study using the same eyetracking paradigm discussed above to examine the effects of onset competition on auditory word recognition (Yee, Blumstein, and Sedivy 2008). In this study, both Wernicke’s aphasics with lesions including the posterior STG and typically extending into the SMG and angular gyrus (AG) and Broca’s aphasics with lesions including the IFG showed pathological patterns. In particular, although both groups of patients were able to select the appropriate target word, Wernicke’s aphasics showed a larger competitor effect than age-matched controls, whereas Broca’s aphasics showed a weak (and non-significant) competitor effect. The fact that lesions in these areas which showed activation in the neuroimaging study resulted in pathological performance indicates that both areas play a functional role and are influenced by the phonological sound shape of lexical competitors. The fact that the behavioral patterns of the two groups of subjects systematically varied indicate that the functional role of these areas differs (cf. also Janse 2006 who showed similar results using a lexical decision paradigm). Taken together, these results suggest that the fronto-temporo-parietal areas are part of a common network that underlies the processes involved in auditory word recognition.

Neural systems underlying lexical competition

131

2.2. Lexical competition effects on access to meaning One possible criticism of the onset competitor studies described above is that the target stimulus and the onset competitor both appeared in the stimulus array, and hence competition was overtly created in the response set. It is also possible that the subject attempted to subvocally name the objects, thus activating both the target word and its onset competitor. Thus, it is not clear whether similar neural structures would be activated if phonological competition were implicit and not directly present in the stimulus set. The question is whether a word like hammock will implicitly activate its phonological competitor hammer. One way that this can be demonstrated is to examine whether the presentation of a target word can activate the semantic associate of the target word’s phonological competitor, i.e. would hammock activate nail, the semantic associate of the phonological competitor hammer. Such a finding would provide strong evidence that lexical competition arises from properties intrinsic to the word recognition system. It would also be consistent with a functional architecture of the language system in which there is interaction between lexical (phonological) and semantic levels of processing (e.g. Dell et al. 1997; MarslenWilson 1987; Peterson and Savoy 1998; Zwitserlood 1989). In this view, the presentation of an auditory target word activates the lexical form of that word and its phonological competitors; this multiple set of competitors in turn activates its respective lexical-semantic networks. To examine this question, Yee and Sedivy (2006) utilized the visual world paradigm. In the experimental condition, subjects were presented with a four picture display consisting of a target object (hammock), a semantic associate of the phonological competitor of the target (nail), and two other pictures that were not semantically or phonologically related to the target or semantic associate of the phonological competitor. Results showed a mediated competitor effect; subjects looked more at the semantic associate of the target’s phonological competitor than to the two unrelated words. Two studies explored the neural systems underlying this mediated effect, one using functional neuroimaging with normal subjects and the other using aphasic patients. The design of the experiments was analogous to that described earlier examining the neural systems underlying phonological onset competitor effects. The fMRI study (Righi 2009) exploring effects of mediated competition showed that the neural systems driving this effect emerged in the parietal and frontal lobes and were similar to those that gave rise to the onset com-

132

Sheila E. Blumstein

petitor effect. Some differences in the exact location within these areas also emerged. In the parietal lobe, clusters emerged in the inferior parietal lobule extending into the SMG, and in the frontal lobe clusters emerged in the IFG as well as in the middle frontal gyrus (MFG) extending into the superior frontal region. That the clusters that emerged were not exactly the same as that found in the onset competitor study is not surprising given that this task required the interaction of phonological and semantic properties whereas the onset competitor studies focused solely on the influence of phonological factors on activation patterns. In fact, neuroimaging experiments have shown activation of the inferior parietal lobule in phonological (McDermott et al. 2003) as well as semantic judgment tasks (Dapretto and Bookheimer 1999; Demonet et al. 1992; de Zubicaray et al. 2001). And activation has been shown in the middle frontal gyrus (MFG) as well as the IFG in semantic processing tasks (Kotz et al. 2002; McDermott et al. 2003; Raposo et al. 2006; Rissman. Eliassen, and Blumstein 2003). Of importance, a region of interest analysis showed sensitivity within the IFG (BA44, 45, 47) to the presence of a mediated competitor, consistent with the view that competition was induced both phonologically (between the mediated phonological competitor and the target) and semantically/conceptually (between the semantic associate of the phonological competitor and the target). The results of the eyetracking study for the aphasic patients (Yee, Blumstein, and Sedivy 2008) showed once again pathological performance for both Broca’s and Wernicke’s aphasics with differing patterns between the groups. Similar to the phonological onset competitor results, Wernicke’s aphasics showed a larger mediated competitor effect than age-matched controls (this effect approached significance), and Broca’s aphasics failed to show a mediated competitor effect. The failure of Broca’s aphasics to show a mediated competitor effect is consistent with earlier studies showing the effects of implicit lexical competition on semantic priming in a lexical decision task. Results with normal subjects show that words containing primes with poorer phonetic exemplars produce significantly less semantic facilitation than words containing good exemplars, whether the prime stimulus has a voiced lexical competitor (e.g., pear with the competitor bear) or not (e.g. pot has no lexical competitor ‘bot’) (Andruski, Blumstein, and Burton et al. 1994). In contrast, although Broca’s aphasics with anterior left hemisphere lesions show similar patterns of performance for prime stimuli without competitors, they lose semantic priming for words with lexical competitors (Misiurski et al. 2005; Utman, Blumstein, and Sullivan 2001). These findings suggest that frontal areas are recruited in word form selection, as all subjects need to do is to

Neural systems underlying lexical competition

133

determine whether the target stimulus is a word or not. They also show that activation in frontal areas is modulated by the presence of lexical competition even when competition is implicit. Taken together, these findings indicate that the neural system responds to competition that is inherent to the structural properties of the lexicon, and that the modulation of competition is neither driven by the overt presence of the competitor in the stimulus array nor by selection requirements from among alternatives in the environmental context. Moreover, the results show that word recognition recruits a processing stream in which properties of the stored lexical representations recruited in the parietal lobe have a cascading effect and modulate activation in frontal areas where the target stimulus is ultimately selected. 3. Lexical Competition in Spoken Word Production 3.1. Shared neural resources with auditory word recognition There is a rich behavioral literature showing that lexical competition affects spoken word production. Most studies have examined the competition effects as a consequence of shared semantic properties of words (Lupker 1979; Rosinski 1977). However, such effects have also been shown under conditions of lexical, i.e. phonological, competition. For example, naming of pictures of words is affected by the presence of a phonological competitor. Using the picture-word-interference paradigm (PWI), Lupker (1982) showed that naming latencies are facilitated when subjects are asked to name a picture of an object and ignore a written word (distractor) placed across the picture that is phonologically similar to the name of the target. Similar facilitory effects in picture naming emerge when the distractor word is presented auditorily. Not only do facilitory effects emerge when words share onsets, but naming latencies for pictures are faster when their names are from dense neighborhoods i.e. for words that have many phonologically similar neighbors, than for names from sparse neighborhoods, i.e. for words which few phonologically similar words (Vitevitch 2002). The basis of this effect is consistent with a functional architecture of the lexical processing system in which not only is the phonological form of the target word to be named activated but so is the phonological form of the distractor (e.g., Damian and Bowers 2009; Schriefers, Meyer, and Levelt 1990). Because the phonological form of the distractor overlaps with that of the target, the phonological representation of the target word is boosted

134

Sheila E. Blumstein

giving rise to faster naming latencies (facilitation) when words share phonological properties compared to naming latencies for unrelated targetdistractor pairs. Under this view, lexical competition, i.e. the activation of multiple lexical candidates that share phonological properties, occurs in both spoken word production and in auditory word recognition. However, the presence of competitors has different behavioral consequences owing to the different processing demands across these two domains (Dell and Gordon 2003). In perception, phonologically related words are strongly activated by the incoming acoustic signal and the listener must select the target from among these phonologically related words, giving rise to interference effects. In production, the overlap in the number of sound segments that the competitor shares with the target word increases the activation of these shared segments in relation to the other sound segments in the lexicon and hence facilitates the processes involved in phonological planning and articulatory implementation. In a recent fMRI study, Righi (2009) examined the neural systems underlying phonological onset competition in picture word naming. This study utilized the PWI paradigm and included a number of competitor conditions to explore phonological, semantic, and mediated competition effects on spoken word production. Of importance, the stimuli in the phonological onset competition condition utilized a subset of those used in the eyetracking experiment investigating onset competition in auditory word recognition described earlier in this chapter. For example, while in the scanner, subjects were presented with the picture of a word to be named aloud such as hammock with the onset competitor distractor word hammer written across it. Behavioral results in the scanner replicated results from the literature showing faster naming latencies for picture names that shared phonological onsets with the distractor stimuli compared to naming latencies for pictures that were phonologically and semantically unrelated to the distractors. FMRI results showed a broad network of activation similar to the network identified in the previous section exploring lexical competition effects in auditory word recognition. In particular, the presence of phonological competition modulated the activation patterns for word naming with increased activation in temporo-parietal areas including the SMG and AG and in frontal areas including the IFG (BA44 and 45) extending into the precentral gyrus. A number of recent studies have also used interference naming paradigms to examine the neural systems underlying phonological competition using fMRI (Abel et al. 2009; Bles and Jansma 2008; De Zubicaray et al.

Neural systems underlying lexical competition

135

2002; De Zubicarey and McMahon 2009; Righi 2009; but see Schnur et al. 2009 for a failure to show phonological interference effects using a blocked naming paradigm). Abel et al. (2009) used a picture-word interference paradigm in which subjects were asked to name a picture presented 200 ms after the presentation of an auditory distractor. Results of the Abel study were similar to those of Righi (2009) showing a broad network activated that encompassed posterior areas including the SMG and STG and frontal areas including the IFG (BA44) and the postcentral gyrus. Each of the other fMRI studies (Bles and Jansma 2008; De Zubicaray 2002; De Zubicarey and McMahorn 2009) identified a subset of the areas shown in the Righi (2009) study including the posterior STG, SMG, and/or the IFG. Methodological and stimulus differences could account for the differences in activation patterns across these studies. Nonetheless, of importance, the Righi (2009) results showed that the same stimuli activated similar neural structures in both spoken word production and auditory word recognition. These findings tentatively suggest that a common lexicon serves both spoken word production and auditory word recognition processes (cf. also Gumnior, Bölte, and Zwitserlood 2006 for similar claims and Hillis and Caramazza 1991, Figure 1, for an alternate view). More research is necessary to examine this question and to ascertain the extent to which these areas may or may not overlap with each other. 3.2. Cascading effects of lexical competition on articulatory processes The results described above showed that phonological competition affected access to words that shared their sound shape. Phonological competition also has a cascading effect on word production processes and affects the phonetic output of the target word itself (Baese-Berk and Goldrick 2009; Goldrick and Blumstein 2006). In particular, there is a larger vowel space in the production of words with many phonological neighbors compared to the production of words with few phonological neighbors (Munson 2007; Munson and Solomon 2004; Scarborough in press; Wright 2004). And the presence of a voiced minimal competitor influences the voice-onset time (VOT) production of an initial voiceless stop consonant (Baese-Berk and Goldrick 2009). In this case, the VOT is longer for words that have a voiced competitor (pear vs. bear) compared to words which do not (pot vs. bot). These effects presumably arise because the presence of a competitor requires greater activation of the target word to override that of its competitor(s) resulting in a ‘hyperarticulated’ production.

136

Sheila E. Blumstein

Examination of the neural systems underlying this effect provides a window into the extent of this cascading effect. It is possible that similar to the studies described above, there will be modulation of activation of the temporo-parietal-frontal network, i.e. posterior STG, SMG, and IFG. Modulation of activation in the IFG would not be surprising since it has been shown that not only is this area sensitive to phonological competition but it has also been proposed that the IFG provides the neural substrate for phonetic planning processes (Bookheimer et al. 1995; Guenther 2006; Huang, Carr, and Cao 2001). However, activation of areas involved in articulatory processes per se such as the precentral gyrus would provide strong evidence that activation at the lexical level cascades throughout the speech production system, modulating activation in those neural areas not only involved in lexical selection and phonological planning (IFG), but also in motor plans for production (precentral gyrus). And these are the results of a recent study (Peramunage et al. 2010). In this study, subjects were asked to read aloud words while in the scanner and their productions were recorded and later analyzed acoustically for VOT. Test stimuli consisted of words beginning with voiceless stop consonants in which half of the words had voiced minimal pairs and half did not. Filler words beginning with a variety of consonants were also included. Of importance, the voiced minimal pair competitor never appeared in the stimulus set. Thus, any effects of phonological competition on production processes and their neural substrates would reflect properties intrinsic to the lexicon and not to the response set. Results showed modulation of activation for the production of words that had minimal pairs in a network including the left posterior STG, the SMG, IFG, and ventral precentral gyrus extending into the post-central gyrus (see Figure 3). Consistent with the behavioral findings which showed faster naming latencies for minimal-pair words compared to non-minimal pair words, there was a reduction in activation for words that had minimal voiced pairs compared to words that did not. The emergence of a competitor effect in the absence of the overt presentation of the competitor indicated that the competition effects were implicit; they reflected the representational properties inherent in the mental lexicon and the extent to which a particular lexical candidate shared phonological properties with other words in the lexicon. This modulation of activation in the precentral gyrus as a function of the lexical properties of words (i.e. whether or not a target stimulus was from a minimal pair) indicates that information flow from those areas involved in lexical access (SMG) and lexical selection (IFG) is retained and cascades to those areas involved in articulatory planning (IFG) and articulatory implementation (precentral gyrus).

Neural systems underlying lexical competition

137

Figure 3. Modulation of activation for the production of words that had minimal pairs in a network including the left posterior STG, the SMG, IFG, and ventral precentral gyrus extending into the post-central gyrus (based on Peramunage et al. 2010). For colour version see separate plate.

4. Domain specificity for the resolution of competition The selection of a word from among phonological competitors consistently recruits the IFG in both word recognition and spoken word production. Nonetheless, it is also the case that the resolution of competition from among semantic and conceptual alternatives also activates the IFG. These findings raise the question whether the ultimate resolution of competition and selection processes are domain general in the sense that they cut across different levels of the grammar or whether there is a functional subdivision of the IFG. There has been considerable discussion in the literature on this question without any clear-cut resolution. It is the case that the IFG can be divided cytoarchitectonically into 3 areas which include the pars opercularis (BA44), the pars triangularis (BA45), and the frontal operculum (BA47) (see Figure 4). Some have proposed that these different areas service different linguistic domains with BA44 recruited in phonological processing and BA45 recruited in seman-

138

Sheila E. Blumstein

tic/conceptual processing (Buckner, Raichle, and Petersen 1995; Burton 2001; Burton, Small, and Blumstein 2000; Fiez 1997; Poldrack et al. 1999). Others have proposed that the IFG is divided into different functional processing streams with the anterior portions of the IFG (BA47) recruited for maintaining multiple conceptual representations and the mid-portions of the IFG (BA45) recruited for selecting the task relevant response (Badre and Wagner 2007). The data from the studies examining the effects of lexical competition discussed in this chapter (Peramunage et al. 2010; Righi et al. 2009; Righi 2009) all showed activation in both BA44 and BA45. In contrast, those studies focusing solely on semantic/conceptual properties of words have shown activation only in BA45, not in BA44 (Thompson-Schill, et al. 1997, 1998, 1999). Taken together, these studies suggest that there is a functional subdivision of the IFG along linguistic domains, with BA45 recruited in resolving semantic/conceptual competition and BA44 recruited in resolving phonological competition. That there is activation in both BA45 and BA44 in the lexical competition studies is not surprising given that these studies not only required the selection of a word that had phonological competitors but also required access to the conceptual representation of a word in response selection by either requiring subjects to look at the appropriate picture from an array or by requiring the subjects to name the picture of an object.

Figure 4. Cytoarchitectonic division of the inferior frontal gryus into Brodmann areas including the pars opercularis (BA44), the pars triangularis (BA45), and the frontal operculum (BA47). For colour version see separate plate.

Neural systems underlying lexical competition

139

5. Summary The promise of neuroimaging and lesion studies is that they not only provide insights into the neural systems underlying language processing but they also provide insights into the functional architecture of the language system. In this chapter, we have examined the neural systems underlying lexical competition in both word recognition and spoken word production. Results suggest a common processing stream for both word recognition and spoken word production involving temporo-parietal (posterior superior temporal gyrus, supramarginal gyrus, and angular gyrus) and frontal areas (the inferior frontal gyrus). This neural system serves different processing stages including mapping sound properties to lexical representations (posterior superior temporal gyrus), access to and activation of multiple lexical representations that share phonological properties (supramarginal gyrus and angular gyrus), and executive control mechanisms for selecting the appropriate response from among multiple activated representation (inferior frontal gyrus). That lexical (phonological) competition in posterior areas modulates activation in frontal areas is consistent with those models in which activation of the target word as well as multiple competing lexical representations influences processing stages downstream from it (Dell 1986; Gaskell and Marslen-Wilson 1999; see Goldrick 2006, for a recent review). Thus, activation of multiple competing lexical representations affects the degree of activation of those processes involved in resolution of competition and ultimately in word selection. These cascading effects occur in pointing to an auditorily presented word during word recognition as well as in producing words to be named. Results for word recognition show that activation of multiple lexical (phonological) candidates has a cascading effect upstream on semantic/conceptual stages of processing, as shown by mediated priming effects in the behavioral data (hammock activates nail, the semantic associate of the onset competitor hammer of hammock), and by modulatory effects in posterior superior temporal gyrus and supramarginal gyrus and the inferior frontal gyrus in the functional neuroimaging data. Results for production show that activation of multiple lexical candidates has a cascading effect downstream on phonetic implementation stages of processing, as shown in the behavioral data (voice-onset time for words with voiced minimal pairs is longer than for words without voiced minimal pairs) and in the neuroimaging data (the precentral gyrus shows increased activation for words with voiced minimal pairs). Importantly, the presence of lexical competition modulates the neural system whether the competitor

140

Sheila E. Blumstein

is present or absent in the stimulus array. These findings indicate that competition reflects the intrinsic properties of the lexicon in which multiple phonologically similar lexical candidates are activated. Finally, the functional distinction between semantic and phonological processing appears to be realized in the resolution of competition in the inferior frontal gyrus. In particular, results suggest that BA44 resolves phonological competition whereas BA45 resolve semantic/conceptual competition. Acknowledgements This research was supported in part by NIH Grants RO1 DC006220 and R01 DC00314 to Brown University from the National Institute on Deafness and Other Communication Disorders. The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Institute on Deafness and Other Communication Disorders or the National Institutes of Health.

References Abel, Stephanie, Katherine Dressel, Ruth Bitzer, Dorothea Kümmerer, Irina Mader, Cornelius Weiller, et al. 2009 The separation of processing stages in a lexical interference fmriparadigm. Neuroimage 44: 1113-1124. Allopenna, Paul D., James S. Magnuson, and Michael K. Tanenhaus 1998 Tracking the time course of spoken word recognition using eye movements: evidence for continuous mapping models. Journal of Memory and Language 38: 419-439. Andruski, Jean E., Sheila E. Blumstein, and Martha W. Burton 1994 The effect of subphonetic differences on lexical access. Cognition 52:163-187. Badre, David and Anthony. W. Wagner 2007 Left ventrolateral prefrontal cortex and the cognitive control of memory. Neuropsychologia 45: 2883-2901. Baese-Berk, Melissa and Matthew Goldrick 2009 Mechanisms of interaction in speech production.. Language and Cognitive Processes 24: 527-554. Binder, Jeffrey R., and Cathy J. Price 2001 Functional neuroimaging of language. In Handbook of Functional Neuroimaging, R. Cabeza and A. Kingstone (eds.), 187-251. Cambridge: MIT Press. Bles, Mart and Bernadette M. Jansma 2008 Phonological processing of ignored distractor pictures, an fmri inves-

Neural systems underlying lexical competition

141

tigation. BMC Neuroscience 9: 20-29. Bookheimer Susan Y, Thomas A. Zeffiro, Teresa Blaxton, William Gaillard, William Theodore 1995 Regional cerebral blood flow during object naming and word reading. Human Brain Mapping 3: 93-106. Buckner, Randy L., Mark E. Raichle, and Steve E. Petersen 1995 Dissociation of human prefrontal cortical areas across different speech production tasks and gender groups. Journal of Neurophysiology 74: 2163-2173. Burton, Martha W. 2001 The role of inferior frontal cortex in phonological processing. Cognitive Science 25: 695-709. Burton, Martha W., Steven L. Small, and Sheila E. Blumstein 2000 The role of segmentation in phonological processing: An fMRI investigation. Journal of Cognitive Neuroscience 12: 679-690. Caplan, David, David Gow, and Nicholas Makris 1995 Analysis of lesions by MRI in stroke patients with acoustic-phonetic processing deficits. Neurology 45: 293-298. Dahan, Delphine, James S. Magnuson, Michael K. Tanenhaus, and Ellen M. Hogan 2001a Subcategorical mismatches and the time course of lexical access: Evidence for lexical competition. Language and Cognitive Processes 16: 507-534. Dahan, Daphne, James S. Magnuson, and Michael K. Tanenhaus 2001b Time course of frequency effects in spoken-word recognition: Evidence from eye movements. Cognitive Psychology 42: 317-367. Damian, Markus F., and Jeffrey S Bowers 2009 Assessing the role of orthography in speech perception and production: Evidence from picture-word interference tasks. European Journal of Cognitive Psychology 21: 581-598. Dapretto, Mirella, and Susan Y. Bookheimer 1999 Form and content: Dissociating syntax and semantics in sentence comprehension. Neuron 24: 427-432. Dell, Gary S. 1986 A spreading activation theory of retrieval in language production. Psychological Review 93: 283-321. Dell, Gary S. and Jean Gordon 2003 Neighbors in the lexicon: Friends or foes? In Phonetics and phonology in language comprehension and production: Differences and similarities, N.O. Schiller and A.S. Meyer (eds.), 9-37. New York: Mouton de Gruyter. Dell, Gary S., Myrna F. Schwartz, Nadine Martin, Eleanor M. Saffran, and Deborah A. Gagnon 1997 Lexical access in aphasic and nonaphasic speakers. Psychological Review 104: 801-838.

142

Sheila E. Blumstein

Demonet, Jean-Francis, Francois Chollet, Stuart Ramsay, Dominique Cardebat, Jean-Luc Nespoulous, Richard Wise, Andre Rascol, and Richard Frackowiak 1992 The anatomy of phonological and semantic processing in normal subjects. Brain 115: 1753-1768. de Zubicaray, Greig I. and Katie L. McMahon 2009 Auditory context effects in picture naming investigated with event related fMRI. Cognitive, Affective, and Behavioral Neuroscience 9: 260-269. de Zubicaray, Greig I., Katie L. McMahon, Matt M. Eastburn, and Stephen J.Wilson 2002 Orthographic/phonological facilitation of naming responses in the picture-word task: An event-related fmri study using overt vocal responding. Neuroimage 16: 1084-1093. de Zubicaray, Greig I., Stephen J. Wilson, Katie L. McMahon, and Santhi Muthia 2001 The semantic interference effect in the picture-word paradigm: An event-related fmri study employing overt responses. Human Brain Mapping 14: 218-227. Duncan, John 2001 An adaptive model of neural function in prefrontal cortex. Nature Reviews Neuroscience 2: 820-829. Duncan, John and Adrian M. Owen. 2000 Common regions of the human frontal lobe recruited by diverse cognitive demands. Trends in Neuroscience 23: 475-483. Fiez, Julie A. 1997 Phonology, semantics, and the role of the left inferior prefrontal cortex. Human Brain Mapping 5: 79-83. Gaskell, M. Gareth and William D. Marslen-Wilson 1999 Ambiguity, competition, and blending in spoken word recognition. Cognitive Science 23: 439-462. Geschwind, Norman 1965 Disconnexion syndromes in animal and man. Brain 88:237-294, 585644. Gold, Brian T. and Randy L. Buckner 2002 Common prefrontal regions coactivate with dissociable posterior regions during controlled semantic and phonological tasks. Neuron 35: 803-812. Goldrick, Matthew 2006 Limited interaction in speech production: Chronometric, speech error, and neuropsychological evidence. Language and Cognitive Processes 21: 817-855. Goldrick, Matthew and Sheila E. Blumstein 2006 Cascading activation from phonological planning to articulatory processes: Evidence from tongue twisters. Language and Cognitive Processes 21: 649-683.

Neural systems underlying lexical competition

143

Grosjean, Francois 1980 Spoken word recognition processes and the gating paradigm. Perception and Psychophysics 28: 267-283. Guenther, Frank H. 2006 Cortical interactions underlying the production of speech sounds, Journal of Communication Disorders 39: 350-365. Gumnior, Heidi, Jens Bölte, and Pienie Zwitserlood 2006 A chatterbox is a box: Morphology in German word production. Language and Cognitive Processes 21:920-944. Hickok, Gregory and David Poeppel 2000 Towards a functional neuroanatomy of speech perception. Trends in Cognitive Science 4:131-138. 2004 Dorsal and ventral streams: A framework for understanding aspects of the functional anatomy of language. Cognition 92: 67-99. Hillis, Argye E. and Alfonso Caramazza 1991 Mechanisms for accessing lexical representations for output: Evidence from a category-specific semantic deficit. Brain and Language 40: 106-144. Huang, Jie, Thomas H. Carr and Yue Cao 2001 Comparing cortical activations for silent and overt speech using event-related fMRI. Human Brain Mapping 15: 39-53. Indefrey, Peter and William J.M. Levelt 2004 The spatial and temporal signatures of word production components. Cognition 92: 101-144. Janse, Esther 2006 Lexical competition effects in aphasia: Deactivation of lexical candidates in spoken word processing. Brain and Language 97:1-11. Kotz, Sonja A., Stefano F. Cappa, D. Yves von Cramon, and Angela D. Friederici 2002 Modulation of the Lexical–Semantic Network by Auditory Semantic Priming: An Event-Related Functional MRI Study. Neuroimage 17: 1761-1772 Levelt, Willem J.M. 1999 Models of Word Production. Trends in Cognitive Science 3: 223232. 2001 Spokend Word Production: A Theory of Lexical Access. Proceedings of the National Academy of Science 98:13464-13471. Lupker, Stephen J. 1979 The semantic nature of response competition in the picture-word interference task. Memory and Cognition 7: 485-495. 1982 The role of phonetic and orthographic similarity in picture word interference. Canadian Journal of Psychology 36: 349-376. Luce, Paul A. and David B Pisoni 1998 Recognizing spoken words: the neighborhood activation model. Ear and Hearing 19: 1-36

144

Sheila E. Blumstein

Marslen-Wilson, William 1987 Functional parallelism in spoken word-recognition. Cognition 25: 71-102. 1989 Access and integration: Projecting sound onto meaning. In Lexical Representation and Process, William Marslen-Wilson (ed.), 3-24. Cambridge, MA: MIT Press. Marslen-Wilson, William and Alan Welsh 1978 Processing interactions and lexical access during word-recognition in continuous speech. Cognitive Psychology 10: 29-63. McClelland, James L., and Jeffrey L. Elman 1986 The TRACE model of speech perception. Cognitive Psychology 18: 1-86. McDermott, Kathleen B., Steven E. Petersen, Jason M. Watson, and Jeffrey G. Ojemann 2003 A Procedure for identifying regions preferentially activated by attention to semantic and Phonological relations using functional magnetic resonance imaging. Neuropsychologia 41: 293-303. Milberg, William, Sheila E. Blumstein, and Barbara Dworetzky 1987 Processing of lexical ambiguities in aphasia. Brain and Language 31: 138-150. Miller, Earl K. and Jonathan D. Cohen 2001 An integrative theory of prefrontal cortex function. Annual Review of Neuroscience 24: 167-202. Misiurski, Cara, Sheila E. Blumstein, Jesse Rissman, and Daniel Berman 2005 The role of Lexical competition and acoustic-phonetic structure in lexical processing:Evidence from normal subjects and aphasic patients. Brain and Language 93: 64-75. Munson, Benjamin 2007 Lexical access, lexical representation, and vowel production. In Laboratory Phonology 9, J. S. Cole and J. I. Hualde (eds.), 201-228. New York: Mouton de Gruyter. Munson, Benjamin and Nancy P. Solomon 2004 The effects of phonological neighborhood density on vowel articulation. Journal of Speech, Language, and Hearing Research 47: 10481058. Norris, Dennis 1994 Shortlist: A connectionist model of continuous speech recognition. Cognition 52: 189-234. Okada, Kayoko and Gregory Hickok 2006 Identification of lexical-phonological networks in the superior temporal sulcus using functional magnetic resonance imaging. Neuroreport 17: 1293-1296

Neural systems underlying lexical competition

145

Peramunage, Dasun, Sheila E. Blumstein, Emily B. Myers, Matt Goldrick, and M. Baese-Berk 2010 Phonological Neighborhood Effects in Spoken Word Production: An fMRI Study. Journal of Cognitive Neuroscience, accepted. Peterson, Robert R., and Pamela Savoy 1998 Lexical selection and phonological encoding during language production: Evidence for cascaded processing. Journal of Experimental Psychology-Learning Memory and Cognition 24: 539-557. Poeppel, David 1996 A critical review of PET studies of phonological processing. Brain and Language 55:317-351. Poldrack Russell A., Anthony D. Wagner, Matthew W., Prull, John E. Desmond, Gary H. Glover and John D. Gabrieli 1999 Functional specialization for semantic and phonological processing in the left inferior prefrontal cortex. Neuroimage 10: 15-35. Prabhakaran, Ranjani, Sheila E. Blumstein, Emily B. Myers, Emmette Hutchinson and Brendan Britton 2006 An event-related investigation of phonological-lexical competition. Neuropsychologia 44: 2209-2221. Price, Cathy J., Catherine Mummery, Carolyn J. Moore, Richard S. Frackowiak, and Karl J. Friston 1999 Delineating necessary and sufficient neural systems with functional imaging of studies with neuropsychological patients. Journal of Cognitive Neuroscience 11: 371-382. Raposo, Ana, Helen E., Moss, Emmanuel A. Stamatakis, and Lorraine K. Tyler 2006 Repetition suppression and semantic enhancement: An investigation of the neural correlates of priming. Neuropsychologia 44: 22842295. Righi, Giulia 2009 The Neural Basis of Competition in Auditory Word Recognition and Spoken Word Production. Unpublished doctoral dissertation. Brown University. Righi, Giulia, Sheila E. Blumstein, John Mertus and Michael S. Worden 2009 Neural systems underlying lexical competition: An eyetracking and fMRI Study. Journal of Cognitive Neuroscience, epub ahead of print. Rissman, Jesse, James C. Eliassen, and Sheila E. Blumstein 2003 An event-related FMRI investigation of implicit semantic priming. Journal of Cognitive Neuroscience 15: 1160-1175. Rosinski, Richard R. 1977 Picture–word interference is semantically based. Child Development 48: 643-647. Scarborough, Rebecca in press Lexical and contextual predictability: Confluent effects on the production of vowels. In Papers in Laboratory Phonology 10, C. Fougeron and M. D’Imperio (eds.). Berlin: Mouton de Gruyter.

146

Sheila E. Blumstein

Schnur, Tatiana T., Myrna F. Schwartz, Daniel Y. Kimberg, Elizabeth Hirshorn, H. Branch Coslett, and Sharon L. Thompson-Schill 2009 Localizing interference during naming: Convergent neuroimaging and neuropsychological evidence for the function of Broca’s area. Proceedings of the National Academy of Sciences of the United States of America 1: 322-327. Scott Sophie K. and Richard J. S. Wise 2004 The functional neuroanatomy of prelexical processing in speech perception. Cognition 92: 13-45. Schriefers, Herbert, Antje S. Meyer and William J. M. Levelt 1990 Exploring the time course of lexical access in language production: picture-word interference studies. Journal of Memory and Language 29: 86-102. Smith, Edward E., and John Jonides 1999 Storage and executive processes in the frontal lobes. Science 283: 1657-1661. Snyder, Hannah R., Keith Feignson and Sharon L Thompson-Schill 2007 Prefrontal cortical response to conflict during semantic and phonological tasks. Journal of Cognitive Neuroscience 19: 761-775. Tanenhaus, Michael K., Michael J. Spivey-Knowlton, Katherine M. Eberhard, and Julie C, Sedivy. 1995 Integration of visual and linguistic information in spoken language comprehension. Science 268: 632-634. Thompson-Schill, Sharon L., Mark D'Esposito, Geoffrey K. Aguirre, and Martha J. Farah 1997 Role of the inferior prefrontal cortex in retrieval of semantic knowledge: a reevaluation. Proceedings of the National Academy of Sciences 94: 14792-14797. Thompson-Schill, Sharon L., Diane Swick, Martha Farah, J., Mark D’Esposito, Irene P. Kan and Robert T. Knight 1998 Verb generation in patients with focal frontal lesions: A neuropsychological test of neuroimaging findings. Proceedings of the National Academy of Sciences of the United States of America 95: 15855-15860. Thompson-Schill, Sharon L., Mark D'Esposito, and I. P. Kan 1999 Effects of repetition and competition on activity in left prefrontal cortex during word generation. Neuron 23: 513-522. Utman, Jennifer A., Sheila E. Blumstein, and Kelly Sullivan 2001 Mapping from sound to meaning: Reduced lexical activation in Broca’s aphasics, Brain and Language 79: 444-472. Vitevitch , Michael S. 2002 The influence of phonological similarity neighborhoods on speech production. Journal of Experimental Psychology: Learning, Memory and Cognition 28: 735-747.

Neural systems underlying lexical competition

147

Wright, Richard A. 2004 Factors of lexical competition in vowel articulation. In Laboratory phonology 6, John J. Local, Richard Ogden, and Rosalind Temple (eds.), 26-50. Cambridge, UK: Cambridge University Press. Yee, Eiling, Sheila E. Blumstein, and Julie C. Sedivy 2008 Lexical-semantic activation in Broca’s and Wernicke’s aphasia: Evidence from eye movements. Journal of Cognitive Neuroscience, 20: 592-612. Yee, Eiling and Julie C. Sedivy 2006 Eye movements reveal transient semantic activation during spoken word recognition. Journal of Experimental Psychology: Learning, Memory and Cognition 32: 1-14. Zwitserlood, Pienie 1989 The locus of the effects of sentential-semantic context in spokenword processing. Cognition 32: 25-64.

Connectionist perspectives on lexical representation David C. Plaut 1. Introduction Words are often thought of as the building blocks of language, but the richness of their internal structure and the complexity of how they combine belies such a simple metaphor. Lexical knowledge encompasses multiple types of interrelated information – orthographic, phonological, semantic, and grammatical – with substantial degrees of ambiguity within each. It is perhaps not surprising then that, despite the extensive efforts put into studying lexical processing across multiple disciplines, our understanding of the cognitive and neural bases of word representation remains piecemeal. The standard way of thinking about lexical representation is that each word is coded by some type of separate, discrete data structure, such as a “logogen” (Morton 1969) or localist processing unit (McClelland and Rumelhart 1981). Each such representation has no internal structure of its own but serves as a “handle” that links together the various types of information that comprise knowledge of the word. One interesting implication of this view is that, although words can be similar orthographically, phonologically, semantically, or grammatically, there's no sense in which, independent of these other dimensions, words can be similar lexically. That is, whereas the representation of each aspect of lexical knowledge defines a similarity space within which words can be more or less related to each other, lexical representations per se are fundamentally different in that each word is coded independently of every other word. In essence, lexical representations themselves have no relevant properties – they exist solely to solve a particular computational problem: How to bind together specific orthographic, phonological, semantic, and grammatical information so that each aspect can evoke the others and together contribute coherently to language processing more generally. Although the traditional theory of lexical representation has considerable intuitive appeal, it runs into some difficulties when confronting the complexities of the internal structure and external relationships of words. This chapter explores the possibility that a particular form of computational modeling – variously known as connectionist modeling, neural-network modeling, or the parallel distributed processing (PDP) approach, not only

150

David C. Plaut

avoids these difficulties but, more fundamentally, provides a different solution to the problem that traditional lexical representations were created to solve in the first place. In particular, and as elaborated below, connectionist/PDP networks can learn the functional relationships among orthographic, phonological, semantic, and grammatical information even though no particular representation binds them all together in one place. In this way, connectionist/PDP modelling raises the radical possibility that, although there is certainly lexical knowledge and lexical processing, as traditionally construed there is no such thing as lexical representation. 2. Principles of connectionist representation Connectionist models are composed of large groups of simple, neuron-like processing units that interact across positive- and negative-weighted connections. Typically, each unit has a real-valued activation level which is computed according to a non-linear (sigmoid) function of the weighted sum of activations of other, connected units. Different groups of units code different types of information, with some units coding input to the system and others coding the system's output or response to that input. Knowledge of how inputs are related to outputs is encoded in the pattern of weights on the connections among units; learning involves modifying the weights in response to performance feedback. In thinking about how a group of units might represent entities in a domain, it is common to contrast two alternatives. The first is a localist representation, in which there is a one-to-one relationship between units and entities – that is, a single, dedicated unit corresponds to each entity. The second is a distributed representation, in which the relationship is many-tomany – that is, each entity is represented by a particular pattern of activity over the units, and each unit participates in representing multiple entities.1 The interactive activation (IA) model of letter and word perception (McClelland and Rumelhart 1981) provides a useful context for clarifying this distinction. The model consists of three layers of interacting units: letter feature units at the bottom (various strokes at each of four positions), letter units in the middle (one per letter at each position; e.g., B, L, U, and R), and word units at the top (one per word; e.g., BLUR). The IA model is usually thought of as localist because it contains single units that stand in one-to-one correspondence with words, but it is important to recognize that a representation is localist or distributed only relative to a specific set of

Connectionist perspectives on lexical representation

151

entities. Thus, the word level of the IA model is localist relative to words, and the letter level is localist relative to (position-specific) letters. However, at the letter level, the presentation of a word results in the activation of multiple units (corresponding to its letters), and each of these units is activated by multiple words (i.e., words containing that letter in that position). Thus, the letter level in the IA model is localist relative to letters but distributed relative to words. In practice, however, it can be difficult to distinguish localist from distributed representations on the basis of activity because localist units typically become active not only for the entity to which they correspond but also for entities that are similar to it. For example, in the IA model, the input for BLUR activates its word unit strongly but also partially activates the word unit for BLUE (see Bowers 2009 p. 226). This off-item activation can be difficult to distinguish from the patterns that comprise distributed representations. Moreover, in most localist theories it is assumed that there are multiple redundant copies of each dedicated unit. Thus, in both localist and distributed representations, multiple units become active in processing a given entity, and each unit will become at least partially active for multiple entities. A further consideration is that the number of active units in a representation – its sparseness – is a matter of degree. Localist representations constitute one extreme of sparseness, but distributed representations in which a very small percentage of units are active at any one time can be functionally quite similar, in that each pattern can have effects that are largely independent of the effects of other patterns. Even so, sparse distributed representations have a distinct advantage over strictly localist ones in that they provide far more efficient coding (O'Reilly and McClelland 1994). Moreover, the degree of sparseness of learned internal representations within connectionist networks need not be stipulated a priori but arises as a consequence of the basic network mechanisms, the learning procedure, and the structure of the tasks to be learned. In general, systematic tasks – in which similar inputs map to similar outputs – yield denser activation to support generalization, whereas unsystematic tasks such as word recognition give rise to sparser activation to avoid interference (for discussion, see McClelland, McNaughton, and O'Reilly 1995; Plaut et al. 1996). An alternative characterisation of the locality of a representation is in terms of knowledge rather than activity (Bowers 2009). That is, one can distinguish whether knowledge about an entity is encoded in the connections coming into or out of a particular unit or whether it is distributed

152

David C. Plaut

across the connections of many units. For example, within the IA model, knowledge that the letter string BLUR is a word is coded only in the connections between the corresponding word unit and its letters; remove that single unit, and BLUR is no longer a word to the model. Although this form of localist theory is clearly distinct from the types of knowledge typically learned by connectionist/PDP networks, it runs into difficulties when confronted with the general issue of the appropriate granularity of localist units – in particular, whether units should be allocated to individually encountered instances of entities or to some equivalence class of instances (Plaut and McClelland 2010). The former case is problematic not only because it requires an unrealistic amount of storage but also because it doesn't explain how we recognize novel instances of familiar categories (e.g., a new car on the street, or this particular occurrence of the word BLUR). Assigning units to classes of instances is problematic because there will always be some further distinctions within the class that are important in some contexts but that are inaccessible because the instances are represented identically by the same localist unit. If both instance and class units are added, the knowledge about an entity is no longer localised to a single processing unit – that is, on this alternative formulation, the representation becomes distributed. Although the issue of the granularity of localist representations is problematic in general, it could be argued that it is entirely straightforward in the specific case of words. That is, units should be allocated for each word, which corresponds to a class of instances (i.e., specific occurrences of that word). The reason this works is that words are symbolic – each instance of a word is exactly functionally equivalent to every other instance of the word, and so nothing is lost by representing them identically. Thus, even if localist representation is untenable in general, perhaps it is perfectly wellsuited for lexical knowledge. Unfortunately, localist representations face another challenge in this domain – capturing the internal structure of words. 3. The challenge of internal structure: Morphology The real building blocks of language, if there were such things, would be morphemes. The traditional view of lexical representation is that words are composed of one or more morphemes, each of which contributes systematically to the meaning and grammatical role of the word as a whole (e.g.,

Connectionist perspectives on lexical representation

153

= UN- + BREAK + -ABLE). If English morphology were perfectly systematic, lexical representation would have nothing to contribute beyond morphemic representation, and localist structures might be fully adequate for the latter. However, as is true of other linguistic domains, morphological systematicity is only partial. That is, the meaning of a word is not always transparently related to the meaning of its morphemes (e.g., a DRESSER is not someone who dresses but a piece of furniture containing clothes). Moreover, the meaning of a morpheme can depend on the word it occurs in (e.g., the affix -ER can be agentive [TEACHER], instrumental [MOWER], or comparative [FASTER], depending on the stem). In fact, some words decompose only partially (e.g., -ER is agentive in GROCER and instrumental in HAMMER, but what remains in each case [GROCE?, HAM?] is not a morpheme that contributes coherently to meaning). In short, the relationship of the meaning of a word to the meanings of its parts – to the extent it even has parts – is sometimes straightforward but can be exceedingly complex in general. This complexity presents a formidable challenge to localist theories of lexical representation. First, the wealth of empirical data showing strong effects of morphological structure on the speed and accuracy of word recognition rules out a solution that involves units only for whole words. The fact that many words exhibit only partial semantic transparency also rules out having only morpheme units that contribute to meaning independently. The only viable approach would seem to be one in which both word and morpheme units are included, such that the word units compensate for any lack of transparency in the semantic contribution of individual morphemes (see, e.g., Taft 2006). Even setting aside concerns about how the system would determine what morphemes are contained in a word, allocate and connect the necessary units, and weight them relative to the word unit appropriately, the approach runs into problems because it forces morphological decomposition to be all-or-none. That is, a word either does or doesn't contain a morpheme, and if it does, the morpheme unit's contribution to meaning (as distinct from the word unit's contribution) is the same as in other words containing it. For instance, it seems clear that BOLDLY contains BOLD as a morpheme (in that it makes a transparent semantic contribution), whereas HARDLY doesn't contain HARD (and so HARDLY, despite the similarity in form, would not be decomposed). And, indeed, in a visually primed lexical decision experiment, BOLD primes BOLDLY but HARD doesn't prime HARDLY (relative to nonmorphological orthographic and semantic controls; Gonnerman, Seidenberg, and Anderson 2007). But what UNBREAKABLE

154

David C. Plaut

about LATE in LATELY? On the localist theory, LATELY should behave either like BOLDLY if it is decomposed, or HARDLY if it's not, but empirically it exhibits an intermediate level of priming (Gonnerman et al. 2007). This finding is awkward for any theory that has no way to express intermediate degrees of morphological relatedness. How might morphological structure be understood on a distributed connectionist approach? The first thing to point out is that morphemes, like word units, have no relevant internal structure but are posited to solve a particular problem: how to relate the surface forms of words to their meanings. We assume that (phonological) surface forms are coded by distributed patterns of activity over a group of units such that words with similar pronunciations are coded by similar patterns, and word meanings are coded over a separate group of units whose patterns capture semantic similarity. Mapping from one to the other is difficult precisely because, apart from morphological structure (and rare pockets of sound symbolism), phonological similarity is essentially unrelated to semantic similarity. This type of arbitrary mapping is particularly difficult for a connectionist network to learn, because units – due to their limited nonlinearity – are intrinsically biased to map similar inputs to similar outputs. In fact, when output similarity is very different from input similarity, the mapping cannot be implemented by direct connections between input and output units, and an additional layer of so-called hidden units are needed to mediate between the input and output. By modifying the input-to-hidden weights, the network can learn to re-represent the input patterns as a new set of patterns over the hidden units whose similarities are sufficiently close to those of the output patterns that the hidden-to-output weights can generate the correct outputs. In this way, networks learn hidden representations that have a similarity structure that is in some sense halfway between the structure of the inputs and the structure of the outputs. This can always be done with a large enough hidden layer, but sometimes it is more efficient to use a series of smaller hidden layers instead. Of course, spoken word comprehension is not a completely arbitrary mapping precisely because many words have morphological structure. On a connectionist account, however, the nature of this structure is not stipulated in advance (e.g., that words are composed of discrete parts) but is something that manifests in the statistical relationship between inputs and outputs and thus is discovered by the network in the course of learning. Morphological structure introduces a degree of componentiality between inputs and outputs – that is, the degree to which parts of the input can be

Connectionist perspectives on lexical representation

155

processed independently from the rest of the input. From a connectionist perspective, the notion of “morpheme” is an inherently graded concept because the extent to which a particular part of the phonological input behaves independently of the rest of the input is always a matter of degree (Bybee 1985). Also note that the relevant parts of the input need not be contiguous, as in prefixes and suffixes in concatenative systems like English. Even noncontiguous subsets of the input, such as roots and word patterns in Hebrew, can function morphologically if they behave systematically with respect to meaning or syntax. A network comes to exhibit degrees of componentiality in its behaviour because, on the basis of exposure to examples of inputs and outputs from a task, it must determine not only what aspects of each input are important for generating the correct output, but also what aspects are uninformative and should be ignored. This knowledge can then apply across large classes of items, only within small subclasses, or even be restricted to individual items. In this way, the network learns to map parts of the input to parts of the output in a way that is as independent as possible from how the remaining parts of the input are mapped. This provides a type of combinatorial generalisation by allowing novel recombinations of familiar parts to be processed effectively. In short, a network can develop mostly componential representations that handle the more systematic aspects of the task and that generalise to novel forms, while simultaneously developing less componential representations for handling the more idiosyncratic aspects of the task, as well as the full range of gradations in between. The graded componential structure of hidden representations is illustrated in a clear way by a simulation of morphological priming carried out by Plaut and Gonnerman (2000). A three-layer network was trained to map from the surface forms of words to their meanings for either of two artificial vocabularies (see Figure 1a). In each, sets of two-syllable words were assigned semantic features such that they varied in their semantic transparency. Each syllable was assigned a particular set of semantic features, such that a transparent word's meaning was simply the union of the features of its component syllables. Such meanings are fully componential in that each semantic feature could be determined by one of the syllables without regard to the other. The meaning of an intermediate word was derived by determing the transparent meaning of its syllables and then changing a random third of its semantic features; for a distant word, two-thirds of the transparent features were changed. These meanings are progressively less componential than transparent meanings because the changed semantic features

156

David C. Plaut

can be determined only from both syllables together. Finally, the meaning of an opaque word was derived by regenerating an entirely new arbitrary set of semantic features that were unrelated to the transparent meanings of its syllables. Using these procedures for generating representations, two languages were created containing 1200 words each. In the morphologically rich language, the first 60 “stems” (first syllables), forming 720 words, were all transparent; in the impoverished language, they were all opaque. The remaining 480 words were identical across the two languages and were formed from 10 transparent stems, 10 intermediate stems, 10 distant stems, and 10 opaque stems. The simulation was designed to evaluate the degree of morphological priming among this shared set of words as a function of the nature of the remaining words in each of the two languages. Figure 1b shows the amount of priming (difference in settling times following related vs. unrelated primes) as a function of level of morphological transparency and of language. The main relevant finding for present purposes is that, in both languages, morphological priming varies in a graded fashion as a function of semantic transparency; analogous to what was observed empirically by Gonnerman et al. (2007). The strong priming exhibited by transparent words suggests that the network's internal representations have learned the systematic relationship between the shared stem's surface form and its (transparent) meaning, and in this sense it seems natural to describe the stem as a “morpheme” that is shared by the prime and target. But the intermediate and distant words benefit from sharing a stem to less of an extent, due to the fact that their internal representations overlap less. In these cases, what the stem contributes to the representation of the prime is not contained in or part of the representation of the target; rather, there is some degree of overlap but also some degree of divergence between the stem's contribution in the two words. At best, what could be said is that the stem functions as a morpheme to some degree, and is contained by words to some degree; there is no discrete point at which words go from being fully componential to fully opaque. And based on the empirical findings, this characterization of graded morphological structure applies to human subjects as well as to the network. In summary, unstructured or localist word representations can be augmented with similar morpheme representations to capture some aspects of the internal structure of words, but the processing of words with intermediate degrees of semantic transparency is awkward to explain. By contrast, because distributed connectionist networks start with the assumption that

Connectionist perspectives on lexical representation

157

entities such as words are represented by patterns of activity with rich internal structure, such networks can more naturally capture the graded relationships between the surface forms of words and their meanings. (a)

(b)

Figure 1: (a) The network architecture used by Plaut and Gonnerman (2000). Numbers of units in each group are shown in parentheses, and large arrows represent full connectivity between groups. (b) Priming results produced by the network as a function of the degree of morphological transparency and whether the network was trained on a morphologically rich or impoverished artificial language (Adapted from Plaut and Gonnerman 2000)

158

David C. Plaut

4. The challenge of external context: Ambiguity Capturing the internal structure of words is not the only challenge facing theories of lexical representation. Another, often neglected problem concerns ambiguity in the relationships among different aspects of lexical knowledge. As it turns out, addressing this issue requires coming to terms with how words contribute to, and are influenced by, higher levels of language processing. Every aspect of lexical knowledge suffers from ambiguity when words are considered in isolation: semantics (e.g., BANK [river] vs. BANK [money]), syntax (e.g., [the] FLY vs. [to] FLY); phonology (e.g., WIND [air] vs. WIND [watch]), and even orthography (e.g., COLOUR vs. COLOR). Most computational models of lexical processing, including connectionist ones, either actively avoid this problem by adopting simplified vocabularies and representations that lack ambiguity (e.g., Kello and Plaut 1993), or perhaps include it only in phonology (e.g., Coltheart, et al. 2001; Harm and Seidenberg 2004; Plaut et al. 1996). In simulations that include semantics, the presentation of a homophone like ATE/EIGHT in phonology, or a heterophonic homograph like WIND in orthography, typically gives rise to a blend of the semantic features of the relevant meanings, although such blending can be reduced by the introduction of disambiguating information, such as distinctive semantic for homophones or phonological information for homographs (see, e.g., Harm and Seidenberg 2004). A similar situation arises in simulations that include semantic ambiguity; that is, in which a given surface form (e.g., BANK) corresponds to more than one semantic representation (e.g., Joordens and Besner 1994), although blends can be prevented for the most part by the use of an appropriate learning procedure (Movellan and McClelland 1993; Rodd, Gaskell, and Marslen-Wilson 2004). The selection of which of its multiple meanings a word produces on a given occasion is influenced by the relative frequency of the meanings but is otherwise the result of random processes within the network. This may suffice when accounting for data from words presented in isolation and in random order, but does not generalize to the way in which ambiguous words are understood in context. Armstrong and Plaut (2008) developed a simple simulation of the use of context to disambiguate semantically ambiguous words, including both homonymy (i.e., words such as BANK [river/money] with multiple distinct meanings) and polysemy (i.e., words such as PAPER [document/material] with multiple distinct senses with a common meaning). Although these relations

Connectionist perspectives on lexical representation

159

are often dichotomized in experimental designs, the degree of pattern overlap among distributed semantic representations provides a natural means of capturing the full continuum of relatedness among word meanings. The target phenomena for the simulation were findings by Hino, Pexman and Lupker (2006) that lexical decision typically produces only a polysemy advantage (i.e., faster responding to polysemous vs. unambiguous words) whereas semantic categorization produces only a homonym disadvantage (i.e., slower responding to homonymous vs. unambiguous words). Armstrong and Plaut's goal was to account for these findings, not in terms of task differences, but in terms of the time-course of cooperative and competitive dynamics within a recurrent connectionist network. The architecture of the network included 25 orthographic units connected to 150 hidden units, which in turn where bidirectionally connected to 100 semantic units. In addition, 75 “context” units provided additional input to the hidden units that served as the basis for disambiguating words. The training patterns consisted of 128 unambiguous words, 64 homonymous words, and 64 polysemous words. Artificial patterns were generated to approximate the relationship among written words and their meanings. Specifically, orthographic, context, and semantic representations were generated by probabilistically activating a randomly selected 15% of the units in a group (ensuring that all patterns differ by at least three units). Unambiguous words consisted of a single pairing of a randomly selected orthographic pattern, context pattern, and semantic pattern. Homonymous words were represented as two separate input patterns which shared the same orthographic pattern but were associated with a different randomly selected context and semantic pattern. Polysemous words were similar except that their semantic patterns were both originally derived by distoring the same “prototype” pattern to ensure that they shared 60% of their features with each other. To instantiate the bottom-up salience of orthographic stimuli, context input was presented only after 10 unit updates with orthographic input alone. After training with a continuous version of recurrent back-propagation, the network was successful at activating the correct semantic features of each word given the appropriate context representation. Figure 2 shows the number of semantic units in the model that were activated strongly (i.e., above 0.7) over the course of processing polysemous, homonymous, and unambiguous words. Early in semantic processing (time A), polysemous words show an advantage over both homonymous and unambiguous words (which do not differ much). This advantage arises because the shared fea-

160

David C. Plaut

tures among the overlapping meanings mutually support each other. In contrast, late in processing (time C), homonymous words show a disadvantage relative to both polysemous and unambiguous words (which do not differ). This disadvantage is due to competition among the non-overlapping features of the alternative unrelated meanings of homonymous words. Thus, the network exhibits the pattern of results observed by Hino et al. (2006), not because of task differences (as there are none in the model), but because of changes in the dynamics among sets of semantic features in the model. The model accounts for the empirical data if we assume that lexical decisions can be made relatively early in the course of semantic processing, whereas semantic categorization requires a more precise semantic representation that takes longer to activate.

Figure 2: The average number of semantic units in the Armstrong and Plaut (2008) model that were active above 0.7 for polysemous, unambiguous, and homonymous words. Note that these trajectories do not reflect presemantic visual and orthographic processing; the zero time-point reflects the onset of semantic processing only, and no semantic units were active above 0.7 before unit update 10. (Adapted from Armstrong and Plaut 2008)

Connectionist perspectives on lexical representation

161

On this account, it should be possible to shift from a polysemy advantage to a homonymy disadvantage within a single task solely by increasing difficulty (and thus degree of semantic processing). Armstrong and Plaut (2008) tested and confirmed this prediction by varying the wordlikeness (summed bigram frequency) of nonword foils in a lexical decision task. Moreover, by using moderately wordlike nonwords, they confirmed the model's predictions that, with an intermediate amount of semantic processing (time B), both effects should be observed (see Figure 2). The Armstrong and Plaut (2008) model illustrates – in admittedly highly oversimplified form – how context can serve to disambiguate words of varying degress of ambiguity in a way that is consistent with at least some aspects of human comprehension processes (see also Gaskell and MarslenWilson 1997). But in many ways the model begs the question of where the appropriate context representations come from in the first place. One possible answer is that the network activation left behind by the previous word might serve as the relevant context. However, while some models have used this approach to model lexical semantic priming effectively (e.g., Plaut and Booth 2000), the meaning of a single word is insufficient in general to capture the richness and complexity of how previous (and even subsequent) linguistic input can serve to alter the meaning of a word. A full treatment of context effects on word comprehension requires embedding lexical processing within a broader framework for sentence understanding. As an example of how sentence-level syntax and semantics must be used to determine word meanings, consider the following: 1. The pitcher threw the ball. Here, every content word has multiple meanings in isolation but an unambiguous meaning in context. The same is true of vague or generic words, such as CONTAINER, which can refer to very different types of objects in different contexts, as in 2. The container held the apples. 3. The container held the cola. Finally, at the extreme end of context dependence are implied constituents which are not even mentioned in the sentence but nonetheless are an important aspect of its meaning. For example, from 4. The boy spread the jelly on the bread. most people infer that the instrument was a knife.

162

David C. Plaut

To address how sentence context can inform word comprehension (among other issues), St. John and McClelland (1990; McClelland, St. John, and Taraban 1989) developed a connectionist model of sentence comprehension which instantiates sentence comprehension as a constraint satisfaction process in which multiple sources of information from both syntax and semantics are simultaneously brought to bear in constructing the most plausible interpretation of a given utterance. The architecture of the model, in the form of a simple recurrent network, is shown in Figure 3. The task of the network was to take as input a single-clause sentence as a sequence of constituents (e.g., THE-BUSDRIVER ATE THE-STEAK WITH-AKNIFE) and to derive an internal representation of the event described by the sentence, termed the Sentence Gestalt. Critically, this representation was not predefined but was learned from feedback on its ability to generate appropriate thematic role assignments for the event given either a role (e.g., Agent, Patient, Instrument) or a constituent that fills a role (e.g., busdriver, steak, knife) as a probe. Events were organized around actions and had a probabilistic structure. Specifically, each of 14 actions had a specified set of thematic roles, each of which was filled probabilistically by one of the possible constituents. In this process, the selection of fillers for certain roles biased the selection for other roles. For example, for eating events, the busdriver most often ate steak whereas the teacher most often ate soup, although occasionally the reverse occurred. These probabilistic biases in the construction of events were intended to approximate the variable but non-random structure of realworld events: Some things are more likely than others to play certain roles in certain activities. The choice of words in the construction of a sentence describing the event was also probabilistic. The event of a busdriver eating a steak with a knife might be rendered as THE-ADULT ATE THE-FOOD WITH-A-UTENSIL, THE-STEAK WAS-CONSUMED-BY THE-PERSON, SOMEONE ATE SOMETHING, and so on. This variability captures the fact that, in real life, the same event may be described in many different ways and yet understood similarly. Overall, given the probabilistic event structures and the lexical and syntactic options for describing events as sentences, there were a total of 120 different events (of which some were much more likely than others) and 22,645 different sentence-event pairs.

Connectionist perspectives on lexical representation

163

Figure 3: The architecture of the St. John and McClelland (1990) model of sentence comprehension. The number of units in each layer is shown in parentheses. The large arrows identify which layers receive input (incoming arrow) or produce output (outgoing arrow). The dashed arrow indicates a projection from "context" units (omitted for clarity) whose states are copied from the Sentence Gestalt layer for the previous time step. The indicated content of representations is midway through the sentence THE BUSDRIVER ATE THE STEAK WITH A KNIFE. (Adapted from St. John & McClelland 1990).

During training, sentence-event pairs were generated successively and the constituents of each sentence were presented one at a time over the Current Constituent units (see Figure 3). For each constituent, the network updated its Sentence Gestalt representation and then attempted to use this representation as input to generate the full set of role/filler pairs for the event. Specifically, with the Sentence Gestalt fixed and given either a role or a filler over the Probe units, the network had to generate the other element of the pair over the Role/Filler units. For example, after the presentation of THE-STEAK in the sentence THE-STEAK WAS-EATEN-BY THEBUSDRIVER, the network was trained to output, among other things, the agent (busdriver), the patient (steak), the action (eating), and the instrument (fork). It was, of course, impossible for the network to do this with complete accuracy, as these role assignments depend on constituents that have yet to occur or are only implied. Even so, the network could do better than

164

David C. Plaut

chance; it could attempt to predict missing information based on its experience with the probabilistic dependencies in the event structures. More specifically, it could (and did) generate distributions of activity over roles and fillers that approximated their frequency of occurrence over all possible events described by sentences that start with the-steak. Note that these distributions could, in many cases, be strongly biased towards the correct responses. For example, steaks typically fill the patient role in events about eating and (in the environment of the network) steaks are most commonly eaten by busdrivers using a fork. In this way, the training procedure encouraged the network to extract as much information as possible as early as possible, in keeping with the principle of immediate update (MarslenWilson and Tyler 1980). Of course, the network also had to learn to revise the Sentence Gestalt appropriately in cases where its predictions were violated, as in THE-STEAK WAS-EATEN-BY THE-TEACHER. The network was trained on a total of 630,000 sentence-event pairs, in which some pairs occurred frequently and others – particularly those with atypical role assignments – were very rare. By the end of training, when tested on 55 randomly generated sentence-event pairs with unambiguous interpretations, the network was 99.4% correct. St. John and McClelland (1990) carried out a number of specific analyses intended to establish that the network could handle more subtle aspects of sentence comprehension. In general, the network succeeded at using both semantic and syntactic context to 1) disambiguate word meanings (e.g., for THE-PITCHER HIT THE-BAT WITH-THE-BAT, assigning flying bat as patient and baseball bat as instrument); 2) instantiate vague words (e.g., for THE-TEACHER KISSED SOMEONE, activating a male of unknown age as patient), and 3) elaborate implied roles (e.g., for THE-TEACHER ATE THE-SOUP, activating spoon as the instrument; for THE-SCHOOLGIRL ATE), activating a range of foods as possible patients). Disambiguation requires the competition and cooperation of constraints from both the word and its context. While the word itself cues two different interpretations, the context fits only one. In THE-PITCHER HIT THE-BAT WITH-THE-BAT, PITCHER cues both container and ball-player. The context cues both ball-player and busdriver because the model has seen sentences involving both people hitting bats. All the constraints supporting ballplayer combine, and together they win the competition for the interpretation of the sentence. In this way, even when several words of a sentence are ambiguous, the event which they support in common dominates the disparate events that they each support individually. The processing of both in-

Connectionist perspectives on lexical representation

165

stances of BAT work similarly: the word and the context mutually support the correct interpretation. Consequently, the final interpretation of each word fits together into a globally consistent understanding of an entire coherent event. There is no question that the Sentence Gestalt model has important limitations in its theoretical scope and empirical adequacy. The model was trained on sentences restricted to single clauses without embeddings and pre-parsed into syntactic constituents, and the use of event structures composed of probabilistic assignment to fixed thematic roles was also highly simplified (although see Rohde 2002, for an extension of the model that addresses these limitations). Nonetheless, it is useful to consider the nature of word meanings, and lexical representations more generally, in light of the operation of the model. The first thing to note is that there is no real sense in which each word/constituent2 in the input is assigned a particular semantic representation – in the form of a pattern of activity over a group of units – even when disambiguated by context. Rather, the current word combines with the current context – coded in terms of the existing activation pattern within the network – to determine a new internal representation (over the hidden units) that then serves to revise the model's sentence interpretation (over the Sentence Gestalt layer). While it is true that the contribution of the current word is carried out via a relatively stable set of weights – those coming out of the unit (or units) coding it as input – the actual impact of this knowledge on active representations within the model is strongly dependent on context. This dependence can vary from introducing subtle shading (for polysemous words) to selection of an entirely distinct interpretation (for homonymous words), and everything in between. In this way, in the context of the model, it would be a mistake to think of words as “having” one or more meanings; rather, words serve as “cues” to sentence meaning – for some words, the resulting sentence meanings have considerable similarity whereas for others, they can be quite unrelated. In the context of a typical psycholinguistic experiment, where words are presented in isolation and in a random order, the representation of “sentence context” is generally unrelated and unbiased relative to the contexts that a word typically occurs in, and so the resulting representation evoked over the Sentence Gestalt layer reflects general implications of a word across all of its context – in some ways analogous to what happens in the model for the initial word/constituent of a sentence. Such a pattern may be systematically related to other types of knowledge (e.g., pronunciation) but

166

David C. Plaut

it wouldn't constitute a specific part of some larger lexical representation. In the model, and perhaps in the human language system as well, words are not assigned specific representations but solely serve as vehicles for influencing higher-level linguistic representations. It is in this sense that, as claimed at the outset of this chapter, distributed connectionist modelling gives rise to a view of language in which lexical knowledge and processing play a fundamental role in language understanding, without any explicit role for lexical representation per se. 5. Conclusions Despite broad agreement on the critical roles that words play in language, there is very little clarity on the nature of word representations and how they interact with other levels of representation to support linguistic performance. Early theories of lexical representation used words as unstructured “handles” or pointers that simply linked together and provided access to phonological, orthographic, semantic, and grammatical knowledge. However, such a simple account is undermined by careful consideration of both the effects of the internal structure of words and of the subtleties in how words are influence by the contexts in which they occur. Distributed connectionist modeling provides a way of learning the functional relationships among different types of information without having to posit an explicit, discrete data structure for each word (or morpheme). Rather, the similarity structure of activation patterns within and between each domain can capture various aspects of morphological relatedness, and an emerging sentence-level interpretation can modulate the contributions that words make to meaning. Indeed, if the goal of language processing is cast as the comprehension and production of larger-scale utterances, individual words can be seen as contributing to these processes in context-sensitive ways without themselves being represented explicitly. Although the resulting theory of language processing runs against strong intuitions about the primacy of lexical representation in language, it might nonetheless provide the best account of actual language performance. Notes 1.

The many-to-one case, where many units code one and only one entity, is essentially a redundant version of a localist code. The one-to-many case,

Connectionist perspectives on lexical representation

2.

167

where entities correspond to single units but a given unit represents multiple entities, is too ambiguous to be useful. Although St. John and McClelland's (1990) Sentence Gestalt model took constituents rather than words as input (e.g., THE-BUSDRIVER), Rohde's (2002) extension of the model took sequences of individual words as input.

References Armstrong, Blair C., and David C. Plaut 2008 Settling dynamics in distributed networks explain task differences in semantic ambiguity effects: Computational and behavioral evidence. Proceedings of the 30th Annual Conference of the Cognitive Science Society. Mahwah, NJ: Lawrence Erlbaum Associates. Bowers, Jeffrey S. 2009 On the biological plausibilty of grandmother cells: Implications for neural network theories in psychology and neuroscience. Psychological Review, 116: 220-251. Bybee, Joan 1985 Morphology: A study of the relation between meaning and form. Philadelphia: Benjamins. Coltheart, Max, Kathleen Rastle, Conrad Perry, Robyn Langdon, Johannes Ziegler 2001 DRC: A dual route cascaded model of visual word recognition and reading aloud. Psychological Review, 108: 204-256. Gaskell, M. Gareth, and William D. Marslen-Wilson 1997 Integrating form and meaning: A distributed model of speech perception. Language and Cognitive Processes, 12: 613-656. Gonnerman, Laura M., Mark S. Seidenberg and Elaine S. Andersen 2007 Graded semantic and phonological similarity effects in priming: Evidence for a distributed connectionist approach to morphology. Journal of Experimental Psychology: General, 136: 323-345. Harm, Michael W. and Mark S. Seidenberg 2004 Computing the meanings of words in reading: Cooperative division of labor between visual and phonological processes. Psychological Review, 111: 662-720. Hino, Yashushi, Penny M. Pexman, and Stephen J. Lupker 2006 Ambiguity and relatedness effects in semantic tasks: Are they due to semantic coding? Journal of Memory and Language, 55: 247-273. Joordens, Steve, and Derek Besner 1994 When banking on meaning is not (yet) money in the bank: Explorations in connectionist modeling. Journal of Experimental Psychology: Learning Memory and Cognition, 20: 1051-1062.

168

David C. Plaut

Kello, Christopher T., and David C. Plaut 2003 Strategic control over rate of processing in word reading: A computational investigation. Journal of Memory and Language, 48: 207232. Marslen-Wilson, William D., and Lorraine K. Tyler 1980 The temporal structure of spoken language understanding. Cognition, 8: 1-71. McClelland, James, L., Brian L. McNaughton, and Randall C. O'Reilly 1995 Why there are complementary learning systems in the hippocampus and neocortex: Insights from the successes and failures of connectionist models of learning and memory. Psychological Review, 102: 419-457. McClelland, James L., and David E. Rumelhart 1981 An interactive activation model of context effects in letter perception: Part 1. An account of basic findings. Psychological Review, 88: 375-407. McClelland, James L., Mark St. John, and Roman Taraban 1989 Sentence comprehension: A parallel distributed processing approach. Language and Cognitive Processes, 4: 287-335. Morton, John 1969 The interaction of information in word recognition. Psychological Review, 76: 165-170. Movellan, Javier R. and James L. McClelland 1993 Learning continuous probability distributions with symmetric diffusion networks. Cognitive Science, 17: 463-496. O'Reilly, Randall C., and James L. McClelland 1994 Hippocampal conjunctive encoding, storage, and recall: Avoiding a tradeoff. Hippocampus, 6: 661-682. Plaut, David C., and James R. Booth 2000 Individual and developmental differences in semantic priming: Empirical and computational support for a single-mechanism account of lexical processing. Psychological Review, 107: 786-823. Plaut, David C., and Laura M. Gonnerman 2000 Are non-semantic morphological effects incompatible with a distributed connectionist approach to lexical processing? Language and Cognitive Processes, 15: 445-485. Plaut, David C., and James L. McClelland 2010 Locating object knowledge in the brain: A critique of Bowers's (2009) attempt to revive the grandmother cell hypothesis. Psychological Review, 117: 284-290.

Connectionist perspectives on lexical representation

169

Plaut, David C., James L. McClelland, Mark S. Seidenberg, and Karalyn Patterson 1996 Understanding normal and impaired word reading: Computational principles in quasi-regular domains. Psychological Review, 103: 56115. Rodd, Jennifer M., M. Gareth Gaskell, and William D. Marslen-Wilson 2004 Modelling the effects of semantic ambiguity in word recognition. Cognitive Science, 28: 89-104. Rohde, Douglas L. T. 2002 A connectionist model of sentence comprehension and production. Ph.D. dissertation, Computer Science Department, Carnegie Mellon University, Pittsburgh, PA. Available as Technical Report CMU-CS02-105. St. John, Mark F., and James L. McClelland 1990 Learning and applying contextual constraints in sentence comprehension. Artificial Intelligence, 46: 217-257. Taft, Marcus 2006 A localist-cum-distributed (LCD) framework for lexical processing. In From inkmarks to ideas: Current issues in lexical processing, Sally Andrews (ed.), 76-94. Hove, UK: Psychology Press.

Recognizing words from speech: The perception-action-memory loop David Poeppel and William Idsardi 1. Conceptual preliminaries 1.1. Terminological The failure to be sufficiently careful about terminological distinctions has resulted in some unnecessary confusion, especially when considering the neurobiological literature. For example, the term speech perception has unfortunately often been used interchangeably with language comprehension. We reserve the term language comprehension for the computational subroutines that occur subsequent to the initial perceptual analyses. In particular, language comprehension can be mediated by ear, eye, or touch. The linguistic system can be engaged by auditory input (speech), visual input (text or sign), and tactile input (Braille). In other words, the processes that underlie language comprehension build on sensorimotor input processes that appear to be, at least in part, independent. While this point may seem pedantic, the literature contains numerous reports that do not respect these distinctions and that conflate operations responsible for distinct aspects of perception and comprehension. We focus here on speech perception proper, the perceptual analysis of auditory input. Importantly, further distinctions must be considered. There are at least three experimental approaches grouped under the rubric ‘speech perception,’ and because they are different in the structure of the input, the perceptual subroutines under investigation, and the putative endpoint of the computations, it is important to be cognizant of these distinctions, too. (a) Most research on speech perception refers to experimentation on specific contrasts across individual speech sounds, i.e., sub-lexical/prelexical units of speech. Subjects may be presented with single vowels or single syllables and asked to execute particular tasks, such as discrimination or identification. In a typical study, subjects listen to consonant-vowel (CV) syllables drawn from an acoustic continuum – for example series exemplifying the /ra/-/la/ tongue-shape contrast or the /bi/-/pi/ voicing contrast – and are asked upon presentation of a single token to identify the stimulus category. This research strategy focuses on sublexical properties

172

David Poeppel and William Idsardi

of speech and typically examines questions concerning the nature of categorical perception in speech (e.g., Liberman 1996), the phonemic inventory of speakers/listeners of different languages (e.g., Harnsberger 2000), perceptual magnet effects (e.g., Kuhl et al. 2007), the changes associated with first (e.g., Eimas et al. 1971) and second language learning (e.g., Flege and Hillenbrand 1986), phonotactic constraints (e.g., Dupoux et al. 1999; Kabak and Idsardi 2007), the role of distinctive features (e.g., Kingston 2003), and other issues productively addressed at the pre-lexical level of analysis. This work has been immensely productive in the behavioral literature and is now prominent in the cognitive neurosciences. For example, using fMRI, several teams have examined regionally specific hemodynamic effects when subjects execute judgments on categorically varying stimuli (Blumstein, Myers, and Rissman 2005; Liebenthal et al. 2005; Raizada and Poldrack 2007). These studies aim to show that there are regions responding differentially to signals that belong to different categories, or that are speech versus non-speech. Interestingly, no simple answer has resulted from even rather similar studies, with temporal, parietal and frontal areas all implicated. Similarly, electrophysiological methods (EEG, MEG) have been used to probe the phonemic inventories of speakers of different languages. For example, Näätänen et al. (1997) were able to show subtle neurophysiological distinctions that characterize the vowel inventories of Finnish versus Estonian speakers. Kazanina, Phillips, and Idsardi (2006), discussed further below, used MEG data to illustrate how language-specific contrasts (Russian versus Korean), including allophonic distinctions, can be quantified neurophysiologically. Despite its considerable influence, it must be acknowledged that this research program has noteworthy limitations. For example, a disproportionately large number of studies examine categorical perception as well as the notion of ‘rapid temporal processing’, all typically based on plosive contrasts (especially voice-onset time, VOT). While syllables with plosive onsets are admittedly fascinating in their acoustic complexity (and VOT is easily manipulated) a rich variety of other phenomena at the pre-lexical level have not been well explored. Moreover, these types of studies are ‘maximally ecologically invalid’: experimenters present single, sub-lexical pieces of speech in the context of experimental settings that require ‘atypical’ attention to particular features – and by and large engage no further linguistic processing, even designing studies with non-words so as to preclude as much as possible any interference from other linguistic levels of analysis. The results obtained are therefore in danger of masking or distorting the processes responsible for ecologically natural speech perception. Speakers/listeners do not consciously attend to sub-lexical material, and

Recognizing words from speech

173

therefore the interpretation of these results, especially in the context of neurobiological findings, requires a great deal of caution, especially since task effects are known to modulate normal reactivity in dramatic ways. b) A second line of research investigates speech perception through the lens of spoken word recognition. These studies have motivated a range of lexical access models (for instance lexical access from spectra, Klatt 1979, 1989; instantiations of the cohort model, e.g., Gaskell and Marslen-Wilson 2002; neighborhood activation model, Luce and Pisoni 1998; continuous mapping models, Allopenna et al. 1998, and others) and have yielded critical information regarding the structure of mental/neural representations of lexical material. Behavioral research has made a lot of significant contributions to our understanding and has been extensively reviewed prior to the advent of cognitive neuroscience techniques (see, for example, influential edited volumes by Marslen-Wilson 1989 and Altmann 1990). Typical experimental manipulations include lexical decision, naming, gating, and priming. Recognizing single spoken words is considerably more natural than performing unusual task demands on sub-lexical material. Some models, such as the influential TRACE model (McClelland and Elman 1986), view featural and lexical access as fully integrated, others argue for more cascaded operations. Some important cognitive neuroscience contributions in this domain have been made by Blumstein and colleagues who have examined aspects of spoken word recognition using lesion and imaging data (e.g.,, Misiurski et al. 2005; Prabhakaran et al. 2006; Utman, Blumstein, and Sullivan 2001). The data support a model in which superior temporal areas mediate acoustic-phonetic analyses, temporo-parietal areas perform the mapping to phonological-lexical representations, and frontal areas (specifically the inferior frontal gyrus) play a role in resolving competition (i.e., deciding) between alternatives when listeners are confronted with noisy or underspecified input. The effect of lexical status on speech-sound categorization has been investigated extensively in the behavioral literature (typically in the context of evaluating top-down effects), and Blumstein and colleagues, using voicing continua with word or non-word endpoints, have recently extended this work using fMRI (Myers and Blumstein 2008). They demonstrate that fMRI data show dissociations between functionally ‘earlier’ effects in the temporal lobes (related to perceptual analyses) and putatively ‘later,’ downstream decision processes implicating frontal lobe structures. A behavioral task that has been used productively in studies of lexical representation is repetition priming, and Gagnepain et al. (2007) used word and non-word repetition priming to elucidate which cortical structures are specifically sensitive to the activation of lexical entries. Bilateral superior temporal

174

David Poeppel and William Idsardi

sulcus and superior temporal gyrus (STS, STG) are particularly prominent, suggesting that the mapping to lexical information occurs in cortical regions slightly more ventral than perceptual computations (and bilaterally; cf. Hickok and Poeppel 2000, 2004, 2007). Finally, subtle theoretical proposals about lexical representation have recently been tested in electrophysiological studies. Eulitz and colleagues (Friedrich, Eulitz, and Lahiri 2006), for example, have used lexical decision designs to support underspecification models of lexical representation. c) A third way in which speech perception is examined is in the context of recognizing spoken sentences and assessing their intelligibility. In these studies, participants are presented with sentences (sometimes containing acoustic manipulations) and are asked to provide an index of intelligibility, for example by reporting key words or providing other metrics that reflect performance. Understanding spoken sentences is, naturally, a critical goal because it is the perceptual task we most want to explain – but there is a big price to pay for using this type of ecologically natural material. In using sentential stimuli, it becomes exceedingly difficult to isolate input-related perceptual processes per se (imagine teasing out effects of activation, competition, and selection à la Marslen-Wilson), because presentation of sentences necessarily entails lexical processes, syntactic processes, both lexical semantic and compositional semantic processes – and therefore engages numerous ‘top-down’ factors that demonstrably play a critical role in the overall analysis of spoken input. Cognitive neuroscience methodologies have been used to test intelligibility at the sentence level as well. In a series of PET and fMRI studies, for example, Scott and colleagues have shown that anterior temporal lobe structures, especially anterior STS, play a privileged role in mediating intelligibility (e.g., Scott et al. 2000). Electrophysiological techniques have also been used to study sentence-level speech intelligibility, and Luo and Poeppel (2007) have argued that phase information in the cortical signal of a particular frequency, the theta band, is particularly closely related to and modulated by the acoustics of sentences. In summary, the locution ‘speech perception’ has been used in at least three differing ways. Important attributes of the neurocognitive system underlying speech and language have been discovered using all three approaches discussed. This brief outline serves to remind the reader that it is challenging to isolate the relevant perceptual computations. Undoubtedly, we need to turn to all types of experimental approaches to obtain a full characterization. For example, to understand the nature of distinctive features for perception and representation, experimentation at the level of subphonemic, phonemic, or syllabic levels will be critical; to elucidate how

Recognizing words from speech

175

words are represented and accessed, research on spoken-word recognition is essential; and it goes without saying that we cannot do without an understanding of the comprehension of spoken sentences. Here, we take speech perception to refer to a specific set of computational subroutines (discussed in more detail in section 1.3 below): speech perception comprises the set of operations that take as input continuously varying acoustic waveforms made available at the auditory periphery and that generate as output those representations (morphemic, lexical) that serve as the data structures for subsequent operations mediating comprehension. More colloquially, our view can be caricatured as the collection of operations that lead from vibrations in the periphery to abstractions in cortex (see Figure 1). 1.2. Methodological Brain science needs gadgets, and practically every gadget usable on humans has been applied to speech and lexical access. There are two types of approaches that the consumer of the literature should know: ‘direct’ techniques using electrical or magnetic measurement devices and ‘indirect’ recording using hemodynamically based measurements as proxies for brain activity. The different methods are suited to address different kinds of questions about speech and language, and the careful alignment of research question with technique should be transparent. The electrical and electromagnetic techniques directly measure different aspects of neuronal activity. Electrophysiological approaches applied to spoken-language recognition range from, on the one hand, very invasive studies with high spatial resolving power – single-unit recording in animals investigating the building blocks underpinning phonemic representation (Engineer et al. 2008; Rauschecker, Tian, and Hauser 1995; Schroeder et al. 2008; Steinschneider et al. 1994; Young 2008) and pre-surgical subdural grid recording in epilepsy patients (e.g., Boatman 2004; Crone et al. 2001) – to, on the other hand, noninvasive recording using electroencephalography (EEG/ERP) and magnetoencephalography (MEG). These methods share the high temporal resolution (on the order of milliseconds) appropriate for assessing perceptual processes as they unfold in real time, but the methods differ greatly in the extent to which one can identify localized processes. Insofar as one has mechanistic processing models/hypotheses that address how speech is represented and processed in neuronal tissue, electrophysiological techniques are critical. Spoken language unfolds quickly, with acoustic signal changes in the millisecond range having specific consequences for perceptual classification. Accordingly, these tech-

176

David Poeppel and William Idsardi

niques are necessary to zoom (in time) into such granular temporal changes. Moreover, although many aspects of speech cannot be addressed in animal models (for example lexical representation), the single-unit and local-field-potential (LFP) animal work informs us about how single neurons and neuronal ensembles encode complex auditory signals. Thus, even though the perceptual endgame is not the same for ferrets and Francophones, some of the subroutines that constitute perception can be probed effectively using animal models. The hemodynamic techniques, principally fMRI and PET, and more recently NIRS (near infra-red spectroscopy) have been used extensively since the late 1980s to study speech perception (Binder et al. 2000; Blumstein et al. 2005; Burton, Small, and Blumstein 2000; Meyer et al. 2005; Obleser et al. 2007; Raettig and Kotz 2008; Scott and Wise 2004). The major advantages – especially of fMRI – are its spatial resolution, and now, ubiquitous availability of the machines. It is now possible to detect activity differentially at a spatial scale of a millimeter and better, and therefore these noninvasive recordings are approaching a scale that is familiar from animal studies (roughly the scale of cortical columns) (Bandettini 2003, Logothetis 2008). However, the temporal resolution is limited, roughly to changes occurring over hundreds of milliseconds (i.e., about a word or so). The main contribution of these approaches is to our understanding of the functional anatomy (see Section 3). Note, also, that these techniques provide a ‘spatial answer’ – requiring as a hypothesis a ‘spatial question.’ While the contribution of hemodynamic imaging to anatomy is considerable, questions about representation – and especially online processing – are difficult to address using such methods. Recent reviews of fMRI, in particular, emphasize the need to complement such data with electrophysiological recordings (Logothetis 2008). As a leading neuroscientist and imaging expert, Nikos Logothetis, points out, “fMRI is a measure of mass action. You almost have to be a professional moron to think you’re saying something profound about the neural mechanisms. You’re nowhere close to explaining what’s happening, but you have a nice framework, an excellent starting point” (http://www.sciencenews.org/view/feature/id/50295/title/ Trawling_the_brain). Historically, neuropsychological data have been the most widely available, consequently deficit-lesion correlation research forms the basis for the functional anatomy of speech sound processing as we conceive it here (see Section 3; In recent years, the reversible (in)activation of neuronal tissue using transcranial magnetic stimulation, TMS, has received much attention, although as yet few studies have investigated speech – and those that have have yielded very dodgy results, e.g., D’Ausilio et al. 2009). The

Recognizing words from speech

177

careful dissection of deficits and associated lesions has played a big role in establishing some of the key insights of current models, including that speech perception is more bilaterally mediated than common textbook wisdom holds to be true, and that frontal areas contribute to perceptual abilities under certain task configurations (see, e.g.,, work by Blumstein for elaboration). Neuropsychological data establish both (a) that speech processing clearly dissociates from language processing as well as from other parts of auditory cognition (Poeppel 2001) and (b) that the classical view that the left temporal lobe subsumes speech and language comprehension is dramatically underspecified. While these school-marmish reminders regarding the benefits and limitations of techniques may seem irritating and perhaps even obvious, it is remarkable how often research is insensitive to crucial methodological limitations, thereby furthering interpretations that are untenable given the origin of the data. Insofar as we seek a theoretically sound, formally explicit, and neuronally realistic model of spoken language processing and brain, a thoughtful consideration of which techniques answer which questions is essential. 1.3. ‘Function-o-logical’ The perspective summarized here has been developed in recent pieces (Poeppel and Hackl 2008; Poeppel, Idsardi, and Wassenhove 2008; Poeppel and Monahan 2008). What we hope to provide is a serviceable definition for the cognitive neuroscience of speech perception that links various interrelated questions from acoustics to phonology to lexical access. Figure 1, from Poeppel et al. (2008), summarizes what we take to be the problem. The starting point for the perceptual-computational system is the acoustic signal, a continuously varying waveform that encodes information on different timescales (Fig 1a). For example, the amplitude envelope of the signal correlates well with properties of the syllabic structure of an utterance; the fine structure of the signal, in contrast, carries information over shorter timescales (including features and segments). This input array must ultimately be transformed into a series of discrete segments that constitute a morpheme/word. Because we believe the key goal to be the identification of words, specifying the format of lexical representation is necessary. Moreover, the morphemes/words must be stored in a format that permits them to enter into subsequent linguistic computation (including, e.g.,, combinatoric operations that underlie language comprehension); identifying a word is not nearly enough – the listener must be able to connect it formally

178

David Poeppel and William Idsardi

Figure 1. From waveforms to words. Continuously varying acoustic signals (a) are analyzed in the afferent auditory pathway, ultimately to be represented as ‘neural versions’ of spectrograms in bilateral auditory cortex (b). Based on this high-resolution auditory representation, we hypothesize that a ‘primal sketch’ – based on multi-time resolution analysis – is constructed (c). The perceptual endgame is the identification of words, which we take to be represented in memory as sequences of segments that themselves comprised of bundles of distinctive features (d). From Poeppel et al. 2008.

(i.e., in terms of representational specifications, such as noun, determiner, etc., in whatever neural code that it is specified in) to its neighboring environment, e.g., to perform whatever phonological, morphological, syntactic, or semantic operation the situation demands. We adopt the view, developed in linguistic research over the last half century – and implicit since the Phoenicians invented an alphabet – that such lexical representations are represented as a series of segments that themselves are made up of bundles of distinctive features (Fig 1d; see Section 2 for more motivation); we explicitly will also allow other parallel representations, e.g., syllables.

Recognizing words from speech

179

The input waveform (representation R1) is analyzed by the auditory periphery and is presumably represented in auditory cortex by neurons with sophisticated spectro-temporal receptive field properties (STRFs). One can think of this as a neural version of a spectrogram, albeit one composed of numerous mini-spectrograms with specializations for certain spectrotemporal patterns (Fig1b), such as the characteristic convergent second and third formant trajectories near velars (Stevens 1998). This type of representation (R2) is most likely a property of neurons in the auditory cortex, and it does not differentiate between speech and non-speech signals. Moreover, given the highly conserved nature of mammalian auditory cortex, these representations are very likely shared with other species, and consequently these representations can be investigated using animal models and singlecell recording approaches. Based on this initial (high resolution) auditory cortical pattern, multiple representations on different scales are constructed, in parallel. In this next step, ‘auditory primitives’ are built out of early auditory cortical elements, with one key feature being the time scale of the new representations. This third type of representation (R3) must be of a granularity that permits mappings (linking operations) from the encoding of simple acoustic properties in early auditory cortical areas to speech primitives in more downstream areas (arguably including STG and STS). We conjecture that these intermediate representations encompass at least two subtypes (temporal primitives) commensurate with syllabic and segmental durations (Boemio et al. 2005; Giraud et al. 2007; Poeppel 2001, 2003; Poeppel et al. 2008). The initial cortical representation is fractionated into (at least) two streams, and concurrent multi-time resolution analysis then lies at the basis of subsequent processing. The specific nature of R3 is a critical research question, and we have characterized the question as arriving at a ‘primal sketch’ for speech perception (Fig 1c), akin to Marr’s famous hypothesis about intermediate representations for object recognition; one possibility for the primal sketch is the PFNA coarse coding (plosive-fricative-nasalapproximant), discussed below. The final, featurally specified representation (R4) constitutes the format that is both the endpoint of perception - but which is also the set of instructions for articulation. As discussed further below, the loop between perception, memory, and action is enabled because the representational format used for words in memory, distinctive features, allows the mapping both from the input to words (identify features) and from words to action (features are in motoric coordinates). Obviously, a major goal now must be to look for a Hegelian synthesis for these various antitheses, i.e., levels of representation with competing structures and affordances. In particular, how is it that we have so much solid evidence for both cohorts and neighborhoods, whose guiding assump-

180

David Poeppel and William Idsardi

tions seem irreconcilable? What kind of system is this that illustrates both phonetic specificity (a surface property of speech sounds) and phonological underspecification (a generalization following from a highly abstract code)? Again, we believe that in order to understand this panoply of confusing results we need to draw further distinctions, and we offer up a modest proposal in order to have our exemplar cake and eat it too. Stealing a line from Cutler and Fay (1982) we agree that there is “one mental lexicon, phonologically arranged.” But the operative word here is “arranged”. We envision a 3-step process that offers a place for each of the kinds of findings. The first step is a coarse-coding of the signal into universal speech categories (akin if not identical to Stevens’ (1998) landmarks). For concreteness let us say that this code is just the speech stream coded into four categories (PFNA: plosives, fricatives, nasals and approximants). Preliminary modeling of English-like lexicons suggests that this coding yields pools of words of approximately the same size as the usual lexical neighborhoods and with a fair overlap between various pools and neighborhoods. Within these pools we can now conduct a directed left-to-right search using contextually defined featural definitions (i.e., the cues for [labial] within [nasal] are different than those within [plosive], and differ language to language). Moreover, this search can be guided by the differences amongst the words in the active pool using analysis-by-synthesis and Bayesian inference (see below). Finally, once the best-available word-form has been selected, the contents of that lexical item are examined, compared to the memory trace of the incoming signal, and verified to in fact be the word we’re looking for. Since the lexical entry contains a great deal of information (morphology, syntax, semantics, pragmatics, usage) there is little harm or cost (and much benefit) in storing a detailed phonetic summary of the form’s pronunciation (though we would prefer a model-based statistical summary to an exemplar cloud). In sum, we get to the entry via a coarse-coded search with subsequent directed refinement, but the choice needs to be verified to be accepted. Thus we expect (eventually) to see in the time-course of word-recognition early effects of coarse-coding followed later by exemplar-like effects of lexical item phonetic specificity, even if our current methods are perhaps too crude to pick up this distinction. One way to think about the challenge is to consider the analogy to visual object recognition. Research there has attempted to identify which intermediate representations can link the early cortical analyses over small spatial receptive fields (edges or Gabor patches, or other early visual primitives) with the representation of objects. There have been different approaches to intermediate representations, but every computational theory, either explicitly or implicitly, acknowledges the need for them. The more traditional

Recognizing words from speech

181

hypothesis – a mapping from acoustic to phonetic to phonological representations – is no longer central to the problem as we define it (although the mapping from R1/R2 to R3 to R4 is reminiscent of similar challenges). The multiple levels of representation we envision are simultaneous representations on different time-scales corresponding to different linguistic ‘views’ of the speech material. 2. Linguistic bases of speech perception 2.1. Features Because most modern societies are literate and often familiar with a language with an alphabetic script, there is a tendency to identify speech perception with the perception of whole, single speech segments (phones or phonemes) – the amount of speech generally captured by a single letter in an alphabetic script. However, segmental phonemes are not the smallest units of representation, but are composed of distinctive features which connect articulatory goals with auditory patterns, and provide a discrete, modality – and task-neutral representation suitable for storage in long-term memory (see Jakobson, Fant and Halle 1952, for the original proposals, and Halle 2002, for a spirited defense of this position; see Mielke 2007, for a contrasting view). For example the feature [+round] encodes a speech sound component that in articulation involves rounding the lips through the enervation of the orbicularis orbis muscle, and on the auditory side a region of speech with a downward sweep of all of the formants (when formant transitions are available), or diffuse spectra (in stop bursts and fricatives). The features thus are the basis of the translation (coordinate transformation) between acoustic-space and articulator-space, and moreover are part of the long-term memory representations for the phonological content of morphemes, forming the first memory-action-perception loop. Phonetic features come in two kinds: articulator-bound and articulatorfree. The articulator-bound features (such as [+round]) can only be executed by a particular muscle group. In contrast, the articulator-free, or “manner” features, which (simplifying somewhat) specify the degree of constriction at the narrowest point in the vocal tract, can be executed by any of several muscles along the vocal tract. Specifying the degree of constriction defines the sonority scale, and thus the major classes of segments: plosives (with complete constriction), fricatives (with constrictions sufficiently narrow to generate turbulent noise), sonorants (including nasals,

182

David Poeppel and William Idsardi

with little constriction), glides and vowels (i.e., approximants, with virtually no constriction). Moreover, as noted above, this division suggests a computational technique for calculating R2 and R3: build a set of majorclass detectors from R1 representations (Stevens 2002; Juneja and EspyWilson 2008). To a crude first approximation, this consists of detectors for quasi-silent intervals (plosives), regions with significant amounts of nonperiodicity (fricatives), regions with only one significant resonance (nasals) and regions with general formant structure (approximants, which then must be sub-classified). These definitions are plausibly universal, and all of these detectors are also plausibly ecologically useful for non-speech tasks (such as predator or prey detection), and thus should be amenable to investigation with animal models, and are good candidates for hard-wired circuits. Once the major class is detected, dedicated sub-routines particular to the recovered class are invoked to subsequently identify the contemporaneous articulator-bound features. In this way, features such as [+round] may have context-sensitive acoustic definitions, such as diffuse falling spectra in stop bursts, a relatively low spectral zero in nasals, and lowered formants in vowels. 2.2. Groupings Even though the individual features are each tracked as a separate stream (like instruments in an orchestra), identification of the streams of phonetic features by themselves is not sufficient to adequately capture the linguistically structured representations. The features must be temporally coordinated, akin to the control exerted by the conductor. Speech-time is quantized into differently-sized chunks of time. There are two critically important chunk-sizes that seem universally instantiated in spoken languages: segments and syllables. Temporal co-ordination of distinctive features overlapping for relatively brief amounts of time (10-80 ms) comprise segments; longer coordinated movements (100-500 ms) constitute syllabic prosodies. For instance “we” and “you” differ in the synchronization of [+round]: in “we,” rounding coincides with the initial glide, in “you,” the rounding is on the vowel, and in “wu” rounding covers both segments. This first aggregation of features must somehow ignore various coarticulation and imprecise articulation effects which can lead to phantom (excrescent) segments, as can be seen in pronunciations of “else” which rhyme with “welts” (familiar to Tom Lehrer fans). At the syllable level, English displays alternating patterns of weak and strong syllables, a distinction which

Recognizing words from speech

183

affects the pronunciation of the segments within the syllables, with weak syllables having reduced articulations along several dimensions. It is possible that groupings of other sizes (morae, feet) are also relevant; certainly linguistic theories postulate menageries of such chunks. We believe that the syllable level may begin to be calculated from the major-class detectors outlined in the previous section; typologically, language syllable structure seems to be almost exclusively characterized by sonority, with the articulator-bound features playing little role in the definition of the constitution of syllables. We hypothesize that the parallel sketch of major-class based syllables and the elaboration of segments via the identification of articulator-bound features offers a potential model for the synthesis of the so-far irreconcilable findings for cohort and neighborhood models of lexical access. 2.3. Predictable changes in pronunciation: phonological process Speech is highly variable. One of the goals of distinctive feature theory is to try to identify higher-order invariants in the speech signal that correlate with the presence of particular features like the [+round] example above (Perkell and Klatt 1986). However, even if we had a perfect theory of phonetic distinctive features, there is variation in the pronunciation of features and segments due to surrounding context, starting with co-articulation effects inherent in the inertial movements of the articulators in the mouth. The net result of these patterned variations in pronunciation is that we are willing to consider disparate pronunciations to be instances of “the same speech sound” because we can attribute the differences in pronunciation to the surrounding context of speech material. A particularly easy way to observe this phenomenon is to consider different forms of the same word which arise through morphological operations like prefixation and suffixation. The 't's in "atom" and "atomic" are not pronounced the same way "atom" is homophonous with "Adam" for many speakers of American English, whereas "atomic" has a portion homophonous with the name "Tom". In technical parlance, the 't' in "atom" is flapped, whereas the 't' in "atomic" is aspirated. This is by no means an unusual case. Every known language has such contextually determined pronunciations (allophonic variation) that do not affect the meanings of the words, and which, for the purpose of recovering the words, appear to be extra noise for the listener. Even worse, languages pick and choose which features they employ for storing forms in memory. English, for example, considers the difference between [l] and [r],

184

David Poeppel and William Idsardi

[±lateral], to be contrastive, so that "rip" and "lip" are different words, as are "more" and "mole". Korean, on the other hand, treats this difference as allophonic, a predictable aspect of the position of the segment in the word; the word for water is "mul" but the term for freshwater is "muri choŭn". For Koreans [l] and [r] are contextual pronunciations of the same sound – they use [r] before vowels and [l] before consonants and at the ends of words. Recent MEG studies (Kazanina et al. 2006) have confirmed that listeners do systematically ignore allophonic differences (Sapir 1933). Using a mismatch design, Kazanina and colleagues compared the behavioral and neural responses of Russian and Korean speakers to items containing “ta” or “da”. The difference in the feature [voice] between “t” and “d” is significant (contrastive) in Russian, as it serves to distinguish pairs of words such as “dom” (house) and “tom” (volume). In Korean, however, this difference is predictable, with “d” occurring only between sonorants, as can be seen in the word “totuk” 鵹鸆 meaning ‘thief’, pronounced “toduk” (and spelled that way in the McCune-Reischauer Romanization system). In this word, the second “t” is pronounced as a “d” because it is flanked by vowels (similar to the English flapping rule). Subjects listened to sequences of items in which one of the two types (for instance “da”) was much more frequent (the “standard”); the other item (the “deviant”, here “ta”) occurred 13% of the time. Neural responses to the items in different were compared (i.e., “ta” as standard was compared with “ta” as deviant). Russian speakers showed a reliable difference in their responses to standards and deviants, indicating that they detected the deviant items in a stream of standards. Conversely, Korean speakers showed no differences, suggesting that they form a perceptual equivalence class for “t” and “d”, mapping these two sounds onto the same abstract representation. Similar phonological processes can also change the pronunciation of otherwise contrastive speech sounds. For instance, in English “s” and “z” are contrastive, differing in the feature [voice], as can be seen by the minimal pair “seal” and “zeal”. However, the plural ending is pronounced either “s” or “z” depending on the adjacent sound: “cats” and “dogz”. English listeners are sensitive to this sequential patterning rule, showing longer reaction times and differences in the neural responses when “s” and “z” are cross-spliced into incongruent positions, *“utz”, *“uds” (Hwang et al. submitted). Thus, in later morphological computations, contrastive sounds are organized into higher-level equivalence classes displaying functional identity (such as the plural ending). Phonological perception thus requires the identification of major class features and articulator-bound features, the coordination of these features

Recognizing words from speech

185

into segment-sized units and larger chunks, and the identification of equivalence classes of features and segments at various levels of abstraction, with this stage of processing culminating in the identification of a stored word form, which can then be mapped back out by the motor system in pronunciation. 3. Cortical basis of speech perception: fractionating information flow across a distributed functional anatomy Historically, the reception of speech is most closely associated with the discoveries by the German neurologist Wernicke. Based on his work, popularized in virtually every textbook since the early 20th century, it was hypothesized that posterior aspects of the left hemisphere, in particular the left superior temporal gyrus (STG), were responsible for the analysis of the input. Perception (modality-specific) and comprehension (modalityindependent) were not distinguished, and so the left temporal lobe has become the canonical speech perception region. Because speech perception dissociates clearly from the auditory perception of non-speech, as well as from more central comprehension processes (see, for example, data from pure word deafness, reviewed in Poeppel 2001; Stefanatos, Gershkoff, and Madigan 2005), the search for a ‘speech perception area’ is, in principle, reasonable (and the search for such a specialized region has, of course, yielded a rich and controversial research program in the case of face recognition and the fusiform face area). However, data deriving from lesion studies, brain imaging, and intracranial studies have converged on a model in which a distributed functional anatomy is more likely. The major ingredients in this model are two concurrent processing streams. Early auditory areas, bilaterally, are responsible for creating a high-resolution spectro-temporal representation. In the terminology developed in Section 1, the afferent auditory pathway from the periphery to cortex executes the set of transformations from R1 to R2. In superior temporal cortex, two parallel pathways originate. One pathway, the ventral stream, is primarily involved in the mapping from sound to lexical meaning. We hypothesize that lexical representations per se (the mappings from concepts to lexical-phonological entries) ‘reside’ in middle temporal gyrus (MTG) (for a recent analysis of lexical processing and MTG based on neurophysiology, see Lau, Phillips, and Poeppel 2008), and the cortical regions that are part of the ventral stream perform the operations that transform acoustic signals into a format that can make contact with these long-term representations.

186

David Poeppel and William Idsardi

One crucial cortical region involved in the mapping from sound structure to lexical representation is the superior temporal sulcus (STS). Neurons in STS appear to be executing some of the essential computations that generate the speech primitives. Our examination of both lesion and imaging data suggests that the ventral pathway is bilateral. Importantly, the left and right contributions are overlapping but not identical; for example, the fractionation of the auditory signal into temporal primitives of different granularity (i.e., different time-scales) occurs differentially on the two sides (Boemio et al. 2005; Giraud et al. 2007). In short the ventral pathway itself can be subdivided into concurrent processing streams that deal preferentially with different ‘linguistic views’ of the input signal, ideally mapping directly onto the parallel linguistically motivated representations for segments and syllables. The second pathway is the dorsal stream. The areas comprising the dorsal stream perform the mapping from sensory (or perhaps phonological) representations to articulatory and motor representations (R4). Various parts of the dorsal stream lie in the frontal lobe, including in premotor areas as well as in the inferior frontal gyrus. One critical new region that is motivated – and now identified – by research on this model is area SPT (Sylvian parietal-temporal; see Hickok et al. 2003; Pa and Hickok 2008). Acoustic information is represented in a different coordinate system than articulatory information, and thus the mapping from acoustic to motor requires a coordinate transformation. Moreover, the SPT “sensorimotor interface” provides the substrate for dealing with working memory demands as well. Because the distributed functional anatomy has been described at length elsewhere (Hickok and Poeppel 2000, 2004, 2007) we aim here simply to emphasize two core features of the model that have stimulated the formulation of new hypotheses about the organization of the speech perception system: there are two segregated processing streams, each of which has functionally relevant subdivisions; second, the organization of the ventral stream is bilateral, unlike the striking lateralization often observed in language processing. On our view, it is the dorsal stream, principally involved in production, which exhibits cerebral dominance. The ventral stream, on the other hand is asymmetric but has important contributions from both hemispheres. The model highlights the distributed nature of the cortical fields underlying speech processing. Moreover, the model illustrates the perception-memory-action loop that we have described. The loop at the basis of speech processing ‘works’ because of the nature of the shared currency that forms the basis of the representation. We contend that this distinctive featural representation is one that permits natural mappings from input to memory to output.

Recognizing words from speech

187

Figure 2. Dual stream model (Hickok and Poeppel 2007). A functional anatomic model of speech perception that hypothesizes speech perception to be executed by two concurrent processing streams. The green box is the starting point, the auditory input. The ventral stream (pink) mediates the mapping from sounds to lexical-semantic representations. The dorsal stream (blue) provides the neuronal infrastructure for the mapping from sound analysis to articulation. For discussion, see Hickok and Poeppel (2000, 2004, 2007). For colour version see separate plate.

4. Analysis by synthesis: an old algorithm, resuscitated The recurrent theme has been the memory-action-perception (MAP) loop. We pointed out that the necessity of being able to both produce and perceive speech leads to the necessity of coordinate transformations between

188

David Poeppel and William Idsardi

acoustic-space and articulatory-space, and economy considerations dictate that we look for a memory architecture that would enable and facilitate this. Distinctive features, as developed by Jakobson, Fant and Halle in the 1950s, have such desirable characteristics, with both acoustic and articulatory definitions, and provide a basis for long-term memory representations. These (static) representational considerations are mirrored in the computational algorithms for (dynamic) speech production and perception; the algorithms are similarly intertwined. Such a system was proposed at the very beginning of modern speech perception research, in the analysis-bysynthesis approach of MacKay (1951) and Halle and Stevens (1962). Bever and Poeppel (2010) and Poeppel and Monahan (2010) review this idea. Here we have the MAP loop writ both large and small. At each level, speech perception takes the form of a guess at an analysis, the subsequent generation of a predicted output form, and a correction to that guess based on an error signal generated by the comparison of the predicted output form against the incoming signal. The initial hypotheses (‘guesses’) are generated based on the current state and the smallest bit of input processed (say a 30 ms sample of waveform), and whatever additional information may have been used to predict the signal (the prior, in a Bayesian framework). The initial set of guesses that trigger synthesis will be large, but at each subsequent processing step, the set of supported guesses gets smaller, therefore

Figure 3. Analysis by synthesis. The processing stream (from left to right) is assumed to reflect detailed interactions between samples from the input, hypotheses (guesses), synthesized candidates, and error correction. There is a continuous alignment of bottom-up information and top-down regulated synthesis and candidate set narrowing until the target is acquired (Poeppel et al. 2008).

Recognizing words from speech

189

the set of synthesized representations gets smaller, and therefore verification or rejection based on subsequent input gets quicker. Although little empirical data exist to date that test these (old) ideas, recent studies on audiovisual speech perception support analysis by synthesis and illustrate how the amount of prior information modulates (in time) cortical responses to speech sounds (van Wassenhove, Grant, and Poeppel 2005). The appeal of the analysis-by-synthesis view is threefold. First, it provides a link to motor theories of perception, albeit at a more abstract level. Motor theories in their most direct form are not well supported empirically (for a recent discussion, see Hickok, 2010), but the hypothesis that a perceptual role for some of the computations underlying motoric action is worth exploring. There is an intriguing connection between perception and production, repeatedly observed in many areas of perception, and narrowly perceptual or motoric theories seem not to be successful at capturing the observed phenomena. Second, analysis-by-synthesis for speech links to interesting new work in other domains of perception. For example, new research on visual object recognition supports the idea (Yuille and Kersten 2006); and the concept has been examined in depth in work on sentence comprehension by Bever (Townsend and Bever 2001). There is a close connection between the analysis-by-synthesis program and Bayesian models of perception (a point also made by Hinton and Nair 2005), and thereby also to link up with more tenable accounts of what mirror neurons are (Kilner, Friston, and Frith 2007). Third, this approach provides a natural bridge to concepts gaining influence in systems neuroscience. The view that the massive top-down architectural connectivity in cortex forms the basis for generating and testing expectations (at every level of analysis) is gaining credibility, and predictive coding is widely observed using different techniques. In our view, the type of computational infrastructure afforded by the analysis-by-synthesis research program provides a way to investigate speech perception in the service of lexical access in a way that naturally links the computational, algorithmic, and implementational levels advocated by Marr. Research on the cognitive science and cognitive neuroscience of speech perception seems to us a productive approach to investigate questions about cognition and its neural basis more generally. The psychological models are increasingly detailed and well articulated and facilitate a principled investigation of how the brain computes with complex representations. For the development of these important models, we owe a debt of gratitude.

190

David Poeppel and William Idsardi

References Allopenna, Paul D., James S. Magnuson, and Michael K. Tanenhaus 1998 Tracking the time course of spoken word recognition using eye movements: Evidence for continuous mapping models. Journal of Memory and Language 38: 419-439. Altmann, Gerry T. M. 1990 Cognitive Models of Speech Processing. Cambridge, M.A.: MIT Press. Bandettini, Peter A. 2003 Functional MRI. In Handbook of Neuropsychology, Jordan Grafman and Ian H. Robertson (Eds.). The Netherlands: Elsevier. Bever, Thomas and David Poeppel in press Analysis by synthesis: A (Re-)emerging program of research for language and vision. Biolinguistics. Binder, Jeffrey R., Julie A. Frost, T.A. Hammeke, P.S.F. Bellgowan, J.A. Springer, J.N. Kaufman, and E.T. Possing 2000 Human temporal lobe activation by speech and nonspeech sounds. Cerebral Cortex 10(5): 512-28. Blumstein, Sheila E., Emily B. Myers, and Jesse Rissman 2005 The perception of voice onset time: an fMRI investigation of phonetic category structure. Journal of Cognitive Neuroscience 17(9): 1353-66. Boatman, Dana 2004 Cortical bases of speech perception: evidence from functional lesion studies. Cognition 92(1-2): 47-65. Boemio, Anthony, Stephen Fromm, Allen Braun, and David Poeppel 2005 Hierarchical and asymmetric temporal sensitivity in human auditory cortices. Nature Neuroscience 8(3): 389-95. Burton, Martha W., Steven L. Small, Sheila E. Blumstein 2000 The role of segmentation in phonological processing: an fMRI investigation. Journal of Cognitive Neuroscience 12(4): 679-90. Crone, Nathan E., Dana Boatman, Barry Gordon, and Lei Hao 2001 Induced electrocorticographic gamma activity during auditory perception. Brazier Award-winning article, 2001. Clinical Neurophysiology 112(4): 565-82. Cutler, Anne and David A. Fay 1982 One Mental Lexicon, Phonologically Arranged: Comments on Hurford's Comments. Linguistic Inquiry 13: 107-113. D'Ausilio, Alessandro, Friedemann Pulvermüller, Paola Salmas, Ilaria Bufalari, Chiara Begliomini, and Luciano Fadiga 2009 The motor somatotopy of speech perception. Current Biology 19(5):381-5.

Recognizing words from speech

191

Dupoux, Emmanuel, Kazuhiko Kakehi, Yuki Hirose, Christophe Pallier, and Jacques Mehler 1999 Epenthetic vowels in Japanese: A perceptual illusion? Journal of Experimental Psychology: Human Perception and Performance 25: 1568-1578. Eimas Peter D., Einar R. Siqueland, Peter Jusczyk, and James Vigorito 1971 Speech perception in infants. Science 171: 303-306. Engineer, Crystal T., Claudia A. Perez, Ye Ting H. Chen, Ryan S. Carraway, Amanda C. Reed, Jai A. Shetake, Vikram Jakkamsetti, Kevin Q. Chang, and Michael P. Kilgard 2008 Cortical activity patterns predict speech discrimination ability. Nature Neuroscience 11(5): 603-8. Flege, James E., and James M. Hillenbrand 1986 Differential use of temporal cues to the /s/-/z/ contrast by native and non-native speakers of English. Journal of the Acoustical Society of America 79(2): 508-17 Friedrich, Claudia K., Carsten Eulitz, and Aditi Lahiri 2006 Not every pseudoword disrupts word recognition: an ERP study. Behavioral and Brain Functions 2: 36. Gagnepain, Pierre, Gael Chetelat, Brigitte Landeau, Jacques Dayan, Francis Eustache, and Karine Lebreton 2008 Spoken word memory traces within the human auditory cortex revealed by repetition priming and functional magnetic resonance imaging. Journal of Neuroscience 28(20): 5281-9. Gaskell, Gareth and William D. Marslen-Wilson 2002 Representation and competition in the perception of spoken words. Cognitive Psychology 45(2): 220-66. Giraud, Anne-Lise, Andreas Kleinschmidt, David Poeppel, Torben E. Lund, Richard S. J. Frackowiak, and Helmut Laufs 2007 Endogenous cortical rhythms determine cerebral specialization for speech perception and production. Neuron 56(6): 1127-34. Halle, Morris 2002 From Memory to Speech and Back. Berlin: Mouton de Gruyter. Halle, Morris and Kenneth N. Stevens. 1962 Speech Recognition: A Model and a Program for Research. IRE Transactions. Harnsberger, James D. 2000 A cross-language study of the identification of non-native nasal consonants varying in place of articulation. Journal of the Acoustical Society of America 108(2): 764-783. Hickok, Gregory 2010 The role of mirror neurons in speech perception and action word semantics. Language and Cognitive Processes.

192

David Poeppel and William Idsardi

Hickok, Gregory, Brad Buchsbaum, Colin Humphries, and Tugan Muftuler 2003 Auditory-motor interaction revealed by fMRI: speech, music, and working memory in area Spt. Journal of Cognitive Neuroscience 15(5): 673-82. Hickok, Gregory and David Poeppel 2000 Towards a functional neuroanatomy of speech perception. Trends Cognitive Science 4(4): 131-138. 2004 Dorsal and ventral streams: a framework for understanding aspects of the functional anatomy of language. Cognition 92(1-2): 67-99. 2007 The cortical organization of speech processing. Nature Reviews Neuroscience 8(5): 393-402. Hinton, Geoffrey and Vinod Nair 2005 Inferring motor programs from images of handwritten digits. Proceedings of NIPS 2005. Hwang, So-One, Phillip J. Monahan, William Idsardi, and David Poeppel submitted The Perceptual Consequences of Voicing Mismatch in Obstruent Consonant Clusters. Jakobson, Roman, Gunnar Fant and Morris Halle 1952 Preliminaries to Speech Analysis. Cambridge MA: MIT Press. Juneja, Amit, and Carol Espy-Wilson 2008 A probabilistic framework for landmark detection based on phonetic features for automatic speech recognition. Journal of the Acoustical Society of America 123(2): 1154-1168. Kabak, Baris and William Idsardi 2007 Perceptual distortions in the adaptation of English consonant clusters: Syllable structure or consonantal contact constraints? Language and Speech 50: 23-52. Kazanina, Nina, Colin Phillips, and William Idsardi 2006 The influence of meaning on the perception of speech sounds. Proceedings of the National Academy of Sciences USA 103(30): 113816. Kilner, JM., Karl J. Friston, and C. D. Frith 2007 The mirror-neuron system: A Bayesian perspective. NeuroReport 18(6): 619-623. Kingston, John 2003 Learning foreign vowels. Language and Speech 46: 295-349. Klatt, Dennis 1979 Speech perception: A model of acoustic phonetic analysis and lexical access. Journal of Phonetics 7: 279-342. Klatt, Dennis 1989 Review of selected models of speech perception. In Lexical Representation and Process, William Marslen-Wilson (Ed), 169-226. Cambridge MA: MIT Press.

Recognizing words from speech

193

Kuhl, Patricia, Barbara T. Conboy, Sharon Coffey-Corina, Denise Padden, Maritza Rivera-Gaxiola, and Tobey Nelson 2007 Phonetic learning as a pathway to language: new data and native language magnet theory expanded (NLM-e). Philosophical Transactions of the Royal Society B 363: 979-1000. Lau, Ellen, Colin Phillips, and David Poeppel 2008 A cortical network for semantics: (de)constructing the N400. Nature Reviews Neuroscience 9: 920-933. Liberman, Alvin M. 1996 Speech: A special code. Cambridge MA: MIT Press. Liebenthal, Einat, Jeffrey R. Binder, Stephanie M. Spitzer, Edward T. Possing, and David A. Medler 2005 Neural substrates of phonemic perception. Cerebral Cortex 15(10): 1621-31. Logothetis, Nikos K. 2008 What we can do and what we cannot do with fMRI. Nature 453(7197): 869-78. Luce, Paul A. and David B. Pisoni 1998 Recognizing spoken words: the neighborhood activation model. Ear and Hearing 19(1): 1-36. Luo, Huan and David Poeppel 2007 Phase patterns of neuronal responses reliably discriminate speech in human auditory cortex. Neuron 54(6): 1001-10. MacKay, Donald M. 1951 Mindlike behavior in artefacts. British Journal for Philosophy of Science 2: 105-121. Marslen-Wilson, William 1989 Lexical Representation and Process. Cambridge MA: MIT Press. McClelland, James L. and Jeff Elman 1986 The TRACE model of speech perception. Cognitive Psychology 18, 1-86. Meyer, Martin, Stefan Zysset, Yves D. von Cramon, and Kai Alter 2005 Distinct fMRI responses to laughter, speech, and sounds along the human peri-sylvian cortex. Cognitive Brain Research 24(2): 291306. Mielke, Jeff 2007 The Emergence of Distinctive Features. Oxford: Oxford University Press. Misiurski, Cara, Sheila E. Blumstein, Jesse Rissman, and Daniel Berman 2005 The role of lexical competition and acoustic-phonetic structure in lexical processing: evidence from normal subjects and aphasic patients. Brain and Language 93(1): 64-78.

194

David Poeppel and William Idsardi

Myers, Emily B. and Sheila E. Blumstein 2008 The neural bases of the lexical effect: an fMRI investigation. Cerebral Cortex 18(2): 278-88. Näätänen, Risto, Anne Lehtokoski, Mietta Lennes, Marie Cheour, Minna Huotilainen, Antti Iivonen, Martti Vainio, Paavo Alku, Risto J. Ilmoniemi, Aavo Luuk, Juri Allik, Janne Sinkkonen, and Kimmo Alho 1997 Language-specific phoneme representations revealed by electric and magnetic brain responses. Nature 385(6615): 432-4. Obleser, Jonas, J. Zimmermann, John Van Meter and Josef P. Rauschecker 2007 Multiple stages of auditory speech perception reflected in eventrelated FMRI. Cerebral Cortex 17(10): 2251-7. Pa, Judy and Gregory Hickok 2008 A parietal-temporal sensory-motor integration area for the human vocal tract: evidence from an fMRI study of skilled musicians. Neuropsychologia 46(1): 362-8. Perkell, Joseph and Dennis Klatt 1986 Invariance and Variability in Speech Processes. Hillsdale NJ: Erlbaum. Poeppel, David 2001 Pure word deafness and the bilateral processing of the speech code. Cognitive Science 21 (5): 679-693. 2003 The analysis of speech in different temporal integration windows: cerebral lateralization as ‘asymmetric sampling in time’. Speech Communication 41: 245-255. Poeppel, David and Martin Hackl 2008 The architecture of speech perception. In J. Pomerantz (Ed.), Topics in Integrative Neuroscience: From Cells to Cognition, Cambridge University Press Poeppel, David and Phillip J. Monahan 2008 Speech perception: Cognitive foundations and cortical implementation. Current Directions in Psychological Science 17(2). 2010 Feedforward and Feedback in Speech Perception: Revisiting Analysis-by-Synthesis. Language and Cognitive Processes. Poeppel, David, William J. Idsardi, and Virginie van Wassenhove 2008 Speech perception at the interface of neurobiology and linguistics. Philosophical Transactions of the Royal Society of London B Biological Sciences 363(1493): 1071-86. Prabhakaran, Ranjani, Sheila E. Blumstein, Emily B. Myers, Emmette Hutchison, and Brendan Britton 2006 An event-related fMRI investigation of phonological-lexical competition. Neuropsychologia 44(12): 2209-21.

Recognizing words from speech

195

Raettig, Tim and Sonja A. Kotz 2008 Auditory processing of different types of pseudo-words: an eventrelated fMRI study. Neuroimage 39(3): 1420-8. Raizada, Rajiv D. and Russell A. Poldrack 2007 Selective amplification of stimulus differences during categorical processing of speech. Neuron 56(4): 726-40. Rauschecker, Joseph P., Biao Tian, and Marc Hauser 1995 Processing of complex sounds in the macaque nonprimary auditory cortex. Science 268(5207): 111-4. Sapir, Edward 1933 La réalité psychologique des phonemes. Journal de Psychologie Normale et Pathologique. Reprinted as The psychological reality of phonemes in Mandelbaum, D. (ed.) Selected writings in language, culture and personality, Berkeley: University of California Press. Schroeder, Charles E., Peter Lakatos, Joshinao Kajikawa, Sarah Partan, and Aina Puce 2008 Neuronal oscillations and visual amplification of speech. Trends in Cognitive Sciences 12(3): 106-13. Scott, Sophie K., C. Catrin. Blank, Stuart Rosen, and Richard J. S. Wise 2000 Identification of a pathway for intelligible speech in the left temporal lobe. Brain 123 Pt 12: 2400-6. Scott, Sophie K. and Richard J. S. Wise 2004 The functional neuroanatomy of prelexical processing in speech perception. Cognition 92(1-2): 13-45. Stefanatos, Gerry A., Arthur Gershkoff, and Sean Madigan 2005 On pure word deafness, temporal processing, and the left hemisphere. Journal of the International Neuropsychological Society 11:456-470. Steinschneider, Mitchell, Charles E. Schroeder, Joseph C. Arezzo, and Herbert G. Vaughan Jr. 1994 Speech-evoked activity in primary auditory cortex: effects of voice onset time. Electroencephalography and Clinical Neurophysiology 92(1): 30-43. Stevens, Kenneth N. 1998 Acoustic Phonetics. Cambridge MA: MIT Press. 2002 Toward a model for lexical access based on acoustic landmarks and distinctive features. Journal of the Acoustical Society of America 111 (4): 1872-1891. Townsend, David J., and Thomas G. Bever 2001 Sentence Comprehension: The Integration of Habits and Rules. Cambridge: MIT Press

196

David Poeppel and William Idsardi

Utman, Jennifer Aydelott, Sheila E. Blumstein, and Kelly Sullivan 2001 Mapping from sound to meaning: reduced lexical activation in Broca's aphasics. Brain and Language 79(3): 444-72. van Wassenhove, Virginie, Kenneth W. Grant, and David Poeppel 2005 Visual speech speeds up the neural processing of auditory speech. Proceedings of the National Academy of Sciences USA 102(4): 11816. Young, Eric D. 2008 Neural representation of spectral and temporal information in speech. Philosophical Transactions of the Royal Society of London B Biological Sciences 363(1493): 923-45. Yuille, Alan and Daniel Kersten 2006 Vision as Bayesian inference: analysis by synthesis? Trends in Cognitive Sciences 10(7): 301-8.

Brain structures underlying lexical processing of speech: Evidence from brain imaging Matthew H. Davis and Jennifer M. Rodd 1. The neural foundations of lexical processing of speech A mental lexicon that links the form of spoken words to their associated meanings and syntactic functions has long been seen as a central to speech comprehension. As should be apparent from other papers in this volume, William Marslen-Wilson's experimental and theoretical contributions have been supremely influential in guiding research on the psychological and computational properties of the lexicon. With recent developments in functional brain imaging, methods now exist to map these processes onto neuroanatomical pathways in the human brain. In the current paper we will argue that theoretical proposals made in various iterations of the Cohort account remain just as relevant to this neuroscientific endeavour as they were for previous generations of researchers in the psychological and cognitive sciences. Here we will review recent brain imaging work on lexical processing in the light of these theoretical principles. Accounts of the neural processes underlying spoken word recognition have converged on the proposal that brain regions centred on the superior temporal gyrus (STG) are critical for pre-lexical processing of spoken words. That is, this region is engaged in the acoustic-phonetic processes that provide the input for later lexical and semantic analysis of the speech signal. In the neuropsychological literature, cases of pure word deafness (an isolated impairment of speech perception in the presence of intact linguistic skills in other modalities and perception of non-speech sounds) is typically observed following bilateral lesions to these superior temporal regions (Saffran, Marin, and Komshian 1976; Stefanatos, Gershkoff, and Madigan 2005). Functional imaging studies that compare neural responses to speech sounds and acoustically matched sounds that do not evoke a speech percept evoke differential responses in superior temporal gyrus regions that surround but do not include primary auditory cortex (Davis and Johnsrude 2003; Scott et al. 2000; Uppenkamp et al. 2006; Vouloumanos et al. 2001). Although activation for speech greater than non-speech is often seen bilaterally, those studies that focus on phonological aspects of speech processing

198

Matthew H. Davis and Jennifer M. Rodd

(for instance, additional responses for syllable pairs that evoke different phonological categories; Jacquemot et al. 2003; Raizada and Poldrack 2007) show left-lateralised responses in the STG and adjacent inferior parietal regions. It is possible, however, that this reflects a particular mode of listening rather than something intrinsic to the stimuli used in these studies (Leech et al. 2009). Convergent evidence to localise pre-lexical perceptual processing of speech to the superior temporal gyrus comes from the imaging contrast of spoken pseudowords compared to real words. In the absence of higherlevel, lexical representations, responses to pseudowords are elevated in superior temporal regions that contribute to sub-lexical processing (see Davis and Gaskell 2010 for a meta-analysis and Figure 1 for a summary of these results). These findings, in combination, suggest a specific role for the STG in sub-lexical phonological processing of the sounds of speech (Binder et al. 2000; Hickok and Poeppel 2000; Scott and Johnsrude 2003). Despite agreement that sub-lexical speech processing engages periauditory regions of the superior temporal gyrus, the neural correlates of the later lexical processes that are critical for speech comprehension remain unclear. Primate neurophysiology would suggest multiple processing pathways that project anterior and posterior along the STG (e.g., Rauschecker and Tian 2000) with onward projections into topographically organised regions of inferior parietal and frontal regions (Petrides and Pandya 1988). However, while all current neural accounts of speech processing suggest that homologues of these dorsal and ventral pathways support human speech processing (Davis and Johnsrude 2007; Hickok and Poeppel 2007; Rauschecker and Scott 2009), the functional contribution and anatomical organisation of specific processing streams for speech remain undecided. In Figure 1 we display each of the major processing pathways proposed by these accounts. There is some agreement between these accounts; specifically that the dorsal auditory pathway is involved in mapping heard speech from posterior regions of the STG (classically Wernicke's area) onto the inferior parietal, prefrontal and premotor cortices that are involved in the articulation of words and pseudowords. However, differences of opinion remain concerning the function of this auditory-motor pathway. Some authors argue that auditory-motor mappings supported by the dorsal stream provide the foundations for speech perception (Liberman and Whalen 2000; Pulvermüller 2005). One motivation for this proposal is the

Brain structures underlying lexical processing of speech

199

Figure 1. Meta-analysis of activation differences between familiar spoken words and unfamiliar pseudowords adapted from Davis and Gaskell (2009). Activation of inferior temporal and fusiform regions for familiar words is shown with a dotted outline, indicating activation hidden from view in this lateral rendering. Overlaid arrows show approximate locations of hierarchical functional pathways linking sublexical and higher-level processing within the temporal lobe. Pathway (d) is the temporo-parietal portion of the dorsal auditory-motor pathway. Pathway (v1) involves auditory association areas in the anterior portions of the superior and middle temporal gyrus. Pathway (v2) links superior temporal regions to basal language areas in the posterior inferior temporal and fusiform gyri. A1 marks the lateral aspect of Heschl's gyrus, an anatomical marker of primary auditory cortex.

200

Matthew H. Davis and Jennifer M. Rodd

observation that articulatory representations are automatically activated during speech perception (Fadiga et al. 2002; Pulvermüller et al. 2006; Wilson et al. 2004; Yuen et al. 2010). An alternative view, however, is that motor representations constrain and support the auditory processes involved in speech perception (Davis and Johnsrude 2007), evidenced by motor involvement being more clearly observed under challenging listening situations (Adank and Devlin 2009; Davis and Johnsrude 2003; Meister et al. 2007; Möttönen and Watkins 2009). A third contrasting view proposes that this pathway for auditory-motor integration supports a rehearsal based short-term memory system for verbal materials that is a mere adjunct to auditory processes in the temporal lobe that are, alone, sufficient for successful speech perception and comprehension (Lotto, Hickok, and Holt 2009; Scott et al. 2009). By comparison with these distinct, yet testable, accounts of dorsal stream contributions to speech perception there are more fundamental disagreements concerning the neural structures and pathways that are critical for accessing the meaning of spoken words. Indeed, recent theoretical proposals concerning the neural basis of speech processing are disappointingly vague concerning the functional and anatomical organisation of the higherlevel processes postulated to occur within the ventral speech processing stream and how these processes might contribute to the comprehension of spoken language (Hickok and Poeppel 2007; Rauschecker and Scott 2009). In the present paper we focus on the recognition of familiar words, and on the later stages by which contextually relevant meanings are accessed. These two computational processes are at the heart of comprehension: it is only by retrieving and then combining the semantic (and syntactic) representations of individual words that we can comprehend connected speech. However, there is no consensus concerning how the two 'ventral' processing pathways in the lateral temporal lobe shown in Figure 1 contribute to critical processes for spoken language comprehension. The ventral auditory pathways in the anterior superior/middle temporal gyrus and posterior inferior temporal regions are by general agreement critically involved in the recognition of auditory objects in other mammalian species. In monkeys, this “what” system is responsible for the recognition of conspecific calls and other familiar environmental sounds (Rauschecker and Scott 2009). By analogy, then, many authors have proposed that the homologous system in humans serves to identify spoken words and access their meanings (Hickok and Poeppel 2004, 2007; Scott and Johnsrude 2003; Tyler and Marslen-Wilson 2008). Yet, despite agreement that ventral brain regions are involved in these lexical aspects of comprehension, there is disagreement in the literature concerning exactly which

Brain structures underlying lexical processing of speech

201

neural pathways are involved. To take a simplified view of two salient proposals in the literature, Scott and Johnsrude (2003) propose that anterior portions of the superior and middle temporal gyrus, extending to the temporal pole, comprise auditory association cortices that are critical for the comprehension of spoken words (v1 in Figure 1). In contrast, Hickok and Poeppel (2004) propose that the ventral auditory stream projects posteriorly from peri-auditory regions into the middle and inferior temporal gyrus, and it is thus the posterior inferior temporal gyrus that is critical for accessing the meaning of spoken words (v2 in Figure 2). While later revisions to both these accounts (Hickok and Poeppel 2007; Rauschecker and Scott 2009) have presented a less stark contrast, there remains significant disagreement concerning those neural structures that support lexical processing and other critical components of speech comprehension. This dispute as to whether it is the anterior or posterior regions of the temporal lobe that are critical for processing word meanings is also mirrored in the neuropsychological literature. In traditional neuropsychological accounts it is the posterior portions of the STG and adjacent parietal regions (Wernicke’s area) that are held to support speech comprehension. This view has been supported to some extent by recent studies that show that the comprehension deficits of stroke patients have been associated with damage to the posterior portion of the left temporal lobe (Bates et al. 2003), though with a more inferior locus. In contrast, the semantic processing deficits of patients with semantic dementia have been linked with damage to the anterior temporal lobes (Mummery et al. 2000; Williams, Nestor, and Hodges 2005). These differences cannot be explained by the aetiology of comprehension impairment (dementia vs. stroke) since evidence to link posterior temporal regions with comprehension can be found in dementia patients (Peelle et al. 2008) and an anterior locus of semantic representations has been suggested by a voxel-based analysis that includes stroke patients (Tyler, Marslen-Wilson, and Stamatakis 2005). This disagreement naturally leads us to consider evidence from human functional imaging concerning the neural structures responsible for lexical and semantic processing of speech. We will therefore review one of the primary sources of evidence – subtraction analyses in functional brain imaging – that we believe has so far had limited success in revealing the functional neuroanatomy of speech comprehension. Throughout this review two themes will emerge. First, for studies of both single words and sentences, it will become apparent that the relevant studies have failed to isolate a single neural system that supports lexical processing. While these results are difficult to interpret within the predominant multiple-pathway accounts of the neural basis of speech processing (one of which is typically proposed to be

202

Matthew H. Davis and Jennifer M. Rodd

“lexical”) we will argue that this outcome precisely reflects theoretical developments culminating in the Distributed Cohort Model (Gaskell and Marslen-Wilson 1997, 1999). Gaskell and Marslen-Wilson argue that lexical knowledge cannot be considered to be a single, unitary system, but rather is the result of a coalition of parallel mappings from sub-lexical representations of speech sequences onto semantic representations, phonological representations involved in articulation, and transient storage of acoustic-phonetic representations in short-term echoic memory. These multiple processes provide putative hypotheses for the functional contribution of speech processing pathways in the lateral temporal lobe, which can be incorporated into neural accounts of speech processing. A second recurring theme in this review is the frequent finding that experimental manipulations (e.g., semantic priming, semantic anomalies) that aim to identify the brain regions involved in lexical and semantic processing often alter neural activity in peri-auditory regions of the STG associated with acoustic-phonetic processing of speech. However paradoxical these results may seem, we will argue that this result is consistent with the processing interactions between lexical identification and sentential meaning that have long been proposed in classic cognitive accounts of speech comprehension developed by Marslen-Wilson and Tyler (MarslenWilson 1987; Marslen-Wilson and Tyler 1980). However, this still begs the question of how we might study these higher-level processes (and their interactions) using functional imaging methods. We will therefore illustrate an alternative method that uses functional MRI to assess additional neural activity that is recruited when specific cognitive processes required for comprehension are challenged. In the final section of the chapter, we will move away from the use of haemodynamic brain imaging methods (fMRI/PET) and turn instead to the fast neurophysiological methods provided by electroencephalography (EEG) and magnetoencephalography (MEG). These methods allow unrivalled temporal resolution for the assessment of short-latency responses to critical events in spoken language and therefore have the potential to illustrate the temporal sequence of computations involved in speech comprehension with great precision. However, before these event-related methods can inform neurocognitive theorising, a significant methodological challenge must be overcome – we must specify when the critical cognitive events in speech (word onsets, word recognition points, etc.) occur in relation to a continuous acoustic signal. We will describe existing studies that, inspired by lexical processing measures derived from versions of the Co-

Brain structures underlying lexical processing of speech

203

hort theory (Gaskell and Marslen-Wilson 2002; Marslen-Wilson and Tyler 1981) begin to provide a neural time course of spoken word identification. 2. Brain imaging studies of lexical processing of spoken words 2.1. Neural effects of lexicality: Words vs. nonwords In the past decade, a wealth of functional imaging data has been brought to bear on the issue of how single words are processed in isolation. A key study came from Jeff Binder and colleagues, who used fMRI and a metaanalysis to identify regions that responded more strongly to individual familiar spoken words compared with unfamiliar pseudowords (Binder et al., 2000). The fMRI study highlighted regions of the posterior middle and inferior temporal gyrus and the angular gyrus. These posterior activations were replicated in the meta-analysis, which also revealed activation in the anterior portions of the middle temporal gyrus (MTG) and STG. Hence, Binder and colleagues proposed a hierarchical account of speech processing, in which these lateral temporal regions (both anterior and posterior) accomplish higher-level lexical and semantic processing of speech in a manner that is perhaps modality independent. These results ushered in a period of great optimism in which it seemed likely that functional brain imaging would provide startling new insights in to the anatomy of spoken language comprehension, and in particular would reveal the specific cognitive roles of these anterior and posterior temporal lobe regions. However, as we will illustrate in a selective review of singleword functional imaging studies, similar subtractive designs have had limited success in further refining our functional neuroanatomical understanding of the neural systems involved in speech perception and comprehension. Several studies have followed directly from the study presented by Binder et al. (2000) by conducting further comparisons of neural responses to familiar words and unfamiliar pseudowords with the goal of isolating critical brain region(s) that contribute to lexical processing. These studies have used a variety of tasks including phonological or lexical decision tasks, speeded repetition and one-back or target-monitoring tasks. While task effects for the contrast of real words versus unfamiliar pseudowords are unclear at present, what is apparent from an Activation Likelihood Estimation (Turkeltaub et al. 2002) meta-analysis of 11 of these studies is that the simple comparison of responses to spoken words and pseudowords fails

204

Matthew H. Davis and Jennifer M. Rodd

to deliver the result that any single brain region or processing pathway plays a dominant role in the recognition of familiar words (Davis and Gaskell 2010; see Figure 1 for a sketch summarising these results). Instead, consistent with the meta analysis presented by Binder and colleagues (Binder et al. 2000), there are multiple regions of the lateral temporal lobe that show an elevated response to familiar words, as well as more distant regions of the medial/lateral parietal and frontal lobes. We can divide these brain regions into different processing pathways based on the anatomical organisation described previously, again depicted in Figure 1. For example, elevated responses to real words are seen in a region of the anterior middle temporal gyrus – squarely within the anterior going pathway highlighted by Scott et al. (2000) in their study of sentence comprehension. Clusters of voxels in the posterior middle temporal, inferior temporal and fusiform gyri also produce a reliable increase in activation for familiar words. This finding is consistent with the proposal made by Hickok and Poeppel (2004; 2007) that this region serves as a “lexical interface”. Further confusion arises from a third set of regions in the dorsal auditorymotor pathway that also shows a lexicality effect. That is, the posterior superior/middle temporal gyrus and adjacent parietal regions (supramarginal and angular gyrus) also produce additional activation for spoken words compared to pseudowords. In many cases, homologous regions of the right hemisphere also show a comparable response elevation for familiar words. In summary, we see that that multiple brain regions show an elevated response to familiar words. It seems that the neuroimaging subtraction, comparing responses to real words and pseudowords does not dissociate the functional contribution of the multiple temporal lobe pathways involved in speech perception. Nor is it the case that any one of these pathways better deserves being labelled as contributing to “lexical processing” over any other. Our favoured interpretation of this finding is that it reflects the multiple cognitive processes that differentiate spoken words from nonwords. Representations of familiar words are to be found in multiple cortical systems, including brain regions argued to contribute to auditory-motor integration, as well as the two possible “ventral streams” for speech comprehension. Such a conclusion sits naturally within computational accounts such as the Distributed Cohort model (Gaskell and Marslen-Wilson 1997, 1999) in which phonological pathways, activation of semantic representations, and internal acoustic-phonetic representations all encode information that to some extent differentiates familiar words and unfamiliar pseudowords. From a functional point of view, this is analogous to saying that words are familiar by virtue of a number of different properties, including their

Brain structures underlying lexical processing of speech

205

articulatory representations, sound patterns and evoked meanings. Neural correlates of all of these representations can be seen in the functional imaging contrast of words and pseudowords. It might be that future studies using artificial word learning could separate out differential contributions by teaching participants novel words that have only a subset of these representations; however, to date such studies have not yet provided significant new information on the functional specialisation of specific temporal lobe pathways (Davis and Gaskell 2010). 2.2. Neural priming effects for spoken words One method that has been applied in combination with the lexicality manipulation is to use fMRI repetition suppression to highlight neural systems that are involved in representing familiar words. The assumption behind this work is that since behavioural facilitation is more pronounced for familiar than for unfamiliar words, neural interactions of lexicality and repetition might similarly dissociate specific neural systems involved in those key aspects of word recognition that are facilitated on second presentations. Similar methods have been used to highlight a fusiform region that is proposed to play a critical role in the recognition of familiar faces and written words (Fiebach, Gruber, and Supp 2005; Henson, Shallice, and Dolan 2000). However, despite behavioural evidence showing a similar dissociation of lexicality and repetition in long-term priming of spoken words, an fMRI study by Orfanidou, Marslen-Wilson and Davis (2006) failed to show differential neural priming (repetition suppression) for words and pseudowords. Rather, lateral and medial prefrontal regions involved in performing the lexical decision task (but not involved in lexical representation) showed response reductions predictive of behavioural priming for words and pseudowords alike. A similar study by Gagnepain and colleagues (2008) excluded responsebased repetition priming by changing the task performed on the first and second presentation of each word or pseudoword. In this case, lexicality by priming interactions arose for the magnitude and latency of the fMRI response in peri-auditory regions of the STG, perhaps because test items were acoustically distorted, leading to greater demands on auditory processing for unprimed items. A number of right hemisphere regions also produced repetition suppression, though this may be associated with voice-specific memory traces rather than with long-term lexical representations (Belin and Zatorre 2003; Gonzalez and McLennan 2007; Von Kriegstein et al. 2003).

206

Matthew H. Davis and Jennifer M. Rodd

One exception, however, was a region of the left posterior middle temporal gyrus which showed both a lexicality effect (word > pseudoword) and greater neural repetition suppression for familiar words. This region aligns well with the proposal that this posterior middle/inferior temporal region plays an important role in supporting lexical identification of spoken words (Hickok and Poeppel 2007). However, although this finding is encouraging, it is apparently contradicted by two fMRI studies that investigated short-term (immediate) rather than long-term (delayed) priming of spoken words. In paired-priming with overt presentation of repeated word pairs, Cohen et al. (2004) showed repetition suppression in a region of the anterior superior/middle temporal gyrus – a region they labelled the “auditory word form area”. In contrast, Kouider et al. (2010) showed a repetition priming effect for words but not pseudowords in primary auditory regions (Heschl's gyrus and planum polare) and in the bilateral insula when subliminal presentation of prime words was used. Thus, to date, it would seem that results from repetition priming studies provide no clear indication of the functional contribution of any single functional pathway to the identification of spoken words. An alternative priming method that can potentially provide more specific information about the functional roles of the ventral processing pathways is semantic priming – that is, facilitated identification of the second word presented in related pairs (dog-cat, bread-butter) compared to unrelated pairs (dog-butter, bread-cat). While the lexical manipulations described above would be predicted to show effects in the brain regions involved in all aspects of lexical processing, this contrast might more specifically reveal those regions involved in accessing word meanings. Several fMRI studies using spoken words have included this contrast, often in the context of participants making lexical decisions to the second word. Neural correlates of facilitated processing of related words pairs (i.e., the contrast unrelated > related) have been reported in bilateral auditory regions (Heschl's gyrus and planum temporale; Kotz et al. 2002) and the left STG (Rissman, Eliassen, and Blumstein 2003; Wible et al. 2006). However, none of the activated regions shows much, if any, overlap with the areas proposed to play a critical role in representing spoken words. If anything, priming effects localize to sub-lexical processes. Hence, these findings provide relatively little evidence to support any specific hypotheses concerning the neural systems involved in accessing the meaning of spoken words. One possible explanation of these uninformative results comes from considering other activation differences also associated with the behavioural priming effect. Many of these studies report priming effects in inferior

Brain structures underlying lexical processing of speech

207

and middle frontal regions primarily associated with decision making and response generation (Kotz et al. 2002; Rissman et al. 2003; Wible et al. 2006). As in studies of repetition priming, greater fluency of semantic processing has downstream consequences for systems involved in making decisions on spoken words. Further evidence of these task effects was obtained in a study that directed subjects to make related/unrelated decisions on spoken word pairs (Ruff et al. 2008). Similar effects of semantic relatedness were obtained in this study, though these were more pronounced than in lexical decision – particularly for right lateralised fronto/parietal regions, though also in left inferior frontal and superior temporal regions. In summary, then, it appears that the semantic priming manipulation does not uniquely reveal brain regions involved in lexical-semantic processing – it is as likely to modulate either pre-lexical processing of speech in the STG or post-lexical task areas. This makes it difficult to use priming to detect brain regions that are specifically involved in lexicalsemantic processing. As we will see in section 3.2, similar concerns arise in using related and unrelated words within sentences. 2.3. Effects of lexical properties on neural responses A third approach to the study of lexical processing has been to explore how neural responses to spoken words are modulated as a function of their lexical and semantic properties. For example, neural responses to highly imageable spoken words are elevated in middle and anterior portions of the left fusiform gyrus (Wise et al. 2000). Similarly, anterior inferior temporal regions are more strongly activated for words (nouns or verbs) that have strong sensory associations (Vigliocco et al. 2006). Such findings are consistent with a semantic role for these inferior portions of the anterior temporal lobe. Fiebach and colleagues (Fiebach et al. 2003) report increased activity for late-acquired words in the bilateral STG – consistent with increased difficulty of identification for these items. Curiously, though, a study by Prabhakaran et al. (2006) showed additional activation of the middle temporal gyrus for high-frequency compared to low-frequency words – a contrast that is likely to be confounded with age of acquisition (high-frequency words tend to be acquired earlier). A further comparison in the same study was to compare words with few and many phonological neighbours (i.e., seeking the neural correlate of lexical competition for words with many neighbours). Prabhakaran and colleagues localised these activation increas-

208

Matthew H. Davis and Jennifer M. Rodd

es to a region of the left supramarginal and angular gyrus. However, essentially the same contrast of high- versus low-neighbourhood items produced additional activation of bilateral STG in a study by Okada and Hickok (2006), while Bozic and colleagues (2007) showed a similar increase in bilateral superior temporal activation for words with additional embeddings (e.g., “claim”). One possible explanation is that these temporal and parietal activations reflect delayed access to the lexical form of these moredifficult-to-comprehend words. Raettig and Kotz (2008) similarly observed additional activation in these temporo-parietal regions for mispronounced words that are nonetheless recognised correctly. In sum, then, it is difficult to know what to make of these findings. One might gloss these results as showing that semantic manipulations (particularly those that involve visual or sensory knowledge) engage anterior and posterior inferior temporal regions, whereas modulation of lexical or phonological processing difficulty modulates superior temporal and inferior parietal regions. However, certain inconsistencies make interpretation difficult and additional studies are needed to confirm this pattern. In addition, two methodological considerations that can be raised at this point could prove useful in directing future research. The first point is that great care must be taken in drawing conclusions from these between-item designs in the absence of statistical comparisons which treat different words as a random effect (Bedny, Aguirre, and Thompson-Schill 2007; Clark 1973); only those results that are reliable over both participants and items provide a statistical foundation for theorising. None of the correlational or factorial studies reported in the previous section use items analyses or min-F' to ensure generalisation beyond the population of words tested. In studies of visual word recognition it has been shown that different conclusions would be reached if item variation were considered at the analysis stage (Bedny et al. 2007). In the absence of this statistical control, it might be that future work will reveal confounding factors that better explain differential activation of (for example) superior temporal regions, rather than the psycholinguistic factors that have been studied to date. A further challenge in this work is to specify the computational principles by which additional activation in discrete processing stages can be linked to specific cognitive manipulations. There is a clear need for a contribution from computational modelling if researchers are to make detailed predictions as to the processing stages that will be modulated by any specific cognitive challenge to lexical processing. One illustration of this is in the study of word frequency and age-of-acquisition effects described previously. Computational models have shown that these effects are more pro-

Brain structures underlying lexical processing of speech

209

nounced in arbitrary than in systematic mappings (e.g., for reading models, these effects are more robust in the “orthography to semantics” mapping than in the “orthography to phonology” mapping; Lambon Ralph and Ehsan 2006). Whilst we have the beginnings of an account of these different functional pathways for speech comprehension, it is by no means straightforward to infer from the models where and how differential responses to high- and low-frequency or early- and late-acquired words will be observed. 3. Brain imaging studies of lexical processing of spoken sentences 3.1. Effects of intelligibility In addition to studies of lexical processing that use single words, functional imaging studies using spoken sentences have also been used to assess the functional role of different components of the ventral temporal pathway. Evidence to suggest a critical role for the anterior temporal lobe in lexical aspects of speech comprehension came from a key PET study by Sophie Scott and colleagues (2000). The approach adopted by Scott and colleagues was to contrast neural responses to four forms of spoken sentence that were artificially processed to vary in the degree of acoustic degradation, and the preservation (or otherwise) of intelligibility. This was achieved by assessing common activation to clearly spoken and noise-vocoded sentences (Shannon et al. 1995) both of which are intelligible and contrasting this to neural responses to spectrally inverted, unintelligible versions of these stimuli (Blesser 1972). Regions differentially activated by intelligible versus unintelligible speech are unlikely to be responding to acoustic rather than linguistic information in speech, since these are substantially different in clear and noise-vocoded speech (e.g., the presence vs. absence of pitch, harmonic structure and rapid formant transitions). Anterior regions of the superior and middle temporal gyrus responded more strongly to spoken sentences that were intelligible, compared to acoustically matched unintelligible versions of these sentences. This region is directly anterior to the pre-lexical region (described above) that responded to the phonetic properties of the speech irrespective of intelligibility. These results highlighted a previously unconsidered contribution of the anterior STG to the comprehension of connected speech. However, from this study alone it is unclear how this anterior temporal region is functionally related to more posterior regions that respond equivalently to intelligible

210

Matthew H. Davis and Jennifer M. Rodd

and unintelligible speech and indeed whether some unforeseen common acoustic cue is responsible for driving intelligibility-related responses in anterior temporal regions. One study that helped show the functional organisation of intelligibility responsive regions of the lateral temporal lobe was reported by Davis and Johnsrude (2003). This study extended the Scott method by using three acoustically different forms of degradation – noise-vocoded speech as before, speech presented against continuous speech-spectrum background noise, and speech intermittently replaced by speech-envelope and spectrum noise. The second innovation was that in each of these speech conditions, the severity of signal degradation was manipulated (using varying numbers of bands, signal-to-noise-ratios and varying duty cycles of clear speech/noise), so as to generate a range of intelligibility quantified by word report scores (20%-90% words correct). This design permitted a parametric analysis to detect regions showing a neural response correlated with the intelligibility of distorted speech. This method highlights intelligibilityresponsive cortex in the bilateral superior and middle temporal gyrus – both anterior and posterior to primary auditory cortex (though posterior activity was seen only in the left hemisphere) and in the left inferior frontal gyrus. Within this more extended network, Davis and Johnsrude (2003) distinguished two functional profiles by testing whether these regions were also modulated by the acoustic form of the distortion. Those regions closest to primary auditory cortex showed an intelligibility effect and responded differently to the three different forms of acoustic distortion. This response profile suggests involvement in processing intelligible speech at a stage that retains information concerning the auditory form of spoken language. In contrast, more distant regions of the superior and middle temporal gyrus (both anterior and posterior) responded to more intelligible speech in a manner that was insensitive to the acoustic differences that exist between these three forms of distortion. This second response profile is diagnostic of neural processes that are sensitive to the abstract, linguistic content of speech and not its acoustic form. Moreover, the anatomical organisation of these lateral temporal regions is consistent with two distinct, hierarchically organised processing pathways running both anteriorly and posteriorly from peri-auditory regions of the STG. Thus, this study highlights multiple functional pathways within intelligibility-responsive regions. Results of subsequent correlational studies largely confirm the fractionation of intelligibility-responsive cortex into multiple stages, and multiple pathways as presented by Davis and Johnsrude (2003). For instance, Scott and colleagues (2006) also showed correlations with speech intelligibility in anterior regions of the left temporal lobe. Obleser and colleagues (2007)

Brain structures underlying lexical processing of speech

211

observed an elevated response in posterior superior temporal and inferior parietal regions for sentences that were both predictable and presented at intermediate levels of intelligibility. This is consistent with processes operating on the combination of sentence content and the bottom-up speech signal. Obleser, Eisner, and Kotz (2008) showed that left and right hemisphere regions were differentially sensitive to slow-spectral (envelope) and fast-temporal (pitch) fluctuations in vocoded speech. This result helps to extend long-standing observations of differential right hemisphere contributions to pitch processing for intelligible speech. However, there have been few studies to date that use speech of varying intelligibility to provide evidence of functional specialisation in either anterior or posterior temporal regions. 3.2. Anomalous vs. normal speech A second approach to studying the functional roles of different brain regions in sentence comprehension has been to compare responses to normal sentences to sentences in which critical words violate semantic, syntactic or pragmatic constraints. By manipulating a specific linguistic aspect of sentence presentations, these studies aim to isolate the brain regions involved in particular aspects of comprehension and thereby assess the functional roles of the different speech-processing pathways. This approach has its roots in the psycholinguistic and ERP literatures, which have used a similar method to provide a wealth of information about the specific timecourse of the different aspects of lexical processing (e.g., the temporal relationship between lexical selection and semantic integration; Van Petten et al. 1999). A key fMRI study using this method was conducted by Kuperberg and colleagues (2000), who compared normal sentences (e.g., “the young man grabbed the guitar”) to ones where the target word violated either a pragmatic, semantic or syntactic constraint (e.g., “the young man buried/drank/slept the guitar”). Contrasts between normal and anomalous sentences revealed activity in the left inferior temporal cortex and the fusiform gyrus. However, even within this area, the reported effect was variable: some voxels showed additional activation for normal sentences, while others showed additional activity for pragmatic anomalies. These authors also found differences between the different types of anomalies within the STG and MTG, with greatest activation in response to the pragmatic anomalies and least activation in response to syntactic anomalies. Based on this evidence, the authors emphasise the role of inferior temporal and fusiform

212

Matthew H. Davis and Jennifer M. Rodd

regions in constructing a higher representation of the meanings of sentences. This view is consistent with the claim that the posterior interior temporal portion of the ventral stream is a critical pathway involved in processing word meanings (Hickok and Poeppel 2004). In contrast, a similar study by Friederici and colleagues (2003), observed activation along the length of the left STG in response to syntactic anomalies and a more restricted area of mid STG (bilaterally) in response to semantic anomalies. A fundamental assumption of this anomaly method is that is by disrupting one specific linguistic property of a sentence it is possible to isolate the brain regions involved in that aspect of processing. We believe that this assumption is highly problematic. The ERP studies that use this paradigm record an electrical response with a degree of temporal sensitivity that allow us to be (relatively) certain that the observed effects reflect the earliest response to the anomaly and that this reflects an automatic online response that is part of the normal comprehension process. In contrast, fMRI paradigms rely on a slow haemodynamic response that smear together in time participants’ initial response to an anomaly with subsequent processes that are triggered by their detection of the anomaly. This is particularly true for studies using a blocked design (e.g., Kuperberg et al. 2000), though the same problem remains for event-related studies. These post-lexical responses are likely to be highly variable (between sentences and individuals), and are cognitively unspecified by most accounts of sentence comprehension. For example, when participants encounter a semantically anomalous word, they may engage in additional processes (compared with normal sentences) in order to try and make sense of the anomaly. They may check whether a consistent representation can be built by accessing an additional low-frequency meaning (e.g., “The dressmaker used a set of expensive frogs” – “frog” here referring to a decorative fastener) or a metaphorical meaning (e.g., “the politician was a shark”). On the other hand, participants may respond to some anomalies by “giving up” and abandoning any attempt to process the rest of the sentence, resulting in a reduction in activation for any aspects of normal sentence comprehension. An additional concern with this anomaly paradigm relates to the observation that some of the activations seen in response to anomalous sentences are in regions proposed to contribute to low-level acoustic-phonetic processing of speech (STG; Friederici et al. 2003; Kuperberg et al. 2000). Whilst such findings are perplexing for subtractive interpretations (highlevel anomaly produces additional activation in low-level speech processes), these results can be readily explained by accounts in which the

Brain structures underlying lexical processing of speech

213

speed and ease of identification of words in connected speech is modulated by the presence of supporting sentential context. This is a long-standing theoretical commitment of the Cohort model and subsequent, Cohort-inspired accounts of spoken language comprehension. Experimental data show that the time at which specific information is processed in the speech signal is intimately linked with the timing of relevant perceptual input (Marslen-Wilson 1984; Zwitserlood 1989) in combination with supporting higher-level, semantic and syntactic content (Marslen-Wilson and Tyler 1980). The absence of supporting context in semantically anomalous sentences not only disrupts the construction of higher-level sentence interpretation, but also places an increased demand on word-recognition processes that are ordinarily supported by sentence content. It is this disruption of lower-level speech processing that most likely explains observations of superior temporal activation in response to semantic and syntactic anomalies in spoken sentences. Our claim that the results from anomaly studies cannot be confidently attributed to a specific aspect of speech processing is reinforced by a review of the (more numerous) studies that use this method with visual presentation. The most common pattern of results has been to see an increase in activation of the inferior frontal gyrus (often left lateralised) in response to violations of semantic constraints (e.g., Hagoort et al. 2004), syntactic constraints (e.g., Kang et al. 1999), pragmatic constraints (e.g., Kuperberg et al. 2003) and real world knowledge (e.g., Hagoort et al. 2004). However, this brain region has also shown reduced activation for syntactic violations (e.g., Kuperberg et al. 2003) and semantic violations (Zhu et al. 2009). This apparent conflict is consistent with our proposal that anomalies can cause participants either to “work harder” or to “give up”, thereby producing an increase or decrease in inferior frontal activity. In addition to the variation in the direction of the effect of these anomalies, there is also considerable variation in the location of these effects, with anomalies producing increased activation of the middle frontal gyri (Kang et al. 1999; Kuperberg et al. 2003), left posterior STS (Kuperberg et al. 2003), angular gyrus, supramarginal gyrus (Kang et al. 1999), parietal cortex (Kuperberg et al. 2003), and anterior cingulate (Kang et al. 1999), as well as decreases in left posterior STS (Kuperberg et al. 2003), parietal cortex (Kuperberg et al. 2003), anterior cingulate (Zhu et al. 2009), left anterior occipital sulcus (Kuperberg et al. 2003), left caudate, precentral gyrus (bilaterally), right cuneus, left lingual gyrus, and posterior cingulate (Zhu et al. 2009). This absence of consistency in the location of these effects is consistent with our claim that a range of additional cognitive processes can be trig-

214

Matthew H. Davis and Jennifer M. Rodd

gered by anomalies in sentences. Taken together, these results suggest that this “anomaly” method is unlikely to reliably locate brain regions involved in specific aspects of sentence comprehension. 3.3. Studies of semantic and syntactic ambiguity The previous section focused on studies in which the sentence materials were designed to include anomalies that would cause sentence comprehension to fail. We argued that the cognitive consequences of comprehension failure are highly variable, and the associated neural data is hence hard to interpret. We therefore suggest that a more productive method is to study comprehension under conditions in which the language system succeeds despite a processing challenge. Speech comprehension is made difficult by the presence of ambiguity in the signal at many levels, and thus the mechanisms that participants use to deal with these challenges can provide a neural marker of successful comprehension. One illustration of this method was already reviewed in neuroimaging studies of distorted yet intelligible speech such as those of Scott et al. (2000) and Davis and Johnsrude (2003). In this latter study, an additional response to distorted yet intelligible stimuli (compared to clear speech) was seen in superior temporal, inferior frontal and premotor regions. These neural correlates of listening effort illustrate how neuroimaging methods can be used to isolate additional processes engaged in response to specific challenges to comprehension. Here we review studies that apply this same method to localizing higher-level semantic and syntactic processes. Speech comprehension is made more difficult by the presence of ambiguous words. For example, to understand the phrase “the bark of the dog” a listener can use the syntactic properties of the word “the” to determine that “bark” is being used as a noun and not a verb. In addition, they can use the semantic properties of the word “dog” to work out that “bark” is probably referring to the noise made by that animal and not the outer covering of a tree. These forms of ambiguity are ubiquitous in language. For example, at least 80% of the common words in a typical English dictionary have more than one dictionary definition (Parks, Ray, and Bland 1998; Rodd, Gaskell, and Marslen-Wilson 2002), and some words have a very large number of different meanings — there are 44 different definitions listed for the word “run” in the WordSmyth Dictionary (Parks et al. 1998; e.g., “an athlete runs a race”, “a river runs to the sea”, “a politician runs for office”). Each time

Brain structures underlying lexical processing of speech

215

one of these ambiguous words is encountered, the listener must select the appropriate meaning on the basis of its sentence context. By studying the additional processing that takes place when a semantically ambiguous word is encountered we can gain insights into the processes involved in activating and selecting word meanings when other lexical and pre-lexical variables are well controlled. Studies of both semantic and syntactic ambiguity have emphasised the role of the left inferior frontal gyrus (LIFG) and the posterior temporal lobe. Rodd, Davis, and Johnsrude (2005) compared high-ambiguity sentences, which contained at least two ambiguous words, with well-matched low-ambiguity sentences, and reported a large cluster of LIFG activation with its peak within the pars opercularis. This is consistent with the results of subsequent studies in both the auditory (Davis et al. 2007) and visual domains (Mason and Just 2007; Zempleni et al. 2007). In addition, Rodd et al. (2010) have recently shown that overlapping regions of the LIFG are activated by semantic and syntactic ambiguities (within the same subjects). These results are consistent with the view that this region contributes to combinatorial aspects of language comprehension (Rodd et al. 2005; Willems and Hagoort 2009; Hagoort 2005). With respect to the involvement of the temporal lobe, these studies provide consistent evidence of the involvement of the posterior temporal lobe. Increased activation has been observed in the left posterior inferior temporal gyrus and MTG for semantic ambiguities (Davis et al. 2007; Rodd et al. 2005) and in the left posterior MTG for syntactic ambiguities (Rodd et al. 2010). One word of caution needs to be noted with respect to this general approach. The use of different ambiguities does provide a way of increasing the processing load on specific aspects of speech comprehension and avoids many of the problems associated with the alternative approach of using different types of anomalies (see previous section). However, this approach does not completely avoid the issue of processing interactivity since high-level ambiguities may also impact on lower-level perceptual processing. Indeed, Rodd et al. (2005) did report a small (but significant) ambiguity-related increase in a “pre-lexical” region of the left STG. However, unlike the anomaly literature, the ambiguity literature has (so far) provided more consistent results, and has highlighted a relatively small set of brain regions, all of which have previously been proposed as being critical for lexical processing. In particular, activation of the posterior (and not anterior) aspects of the temporal lobe provide clear support for the view that these regions form part of a ventral processing stream that is critical for accessing the meaning of spoken words (Hickok and Poeppel 2004). We are hopeful that extending this approach to the study of other forms of am-

216

Matthew H. Davis and Jennifer M. Rodd

biguity (e.g., acoustic, phonological) would provide additional insights into the functional contributions of the different brain regions that are involved in lexical processing. 4. Looking forward in time: Fast electrophysiological measures of lexical processing The work described so far has shown the challenges in using slow measures of haemodynamic activity to localise specific cognitive components of lexical and sentential processing. We have seen striking discrepancies between the nature of the cognitive manipulation employed in a specific study (e.g., higher-level semantic anomaly or priming), and the location of the primary difference in neural activity observed (often in superior temporal regions engaged in early perceptual processing of speech). One interpretation of this discrepancy is that it reflects the operation of top-down or interactive processes. Early phonological processing of speech is ordinarily supported by higher-level contextual information. Manipulations that remove this higher-level support thus place an additional load on lower-level processes. Although this is both a cognitively and neurally plausible explanation of these observations, it is important that cognitive neuroscience devise methods of testing for these top-down modulations since bottom-up accounts are also plausible. In order to assess the direction of information flow in these complex neural networks, we need measures which have both: (1) the spatial resolution to localize activation in higher-level frontal and temporal “semantic” areas, and lower-level auditory processes in the STG, and (2) the temporal resolution to determine whether responses of higher-level regions lead or lag the activation of lower-level regions. It is only when higher-level responses precede lower-level responses that we can safely infer top-down, interactive processes. Modern fMRI methods can provide whole-brain activation measures on a second-by-second basis (Schwarzbauer et al. 2006), but this temporal resolution may be insufficient to detect the rapid neuronal interactions involved in sentence comprehension. We have described methods by which slow haemodynamic responses to semantic and acoustic challenges can be used to localise specific processing stages in speech comprehension. However, it is only by tracing the time-course of word recognition and ambiguity resolution that we can map out the functional organisation of the processing stages involved in speech comprehension.

Brain structures underlying lexical processing of speech

217

For these reasons, the cognitive neuroscience of speech comprehension needs measures of neural activation with high temporal resolution. Speechevoked brain responses that can be measured with electro- and magnetoencephalography (EEG and MEG) provide an opportunity to trace neural contributions to lexical and semantic contributions to comprehension over the time-course of presentation of a single word (