This handbook presents detailed accounts of current research in all aspects of language prosody, written by leading expe
222 57 56MB
English Pages 957 Year 2021
Table of contents :
Cover
The Oxford Handbook of Language Prosody
Copyright
Contents
Acknowledgements
List of Figures
List of Tables
List of Maps
List of Abbreviations
About The Contributors
Chapter 1: Introduction
1.1 Introduction
1.2 Motivating our enterprise
1.3 Definitional and terminological issues
1.3.1 Tradition and innovation in defining language prosody
1.3.2 Some typological categories
1.3.3 Some terminological ambiguities
1.4 The structure of the handbook
1.5 Reflections and outlook
Part I: Fundamentals of Language Prosody
Chapter 2: Articulatory Measures of Prosody
2.1 Introduction
2.2 Experimental techniques
2.2.1 Laryngoscopy
2.2.2 Electroglottography
2.3 Aerodynamic and respiratory movement measures
2.4 Point-tracking techniques for articulatory movements
2.4.1 Ultrasound
2.4.2 Electropalatography
2.5 Summary of articulatory measurement techniques
2.6 Conclusion
Acknowledgements
Chapter 3: Fundamental Aspects in the Perception of f0
3.1 Introduction
3.2 A history of fundamental pitch perception research
3.2.1 Basic terminology
3.2.2 Theories of pitch perception
3.2.3 Critical bands and their importance for pitch perception theories
3.2.4 Which components are important?
3.3 Pitch perception in speech
3.3.1 Just noticeable differences and limitations in the perception of f0
3.3.2 Segmental influences on the perception of f0
3.3.3 Perceptual interplay between prosodic parameters
3.4 Conclusion
Part II: Prosody and Linguistic Structure
Chapter 4: Tone Systems
4.1 Introduction: What is tone?
4.1.1 Tone as toneme versus morphotoneme
4.1.2 Tone as pitch versus tone package
4.1.3 Tone-bearing unit versus tonal domain (mora, syllable, foot)
4.1.4 Tone versus accent
4.2 Phonological typology of tone by inventory
4.2.1 Number of tones
4.2.2 Contour tones
4.2.3 Downstep and floating tones
4.2.4 Underspecified tone and tonal markedness
4.2.5 Distributional constraints
4.3 Phonological typology of tone by process
4.3.1 Vertical assimilation
4.3.2 Horizontal assimilation
4.3.3 Contour simplification
4.3.4 Dissimilation and polarity
4.4 Grammatical tone
4.4.1 Lexical versus morphological tone
4.4.2 Tonal morphemes
4.4.3 Replacive tone
4.4.4 Inflectional tonology
4.4.5 Compounding
4.4.6 Phrase-level tonology
4.5 Further issues: Phonation and tone features
4.6 Conclusion
Chapter 5: Word-Stress Systems
5.1 Introduction
5.2 Evidence for stress
5.2.1 Phonetic exponents
5.2.2 Speaker intuitions and co-speech gestures
5.2.3 Segmental and metrical exponents of stress
5.2.4 Distributional characteristics of stress
5.3 Typology of stress
5.3.1 Lexical versus predictable stress
5.3.2 Quantity-insensitive stress
5.3.3 Quantity-sensitive stress
5.3.4 Bounded and unbounded stress
5.3.5 Secondary stress
5.3.6 Non-finality effects
5.4 Rhythmic stress and the foot
5.5 Outstanding issues in word stress
5.5.1 The diagnosis of stress
5.5.2 Stress and prosodic taxonomy
5.5.3 Stress typology and explanation
5.6 Conclusion
Additional reading
Chapter 6: The Autosegmental-Metrical Theory of Intonational Phonology
6.1 Introduction
6.2 AM phonology
6.2.1 AM essentials
6.2.2 Metrical structure and its relationship with the autosegmental tonal string
6.2.3 Secondary association of tones
6.2.4 The phonological composition of melodies
6.3 Phonetic implementation in AM
6.3.1 Tonal alignment
6.3.2 Tonal scaling
6.3.3 Interpolation and tonal crowding
6.4 Applications of AM
6.5 Advantages over other models
Chapter 7: Prosodic Morphology
7.1 Introduction
7.2 Prosodic structure
7.3 Reduplication
7.4 Root-and-pattern morphology
7.5 Truncation
7.6 Infixation
7.7 Summary
Chapter 8: Sign Language Prosody
8.1 The visible organization of sign languages
8.2 Prosodic constituency in signed languages
8.2.1 The syllable and the prosodic word
8.2.2 Intonational phrases
8.2.3 Phonological phrases
8.3 Defining properties of sign language intonation
8.4 Intonation and information structure
8.4.1 Topic/comment
8.4.2 Given/new information
8.4.3 Focus/background
8.5 Prosody versus syntax: evidence from wh-questions
8.6 Summary and conclusion
Acknowledgements
Part III: Prosody in Speech Production
Chapter 9: Phonetic Variation in Tone and Intonation Systems
9.1 Introduction
9.2 Tonal coarticulation
9.3 Timing of pitch movements
9.3.1 Segmentally induced variability in f0 target realization
9.3.2 Time pressure effects on f0 target realization
9.3.3 Truncation and compression
9.4 Scaling of pitch movements
9.4.1 Pitch range variability: basic characteristics
9.4.2 Paralanguage, pitch range (quasi-)universals, and grammaticalization
9.4.3 Downtrend
9.4.4 Perceptual constraints on tone scaling patterns
9.5 Contour shape
9.5.1 Peak shapes and movement curvatures
9.5.2 ‘Dipping’ Lows and local contrast
9.5.3 Integrality of f0 features
9.6 Non-f0 effects
9.7 Conclusion
Chapter 10: Phonetic Correlates of Word and Sentence Stress
10.1 Introduction
10.2 Acoustic correlates of word stress
10.2.1 Segment duration
10.2.2 Intensity
10.2.3 Spectral tilt
10.2.4 Spectral expansion
10.2.5 Resistance to coarticulation
10.2.6 Rank order
10.3 Acoustic correlates of sentence stress
10.4 Perceptual cues of word and sentence stress
10.5 Cross-linguistic differences in phonetic marking of stress
10.5.1 Contrastive versus demarcative stress
10.5.2 Functional load hypothesis
10.6 Conclusion
Appendix
Measuring correlates of stress using Praat speech processing software
Measuring duration
Measuring intensity
Measuring spectral tilt
Measuring formants F1, F2 (for vowels and sonorant consonants)
Measuring noise spectra (for fricatives, stops, and affricates)
Measuring pitch correlates
Chapter 11: Speech Rhythm and Timing
11.1 Introduction
11.1.1 Periodicity in surface timing
11.1.2 Contrastive rhythm
11.1.3 Hierarchical timing
11.1.4 Articulation rate
11.2 ‘Rhythm metrics’ and prosodic typology
11.2.1 Acoustically based metrics of speech rhythm: lessons and limitations
11.2.2 The fall of the rhythm class hypothesis
11.3 Models of prosodic speech timing
11.3.1 Localized approaches to prosodic timing
11.3.2 Coupled oscillator approaches to prosodic timing
11.4 Conclusions and prospects
Part IV: Prosody Across the World
Chapter 12: Sub-Saharan Africa
12.1 Introduction
12.2 Tone
12.2.1 Tonal inventories
12.2.2 The representation of tone
12.2.3 Phonological tone rules/constraints
12.2.4 Grammatical functions of tone
12.3 Word accent
12.4 Intonation
12.4.1 Pitch as marking sentence type or syntactic domain
12.4.2 Length marking prosodic boundaries
12.5 Conclusion
Chapter 13: North Africa and the Middle East
13.1 Introduction
13.2 Afro-Asiatic
13.2.1 Berber
13.2.2 Egyptian
13.2.3 Semitic
13.2.3.1 East Semitic
13.2.3.2 West Semitic: Modern South Arabian
13.2.3.3 West Semitic: Ethio-Semitic
13.2.3.4 Central Semitic: Sayhadic
13.2.3.5 Central Semitic: North West Semitic
13.2.3.6 Central Semitic: Arabian
13.2.4 Chadic
13.2.5 Cushitic
13.2.6 Omotic
13.3 Nilo-Saharan
13.3.1 Eastern Sudanic
13.3.2 Central Sudanic
13.3.3 Maban
13.3.4 Saharan
13.4 Discussion
Chapter 14: South West and Central Asia
14.1 Introduction
14.2 Turkic
14.2.1 Lexical prosody in Turkish: stress
14.2.2 Lexical prosody: vowel harmony in Turkish
14.2.3 Post-lexical prosody in Turkish
14.2.4 Focus in Turkish
14.3 Mongolian
14.3.1 Lexical prosody in Mongolic: stress
14.3.2 Lexical prosody: vowel harmony in Mongolian
14.3.3 Post-lexical prosody in Mongolian
14.3.4 Focus in Mongolian
14.4 Persian
14.4.1 Lexical prosody in Persian
14.4.2 Post-lexical prosody in Persian
14.4.3 Focus in Persian
14.5 Caucasian
14.5.1 Georgian
14.5.1.1 Lexical prosody in Georgian
14.5.1.2 Post-lexical prosody in Georgian
14.5.1.3 Focus in Georgian
14.5.2 Daghestanian
14.6 Communicative prosody: question intonation
14.7 Conclusion
Chapter 15: Central and Eastern Europe
15.1 Introduction
15.2 Word prosody
15.2.1 Quantity
15.2.1.1 Baltic
15.2.1.2 Finno-Ugric
15.2.1.3 Slavic
15.2.2 Word stress
15.2.2.1 Baltic
15.2.2.2 Finno-Ugric
15.2.2.3 Slavic
15.2.2.4 Romance
15.3 Sentence prosody
15.3.1 Baltic
15.3.2 Finno-Ugric
15.3.3 Slavic
15.3.4 Romance
15.4 Conclusion
Chapter 16: Southern Europe
16.1 Introduction
16.2 Prosodic structure
16.2.1 Stress
16.2.2 Rhythm
16.2.3 Prosodic constituency
16.3 Intonation
16.3.1 Inventories
16.3.2 Downstep
16.3.3 Copying, merging, and truncation
16.4 Conclusion
Chapter 17: Iberia
17.1 Introduction
17.2 Word prosody
17.2.1 Catalan, Spanish, and Portuguese
17.2.2 Basque
17.3 Prosodic phrasing
17.3.1 Prosodic constituents and tonal structure
17.3.2 Phrasal prominence
17.4 Intonation
17.4.1 Tonal events
17.4.2 Main sentence types and pragmatic meanings
17.5 Conclusion and perspectives
Acknowledgements
Chapter 18: Northwestern Europe
18.1 Introduction
18.2 Continental north germanic
18.2.1 Word stress
18.2.2 Tone
18.2.2.1 Typology
18.2.3 Intonation
18.2.4 Notes on Danish
18.2.5 Prosodic domains
18.3 Continental west germanic
18.3.1 Prosodic domains
18.3.2 Word stress
18.3.3 Intonation
18.3.4 Tone accents
18.4 Concluding remarks
Chapter 19: Intonation Systems Across Varieties of English
19.1 The role of English in intonation research
19.2 Scope of the chapter
19.3 Intonational systems of Mainstream English Varieties
19.3.1 Northern hemisphere
19.3.2 Non-mainstream varieties of American English
19.3.3 Non-mainstream British varieties
19.3.4 Southern hemisphere mainstream varieties
19.4 English intonation in contact
19.4.1 Hong Kong English
19.4.2 West African Englishes (Nigeria and Ghana)
19.4.3 Singapore English
19.4.4 Indian English
19.4.5 South Pacific Englishes (Niue, Fiji, and Norfolk Island)
19.4.6 East African Englishes (Kenya and Uganda)
19.4.7 Caribbean English
19.4.8 Black South African English
19.4.9 Maltese English
19.5 Uptalk
19.6 Conclusion
Chapter 20: The North Atlantic and the Arctic
20.1 Introduction
20.2 Celtic
20.2.1 Irish and Scottish Gaelic
20.2.2 Intonation
20.3 Insular Scandinavian
20.3.1 Stress in words and phrases
20.3.2 Intonation
20.4 Eskimo-Aleut
20.4.1 Inuit
20.4.2 Yupik
20.4.3 Aleut
20.5 Conclusion
Chapter 21: The Indian Subcontinent
21.1 Introduction
21.2 Quantity
21.3 Word stress
21.4 Tone
21.5 Intonation and intonational tunes
21.5.1 Declarative
21.5.2 Focus
21.5.3 Yes/no questions, with and without focus
21.6 Segmental rules and phrasing
21.7 Conclusion
Acknowledgements
Chapter 22: China and Siberia
22.1 Introduction
22.2 The syllable and tone inventories of Chinese languages
22.3 Tone sandhi in Chinese languages
22.4 Lexical and phrasal stress in Chinese languages
22.5 Intonation in Chinese languages
22.5.1 Focus
22.5.2 Interrogativity
22.6 The prosody of Siberian languages
22.7 Summary
Chapter 23: Mainland South East Asia
23.1 Scope of the chapter
23.2 Word-level prosody
23.2.1 Word shapes and stress
23.2.2 Tonation
23.2.2.1 Inventories
23.2.2.2 Tonal phonology, tone sandhi, and morphotonology
23.3 Phrasal prosody
23.3.1 Prosodic phrasing
23.3.2 Intonation
23.3.3 Information structure
23.4 Conclusion
Chapter 24: Asian Pacific Rim
24.1 Introduction
24.2 Japanese
24.2.1 Japanese word accent
24.2.2 Japanese intonation
24.3 Korean
24.3.1 Korean word prosody
24.3.2 Korean intonation: melodic aspects
24.3.3 Korean intonation: prosodic phrasing
24.4 Conclusion
Acknowledgement
Chapter 25: Austronesia
25.1 Introduction
25.2 Lexical tone
25.3 Lexical stress
25.4 Intonation
25.5 Prosodic integrationof function words
25.6 Conclusion
Chapter 26: Australia and New Guinea
26.1 General metrical patterns in Australia
26.1.1 Quantity and peninitial stress
26.2 Intonation in Australian languages
26.3 Word prosody in New Guinea
26.3.1 Stress
26.3.2 Tone
26.4 Intonation in Papuan languages
26.5 Conclusion
Chapter 27: North America
27.1 Introduction
27.2 Stress in North American Indian languages
27.2.1 Typology of stress in North American Indian languages
27.2.2 Weight-sensitive stress
27.2.3 Iambic lengthening
27.2.4 Morphological stress
27.2.5 Phonetic exponents of stress in North America
27.3 Tone in North American Indian languages
27.3.1 Tonal inventories
27.3.2 Tonal processes
27.3.3 Stress and tone
27.3.4 Grammatical tone
27.3.5 Tonal innovations and multidimensionality of tone realization
27.4 Intonation and prosodic constituency
27.5 Prosodic morphology
27.6 Conclusions
Chapter 28: Mesoamerica
28.1 Introduction
28.2 Oto-Manguean languages
28.2.1 Lexical tone
28.2.2 Stress
28.2.3 Phonation type
28.2.4 Syllable structure and length
28.2.5 Intonation and prosody above the word
28.3 Mayan languages
28.3.1 Stress and metrical structure
28.3.2 Lexical tone
28.3.3 Phonation
28.3.4 Syllable structure
28.3.5 Intonation
28.4 Toto-Zoquean
28.4.1 Syllable structure, length, and phonation type
28.4.2 Stress and intonation
28.5 Conclusion
Acknowledgements
Chapter 29: South America
29.1 Introduction
29.2 Stress and metrical structure
29.2.1 Manifestation of prominence
29.2.2 Metrical feet and edges
29.2.3 Syllable, mora, and quantity
29.3 Tones
29.3.1 H, M, L
29.3.2 Underlying L and default H
29.3.3 Languages with underlying H and default L
29.3.4 Languages with underlying H and L
29.4 Sonority hierarchies, laryngeals, and nasality
29.5 Word prosody and morphology
29.5.1 Stress and morphology
29.5.2 Tones and morphology
29.6 Historical and comparative issues
29.6.1 Stress
29.6.2 Tones
29.7 Conclusion
Part V: Prosody in Communication
Chapter 30: Meanings of Tones and Tunes
30.1 Introduction
30.2 Basic concepts for the study of intonational meaning
30.3 Generalist and specialist theories of intonational meaning
30.3.1 Generalist theories
30.3.2 Specialist theories
30.4 Towards unifying generalist and specialist theories
30.5 Experimental work on intonational meaning
30.6 Conclusion
Acknowledgements
Chapter 31: Prosodic Encoding of Information Structure: A typological perspective
31.1 Introduction
31.2 Basic concepts of information structure
31.3 A typology of prosodic encoding of information structure
31.3.1 Stress- or pitch-accent-based cues
31.3.1.1 Types of nuclear accent or tune
31.3.2 Phrase-based cues
31.3.3 Register-based cues
31.4 Syntax–prosody interaction and non-prosodic-marking systems
31.5 Unified accounts
31.6 Evaluation and considerations for future research
Chapter 32: Prosody in Discourse and Speaker State
32.1 Introduction
32.2 Prosody in discourse
32.2.1 Prosody and turn-taking
32.2.2 Prosody and entrainment
32.3 Prosody and speaker state
32.3.1 Prosody and emotion
32.3.2 Prosody and deception
32.4 Conclusion
Chapter 33: Visual Prosody Across Cultures
33.1 Introduction
33.2 What is visual prosody?
33.2.1 How do auditory and visual prosodic cues relate to each other?
33.2.2 Is there cultural variability in (audio)visual prosody?
33.3 Three case studies
33.3.1 Cues to feeling-of-knowing in Dutch and Japanese
33.3.2 Correlates of winning and losing by Dutch and Pakistani children
33.3.3 Gestural cues to time in English and Chinese
33.4 Discussion and conclusion
Chapter 34: Pathological Prosody: Overview, assessment, and treatment
34.1 Introduction
34.2 Neural bases of pathological prosody
34.2.1 History of approaches to hemispheric specialization
34.2.2 Current proposals of hemispheric specialization of prosodic elements
34.2.3 Disturbances of pitch control and temporal cues
34.2.4 Subcortical involvement in speech prosody: basal ganglia and cerebellum
34.2.5 Prosody in autism
34.3 Evaluation of prosodic performance
34.4 Treatment for prosodic deficits
34.5 Concluding remarks
Part VI: Prosody and Language Processing
Chapter 35: Cortical and Subcortical Processing of Linguistic Pitch Patterns
35.1 Introduction
35.2 The basic functional anatomy of the human auditory system
35.3 Experimental methods in the neurophysiology of language processing
35.4 Hemispheric specialization
35.5 Neural evidence for mechanisms of linguistic pitch processing
35.6 Cortical plasticity of pitch processing
35.7 Subcortical pitch processing and its plasticity
35.8 Prosody and syntactic processing in the brain
35.9 Conclusion and future directions: bridging linguistic theory and brain models
Chapter 36: Prosody and Spoken-Word Recognition
36.1 Introduction
36.2 Defining prosody in spoken-word recognition
36.3 The bayesian prosody recognizer: robustness under variability
36.3.1 Parallel uptake of information
36.3.1.1 Influences on processing segmental information
36.3.1.2 Influences on lexical segmentation
36.3.1.3 Influences on lexical selection
36.3.1.4 Influences on inferences about other structures
36.3.2 High contextual dependency
36.3.2.1 Left-context effects
36.3.2.2 Right-context effects
36.3.2.3 Syntagmatic representation of pitch
36.3.3 Adaptive processing
36.3.4 Phonological abstraction
36.4 Conclusions and future directions
Chapter 37: The Role of Phrase-Level Prosody in Speech Production Planning
37.1 Introduction
37.2 Modern theories of prosody
37.3 Evidence for the active use of prosodic structure in speech production planning
37.3.1 Rules and processes that are sensitive to prosodic constituent boundaries: selection of cues
37.3.2 Patterns of surface phonetic values in cues to distinctive segmental features, reflecting prosodic structure
37.3.3 Behavioural evidence for the role of prosody in speech planning
37.4 The role of prosody in models of speech production planning
37.5 Summary and related issues
Part VII: Prosody and Language Acquisition
Chapter 38: The Acquisition of Word Prosody
38.1 Introduction
38.2 The acquisition of lexical tone
38.2.1 Perception of lexical tones
38.2.2 The role of lexical tone in word learning
38.2.3 Production of lexical tones
38.2.4 Summary
38.3 The acquisition of pitch accent
38.3.1 Perception of pitch accent
38.3.2 The role of pitch accent in word recognition and learning
38.3.3 Production of lexical accent
38.3.4 Summary
38.4 The acquisition of word stress
38.4.1 The perception of word stress
38.4.2 The role of word stress in word recognition and word learning
38.4.3 The production of word stress
38.4.4 Summary
38.5 Discussion and conclusions
Chapter 39: Development Of Phrase-Level Prosody from Infancy to Late Childhood
39.1 Introduction
39.2 Prosody in infancy
39.2.1 Infants’ perception of tonal patterns and prosodic phrasing
39.2.2 Prosody in pre-lexical vocalizations
39.3 Prosodic production in childhood
39.3.1 The acquisition of intonational contours
39.3.2 The acquisition of speech rhythm
39.4 Communicative uses of prosody in childhood: production and comprehension
39.4.1 Acquisition of prosody and information structure
39.4.2 Acquisition of prosody and sociopragmatic meanings
39.5 Future research
Chapter 40: Prosodic Bootstrapping
40.1 Introduction
40.2 Prosodic bootstrapping theory
40.3 Newborns’ sensitivity to prosody as a foundation for prosodic bootstrapping
40.4 How early sensitivity to prosody facilitates language learning
40.4.1 Prosodic grouping biases and the Iambic-Trochaic Law
40.4.2 How lexical stress helps infants to learn words
40.4.3 How prosody bootstraps basic word order
40.4.4 How prosody constrains syntactic analysis
40.5 Conclusion and perspectives
Chapter 41: Prosody in Infantand Child-Directed Speech
41.1 Introduction
41.2 Primary prosodic characteristics of infant- and child-directed speech
41.3 Cross-cultural similarities and differences
41.4 Other sources of variation
41.5 Function of prosodic characteristics
41.6 Conclusion and future directions
Chapter 42: Prosody in Children with Atypical Development
42.1 Introduction
42.2 Autism spectrum disorder
42.2.1 Prosody production
42.2.2 Prosody perception
42.3 Developmental language disorder
42.3.1 Prosody production
42.3.2 Prosody perception
42.4 Cerebral palsy
42.4.1 Prosody production
42.5 Hearing loss
42.5.1 Prosody production
42.5.1.1 Prosody production in sentences
42.5.1.2 Emotional prosody production
42.5.2 Prosody perception
42.5.2.1 Prosody and sentence perception
42.5.2.2 Prosody and emotion perception
42.6 Clinical practice in developmental prosody disorders
42.6.1 Assessing prosody deficits
42.6.2 Treatment of prosody deficits
42.7 Conclusion
Chapter 43: Word Prosody in Second Language Acquisition
43.1 Introduction
43.2 Lexical stress
43.2.1 Second language word perception/recognition
43.2.2 Second language word production
43.3 Lexical tone
43.3.1 Second language perception/recognition of lexical tone
43.3.2 Second language production of lexical tone
43.4 Conclusions and future directions
Chapter 44: Sentence Prosody in a Second Language
44.1 Introduction
44.2 Intonational aspects of second language sentence prosody
44.2.1 Prosodic marking of information structure
44.2.2 Prosodic marking of questions
44.2.3 Prosodic phrasing
44.2.4 Phonetic implementation of pitch accents and boundary tones
44.2.5 Prosodic marking of non-linguistic aspects
44.3 Timing phenomena in second language sentence prosody
44.3.1 Rhythm
44.3.2 Tempo and pauses
44.3.3 Fluency
44.4 Perception of second language sentence prosody
44.4.1 Perception and interpretation
44.4.2 Perceived foreign accent and ease of understanding
44.5 Conclusions
Acknowledgements
Chapter 45: Prosody in Second Language Teaching: Methodologies and effectiveness
45.1 Introduction
45.2 The importance of prosody for L2 learners
45.3 Teaching prosody
45.3.1 Intonation
45.3.2 Rhythm
45.3.3 Word stress
45.4 The effectiveness of L2 pronunciation instruction applied to prosody
45.4.1 Awareness
45.4.2 Perception
45.4.3 Production
45.4.4 Multi-modality: visual and auditory input and feedback
45.5 Conclusion
Part VIII: Prosody in Technology and the Arts
Chapter 46: Prosody in Automatic Speech Processing
46.1 Introduction
46.2 A short history of prosody in automatic speech processing
46.2.1 Timeline
46.2.2 Phenomena and performance
46.3 Features and their importance
46.3.1 Power features
46.3.2 Leverage features
46.3.3 An illustration
46.4 Concluding remarks
Acknowledgements
Chapter 47: Automatic Prosody Labelling and Assessment
47.1 Introduction
47.2 Prosodic inventories
47.3 Two types of informationabout prosody
47.3.1 Information from syntax
47.3.2 Information from the acoustic signal
47.3.3 Fusion of syntactic and acoustic information
47.4 How autobi works
47.5 Assessment
47.5.1 Intrinsic assessment of automatic labelling
47.5.2 Extrinsic assessment of automatic labelling
47.5.3 Assessment of language learner prosody
47.6 Conclusion
Chapter 48: Stress, Meter, and Text-Setting
48.1 Introduction
48.2 Meter
48.3 Prominence, rhythm, and stress
48.4 Stress-based meters
48.5 Quantitative meters
48.6 Text-setting
Acknowledgements
Chapter 49: Tone–Melody Matching in tone-Language Singing
49.1 Introduction
49.2 Defining and investigating tone–melody matching
49.3 Some examples
49.3.1 Cantonese pop music
49.3.2 Vietnamese tân nhạc
49.3.3 Contemporary Thai song
49.3.4 Traditional Dinka songs
49.4 Prospect
References
Index of Languages
Subject Index
OUP CORRECTED PROOF – FINAL, 07/12/20, SPi
T h e Ox f o r d H a n d b o o k o f
L A NGUAGE PRO SODY
OUP CORRECTED PROOF – FINAL, 07/12/20, SPi
OXFORD HANDBOOKS IN LINGUISTICS Recently published
THE OXFORD HANDBOOK OF LANGUAGE POLICY AND PLANNING Edited by James W. Tollefson and Miguel Pérez-Milans
THE OXFORD HANDBOOK OF PERSIAN LINGUISTICS Edited by Anousha Sedighi and Pouneh Shabani-Jadidi
THE OXFORD HANDBOOK OF ENDANGERED LANGUAGES Edited by Kenneth L. Rehg and Lyle Campbell
THE OXFORD HANDBOOK OF ELLIPSIS
Edited by Jeroen van Craenenbroeck and Tanja Temmerman
THE OXFORD HANDBOOK OF LYING Edited by Jörg Meibauer
THE OXFORD HANDBOOK OF TABOO WORDS AND LANGUAGE Edited by Keith Allan
THE OXFORD HANDBOOK OF MORPHOLOGICAL THEORY Edited by Jenny Audring and Francesca Masini
THE OXFORD HANDBOOK OF REFERENCE Edited by Jeanette Gundel and Barbara Abbott
THE OXFORD HANDBOOK OF EXPERIMENTAL SEMANTICS AND PRAGMATICS Edited by Chris Cummins and Napoleon Katsos
THE OXFORD HANDBOOK OF EVENT STRUCTURE Edited by Robert Truswell
THE OXFORD HANDBOOK OF LANGUAGE ATTRITION Edited by Monika S. Schmid and Barbara Köpke
THE OXFORD HANDBOOK OF LANGUAGE CONTACT Edited by Anthony P. Grant
THE OXFORD HANDBOOK OF NEUROLINGUISTICS Edited by Greig I. de Zubicaray and Niels O. Schiller
THE OXFORD HANDBOOK OF ENGLISH GRAMMAR Edited by Bas Aarts, Jill Bowie, and Gergana Popova
THE OXFORD HANDBOOK OF AFRICAN LANGUAGES Edited by Rainer Vossen and Gerrit J. Dimmendaal
THE OXFORD HANDBOOK OF NEGATION Edited by Viviane Déprez and M. Teresa Espinal
THE OXFORD HANDBOOK OF LANGUAGE PROSODY Edited by Carlos Gussenhoven and Aoju Chen
For a complete list of Oxford Handbooks in Linguistics please see pp. 893–896.
OUP CORRECTED PROOF – FINAL, 07/12/20, SPi
T h e Ox f or d H a n db o o k of
LANGUAGE PROSODY Edited by
CARLOS GUSSENHOVEN and
AOJU CHEN
1
OUP CORRECTED PROOF – FINAL, 07/12/20, SPi
1 Great Clarendon Street, Oxford, ox2 6dp, United Kingdom Oxford University Press is a department of the University of Oxford. It furthers the University’s objective of excellence in research, scholarship, and education by publishing worldwide. Oxford is a registered trade mark of Oxford University Press in the UK and in certain other countries © editorial matter and organization Carlos Gussenhoven and Aoju Chen 2020 © the chapters their several contributors 2020 The moral rights of the authors have been asserted First Edition published in 2020 Impression: 1 All rights reserved. No part of this publication may be reproduced, stored in a retrieval system, or transmitted, in any form or by any means, without the prior permission in writing of Oxford University Press, or as expressly permitted by law, by licence or under terms agreed with the appropriate reprographics rights organization. Enquiries concerning reproduction outside the scope of the above should be sent to the Rights Department, Oxford University Press, at the address above You must not circulate this work in any other form and you must impose this same condition on any acquirer Published in the United States of America by Oxford University Press 198 Madison Avenue, New York, NY 10016, United States of America British Library Cataloguing in Publication Data Data available Library of Congress Control Number: 2020937413 ISBN 978–0–19–883223–2 Printed and bound by CPI Group (UK) Ltd, Croydon, cr0 4yy Links to third party websites are provided by Oxford in good faith and for information only. Oxford disclaims any responsibility for the materials contained in any third party website referenced in this work.
OUP CORRECTED PROOF – FINAL, 07/12/20, SPi
Contents
Acknowledgements xi List of Figures xiii List of Tables xxii List of Maps xxv List of Abbreviations xxvii About the Contributors xxxv
1. Introduction
1
Carlos Gussenhoven and Aoju Chen
PA RT I F U N DA M E N TA L S OF L A NGUAG E PRO S ODY 2. Articulatory measures of prosody
15
Taehong Cho and Doris Mücke
3. Fundamental aspects in the perception of f0
29
Oliver Niebuhr, Henning Reetz, Jonathan Barnes, and Alan C. L. Yu
PA RT I I PRO S ODY A N D L I NGU I S T IC S T RUC T U R E 4. Tone systems
45
Larry M. Hyman and William R. Leben
5. Word-stress systems
66
Matthew K. Gordon and Harry van der Hulst
6. The Autosegmental-Metrical theory of intonational phonology
78
Amalia Arvaniti and Janet Fletcher
7. Prosodic morphology John J. McCarthy
96
OUP CORRECTED PROOF – FINAL, 07/12/20, SPi
vi Contents
8. Sign language prosody
104
Wendy Sandler, Diane Lillo-Martin, Svetlana Dachkovsky, and Ronice Müller de Quadros
PA RT I I I PRO S ODY I N SPE E C H PRODUC T ION 9. Phonetic variation in tone and intonation systems
125
Jonathan Barnes, Hansjörg Mixdorff, and Oliver Niebuhr
10. Phonetic correlates of word and sentence stress
150
Vincent J. van Heuven and Alice Turk
11. Speech rhythm and timing
166
Laurence White and Zofia Malisz
PA RT I V PRO S ODY AC RO S S T H E WOR L D 12. Sub-Saharan Africa
183
Larry M. Hyman, Hannah Sande, Florian Lionnet, Nicholas Rolle, and Emily Clem
13. North Africa and the Middle East
195
Sam Hellmuth and Mary Pearce
14. South West and Central Asia
207
Anastasia Karlsson, Gülİz Güneş, Hamed Rahmani, and Sun-Ah Jun
15. Central and Eastern Europe
225
Maciej Karpiński, Bistra Andreeva, Eva Liina Asu, Anna Daugavet, Štefan Beňuš, and Katalin Mády
16. Southern Europe
236
Mariapaola D’Imperio, Barbara Gili Fivela, Mary Baltazani, Brechtje Post, and Alexandra Vella
17. Iberia
251
Sónia Frota, Pilar Prieto, and Gorka Elordieta
18. Northwestern Europe Tomas Riad and Jörg Peters
271
OUP CORRECTED PROOF – FINAL, 07/12/20, SPi
Contents vii
19. Intonation systems across varieties of English
285
Martine Grice, James Sneed German, and Paul Warren
20. The North Atlantic and the Arctic
303
Kristján Árnason, Anja Arnhold, Ailbhe Ní Chasaide, Nicole Dehé, Amelie Dorn, and Osahito Miyaoka
21. The Indian subcontinent
316
Aditi Lahiri and Holly J. Kennard
22. China and Siberia
332
Jie Zhang, San Duanmu, and Yiya Chen
23. Mainland South East Asia
344
Marc Brunelle, James Kirby, Alexis Michaud, and Justin Watkins
24. Asian Pacific Rim
355
Sun-Ah Jun and Haruo Kubozono
25. Austronesia
370
Nikolaus P. Himmelmann and Daniel Kaufman
26. Australia and New Guinea
384
Brett Baker, Mark Donohue, and Janet Fletcher
27. North America
396
Gabriela Caballero and Matthew K. Gordon
28. Mesoamerica
408
Christian DiCanio and Ryan Bennett
29. South America
428
Thiago Costa Chacon and Fernando O. de Carvalho
PA RT V PRO S ODY I N C OM M U N IC AT ION 30. Meanings of tones and tunes
443
Matthijs Westera, Daniel Goodhue, and Carlos Gussenhoven
31. Prosodic encoding of information structure: A typological perspective 454 Frank Kügler and Sasha Calhoun
OUP CORRECTED PROOF – FINAL, 07/12/20, SPi
viii Contents
32. Prosody in discourse and speaker state
468
Julia Hirschberg, Štefan Beňuš, Agustín Gravano, and Rivka Levitan
33. Visual prosody across cultures
477
Marc Swerts and Emiel Krahmer
34. Pathological prosody: Overview, assessment, and treatment
486
Diana Van Lancker Sidtis and Seung-yun Yang
PA RT V I PRO S ODY A N D L A NGUAG E PRO C E S SI NG 35. Cortical and subcortical processing of linguistic pitch patterns
499
Joseph C. Y. Lau, Zilong Xie, Bharath Chandrasekaran, and Patrick C. M. Wong
36. Prosody and spoken-word recognition
509
James M. McQueen and Laura Dilley
37. The role of phrase-level prosody in speech production planning 522 Stefanie Shattuck-Hufnagel
PA RT V I I PRO S ODY A N D L A NGUAG E AC QU I SI T ION 38. The acquisition of word prosody
541
Paula Fikkert, Liquan Liu, and Mitsuhiko Ota
39. Development of phrase-level prosody from infancy to late childhood
553
Aoju Chen, Núria Esteve-Gibert, Pilar Prieto, and Melissa A. Redford
40. Prosodic bootstrapping
563
Judit Gervain, Anne Christophe, and Reiko Mazuka
41. Prosody in infant- and child-directed speech Melanie Soderstrom and Heather Bortfeld
574
OUP CORRECTED PROOF – FINAL, 07/12/20, SPi
Contents ix
42. Prosody in children with atypical development
582
Rhea Paul, Elizabeth Schoen Simmons, and James Mahshie
43. Word prosody in second language acquisition
594
Allard Jongman and Annie Tremblay
44. Sentence prosody in a second language
605
Jürgen Trouvain and Bettina Braun
45. Prosody in second language teaching: Methodologies and effectiveness
619
Dorothy M. Chun and John M. Levis
PA RT V I I I PRO S ODY I N T E C H NOL O GY A N D T H E A RT S 46. Prosody in automatic speech processing
633
Anton Batliner and Bernd Möbius
47. Automatic prosody labelling and assessment
646
Andrew Rosenberg and Mark Hasegawa-Johnson
48. Stress, meter, and text-setting
657
Paul Kiparsky
49. Tone–melody matching in tone-language singing
676
D. Robert Ladd and James Kirby
References Index of Languages Subject Index
689 877 887
OUP CORRECTED PROOF – FINAL, 07/12/20, SPi
Acknowledgements
We thank Julia Steer for her invitation to produce a handbook on language prosody and for the confidence she appeared to have in its completion even during the protracted gestation period that preceded our first moves. We are deeply grateful to all first authors for their efforts to coordinate the production of their chapters and for the congenial way in which we were able to negotiate the composition of their author teams with them. We also thank Emilia Barakova, Caroline Féry, David House, Hui-Chuan (Jennifer) Huang, Michael Krauss, Malcolm Ross, Louis ten Bosch, Leo Wetzels, Tony Woodbury, as well as many of our authors for suggesting names of potential contributors or providing information. We thank Karlijn Blommers, Rachida Ganga, Megan Mackaaij, and Laura Smorenburg for the many ways in which they helped us move the project forward, as well as Karen Morgan and Vicki Sunter of Oxford University Press and their outsourced service providers Premkumar Ap, Kim Allen, and Hazel Bird for their cooperation and guidance. Ron Wunderink deserves special thanks for the production of Map 1.1. We would like to acknowledge the financial support from Utrecht University for editorial assistance and from the Centre of Language Studies of Radboud University for Map 1.1. Our work on this handbook was crucially facilitated by the following reviewers of chapters: Aviad Albert, Kai Alter, Mark Antoniou, Meghan Armstrong-Abrami, Amalia Arvaniti, Stefan Baumann, Štefan Beňuš, Antonis Botinis, Bettina Braun, Amanda Brown, Gene Buckley, Daniel Büring, Gabriela Caballero, Michael Cahill, Francesco Cangemi, Thiago Chacon, Chun-Mei Chen, Laura Colantoni, Elisabeth de Boer, Fernando O. de Carvalho, Maria del Mar Vanrell, Volker Dellwo, Anne-Marie DePape, Christian DiCanio, Laura Dilley, Laura Downing, Núria Esteve-Gibert, Susan D. Fischer, Sónia Frota, Riccardo Fusaroli, Carolina Gonzalez, Matthew K. Gordon, Wentao Gu, Mark Hasegawa-Johnson, Kara Hawthorne, Bruce Hayes, Nikolaus P. Himmelmann, David House, Larry M. Hyman, Pavel Iosad, Haike Jacobs, Barış Kabak, Sayang Kim, John Kingston, James Kirby, Michael Krauss, Gjert Kristoffersen, Jelena Krivokapić, Haruo Kubozono, Frank Kügler, Anja Kuschmann, D. Robert Ladd, Angelos Lengeris, Pärtel Lippus, Zenghui Liu, Jim Matisoff, Hansjörg Mixdorff, Peggy Mok, Christine Mooshammer, Claire Nance, Marta OrtegaLlebaria, Cédric Patin, Roland Pfau, Cristel Portes, Pilar Prieto, Anne Psycha, Melissa Redford, Bert Remijsen, Tomas Riad, Toni Rietveld, Anna Sara Romøren, Malcolm Ross, Katrin Schweitzer, Stefanie Shattuck-Hufnagel, Stavros Skopoteas, Louis ten Bosch, Annie Tremblay, Jürgen Trouvain, Hubert Truckenbrodt, Frank van de Velde, Harry van der Hulst, Vincent J. van Heuven, László Varga, Irene Vogel, Lei Wang, Natasha Warner, Leo Wetzels, Chris Wilde, Patrick C. M. Wong, Tony Woodbury, Seung-yun Yang, Sabine Zerbian, Bao Zhiming, and Marzena Żygis.
OUP CORRECTED PROOF – FINAL, 07/12/20, SPi
List of Figures
2.1 Waveforms corresponding to vocal fold vibrations in electroglottography (examples by Phil Hoole at IPS Munich) for different voice qualities. High values indicate increasing vocal fold contact.17 2.2 Volume of the thoracic and abdominal cavities in a Respitrace inductive plethysmograph during sentence production, inhalation and exhalation phase.19 2.3 Lip aperture in electromagnetic articulography. High values indicate that lips are open during vowel production. Trajectories are longer, faster, and more displaced in target words in contrastive focus (lighter grey lines) compared to out of focus.
21
2.4 Tongue shapes in ultrasound.
23
2.5 Contact profiles in electropalatography for different stops and fricatives. Black squares indicate the contact of the tongue surface with the palate (upper rows = alveolar articulation, lower row = velar articulation).25 3.1 Enumeration ‘Computer, Tastatur und Bildschirm’ spoken by a female German speaker in three prosodic phrases (see also Phonetik Köln 2020).35 3.2 Schematic representation of the two key hypotheses of the Theory of Optimal Tonal Perception of House (1990, 1996): (a) shows the assumed time course of information density or cognitive workload across a CVC syllable and (b) shows the resulting pitch percepts for differently aligned f0 falls.
38
3.3 Utterance in Stockholm bei der ICPhS, Kiel Corpus of Spontaneous Speech, female speaker g105a000. Arrows indicate segmental intonation in terms of a change in the spectral energy distribution (0–8 kHz) of the final [s], 281 ms.
41
3.4 Perceived prosodic parameters and their interaction.
41
5.1 Number of languages with different fixed-stress locations according to StressTyp2 (Goedemans et al. 2015).
70
5.2 Median percentages of words with differing numbers of syllables in languages with a single stress per word and those with rhythmic secondary stress in Stanton (2016).
73
OUP CORRECTED PROOF – FINAL, 07/12/20, SPi
xiv List of Figures
6.1 Spectrograms and f0 contours illustrating the same English tune as realized on a monosyllabic utterance (a) and a longer utterance (b).
80
6.2 Spectrograms and f0 contours of the utterance [koˈlibise i ˈðimitra] with focus on [koˈlibise] ‘swam’ (a) and on [ˈðimitra] (b), translated as ‘Did Dimitra SWIM?’ and ‘[Was it] DIMITRA who swam?’ respectively.
85
6.3 The English intonation grammar of Pierrehumbert (1980); after Dainora (2006).
86
8.1 The monosyllabic sign SEND in ISL. The dominant hand moves in a path from the chest outward, and the fingers simultaneously change position from closed to open. The two simultaneous movements constitute a complex syllable nucleus.
106
8.2 ISL complex sentence, ‘The cake that I baked is tasty’, glossed: [[CAKE IX]PP [I BAKE]PP]IP [[TASTY]PP]IP. ‘IX’ stands for an indexical pointing sign.
107
8.3 Linguistic facial expressions for three types of constituent in ISL. (a) Yes/no questions are characterized by raised brows and head forward and down; (b) wh-questions are characterized by furrowed brow and head forward; and (c) squint signals retrieval of information shared between signer and addressee. These linguistic face and head positions are strictly aligned temporally with the signing hands across each prosodic constituent.
109
8.4 Simultaneous compositionality of intonation in ISL: raised brows of yes/no questions and squint of shared information, e.g. ‘Did you rent the apartment we saw last week?’.
110
8.5 Overriding linguistic intonation with affective intonation: (a) yes/no question, ‘Did he eat a bug?!’ with affective facial expression conveying fear/revulsion, instead of the neutral linguistic yes/no facial expression shown in Figure 8.3a. (b) wh-question, ‘Who gave you that Mercedes Benz as a gift?!’ Here, affective facial expression conveying amazement overrides the neutral linguistic wh-questions shown in Figure 8.3b.
111
8.6 Intonational marking of topics in (a) ISL and (b) ASL.
113
8.7 Different phonetic realizations of the low accessibility marker, squint, in (a) ISL and (b) ASL.
114
8.8 Non-manual markers in Libras accompanying (a) information focus (raised brows and head tilted back) and (b) contrastive focus (raised and furrowed brows, and head tilted to the side).
116
8.9 ASL alternative question, glossed: PU FLAVOUR CHOCOLATE VANILLA OR LAYER OR PU, translated roughly as ‘What flavour do you want, chocolate, vanilla or layer?’.
121
9.1 Carry-over coarticulation in Mandarin Chinese. See text for an explanation of the abbreviations.
127
OUP CORRECTED PROOF – FINAL, 07/12/20, SPi
List of Figures xv
9.2 Anticipatory coarticulation in Mandarin Chinese appears dissimilatory. See text for an explanation of the abbreviations.
128
9.3 Segmental anchoring: a schematic depicting the relative stability of alignment of f0 movements (solid line) with respect to the segmental string (here CVC) and the accompanying variation in shape (i.e. slope and duration) of the f0 movement.
130
9.4 Schematic representation of compressing and truncating approaches to f0 realization under time pressure.
132
9.5 An f0 peak realized over the monosyllabic English word Anne at seven different levels of emphasis. Peaks vary considerably, while the final low is more or less invariant.
133
9.6 f0 contours for 11 English sentences read by speaker KS. A general downward trend is clearly observed (§9.4.3), but the distance between the peaks and the baseline is also progressively reduced, due to the topline falling more rapidly than the baseline. S = sentence.
134
9.7 Waveform, spectrogram, and f0 contour of a Cantonese sentence, 媽媽 擔憂娃娃, maa1 maa1 daam1 jau1 waa1 waa1, ‘Mother worries about the baby’, composed entirely of syllables bearing high, level Tone 1. Gradually lowering f0 levels over the course of the utterance could be attributed to declination.
136
9.8 Waveform, spectrogram, and f0 contour of a Cantonese sentence, 山岩 遮攔花環, saan1 ngaam4 ze1 laan4 faa1 waan4, ‘A mountain rock obstructs the flower wreath’, in which high Tone 1 alternates with the low falling Tone 4, creating a HLHLHL pattern reminiscent of the terracing downstep typically described in African languages.
137
9.9 The realization of the H+L* versus H* contrast in German by means of variation in f0 peak alignment (top) or f0 peak shape (bottom). The word-initial accented CV syllables of Laden ‘store’, Wiese ‘meadow’, Name ‘name’, and Maler ‘painter’ are framed in grey. Unlike for the ‘aligner’ (LBO), the f0-peak maxima of the ‘shaper’ are timed close to the accented-vowel onset for both H+L* and H*.141 9.10 A declarative German sentence produced once as a statement (left) and once as a question (right). The shapes of the prenuclear pitch accent peaks are different. The alignment of the pitch accent peaks is roughly the same (and certainly within the same phonological category) in both utterances (statement and question).
142
9.11 A sharp peak, and a plateau, realized over the English phrase ‘there’s luminary’.142 9.12 Schematic depiction of how various f0 contour shape patterns affect the location of the Tonal Center of Gravity (TCoG) (Barnes et al. 2012b) and the concomitant effect on perceived pitch event alignment. The shapes
OUP CORRECTED PROOF – FINAL, 07/12/20, SPi
xvi List of Figures
on the left should predispose listeners to judgements of later ‘peak’ timing, while the mirror images (right) suggest earlier timing. Shapes that bias perception in the same direction are mutually enhancing and hence predicted to co-occur more frequently in tonal implementation.
145
9.13 f0-peak shift continuum and the corresponding psychometric function of H* identifications. The lighter lines refer to a repetition of the experiment but with a flatter intensity increase across the CV boundary.
146
10.1 Initial stress perceived (%) as a function of intensity difference between V1 and V2 (in dB) and of duration ratio V1 ÷ V2 in minimal stress pairs (a) in English, after Fry (1955), and (b) in Dutch, after van Heuven and Sluijter (1996).
159
14.1 Multiple φ’s in all-new context and with canonical SOV order.
210
14.2 Pitch track of Ali biliyor Aynurun buraya gelmeden önce nereye gitmiş olabileceğini ‘Ali knows where Aynur might have gone to before coming here’, illustrating multiple morphosyntactic words as a single ω, with focus for the subject Ali (Özge and Bozşahin 2010: 148).
210
14.3 Pitch track showing the division into α’s of all-new [[mʊʊr]α[nɔxɔint]α [parʲəgtəw]ip]ι ‘A cat was caught by a dog’, where underlined bold symbols correspond to the second mora in an α. -LH marks the beginning of the ip (Karlsson 2014: 194).
213
14.4 Pitch track of [[pit]α [marɢaʃα]ip [[xirɮʲəŋ]α [ɢɔɮig]α]ip [tʰʊʊɮəŋ]ip]ι ‘We will cross the Kherlen river tomorrow’ (Karlsson 2014: 196). -LH marks the beginning of an ip.
214
14.5 Pitch track and speech waveform illustrating final ←Hfoc marking focus on all the preceding constituents. The utterance is [[[manai aaw pɔɮ]α]ip [[[saixəŋʦantai]α]ip [[ʊxaɮəg]α]ip]foc [xuŋ]ip]ι ‘My father is nice and wise’.
215
14.6 f0 contours of 13a (a) and 13b (b).
218
14.7 Pitch track and speech waveform of Manana dzalian lamaz meomars bans, ‘Manana is washing the very beautiful soldier’. Each word forms an α with a rising contour, [L* Hα].220 14.8 Pitch track of The soldier’s aunt is washing Manana. The complex NP subject [meomris mamida] forms an ip, marked with a H- boundary tone that is higher than the preceding Hα.221 14.9 Pitch track of No, GELA is hiding behind the ship, where the subject noun is narrowly focused and the verb, instead of being deaccented, has a H+L phrase accent. The focused word and the verb together form one prosodic unit.
222
16.1 Surface variants predicted by Jun and Fougeron’s model (from Michelas and D’Imperio 2012b).
243
OUP CORRECTED PROOF – FINAL, 07/12/20, SPi
List of Figures xvii
16.2 Pitch tracks for non-contrastive topic (T), partial topic (PT), and contrastive focus (CF) renditions of the sentence Milena lo vuole amaro ‘Milena drinks it black’, in Neapolitan (NP–VP boundary = vertical line).246 16.3 In the utterance Le schéma du trois-mâts de Thomas devenait vraiment brouillon ‘Thomas’s sketch of a square-rigger became a real scribble’, the continuous line represents the reference pitch level for the first phrase.247 16.4 Schematization involving a nuclear L* Hφ followed by a postnuclear L+H- Hι combination.
248
16.5 Tonal copy of the final tone of the matrix sentence Parce qu’il n’avait plus d’argent ‘Because he didn’t have any money’ onto the rightdislocated constituent Mercier (after Ladd 1996: 141–142).
249
17.1 Frequencies of stress patterns (%) in Catalan, Spanish, and Portuguese.
252
17.2 Left utterance: Amúmen liburúa emon nau (grandmother-gen bookabs give aux ‘(S)he has given me the grandmother’s book’); right utterance: Lagunen diruá emon nau (friend-gen money-abs give aux ‘(S)he has given me the friend’s money’).
254
17.3 f0 contour of the Catalan utterance La boliviana de Badalona rememorava la noia (‘The Bolivian woman from Badalona remembered the girl’).
256
17.4 f0 contour of the Spanish utterance La niña de Lugo miraba la mermelada ‘The girl from Lugo watched the marmalade’.
256
17.5 f0 contour of the Portuguese utterance A nora da mãe falava do namorado (‘The daughter-in-law of (my) mother talked about the boyfriend’).257 17.6 f0 contour of an utterance from Northern Bizkaian Basque: ((Mirénen)AP (lagúnen)AP (liburúa)AP )ip erun dot (Miren-gen friend-gen book-abs give aux ‘I have taken Miren’s friends’ book’).
257
17.7 f0 contour of an utterance from Northern Bizkaian Basque: ((Imanolen alabien diruá)AP )ip erun dot (Imanol-gen daughter-gen money-abs give aux ‘I have taken Imanol’s daughter’s money’).
258
17.8 f0 contour of the broad-focus statement Les nenes volen melmelada, produced by a Catalan speaker (top), and Las niñas quieren mermelada, produced by a Spanish speaker, ‘The girls want jam’ (bottom).
263
17.9 f0 contour of the narrow contrastive focus statement Les nenes volen MELMELADA (‘The girls want JAM’), produced by a Catalan speaker.
264
17.10 f0 contour of the narrow contrastive-focus statement MELMELADA, quieren (JAM (they) want, ‘(They) want JAM’), produced by a Spanish speaker.264
OUP CORRECTED PROOF – FINAL, 07/12/20, SPi
xviii List of Figures
17.11 f0 contour of the rising yes/no question Quieren mermelada?, produced by a Spanish speaker (top), and Que volen melmelada, produced by a Catalan speaker (bottom), ‘Do (they) want jam?’
265
17.12 f0 contour of the broad-focus statement As meninas querem marmelada (‘The girls want jam’), produced by a European Portuguese speaker.
266
17.13 f0 contour of the narrow contrastive-focus statement As meninas querem marmelada (‘The girls want jam’) with a narrow focus on ‘MARMELADA’ (top) and on ‘AS MENINAS’ (bottom), in European Portuguese.267 17.14 f0 contour of the yes/no question As meninas querem marmelada? (‘(Do) the girls want jam?’), produced by a Standard European Portuguese speaker.
268
17.15 f0 contours of the statement Allagá da laguna (arrive aux friend-abs ‘The friend has arrived’) and the yes/no question Allagá da laguna? (‘Has the friend arrived?’), uttered by the same speaker of Northern Bizkaian Basque.
269
17.16 f0 contour of the yes/no question Garágardoa edán du? (beer-abs drink aux, ‘Did he drink the beer?’) in Standard Basque.
269
18.1 The lexical tone and the partial identity of the accents in Central Swedish.274 18.2 Post-lexical Accent 2, compound, uppmärksamhetssplittring ‘attention split’.274 18.3 Swedish intonation: initiality accent (dåliga), deaccenting with plateau (gamla), word accent (lagningar), deaccented auxiliary (måste), and nuclear accent (åtgärdas).276 18.4 Comparison of Stockholm and Copenhagen pitch accents in three different conditions. The extracted word is ˈKamma (name), which gets Accent 2 in Swedish and no-stød in Danish.
277
18.5 Nuclear pitch contours without accent modifications attested for Dutch, High German, Low German, and West Frisian.
282
18.6 Cologne Accent 1 and Accent 2 in nuclear position interacting with two contours, H*LL% and L* H-L%.
283
19.1 Tonal representations and stylized f0 contours for three stress patterns in a declarative context.
292
19.2 Tonal representations and stylized f0 contours for three stress patterns in a polar interrogative context.
292
19.3 Waveform, spectrogram, and f0 track for a sentence of read speech in Singapore English.
294
19.4 Intonation patterns for yes/no questions in Fijian and Standard English.296
OUP CORRECTED PROOF – FINAL, 07/12/20, SPi
List of Figures xix
19.5 Fall-rise uptalk contour, Australian English.
300
19.6 Late rise uptalk contour, New Zealand English.
300
21.1 Declarative intonation in Bengali.
322
21.2 Standard Colloquial Assamese ‘Ram went to Ramen’s house’ (Twaha 2017: 57).
322
21.3 Nalbariya Variety of Assamese ‘Today I scolded him’ (Twaha 2017: 79).
322
21.4 Bengali intonation, focus on Nɔren.325 21.5 Bengali intonation, focus on Runir.325
21.6 Four prosodic structures for Bengali [mɑmɑ-r ʃɑli-r bie] ‘Mother’s brother’s wife’s sister’s wedding’. The neutral, broad-focus declarative (a); the declarative with focus on [mɑmɑ-r] ‘It is Mother’s brother’s wife’s sister’s wedding’ (b); the neutral, broad-focus yes/no question (c); the yes/no question with focus on [mɑmɑ-r] ‘Is it Mother’s brother’s wife’s sister’s wedding?’ (d). Only in (c) can r-coronal assimilation go through, since there are focus-marking phonological phrase boundaries after [mɑmɑr] in (b) and (d), and an optional, tone-marked phonological phrase boundary in (a).
329
23.1 Waveforms, spectrograms, and pitch tracks of the Wa words tɛɁ ‘land’ (clear register, left) and tɛ̤Ɂ ‘wager’ (breathy register, right). The clear register is characterized by sharper, more clearly defined formants; the breathy register has relatively more energy at very low frequencies.348 24.1 Waveform and f0 track of example (10) produced as [{(jʌŋanɨn) (imoɾaŋ)(imobuɾaŋ)}ip{(jʌŋhwagwane)(kandejo)}ip]IP by a Seoul Korean speaker.367 24.2 Waveform and f0 track of example (10) produced as [{(jʌŋanɨn) (imoɾaŋ)(imobuɾaŋ)}ip {(jʌŋhwagwane kandejo)}ip]IP by a Chonnam Korean speaker.
368
25.1 The Austronesian family tree (Blust 1999; Ross 2008).
371
25.2 f0 and edge tones for (3).
377
25.3 f0 and edge tones for (4).
377
25.4 f0 and edge tones for (6).
379
25.5 f0 and edge tones for (7).
380
25.6 f0 and edge tones for (8).
380
25.7 f0 and tonal targets for (9).
381
25.8 f0 and tonal targets for (10).
382
26.1 f0 contour for a wh-question in Mawng: ŋanti calŋalaŋaka werk ‘Who is the one that she sent first?’ with high initial peak on the question word: ŋanti ‘who’.
389
OUP CORRECTED PROOF – FINAL, 07/12/20, SPi
xx List of Figures
28.1 Tones in utterance non-final and utterance-final position in Ixcatec. The figures show f0 trajectories for high, mid, and low tones, averaged across four speakers.
416
31.1 Typical realizations of (1) and (4), showing how focus position affects prosodic realization. A schematic pitch realization is given, along with the prosodic phrasing, intonational tune, and text, where capitals indicate the pitch accented syllable. See text for further details.
457
31.2 Lebanese Arabic (a) and Egyptian Arabic (b) realization of narrow focus on the initial subject, from Chahal and Hellmuth (2014b). As can be seen, post-focal words are deaccented in Lebanese Arabic but not Egyptian Arabic.
458
31.3 Broad focus (a) and contrastive focus (b) in Sardinian.
460
31.4 Time-normalized pitch tracks in different focus conditions in Hindi, based on five measuring points per constituent, showing the mean across 20 speakers. SOV (a) and OSV word order (b). The comparisons of interest are subject focus (dotted line) and object focus (dashed line) with respect to broad focus (solid line).
463
33.1 Visual cues reflecting a positive (a) and a negative (b) way to produce the utterance ‘My boyfriend will spend the whole summer in Spain’.
480
34.1 (a) Schematized f0 curve of He’s going downtown today, produced by a healthy male speaker. (b) Schematized f0 curve of He’s going downtown today, produced by a patient with right hemisphere damage diagnosed with dysprosody, before treatment. The intonation contour rises at the end, yielding an unnatural prosody. (c) Schematized f0 curve of He’s going downtown today, produced by a dysprosodic patient with right hemisphere damage, after treatment. The intonation contour approaches the healthy speaker’s profile.
493
37.1 Two versions of the Nijmegen approach to modelling speech production planning. (a) shows the 1989 version of planning for connected speech, while (b) shows the 1999 version of planning for single-word (or single-PWd) utterances.
533
41.1 An example of mean f0 and f0 variability in CDS compared with ADS across six languages for both fathers (Fa) and mothers (Mo).
575
45.1 Visualization techniques for intonation contours. (a) depicts drawn, stylized intonation contours (e.g. also employed by Celce-Murcia et al. 1996/2010). (b) portrays a smoother, continuous contour (e.g. used in Gilbert 1984/2012). (c) shows a system consisting of dots representing the relative pitch heights of the syllables; the size of the dots indicates the salience level of the syllables; tonal movements are indicated by curled lines starting at stressed syllables (O’Connor and Arnold 1973). (d) represents pitch movement by placing the actual text at different
OUP CORRECTED PROOF – FINAL, 07/12/20, SPi
List of Figures xxi
vertical points (e.g. Bolinger 1986). (e) illustrates a notational system that uses arrows to indicate the direction of pitch movement and diacritics and capitalization to mark stress (similar to Bradford 1988, who used a combination of (c) and (d)). (f) represents a modification of the American transcription system ToBI, based on the autosegmental approach (e.g. Toivanen 2005; Estebas-Vilaplana 2013). The focal stress of the sentence is marked by the largest dot above the stressed syllable (c), capitalization of the stressed syllable and the diacritic ´ (e), and the L+H* notation marking the pitch accent (f).
622
45.2 The rhythms of some other languages (top) and English (bottom).
623
45.3 Representing pitch movement in a pronunciation course book.
626
45.4 Teaching word stress using rubber bands.
627
45.5 Waveforms and pitch curves of jìn lái ‘come in’ produced by a female native speaker (left) and a female student (right).
629
45.6 Different renditions of the question ‘You know why, don’t you?’.
630
45.7 A screenshot of the Streaming Speech software.
630
46.1 Effect of power features on performance: a few power features contribute strongly to performance (continuous line), whereas often there is no clear indication of which features contribute most (dashed line).
644
OUP CORRECTED PROOF – FINAL, 07/12/20, SPi
List of Tables
2.1 Advantages and disadvantages of articulatory measuring techniques
26
4.1 Tonal contrasts in Iau
45
4.2 Tonal contrasts in Vietnamese
46
4.3 Different contour simplifications of L-HL-H
55
4.4 H- tone stem patterns in Kikuria
58
4.5 Detransitivizing LH replacive tone in Kalabari
58
4.6 Possessive determiners in Kunama
59
4.7 Noni sg~pl alternations in noun class 9/10
59
4.8 Day completive/incompletive aspect alternations
59
4.9 Iau verb tones
60
4.10 Inflected subject markers in Gban
60
4.11 Tonal distributions in Itunyoso Trique
64
8.1 Non-manual marking used in different contexts in ASL and Libras
120
12.1 Grammatical tone in a language without a tone contrast in the verb stem (Luganda) and its absence in a language with such a tone contrast (Lulamogi)187 12.2 Stem-initial prominence marked by distributional asymmetries
190
13.1 Stress assignment in different Arabic dialects
199
13.2 Pausal alternations observed in Classical Arabic (McCarthy 2012)
200
15.1 Available descriptions based on the autosegmental-metrical framework
230
16.1 Information-seeking yes/no questions: nuclear patterns in 16 Italian varieties (left table) and their stylization (right schemes); motifs indicate possible groupings on the basis of nuclear tunes; varieties are represented by abbreviations. For details see Gili Fivela and Nicora (2018), adapted and updated from Gili Fivela et al. (2015a)
242
16.2 Combinations of phrase accent and boundary tone and their pragmatic functions in Athenian Greek (from Arvaniti and Baltazani 2005)
244
17.1 Inventory of nuclear accents
261
17.2 Inventory of IP boundary tones
262
17.3 Nuclear configurations
262
OUP CORRECTED PROOF – FINAL, 07/12/20, SPi
List of Tables xxiii
18.1 Accent contrast and prominence levels in Central Swedish (Bruce 1977, 2007; Myrberg 2010; Myrberg and Riad 2015, 2016). The lexical tone is bolded and intonation tones are plain. The tone-bearing unit (TBU) is a stressed syllable; * indicates association to a TBU
273
22.1 Tonal inventories in three dialects of Chinese
333
28.1 Tonal complexity by Oto-Manguean language family
409
28.2 Ixpantepec Nieves Mixtec (Carroll 2015)
410
28.3 San Juan Quiahije Chatino tone sandhi (Cruz 2011)
411
28.4 Yoloxóchitl Mixtec tonal morphology (Palancar et al. 2016)
411
28.5 Stress pattern by Oto-Manguean language family
412
28.6 Controlled and ballistic syllables (marked with /ˊ/) in Lalana Chinantec (Mugele 1982: 9)
413
28.7 The distribution of Itunyoso Triqui tones in relation to glottal consonants414 28.8 Permitted rime types and length contrasts by Oto-Manguean family
415
28.9 Syllable structure in Ayutla Mixe
424
28.10 Segment-based quantity-sensitive stress in Misantla Totonac nouns (Mackay 1999)
425
28.11 Lexical stress in Filomena Mata Totonac (McFarland 2009)
426
29.1 Summary of languages with stress and/or tone systems
428
29.2 Position of primary stress relative to word edges
430
29.3 Types and proportion of quantity-sensitive systems
432
30.1 Information-structural meanings of pitch accents (Steedman 2014)
447
34.1 Theoretical overview of hemispheric lateralization for speech prosody
488
37.1 Distribution of glottalized word-onset vowels in a sample of FM radio news speech, showing the preference for glottalization at the onset of a new intonational phrase and at the beginning of a pitch accented word, as well as individual speaker variation. Stress level is indicated with +/−F for full versus reduced vowel, and +/−A for accented versus unaccented syllable
523
42.1 Recording form for judging prosodic production in spontaneous speech592 46.1 Prototypical approaches in research on prosody in automatic speech processing over the past 40 years (1980–2020), with the year 2000 as a turning point from traditional topics to a new focus on paralinguistics
636
46.2 Phenomena and performance: a rough overview (qualitative performance terms appear in italics)
638
OUP CORRECTED PROOF – FINAL, 07/12/20, SPi
xxiv List of Tables
49.1 Similar, contrary, and oblique settings, as defined by the relation between the pitch direction in a sequence of two tones and the two corresponding musical notes
680
49.2 Expected frequencies of similar, oblique, and contrary settings
680
49.3 The six Cantonese tones classified in terms of overall level, for the purposes of defining pitch direction in a sequence of two tones
681
49.4 Frequencies of similar (bold), oblique (underlined), and contrary (italic) settings in a 2,500-bigram corpus from Cantonese pop songs, from Lo (2013)
682
49.5 The six Vietnamese tones classified in terms of overall level, for purposes of defining pitch direction in a sequence of two tones
682
49.6 Frequencies of similar (bold), oblique (underlined), and contrary (italic) settings in a corpus from Vietnamese ‘new music’
683
49.7 Frequencies of similar (bold), oblique (underlined), and contrary (italic) settings in a bigram corpus from 30 Thai pop songs
684
49.8 Frequencies of similar (bold), oblique (underlined), and contrary (italic) settings in a 355-bigram pilot corpus from three Dinka songs
685
OUP CORRECTED PROOF – FINAL, 07/12/20, SPi
List of Maps
1.1 Areal groupings of languages explored in Part IV
see plate section
12.1 Number of contrastive tone heights
see plate section
12.2 Types of tonal contours
see plate section
12.3 Types of downstepped tones
see plate section
13.1 Geographical location of languages treated in this chapter, with indications of the presence of stress and of tone contrasts (1 = binary contrast; 2 = ternary contrast; 3+ = more complex system), produced with the ggplot2 R package (Wickham 2009)
196
24.1 Japanese and South Korean dialect areas
357
24.2 Six dialect areas of Korean spoken in South Korea
363
26.1 Locator map for tonal elaboration in New Guinea. Grey circle: language with no tone; dark grey square: language with a two-way contrast in tone; black circle: language with three or more tonal contrasts
392
OUP CORRECTED PROOF – FINAL, 07/12/20, SPi
List of Abbreviations
1pl.o 1st person exclusive object A1 primary auditory cortex a1pl 1st person plural absolutive a1sg 1st person singular absolutive AAE African American English abil abilitative abs absolutive acc accusative adj adjective ADS adult-directed speech aka also known as AM acoustic model AM autosegmental-metrical (theory of intonational phonology) and andative AP accentual phrase AP Articulatory Phonology appl applicative ASC autism spectrum condition ASD autism spectrum disorder ASL American Sign Language ASP automatic speech processing ASR automatic speech recognition/recognizer ATR advanced tongue root AU actions unit AusE Australian English AuToBI Automatic ToBI-labelling tool aux auxiliary av actor voice AVEC Audio/Visual Emotion Challenge BCMS Bosnian-Croatian-Montenegrin-Serbian BlSAfE Black South African English BOLD blood-oxygen-level-dependent BPR Bayesian Prosody Recognizer BURSC Boston University Radio Speech Corpus C consonant Ca Catalan CAH Contrastive Analysis Hypothesis CAPT computer-assisted pronunciation training
OUP CORRECTED PROOF – FINAL, 07/12/20, SPi
xxviii List of Abbreviations
caus causative CAY Central Alaskan Yupik CC corpus callosum CDS infant- and child-directed speech CEV Contact English Variety CF conversational filler CI cochlear implant cl classifier clf cleft CNG Continental North Germanic com comitative com.a3sg completive aspect, 3rd person singular absolutive compl completive conj conjunction CP categorical perception CP computational paralinguistics CP cerebral palsy C-Prom annotated corpus for French prominence studies CQ closed quotient, aka contact quotient (glottis) cs centisecond CS current speaker CWCI children with cochlear implants CWG Continental West Germanic CWTH children with typical hearing DANVA Diagnostic Analysis of Nonverbal Behavior dat dative dB decibel def definite det determiner DIRNDL Discourse Information Radio News Database for Linguistic Analysis DLD developmental language disorder dur durative e1pl 1st person plural ergative e3sg 3rd person ergative EEG electroencephalography, electroencephalogram EGG electroglottography, electroglottograph EMA electromagnetic articulography, articulograph emph emphatic encl enclitic EPG electropalatography ERB equivalent rectangular bandwidth (unit) ERP event-related potential erg ergative
OUP CORRECTED PROOF – FINAL, 07/12/20, SPi
List of Abbreviations xxix
excl exclusive (1pl) ez Ezefe marker F falling tone f0 fundamental frequency F1 Formant 1 F1 harmonic mean of precision and recall or F-measure F2 Formant 2 FFR frequency-following response FLH Functional Load Hypothesis fMRI functional MRI fNIRS functional near-infrared spectroscopy foc focus FOK feeling-of-knowing FP filled pause Ft foot fut future G glide G_ToBI German ToBI gen genitive GhanE Ghanaian English GR_ToBI Greek ToBI GT grammatical tone H High tone H heavy (syllable type) HKE Hong Kong English HL hearing loss HTS High tone spreading Hz hertz i (subscript) intonational phrase IE Indian English IFG inferior frontal gyrus IK intonational construction incl inclusive (1PL) inst instrumental intr intransitive INTSINT International Transcription System for Intonation IP intonational phrase ip intermediate phrase IPB intonational phrase boundary IPO Institute for Perception Research IPP irregular pitch period irr irrealis IS information structure
OUP CORRECTED PROOF – FINAL, 07/12/20, SPi
xxx List of Abbreviations
ISL Israeli Sign Language ITL Iambic-Trochaic Law IViE Intonational Variation in English JND just noticeable difference K_ToBI Korean ToBI kHz 1,000 Hz L Low tone L light (syllable type) L1 first or native language L2 second language lat lative LeaP corpus Learning Prosody in a Foreign Language corpus LH left hemisphere Libras Brazilian Sign Language LIS Italian Sign Language LLD low-level descriptor LM language model Ln natural logarithm loc locative LSF Langue des signes française (French Sign Language) LSVT Lee Silverman Voice Treatment LTAS long-term average spectrum LTS Low tone spreading M mid (tone) m medium (stress type) MAE Mainstream American English MAE_ToBI Mainstream American English ToBI MaltE Maltese English MDS multi-dimensional scaling MEG magnetoencephalography, magnetoencephalogram MEV Mainstream English Variety MFCC mel frequency cepstral coefficient mira mirative MIT Melodic Intonation Therapy ML machine learning MMN mismatch negativity MOMEL Modelling Melody MRI magnetic resonance imaging ms millisecond MSEA Mainland South East Asia MTG middle temporal gyrus Mword morphological word N noun
OUP CORRECTED PROOF – FINAL, 07/12/20, SPi
List of Abbreviations xxxi
NBB Northern Bizkaian Basque NC Niger-Congo neg negation NGT Nederlandse Gebarentaal (Sign Language of the Netherlands) NigE Nigerian English NIRS near-infrared spectroscopy nom nominative NP noun phrase NPN non-Pama-Nyungan nPVI normalized pairwise variability index NRU narrow rhythm unit NVA Nalbariya Variety of Assamese NZE New Zealand English Ø distinctive absence of tone obj object obliq oblique OCP Obligatory Contour Principle OQ open quotient (glottis) OSV object-subject-verb OT Optimality Theory p (subscript) phonological phrase PA pitch accent PAM Perceptual Assimilation Model PENTA Parallel Encoding and Target Approximation PEPS-C Profiling Elements of Prosody in Speech-Communication perf perfect(ive) PET positron emission tomography PFC post-focus compression pfx prefix pl plural PLVT Pitch Limiting Voice Treatment png person-number-gender POS parts-of-speech poss possessive pot potential PP phonological phrase pr present prf perfective prog progressive ProP Prosody Profile prs present prtc participle Ps subglottal pressure
OUP CORRECTED PROOF – FINAL, 07/12/20, SPi
xxxii List of Abbreviations
PT Portuguese pv patient voice PVI pairwise variability index PVSP Prosody-Voice Screening Protocol Pword phonological word Q1 short (quantity) Q2 long (quantity) Q3 overlong (quantity) qm question marker Qpart question particle QUD question under discussion quot quotative R rising tone RaP Rhythm and Pitch red reduplicant refl reflexive RF random forest RFR rise-fall-rise RH right hemisphere RIP Respitrace inductive plethysmograph rls realis RPT Rapid Prosodic Transcription s second S strong (stress type) SA South America(n) SAH Segmental Anchoring Hypothesis SAOV subject-adverbial-object-verb SCA Standard Colloquial Assamese sg singular SgE Singapore English SJQC San Juan Quiahije Chatino SLM Speech Learning Model SLUSS simultaneous laryngoscopy and laryngeal ultrasound SOV subject-object-verb SP Spanish sp species SQ skewness quotient (glottis) sqrt square root ss status suffix SSBE Southern Standard British English ST semitone StB Standard Basque STG superior temporal gyrus SVM support vector machine syll/s syllables per second
OUP CORRECTED PROOF – FINAL, 07/12/20, SPi
List of Abbreviations xxxiii
TAM tense-aspect-mood TBU tone-bearing unit TCoG Tonal Center of Gravity TD typically developing tns tense ToBI Tones and Break Indices ToDI Transcription of Dutch Intonation top topic marker tr transitive TRP transition-relevant place tv transitive verb U utterance (prosodic constituent) UAR unweighted average recognition UNB Urban Northern British V vowel V̂ vowel with falling tone ̌ V vowel with rising tone V́ high toned vowel V̀ low toned vowel ̋ V super-high toned vowel VarcoC variation coefficient for consonantal intervals VarcoV variation coefficient for vocalic intervals veg vegetable noun class vn verbal noun VOS verb-object-subject VOT voice onset time VP verb phrase VSAO verb-subject-adverbial-object VSO verb-subject-object W weak (stress type) WA word accent WALS World Atlas of Language Structures WAR weighted average recognition WER word error rate wpm words per minute WSAfE White South African English XT/3C Extrinsic-Timing-Based Three-Component model YM Yoloxóchtil Mixtec α accentual phrase ι intonational phrase μ mora σ syllable υ utterance (prosodic constituent) φ phonological phrase ω phonological word, prosodic word
OUP CORRECTED PROOF – FINAL, 07/12/20, SPi
About The Contributors
Bistra Andreeva is Professor of Phonetics and Phonology in the Department of Language Science and Technology, Saarland University. Her major research interests include the phonetics and phonology of intonation and rhythm, cross-language and individual differences in the production and perception of syllabic prominence in various languages, the relation between intonation and information structure, and the interaction between information density and prosodic structure. Kristján Árnason is Professor Emeritus of Icelandic Language and Linguistics at the University of Iceland. He obtained his PhD at the University of Edinburgh in 1977. Among his publications in English are Quantity in Historical Phonology: Icelandic and Related Cases (Cambridge University Press, 1980/2009), The Rhythms of Dróttkvætt and Other Old Icelandic Metres (Institute of Linguistics, University of Iceland, 1991/2000), and The Phonology of Icelandic and Faroese (Oxford University Press, 2011). Particular interests within phonology include intonation and prosody, the interface between morphosyntax and phonology, and Old Icelandic metrical rhythm and poetics. He has organized conferences and participated in research projects on phonological variation, metrics, and sociolinguistics. Anja Arnhold is an Assistant Professor in the Department of Linguistics at the University of Alberta. She is an experimental phonologist who specializes in the prosodic marking of information structure and has worked on various languages, including Finnish, Greenlandic (Kalaallisut), Inuktitut, Mandarin, and Yakut (Sakha). Anja earned her MA from the University of Potsdam in 2007 and her PhD from Goethe University Frankfurt in 2013, moving on to a position as a postdoctoral fellow and contract instructor at the University of Alberta and as a postdoctoral researcher at the University of Konstanz. Amalia Arvaniti is Professor of English Linguistics at Radboud University. She has worked extensively on prosody, particularly on rhythm and intonation; her research focuses on Greek, English, Romani, and Korean. She has previously held appointments at the University of Kent (2012–2020), the University of California, San Diego (2002–2012), and the University of Cyprus (1995–2001), and temporary appointments at Cambridge, Oxford, and the University of Edinburgh. She is currently the president of the Permanent Council for the Organisation of the International Congress of Phonetic Sciences and the vice-president of the International Phonetic Association. Eva Liina Asu is an Associate Professor of Swedish and Phonetics at the University of Tartu. She obtained her PhD in Linguistics at the University of Cambridge in 2004. Her research
OUP CORRECTED PROOF – FINAL, 07/12/20, SPi
xxxvi About The Contributors focuses on various prosodic aspects of Estonian including the phonetics and phonology of intonation, rhythm, stress, and quantity. She is also interested in prosodic and segmental features of Estonian Swedish in comparison with other varieties of Swedish. Brett Baker is an Associate Professor in Linguistics at the School of Languages and Linguistics, University of Melbourne. His primary research areas are phonology and morphology, with a focus on Australian Indigenous languages including Kriol. He has worked on a number of languages of southeastern Arnhem Land, especially Ngalakgan and Nunggubuyu/Wubuy, through primary fieldwork since the mid-1990s. His current work takes an experimental approach to investigating the extent to which speakers of Wubuy have knowledge of the internal structure of polysynthetic words. Mary Baltazani is a researcher at the Phonetics Laboratory, University of Oxford. Her research focuses on phonetics, phonology, and their interface, with special interests in intonation and pragmatics, Greek dialects, dialectology, and sociophonetics. She is currently investigating the diachronic development of intonation as it has been shaped by the historical contact of Greek with Italian and Turkish in a project supported by the Economic and Social Research Council, UK. Jonathan Barnes is an Associate Professor in the Boston University Department of Linguistics. He received his PhD from the University of California, Berkeley, in 2002, and specializes in the interface between phonetics and phonology, most particularly as this concerns the structures of tone and intonation systems. Much of his recent work involves dynamic interactions in perception between ostensibly distinct aspects of the acoustic signal, and the consequences of these interactions for our understanding of the content of phonological representations. Anton Batliner is Senior Research Fellow affiliated with the chair of Embedded Intelligence for Health Care and Wellbeing at the University of Augsburg. He obtained his PhD at LMU Munich in 1978. He has published widely on prosody and paralinguistics and coauthored Computational Paralinguistics (Wiley, 2014, with Björn Schuller), besides being an active editor and conference organizer. His earlier affiliations were with the Pattern Recognition Lab at the University of Erlangen-Nuremberg and the institutes for Nordic Languages and German Philology (both LMU Munich). Ryan Bennett is an Associate Professor in the Department of Linguistics at the University of California, Santa Cruz. His primary research area is phonology, with a particular emphasis on prosody and the interfaces between phonology and other grammatical domains (phonetics, morphology, and syntax). His current research focuses on the phonetics and phonology of K’ichean-branch Mayan languages, particularly Kaqchikel and Uspanteko. This work involves ongoing, original fieldwork in Guatemala and draws on data from elicitation, experimentation, and corpora. He also has expertise in Celtic languages, especially Irish. Štefan Beňuš is an Associate Professor in the Department of English and American Studies at Constantine the Philosopher University and a Senior Researcher in Speech Sciences at
OUP CORRECTED PROOF – FINAL, 07/12/20, SPi
About The Contributors xxxvii the Institute of Informatics of the Slovak Academy of Sciences in Bratislava. He holds a PhD in linguistics from New York University and postdoctoral qualifications from Columbia University and LMU Munich. His research centres on the relationship between (i) speech prosody and the pragmatic/discourse aspect of the message and (ii) phonetics and phonology, with a special interest in the articulatory characteristics of speech. He previously served as an associate editor of Laboratory Phonology and regularly presents at major conferences, such as Speech Prosody and Interspeech. Heather Bortfeld is a Professor of Psychological Sciences at the University of California, Merced (UC Merced). She completed her PhD in experimental psychology at the State University of New York, Stony Brook, in 1998. Her postdoctoral training in cognitive and linguistic sciences at Brown University was supported by the National Institutes of Health. She was on the psychology faculty at Texas A&M University and the University of Connecticut prior to arriving at UC Merced in 2015. Her research focuses on how typically developing infants come to recognize words in fluent speech and the extent to which the perceptual abilities underlying this learning process are specific to language. She has more recently extended this focus to the influence of perceptual, cognitive, and social factors on language development in paediatric cochlear implant users. Bettina Braun is Professor of General Linguistics and Phonetics at the University of Konstanz. Her research focuses on the question of how listeners process and interpret the continuous speech stream, with a special emphasis on speech prosody. Further research interests include first and second language acquisition of prosody, and the interaction between prosody and other aspects of language (word order, particles). Marc Brunelle joined the Department of Linguistics at the University of Ottawa, where he is now Associate Professor, in 2006. He obtained his PhD at Cornell University in 2005. His research interests include phonology and phonetics, tone and phonation, prosody, language contact, South East Asian linguistics, and the linguistic history of South East Asia. His work focuses on Chamic languages and Vietnamese. Gabriela Caballero is an Associate Professor in the Department of Linguistics at the University of California, San Diego. She received her BS from the University of Sonora in 2002 and her PhD from the University of California, Berkeley, in 2008. Her research focuses on the description and documentation of indigenous languages of the Americas (especially Uto-Aztecan languages), phonology, morphology, and their interaction. Her research interests recently extend to the psycholinguistic investigation of phonological and morphological processing in order to better understand patterns of morphological and phonological variation in morphologically complex languages and prosodic typology. Sasha Calhoun is a Senior Lecturer in Linguistics at Victoria University of Wellington. Her research focuses on the functions of prosody and intonation, in particular information structure. Her PhD thesis, completed at the University of Edinburgh, looked at how prosody signals information structure in English from a probabilistic perspective. More recently, she has extended this work to look at how information structure, prosody, and
OUP CORRECTED PROOF – FINAL, 07/12/20, SPi
xxxviii About The Contributors syntax interact in other languages, including Samoan, te reo Māori, and Spanish. She has also been involved in work looking at intonation from an exemplar perspective in English and German. Thiago Costa Chacon is an Assistant Professor of Linguistics at the University of Brasilia and a research associate at the Max Planck Institute for the Sciences of Human History. He received his PhD from the University of Hawaiʻi at Manoa in 2007 and has specialized in the native languages of the Amazon, with particular focus on phonology, typology, historical linguistics, and language documentation and conservation. He has done fieldwork in several languages, including Kubeo, Tukano, Desano, Wanano (Tukanoan family), Ninam (Yanomami family), and Arutani (linguistic isolate). Bharath Chandrasekaran is a Professor and Vice Chair for Research in Communication Science and Disorders at the University of Pittsburgh. His research uses a systems neuroscience approach to study the computations, maturational constraints, and plasticity underlying speech perception. Aoju Chen is Professor of Language Development in Relation to Socialisation and Identity at Utrecht University. She has worked extensively on the production, perception, and processing of prosodic meaning and acquisition of prosody in first and second languages from a cross-linguistic perspective. More recently, she has extended her work to research on the social impact of developing language abilities in a first or second language with a focus on speech entrainment and the development of belonging. She is currently an associate editor of Laboratory Phonology and an elected board member of the ISCA Special Interest Group on Speech Prosody (SProSIG). Yiya Chen is Professor of Phonetics at the Leiden University Centre for Linguistics and a Senior Researcher at the Leiden Institute for Brain and Cognition. Her research focuses on prosody and prosodic variation, with particular attention to tonal languages. The general goal of her research is to understand the cognitive mechanisms and linguistic structures that underlie communication and language. She obtained her PhD from Stony Brook University in 2003. She has worked as a postdoctoral researcher at the University of Edinburgh and Radboud University in Nijmegen. She currently serves on the editorial boards of the Journal of Phonetics and the Journal of International Phonetic Association. Taehong Cho is Professor of Phonetics in the Department of English Language and Literature and the Director of the Institute for Phonetics and Cognitive Sciences of Language at Hanyang University. He earned his PhD degree in phonetics at the University of California, Los Angeles, and subsequently worked at the Max Planck Institute for Psycholinguistics in Nijmegen. His main research interest is in the interplay between prosody, phonology, and phonetics in speech production and its perceptual effects in speech comprehension. He is currently serving as editor in chief of the Journal of Phonetics and is book series editor for Studies in Laboratory Phonology (Language Science Press).
OUP CORRECTED PROOF – FINAL, 07/12/20, SPi
About The Contributors xxxix Anne Christophe is a Centre National de la Recherche Scientifique (CNRS) researcher and the Director of the Laboratoire de Sciences Cognitives et Psycholinguistique at the École Normale Supérieure in Paris (part of PSL University). She received her PhD in cognitive psychology from the École des Hautes Études en Sciences Sociales (Paris) in 1993 and worked as a postdoctoral researcher at University College London prior to her CNRS research position. Her work focuses on early language acquisition and the role of phrasal prosody and function words in promoting early lexical and syntactic acquisition. Dorothy M. Chun is Professor of Applied Linguistics and Education at the University of California, Santa Barbara (UCSB). Her research areas include second language phonology and intonation, second language reading and vocabulary acquisition, computer-assisted language learning, and telecollaboration for intercultural learning. She is the author of Discourse Intonation in L2: From Theory and Research to Practice (John Benjamins, 2002). She has been the editor in chief of the online journal Language Learning & Technology since 2000 and is the founding director of the PhD Emphasis in Applied Linguistics at UCSB. Emily Clem is an Assistant Professor of Linguistics at the University of California, San Diego. She obtained her PhD at the University of California, Berkeley, in 2019. Her research focuses primarily on syntax and its interfaces and draws on data from her fieldwork on Amahuaca (a Panoan language of Peru) and Tswefap (a Grassfields Banttu language of Cameroon). Her work also examines the large-scale areal distribution of linguistic features, such as tone, using computational tools to illuminate the influence of inheritance and contact on distributional patterns. Svetlana Dachkovsky is a researcher at the Sign Language Research Lab, University of Haifa, and a lecturer at Gordon Academic College. She obtained her PhD at the University of Haifa in 2018. Her research addresses topics in linguistic and non-linguistic aspects of prosody, information structure, and change in sign language grammar, as well as multimodal communication in signed and spoken modalities. Her work focuses on the grammaticalization of non-linguistic signals into linguistic intonation in sign language, and on the role of information structure in this process. Anna Daugavet is an Assistant Professor at the Department of General Linguistics of Saint Petersburg State University, where she teaches courses on Lithuanian and Latvian dialectology and areal features of the Baltic languages. She completed her PhD in linguistics at Saint Petersburg University in 2009. Her research interests include syllable weight, tone, and stress. Fernando O. de Carvalho is Assistant Professor of Linguistics at the Federal University of Amapá. He has been a visiting researcher at the Max Planck Institute for Evolutionary Anthropology in Leipzig and the Laboratoire Dynamique du Langage in Lyon. His primary research area is the historical linguistics of the indigenous languages of South America, in particular of the Arawak, Jê, and Tupi-Guarani language families. He has done fieldwork with a number of lowland South American languages, including Mebengokre (Jê), Kalapalo (Cariban), Wayuunaiki (Arawak), and Terena (Arawak).
OUP CORRECTED PROOF – FINAL, 07/12/20, SPi
xl About The Contributors Nicole Dehé has held postdoctorate research and teaching positions at University College London, the University of Leipzig, the University of Braunschweig, the Humboldt University at Berlin, and the Freie Universität Berlin. She obtained her PhD at the University of Leipzig in 2001. In 2010, she joined the Department of Linguistics at the University of Konstanz as Full Professor of Linguistics. She is an Adjunct Professor at the Faculty of Icelandic and Comparative Cultural Studies, University of Iceland. Her research focuses on prosody, intonation, syntax, and the syntax–prosody and prosody–pragmatics interfaces. She mainly works on Icelandic, English, and German. Christian DiCanio is Assistant Professor at the University at Buffalo and a senior research scientist at Haskins Laboratories. He obtained his PhD at the University of California, Berkeley, in 2008. As a phonetician and a fieldworker, he focuses primarily on the phonetics, phonology, and morphology of tone in Oto-Manguean languages. He has documented the San Martín Itunyoso Triqui language, and applied corpus and laboratory methods to the analysis of various endangered languages of the Americas. Laura Dilley is Associate Professor in the Department of Communicative Sciences and Disorders at Michigan State University. She received her BS in brain and cognitive sciences with a minor in linguistics in 1997 from MIT and obtained her PhD in the Harvard–MIT Program in Speech and Hearing Biosciences and Technology in 2005. She is the author of over 60 publications on prosody, word recognition, and other topics. Mariapaola D’Imperio is currently Distinguished Professor in the Department of Linguistics and the Cognitive Science Center at Rutgers University and Head of the Speech and Prosody Lab. She obtained a PhD in linguistics from Ohio State University in 2000 and then joined the Centre National de la Recherche Scientifique in 2001. She then obtained a position as Professor of Phonetics, Phonology and Prosody at the Department of Linguistics at Aix-Marseille University, where she was Head of the Prosody Group at the Laboratoire Parole et Langage in Aix-en-Provence, France. She is currently associate editor of the Journal of Phonetics and president of the Association for Laboratory Phonology. Her research interests span the intonational phonology of Romance languages to prosody production, perception, and processing. Mark Donohue has worked linguistically in New Guinea since 1991, researching languages from both sides of the international border. He has published extensively on the prosodic systems of the languages of a number of different families in the region and beyond, as well as grammatical description and historical linguistics. He works with the Living Tongues Institute for Endangered Languages. Amelie Dorn obtainted her PhD in linguistics from Trinity College Dublin, where she carried out research on the prosody and intonation of Donegal Irish, a northern variety of Irish. In 2015, she joined the Austrian Centre for Digital Humanities and Cultural Heritage of the Austrian Academy of Sciences in Vienna. She is also a postdoctoral researcher and lecturer in the Department of German Studies at the University of Vienna.
OUP CORRECTED PROOF – FINAL, 07/12/20, SPi
About The Contributors xli San Duanmu is Professor of Linguistics at the University of Michigan. He received his PhD in linguistics from MIT in 1990 and has held teaching posts at Fudan University (1981–1986) and the University of Michigan, Ann Arbor (1991–present). His research focuses on general properties of language, especially those in phonology. He is the author of The Phonology of Standard Chinese (2nd edition, Oxford University Press, 2007), Syllable Structure: The Limits of Variation (Oxford University Press, 2008), Foot and Stress (Beijing Language & Culture University Press, 2016), and A Theory of Phonological Features (Oxford University Press, 2016). Gorka Elordieta is Professor of Linguistics in the Department of Linguistics and Basque Studies at the University of the Basque Country, Spain. He obtained his PhD at the University of Southern California in 1997. His main interests are the syntax–phonology interface (the derivation of prosodic structure from syntactic structure and the effect of prosodic markedness constraints on prosodic phrasing), intonation in general, the prosodic realization of information structure, and intonational issues in language and dialect contact situations. Núria Esteve-Gibert is an Associate Professor in the Department of Psychology and Educational Sciences at the Universitat Oberta de Catalunya. She is mainly interested in first and second language development, and in particular the interfaces between prosody, body gestures, and pragmatics. Her research has shown that prosodic cues in speech align with body movements, and that language learners use multi-modal strategies to acquire language and communication. Paula Fikkert is Professor of Linguistics specializing in child language acquisition at Radboud University in Nijmegen. She obtained her PhD from Leiden University for her awardwinning dissertation On the Acquisition of Prosodic Structure (1994). She has been a (guest) researcher at various universities, among them Konstanz Universität, the University of Tromsø, and the University of Oxford. Her research concerns the acquisition of phonological representations in the lexicon and the role of these representations in perception and production. Most of her research is conducted at the Baby and Child Research Center in Nijmegen. Janet Fletcher is Professor of Phonetics in the School of Languages and Linguistics at the University of Melbourne. Her research interests include phonetic theory, laboratory phonology, prosodic phonology, and articulatory and acoustic modelling of prosodic effects in various languages. She is currently working on phonetic variation, prosody, and intonation in Indigenous Australian languages, including Mawng, Bininj Gun-wok, and Pitjantjatjara, and has commenced projects on selected languages of Oceania. Sónia Frota is Professor of Experimental Linguistics at the University of Lisbon. Her research seeks to understand the properties of prosodic systems (phrasing, intonation, and rhythm), the extent to which they vary across and within languages, and how they are acquired by infants and help to bootstrap the learning of language. She is the director of the Phonetics and Phonology Lab and the Baby Lab at the University of Lisbon, and is the editor in chief of the Journal of Portuguese Linguistics.
OUP CORRECTED PROOF – FINAL, 07/12/20, SPi
xlii About The Contributors James Sneed German is an Associate Professor of Language Sciences at Aix-Marseille University and the Laboratoire Parole et Langage. His research focuses primarily on intonational meaning, and especially on how specific marking strategies are disrupted by conflicting constraints across different levels of representation. He has worked extensively on modelling the intonational phonology of understudied varieties, while more recent research explores how the linguistic system dynamically adapts to socio-indexical cues in situations of dialect contact. His research covers a range of language varieties including American English, Glaswegian English, Singapore English, Standard French, Corsican French, Singapore Malay, and Taiwan Mandarin. Judit Gervain is a senior research scientist at the Centre National de la Recherche Scientifique (CNRS), working in the Laboratoire Psychologie de la Perception in Paris. She received her PhD from the International School for Advanced Studies, Trieste, in 2007 and worked as a postdoctoral research fellow at the University of British Columbia in Vancouver, before taking up a CNRS researcher position in 2009. Her work focuses on speech perception and language acquisition during the prenatal and early postnatal periods. Barbara Gili Fivela is Associate Professor at the University of Salento, where she is also vicedirector of the Centro di Ricerca Interdisciplinare sul Linguaggio and director of the programme for Studies in Interlinguistic Mediation/Translation and Interpretation. Since 2019, she has been president of the Associazione Italiana Scienze della Voce, an ISCA Special Interest Group. Her main research interests are the phonology and phonetics of intonation, second language learning processes, and the kinematics of healthy and dysarthric speech. Daniel Goodhue is a postdoctoral researcher at the University of Maryland, College Park. He completed his PhD at McGill University in 2018 with a thesis entitled ‘On asking and answering biased polar questions’. Using traditional and experimental methods, he researches the semantics and pragmatics of various phenomena, including intonation, questions, answers, and modality. Matthew K. Gordon is Professor of Linguistics at the University of California, Santa Barbara. His research interests include prosody, typology, phonological theory, and the phonetic and phonological description of endangered languages. He is the author of Syllable Weight: Phonetics, Phonology, Typology (Routledge, 2006) and Phonological Typology (Oxford University Press, 2016). Agustín Gravano is Professor in the Computer Science Department at the University of Buenos Aires, and Researcher at CONICET (Argentina’s National Research Council). His main research topic is the coordination between participants in conversations, both at a temporal level and along other dimensions of speech. The ultimate goal is to include this knowledge into spoken dialogue systems, aiming at improving their naturalness. Martine Grice is Professor of Phonetics at the University of Cologne. She has served as president of the Association of Laboratory Phonology and edits the series Studies in Laboratory Phonology. Her work on intonation theory includes the analysis of complex
OUP CORRECTED PROOF – FINAL, 07/12/20, SPi
About The Contributors xliii tonal structures and the consequences of tune–text negotiation. She has worked extensively on the intonation of Italian, English, and German and has addressed specific challenges in the analysis of Vietnamese, Tashlhiyt Berber, and Maltese. More recently she has been working on the use of intonation in attention orienting and the effect of perspective-taking abilities (e.g. in autism) on prosodic encoding and decoding. Güliz Güneş is a postdoctoral researcher at the Leiden University Centre for Linguistics and a scientific researcher at Eberhard Karls University of Tübingen. She participates in a project funded by the Dutch Research Council on the prosody of ellipsis and deaccentuation. Her past research focused on the prosodic constituency of Turkish, the interactions between information structure and prosody, and syntax–prosody correspondence in Turkish at the morpheme, word, phrase, and sentence levels. She has also worked on how morphology mediates prosodic constituency formation in Turkish. More generally, her research seeks to understand what prosodic traits of spoken speech can tell us about the interactions between syntax, morphology, and discourse structure. Carlos Gussenhoven obtained his PhD from Radboud University in 1984. He is currently Professor of Phonetics and Phonology at National Chiao Tung University, funded by the Ministry of Science and Technology, and Professor Emeritus at Radboud University, where he held a personal chair from 1996 to 2011. He was a visiting scholar at Edinburgh University (1981–1982), Stanford University (1985–1986), and the University of California, Berkeley (1995, Fulbright), and has held positions at the University of California, Berkeley (1991) and Queen Mary University of London (2004–2011), as well as guest professorships at the University of Konstanz and Nanjing University. Besides publishing his research on prosodic theory and the prosodic systems of a variety of languages in journals, edited books, and conference proceedings, he co-authored Understanding Phonology (4th edition, Routledge, 2017) and published The Phonology of Tone and Intonation (Cambridge University Press, 2004). Mark Hasegawa-Johnson is Professor of Electrical and Computer Engineering at the University of Illinois (Urbana-Champaign) and a Fellow of the Acoustical Society of America. He is currently treasurer of International Speech Communication Association, Secretary of SProSIG, and a member of the Speech and Language Technical Committee of the Institute of Electrical and Electronics Engineers; he was general chair of Speech Prosody 2010. Mark received his PhD from Ken Stevens at MIT in 1996, was a postdoctoral researcher with Abeer Alwan at the University of California, Los Angeles, and has been a visiting professor with Jeff Bilmes in Seattle and with Tatsuya Kawahara in Kyoto. He is author or co-author of over 50 journal articles, book chapters, and patents, and of over 200 conference papers and published abstracts. His primary research areas are in the application of phonological concepts to audio and audiovisual speech recognition and synthesis (LandmarkBased and Prosody-Dependent Speech Recognition), in the application of semi-supervised and interactive machine learning methods to multimedia browsing and search (Multimedia Analytics), and in the use of probabilistic transcription to develop massively multilingual speech technology localized to under-resourced dialects and languages.
OUP CORRECTED PROOF – FINAL, 07/12/20, SPi
xliv About The Contributors Sam Hellmuth is Senior Lecturer in Linguistics in the Department of Language and Linguistic Science at the University of York. Sam was director and principal investigator of the UK Economic and Social Research Council-funded project Intonational Variation in Arabic. Her research seeks to understand the scope of variation observed in the intonational systems of spoken Arabic dialects, and the interaction of intonation in these languages with segmental and metrical phonology, syntax, semantics, and information structure. Sam also works on second language acquisition of prosody, and the prosodic properties of regional dialects of British Englishes and World Englishes. Nikolaus P. Himmelmann is Professor of General Linguistics at the Universität zu Köln. His specializations include typology and grammaticalization as well as language documentation and description. He has worked extensively on western Austronesian as well as Papuan languages. Julia Hirschberg is Percy K. and Vida L. W. Hudson Professor of Computer Science at Columbia University. Her research focuses on prosody and discourse, including studies of speaker state (emotional, trustworthy, and deceptive speech), text-to-speech synthesis, detection of hedge words and phrases, spoken dialogue systems and entrainment in human–human and human–machine conversation, and linguistic code-switching (language mixing by bilinguals). Her previous research includes studies of charismatic speech, turn-taking behaviours, automatic detection of corrections and automatic speech recognition errors, cue phrases, and conversational implicature. She previously worked at Bell Labs and AT&T Labs in speech and human–computer interface research. Larry M. Hyman has since 1988 been Professor of Linguistics at the University of California, Berkeley, in the Department of Linguistics, which he chaired from 1991 to 2002. He has worked extensively on phonological theory and other aspects of language structure, including publishing several books—such as Phonology: Theory and Analysis (Holt, Rinehart & Winston, 1975) and A Theory of Phonological Weight (Foris, 1985)—and numerous theoretical articles in such journals as Language, Linguistic Inquiry, Natural Language and Linguistic Theory, Phonology, Studies in African Linguistics, and the Journal of African Languages and Linguistics. His current interests centre around phonological typology, tone systems, and the descriptive, comparative, and historical study of Niger-Congo languages, especially Bantu. He is also currently executive director of the France-Berkeley Fund. Allard Jongman is a Professor in the Linguistics Department at the University of Kansas. His research addresses the nature of phonetic representations and involves the study of the acoustic properties of speech sounds, the relation between phonetic structure and p honological representations, and the interpretation of the speech signal in perception across a wide range of languages. In addition to many journal articles, he is the co-author (with Henning Reetz) of Phonetics: Transcription, Production, Acoustics, and Perception (Wiley, 2020). Sun-Ah Jun obtained her PhD from the Ohio State University in 1993. She is Professor of Linguistics at the University of California, Los Angeles. She also taught at the Linguistic Society of America Summer Institute in 2001 and 2015, and the Netherlands Graduate School of Linguistics Summer School in 2013. Her research focuses on intonational phonology,
OUP CORRECTED PROOF – FINAL, 07/12/20, SPi
About The Contributors xlv prosodic typology, the interface between prosody and syntax/semantics/sentence processing, and language acquisition. Her publications include The Phonetics and Phonology of Korean Prosody: Intonational Phonology and Prosodic Structure (Garland, 1996; Routledge, 2018) and the two edited volumes Prosodic Typology: The Phonology of Intonation and Phrasing (Oxford University Press, 2005, 2014). Anastasia Karlsson is an affiliate Associate Professor of Phonetics at Lund University. Her research area is phonetics with the main focus on prosodic typology. She has contributed empirical studies on a number of typologically different languages, such as Kammu; the Formosan languages Bunun, Puyuma, and Seediq; and Halh Mongolian. Maciej Karpiński is a Professor in the Faculty of Modern Languages and Literatures of Adam Mickiewicz University, Poznań. He holds a PhD in general linguistics and a postdoctoral degree in applied linguistics. Focusing on pragma-phonetic and paralinguistic aspects of communication, he has recently investigated the contribution of prosody and gestures to the process of communicative accommodation and the influence of social factors, emotional states, and physical conditions on speech prosody. He has developed a number of linguistic corpora and contributed to projects on language documentation. Daniel Kaufman is an Assistant Professor at Queens College and the Graduate Center of the City University of New York and co-director of the Endangered Language Alliance, a non-profit organization dedicated to the documentation and support of endangered languages spoken by immigrant communities in the New York area. He obtained his PhD at Cornell University in 2010. He specializes in the Austronesian languages of the Philippines and Indonesia with a strong interest in both the synchronic analysis of their phonology, morphology, and syntax and their typology and diachrony. Holly J. Kennard completed her DPhil in linguistics at the University of Oxford. She subsequently held a British Academy Postdoctoral Fellowship, examining Breton phonology and morphophonology, and is now a Departmental Lecturer in Phonology at the University of Oxford. Her research focuses on phonology and morphonology in Breton and a variety of other languages. Paul Kiparsky, a native of Finland, is Professor of Linguistics at Stanford University. He works mainly on grammatical theory, language change, and verbal art. He collaborated with S. D. Joshi to develop a new understanding of the principles behind Pāṇini’s grammar. His book Pāṇini as a Variationist (MIT Press, 1979) uncovered an important dimension of the grammar that was not known even to the earliest commentators. James Kirby received his PhD in linguistics from the University of Chicago in 2010. Since that time he has been a Lecturer (now Reader) in the Department of Linguistics and English Language at the University of Edinburgh. His research considers the phonetic and phonological underpinnings of sound change, with particular attention to the emergence of lexical tone and register systems. Emiel Krahmer is Professor of Language, Cognition and Computation at the Tilburg School of Humanities and Digital Sciences. He received his PhD in computational linguistics in 1995, after which he worked as a postdoctoral researcher in the Institute
OUP CORRECTED PROOF – FINAL, 07/12/20, SPi
xlvi About The Contributors for Perception Research at the Eindhoven University of Technology before moving to Tilburg University. His research focuses on how people communicate with each other, both verbally and non-verbally, with the aim of improving the way computers communicate with human users. Haruo Kubozono completed his PhD at the University of Edinburgh in 1988. He taught phonetics and phonology at Nanzan University, Osaka University of Foreign Studies, and Kobe University (at the last of which he was a professor) before he moved to the National Institute for Japanese Languages and Linguistics as a professor and director in 2010. His research interests range from speech disfluencies to speech prosody (accent and intonation) and its interfaces with syntax and information structure. He recently edited The Handbook of Japanese Phonetics and Phonology (De Gruyter Mouton, 2015), The Phonetics and Phonology of Geminate Consonants (Oxford University Press, 2017), and Tonal Change and Neutralization (De Gruyter Mouton, 2018). Frank Kügler is Professor of Linguistics (Phonology) at Goethe University Frankfurt. His PhD thesis (Potsdam University, 2005) compared the intonation of two German dialects. He received his postdoctoral degree (Habilitation) from Potsdam University studying the prosodic expression of focus in typologically unrelated languages and obtained a Heisenberg Fellowship to do research at the Phonetics Institute of Cologne University. His research interests are in cross-linguistic and typological prosody, tone, intonation, recursivity of prosodic constituents, and the syntax–prosody interface. He has worked on the prosody of a number of typologically diverse languages. D. Robert (Bob) Ladd is Emeritus Professor of Linguistics at the University of Edinburgh. Much of his research has dealt with intonation and prosody in one way or another; he was also a leading figure in the establishment of ‘laboratory phonology’ during the 1980s and 1990s. He received a BA in linguistics from Brown University in 1968 and a PhD from Cornell University in 1978. He undertook various research and teaching appointments from 1978 to 1985, and has been at the University of Edinburgh since 1985. He was appointed Professor of Linguistics in 1997 and became an Emeritus Professor in 2011. He became a Fellow of the British Academy in 2015 and a Member of Academia Europaea in 2016. Aditi Lahiri holds PhDs from the University of Calcutta and Brown University. She has held a research appointment at the Max Planck Institute for Psycholinguistics and various teaching appointments at the University of California Los Angeles and Santa Cruz. After spending 15 years as the Lehrstuhl Allgemeine Sprachwissenschaft at the University of Konstanz, she is currently the Statutory Professor of Linguistics at the University of Oxford. She specializes in phonology from various perspectives—synchronic, diachronic, and experimental, and has largely focused on Germanic languages and on Bengali. Joseph C. Y. Lau is a postdoctoral scholar in the Department of Psychology and Roxelyn and Richard Pepper Department of Communication Sciences and Disorders at Northwestern University. He received his PhD in linguistics at the Chinese University of Hong Kong. His work focuses on using neurophysiological and behavioural methods in consonance with
OUP CORRECTED PROOF – FINAL, 07/12/20, SPi
About The Contributors xlvii machine learning techniques to understand long-term and online neuroplasticity in speech processing in neurotypical and clinical populations from infancy to adulthood. William R. Leben is Professor Emeritus of Linguistics at Stanford University, his main affiliation since 1972. He has done fieldwork in Niger, Nigeria, Ghana, and Côte d’Ivoire on tone and intonation in Chadic and Kwa languages. He has co-authored textbooks on Hausa, the structure of English vocabulary, and the languages of the world. John M. Levis has taught and researched pronunciation for many years. He is founding editor of the Journal of Second Language Pronunciation, the founder of the Pronunciation in Second Language Learning and Teaching Conference, and the codeveloper of pronunciationforteachers. com. He is co-editor of several books, including the Phonetics and Phonology section of the Encyclopedia of Applied Linguistics (Wiley, 2012), Social Dynamics in Second Language Accent (de Gruyter, 2014), the Handbook of English Pronunciation (Wiley, 2015), Critical Concepts in Linguistics: Pronunciation (Routledge, 2017) and Intelligibility, Oral Communication, and the Teaching of Pronunciation (Cambridge University Press, 2018). Rivka Levitan received her PhD from Columbia University in 2014. She is now an Assistant Professor in the Department of Computer and Information Science at Brooklyn College, and in the Computer Science and Linguistics Programs at the City University of New York Graduate Center. Her research focuses on the detection of paralinguistic and affective information from speech and language, with special interest in the information carried by dialogue and prosody. Diane Lillo-Martin is a Board of Trustees Distinguished Professor of Linguistics at the University of Connecticut and a Senior Research Scientist at Haskins Laboratories. She is a Fellow of the Linguistic Society of America and currently serves as chair of the international Sign Language Linguistics Society. She received her PhD in linguistics from the University of California, San Diego. Her research areas include the morphosyntax of American Sign Language and the acquisition of both signed and spoken languages, including bimodal bilingualism. Florian Lionnet is Assistant Professor of Linguistics at Princeton University. He obtained his PhD at the University of California, Berkeley, in 2016. His research focuses on phonology, typology, areal and historical linguistics, and language documentation and description, with a specific focus on African languages. He is currently involved in research on understudied and endangered languages in southern Chad. He has published on a range of topics, including the phonetics–phonology interface, tonal morphosyntax, the areal distribution of phonological features in northern sub-Saharan Africa, and the typology and grammaticalization of verbal demonstratives. Liquan Liu is a Lecturer in the School of Psychology at Western Sydney University. He received his PhD from Utrecht University in 2014 , and currently holds a Marie SkłodowskaCurie fellowship at the Center for Multilingualism in Society across the Lifespan, University of Oslo. He uses behavioural and electrophysiological techniques to measure infant and early childhood development, featuring speech perception and bilingualism from an interdisciplinary perspective.
OUP CORRECTED PROOF – FINAL, 07/12/20, SPi
xlviii About The Contributors Katalin Mády is a Senior Researcher at the Hungarian Research Institute for Linguistics, Budapest. After finishing her PhD on clinical phonetics at the Institute of Phonetics and Speech Processing (IPS), LMU Munich in 2004, she became Assistant Professor in German linguistics at the Pázmány Péter Catholic University. She returned to IPS LMU as a postdoctoral researcher in 2006 to do research in sociophonetics, laboratory phonology, and prosody. Her current work mainly focuses on prosodic typology, with a special interest in Uralic and other understudied languages in Central and Eastern Europe. James Mahshie is Professor in the Department of Speech, Language and Hearing Sciences at George Washington University. His research examines the perception and production of speech features by deaf children with cochlear implants. He has published numerous articles and book chapters on speech production and deafness, and co-authored a text on facilitating communication enhancement in deaf and hard-of-hearing children. Zofia Malisz is a researcher in speech technology at the Royal Institute of Technology in Stockholm. Her work focuses on modelling speech rhythm, timing, and prominence as well as improving prominence control in text-to-speech synthesis. Reiko Mazuka is a Team Leader for Laboratory for Language Development at the RIKEN Center for Brain Sciences. Her PhD dissertation was on developmental psychology (Cornell University, 1990). Before opening her lab at RIKEN in 2004, she worked in the Psychology Department at Duke University. She is interested in examining the role of language-specific phonological systems on phonological development. John J. McCarthy is Provost and Distinguished Professor at the University of Massachusetts Amherst. He is a fellow of the American Academy of Arts and Sciences, the American Association for the Advancement of Science, and the Linguistic Society of America. His books include Hidden Generalizations: Phonological Opacity in Optimality Theory (Equinox, 2007) and Doing Optimality Theory (Blackwell, 2008). With Joe Pater he edited Harmonic Grammar and Harmonic Serialism (Equinox, 2016). James M. McQueen is Professor of Speech and Learning at Radboud University. He studied experimental psychology at the University of Oxford and obtained his PhD from the University of Cambridge. He is a principal investigator at the Donders Institute for Brain, Cognition and Behaviour (Centre for Cognition) and is an affiliated researcher at the Max Planck Institute for Psycholinguistics. His research focuses on learning and processing in spoken language: How do listeners learn the sounds and words of their native and nonnative languages, and how do they recognize them? His research on speech learning concerns initial acquisition processes and ongoing processes of perceptual adaptation. His research on speech processing addresses core computational problems (such as the variability and segmentation problems). He has a multi-disciplinary perspective on psycholinguistics, combining insights from cognitive psychology, phonetics, linguistics, and neuroscience. Alexis Michaud received his PhD in phonetics from Sorbonne Nouvelle in 2005. He joined the Langues et Civilisations à Tradition Orale (LACITO) research centre at the Centre National de la Recherche Scientifique as a research scientist in 2006. His interests include
OUP CORRECTED PROOF – FINAL, 07/12/20, SPi
About The Contributors xlix tone and phonation, prosody, language documentation and description, and historical linguistics. His work focuses on languages of the Naish subgroup of Sino-Tibetan (Na and Naxi) and of the Vietic subgroup of Austroasiatic. Hansjörg Mixdorff is a professor at Beuth University of Applied Sciences Berlin. His main research interest is the production and perception of prosody in a cross-language perspective. He employs the Fujisaki model as a tool to synthesize fundamental frequency as well as to measure differences between native and second language speakers’ prosody, for instance. More recently in studies on prominence and attitude, he has worked on the interface between non-verbal facial gestures and prosodic features. Osahito Miyaoka took his MA degree and a PhD in linguistics at Kyoto University. For about 40 years, he carried out a great deal of fieldwork in Southwest Alaska (the Yupik area, including Bethel), in addition to his fieldwork on Yámana as spoken in Ukika on Navarino Island (Tierra del Fuego). In 2012, he published his monumental A Grammar of Central Alaskan Yupik (CAY) (Mouton de Gruyter). Before his retirement from Kyoto University in 2007, he taught at the Graduate School of Letters at Kyoto University. He has also taught at the University of Alaska (Fairbanks and Anchorage) and Hokkaido University. Bernd Möbius is Professor of Phonetics and Phonology at Saarland University and was editor in chief of Speech Communication (2013–2018). He was a board member of the International Speech Communication Association (ISCA) from 2007 to 2015, a founding member and chair (2002–2005) of ISCA’s special interest group on speech synthesis, and has served on ISCA’s Advisory and Technical committees. A central theme of his research concerns the integration of phonetic knowledge in speech technology. Recent work has focused on experimental methods and computational simulations to study aspects of speech production, perception, and acquisition. Doris Mücke is a Senior Researcher at the IfL Phonetics Lab at the University of Cologne, where she works with various methods of acoustic and articulatory analysis. She finished a PhD on vowel synthesis and perception in 2003 and in 2014 obtained her Habilitation degree for her research on dynamic modelling of articulation and prosodic structure. Her main research interest is the integration of phonetics and phonology with a special focus on the interplay of articulation and prosody. She investigates the coordination of tones and segments, prosodic strengthening, and kinematics and acoustics of syllable structure in various languages in typical and atypical speech. Currently, she is working on the relationship between brain modulation and speech motor control within a dynamical approach in healthy speakers as well as with patients with essential tremor and Parkinson’s disease. Ronice Müller de Quadros has been a professor and researcher at the Federal University of Santa Catarina since 2002 and a researcher on sign languages at Conselho Nacional de Desenvolvimento Científico e Tecnológico s ince 2006. She holds an MA (1995) and a PhD (1999) in linguistics, both from Pontifícia Universidade Católica do Rio Grande do Sul. Her PhD project included an 18-month internship at the University of Connecticut (1997–1998). Her main research interests are the grammar of Brazilian Sign Language (Libras), bimodal
OUP CORRECTED PROOF – FINAL, 07/12/20, SPi
l About The Contributors bilingual languages (Libras and Portuguese, and American Sign Language and English), sign language acquisition, and the Libras Corpus. Ailbhe Ní Chasaide is Professor of Phonetics and Director of the Phonetics and Speech Laboratory at the School of Linguistic, Speech and Communication Sciences, Trinity College Dublin. She has directed over 20 funded research projects and published widely on a range of topics including the voice quality dimension of prosody, how voice quality and pitch interact in signalling both linguistic and affective information, and the prosodic and segmental structure of Irish dialects. She is the lead principal investigator on the ABAIR project, which is developing phonetic-linguistic resources and technology for the Irish language. Oliver Niebuhr earned his doctorate (with distinction) in phonetics and digital speech processing from Kiel University and subsequently worked as a postdoctoral researcher at linguistic and psychological institutes in Aix-en-Provence and York as part of the interdisciplinary European Marie Curie Research Training Network ‘Sound to Sense’. In 2009, he was appointed Junior Professor of Spoken Language Analysis and returned to Kiel University, where he is Associate Professor of Communication and Innovation at the University of Southern Denmark. In 2017, he was appointed head of the CIE Acoustics Lab and founded the speech-technology startup AllGoodSpeakers ApS. Mitsuhiko Ota is Professor of Language Development at the University of Edinburgh. His research addresses phonological development in both first and second languages, with a focus on the role of linguistic input and the interface between phonology and the lexicon. He has worked on children’s acquisition of prosodic systems, such as those of Japanese and Swedish. He is an associate editor of Language Acquisition. Rhea Paul is Professor and Chair of Speech-Language Pathology at Sacred Heart University and the author of over 100 refereed journal articles, over 50 book chapters, and nine books. She holds a PhD and a Certificate of Clinical Competence in speech-language pathology. She received the Ritvo-Slifka Award for Innovative Clinical Research from the International Society for Autism Research in 2010 and Honors of the Association for lifetime achievement in 2014 from the American Speech-Language and Hearing Association. Mary Pearce obtained her PhD in linguistics at University College London in 2007 on the basis of her research on the phonology and phonetics of Chadic languages, with a particular interest in vowel harmony and tone. She has lived for a number of years in Chad although she is now based back in the UK. Her publications include The Interaction of Tone with Voicing and Foot Structure: Evidence from Kera Phonetics and Phonology (Center for the Study of Language and Information, Stanford, 2013) and ‘The Interaction between Metrical Structure and Tone in Kera’ (Phonology, 2006). She is currently the International Linguistics Coordinator for SIL International. Jörg Peters is Professor of Linguistics at Carl von Ossietzky University Oldenburg, where he teaches phonology, phonetics, sociolinguistics, and pragmatics. His research interests
OUP CORRECTED PROOF – FINAL, 07/12/20, SPi
About The Contributors li are in segmental and prosodic variation of German, Low German, Dutch, and Saterland Frisian. His publications include Intonation deutscher Regionalsprachen (de Gruyter, 2006) and Intonation (Winter, 2014). Brechtje Post is Professor of Phonetics and Phonology at the University of Cambridge. Her research focuses on speech prosody, which she explores from a number of different angles (phonology, phonetics, acquisition, and cognitive and neural speech processing). Pilar Prieto is a Research Professor funded by the Catalan Institution for Research and Advanced Studies in the Department of Translation and Language Sciences at Universitat Pompeu Fabra. Her main research interests are how prosody and gesture work in language acquisition and how they interact with other types of linguistic knowledge (pragmatics and syntax). She has published numerous research articles that address these questions and coedited a book titled Prosodic Development (John Benjamins, 2018). Hamed Rahmani obtained his PhD from Radboud University in 2019. His research focuses on the word and sentence prosody of Persian and its interaction with morphosyntax and semantics. His other research interests include the relations between mathematics, music, and linguistic structures. Melissa A. Redford is Professor of Linguistics at the University of Oregon. She received her PhD in psychology and postdoctoral training in computer science from the University of Texas at Austin. Her research investigates how language and non-language systems interact over developmental time to structure the speech plan and inform speech production processes in real time. Henning Reetz holds an MSc in computer science and received his PhD from the University of Amsterdam in 1996. He worked at the Max Planck Institute for Psycholinguistics in Nijmegen and taught phonetics at the University of Konstanz. Currently, he is Professor of Phonetics and Phonology at the University of Frankfurt. His main research is on human and machine speech recognition with a focus on the mental representation of speech. Part of this work includes the processing of audio signals in the early neural pathway, where pitch perception plays a major role. Tomas Riad has been Professor of Nordic languages at Stockholm University since 2005. His main research interests concern prosody in various ways: North Germanic pitch accent typology, the origin of lexical pitch accents, the relationship between grammar and verse metrics, and the relationship between morphology and prosody. He is the author of The Phonology of Swedish (Oxford University Press, 2014). He has been a member of the Swedish Academy since 2011. Nicholas Rolle is a postdoctoral researcher at Leibniz-Zentrum Allgemeine Sprachwissenschaft (ZAS, Berlin). He received his PhD from the University of California, Berkeley in 2018, and was previously a Postdoctoral Research Associate at Princeton University. His specialization is phonology at its interface with morphology and syntax, including the grammatical use of tone, prosodic subcategorization, paradigm uniformity
OUP CORRECTED PROOF – FINAL, 07/12/20, SPi
lii About The Contributors effects, and allomorphy. His empirical focus is on African languages, involving fieldwork on the Edoid and Ijoid families of Nigeria. Andrew Rosenberg has been a Research Staff Member at IBM Research since 2016. He received his PhD from Columbia University in 2009. He then taught and researched at Queens College, City University of New York (CUNY), until he joined IBM. From 2013 to 2016, he directed the CUNY Graduate Center Computational Linguistics Program. He has written over 70 journal and conference papers, primarily on automated analyses of prosody and the use of these on downstream spoken-language-processing tasks. He is the author and maintainer of AuToBI, an open-source tool for the automatic assignment of ToBI labels from speech. He is a National Science Foundation CAREER award winner. Hannah Sande is an Assistant Professor of Linguistics at Georgetown University. She obtained her PhD at the University of California, Berkeley, in 2017. She carries out both documentary and theoretical linguistic research. Her theoretical work investigates the interaction of phonology with morphology and syntax, with original data primarily from African languages. She has spent many summers in West Africa working with speakers of Guébie, an otherwise undocumented Kru language spoken in Côte d’Ivoire. She also works locally with speakers of Amharic (Ethio-Semitic), Dafing (Mande), Nobiin (Nilotic), and Nouchi (contact language, Côte d’Ivoire). Her dissertation work focused on phonological processes and their interaction with morphosyntax, based on data from Guébie, where much of the morphology is non-affixal and rather involves root-internal changes such as tone shift or vowel alternations. She continues to investigate morphologically specific phonological alternations across African and other languages. Wendy Sandler is Professor of Linguistics at the University of Haifa and Founding Director of the Sign Language Research Lab there. She has developed models of sign language phonology and prosody that exploit general linguistic principles to reveal both the similarities and the differences in natural languages in two modalities. More recently, her work has turned to the emergence of new sign languages and ways in which the body is recruited to manifest increasingly complex linguistic forms within a community of signers. Sandler has authored or co-authored three books on sign language: Phonological Representation of the Sign (Foris, 1989); A Language in Space: The Story of Israeli Sign Language, co-authored with Irit Meir (Hebrew version: University of Haifa Press, 2004; English version: Lawrence Erlbaum Associates/Taylor Francis, 2008, 2017); and Sign Language and Linguistic Universals, co-authored with Diane Lillo-Martin (Cambridge University Press, 2006). She is currently conducting a multi-disciplinary research project, The Grammar of the Body, supported by the European Research Council. Stefanie Shattuck-Hufnagel is a Principal Research Scientist in the Speech Communication Group at MIT. She received her PhD in psycholinguistics from MIT in 1974, taught in the Department of Psychology at Cornell University, and returned to MIT in 1979. Her research is focused on the cognitive processes and representations that underlie speech production planning, using behaviour such as speech errors, context-governed systematic variation in surface phonetic form, prosody, and co-speech gesture to test hypotheses about the plan-
OUP CORRECTED PROOF – FINAL, 07/12/20, SPi
About The Contributors liii ning process and to derive constraints on models of that process. Additional interests include developmental and clinical aspects of speech production, and the role of individual acoustic cues to phonological features in the processing of speech perception. She is a proud founding member of the Zelma Long Society. Elizabeth Schoen Simmons is an Assistant Professor at Sacred Heart University. She received her PhD in Cognitive Psychology from the University of Connecticut. Her research focuses on language development in both typical and clinical populations. Melanie Soderstrom is Associate Professor of Psychology at the University of Manitoba. She received her PhD in psychological and brain sciences from Johns Hopkins University in 2002, and has been a National Institutes of Health-funded postdoctoral researcher in Brown University’s Department of Cognitive and Linguistic Sciences. Her early work examined infants’ responses to the grammatical and prosodic characteristics of speech. More recently, she has focused the characteristics of child-directed speech in the home environment. She is currently active in the ManyBabies large-scale collaborative research initiative, and in a smaller collaborative project, Analyzing Child Language Environments Around the World (ACLEW). Marc Swerts is a Professor in the School of Humanities and Digital Sciences at Tilburg University and currently also acts as the vice-dean of research in that same school. His scientific focus is on trying to get a better understanding of how speakers exploit non-verbal features to exchange information with their addressees, with a specific interest in the interplay between prosodic characteristics, facial expressions, and gestures to signal socially and linguistically relevant information. He has served on the editorial boards of three major journals in the field of language and speech research, and has served as editor in chief of Speech Communication. He was elected to become one of the two distinguished lecturers (for the years 2007–2008) of the International Speech Communication Association (ISCA) to promote speech research in various parts of the world, and was awarded with an ISCA fellowship in 2015. Annie Tremblay is a Professor of Linguistics at the University of Kansas. She completed her PhD in second language acquisition at the University of Hawaiʻi in 2007. She uses psycholinguistic techniques such as eye tracking, cross-modal priming, and artificial language segmentation to investigate speech processing and speech segmentation in non-native listeners, with a focus on the use of suprasegmental information in spoken-word recognition. Jürgen Trouvain is a Researcher and Lecturer at the Department of Language Science and Technology at Saarland University. The focus of his PhD was tempo variation in speech production. His research interests include non-verbal vocalizations, non-native speech, stylistic variation of speech, and historical aspects of speech communication research. He has been a co-editor of publications on non-native prosody and phonetic learner corpora. Alice Turk is Professor of Linguistic Phonetics at the University of Edinburgh, where she has been since 1995. Her work focuses on systematic timing patterns in speech as evidence for the structures and processes involved in speech production. Specific interests include prosodic structure and speech motor control.
OUP CORRECTED PROOF – FINAL, 07/12/20, SPi
liv About The Contributors Harry van der Hulst specializes in phonology, as encompassing both the sounds systems of languages and the visual aspects of sign languages. He obtained his PhD at Leiden University in 1984. He has published four books, two textbooks, and over 170 articles, and has edited over 30 books and six journal theme issues in the above-mentioned areas. He has been editor in chief of The Linguistic Review since 1990 and is co-editor of the series Studies in Generative Grammar (Mouton de Gruyter). He has been Professor of Linguistics at the University of Connecticut since 2000. Vincent J. van Heuven is Emeritus Professor of Experimental Linguistics and Phonetics at Leiden University and the University of Pannonia. He has honorary professorships at Nankai University and Groningen University, and is a guest researcher at the Fryske Akademy in Leeuwarden. He served as the director of the Holland Institute of Linguistics from 1999 to 2001 and of the Leiden University Centre for Linguistics from 2001 to 2006. He is a life member of the Royal Netherlands Academy of Arts and Sciences. Diana Van Lancker Sidtis is Professor of Communicative Sciences and Disorders at New York University and Research Scientist at the Nathan Kline Institute for Psychiatric Research in Orangeburg, New York. She holds a PhD and a Certificate of Clinical Competence for Speech-Language Pathologists. Educated at the universities of Wisconsin and Chicago and Brown University, she performed predoctoral studies at the University of California, Los Angeles, and was awarded a National Institutes of Health postdoctoral fellowship at Northwestern University. Her peer-reviewed published research examines voice, aphasia, motor speech, prosody, and formulaic language. Her scholarly book Foundations of Voice Studies, co-authored with Dr Jody Kreiman (Wiley-Blackwell, 2011), won the 2011 Prose Award for Scholarly Excellence in Linguistics from the American Publishers Association. Alexandra Vella completed her PhD in linguistics at the University of Edinburgh in 1995. She is Professor of Linguistics at the University of Malta, where she coordinates the Sound component of the Institute of Linguistics and Language Technology programme, teaching various courses in phonetics and phonology. Her main research focus is on prosody and intonation in Maltese and its dialects, as well as Maltese English, the English of speakers of Maltese in the rich and complex linguistic context of Malta. She leads a small team of researchers working on developing annotated corpora of spoken Maltese and its dialects as well as of Maltese English. Paul Warren is Professor of Linguistics at Victoria University of Wellington, New Zealand. He teaches and researches in psycholinguistics (having published Introducing Psycholinguistics, Cambridge University Press, 2012) and in phonetics, especially the description of New Zealand English and of intonation (addressed in his Uptalk, Cambridge University Press, 2016). He is on the editorial boards of the Journal of the International Phonetic Association, Laboratory Phonology, and Te Reo (the journal of the Linguistics Society of New Zealand). He is a founding member of the Association for Laboratory Phonology. Justin Watkins is Professor of Burmese and Linguistics at SOAS, University of London. His research focuses on Burmese and minority languages of Myanmar.
OUP CORRECTED PROOF – FINAL, 07/12/20, SPi
About The Contributors lv Matthijs Westera is an Assistant Professor in Humanities and AI at Leiden University. He obtained his PhD in 2017 from the University of Amsterdam with a dissertation on implicature and English intonational meaning, after which he held a postdoctoral position in computational linguistics at the Universitat Pompeu Fabra. His research is on the semantics–pragmatics interface and combines traditional methods with advances in computational linguistics and deep learning. Laurence White is a Senior Lecturer in Speech and Language Sciences at Newcastle University. He completed his PhD in linguistics at Edinburgh University in 2002 and was a postdoctoral researcher at Bristol University and the International School for Advanced Studies, Trieste. He joined Plymouth University as a lecturer in 2011 and Newcastle University in 2018. His research explores speech perception, speech production, and their relationship, with a focus on prosody and its role in the segmentation of speech by listeners. He also works on infant language development and second language acquisition. Patrick C. M. Wong holds the Stanley Ho Chair in Cognitive Neuroscience and is Professor of Linguistics and Otolaryngology and Founding Director of the Brain and Mind Institute at the Chinese University of Hong Kong. His research covers basic and translational issues concerning the neural basis and disorders of language and music. His work on language learning attempts to explain the sources of individual differences by focusing on neural and neurogenetic markers of learning in order to support methods to personalize learning. His work also explores questions concerning phonetic constancy and representation. Zilong Xie is currently a postdoctoral researcher in the Department of Hearing and Speech Sciences at the University of Maryland, College Park. He received his PhD in communication sciences and disorders at the University of Texas at Austin. His research focuses on understanding the sensory and cognitive factors that contribute to individual differences in speech processing in typical as well as clinical (e.g. individuals with cochlear implants) populations, using behavioural and neuroimaging (e.g. electroencephalography) methods. Seung-yun Yang is Assistant Professor in Communication, Arts, Sciences & Disorders at Brooklyn College, City University of New York. She is also a certified Speech-Language Pathologist and a member of the Brain and Behavior Laboratory at the Nathan Kline Institute for Psychiatric Research in Orangeburg, New York. Her research aims to better understand the neural bases and acquisition of nonliteral language: how people communicate nonliteral meanings in spoken language and how acquired brain damage affects these communicative functions and pragmatic skills. Alan C. L. Yu is Professor of Linguistics and Director of the Phonology Laboratory at the University of Chicago. His research primarily addresses issues of individual variation in the study of language variation and change, particularly in how it informs the origins and actuation of sound change. He is the author of A Natural History of Infixation (Oxford University Press, 2007) and a (co-)editor of the Handbook of Phonological Theory (2nd edition, Blackwell Wiley, 2011) and the Origins of Sound Change: Approaches to Phonologization (Oxford University Press, 2013).
OUP CORRECTED PROOF – FINAL, 07/12/20, SPi
lvi About The Contributors Jie Zhang completed his PhD in linguistics at the University of California, Los Angeles, in 2001 and is currently Professor in the Linguistics Department at the University of Kansas, where he teaches courses on phonology, introductory linguistics, and the structure of Chinese. He also served as a Lecturer in Linguistics at Harvard University from 2001 to 2003. His research uses experimental methods to investigate the representation and processing of tone and tonal alternation patterns, with a special focus on the productivity of tone sandhi in Chinese dialects.
OUP CORRECTED PROOF – FINAL, 04/12/20, SPi
chapter 1
I n troduction Carlos Gussenhoven and Aoju Chen
1.1 Introduction In this chapter we first offer a motivation for starting the project that has led to this handbook (§1.2). Next, after a discussion of some definitional and terminological issues in §1.3, we lay out the structure of the handbook in §1.4. Finally, in §1.5 we discuss our reflections on this project and the outlook for this handbook.
1.2 Motivating our enterprise Surveys of language prosody tend to focus on specific aspects, such as tone, word stress, prosodic phrasing, and intonation. Surveys that attempt to cover all of these are less common. In part, this may be due to a perceived lack of uniformity in the field. Shifting conceptions and terminologies may indicate a lack of consensus about basic issues to newcomers, while a confrontation with the variety of approaches to the topic may well have the same effect. We believe, however, that the way today’s researchers conceptualize the word and sentence prosodic structures in the languages of the world and their place in discourse, processing, acquisition, language change, speech technology, and pathology shows more coherence than differences in terminology may suggest. Three developments in the field have been particularly helpful. The first of these is the model of phonology and phonetics presented by Janet Pierrehumbert in her 1980 dissertation and its application to the intonation of English. Its relevance extends considerably beyond the topic of intonation, mainly because her work has shaped our conceptualization of the relation between phonological representations and phonetic implementation. This descriptive framework moreover increased the crosslinguistic comparability of prosodic accounts, and preserved one of the main achievements of Gösta Bruce’s 1977 dissertation, the integration of tone and intonation in the same grammar. Quite unlike what happened after the introduction of phonology as a separate discip line in the early twentieth century, the separation of phonetic implementation from phonological representations has had a fruitful effect on the integrated study of phonetics
OUP CORRECTED PROOF – FINAL, 04/12/20, SPi
2 CARLOS GUSSENHOVEN AND AOJU CHEN and phonology. This is true despite the fact that it may be hard to decide whether specific intonational features in some languages have a phonological representation or arise in the phonetic implementation. The second development is the expansion of the database. The wider typological perspective has led to a more realistic view of the linguistic diversity in prosodic systems. Hypotheses about the universality in the relation between prosodic prominence and focus, for instance, are now competing with a hypothesis taking post-focus compression, a common way of marking focus, to be an areal feature (e.g. Samek-Lodovici 2005; Xu et al. 2012). Further, the data increase will have helped to shift the objects of typology from languages (‘intonation language’, ‘tone language’) to linguistic properties (cf. Larry Hyman’s work on ‘propertydriven typology’, 2006). Finally, the availability of large corpora annotated for various features has resulted in new insights into the use of prosody in everyday communication by speakers with different native languages and second languages. The third development is the emergence of new lines of research, in part due to a rapid evolution of research methodologies and registration techniques, such as eye tracking and the registration of brain activity (e.g. electroencephalography—EEG, magnetoencephalography— MEG, and functional magnetic resonance imaging—fMRI). Earlier, psycholinguistic research paradigms focused on the investigation of the role of prosodic cues in spoken-word recognition, on how listeners capitalize on prosodic cues to resolve temporary syntactic and semantic ambiguity in (‘garden path’) cases such as When he leaves the house is dark, and on how pragmatically appropriate prosody influences speech comprehension. More recently, neurological research has shown topographic and latency patterns in brain activity induced by a variety of linguistic stimuli. Accent location and type of accent have been investigated as cues to language processing in eye-tracking research, and electromagnetic registrations of articulation have provided information on how speakers integrate the phonetic implementation of prosodic and ‘textual’ parts of language. Broadly, Parts I to IV deal with linguistic representations and their realization, while Parts V to VIII deal with a variety of fields in which language prosody plays an important role. In this second half of the book, preconceived theoretical notions might at times be less of a help and, accordingly, not all chapters relate to a theoretical framework.
1.3 Definitional and terminological issues Much as we dislike pinning ourselves down to definitions of ‘language prosody’ (it is best to keep options open as long as our understanding is developing), it will be useful to provide some orientation on what we mean by that term. Definitions of scientific objects do more than circumscribe phenomena so as to shape the reader’s expectations: they also reflect the conceptualizations of the field. In §1.3.1, we briefly sketch the development in the conceptualization of language prosody, while §1.3.2 indicates the typological categories that this handbook is most directly concerned with and §1.3.3 discusses some terminological ambiguities.
OUP CORRECTED PROOF – FINAL, 04/12/20, SPi
INTRODUCTION 3
1.3.1 Tradition and innovation in defining language prosody Over the past decades, a factor of interest has been the balance between form and function. An early form-based conceptualization focused on the analysis of the speech signal into four phonetic variables: pitch, intensity, duration, and spectral pattern. The first three variables are ‘suprasegmental’ in the sense that they are ‘overlaid’ relative to features that inherently define or express phonetic segments (Lehiste 1970). Examples of inherent features are voicing, nasality, and a low second formant, or even—so Lehiste points out— duration and intensity inasmuch as these are required for a segment to be identifiable in the visually or auditorily perceived speech signal. Thus, duration is suprasegmental if it is used to create a long vowel or geminate consonant, and pitch is suprasegmental when it is a manifestation of a tonal or intonational pattern (p. 2). This division between ‘segmental’ and ‘suprasegmental’ features has shaped the conceptualization of the field up until today. One consequence is that the term ‘segmental’ typically still excludes tones, the segments of autosegmental phonology, whose realization relies largely on pitch. Its usual reference is only to vowels and consonants, segments whose realization relies largely on spectral variation. Within this approach, a functional perspective will consider how each suprasegmental variable plays its role in creating communicative effects. Pitch is the most easily perceived as a separate component of speech and its acoustic correlate, fundamental frequency (f0), is readily extractable from the signal (see chapter 3). It thus invitingly exposes itself to linguistic generalizations, as in Xu’s (2005) function-based approach known as the PENTA model. In principle, this approach could also be applied to duration, which may reflect vowel or consonant quantity as well as effects of domain boundaries, tone-dependent lengthening, and effects of speech tempo, but it would be less suitable for intensity, which has a substantially reduced role to play in speech communication (cf. Watson et al. 2008a). In order to go beyond the decomposition of the speech signal into phonetic variables and the establishment of their role in language, it is useful to ask ourselves to what extent these variables express forms or functions. More recently, there has been a stronger emphasis on the distinction between what Lehiste (1970) referred to as ‘paralanguage’ and ‘language’. Paralanguage is ‘not to be confused with language’, but is nevertheless used ‘in systematic association with language’ (p. 3). As a result, researchers have confronted the fact that phonetic variables can express meanings, as in paralanguage, and meaningless phonological forms, as in language. The first aspect is involved in the signalling of affect and emotion, as when we raise our pitch to express fear or surprise, and sometimes of more linguistic functions such as interrogativity and emphasis, through manipulations of pitch, intensity, and duration. The second concerns the phonetic expression of phonological constituents such as segments, syllables, feet, and larger phonological constituents including the phonological word and the phonological phrase. Of these, segments have a different status in that the features they are composed of provide the phonological content of the phonological representation, while the other constituents provide a hierarchical prosodic phrasing structure, similar though not identical to the structure built by morphosyntactic constituents.
OUP CORRECTED PROOF – FINAL, 04/12/20, SPi
4 CARLOS GUSSENHOVEN AND AOJU CHEN
1.3.2 Some typological categories To return to the issue of what this handbook should be about, we consider the core prosodic elements to be tone, stress, prosodic constituents, and intonation. First, as said above, tones are segments, following their analysis as autonomous segments in autosegmental phonology (Goldsmith 1976a). They have a variety of functions, as discussed in detail in chapters 4 and 6. Second, stress, which is not a segment, is a manifestation of a headed prosodic constituent, the foot. The intrinsic link between feet and word stress accounts for its ‘obligatoriness’ (Hyman 2006), on the assumption that while specific tones, vowels, and consonants may or may not appear in any one word of a language, prosodic parsing is obligatory (Selkirk 1984; Nespor and Vogel 1986; cf. the ‘Strict Layer Hypothesis’). For many languages, contrastive secondary stress will additionally require a headed phonological word, as in English, which differentiates between the stress patterns of ˈcatamaˌran and ˌhullabaˈloo, for example. Further increases in prominence can be created by the presence of intonational pitch accents on some stressed syllables. Many researchers assume that stress exists above the level of the word as a gradient property correlating with the rank of phonological constituents. In this view, increasing levels of stress, or of prominence, function as the heads of prosodic phrases and guide the assignment of pitch accents to word-stressed syllables (for recent positions, see e.g. Cole et al. 2019; Kratzer and Selkirk 2020; and references in both of these). Other analyses describe pitch accent assignment independently of any stress levels beyond the word, following Bolinger’s (1958: 111) conclusion that ‘pitch and stress are phonemically independent’ (cf. Lieberman 1965; Gussenhoven 2011). Chapter 5 explicitly keeps its options open and reports cases of phrasal stress for languages that may not have word stress. Also, chapter 10 makes the point that the phonetic properties of pitch accented syllables in West Germanic languages are typically of the same kind as those that differentiate stressed from unstressed syllables at the word level, and appear to enhance the word stress (e.g. Beckman and Cohen 2000). The third element, prosodic constituency, was already implicated in the above comments on stress. Prosodic constituents form an increasingly encompassing, hierarchical set. Not all of these are recognized by all researchers, but a full set ranges from morae (μ) to utterances (υ), with syllables (σ), feet (Ft), phonological words (ω), clitic groups (CG), accentual phrases (α aka AP), phonological phrases (φ, aka PP), intermediate phrases (ip), and inton ational phrases (ι aka IP) in between. Timing variation conditioned by certain constituents in this prosodic hierarchy may, in part, be responsible for perceptions of language-specific rhythmicity (see chapter 11). Languages appear to skip these prosodic ranks as a matter of course, meaning that not all of these constituents are referred to by grammatical generalizations in all languages. Truth be told, it is more difficult to show that a constituent has some role to play than to show it has no reality at all; no convincing empirical case has been made for the absence of syllables in any language, for instance (cf. Hyman 2015). Languages with contrastive syllabification within the morphological word are usually claimed to have ω’s, where these constitute syllabification domains. For example, English ˈX-Parts and ˈexperts respectively have an aspirated and an unaspirated [p]. The first item can on that basis be argued to have two ω’s, (X)ω (Parts)ω (contrasting with a single ω for the second item, cf. Nespor and Vogel 1986: 137). Further up the hierarchy, prosodic constituents often define the tonal structure, like α’s rejecting more than one pitch accent in their domain, or any constituent defining the distribution of tones, as in the case of the α-initial boundary tones
OUP CORRECTED PROOF – FINAL, 04/12/20, SPi
INTRODUCTION 5 of Northern Bizkaian Basque or the ι-final ones in Bengali, Irish, Lusoga, Mawng, or Turkish. Because not all constituents may be in use by the grammar, researchers may find themselves free to pick a rank for the phonologically active constituent. For instance, the φ maximally has one pitch accent in many Indo-European languages spoken in India, but, since it would appear to be the only constituent between the ω and the ι, either α or ip might in principle have been used instead of φ. Besides such indeterminacy, rival accounts may exist for the same language. For instance, West Germanic languages have been described with as well as without ip’s, and only continued research can decide which description is to be preferred. The fourth element, ‘intonation’, is a formal as well as a functional concept. Ladd’s (2008b: 4) definition identifies three properties: The use of suprasegmental phonetic features to convey ‘postlexical’ or sentence-level pragmatic meanings in a linguistically structured way. (italics original)
The ‘form’ aspects here are the restriction to linguistically structured ways and the restriction to suprasegmental features. The first restriction says that the topic concerns phonologically encoded morphemes. The second restriction excludes morphemes that are exclusively encoded with the help of spectral features, i.e. with the help of vowels and consonants, such as question particles, focus-related morphemes, and modal adverbs with functions similar to those typically encoded by intonational melodies. As for the functional aspect, ‘sentence-level meanings’ are those that do not arise from the lexicon, thus excluding any form of lexically or morphologically encoded meaning. These emphatically include the intonational pitch accents of a language like English (even though Liberman 1975, quite appropriately, referred to them as an ‘intonational lexicon’ to bring out the fact that they form a collection of post-lexical tonal morphemes with discoursal meanings). The additional reference to ‘pragmatic’ meaning allows vowel quantity to be part of ‘intonation’. Final vowel lengthening in Shekgalagari signals listing intonation, while absence of penultimate vowel lengthening marks hortative and imperative sentences (Hyman and Monaka 2011). At the same time, the inclusion of ‘pragmatic’ may pre-judge the issue of the kinds of meaning that post-lexical tone structures can express. Because prosodic phrasing may affect pitch accent location, and because prosodic phrasing reflects syntactic phrasing, syntactic effects may arise from the way utterances are prosodically phrased (Nespor and Vogel 1986: 301). For instance, to borrow an example from chapter 19, the accent locations in CHInese GARden indicate that this is a noun phrase, but the accentuation of MAINland ChiNESE GARden Festival signals the phrasal status of Mainland Chinese, thus giving an interpretation whereby the festival features international garden design, as opposed to exclusively focusing on garden design in China at an exhibition on some mainland. More directly syntactic uses of tone are mentioned for Yanbian Korean (chapter 24) and Persian (chapter 14).
1.3.3 Some terminological ambiguities Partly because of shifts in conceptions, reports of prosodic research can be treacherous in their use of technical terms, notably ‘pitch accent’, ‘phrase accent’, and ‘compositionality’. To tackle the first of these, Pierrehumbert (1980) distinguished ‘boundary tones’, which align
OUP CORRECTED PROOF – FINAL, 04/12/20, SPi
6 CARLOS GUSSENHOVEN AND AOJU CHEN with the boundaries of prosodic domains, from ‘central tones’, which are located inside those domains. Boundary tones are typically non-lexical, while central tones have either a lexical or an intonational function. Central tones are frequently referred to as ‘accents’ or ‘pitch accents’, but those terms have been used in other meanings too. Quite apart from its evidently different meaning of a type of pronunciation characteristic of a regional or social language variety, there are various prosodic meanings of ‘accent’, ranging from that of a label indicating the location of some tone, as in Goldsmith (1976a), to a phonetic or phonological feature that is uniquely found in a specific syllable of the word, such as duration, bimoraicity, or tone, as in van der Hulst (2014a). There would appear to be at least four meanings of the term ‘pitch accent’. One is the occurrence of a maximum of a single instance of distinctive tone per domain, as in Barasana (chapters 4 and 29), Tunebo (chapter 29), tonal varieties of Japanese, and Bengali, whereby the location of the contrast may be fixed, like the word-initial syllable in Tunebo. Second, ‘pitch accent’ may refer to a lexically distinctive tone contrast in a syllable with word stress, often restricted to word stress in a specific location in the word, like the first in South Slavic languages (chapter 15), a non-final one in Swedish (chapter 18), and the last in Ma’ya (chapter 25). Other cases are the Goizueta and Leitza varieties of Basque, Franconian German (including Limburgish), and Norwegian. The Lithuanian case is presented with some reservations in chapter 15. These two meanings are often collapsed in discussions, particularly if the tone contrasts are binary, as in all languages mentioned here except Ma’ya. Third, ‘pitch accent’ is used for the intonational tones that are inserted in the accented syllables in many Indo-European languages, whereby there may be more than one pitch accent in a domain. This usage derives from Bolinger’s (1958) article arguing for the phonological independence of tone and word stress and was boosted by Pierrehumbert’s (1980) adoption of the term. A factor favouring the use of ‘pitch accent’ in all three types of central tone mentioned so far is the existence of generalizations about their location. If they are deleted in some morphosyntactic or phonological contexts, descriptions may prefer stating that the accent is deleted to stating that the tones are deleted. In Tokyo Japanese, these two ways of putting things amount to the same thing, since only a single tone option is available. However, syllables that must be provided with tone in an English sentence, i.e. its accented syllables, each have a set of tones from which a choice is to be made, which makes a description in terms of tone insertion or deletion more cumbersome. Ladd (1980) coined the term ‘deaccenting’ to refer to the removal of the option of tonal insertion in a given syllable. Here, ‘accent’ is often used to refer to the location and ‘pitch accent’ to the tones. Finally, a fourth use of ‘pitch accent’ is a word prominence that is exclusively realized by f0, as in tonal varieties of Japanese. In this sense, Italian and English have ‘non-pitch accents’ (aka ‘stress accents’), because an accented stressed syllable will have durational and spectral features distinguishing it from unstressed syllables, besides f0 features distinguishing it from unaccented stressed syllables (Beckman 1986; see also chapter 10). Of these different usages, the Bolinger–Pierrehumbert one has perhaps been most widely accepted. The term ‘phrase accent’ has different meanings, too. Pierrehumbert (1980) first used it to refer to a tone or tones between the last pitch accent and the boundary tone at the right edge of the ι. In Swedish, this description applies to a H tone that marks the end of the focused constituent (the ‘sentence accent’; Bruce 1977), which may be flanked by a lexical pitch accent on its left and a final boundary tone on its right. Pierrehumbert (1980) introduced L- and H- in this postnuclear position to provide greater melodic scope for the English
OUP CORRECTED PROOF – FINAL, 04/12/20, SPi
INTRODUCTION 7 nuclear pitch accent as compared to prenuclear ones. Two new meanings arose from theor etical proposals about the status of these English tones. First, Beckman and Pierrehumbert (1986) reanalysed the L- and H- phrase accents as the final boundary tones of a new pros odic constituent, the ip, ranking immediately below the IP. The second redefinition is that of a postnuclear tone in any language that has an association with a tone-bearing unit, typically a stressed syllable, which configuration is otherwise reserved for the starred tone of a pitch accent (Grice et al. 2000). In this second sense, the phrase accent retains the property of being final in the phrase as well as that of an accent, because it will typically be associated with the last available stressed syllable. Since there is no guarantee that what have been analysed as boundary tones of the ip always have an association with a stressed syllable and vice versa, it is important to keep these meanings apart. A final terminological comment concerns ‘compositionality’, which refers to the transparent contributions of individual morphemes to the meaning of a linguistic expression. In intonation the term has been used to refer to the more specific assumption that each of the morphemes must consist of a single phonological tone, a position frequently attributed to Pierrehumbert and Hirschberg (1990). In the latter interpretation, all alternative analyses are ‘non-compositional’; in the more general sense, that label only applies to Liberman and Sag’s (1974) proposal that the sentence-wide melodies are indivisible morphemes, like L*+H L* H- H% (to use Pierrehumbert’s later analysis), which they analyse as meaning ‘contradiction’.
1.4 The structure of the handbook Part I, ‘Fundamentals of Language Prosody’, lays out two fundamental prerequisites for language prosody research. Chapter 2 (Taehong Cho and Doris Mücke) sets out the avail able measurement techniques, while emphasizing the integrity of speech and the interaction between prosodic and supralaryngeal data. Chapter 3 (Oliver Niebuhr, Henning Reetz, Jonathan Barnes, and Alan C. L. Yu) surveys the mechanics of f0 perception in acoustic signals generally, as well as pitch perception in its dependence on aspects of speech, including a discussion of the relation between the visually observable f0 tracks in the speech signal and the pitch contours as perceived in speech. Part II, ‘Prosody and Linguistic Structure’, contains five chapters devoted to structural aspects of prosody. Chapter 4 (Larry M. Hyman and William R. Leben) is a wide-ranging survey of lexical and grammatical tone systems, juxtaposing data from geographically widely dispersed areas. Chapter 5 (Matthew K. Gordon and Harry van der Hulst) is on stress systems and shows that languages with stress vary in the extent to which stress locations are rule governed as opposed to lexically listed; it also presents a typological survey of stress systems and their research issues. The autosegmental-metrical model, which describes sentence prosody in terms of tones and their locations in the prosodic structure, is discussed in chapter 6 (Amalia Arvaniti and Janet Fletcher), with explanatory references to the analysis of American English by Janet Pierrehumbert and Mary Beckman (MAE_ToBI) as well as other languages. Chapter 7 (John J. McCarthy) summarizes the way phonological constituents such as the syllable, the foot, and the phonological word may be the exponents of morphological categories. Finally, chapter 8 (Wendy Sandler, Diane Lillo-Martin, Svetlana Dachkovsky, and Ronice de Quadros) dives into the syntax versus prosody debate and
OUP CORRECTED PROOF – FINAL, 04/12/20, SPi
8 CARLOS GUSSENHOVEN AND AOJU CHEN shows how non-manual markers of information structure and wh-questions in unrelated sign languages are prosodic in nature. Part III, ‘Prosody in Speech Production’, contains three survey chapters on reflexes of prosodic structure in the speech signal, dealing with tones, stress, and rhythm. Chapter 9 (Jonathan Barnes, Hansjörg Mixdorff, and Oliver Niebuhr) tackles tonal phonetics covering both lexical and post-lexical tone. The chapter keeps a close eye on the difference between potentially speaker-serving strategies, such as tonal coarticulation, and potentially hearerserving strategies aimed at sharpening up pitch contrasts. Chapter 10 (Vincent J. van Heuven and Alice Turk) surveys cues to word stress and sentence accents, which are not restricted to variation in the classic suprasegmental features (cf. §1.3.1) but appear in spectral properties, as derivable from hyperarticulation. Chapter 11 (Laurence White and Zofia Malisz) evaluates the claim that languages fall into rhythm groups based on the timing of morae, syllables, or stresses. As an alternative to temporal regularity of these lower-end hierarchical constituents, segmental properties, such as consonant clusters and reduced vowels, have been hypothesized to be the determinants of rhythm. Regardless of whether they are, these have figured in a number of ‘rhythm measures’ by which languages have been characterized. Part IV, ‘Prosody across the World’, consists of 18 surveys of the prosodic facts and prosody research agendas for the languages of the world. We felt that a geographical approach would best accommodate the varying regional densities of language families and the varying intensities of research efforts across the regions of the world. As an added benefit, a geographical approach gives scope to identifying areal phenomena. Given our choice, the distribution of languages reflects the situation before the European expansion of the fifteenth and later centuries, which allowed us to include varieties of European languages spoken outside Europe in the chapters covering Europe. One way or another, our geograph ical approach led to the groupings shown in Map 1.1 (see plate section), which identifies geographical areas by chapter title and number. The core topics addressed in these chapters are word stress, lexical tone, and intonation, plus any interactions of these phenomena with other aspects of phonology, including voice quality, non-tonal segmental aspects, prosodic phrasing, or morphosyntax. For word stress, these notably involve rhyme structures. They may have effects on intonation, as in Finnish and Estonian (chapter 15, Maciej Karpiński, Bistra Andreeva, Eva Liina Asu, Anna Daugavet, Štefan Beňuš, and Katalin Mády), and more commonly on stress locations, as in the case of Maltese (chapter 16, Mariapaola D’Imperio, Barbara Gili Fivela, Mary Baltazani, Brechtje Post, and Alexandra Vella). Derivation-insensitive, morphemic effects on stress are p ervasive in Australian languages, which tend to place the stress at the left edge of the root (chapter 26, Brett Baker, Mark Donohue, and Janet Fletcher). Many of the languages dealt with in chapter 14 (Anastasia Karlsson, Güliz Güneş, Hamed Rahmani, and Sun-Ah Jun) have no stress and the same goes for those in chapter 25 (Nikolaus P. Himmelmann and Daniel Kaufman), which goes some way towards correcting an impressionistic over-reporting of stress in the earlier literature. Phonological constraints on the distribution of lexical tone may be imposed by segmental, metrical, syllabic, or phrasal structures. Interactions between tones and voice quality are notably present in South East Asia ( chapter 23, by Marc Brunelle, James Kirby, Alexis Michaud, and Justin Watkins) and Mesoamerica (chapter 28, by Christian DiCanio and Ryan Bennett). In varieties of Chinese, tone contrasts are reduced by a coda glottal stop (chapter 22, by Jie Zhang, San Duanmu, and Yiya Chen), and in the Chadic
OUP CORRECTED PROOF – FINAL, 04/12/20, SPi
INTRODUCTION 9 l anguage Musgu, tones are targeted by depressor as well as raiser consonants ( chapter 13, Sam Hellmuth and Mary Pearce). Swedish allows lexical tone in stressed syllables only, while in most German tonal dialects it is additionally restricted to syllable rhymes with two sonorant morae (chapter 18, Tomas Riad and Jörg Peters), as it is in Somali and Lithuanian. Besides the restriction on the number of distinctive tones within a phrase (see §1.3), phrase length may impose constraints on the number of boundary tones that are realized. Thus, in Seoul Korean one or two tones out of four will not be realized if the α has three syllables or less (chapter 24, by Sun-Ah Jun and Haruo Kubozono). Tones are frequent exponents of morphological categories (‘grammatical tone’) in the languages spoken in sub-Saharan Africa, as illustrated in chapter 12 (Larry M. Hyman, Hannah Sande, Florian Lionnet, Nicholas Rolle, and Emily Clem), and North America, as discussed in chapter 27 (Gabriela Caballero and Matthew K. Gordon), for instance. While all chapters discuss prosodic phrasing, the number of phrases involved in constructing the tone string varies from three in varieties of Basque (the accentual phrase, the intermediate phrase, and the intonational phrase) via two in Catalan and Spanish (which lack an accentual phrase) to one in Portuguese (which only uses an intonational phrase in the construction of the tonal representation) (chapter 17, by Sónia Frota, Pilar Prieto, and Gorka Elordieta). In chapter 20 (Kristján Árnason, Anja Arnhold, Ailbhe Ní Chasaide, Nicole Dehé, Amelie Dorn, and Osahito Miyaoka), varying degrees of integration of clitics into their host in Central Alaskan Yupik create different foot and syllable structures, with effects on stress and segments. Finally, as expected, prosodic diversity varies considerably from chapter to chapter. South America is home to 53 language families and shows a variety of languages with stress, with tone, and with both stress and tone (chapter 29, by Thiago Costa Chacon and Fernando O. de Carvalho). Two chapters offer counterpoints to this situation. Chapter 19 (Martine Grice, James Sneed German, and Paul Warren) deals with English but presents a number of closely related varieties of this language under the heading of Mainstream English Varieties as well as a number of contact languages, known as ‘New Englishes’, which display a range of typologically different phenomena. And chapter 21 (Aditi Lahiri and Holly J. Kennard) shows how the two largest language families spoken on the Indian subcontinent, Dravidian and Indo-European, present very similar intonation systems despite their genetic divergence. Part V, ‘Prosody in Communication’, starts with a survey of approaches to intonational meaning, contrasting detailed accounts of limited data with broader accounts of entire inventories of melodies, with progress being argued to follow from the integration of these approaches (chapter 30, by Matthijs Westera, Daniel Goodhue, and Carlos Gussenhoven). In chapter 31 (Frank Kügler and Sasha Calhoun), prosodic ways of marking focus are split into those that rely on the manipulation of prominent syllables (stress- or pitch-accentbased cues) and those that rely on phrase- or register-based cues, with discussion of the prosodic marking of a number of aspects of information structure, such as topic, comment, and givenness. Chapter 32 (Julia Hirschberg, Štefan Beňuš, Agustín Gravano, and Rivka Levitan) addresses the importance of prosody beyond the utterance level. It surveys the role of prosody in characterizing the dynamics of interpersonal spoken interactions, such as turn-taking and entrainment, and in providing indications of each speaker’s state of mind, such as deception versus truthfulness. Chapter 33 (Marc Swerts and Emiel Krahmer) zooms in on how visual prosody relates to auditory prosody in communication and on how the combined use of visual and auditory prosody varies across cultures. Finally, chapter 34 (Diana Van Lancker Sidtis and Seung-yun Yang) reviews the underlying causes of
OUP CORRECTED PROOF – FINAL, 04/12/20, SPi
10 CARLOS GUSSENHOVEN AND AOJU CHEN c ommunication-related prosodic abnormalities in adults with brain damage and the challenges facing evaluation of prosodic abilities, and surveys treatments for certain prosodic deficiencies together with evidence for their efficacy. Part VI, ‘Prosody and Language Processing’, examines both the processing of prosodic information and the role of prosody in language processing. Chapter 35 (Joseph C.Y. Lau, Zilong Xie, Bharath Chandrasekaran, and Patrick C.M. Wong) reviews lesion, neuroimaging, and electrophysiological studies of the processing of linguistically relevant pitch patterns (e.g. lexical tones) and the influence of prosody on syntactic processing. It shows that what underlies linguistic pitch processing is not a specific lateralized area of the cerebral cortex, but an intricate neural network that spans the two hemispheres as well as the c ortical and subcortical areas along the auditory pathway. Chapter 36 (James M. McQueen and Laura Dilley) discusses how the prosodic structure of an utterance constrains spoken-word recognition; the chapter also outlines a prosody-enriched Bayesian model of spoken-word recognition. Chapter 37 (Stefanie Shattuck-Hufnagel) summarizes evidence for the use of prosodic structure in speech production planning as part of a review of modern theories of prosody in the acoustic, articulatory, and psycholinguistic literature, and probes into the role of prosody in different models of speech production planning. Part VII, ‘Prosody and Language Acquisition’, reflects a growing body of research on prosodic development at both word and phrase level in first language (L1) acquisition. Chapter 38 (Paula Fikkert, Liquan Liu, and Mitsuhiko Ota) surveys the developmental stages during infancy and early childhood in the perception and production of lexical tone, Japanese pitch accent, and word stress, and summarizes work on the relationship between perception and production, the representation of word prosody, and the factors driving the development of word prosody. Chapter 39 (Aoju Chen, Núria Esteve-Gibert, Pilar Prieto, and Melissa A. Redford) presents developmental trajectories for the formal and functional properties of phrase-level prosody and reviews the factors that may explain why prosodic development is a gradual process across languages and why cross-linguistic differences nevertheless arise early. Chapter 40 (Judit Gervain, Anne Christophe, and Reiko Mazuka) summarizes empirical evidence for early sensitivity to prosody and discusses in depth how infants use prosodic information to bootstrap other aspects of language, in particular word segmentation, word order, syntactic structure, and word meaning. Chapter 41 (Melanie Soderstrom and Heather Bortfeld) reviews the primary prosodic characteristics of childdirected speech (CDS) (which is the input for early language development), considers sources of variation across culture and context, and examines the function of CDS for social and linguistic development. Focusing on pathological conditions, chapter 42 (Rhea Paul, Elizabeth Schoen Simmons, and James Mahshie) examines prosodic dysfunction in children within relatively common developmental disorders, such as autism spectrum disorder and developmental language disorder. It also outlines strategies for assessing and treating these prosodic deficits, many of which offer at least short-term improvements in both pros odic production and perception. Moving on to research on the learning and teaching of prosody in a second language, chapter 43 (Allard Jongman and Annie Tremblay) surveys adult second language (L2) learners’ production, perception, and recognition of word prosody, and the role of L1 word prosody in L2 word recognition. Chapter 44 (Jürgen Trouvain and Bettina Braun) reviews current knowledge of adult L2 learners’ production and perception of phrasal-level prosody (i.e. intonation, phrasing, and timing) and their use of intonation for communicative
OUP CORRECTED PROOF – FINAL, 04/12/20, SPi
INTRODUCTION 11 urposes, such as expressing information structure, illocutionary force, and affect. p Chapter 45 (Dorothy M. Chun and John M. Levis) focuses on teaching prosody, mainly intonation, rhythm, and word stress, and the effectiveness of L2 instruction of prosody. Part VIII, ‘Prosody in Technology and the Arts’, begins with two chapters dealing with technological advances and best practices in automatic processing and labelling of prosody. Chapter 46 (Anton Batliner and Bernd Möbius) identifies changes in the role of prosody in automatic speech processing since the 1980s, focusing on two main aspects: power features and leverage features. Chapter 47 (Andrew Rosenberg and Mark Hasegawa-Johnson) examines major components of an automatic labelling system, illustrates the most import ant design decisions to be made on the basis of AuToBI, and discusses assessment of automatic prosody labelling and automatic assessment of human prosody. The third and fourth chapters are dedicated to art forms that interact with linguistic-prosodic structures. In chapter 48, Paul Kiparsky investigates the way metrical constraints define texts so as to fit the poetic forms of verse. He demonstrates their effects in various verse forms in a broad typological spectrum of languages and includes an account of the detailed constraints imposed by William Shakespeare’s meter. Chapter 49 (D. Robert Ladd and James Kirby) is the first typological treatment of the constraints musical melodies pose on linguistic tone. The authors show how both in Asia and in Africa constraints focus on transitions between notes and tones, rather than the notes or the defining pitches of the tones themselves.
1.5 Reflections and outlook This handbook needed to strike a balance between our desire to be comprehensive and the fact that, besides becoming available as an e-book, it was to be published as a single volume. We understood one side of this balance, comprehensiveness, not only in terms of topic coverage but also in the sense of representing the work of different authors and research groups from different disciplines. Moreover, in line with the usual objective of handbooks of this kind, we wanted to provide a research survey for specialists that would also be suitable as an introduction for beginning researchers in the field as well as for researchers with a non-prosodic background who wish to extend their research to prosody. That is, it should ideally present the state of the art and only secondarily provide a platform for new research results. The other side of the balance, size limitation, made us emphasize conciseness in our communications with authors in a way that frequently bordered on rampant unreasonableness, particularly when our suggestions came with requests for additional content. We are deeply grateful for the generosity and understanding with which authors responded to our suggestions for textual savings and changes and hope that this policy has not led to significant effacements of information or expository threads. The outline of the contents of the handbook in §1.4 passes over topics that have no dedicated chapter in this handbook. Some of these exclusions derive from our discussion of the definition of language prosody in §1.3. Here, we may think of laughter, filled pauses, more or less continuous articulatory settings characteristic of languages, nasal prosodies, syllable onset complexity, or focus-marking particles with no tone in them. All of these will interact with language prosody, but they are not focuses of this handbook. In other cases, various circumstances account for the absence of topics. Part II might have featured a chapter on
OUP CORRECTED PROOF – FINAL, 04/12/20, SPi
12 CARLOS GUSSENHOVEN AND AOJU CHEN prosodic phrasing setting out its relation to syntactic structure, including the apparent motivations for deviating from phonological-morphosyntactic isomorphism. The relative lack of typological data here would have made it hard to discuss, say, the estimated frequencies with which prosodic constituents make their appearance in languages or the extent to which size constraints trump syntactic demands on phonological structure. The geographical approach in Part IV made us lose sight of creoles, languages that arose on the basis of mostly European languages in various parts of the world; a survey might well have addressed the debate on the status of creoles as a typological class (McWhorter 2018). Finally, a few topics do not feature in the handbook in view of recent overviews or volumes dedicated to them. In Part I, we decided to leave out a chapter on the physiological and anatomical aspects of speech production because of the existence of a number of surveys, notably Redford (2015). In Part VI, while the processing of prosody in language comprehension is briefly considered in chapter 35, we refer to Dahan (2015) for a more detailed discussion. Including a chapter on prosody in speech synthesis, a chapter on the role of prosody in social robotics, and a chapter on the link between music and speech prosody would have given Part VIII a larger coverage. Hirose and Tao (2015), Crumpton and Bethel (2016), and Heffner and Slevc (2015), however, provide welcome overviews of the state of the art on each of these topics. We hope that this survey will not only be useful for consultation on a wide range of information but also serve as a source of inspiration for tackling research questions that the 49 chapters have implicitly or explicitly highlighted through their multi-disciplinary lens.
pa rt I
F U N DA M E N TA L S OF L A NGUAGE PRO S ODY
chapter 2
A rticu l atory M easu r es of Prosody Taehong Cho and Doris Mücke
2.1 Introduction Over the past few decades, theories of prosody have been developed that have broadened the topic in such a way that the term ‘prosody’ does not merely pertain to low-level realization of suprasegmental features such as f0, duration, and amplitude, but also concerns highlevel prosodic structure (e.g. Beckman 1996; Shattuck-Hufnagel and Turk 1996; Keating 2006; Fletcher 2010; Cho 2016). Prosodic structure is assumed to have multiple functions, such as a delimitative function (e.g. a prosodic boundary marking), a culminative function (e.g. a prominence marking), and functions deriving from the distribution of tones at both lexical and post-lexical levels. It involves dynamic changes of articulation in the laryngeal and supralaryngeal system, often accompanied by prosodic strengthening—that is, hyperarticulation of phonetic segments to enhance paradigmatic contrasts by a more distinct articulation, and sonority expansion to enhance syntagmatic contrasts by increasing periodic energy radiated from the mouth (see Cho 2016 for a review). Under a broad definition of prosody, therefore, prosody research in speech production concerns the interplay between phonetics and prosodic structure (e.g. Mücke et al. 2014, 2017). It embraces issues related to how abstract prosodic structure influences the phonetic implementation by the laryngeal and supralaryngeal systems, and how higher-level prosodic structure may in turn be recoverable from or manifest in the variation in the phonetic realization. For example, a marking of a tonal event in the phonetic substance involves dynamic changes not only in the laryngeal system (regulating the vocal fold vibration to produce f0 contours) but also in the supralaryngeal system (regulating movements of articulators to produce consonants and vowels in the textual string). With the help of articulatory measuring techniques, the way these two systems are coordinated in the spatio-temporal dimension is directly observable, allowing various inferences about the role of the prosodic structure in this coordination to be made. This chapter introduces a number of modern articulatory measuring techniques, along with examples across languages indicating how each technique may be used or has been
16 TAEHONG CHO AND DORIS Mücke used on various aspects of prosody in the phonetics–prosody interplay. These include (i) laryngoscopy and electroglottography (EGG) to study laryngeal events associated with vocal fold vibration; (ii) systems such as the magnetometer (electromagnetic articulography, EMA), electropalatography (EPG), and ultrasound systems for exploring supralaryngeal articulatory events; and (iii) aerodynamic measurement systems for recording oral/ subglottal pressure and oral/nasal flow, and a device, called the RIP (Respitrace inductive plethysmograph) for recording respiratory activities.
2.2 Experimental techniques 2.2.1 Laryngoscopy Laryngoscopy allows a direct observation of the larynx. A fibreoptic nasal laryngoscopy system (Ladefoged 2003; Hirose 2010) contains a flexible tube with a bundle of optical fibres which may be inserted through the nose, while the lens at the end of the fibreoptic bundle is usually positioned near the tip of the epiglottis above the vocal folds. Before the insertion of the scope, surface anaesthesia may be applied to the nasal mucosa and to the epipharyngeal wall. The procedure is relatively invasive, requiring the presence of a physician during the experiment. A recent system for laryngoscopy provides high-speed motion pictures of the vibrating vocal folds with useful information about the laryngeal state and the glottal condition during phonation (e.g. Esling and Harris 2005; Edmondson and Esling 2006). The recording of laryngeal images by a laryngoscope is often made simultaneously with a recording of the electroglottographic and acoustic signals (see Hirose 2010: fig. 4.3). Because of its invasiveness and operating constraints, however, the use of laryngoscopy in phonetics research has been quite limited. In prosody research, a laryngoscope may be used to explore the laryngeal mechanisms for controlling f0 in connection with stress, tones, and phonation types. Lindblom (2009) discussed a fibrescopic study in Lindqvist-Gauffin (1972a, 1972b) that examined laryngeal behaviour during glottal stops and f0 changes for Swedish word accents. The fibrescopic data were interpreted as indicating that there may be three dimensions involved in controlling f0 and phonation types: glottal adduction-abduction, laryngealization (which involves the aryepiglottic folds), and activity of the vocalis muscle. With reference to more recent fibrescopic data (Edmondson and Esling 2006; Moisik and Esling 2007; Moisik 2008), Lindblom (2009) suggested that the glottal stop, creaky voice, and f0 lowering may involve the same kind of laryngealization to different degrees. Basing their argument on crosslinguistic fibrescopic data, Edmondson and Esling (2006) suggested that there are indeed different ‘valve’ mechanisms for controlling articulatory gestures that are responsible for cross-linguistic differences in tone, vocal register, and stress, and that languages may differ in choosing specific valve mechanisms. A fibreoptic laryngoscopic study that explores the interplay between phonetics and prosodic structure is found in Jun et al. (1998), who directly observed the changing glottal area in the case of disyllabic Korean words with different consonant types, with the aim of understanding the laryngeal states associated with vowel devoicing. A change in the glottal area was in fact shown to be conditioned by prosodic position (accentual
ARTICULATORY MEASURES OF PROSODY 17 phrase-initial vs. accentual phrase-medial), interpreted as gradient devoicing. Further fibreoptic research might explore the interplay between phonetics and prosodic structure, in particular to study the connection between prosodic strengthening and laryngeal articulatory strengthening.
2.2.2 Electroglottography
0.3 0.2 0.1 0 –0.1 –0.2 –0.3
0
0.4 0.3 0.2 0.1 0 –0.1 –0.2 –0.3 –0.4 0
2
4
6
8 10 12 14 16 18 Time (ms)
Breathy voice
2
4
6
8 10 12 14 16 18 20 Time (ms)
Increasing vocal fold contact ==>
Normal voice
0.4
Creaky voice
0.2 0.15 0.1 0.0.5 0 –0.05 –01 –0.15
Increasing vocal fold contact ==>
Increasing vocal fold contact ==>
Increasing vocal fold contact ==>
The electroglottograph (EGG), also referred to as laryngograph, is a non-invasive device that allows for monitoring vocal fold vibration and the glottal condition during phonation (for more information see Laver 1980; Rothenberg and Mashie 1988; Rothenberg 1992; Baken and Orlikoff 2000; d’Alessandro 2006; Hirose 2010; Mooshammer 2010). It estimates the contact area between the vocal folds during phonation by measuring changes in the transverse electrical impedance of the current between two electrodes across the larynx placed on the skin over both sides of the thyroid cartilage (Figure 2.1). Given that a glottis filled with air does not conduct electricity, the electrical impedance across the larynx is roughly negatively correlated with the contact area. EGG therefore not only provides an accurate estimation of f0 but also measures parameters related to the glottal condition, such as open quotient (OQ, the percentage of the open glottis interval relative to the duration of the full abduction–adduction cycle), contact or closed quotient (CQ, the percentage of the closed glottis interval relative to the duration of the full cycle), and skewness quotient (SQ, the ratio between the closing and opening durations). EGG signals are often obtained simultaneously with acoustic and airflow signals, so that the glottal condition can be holistically estimated. While readers are referred to d’Alessandro (2006) for a review of how voice source parameters (including those derived from EGG signals) may be used in prosody
0 10 20 30 40 50 60 70 80 Time (ms) Loud voice
0.3 0.2 0.1 0 –0.1 –0.2 –0.3 –0.4
0
2
4
6
8 10 12 14 16 18 Time (ms)
Figure 2.1 Waveforms corresponding to vocal fold vibrations in electroglottography (examples by Phil Hoole at IPS Munich) for different voice qualities. High values indicate increasing vocal fold contact. Photo taken at IfL Phonetics Lab, Cologne.
18 TAEHONG CHO AND DORIS Mücke analysis, in what follows we will discuss a few cases in which EGG is used in exploring the interplay between phonetics and prosody. An EGG may be used to explore variation in voice quality strengthening as a function of prosodic structure. For example, Garellek (2014) explored the effects of pitch accent (phraselevel stress) and boundary strength on the voice quality of vowels in vowel-initial words in English and Spanish. Word-initial vowels under pitch accent were found to have an increase in EGG contact (as reflected in the CQ) in both English and Spanish, showing laryngealized voice quality. Garellek’s study of the phonetics–prosody interplay built on the assumption that fine-grained phonetic detail, this time in the articulatory dimension of glottis, is modulated differently by different sources of prosodic strengthening (prominence vs. boundary). Interestingly, however, both languages showed a decrease in EGG contact at the beginning of a larger prosodic domain (e.g. intonational phrase-initial vs. word-initial). This runs counter to the general assumption that domain-initial segments are produced with a more forceful articulation (e.g. Fourgeron 1999, 2001; Cho et al. 2014a) and that phrase-initial vowels are more frequently glottalized than phrase-medial ones (e.g. Dilley et al. 1996; Di Napoli 2015), which would result in an increase in EGG contact. Moreover, contrary to Garellek’s observation, Lancia et al. (2016) reported that vowel-initial words in German showed more EGG contact only when the initial syllable was unstressed, which indicates that more research is needed to understand this discrepancy from cross-linguistic perspectives. EGG was also used for investigating glottalization at phrase boundaries in Tuscan and Roman Italian, pointing to the fact that these glottal modifications are used as prosodic markers in a gradient fashion (Di Napoli 2015). Another EGG study that relates the glottal condition to prominence is Mooshammer (2010). It examined various parameters obtained from EGG signals in order to explore how word-level stress and sentence-level accent may be related to vocal effort in German. The author showed that a vowel produced with a global vocal effort (i.e. with increased loudness) was similar to a vowel with lexical stress at least in terms of two parameters, OQ and glottal pulse shape (obtained by applying a version of principal component analysis), independent of higher-level accent (due to focus). A focused vowel, on the other hand, was produced with a decrease in SQ compared to an unfocused vowel, showing a more symmetrical vocal pulse shape. To the extent these results hold, lexical stress and accent in German may be marked by different glottal conditions. However, given that an accented vowel is in general produced with an increase in loudness, as has been found across languages (including German, e.g. Niebuhr 2010), further research is required to explore the exact relationship between vocal effort and accent, both of which apparently increase loudness.
2.3 Aerodynamic and respiratory movement measures Aerodynamic devices most widely used for phonetic research use oral and nasal masks (often called Rothenberg masks, following Rothenberg 1973) through which the amount of oral/nasal flow can be obtained in a fairly non-invasive way. Intraoral pressure may be
ARTICULATORY MEASURES OF PROSODY 19 simultaneously obtained by inserting a small pressure tube between the lips inside the oral mask; this records the pressure of the air in the mouth (e.g. Ladefoged 2003; Demolin 2011). One aerodynamic measure that is more directly related to prosody may be subglottal pressure, as sufficient subglottal pressure is required for initiating and maintaining vocal fold vibration (van den Berg 1958) and an increase in subglottal pressure is likely to result in an increase in loudness (sound pressure level) and f0 (e.g. Ladefoged and McKinney 1963; Lieberman 1966; Ladefoged 1967). It is not, however, easy to measure subglottal pressure directly; this involves either a tracheal puncture (i.e. inserting a pressure transducer needle in the trachea; see Ladefoged 2003: fig. 3) or inserting a rubber catheter with a small balloon through the nose and down into the oesophagus at the back of the trachea (e.g. Ladefoged and McKinney 1963). Non-invasive methods to estimate subglottal pressure have been developed by using intraoral pressure and volume flow (Rothenberg 1973; Smitheran and Hixon 1981; Löfqvist et al. 1982), but these have limited applicability in prosody research, because certain conditions (e.g. a CVCV context) must be met to obtain reliable data. Aerodynamic properties of speech sounds can be compared with respiratory activities such as lung volume, which may be obtained with a so-called RIP, or Respitrace inductive plethysmograph (e.g. Gelfer et al. 1987; Hixon and Hoit 2005; Fuchs et al. 2013, 2015). In this technique, subjects wear two elastic bands (approximately 10 cm wide vertically), one around the thoracic cavity (the rib cage) and one around the abdominal cavity (Figure 2.2). The bands expand and recoil as the volume of the thoracic and abdominal cavities changes during exhalation and inhalation, such that the electrical resistance of small wires attached to the bands (especially the upper band) is used to estimate the change in the lung volume
Acoustics
2 1 0 –1 –2
Thorax in V
1 0.8
3.5
4
4.5
5
Inhalation
5.5
6
6.5
7
Exhalation
0.6 0.4 0.2
Abdomen in V
Er malt Tania, aber nicht Sonja. (He paints Tanja, but not Sonja.)
3.5
4
4.5
5
5.5
6
6.5
7
3.5
4
4.5
5
5.5 Time in s
6
6.5
7
0.8 0.6 0.4 0.2 0
Figure 2.2 Volume of the thoracic and abdominal cavities in a Respitrace inductive plethysmograph during sentence production, inhalation and exhalation phase. (Photo by Susanne Fuchs at Leibniz Zentrum, ZAS, Berlin.)
20 TAEHONG CHO AND DORIS Mücke during speech production. Gelfer et al. (1987) used the subglottal pressure measurements (Ps) obtained directly from the trachea to examine the nature of global f0 declination (e.g. Pierrehumbert 1979; Cooper and Sorenson 1981). Based on comparison of Ps with f0 and estimated lung volume (obtained with a RIP), Gelfer et al. suggested that Ps is a controlled variable in sentence production, and f0 declination comes about as a consequence of controlling Ps. Most recently, however, Fuchs et al. (2015) assessed respiratory contributions to f0 declination in German by using the same RIP technique, and suggested that f0 declination may not stem entirely from physiological constraints on the respiratory system but may additionally be modulated by speech planning as well as by communicative constraints as suggested in Fuchs et al. (2013). This finding is in line with Arvaniti and Ladd (2009) for Greek. Some researchers have measured airflow and intraoral pressure as an index of respiratory force, because they are closely correlated with subglottal pressure. For instance, oral flow (usually observed during a vowel or a continuant consonant) and oral pressure (usually observed during a consonant) are often interpreted as being correlated with the degree of prominence (e.g. Ladefoged 1967, 2003). Exploring boundary-related strengthening effects on the production of three-way contrastive stops in Korean (lenis, fortis, aspirated; e.g. Cho et al. 2002), for example, Cho and Jun (2000) observed systematic variation of airflow measured just after the release of the stop as a function of boundary strength. However, the detailed pattern was better understood as supporting the three-way obstruent contrast. This again implies that variation in the respiratory force as a function of prosodic structure is further modulated in a language-specific way, in this case by the segmental phonology of the language. In a similar vein, nasal flow has been investigated by researchers in an effort to understand how the amount of nasal flow produced with nasal sounds may be regulated by prosodic structure (e.g. Jun 1996; Fougeron and Keating 1997; Gordon 1997; Fougeron 2001). From an articulatory point of view, Fougeron (2001) hypothesized that the articulatory force associated with prosodic strengthening may have the effect of elevating the velum, resulting in a reduction of nasal flow. Results from French (Fougeron 2001), Estonian (Gordon 1997), and English (Fougeron and Keating 1997) indeed show that nasal flow tends to be reduced in domain-initial position, in line with Fougeron’s articulatory strengthening-based account. (See Cho et al. 2017 for a suggestion that reduced nasality for the nasal consonant may be interpreted in terms of paradigmatic vs. syntagmatic enhancement due to prominence and domain-initial strengthening, respectively.) These studies again indicate that an examination of nasal flow would provide useful data on how low-level segmental realization is conditioned by higher-order prosodic structural factors.
2.4 Point-tracking techniques for articulatory movements Point-tracking techniques allow for the measuring of positions and movements of articulators over time by attaching small pellets (or sensors) to flesh points of individual articulators. The point-tracking systems that have been used in the phonetic research include the
ARTICULATORY MEASURES OF PROSODY 21 magnetometer, the X-ray microbeam, and the Optotrak (an optoelectronic system). The point-tracking systems track movements of individual articulators, including the upper and lower lips and the jaw as well as the tongue and velum, over time during speech production, although the optoelectronic system is limited to an external use (i.e. for tracking the lips and the jaw; see Stone 2010 for a brief summary of each of these systems). Among the point-tracking systems, the electromagnetic articulograph (EMA), also generally referred to as magnetometer, has been steadily developed and more widely used in recent years than the other two techniques because the former is less costly and more accessible, and provides a more rapid tracking rate compared to the X-ray microbeam system. The EMA system uses alternating electromagnetic fields that are generated from multiple transmitters placed around the subject’s head. The early two-dimensional magnetometer system (e.g. Perkell et al. 1992) used three transmitters, but Carstens’ most recent system (AG 501; see Hoole 2014) uses nine transmitters that provide multi-dimensional articulatory data. In this technique, a number of receiver coils (sensors, as small as 2 × 3 mm) are glued on the articulators (Figure 2.3), usually along the midsagittal plane, but capturing more dimensions is also possible in the three-dimensional systems. Note that the NDI Wave System (Berry 2011) uses sensors containing multiple coils (e.g. Tilsen 2017; Shaw and Kawahara 2018). The basic principle is that the strength of the electromagnetic field in a receiver (sensor) is inversely related to its distance from each of the transmitters around the head at different frequencies. Based on this principle, the system calculates the sensor’s voltages at different frequencies and obtains the distances of each sensor from the transmitters, allowing the positions of the sensors to be estimated in the two-dimensional XY or the three-dimensional XYZ coordinate plane plus two angular coordinates (see Zhang et al. 1999 for details of the technical aspects of two-dimensional EMA; Hoole and Zierdt 2010 and Hoole 2014 for EMA systems that use three or more dimensions; and Stone 2010 for a more general discussion on EMA). Given its high temporal and spatial resolution (at a sample rate up to 1,250 Hz in a recent EMA system), an EMA is particularly useful in investigating dynamic aspects of overlapping vocal tract actions that are coactive over time (for quantitative analysis of EMA data see Danner et al. 2018; Tomaschek et al. 2018; Wieling 2018). In prosody research, an EMA is particularly useful in the exploration of the phonetics–prosody
Figure 2.3 Lip aperture in electromagnetic articulography. High values indicate that lips are open during vowel production. Trajectories are longer, faster, and more displaced in target words in contrastive focus (lighter grey lines) compared to out of focus. (Photo by Fabian Stürtz at IfL Phonetics Lab, Cologne.)
22 TAEHONG CHO AND DORIS Mücke interplay in the case of tone-segment alignment and articulatory signatures of prosody structure. A few examples are provided below. It is worth noting, however, that before the EMA system was widely used, prosodically conditioned dynamic aspects of articulation had been investigated by researchers using the X-ray microbeam (e.g. Browman and Goldstein 1995; de Jong 1995; Erickson 1995) and an optoelectronic device (e.g. Edwards et al. 1991; Beckman and Edwards 1994). An EMA may be used to investigate the timing relation of tonal gestures with supralaryngeal articulatory gestures (D’Imperio et al. 2007b). Tone gestures are defined as dynamic movements in f0 space that can be coordinated with articulatory actions of the oral tract within the task dynamics framework (Gao 2009; Mücke et al. 2014). For example, Katsika et al. (2014) showed that boundary tones in Greek are lawfully timed with the vocalic gesture of the pre-boundary (final) syllable, showing an anti-phase (sequential) coupling relationship between the tone and the vocalic gesture in interaction with stress distribution over the phrase-final word (see also Katsika 2016 for related data in Greek). As for the tone-gesture alignment associated with pitch accent, Mücke et al. (2012) reported that Catalan employs an in-phase coupling relation (i.e. roughly simultaneous initiation of gestures) between the tone and the vocalic gesture with a nuclear pitch accent LH. By contrast, the tone–gesture alignment in a language with a delayed nuclear LH rise, such as German, is more complex (e.g. L and H may compete to be in phase with the vocalic gesture, with in-phase L inducing a delayed peak). Moreover, quite a few studies have shown that tone–segment alignment may be captured better with articulatory gestural landmarks rather than with acoustic ones, in line with a gestural account of tone–segment alignment (e.g. Mücke et al. 2009, 2012; Niemann et al. 2014; see also Gao 2009 for lexical tones in Mandarin). More broadly, an EMA is a device that provides useful information about the nature of tone–segment coordination, allowing for various assumptions of the segmental anchoring hypothesis (see chapter 6). EMA has also been extensively used to investigate supralaryngeal articulatory characteristics of prosodic strengthening in connection with prosodic structure. Results of EMA studies have shown that articulation is systematically modified by prominence, largely in such a way as to enhance paradigmatic contrast (e.g. Harrington et al. 2000; Cho 2005, 2006a; see de Jong 1995 for similar results obtained with the X-ray microbeam system). On a related point, Cho (2005) showed that strengthening of [i] by means of adjustments of the tongue position manifested different kinematic signatures of the dual function (boundary vs. prominence marking) of prosodic structure (see Tabain 2003 and Tabain and Perrier 2005 for relevant EMA data in French; Mücke and Grice 2014 in German; and Cho et al. 2016 in Korean). A series of EMA studies has also examined the nature of supralaryngeal articulation at prosodic junctures, especially in connection with phrase-final (pre-boundary) lengthening (Edwards et al. 1991; Byrd 2000; Byrd et al. 2000, 2006; Byrd and Saltzman 2003; Krivokapić 2007; Byrd and Riggs 2008; Krivokapić and Byrd 2012; Cho et al. 2014b; Katsika 2016). These kinematic data have often been interpreted in terms of gesture types as associated with different aspects of the prosodic structure, like ‘π-gestures’ for slowing down the local tempo at boundaries and ‘μ-gestures’ for temporal and spatial variations under stress and accent. Krivokapić et al. (2017) have extended the analysis of vocal tract actions to manual gestures, combining an EMA with motion capture, and have shown that manual gestures are tightly coordinated with pitch accented syllables and boundaries. Both vocal tract actions and manual gestures (pointing gestures) undergo lengthening under prominence. Scarborough et al. (2009) is an example of the combined use of an EMA and
ARTICULATORY MEASURES OF PROSODY 23 an optoelectronic system to examine the relationship between articulatory and visual (facial) cues in signalling lexical and phrasal stress in English. Together, these studies have provided insights into the theory of the phonetics–prosody interplay in general, and the dynamic approaches have improved our understanding of the human communicative sound system (e.g. Mücke et al. 2014; see also recent EMA studies on other aspects of speech dynamics, such as Hermes et al. 2017; Pastätter and Pouplier 2017).
2.4.1 Ultrasound Ultrasound imaging (also referred to as (ultra)sonography) is a non-invasive technique used in phonetic research to produce dynamic images of the sagittal tongue shape, which allows for investigating vocal tract characteristics, tongue shape, and tongue motion over time (for reviews see Gick 2002; Stone 2010). It uses the reflective properties of ultra-highfrequency sound waves (which humans cannot hear) to create images of the inside of the body (Figure 2.4). The high-frequency sound wave penetrates through the soft tissues and fluids, but it bounces back off surfaces or tissues of a different density as well as air. The variation in reflected echoes is processed by computer software and displayed as a video image. It has some limitations, however (see Stone 2010). For example, due to its relatively low sampling rate (generally lower than 90 Hz), it does not allow for the investigation of sophisticated dynamic characteristics of tongue movements (unlike EMA, which provides a sample rate up to 1,250 Hz). In addition, it is generally unable to capture the shape of the tongue tip and the area beyond tissue or air that reflects ultrasound. However, because ultrasound imaging allows researchers to examine real-time detailed lingual postures not easily captured by methods such as EPG and EMA (including the tongue groove and the tongue root; see Lulich et al. 2018), and because some systems are portable and inexpensive (Gick 2002), it has increasingly been used in various phonetic studies (Stone 2010; Carignan 2017; Ahn 2018; Strycharczuk and Sebregts 2018; Tabain and Beare 2018), and also in studies involving young children (Noiray et al. 2013). Lehnert-LeHouillier et al. (2010) used an ultrasound imaging system to investigate prosodic strengthening as shown in the tongue shape for mid vowels in domain-initial position with different prosodic boundary strengths in English. They found a cumulatively increasing magnitude of tongue lowering as the boundary strength increased for vowels in
Figure 2.4 Tongue shapes in ultrasound. (Photo by Aude Noiray at the Laboratory for Oral Language Acquisition.)
24 TAEHONG CHO AND DORIS Mücke vowel-initial (VC) syllables, but not in consonant-initial (CVC) syllables. Based on these results, the authors suggested that boundary strengthening is localized in the initial segment, whether consonantal or vocalic. In a related ultrasound study that examined lingual shapes for initial vowels in French, Georgeton et al. (2016) showed that prosodic strengthening of domain-initial vowels is driven by the interaction between language factors (such as the phonetic distinctiveness in the perceptual vowel space) and physiological constraints imposed on the different tongue shapes across speakers. In an effort to explore the articulatory nature of the insertion of a schwa-like element into a consonantal sequence, Davidson and Stone (2003) used an ultrasound imaging technique to investigate how phonotactically illegal consonantal sequences may be repaired. Their tongue motion data in the production of /zC/ sequences in pseudo-Polish words suggested that native speakers of English employed different repair mechanisms for the illegal sequences, but in a gradient fashion in line with a gestural account—that is, as a result of the interpolation between the flanking consonants without showing a positional target in the articulatory space (Browman and Goldstein 1992a). It remains to be seen how such a gradient repair may be further conditioned by prosodic structure. An ultrasound imaging system may be used in conjunction with laryngoscopy. Moisik et al. (2014) employed simultaneous laryngoscopy and laryngeal ultrasound (SLLUS) to examine Mandarin tone production. In SLLUS (see also Esling and Moisik 2012), laryngoscopy is used to obtain real-time video images of the glottal condition, which provide information about laryngeal state, while laryngeal ultrasound is simultaneously used to record changes in larynx height. Results showed no positive correlation between larynx height and f0 in the production of Mandarin tones except for low f0 tone targets that were found to be accompanied by larynx raising due to laryngeal constriction (as low tone often induces creakiness). This study implies that larynx height may be controlled to help facilitate f0 change, especially under circumstances in which f0 targets may not be fully accomplished (e.g. due to vocal fold inertia). Despite the invasiveness of laryngoscopy, the innovative technique was judged to be particularly useful in exploring the relation between f0 regulation and phonation type and their relevance to understanding the production of tones and tonal register targets.
2.4.2 Electropalatography Electropalatography (EPG) is a technique that allows for monitoring linguo-palatal contact (i.e. contact between the tongue and the hard palate) and its dynamic change over time during articulation. The subject wears a custom-fabricated artificial palate, usually made of a thin acrylic, held in place by wrapping around the upper teeth (Figure 2.5). Several dozen electrodes are placed on the artificial palate, and the electrodes that are contacted by the tongue during articulation send signals to an external processing unit, indexing details of tongue activity during articulation. Unlike the EMA, which is usually used to track articulatory movements in the midsagittal plane, an EPG records the tongue contact anywhere on the entire palate (i.e. the target of the tongue movement). EPG is generally limited to investigating the production of consonants and high vowels, which involve tongue
ARTICULATORY MEASURES OF PROSODY 25
/t/
/s/
/k/
/ç/
Figure 2.5 Contact profiles in electropalatography for different stops and fricatives. Black squares indicate the contact of the tongue surface with the palate (upper rows = alveolar articulation, lower row = velar articulation). (Photo taken at IfL Phonetics Lab, Cologne.)
c ontact between the lateral margins of the tongue and the sides of the palate near the upper molars (Byrd et al. 1995; Gibbon and Nicolaidis 1999; Stone 2010; see Ünal-Logacev et al. 2018 for the use of EPG with an aerodynamic device). In prosodic research, EPG has been used to measure prosodic strengthening as evidenced by the production of lingual consonants across languages, in English (Fougeron and Keating 1997; Cho and Keating 2009), French (Fougeron 2001), Korean (Cho and Keating 2001), Taiwanese (Hsu and Jun 1998), and German (Bombien et al. 2010) (see Keating et al. 2003 for a comparison of four languages). Results of these studies have shown that consonants have a greater degree of linguo-palatal contact when they occur in initial position of a higher prosodic domain and when they occur in stressed and accented syllables. These studies have provided converging evidence that low-level articulatory realization of consonants is fine tuned by prosodic structure, and as a consequence prosodic structure itself is expressed at least to some extent by systematic low-level articulatory variation of consonantal strengthening, roughly proportional to prosodic strengthening as stemming from the prosodic structure. EPG has also been used in exploring effects of syllable position and prosodic boundaries on articulation of consonantal clusters (Byrd 1996; Bombien et al. 2010). Byrd (1996), for example, showed that a consonant in English is spatially reduced (with lesser linguo-palatal contact) in the coda as compared to the same consonant in the onset, and that an onset cluster (e.g. /sk/) overlaps less than a coda or a heterosyllabic cluster of the same segmental make-up. In relation to this study, Bombien et al. (2010) examined articulation of initial consonant clusters in German and reported that at least for some initial clusters (e.g. /kl/ and /kn/), boundary strength induced spatial and/or temporal expansion of the initial consonant, whereas stress caused temporal expansion of the second consonant, and both resulted in less overlap. These two studies together imply that coordination of consonantal gestures in clusters is affected by elements of prosodic structure, such as syllable structure, stress, and prosodic boundaries.
26 TAEHONG CHO AND DORIS Mücke
2.5 Summary of articulatory measurement techniques Table 2.1 summarizes advantages and disadvantages of several articulatory measurement techniques that we have introduced in this chapter.
Table 2.1 Advantages and disadvantages of articulatory measuring techniques Devices
Advantages
Disadvantages
Laryngoscopy provides high-speed motion pictures of larynx activities
Direct observation of vocal fold vibrations; drawing inferences about different glottal states and activity of the vocalis muscle
Relatively invasive; not ideal for obtaining a large quantity of data from a large population
EGG monitors glottal states using electrical impedance
Non-invasive; relatively easy to handle for monitoring vocal fold vibration and different glottal states during phonation with a larger subject pool
Not as accurate as laryngoscopy; indirect observation of larynx activity estimated by a change in electrical impedance across the larynx
RIP estimates lung volume change during speech by measuring expansion and recoil of elastic bands wrapped around the thoracic and abdominal cavities
Non-invasive device; useful for testing the relationship between the respiratory process and global prosodic planning in conjunction with acoustic data
Difficult to capture fine detail at the segmental level; useful for testing prosodic effects associated with large prosodic constituents, such as an intonational phrase
EMA tracks positions and movements of sensors that are attached on several articulators (e.g. tongue, lips, jaw, velum), within an electromagnetic field
Data obtained at high sampling rates; useful for examining kinematics for both consonants and vowels; simultaneous observation of multiple articulators; high temporal and spatial resolution; applicable to manual and facial movements in prosody; threedimensional devices available; used for recording a larger population with a recent device
Limited to point tracking; usually used to track movements of points along the midsagittal plane; difficult to capture the surface structure (e.g. the complete shape of the tongue); can only be used to observe the anterior vocal tract; quite invasive with sensors; possibly impedes natural articulation
Ultrasound provides imaging of tongue position and movement
Data obtained at relatively high sampling rates (though not as high as a recent EMA system); generally non-invasive; used for measuring the real-time lingual postures during vowel and consonant production; a portable device is available
Difficult to image the tongue tip (usually about 1 cm), some parts of the vocal tract (e.g. the palatal and pharyngeal wall), and the bony articulator (e.g. the jaw); because tongue contours are tracked relative to probe position, it is difficult to align them to a consistent reference, such as hard palate structure
ARTICULATORY MEASURES OF PROSODY 27
Devices
Advantages
Disadvantages
EPG measures linguo-palatal contact patterns using individual artificial palates with incorporated touch-sensitive electrodes
Useful for examining linguo-palatal contact patterns along the parts of the palate over time, especially for coronal consonants; provides information about the width, length, and curvature of constriction made along the palatal region
Custom-made artificial palates needed for individual subjects; often impedes natural speech; restricted to complete tongue– palate contacts; no information about articulation beyond the linguo-palatal contact (e.g. non-high vowels, labials, velars)
Optoelectronic device (Optotrak) provides threedimensional motion data (using optical measurement techniques) with active markers placed on surfaces
Non-invasive; data with high temporal and spatial resolution; useful for capturing movements of some articulators (jaw, chin, lips) as well as all other visible gestures (e.g. face, head, hand)
Limited to point tracking; limited to external use (i.e. to ‘visible’ movements); impossible to capture articulation inside the vocal tract; the system requires line of sight to track the markers
Real-time MRI provides real-time imaging data of complex structures and articulatory movements of the vocal tract using strong magnetic fields
Non-invasive with high spatial resolution; useful for rich anatomical data analysis as well as for analysis of articulatory movements of the entire vocal tract (including the pharyngeal area)
Relatively poor time resolution; subjects are recorded in supine posture, which may influence speech; concurrent audio from production is very difficult to acquire because of scanner noise
2.6 Conclusion Prosody research has benefited from various experimental techniques over the past decades and has as a result extended in its scope, embracing investigations of various phonetic events that occur in both laryngeal and supralaryngeal speech events in the interplay between segmental phonetics and prosodic structure. Careful examination of fine-grained phonetic detail that arises in the phonetics–prosody interplay has no doubt advanced our understanding of the regulatory mechanisms and principles that underlie the dynamics of articulation and prosody. Prosodic features and segmental features are not implemented independently. The coordination between the two systems is modulated in part by universally driven physiological and biomechanical constraints imposed on them, but also in part by linguistic and communicative factors that must be internalized in the phonological and phonetic grammars in language-specific ways. Future prosody research with the help of articulatory measuring techniques of the type introduced in this chapter will undoubtedly continue to uncover the physiological and cognitive underpinnings of the articulation of prosody, illuminating the universal versus language-specific nature of prosody that underlies human speech.
28 TAEHONG CHO AND DORIS Mücke
Acknowledgements This work was supported in part by the Global Research Network programme through the Ministry of Education of the Republic of Korea and the National Research Foundation of Korea (NRF2016S1A2A2912410), and in part by the German Research Foundation as an aspect of SFB 1252 ‘Prominence in Language’ in the project A04 ‘Dynamic Modelling of Prosodic Prominence’ at the University of Cologne.
OUP CORRECTED PROOF – FINAL, 04/12/20, SPi
chapter 3
Fu n da m en ta l Aspects i n the Perception of f0 Oliver Niebuhr, Henning Reetz, Jonathan Barnes, and Alan C. L. Yu
3.1 Introduction The mechanisms and cognitive processes that lead to the physiological sensation of pitch in acoustic signals are complex. On the one hand, the basic mechanisms and processes involved in the perception of pitch in a speech signal work on the same principles that apply to any other acoustic signal. On the other hand, pitch perception in speech is still a special case for two reasons. First, speech signals show variation and dynamic changes in many more time and frequency parameters than many other acoustic signals, in particular those that are used as psychoacoustic test stimuli. Second, speech signals convey meaning and as such give the listener’s brain more ‘top-down’ interpretation and prediction possibilities than other signals. Therefore, focusing on fundamental points, this chapter will address both the general principles in the creation of pitch and the more specific aspects of this creation in speech. The basic mechanisms and processes are dealt with in §3.2, embedded in their historical development. Next, §3.3 addresses speech-specific aspects of pitch perception, from the detection of changes in f0 to the influences of segments and prosodies. Finally, §3.4 concludes that special care needs to be taken by speech scientists when analysing visually represented f0 contours in terms of perceived pitch contours.
OUP CORRECTED PROOF – FINAL, 04/12/20, SPi
30 OLIVER NIEBUHR, HENNING REETZ, JONATHAN BARNES, AND ALAN C. L. YU
3.2 A history of fundamental pitch perception research Pitch perception theories can roughly be divided into three categories: 1. Place (or tonotopic) theories assume that the pitch of a signal depends on the places of stimulation on the basilar membrane in the inner ear coding spectral properties of the acoustic signal. 2. Rate theories assume that the temporal distance between the firing of neurons determines the perceived pitch and that pitch is therefore coded by temporal properties of a signal. 3. Other theories assume a place coding for lower-frequency components and a rate coding for higher-frequency components. We present the most important results of nearly 200 years of experimentation that have contributed to the development of current theories. For this discussion, it is essential to be aware of the differences between the periodicity frequency (rate), the fundamental frequency, and the pitch of a signal.
3.2.1 Basic terminology A periodic signal can be represented by an infinite sum of sinusoids, according to the the orem of Fourier (see Oppenheim 1970). These sinusoids are uniquely determined by their frequency, amplitude, and phase. In speech perception, the phase is ignored, since our hearing system evaluates it only for directional hearing, not for speech perception per se (Moore 2013). The squared amplitude (i.e. the power) of a signal is usually displayed with a (power) spectrum, showing the individual frequency components of a signal. And the development over time of a spectrum is represented with a spectrogram. Frequencies are given in hertz (Hz) and the power is given on a decibel (dB) scale. The transformation from the acoustic signal in the time domain to the spectrum in the frequency domain is usually performed with a Fourier transformation of short stretches of speech. To avoid artefacts by cutting speech in this way, windowing can reduce the segmentation effects (Harris 1978) but cannot remove them. A pure (sine) tone of, for example, 100 Hz has a period duration of 10 ms and it has only this frequency component in its spectrum. The f0 of a complex signal composed of several pure tones whose frequency components are whole multiples of its lowest sine frequency component (e.g. 100, 200, 300, and 400 Hz) is equivalent to this lowest component (here: 100 Hz); its multiples are called ‘harmonics’. All frequency components show up in the spectrum and the period duration of this complex signal is that of the fundamental frequency, here: 10 ms. A complex signal can consist of several harmonics with some harmonics (even the fundamental) having no energy (e.g. 200, 300, and 400 Hz). This complex tone still has a period duration of 10 ms and the period frequency of the signal is 100 Hz, given that the period frequency is the inverse of the period duration. The fundamental frequency
OUP CORRECTED PROOF – FINAL, 04/12/20, SPi
FUNDAMENTAL ASPECTS IN THE PERCEPTION OF F 0 31 is therefore also 100 Hz, even though it has no energy, as a result of which it is usually said to be ‘missing’. The debate in pitch perception (which is a physiological sensation, not a physical property) is whether the periodicity of a speech signal or its harmonic structure is most import ant for the perception of its pitch. There are two theories, both based on series of experiments, as discussed in the following sections.
3.2.2 Theories of pitch perception It was long believed that the fundamental frequency (i.e. the first harmonic) of a signal must be present for it to be perceived as the pitch of a signal (see de Cheveigné 2005). This changed when Seebeck (1841) created tones with little energy at the fundamental frequency, using air sirens. He observed that perceived pitch was still that of the weak fundamental and argued that it was perceived on the basis of the rate of the periodic pattern of air puffs. Ohm (1843) objected to this explanation and instead assumed that listeners hear each harmonic in a complex tone according to mechanical resonance properties of the cochlea in the inner ear, whereby the fundamental is the strongest harmonic. Because this view makes the presence of the fundamental essential, he assumed that, in Seebeck’s experiments, non-linear distortions had somehow reintroduced the fundamental in the outer and middle ear. On this assumption, Helmholtz (1863) argued that the inner ear functions as a resonator and a ‘place coder’ for the fundamental frequency. Thus, the experimental evidence for the rate coding of pitch and the absence of the fundamental in Seebeck’s experiments were explained by the place theory, the assumption that the fundamental frequency is reintroduced in the outer or middle ear. The place theory remained the accepted theory and was supported by Békésy’s (1928) findings about the frequency-dependent elongation of different places along the basilar membrane in the inner ear. This theory was challenged by Schouten (1938, 1940a), who electronically generated complex tones with no energy at the fundamental. His findings paralleled Seebeck’s in that the perceived pitch was that of the missing fundamental. He also introduced pure tones close to the frequency of the eliminated fundamental that should lead to interference patterns in the form of a waxing and waning loudness perception, socalled beats (Moore 2013), if the fundamental was physically reintroduced as proposed by Ohm. Since no such beats were perceived, Schouten showed that there was no energy at the fundamental frequency in the listener’s ear. Additional proof of the nonexistence of the fundamental was provided by Licklider (1954), who presented stimuli in which the frequency of the fundamental was masked by noise, which did not prevent listeners from perceiving the pitch at that frequency. More advanced theories were needed to explain these experimental observations.
3.2.3 Critical bands and their importance for pitch perception theories Fletcher (1940) introduced the term ‘critical bands’ for the notion that two simultaneous frequencies must be sufficiently different to be perceived as separate tones rather than as
OUP CORRECTED PROOF – FINAL, 04/12/20, SPi
32 OLIVER NIEBUHR, HENNING REETZ, JONATHAN BARNES, AND ALAN C. L. YU a single complex tone. (Think of flying insects in a bedroom at night; several insects can sound like a single bigger one unless their sinusoidal wing frequencies fall into different ‘critical bands’.) Several experiments with different methodologies have repeated this finding (see Moore 2013). The question of whether pitch perception is a rate or a place principle was thus changed into an issue relating to the widths of critical bands. These are smaller for low frequencies and larger for higher frequencies, while the harmonics of a complex tone are equidistant. Consequently, full separation of individual harmonics is possible for lower-frequency components, whose harmonics will rarely fall into the same critical band, while higher harmonics readily fall into the same critical band and are therefore not separable. At the same time, temporal selectivity in the higher frequency range is better than in the lower frequency range. The question of pitch perception as either rate or place thus became a question of resolving individual harmonics. If the resolvable lower harmonics are important for pitch perception, then the coding can be explained by the place along the basilar membrane. If pitch perception is guided by the higher harmonics, then coding must take place by means of the firing distances of the individual nerves, because the higher harmonics cannot be resolved by the places along the basilar membrane. To show that higher harmonics can determine pitch perception, Schouten (1940b) conducted experiments using signals with a non-trivial harmonic structure. The signals consisted of three equally spaced harmonics in a range where place coding is unlikely (e.g. 1,800, 2,000, and 2,200 Hz). The missing fundamental of these three harmonics (here: 200 Hz) was predictably perceived as pitch. He then shifted the frequency components by a constant (e.g. 40 Hz) and generated the inharmonic frequency components 1,840, 2,040, and 2,240 Hz. The question was: which frequency do subjects perceive? The pitch should correspond to 200 Hz if it is derived from the spacing between (in)harmonics. But, if pitch is something like the greatest common divisor of the harmonics, then the percept should correspond to a missing fundamental at 40 Hz, as induced by the three harmonics—that is, the 46th (1,840 Hz), 51st (2,040 Hz), and 56th (2,240 Hz). However, the subjects in Schouten’s experiment most often perceived a pitch of 204 Hz, sometimes 185 or 227 Hz. These experimental findings subsequently appeared to be explained by the firing rate of neurons, which operate as ‘peak pickers’, firing at signal maxima (Schouten et al. 1962). Concentrating on the fine structure of the waveform, de Boer (1956) measured the distances between the peaks of the amplitude signal, which showed that a distance of 4.9 ms occurred most often, representing a pitch of 204 Hz. Distances of 5.4 and 4.4 ms occurred less frequently, which can be perceived as 227 and 185 Hz. Walliser (1968, 1969) gave an explanation based on a spectral decomposition of the signal. He observed that 204.4 Hz is a subharmonic of 1,840 Hz, and 185.4 and 226.6 Hz are subharmonics of 2,040 Hz. He suggested that a rough power spectrum of the resolvable harmonics is first ‘computed’ in the listener’s brain and that a central processor subsequently uses pattern-recognition techniques to determine the pitch. In this way, the virtual fundamental frequency with the best fit is selected as the perceived pitch frequency. Explaining pitch perception with only approximately matching quasi-fundamental frequencies is actually an improvement over exactly matching frequencies, because a speech signal never has a perfectly harmonic structure and the ear must deal with such imperfect structures.
OUP CORRECTED PROOF – FINAL, 04/12/20, SPi
FUNDAMENTAL ASPECTS IN THE PERCEPTION OF F 0 33
3.2.4 Which components are important? Plomp (1967) and Ritsma (1967) conducted a number of experiments that led to the conclusion that a frequency band around the second, third, and fourth harmonics for fundamentals between 100 and 400 Hz was crucial for the perception of pitch, with periodicity of the signal being the important factor. In contrast, place proponents (e.g. Hartmann 1988; Gulick et al. 1989) argued that the resolved harmonics (i.e. the lowest multiples of f0) determine the pitch through their tonotopic distances. There was support for both theories (for detailed overviews see Plomp 1967; Moore 2013), so that the experiments gave no clear insight as to whether pitch is perceived in the temporal (rate) or the spectral (place) domain. Houtsma and Goldstein (1972) conducted an important experiment in which they presented complex tones composed of two pure tones (e.g. 1,800 and 2,000 Hz) to experienced listeners, who were able to identify the missing fundamental of 200 Hz. The pitch percept was even present when one harmonic was presented to the left and the other to the right ear. They assumed a central pitch processor in the listener’s brain that receives place-pitch information from each ear and then integrates this information in a single central pitch percept. In addition, Bilsen and Goldstein (1974) found that subjects can perceive pitch in whitenoise signals presented to both ears but delayed in one by 2 ms or more. The pitch percept is weak, but similar to the percept when delayed and undelayed signals are presented to the same ear (Bilsen 1966), which again points to a central mechanism (Bilsen 1977). Goldstein (1973) proposed an ‘optimum processor’ theory, where a central estimator receives information about frequencies of resolvable simple tones. The estimator tries to interpret resolved tones as harmonics of some fundamental and looks for a best fit. This analysis can be based on either place or rate information. Wightman (1973) suggested a ‘pattern-transformation’ theory, a sort of cepstrum analysis that roughly represents a phaseinsensitive autocorrelation function of the acoustic signal. Terhardt (1974) explained pitch perception in his ‘virtual pitch’ theory via a learning matrix of spectral pitch of pure tones and an analytic mode for later identifying the pitch, where harmonics leave ‘traces’ in the matrix. In a later version of his theory (Terhardt et al. 1982), he included the concept of analytical listening (‘outhearing’ individual harmonics of a signal) and holistic listening (perceiving the spectral shape as a whole). The close relationships between these three theories are brought out in de Boer (1977). All three in fact have peripheral stages where a rough power spectrum of the resolved harmonics is computed, in addition to a central processor that uses pattern-recognition techniques to extract the pitch of a signal. Although none of the theories offers a complete explanation of the computation of pitch from harmonic information only, the fact remains that binaural pitch perception and similar phenomena cannot be explained by a peripheral pitch processor alone. The central processing theories can explain a variety of experimental findings, but they still fail to explain the decreasing spectral resolution of higher harmonics due to the critical bands, which predicts good spectral coding for lower frequencies and better temporal coding for higher frequencies. This led Licklider (1951), de Boer (1956), Moore and Glasberg (1986), and Houtsma and Smurzynski (1990) to argue for a dual mechanism of pitch perception. They propose a frequency selectivity mechanism for lower harmonics and a rate analysis for higher harmonics and argue that this dual mechanism is more flexible, and thus more robust than a singular mechanism.
OUP CORRECTED PROOF – FINAL, 04/12/20, SPi
34 OLIVER NIEBUHR, HENNING REETZ, JONATHAN BARNES, AND ALAN C. L. YU There is some evidence for inter-individual differences in the use of the two mechanisms. Ladd et al. (2013) adopted the terminology of ‘spectral listeners’, who use spectral patterns to perceive pitch (i.e. a place coding), and ‘f0 listeners’, whose pitch perception follows the signal’s periodicity (i.e. a rate coding). Some listeners seem to prefer place coding and some rate coding. No one, however, seems to be either a pure ‘spectral listener’ or a pure ‘f0 listener’. Moreover, the nature of the stimulus itself is also relevant. Many listeners switch between the place and rate mechanisms, with the number of ‘spectral listeners’ increasing for stimuli with lower-frequency components. Virtual pitch perception therefore appears to dominate in natural, unmanipulated speech signals and everyday conversation situations. Thus, the bottom line of §3.2 is that pitch perception in complex sound signals relies on multi-layer, signal-adaptive cognitive mechanisms in which f0 is neither required to be physically present nor directly translated into its psychoacoustic counterpart. Pitch is virtual, and this fact becomes even clearer when we shift the focus to speech signals. The following sections on pitch perception in speech begin with listeners’ sensitivity to f0 change (§3.3.1) and then successively move deeper into segments and prosodies by explaining how speech segments hinder and support perception of f0 change (§3.3.2) and how pitch perception is shaped by duration and intensity (§3.3.3).
3.3 Pitch perception in speech 3.3.1 Just noticeable differences and limitations in the perception of f0 Just noticeable differences (JNDs) play a crucial role in the analysis of speech melody. We need to know how fine grained a listener’s resolution of pitch differences is in order to be able to separate relevant from irrelevant pitch changes. However, psychoacoustic JNDs often considerably underestimate the actual JNDs of listeners in the perception of speech melody. This is probably because the limited cognitive processing capacity of the auditory system allows for more fine-grained resolution of pitch differences in simple, steady psycho acoustic stimuli than in complex and variable speech stimuli (Klatt 1973; Mack and Gold 1986; House 1990). For example, psychoacoustic studies suggest that the JND between two f0 levels is as low as 0.3–0.5%, hence only 1 Hz or even lower for typical speech f0 values (Flanagan and Saslow 1958). In contrast, studies based on real speech or speech-like stimuli have found that listeners only detect f0 changes larger than 4–5%, i.e. 5–10 Hz or 1 semitone (ST) (Isačenko and Schädlich 1970; Rossi and Chafcouloff 1972). Fry’s (1958) experiments on f0 cues to word stress also support this JND level. If reverberation in a room scrambles phase relationships between individual frequencies of the signal, then listeners’ sensitivity to pitch differences decreases further, to about 10% (Bernstein and Oxenham 2006). These 10% or roughly 2 ST are about 20 times higher than the JND specified in psychoacoustic research. That 2 ST nevertheless represents a realistic threshold for detection of pitch changes in everyday speech is reflected in many phonetic analyses. For instance, annotating intonation often involves deciding whether two sequential pitch accents are linked by a high plateau or
OUP CORRECTED PROOF – FINAL, 04/12/20, SPi
FUNDAMENTAL ASPECTS IN THE PERCEPTION OF F 0 35 a sagging transition, in particular in languages that distinguish a ‘hat pattern’ from a ‘dip pattern’ (e.g. German). Ambrazaitis and Niebuhr (2008) found that, for German listeners, at least 2–3 ST difference is necessary before meaning-based judgements switch from hatto dip-pattern identification. Similarly, communicatively relevant steps in stylized inton ation contours such as the calling contour are at least 1–2 ST (Day-O’Connell 2013; Niebuhr 2015; Arvaniti et al. 2017; Huttenlauch et al. 2018). Furthermore, listeners seem equally insensitive to differences in f0 range—that is, to the magnitude of f0 movements. Several experiments on this topic suggest that the JND for f0 ranges is at least 1 ST, and perhaps much higher—that is, 2–3 ST (Pierrehumbert 1979; ’t Hart 1981). This becomes relevant in the minor f0 drop frequently occurring at the end of phrase-final rises, as in the audio examples for Kohler (2005) shown in IPDSP (2009: fig. 3). From a purely visual perspective, labelling such phrase endings with a downstepped high (or low) rather than a high boundary tone seems justifiable. However, these drops are typ ically less than 2–3 ST and hence not audible (though see Jongman et al. 2017 on language specificity). Figure 7(c) in the audio examples for Gussenhoven (2016) shows a rare case in which a short f0 drop at the end of a phrase-final rise is just above the JND and hence aud ible for listeners (358–305 Hz, 2.8 ST). Rossi (1971) made another important observation with respect to f0-range perception. Compared to measured f0 range, perceived pitch range can be considerably smaller, varying between 66% and 100% of the f0 movement, with steeper movements resulting in smaller perceived ranges. In Figure 3.1, for example, the continuation rises depicted for the first two words have roughly the same frequency range and end at similar levels (rise 1: 174–275 Hz; rise 2: 180–272 Hz). Yet, the second rise starts later, is steeper, and thus perceptually ends at a lower pitch level than the first. In this case, Rossi’s finding has no phonological c onsequence, but this could easily occur, for example, when the f0 contour suggests an upstep from a prenuclear to a nuclear pitch accent, where listeners might not perceive an upstep owing to the steepness of the nuclear rise. The correlation that Rossi discovered between the steepness of f0 slopes and the perceived range of the corresponding pitch movements could indirectly also explain the findings of Michalsky (2016). His original aim was to determine the transition point between question and statement identification in a slope continuum of phrase-final f0 movements in German, but he stumbled upon a variation in identification behaviour that was linked to speaking rate. This link suggests that the faster speakers are, the smaller are the perceived pitch ranges of the f0 movements they produce—all else equal, including the steepness of the f0 slope. It is possible that a faster speaking rate makes f0 movements appear steeper, which then in accord with Rossi’s findings causes the decrease in pitch range perception.
(1) Computer,
Tastatur und
Bildschirm.
Figure 3.1 Enumeration ‘Computer, Tastatur und Bildschirm’ spoken by a female German speaker in three prosodic phrases (see also Phonetik Köln 2020). (From the G_ToBI training material)
OUP CORRECTED PROOF – FINAL, 04/12/20, SPi
36 OLIVER NIEBUHR, HENNING REETZ, JONATHAN BARNES, AND ALAN C. L. YU Additionally, not every f0 rise or fall is even perceived as a pitch movement. Dynamic pitch perception requires certain relations between the duration and range of an f0 movement. Below this ‘glissando threshold’, f0 movements appear to listeners as stationary tones. The shorter a frequency transition is, the steeper it must be to induce a dynamic percept (Sergeant and Harris 1962). Likewise, for a relatively flat movement to be perceived as dynamic, it must be fairly long, consistent with the results of Rossi (1971). ’t Hart et al. (1990) integrate these results on glissando perception into a formula that states that the minimum f0 slope for an ST interval to yield the percept of a pitch movement corresponds to a constant factor divided by duration squared. A short f0 movement (50 ms) must have a slope of at least 46 ST/s to yield a movement percept, approaching the limits of human production velocity, at least for rises (Xu and Sun 2002). For a 100 ms movement, the glissando threshold decreases to about 12 ST/s. One important practical implication of this is that short f0 movements framed by voiceless gaps are often not steep enough to induce perception of a pitch movement. The result is a steady pitch event whose level roughly corresponds to the mean f0 of the second half of the movement. For example, this applies to the short f0 fall on und [ʊnb̥ː] in Figure 3.1. Thus, while the contour on und Bildschirm visually suggests annotation with a boundary high, %H, annotators who also listen to the phrase will likely decide on %L instead. (See G_ToBI conventions in Grice and Baumann 2002.) The und example shows the relevance of the glissando threshold for upstep and downstep decisions, in particular between phrases (Truckenbrodt 2007). As for the discrimination of f0 slopes, Nábelek and Hirsh (1969) report, for psycho acoustic conditions, a minimum threshold of about 30%. For more speech-like stimuli, the JND again increases to between 60% and 130% (see also Klatt 1973 and ’t Hart et al. 1990, who, for most speech-like conditions, obtained a JND of about 100%). Accordingly, speakers who distinguish pitch accents using shape rather than alignment differences vary the steepness of their pitch accent slopes so as to exceed this JND (cf. Niebuhr et al. 2011a). Also, differences between monotonal and bitonal accents such as H* and L+H* or H*+L typically involve slope differences of 60% or more, leaving aside the controversial crosslinguistic discussion about the separate phonological status of these accent types (cf. Gussenhoven 2016). In summary, it is crucial for analyses of speech melody to take into account that not all observable changes in f0 result in differences in perceived pitch. Some may be insufficiently large, others not steep or fast enough (or too steep or fast), and, even if an f0 movement produces a dynamic percept, it is likely that the movement’s perceived range will be substantially smaller than f0 suggests. Still more complicated, there is increasing evidence that JNDs are not mere limitations of the auditory system but to some degree also acquired barriers the brain sets up to filter out what it has learned to regard as irrelevant variability (cf. Gregory 1997). This idea fits well with the inter-individual differences (e.g. superior performa nce of musicians compared to non-musicians) and the training or learning effects that are reported in many JND studies (’t Hart 1981; Handel 1989; Micheyl et al. 2006). The implication is that during perception experiments participants could be screened for JND levels (perhaps just with biographical questions), and likewise acoustic f0 analyses should work with conservative JND estimates (see Mertens 2004).
OUP CORRECTED PROOF – FINAL, 04/12/20, SPi
FUNDAMENTAL ASPECTS IN THE PERCEPTION OF F 0 37
3.3.2 Segmental influences on the perception of f0 While §3.3.1 focused on the f0 contour itself, this section focuses on how segmental elem ents both limit and enhance the perception of pitch movements in speech. We begin with three examples of limitations. First, the speaker’s intended intonational f0 contour is typically interspersed with many f0 ‘micro-perturbations’ (Kohler 1990; Pape et al. 2005; Hanson 2009). Kirby and Ladd (2016a), for example, argue that there are two types of consonant-based f0 perturbation effect, one raising f0 following the release of a voiceless consonant, the other depressing f0 around the closure phase of a voiced obstruent. These micro-perturbations are actually not as small as the term ‘micro’ may imply. They can easily be between 10–20 Hz in magnitude, or more than 10% of the f0 range typically employed in natural speech. Moreover, these ups and downs may extend for a considerable time into the vowel (Hombert 1978). As a result, they may appear visually quite salient in depictions of f0 contours. Despite this visual salience, however, f0 micro-perturbations have little effect on listeners’ perceptions of speakers’ intended intonational patterns (though they can significantly influence phoneme identification and duration perception; see §3.3.3). Pitch pattern annotators, in other words (both humans and machines), must learn how f0 perturbations manifest themselves for the consonants of the language under analysis (Jun 1996), and apply this to the evaluation of the perceptual relevance of consonant-adjacent f0 movements, particularly those smaller than 3 ST (Mertens 2004) and shorter than about 60 ms (Jun 1996). Special caution is required when annotating phonologically relevant high and low targets near consonants; f0 perturbations, especially local dips before or inside voiced consonants, ‘could easily be misinterpreted as [low] tonal targets’ (Braun 2005: 106; cf. Petrone and D’Imperio 2009). The same applies to local f0 maxima and their interpretation as high tonal targets after voiceless consonants. Several explanations exist for apparent listener ‘deafness’ to consonantal f0 microperturbations. One is that such movements are so abrupt that they fall below the glissando threshold, and/or JNDs for change in pitch level and range (see §3.3.1). This cannot be the whole story, however, since micro-perturbations are clearly audible, insofar as they function as cues to segmental contrasts (e.g. stop voicing) (Kohler 1985; Terken 1995), sometimes even resulting in phonologization of tone contrasts that replace their consonantal precursors altogether (Haudricourt 1961; Hombert et al. 1979; House 1999; Mazaudon and Michaud 2008). Rosenvold (1981) approaches this paradox by assuming that f0 perturba tions are somehow integrated first as segmental properties in perception, thereby exempting them from additional parsing at the intonational level, which view was also argued for by Kingston and Diehl (1994). Ultimately, though, unanswered questions remain. A second reason why not all observable f0 changes manifest themselves as differences in perceived pitch comes from House’s (1990, 1996) Theory of Optimal Tonal Perception, according to which continuous f0 movements are parsed into a sequence of either level pitches or pitch movements depending on the information density of the speech signal over the relevant span. At segment onsets, for example, and particularly consonant–vowel boundaries, listeners must process a great deal of new information (see Figure 3.2a). The auditory system is thus fully engaged with processing spectral information for segmental purposes, causing it to reduce f0 movements in those regions to steady pitch events, approximately corresponding to the mean f0 of the movement’s final centiseconds
OUP CORRECTED PROOF – FINAL, 04/12/20, SPi
38 OLIVER NIEBUHR, HENNING REETZ, JONATHAN BARNES, AND ALAN C. L. YU
(a)
Time course of cognitive workload across a CVC syllable
(b) Hz 150
Max
Perception of f0 movements Low
Fall High
C
V
140 130 120 100 Min
100 C
V 200 ms
C
C
Figure 3.2 Schematic representation of the two key hypotheses of the Theory of Optimal Tonal Perception of House (1990, 1996): (a) shows the assumed time course of information density or cognitive workload across a CVC syllable and (b) shows the resulting pitch percepts for differently aligned f0 falls.
(about 3–4 cs; see d’Alessandro et al. 1998 for an alternative pitch-level calculation). Only when enough cognitive resources are available can f0 movements be tracked faithfully as dynamic events, which requires a spectrally stable sonorant section in the speech signal of at least 100 ms, as typically provided by vowels (stressed and/or long vowels in particular). This is why only the middle f0 fall in Figure 3.2b across a sonorant consonant-initial (CVC) sequence will be perceived as a falling movement (F), whereas the other two falls create the impression of either High (H) or Low (L) tones. The Prosogram algorithm for displaying and analysing f0 contours (Mertens 2004) is the first software to incorporate both this and the glissando threshold (§3.3.1) into its analyses. A third way in which sound segments limit the perception of f0 patterns is related to House’s Theory of Optimal Tonal Perception and concerns the interplay between the segmental sonority of and the salience of f0 events. In general, pitch percepts are more robust and salient when they originate from more sonorous segments. Not by chance therefore (e.g. Gordon 2001a; Zhang 2001) do phonologically relevant f0 events occur primarily in the higher sonority regions of the speech signal—for example, right after the onset of a vowel or towards its end (Kohler 1987; Xu 1998; Atterer and Ladd 2004). In a study investigating perceived scaling of plateau-shaped English pitch accents (see §9.5.1 on peak shape across languages), Barnes et al. (2011, 2014) observe that f0 plateaux that coincide with high sonority segments (e.g. accented vowels) are judged higher than identical plateaux that partially overlap with less sonorous coda consonants. Accordingly, they include segmental sonority as one of the weighting factors determining each f0 sample’s influence on the holistic percept of the timing and scaling of f0 events instantiating pitch accents, for instance, which percept they call the Tonal Center of Gravity. Thus, a plateau-shaped accent that appears visually to exhibit typical f0 alignment for a L*+H (late-peak) accent in American English may sound instead like a L+H*, if the f0 plateau
OUP CORRECTED PROOF – FINAL, 04/12/20, SPi
FUNDAMENTAL ASPECTS IN THE PERCEPTION OF F 0 39 extends largely over lower-sonority segmental material. Clearly, then, f0 contours cannot be properly analysed in isolation from the segmental context over which they are realized and annotators must rely on their ears as much as, or even more than, on convenient electronic visualizations. These findings accord with Niebuhr and Kohler (2004), who suggest that categorical perception (CP) of intonational contrasts is modulated by the presence of sonority breaks, notably CV or VC boundaries. Niebuhr (2007c) demonstrated that the abruptness of this sonority break (taken as the slope of intensity change) may alter the clarity with which perception experiments appear to show CP in their results. Lessening a sonority break’s abruptness can make an otherwise categorical-seeming identification function appear gradual, while enhancing a break’s abruptness can do the opposite. In any case, the nature of a (categorical or gradual) perceptual boundary between prosodic entities like pitch accents is not determined entirely by the nature of the entities themselves, but also involves the segmental context in which they are perceived (see §9.5.1 for more on timing relations between the f0 and intensity contours). These findings call into question the usefulness of CP as a tool for identifying phonological contrast, echoing Prieto’s statement (2012: 531–532) that ‘the application of this [CP] paradigm to intonation research has met with mixed success and there is still a need to test the [...] adequacy of this particular method’. So far, we have dealt with segmental phenomena that reduce the richness of an utterance’s perceived pitch contour in comparison with its observable f0. However, at least two noteworthy instances of the opposite are attested as well. First, while consonantal f0 per turbations may add f0 patterns to the signal that are not (entirely) incorporated into listeners’ perceived pitch pattern, vowels may in fact add something to it that is not immediately visible in recorded f0. This effect, known as intrinsic pitch, is not to be confused with intrinsic f0, whereby high vowels increase vertical tension of the vocal folds, thus raising f0 compared to low vowels, which relax and thicken the vocal folds, hence lowering f0 (Whalen and Levitt 1995; Fowler and Brown 1997). Rather, the effect of intrinsic pitch runs counter to that of intrinsic f0: all else equal, low vowels such as [a] are perceived as higher in pitch than high vowels such as [i] and [u], even when their realized f0 is in fact identical (Hombert 1978; Fowler and Brown 1997; Pape et al. 2005). Thus, an f0 rise across the two vowels [ɑː] and [i] of hobby is (in addition to the facts described in §3.1) perceptually smaller than its f0 range suggests, whereas an analogous fall becomes perceptually larger. The opposite would apply to a word like jigsaw with the inverse ordering of vowel heights (Silverman 1987). Intrinsicpitch differences are also big enough to create rising or falling pitch perceptions during visually flat f0 regions across vowel sequences or diphthongs (Niebuhr 2004). The intrinsicpitch effect can, thus, lend a word like hi, which mainly consists of a diphthong, a falling intonation although f0 is completely flat. A further way in which segments enhance f0 patterns relies on the fact that pitch perception is not restricted to periodic signals alone. Noise signals are also capable of creating aperiodic pitch impressions. These are particularly strongly influenced by the frequency of F2 but can, more generally, be modelled as a weighted combination of acoustic energy in different frequency bands (Traunmüller 1987), independently of a listener’s phonological background (Higashikawa and Minifie 1999). Thus, to the extent that speakers can control formants and the distribution of acoustic energy in the frequency spectrum, they can also control and actively vary the aperiodic pitch impressions evoked by the noise signals in their speech. As is well known, aperiodic pitch patterns can take over when f0 is not
OUP CORRECTED PROOF – FINAL, 04/12/20, SPi
40 OLIVER NIEBUHR, HENNING REETZ, JONATHAN BARNES, AND ALAN C. L. YU a vailable, as in whispered speech, where the aperiodic pitch contour allows listeners to reliably identify features of information structure, turn taking, sentence mode, or (in the case of tone languages) lexical meanings (cf. Meyer-Eppler 1957; Abramson 1972; Whalen and Xu 1992; Krull 2001; Nicholson and Teig 2003; Liu and Samuel 2004; Konno et al. 2006). Controlled, functional, and aperiodic pitch impressions were for a long time assumed to be characteristic of whispered speech only. Recent work shows that they are not. Fricatives such as [f s ʃ x] within normally voiced utterances vary in their spectral energy distribution such that the aperiodic pitch impression they create reflects the adjacent f0 level. That is, fricatives are ‘higher pitched’ in high f0 contexts and ‘lower pitched’ in low f0 contexts (‘segmental intonation’; Niebuhr 2009). Segmental intonation occurs at the end as well as in the middle of prosodic phrases and has been found for a number of different phonological f0 contexts in German (Niebuhr 2008, 2012, 2017; Niebuhr et al. 2011b; Ritter and Röttger 2014) and other languages, such as Polish (Żygis et al. 2014), Cantonese (Percival and Bamba 2017), French (Welby and Niebuhr 2019), and Dutch (Heeren 2015). Heeren also supports evidence in Mixdorff and Niebuhr (2013) and Welby and Niebuhr (2016) that the segmental intonation of fricatives is integrated in the listener’s overall perception of utterance intonation. Segmental intonation could thus be one reason why the intonation contours of utterances are ‘subjectively continuous’ (Jones 1909: 275) despite that fact that between 20% and 30% of an utterance is usually voiceless. To be sure, the segmental intonation of a fricative does not simply co-vary with the adjacent f0 context, since speakers may also produce spectral energy changes inside these fricatives such that a falling f0 movement (for example) is followed by (or continued in) a falling aperiodic pitch movement (Ritter and Röttger 2014). Furthermore, Percival and Bamba’s (2017) finding that segmental intonation is more p ronounced in English than in Cantonese underscores the extrinsic nature of the phenomenon. The most important practical implication of segmental intonation probably concerns phrase-final f0 movements. There is evidence (e.g. Kohler 2011) suggesting that phrase-final f0 movements that are acoustically truncated by final voiceless fricatives can be continued in the aperiodic noise of that fricative, appearing to be perceptually less truncated as a result. Similarly, phrase-final low rises ended by a voiceless fricative might be perceived as high rises, while phrase-final high-to-mid f0 falls ended by a voiceless fricative might be perceived as ending in low pitch. Again, this means that the decision between L-H% and H-^H% in G_ToBI, for instance, cannot be based on the f0 movement alone in such cases, and the same applies to the decision between !H-% and L-%. Figure 3.3 (from Kohler 2011) shows a highly truncated f0 fall followed by [s] at the end of in Stockholm bei der ICPhS ‘in Stockholm at the ICPhS’. Despite the clear final f0 rise, this utterance is not perceived as an open question with a phrase-final rise (!H-%) but as a binding proposal whose falling phrase-final pitch movement (L-%) reaches as low as the phrase-internal one on Stockholm.
3.3.3 Perceptual interplay between prosodic parameters Beyond the interplay of segments and prosodies reviewed above, other phonetic characteristics of the signal, such as duration and intensity, also interact with f0 in perception. This is schematized in Figure 3.4. There is, for example, a robust effect of pitch on the perceived durations of stimuli such as syllables, vowels, or even pauses. Lehiste (1976) concludes from
OUP CORRECTED PROOF – FINAL, 04/12/20, SPi
FUNDAMENTAL ASPECTS IN THE PERCEPTION OF F 0 41
Stockholm
bei
der
I C
P
h
s
Pitch (Hz)
85 300
66.67
200
48.33
150
0
Energy (dB)
In
30 2.43613
Time (s)
Figure 3.3 Utterance in Stockholm bei der ICPhS, Kiel Corpus of Spontaneous Speech, female speaker g105a000. Arrows indicate segmental intonation in terms of a change in the spectral energy distribution (0–8 kHz) of the final [s], 281 ms. (Adapted from Kohler 2011)
p tem
od
re ec
as
es
Perceived pitch
t en em ses v o m crea in
Perceived duration
n co egat rre ive lat ion
positive correlation
Perceived loudness
Figure 3.4 Perceived prosodic parameters and their interaction.
perception experiments that a changing fundamental frequency pattern has a strong influence on the listener’s perception of duration (see also Kohler 1986). More specifically, compared to a flat pitch, pitch movements—and falling ones in particular—make a sound or a syllable appear longer to listeners. Van Dommelen (1995) and Cumming (2011a) report similar lengthening effects for Norwegian, German, and French. Brugos and Barnes (2012) show in addition that larger pitch changes across pauses make these pauses appear longer. The work of Yu (2010) extends these findings to the perception of level versus dynamic lexical tones in Cantonese, adding moreover to the picture that syllables with high pitch are perceived as longer than syllables with low pitch (see also Gussenhoven and Zhou 2013). The latter effect is either language or domain specific, though, as Kohler (1986) and Rietveld and Gussenhoven (1987) showed that higher pitch levels over a longer interval cause an increase in perceived speech rate. Perceived duration—in the form of the speaking rate—also has an effect on pitch in the opposite direction in that faster speaking rates narrow perceived pitch ranges. Various investigations point to a further effect of perceived (intensity-related) loudness on pitch (for
OUP CORRECTED PROOF – FINAL, 04/12/20, SPi
42 OLIVER NIEBUHR, HENNING REETZ, JONATHAN BARNES, AND ALAN C. L. YU an overview, see Rossing and Houtsma 1986). While the direction and magnitude of this effect vary, the most frequent finding is that a decrease in loudness increases perceived pitch, potentially by as much as 1 ST. In addition, higher loudness levels lead to the perception of larger pitch intervals (Thomsen et al. 2012). In turn, loudness is affected by perceived duration (Lehiste 1970). Explanations of this interplay of prosodic parameters range from basic bottom-up reception mechanisms to expectation-based top-down perceptual processing that helps the listener ‘filter out the factors that influence duration and frequency in order to perceive the speaker’s intended’ parameter values (Handel 1989: 422). Lehiste (1970: 118) uses the term ‘correction factors’ in a similar context (see also Yu 2010). Whatever the explanation, the interaction of these parameters underscores the need for prosody to be viewed as a whole, with none of its various dimensions analysed and interpreted in isolation from the others. In this same context, note that Figure 3.4 implies the possibility of effect chains as well. For instance, a longer or shorter perceived duration lowers or raises perceived loudness, which, in turn, raises or lowers perceived pitch; alternatively, a pitch movement increases perceived duration, which, in turn, increases perceived loudness. Such potential interactions are particularly relevant in prosodically flatter portions of the signal, where in the absence of other activity the right combination of these factors might modulate the perceived prominence of syllables, or even cue the presence of less salient H* or L* pitch accents. Lastly, it is worth mentioning that perceived pitch can also be modulated significantly by voice quality or phonation type. Examples of interaction between voice quality and tone/ intonation systems are presented in chapters 12, 23, and 29.
3.4 Conclusion Today, f0 can relatively simply be extracted from acoustic speech signals and be visually displayed, making useful and compelling information available on pitch contours during speech. Nonetheless, in this chapter we have urged researchers not to mistake visual salience for perceptual reality. Intonation models involving integration of f0 information over time, such as Prosogram (Mertens 2004), Tilt (Taylor 2000), Tonal Center of Gravity (Barnes et al. 2012b), or Contrast Theory (Niebuhr 2013; for a discussion see chapter 9), represent steps towards operationalizing this reality. Meanwhile, we hope to have convinced readers that a meaningful analysis of acoustic f0 in speech signals must take into account not just how it was produced but also how it is perceived.
OUP CORRECTED PROOF – FINAL, 07/12/20, SPi
Pa rt I I
PRO S ODY A N D L I NGU IST IC ST RUC T U R E
OUP CORRECTED PROOF – FINAL, 07/12/20, SPi
chapter 4
Ton e Systems Larry M. Hyman and William R. Leben
4.1 Introduction: What is tone? All languages use ‘tone’ if what is meant is either pitch or the f0 variations that are unavoid able in spoken language. However, this is not what is generally meant when the term is used by phonologists. Instead, there is a major typological split between those languages that use tone to distinguish morphemes and words versus those that do not. Found in large numbers of languages in sub-Saharan Africa, East and South East Asia, parts of New Guinea, Mexico, the Northwest Amazon, and elsewhere, tone can be used to distinguish lexical morphemes (e.g. noun and verb roots) or grammatical functions. Thus, in Table 4.1 the same eight-way tonal contrast has a lexical function on nouns, but marks the indicated grammatical distinc tions on the verb /ba/ ‘come’ in Iau [Lakes Plain; West Papua] (Bateman 1990: 35–36).1
Table 4.1 Tonal contrasts in Iau Tone Nouns H M HS LM HL HM ML HLM
bé be¯ bé˝ be᷅ bê be᷇ be᷆ bê ̄
Verbs ‘father-in-law’ ‘fire’ ‘snake’ ‘path’ ‘thorn’ ‘flower’ ‘small eel’ ‘tree fern’
bá ba¯ bá˝ ba᷅ bâ ba᷇ ba᷆ bâ ̄
Inflectional meaning ‘came’ ‘has come’ ‘might come’ ‘came to get’ ‘came to end point’ ‘still not at endpoint’ ‘come (process)’ ‘sticking, attached to’
Totality of action punctual Resultative durative Totality of action incompletive Resultative punctual Telic punctual Telic incompletive Totality of action durative Telic durative
As the above monosyllabic examples make clear, tone can be a crucial exponent of mor phemes, which may be distinguished only by tone. While Iau tone is hence densely paradig matic, at the other end of the spectrum tone can be quite sparse and syntagmatic. This is the case in Chimwiini [Bantu; Somalia], where a single H tone contrasts with zero (Ø), is strictly 1 For tone, H = high (´), ꜜH = downstepped H (ꜜH), M = mid (¯), L = low (`), and S = superhigh (˝). HS thus represents a contour that rises from high to superhigh tone. Our segmental transcriptions of African language data follow standard practice in African linguistics, which sometimes conflicts with the International Phonetic Alphabet. For example, we use ‘y’ for IPA [j] and ‘c’ and ‘j’ for IPA [tʃ] and [dʒ]. In a few forms cited (not in phonetic brackets) we are unsure of what segment our source intended by a symbol.
OUP CORRECTED PROOF – FINAL, 07/12/20, SPi
46 larry m. hyman and william r. leben grammatical (nouns and verbs are underlyingly toneless), and can occur only on the final or penultimate syllable of a phonological phrase (Kisseberth and Abasheikh 2011: 1994): (1) a. n-jileː n̪amá ‘I ate meat’ n-jile ma-tuːndá ‘I ate fruit’ b. jileː n̪amá ‘you sg. ate meat’ jile ma-tuːndá ‘you sg. ate fruit’ c. jileː n̪áma ‘s/he ate meat’ jile ma-túːnda ‘s/he ate fruit’ As seen from the above examples, the H tone will occur phrase-finally in the past tense if the subject prefix on the verb is either first or second person. Otherwise the phonological phrase will receive a default penultimate H.
4.1.1 Tone as toneme versus morphotoneme Comparisons of Iau and Chimwiini, representing two extremes, reveal two different approaches to tonal contrasts, which are exemplified by two pioneers in the study of tone. For Kenneth Pike the presence of tone had to do with surface phonological contrasts. Hence, a language with tone is one ‘having significant, contrastive, but relative pitch on each syllable’ (K. Pike 1948: 3). For William E. Welmers, on the other hand, tone was seen to be an underlying property of morphemes. Hence, a tone language is one ‘in which both pitch phonemes [read: features] and segmental phonemes enter into the composition of at least some morphemes’ (Welmers 1959: 2; 1973: 80). Since Pike conceptualized tone as relatively concrete surface contrasts, he assumed that every output syllable carries a tone (or tones), as in Iau. Welmers, on the other hand, emphasized that a tone system could have toneless tone-bearing units (TBUs) as well as toneless morphemes (e.g. toneless noun and verb roots in Chimwiini). We here follow Welmers in defining a tone language as one in which both tonal and segmental features enter into the composition of at least some morphemes.
4.1.2 Tone as pitch versus tone package As indicated, the bare minimum to be considered a ‘tone language’ is that pitch enters as a (contrastive) exponent of at least some morphemes. However, more than pitch can be involved in a tonal contrast. This is particularly clear in Chinese and Southeast Asian lan guages. As seen in Table 4.2, different phonation properties accompany the six-way tonal contrast in Hanoi Vietnamese (Kirby 2011: 386).
Table 4.2 Tonal contrasts in Vietnamese Vietnamese term Pitch level Contour Other features
Example
Ngang Huyê`n Sa˘´c Na̩n ̌ g Hỏi Nga᷉
ma mà má mṃ mả ma᷉
high-mid mid high low low high
level falling rising falling falling rising
laxness laxness, breathiness tenseness glottalization or tenseness tenseness glottalization
‘ghost’ ‘but, yet’ ‘cheek’ ‘rice seedling’ ‘tomb’ ‘code’
OUP CORRECTED PROOF – FINAL, 07/12/20, SPi
tone systems 47 In ‘stopped’ syllables ending in /p, t, k/ the above six tones are neutralized to a binary contrast between a ‘checked’ rising versus low tone (mát ‘cool’, ma̩t ‘louse, bug’). In addition to glottalization, breathiness, and tense-laxness, different tones can have different durations. While falling and rising contour tones may be associated with greater duration (Zhang 2004a), tone-specific durational differences are not always predictable on the basis of universal phonetics. Thus, of the four tones of Standard Mandarin as spoken in isolation, level H Tone 1 and HL falling Tone 4 tend to be shorter than rising Tone 2, which is shorter than low-dipping Tone 3 (Xu 1997: 67).2 Correlating with such complex phonetic realiza tions found in Chinese and South East Asia is the traditional view of areal specialists that contour tones should be interpreted as units and not sequences of individual level tones. The typological distinction seems therefore to be between tone as a ‘package’ of features (necessarily including pitch) versus tone as pitch alone (cf. Clements et al. 2010: 15). For examples from South East Asia, see §23.2.2, which refers to such tone packages as tonation, following Bradley (1982). Especially prevalent in languages with monosyllabic words, the worldwide distribution of these complexes of tone and other laryngeal features can be char acterized as the Sinosphere versus the rest of the world. Outside the Sinosphere (Matisoff 1999), phonations and non-universal timing differences between tones are much rarer. Where they do occur (e.g. in the Americas), they are generally independent of the tones, and tonal contours are readily decomposable into sequences of level tones (cf. §4.2.2).
4.1.3 Tone-bearing unit versus tonal domain (mora, syllable, foot) Another way tone systems can differ is in their choice of TBU and tonal domain. By TBU we mean the individual landing sites to which the tones anchor. Past literature has referred to vowels (or syllabic segments), to morae, or to syllables as the carriers of tone. Some lan guages count a bimoraic (heavy) syllable as two TBUs and a monomoraic (light) syllable as one. Such languages often allow HL or LH contours only on bimoraic syllables. Thus, in Jamsay [Dogon; Mali], bimoraic CV: syllables can be H, L, HL, or LH, while monomoraic CV syllables can only be H or L (Heath 2008: 81). Other languages are indifferent to syllable weight and treat all syllables the same with respect to tone. Another notion distinct from the TBU is the domain within which tones (or tonal melo dies) are mapped. In Kukuya [Bantu; Republic of Congo], for example, the five tonal melodies /L, H, LH, HL, LHL/ are a property of the prosodic stem (Paulian 1975; Hyman 1987). Thus, in (2), the /LHL/ ‘melody’ stretches out over the maximally trimoraic stem. (2) (ndὲ) (ndὲ) (ndὲ) (ndὲ) (ndὲ)
bvɪ ᷈ kàây pàlɪ ̂ bàámì kàlə́gì
‘(s/he) falls’ ‘(s/he) loses weight’ ‘(s/he) goes out’ ‘(s/he) wakes up’ ‘(s/he) turns around’
/kàɪ ̂/
2 Xu adds that these findings match results from earlier studies, citing Lin (1988a). We thank Kristine Yu for locating these studies for us. See also Yu (2010: 152) and references cited therein.
OUP CORRECTED PROOF – FINAL, 07/12/20, SPi
48 larry m. hyman and william r. leben For some this has meant that the prosodic stem is the TBU. However, it is important to keep distinct the carrier of tone (mora, syllable) versus the domain within which the tones or tonal sequences map. While the distinction is usually clear, Pearce’s (2013) study of Kera [Chadic; Chad, Cameroon] shows how the two notions can be confused. In this language, tones are mapped by feet. Since it will often be the case that a foot takes one or another tone (or tone pattern), it is tempting to refer to the foot as the TBU. A similar situation arises in Tamang [Tibeto-Burman; Nepal] (Mazaudon and Michaud 2008), where there are four word-tone patterns (with phonations) that map over words. In languages that place a single ‘culminative’ tone, typically H, within a prosodic domain, as in Chimila [Chibchan; Columbia] (Malone 2006: 34), the H is often described not only as a property of its TBU but also of its domain. However, the distinction between TBU and tonal domain is clearer in most languages, and it is useful to keep them separate.
4.1.4 Tone versus accent The example of a single H per domain brings up the question of whether the H is (only) a tone or whether it is also an ‘accent’. We saw such a case in Chimwiini with only one H tone per phonological phrase. One way to look at this H is from the perspective of the domain, the phonological phrase. In this case, since there can be only one H, the tempta tion is to regard the H as an ‘accent’, as Kisseberth and Abasheikh (1974) refer to it. However, in (1), final H is a strictly tonal exponent of the first or second person subject prefix. In this sense it satisfies the definition of tone, which is our only concern here. Although there are other cases where the tone-versus-accent distinction becomes blurred, the goal is not to assign a name to the phenomenon, rather to understand strictly tonal properties. While there are very clear cases of tone, such as Iau in Table 4.1, and of accent (e.g. stress in English), no language requires a third category called ‘pitch accent’ or ‘tonal accent’ (Hyman 2009). Instead, there are languages with restricted, ultimately obligatory and culminative ‘one and only one H tone per domain’, as in Kinga [Bantu; Tanzania] (Schadeberg 1973) and Nubi [Arabic-lexified creole; Uganda] (Gussenhoven 2006). Between this and a system that freely combines Hs and Ls, languages place a wide range of restrictions on tonal distributions.
4.2 Phonological typology of tone by inventory There are a number of ways to typologize tone systems by inventory. The first concerns the number of tones, which can be calculated in one of two ways: (i) the number of tone heights and (ii) the number of distinct tonal configurations including level and contour tones. Tone systems can also differ in whether they allow register effects such as downstep, by the various constraints they place on the distribution of their tones and whether the lack of tone (Ø) can function as a contrastive value. We take up each of these in this section.
OUP CORRECTED PROOF – FINAL, 07/12/20, SPi
tone systems 49
4.2.1 Number of tones In order to satisfy the definition of a tone language in §4.1, there must minimally be a binary contrast in pitch. In most cases this will be a contrast between the two level tones /H/ and /L/, as in Upriver Halkomelem [Salish; Canada] /qwáːl/ ‘mosquito’ vs. /qwàːl/ ‘to speak’ (Galloway 1993: 3). Other languages contrast up to five tone heights as in Shidong Kam [Tai-Kadai; China] (Edmondson and Gregerson 1992: 566). A tone system may dis tinguish fewer underlying contrastive tone heights than surface ones. A particularly dra matic case of this is Ngamambo [Bantoid; Cameroon], which although analysed with /H, L/ (Hyman 1986a), presents a five-way H, M, ꜜM, L˚, L contrast on the surface. (Concerning L˚ see §4.2.2; concerning ꜜM see §4.2.3.) More than a simple distinction of contrasting heights is sometimes needed based on the phonological behaviour of the tones. While the most common three-height tonal contrast is /H, M, L/ (á, ā, à), where /M/ functions as quite distinct from /H/ and /L/, some tone systems instead distinguish H and extra H (a̋, á, à) or L and extra L (á, à, ȁ).
4.2.2 Contour tones In addition to level tones, languages often have contour tones, where the pitch can be fall ing, rising, rising-falling, or falling-rising. Essentially what this means is that two or more tone heights are realized without interruption by a (supralaryngeal) consonant. Contours thus occur in all but the last Kukuya example in (2). As shown autosegmentally in (3a), the LHL sequence is realized on a single mora. In (3b) the LH sequence is realized one to one on the first two morae. Both (3a) and (3b) would be called contours in contrast with the L-to-H-to-L transitions in (3c), where each tone is linked to a CV mora. (3) a. bvɪ LHL
b. bàámì
c. kàlə´gì
LH L
LHL
Thus, contours arise either when more than one tone links to the same TBU or when two or more tones link to successive vocalic morae. A third possible interpretation often assumed in the study of Chinese and South East Asian languages would treat the sequenced pitch gestures as a single unit, as mentioned in §4.1.2, such as ‘falling’ (see Yip 1989, 2002: 50–52). Note that the above refers to phonological contours. It is often the case that level tones also redundantly contour. It is very common for sequences of like tones to slightly rise or trail off in actual pronunciation. Even among closely related languages there can be differences. Within the Kuki-Chin branch of Tibeto-Burman, a prepausal L tone will abruptly fall in KukiThaadow, e.g. /zààn/ [ \] ‘night’. This is the most common realization of a L tone before pause. In closely related Hakha Lai, however, L tone is realized with level pitch, e.g. /kòòm/ [ _ ] ‘corn’. Other languages can have a surface contrast between falling versus level L. In most cases the level tone, represented with ˚, can be shown to be the result of the simplification of a final
OUP CORRECTED PROOF – FINAL, 07/12/20, SPi
50 larry m. hyman and william r. leben r ising tone or of the effect of a ‘floating’ H tone in the underlying representation, e.g. BamilekeDschang [Bantoid; Cameroon] /lə̀-tɔ̀ŋ´/ → lə̀tɔ̀ŋ˚ ‘navel’ vs. /lə̀-tɔ̀ŋ/ → lə̀tɔ̀ŋ ‘to reimburse’ (Hyman and Tadadjeu 1976: 91). Correspondingly, there are languages where /H/ (more rarely /M/) is realized as a falling contour before pause (and hence in isolation), e.g. Leggbó [Cross River; Nigeria] /dzɔ́/ → dzɔ̂ ‘ten’ (Hyman, personal notes) and Tangkhul Naga [TibetoBurman; North East India] /sām/ → sa᷆m ‘hair’ (Hyman, personal notes). In other cases con tour tones arise by fusing unlike tones, either between words, as in the reduplication case in Etsako [Benue-Congo; Nigeria] (Elimelech 1978: 45) in (4a), or by affixation of a grammatical tone, as in Tanacross [Athabaskan; Alaska] in (4b), where the possessive H+glottal stop suffix also conditions voicing (Holton 2005: 254). (4)
a. ówà + ówà ‘house’ → ówǒwà b. š-tš’òx + ´ ʔ → š-tš’ǒɣʔ
‘every house’ ‘my quill’
In other cases input contours are simplified to level tones (see §4.3.2). In short, contours can be either underlying or derived.
4.2.3 Downstep and floating tones In most two-height systems, alternating Hs and Ls in declarative utterances usually undergo ‘downdrift’, in which each H preceded by L is lowered, and with each lowered H establishing a new terrace for further tones. Downdrift is absent in a few two-height languages, such as Haya [Bantu; Tanzania] (Hyman 1979a). Independent of whether a language has downdrift or not, it may also have non-automatic downsteps, marked by ꜜ. The most common down stepped tone is ꜜH, which usually contrasts with H only after another (ꜜ)H, as in Aghem [Bantoid; Cameroon]. As seen in (5a), the two nouns ‘leg’ and ‘hand’ are both realized H-H in isolation (Hyman 2003). (5) a. kɨ́-fé H H
‘leg’ [ ]
kɨ́-wó H H
‘hand’ [ ] ‘this hand’ [ ]
b. fé H
kɨ́n H
‘this leg’ [ ]
wó ꜜkɨ́n H L H
c. fé
kɨ̂a
‘your sg. leg’
wó
H
L
[
]
kɨ̀a
H L L
‘your sg. hand’ [
]
However, as seen in (5b) and (5c) they have different effects on the tones that follow (the noun class prefix /kɨ ́-/ drops when these nouns are modified). When followed by /H/ tone /kɨ ́n/ ‘this’ in (5b), ‘this leg’ is realized H-H, while ‘this hand’ is realized H-ꜜH. As indicated, lowering or downstepping of /kɨ ́n/ is conditioned by an abstract floating L tone (which used to be on a lost second stem syllable; cf. Proto-Bantu *-bókò ‘hand’). While this L has no TBU to be pronounced on its own, it has effected a ‘register lowering’ on the H of the demonstra tive. The same floating L tone blocks the H tone of /-wó `/ ‘hand’ from spreading onto the following L tone possessive pronoun /kɨ̀a/ in (5c).
OUP CORRECTED PROOF – FINAL, 07/12/20, SPi
tone systems 51 Although first and most extensively documented in African tone systems, downstepped H is found in New Guinea, e.g. Kairi [Trans New Guinea; Papua New Guinea] (Newman and Petterson 1990); Asia, e.g. Kuki-Thaadow [Tibeto-Burman; Myanmar, North East India] (Hyman 2010a); Mexico, e.g. Coatzospan Mixtec [Mixtecan] (E. Pike and Small 1974); and South America, e.g. Tatuyo [Tukanoan; Colombia] (Gomez-Imbert 1980). The most common source is the loss of a L tone between Hs, whether by loss of the TBU (as in the Aghem example in (5b)), by simplifications of a contour tone in a HL-H or H-LH sequence, or by one of assimilation of a /H-L-H/ sequence to either H-H-ꜜH or Hꜜ-H-H, e.g. Igbo [Benue-Congo; Nigeria] /ócé/ ‘chair’ + /àtó̙/ ‘three’ → ócé ꜜátó̙ ‘three chairs’ (Welmers and Welmers 1969: 317). Another source is the downstepping of a H that directly follows another H, as when Shambala [Bantu; Tanzania] /nwáná/ ‘child’ + /dú/ ‘only’ is realized nwáná ꜜdú ‘only a child’ (Odden 1982:187). While H is by far the most common downstepped tone, downstepped M and L also occur as contrastive tones, albeit rarely. An example is the five-way H, M, ꜜM, L˚, L contrast of Ngamambo (Bantoid; Cameroon) (Hyman 1986a: 123, 134–135). In some languages, the sequence H-ꜜH contrasts phonologically with H-M, even though the step from H to ꜜH can be identical (or nearly so) to the step from H to M. The difference is phonological: ꜜH establishes a new register, so that an immediately following H will be on the same level as ꜜH, while following H-M a H will go up a step in pitch (Hyman 1979a). While downstep is clearly established in the literature, there are also occasional men tions of an opposite ‘upstep’ phenomenon whereby H tones become successively higher. This is found particularly in Mexican tone languages. In Acatlán Mixtec (E. Pike and Wistrand 1974: 83) where H and upstepped H contrast, upstep appears to be the reverse of downstep. In Peñoles Mixtec (Daly and Hyman 2007: 182) a sequence of input Hs is real ized level; however, if preceded by a L, the Hs will each go up in pitch, ultimately reaching the upper end of a speaker’s pitch range. Upstep has also been reported in some African languages, such as Krachi [Kwa; Ghana] (Snider 1990). A number of other languages have what has been called ‘upsweep’: a sequence of H tones begins quite low and reaches an ultimately H pitch level (Tucker 1981) One such language is Baule [Kwa; Ivory Coast] (Leben and Ahoua 1997). Connell (2011) and Leben (in press) survey downstepping and upstepping phenomena in some representative languages.
4.2.4 Underspecified tone and tonal markedness In many tone languages one of the contrastive tone heights is best represented as the absence of tone. The simplest and most common case is a two-height system with an underlying contrast between /H/ and Ø. TBUs that do not have an underlying /H/ may acquire H or L by rule, the latter most often as a default tone. The primary argument for zeroing out a tone is that it is not ‘phonologically activated’ in the sense of Clements (2001). In /H/ vs. Ø languages, phonological rules and distributions refer to H but not to L. Examples are Chichewa [Bantu; Malawi] (Myers 1998), Tinputz [Oceanic; Papua New Guinea] (Hostetler and Hostetler 1975), Blackfoot [Algonquian; Montana, Alberta] (Stacy 2004), and Iñapari [Arawakan; Peru] (Parker 1999). Some languages contrast /L/ vs. Ø, e.g. Malinke [Mande;
OUP CORRECTED PROOF – FINAL, 07/12/20, SPi
52 larry m. hyman and william r. leben Mali] (Creissels and Grégoire 1993), Bora [Witotoan; Colombia, Peru] (Thiesen and Weber 2012: 56), and a number of Athabaskan languages (Rice and Hargus 2005: 11–17). The asymmetrical behaviour of different tones becomes even more evident in languages with multiple tone heights. A common analysis in a three-height tone system is to treat the M tone as Ø, as originally proposed for Yoruba [Benue-Congo; Nigeria] (Akinlabi 1985; Pulleyblank 1986), where only the H and L tones are activated. However, Campbell (2016) has shown that Zenzontepec Chatino [Oto-Manguean; Mexico] has a /H, M, Ø/ system, where Ø TBUs assimilate to a preceding (activated) H or M or otherwise receive the default L tone. Finally there are /H, L, Ø/ systems where Ø does not represent a distinct tone height, rather a TBU with a third behaviour. Thus in Margi [Chadic; Nigeria], roots and suffixes can have a fixed /H/ or /L/, or can have a third tone (Ø) that varies between H and L depending on the neighbouring tone (Pulleyblank 1986: 69–70).
4.2.5 Distributional constraints We have already mentioned that tone is more dense in some tone systems than in others (cf. Gussenhoven 2004: 35). At one end of the density scale are languages where all tonal con trasts are fully specified and realized in all positions. Assuming that the syllable is the TBU, a /H, L/ system would thus predict two possible contrasts on monosyllabic words, four contrasts on disyllabic words, eight contrasts on trisyllabic words, and so forth, as in Andoke [isolate; Colombia] (Landaburu 1979: 48). At the other extreme are systems such as Somali [Cushitic; Somalia], which, along with other restrictions, rarely allows more than one /H/ per word (Green and Morrison 2016). In between the extremes are systematic constraints, such as on the distribution of underlying tones or on their surface realization. In Tanimuka [Tukanoan; Colombia] (Keller 1999: 77), for example, disyllabic words are limited to H-H, L-H, and H-L, with *L-L non-occurring. The same requirement of at least one H is found on trisyllabic words (*L-L-L), but, in addition, there are no words of the shape *H-L-L. (There are no monosyllabic words.) As mentioned in §4.1.3, Kukuya allows the five prosodic stem melodies /L, H, LH, HL, LHL/. Significantly, it is missing the possibility of a /HLH/ melody. Where underlying /HLH/ sequences occur, they are frequently modified (Cahill 2007), such as to trisyllabic H-H-H, H-ꜜH-H, or H-H-ꜜH. Languages also may restrict some or all tonal contrasts to the stressed syllable, as in the ‘Accent 1’ versus ‘Accent 2’ in Swedish and Norwegian (Riad 1998a inter alia). More dramatic is the Itunyoso Trique [Oto-Manguean; Mexico] nine-way tonal contrast (45, 4, 3, 2, 1, 43, 32, 31, 13) realized only on the word-final (stressed) syllable (DiCanio 2008). Both underlying and derived contour tones can also have strict distribution constraints. First, they can be restricted by syllable type. Heavy, especially long-vowel syllables support tonal contours better than syllables with shorter rimes or stop codas (Gordon 2001a; Zhang 2004a). In addition, tonal contours can be restricted to stressed syllables or to phrase-final or penultimate position. The following markedness scale is generally assumed (where R = rising, F = falling, and > means ‘more marked than’): RF, FR > R > F > H, L (cf. Yip 2002: 27–30). Finally, contours can be restricted by what precedes or follows them: Some languages require that a contour be preceded or followed by a like tone height
OUP CORRECTED PROOF – FINAL, 07/12/20, SPi
tone systems 53 (e.g. L-LH, HL-L), while others prefer that the neighbouring tone height be opposite (e.g. H-LH, HL-H) (Hyman 2007: 14). Other languages restrict contour tones to final position (Clark 1983).
4.3 Phonological typology of tone by process In some languages the input or underlying tonal contrasts are realized essentially the same in the surface phonology. Such is the case in Tangkhul Naga [Tibeto-Burman; NE India], which has no tonal morphology or tonal alternations; the /H/, /M/, and /L/ tones that contrast on lexical morphemes (páay ‘defecate’, pāay ‘be cheap, able’, pàay ‘jump’) do not change in the output. However, in many (perhaps most) tone systems, input tones can be modified by processes triggered by another tone, a boundary, or the grammar (morphology, syntax). In this section we consider the most common phonologically conditioned tone rules.
4.3.1 Vertical assimilation Whenever tones of different heights occur in sequence, the pitch level of one or the other can be raised or lowered by what we call ‘vertical assimilations’. In a two-tone system, the interval of a /L-H/ sequence generally compresses, while a /H-L/ interval expands. For example, a H is raised to a new extra H level before a L in Engenni [Edoid; Nigeria] (Thomas 1978: 12). While only the H TBU immediately before the L is affected in Engenni, in other languages it can be a whole sequence of Hs, especially when the Hs are raised in anticipation of an upcoming downstep, which can be located several syllables away, as in the examples in (6) from Amo [Kainji; Nigeria] (Hyman 1979a: 25n). (6)
a. kìté úkɔ́ɔm ́ í fínáwà b. kìꜛté úkɔ́ɔm ́ í fíkáꜜlé
‘the place of the bed of the animal’ ‘the place of the bed of the monkey’
By expanding the /L-H/ interval of kìꜛté ‘place’ in (6b), speakers create the tonal space for what can in principle be an unlimited number of ꜜH pitch levels (cf. §4.2.3). Such anticipa tory pre-planning is extremely common, perhaps universal in languages with downstep (cf. Rialland 2001; Laniran and Clements 2003). Vertical assimilations can occur in multi-height tone systems as well. Thus, Jamieson (1977: 107) reports that all three non-low tones are raised before a low tone in four-height Chiquihuitlán Mazatec [Oto-Manguean; Mexico]. Less common are cases where the /H-L/ interval is compressed to [M-L] or [H-M], the latter occurring in the Kalenjin group of Nilotic [Kenya], e.g. Nandi áy-wà → áy-wā ‘axe’ (Creider 1981: 21) and Endo tány ‘cow’ + àkà ‘another’ → tány ākā ‘another cow’ (Zwarts 2004: 95). Finally, note that vertical assimi lation can be conditioned by a boundary, as when one or more H TBUs are realized M before pause in Isoko and Urhobo [Edoid; Nigeria] (Elugbe 1977: 54–55).
OUP CORRECTED PROOF – FINAL, 07/12/20, SPi
54 larry m. hyman and william r. leben
4.3.2 Horizontal assimilation Whereas vertical assimilations involve an upward or downward adjustment in pitch range, horizontal assimilations involve cases where a tone extends to a neighbouring TBU. Better known as ‘tone spreading’, the most common cases involve perseverative assimilations where the first tone spreads into a following TBU. In horizontal assimilations there is a tendency for a tone to last too long rather than start too early (Hyman and Schuh 1974: 87–90), as schematized in (7). (7) a. Natural L-H → L-LH H-L → H-HL (perseverative)
b. Less natural L-H → LH-H H-L → HL-L (anticipatory)
In (7a), L tone spreading (LTS) and H tone spreading (HTS) create a contour tone on the next syllable, which may, however, be simplified by subsequent processes, as in the case of HTS in Adioukrou [Kwa; Ivory Coast] in (8c) below (Hérault 1978: 11). (8)
a. /jɔ́w + à/
→ jɔ́w â
‘the woman’
b. /tʃǎn +à/
→ tʃǎn â
‘the goat’
c. /má + dʒěn/ → má dʒe᷉n ‘type of pestle’ (→ [má dʒéꜜń]) LTS occurs less frequently (and often with more restrictions) than HTS, e.g. applying in Nandi [Nilotic; Kenya] only if the H TBU has a long vowel: /là̙ːk-wé̙ːt/ → là̙ːk-wě̙ːt ‘child’ (Creider 1981: 21). Many languages have both HTS and LTS that potentially interact with each other in longer utterances. In (9a) we see that HTS combined with LTS creates successive contour tones in Yoruba [Benue-Congo; Nigeria] (Laniran and Clements 2003: 207). In Kuki-Thaadow [Tibeto-Burman; North East India, Myanmar] the expected contours in (9b) are, however, simplified, since the language allows contours only on the final syllable (Hyman 2010a). (9) a. /máyò̙ | | H L
mí rà wé/ | | | H L H
b. /kà zóoŋ lìen thúm/ | | | | L H L H
[máyô̙ mıˇ râ wě]
‘Mayomi bought books’
[kà zòoŋ líen thuˇm]
‘my three big monkeys’
In languages with a M tone, spreading most likely occurs between Hs and Ls, where the interval is greater than between M and either H or L. However, Gwari [Benue-Congo; Nigeria] (Hyman and Schuh 1974: 88–89) not only has HTS and LTS of the sort illustrated in (9), but also /M-L/ is realized M-ML, e.g. /ōzà/ → ōza᷆ ‘person’. We thus expect tone spreading to follow a hierarchy of likely occurrence, HTS > LTS > MTS, where HTS is most common and MTS quite rare. While the above examples involve an interaction between H and L tones, spreading can also occur in privative systems. In /H/ vs. Ø systems, H tone often spreads over multiple TBUs, as in Yaqui [Uto-Aztecan; Mexico] /téeka/ → tééká ‘sky’, /tá-tase/ → tá-tásé ‘is cough ing’ (Demers et al., 1999: 40). In both Bagiro [Central Sudanic; Democratic Republic of
OUP CORRECTED PROOF – FINAL, 07/12/20, SPi
tone systems 55 Congo] (Boyeldieu 1995: 134) and Zenzontepec Chatino [Oto-Manguean; Mexico] (Campbell 2016: 147), a H tone will spread through any number of Ø (or L) TBUs until it reaches either pause or a H or M tone, which will be downstepped, as in the Zenzontepec Chatino example (10). (10) ta
tāká | | MH
tzaka
nkwítza | H
‘there already was a child’ (already exist one child)
[tà tāká tzáká nkwítzá] (Campbell 2016:148) In other languages HTS can be restricted from applying onto a L (or Ø) TBU that is fol lowed by another H. Thus, in Chibemba [Bantu; Zambia], /bá-la-kak-a/ ‘they tie up’ is real ized bá-lá-kàk-à, with HTS from the subject prefix /bá-/ onto the following tense marker, while /bá-la-súm-a/ ‘they bite’ is realized bá-là-súm-á utterance-medially. As seen, the /H/ of /bá-/ cannot spread, because it would bump into the H of /-súm-/, a violation of the ‘Obligatory Contour Principle’, so named by Goldsmith (1976a, 1976b) for a prohibition against two identical autosegments adjacent in a melody. (The /H/ of the verb root /-súm-/ does, however, spread onto the final inflectional suffix /-a/.)
4.3.3 Contour simplification We have already seen in some of the examples that a common process is contour tone sim plification. As mentioned in §4.2.5, languages frequently limit the distribution of contour tones, requiring that their beginning or end points be preceded or followed by a like or unlike tone height. The Kuki-Thaadow example in (9b) shows another tendency, which is to restrict contour tones to the final syllable of a phrase or utterance. A major motivator of contour simplification is the general principle of minimizing the number of ups and downs, a potential problem that becomes particularly acute when a contour is surrounded by unlike tone heights. Table 4.3 lists various fates of the L-HL-H input sequence in the indi cated Grassfields Bantu languages [Bantoid; Cameroon] (Hyman 2010b: 71).
Table 4.3 Different contour simplifications of L-HL-H Language
Output
Process
Mankon Babanki Babadjou Yemba (Dschang) Kom Aghem
L-H-ꜛH L-M-H L-H-ꜜH L-ꜜH-H L-M-M L-H-H
H-upstep HL-fusion H-downstep HL-fusion + H-downstep H-lowering L-deletion
Leroy (1979) Hyman (1979b) Hyman, field notes Hyman and Tadadjeu (1976) Hyman (2005) Hyman (1986b)
As seen, contour simplifications can produce a new surface-contrastive tone, such as the M in Babanki in (11), which results from the simplification of the HL resulting from n-deletion (Akumbu 2016):
OUP CORRECTED PROOF – FINAL, 07/12/20, SPi
56 larry m. hyman and william r. leben (11)
a. kə̀-bán + ə̀-kɔ́m → kə̀-bāː kɔ́m ‘my fufucorn’ b. kə̀-ŋkón + ə̀-kɔ́m → kə̀-ŋkɔ̄ː kɔ́m ‘my fool’
Similarly, to minimize ups and downs, an input H-LH-L sequence is subject to multiple modifications in the output. A second motivation for contour simplification is tone absorption (Hyman and Schuh 1974: 90), whereby the input sequences LH-H and HL-L are simplified to L-H and H-L, r espectively. In these cases the endpoint of the contour has been masked by the following like tone height. Thus, in Lango [Nilotic; Uganda], a HL falling tone derived by HTS is simplified to H: /dɔ́g gwὲnò/ (mouth + chicken) → dɔ́g gwɛ̂nò → dɔ́g gwέnò ‘chicken’s mouth’ (Noonan 1992: 51). Another common change is LH-L → L-HL, which occurs in Lango (p.53), Isthmus Zapotec [Oto-Manguean; Mexico] (Mock 1988: 214), and elsewhere. In this case the more marked LH rising tone is avoided and a less marked HL falling tone results.
4.3.4 Dissimilation and polarity The tone processes discussed above all either are assimilatory or represent simplifications of contours and other ‘ups and downs’. As in segmental phonology, there are processes that are dissimilatory in nature. In Munduruku [Tupi; Brazil], a L tone becomes H after /L/, as in /è + dìŋ/ (tobacco + smoke) → è-díŋ ‘tobacco smoke’ (Picanço 2005: 312). Besides tone levels, contours dissimilate, as when Hakha Lai [Tibeto-Burman; Myanmar, North East India] LH rising tone becomes falling HL after another LH rising tone, as in /ka kǒoy hrǒm/ → ka kǒoy hrôm ‘my friend’s throat’ (Hyman and VanBik 2004: 832). Similar ‘con tour metatheses’ occur in various Chinese dialects, e.g. Pingyao hai35 + bing35 → hai53 bing35 ‘become ill’ (Chen 2000: 15). Even disyllabic sequences can dissimilate. Thus in Cuicateco [Oto-Manguean; Mexico], a sequence of /M-L/ + /M-L/ becomes L-M + M-L, as in /ntōʔò/ ‘all’ + /ʔīnù/ ‘three’ → ntòʔō ʔīnù ‘all three’ (Needham and Davis 1946: 145). In some cases where it is not desirable to start with an underlying tone, a morpheme may receive the opposite ‘polar’ tone to what precedes or follows. In Eastern Kayah Li (Karen) [Tibeto-Burman; Myanmar], which distinguishes /H, M, L/, prefixes contrast in tone before a M root: ʔì-lū ‘the Kayah New Year festival’ vs. ʔí-vī ‘to whistle’. However, prefixes are Hbefore /L/ and L- before /H/, as in ʔí-lò ‘to plant (seeds)’ and ʔì-khré ‘to winnow’ (Solnit 2003: 625). In many analyses morphemes with polar tone are analysed as underlyingly toneless, receiving their tone by context. This is so for Margi [Chadic; Nigeria], discussed at consider able length by Pulleyblank (1986: 203–214), as well as Fuliiru [Bantu; Democratic Republic of Congo], which contrasts /H/, /L/, and /Ø/ verb roots, the last behaving like /H/ or /L/ verb roots in different parts of the paradigm (Van Otterloo 2014: 386).
4.4 Grammatical tone In this section we consider grammatical functions of tone. While tone is (almost) completely lexical in many languages (e.g. most Chinese languages), there are other lan guages where tone is largely grammatical—for example, marking morphological classes,
OUP CORRECTED PROOF – FINAL, 07/12/20, SPi
tone systems 57 morphological processes, and ultimately syntactic configurations as well as semantic and pragmatic functions such as negation and focus. For example, in the Igboid language Aboh [Benue-Congo; Nigeria], the difference between affirmative and negative can be solely tonal: ò jè kò ‘s/he is going’ vs. ó jé kò ‘s/he is not going’. Grammatical functions of tone are as varied as grammar itself. From the above examples we see that tone can function alone as a morpheme. It follows therefore that if tone can be a morpheme, it can do everything that a (segmental) morpheme can do, such as mark singular/plural, case, person, tense, aspect, and of course negation (Hyman and Leben 2000: 588). On the other hand, tone vastly surpasses segmental phonology in encoding syntactically dependent prosodic domains (cf. §4.4.6).
4.4.1 Lexical versus morphological tone It is clear that tone can be a property of either lexical morphemes (nouns, verb roots, etc.) or grammatical elements (pronouns, demonstratives, prefixes, suffixes, clitics, etc.). There may of course be generalizations concerning the distribution of tones by word class. For example, in Mpi [Tibeto-Burman; Thailand], nouns contrast /H, M, L/ (sí ‘four’, sī ‘a colour’, sì ‘blood’) while verbs contrast /MH, LM, HL/ (sı᷄ ‘to roll’, sı᷅ ‘to be putrid’, sî ‘to die’) (Matisoff 1978). However, the term ‘grammatical tone’ does not usually refer to tonal con trasts on segmental morphemes, such as the H and L tones of the subject pronouns à ‘I’, ò ‘he’, and á ‘she’ in Kalabari [Ijoid; Nigeria] (Jenewari 1977: 258–259). In such cases the tone is clearly linked to its TBU and not assigned by a grammatical process. In the following subsections, ‘grammatical tone’ will refer to cases either where tone is the sole exponent of morphology or where morphology introduces tonal exponents that are realized independ ent of any segmental morpheme that may accompany the tone.
4.4.2 Tonal morphemes The most straightforward type of grammatical tone is where the tone is the only exponent of a morphological distinction. Typically called a ‘tonal morpheme’, its position can some times be established within a string of (segmental) morphemes. For example, the subject H tone of Yoruba [Benue-Congo; Nigeria] occurs exactly between the subject and verb: o¯ m ̙ o¯ ̙ + ´+ lo¯ ̙ → o¯ ̙m ó̙ lo¯ ̙ ‘the child went’ (Akinlabi and Liberman 2000: 35). Similarly, the H geni tive (‘associative’) marker of Igbo [Benue-Congo; Nigeria], often translatable as ‘of ’, can be located between the two nouns in /àlà/ ‘land’ + ´ + /ìgbò/ ‘Igbo’ → àlá ìgbò ‘Igboland’ (Emenanjo 1978: 36). Such tonal morphemes can have any shape (L, M, etc.) and can even occur in sequences. In other cases it is harder to analyse morphological tones as items to be arranged in a sequence with segmental morphemes. Instead, individual H or L tones may be assigned to various positions within a paradigm. In the /H/ vs. Ø language Kikuria [Bantu; Kenya], there is no lexical tone contrast on verb roots. Instead, different inflectional features assign a H tone to the first, second, third, or fourth mora of the verb stem. In the examples in Table 4.4, the stem is bracketed and the mora receiving the H is underlined. As also seen, this H then spreads to the penultimate vowel (Marlo et al. 2014: 279).
OUP CORRECTED PROOF – FINAL, 07/12/20, SPi
58 larry m. hyman and william r. leben
Table 4.4 H- tone stem patterns in Kikuria μ1
ntoo-[kó̱óndókóra]
μ2
ntooɣa-[koó̱ndókóóye]
‘indeed we have already uncovered’ ‘indeed we have been uncovering’
μ3 μ4
ntore-[koondó̱kóra] tora-[koondokó̱ra]
‘we will uncover (then)’ ‘we are about to uncover’
Untimed past anterior Hodiernal past progressive Anterior focused Remote future focused Inceptive
4.4.3 Replacive tone In other cases a morphological process may assign a ‘replacive’ tone or tonal schema. Table 4.5 gives examples from Kalabari [Ijoid; Nigeria], where a LH ‘melody’ replaces the contrastive verb tones in deriving the corresponding intransitive verb (Harry and Hyman 2014: 650).
Table 4.5 Detransitivizing LH replacive tone in Kalabari Transitive kán kɔ̀n ányá ɗ ìmà sá↓kí kíkímà pákìrí gbóló↓má
H L H-H L-L H-ꜜH H-H-L H-L-H H-H-ꜜH
Intransitive
‘tear, demolish’ ‘judge’ ‘spread’ ‘change’ ‘begin’ ‘hide, cover’ ‘answer’ ‘join, mix up’
kàán kɔ̀ɔn ́ ànyá ɗ ìmá sàkí kìkìmá pàkìrí gbòlòmá
LH LH L-H L-H L-H L-L-H L-L-H L-L-H
‘tear, be demolished’ ‘be judged’ ‘be spread’ ‘change’ ‘begin’ ‘be hidden, covered’ ‘be answered’ ‘be joined, mixed up’
As seen, the LH melody is realized as a LH rising tone (with vowel lengthening) on monosyllables, L-H on two syllables, and L-L-H on trisyllabic verbs. In (12), denominal adjectives are derived via replacive H tone in Chalcatongo Mixtec [Oto-Manguean; Mexico] (Hinton et al. 1991: 154; Macaulay 1996: 64), while deadjectival verbs are derived via repla cive L in Lulubo [Central Sudanic; South Sudan] (Andersen 1987a: 51). (12)
a.
Chalcatongo Mixtec
b. Lulubo
bīkò tānà sòʔò žūù
‘cloud’ ‘medicine’ ‘ear’ ‘rock’
ōsú ‘good’ àkēlí ‘red’ álí ‘deep’
→ → → →
bíkó táná sóʔó žúú
→ òsù → àkèlì → àlì
‘cloudy’ ‘medicinal’ ‘deaf ’ ‘solid, hard’ ‘become good’ ‘become red’ ‘become deep’
Replacive tones are found in Asia as well, such as in Southern Vietnamese [Mon-Khmer; Vietnam] (Thompson 1965) and White Hmong [Hmong-Mien; China] (Ratliff 1992). For cases of replacive tone conditioned by phrasal domains see §4.4.6.
OUP CORRECTED PROOF – FINAL, 07/12/20, SPi
tone systems 59
4.4.4 Inflectional tonology The above examples show that tone can directly mark derivational processes (to which we return in §4.4.5). It may also mark inflectional morphology, specifically morphosyntactic features such as person, number, gender, tense, and aspect. Thus, in Ronga [Nilo-Saharan; Chad, Central African Republic] certain nouns mark their plural by assigning a H tone: tə̀ù ‘flour’ (pl. tə́ú), ndòbó ‘meat’ (pl. ndóbó) (Nougayrol 1989: 27). A similar H tone plural effect on possessive determiners occurs in Kunama [Nilo-Saharan; Eritrea] (Connell et al. 2000: 17), as shown in Table 4.6.
Table 4.6 Possessive determiners in Kunama
Singular
Plural
First person (exclusive) Second person Third person First person (inclusive)
-áaŋ -éy -íy -íŋ
-àaŋ -èy -ìy
As seen, the segmental morphs mark person, while the tones mark number (L for singular, H for plural). Similar alternations due to number are seen in Table 4.7 for noun class 9/10 in Noni [Bantoid; Cameroon] (Hyman 1981: 10).
Table 4.7 Noni SG~PL alternations in noun class 9/10
Stem tone
Singular
Plural
Alternation
(i) (ii) (iii) (iv)
/L/ /LH/ /HL/ /H/
jòm bìè bìeˉ bweˇ
/ ` + jòm/ / ` + bìé/ / ` + bíè/ / ` + bwé/
jo᷆m bíé bıˉeˉ bwé
/ ´ + jòm/ / ´ + bìé/ / ´ + bíè/ / ´ + bwé/
‘antelope’ ‘fish’ ‘goat’ ‘dog’
L vs. ML L vs. H LM vs. M LH vs. H
As indicated, from a two-height system, Noni developed a H, M, L surface contrast, where most occurrences of *H became M. The main exception is the plural of ‘fish’: in this case the expected HLH sequence simplified to H. A similar situation arises in Day [Adamawa; Chad] in the marking of aspect (Nougayrol 1979: 161). Although the language contrasts sur face H, M, and L we again recognize inputs /H/ and /L/ in Table 4.8.
Table 4.8 Day completive/incompletive aspect alternations
/yúú/ ‘put on, wear’
/yùù/ ‘drink’
Completive Incompletive
HL-
yúú yūū
yūū yùù
OUP CORRECTED PROOF – FINAL, 07/12/20, SPi
60 larry m. hyman and william r. leben As seen, when the completive H- prefix combines with the H tone verb ‘put on, wear’, the result is a H tone. Similarly, when the incompletive L-prefix combines with the L-tone verb ‘drink’, the result is a L tone. Both H+L and L+H result in M tone yūū, which can either mean ‘put on, wear’ (incompletive) or ‘drink’ (completive). While the above cases allow us to factor out the individual tonal contributions of each morpheme (an affix and a root), such a segmentation may be difficult or impossible in other cases. Recall from Table 4.1 the inflected verb forms that were seen in Iau, here summarized in Table 4.9.
Table 4.9 Iau verb tones
Telic
Totality of action
Resultative
Punctual Durative Incompletive
HL HLM HM
H ML HꜛH
LM M –
Although Iau verbs lend themselves to a paradigmatic display by plotting the above mor phosyntactic features, the portmanteau tonal melodies do not appear to be further seg mentable into single tones or features. Of course one can look for patterns of the sort that telic forms begin H and have a L, a M, or both tones after them, but these would not be predictive. Inflectional tonology may also produce scalar effects. In Gban [Mande; Ivory Coast], a language with four tone heights (4 = highest, 1 = lowest), there are systematic effects on inflected subject markers conditioned by person and tense (Zheltov 2005: 24). As seen in Table 4.10, first and second persons are one degree higher than third, and past tense is two degrees lower than present.
Table 4.10 Inflected subject markers in Gban First person Second person Third person
Present
sg. pl. u2 ı 2᷉ ɛɛ2 aa2 ɛ1 ɔ1 [-raised]
Past sg. ı ᷉4 ɛɛ4 ɛ3 [+raised]
pl. u4 aa4 ɔ3
[+upper] [-upper]
Although there are different ways to implement such a paradigm, Table 4.10 shows how the tonal reflexes can be nicely modelled with the features [upper] and [raised] (Yip 2002). Such an analysis is not possible in contiguous Guébie [Kru; Ivory Coast], where (i) each of the tone heights 1–4 goes down one level in the imperfective; (ii) just in case the imperfect is already 1, the tone height of the preceding subject is raised by one level instead; and (iii) just in case the subject is already 4, the tone height is further raised to a super-high 5 level, the only such occurrence in the language (Sande 2018).
OUP CORRECTED PROOF – FINAL, 07/12/20, SPi
tone systems 61
4.4.5 Compounding In §4.4.3 we observed a number of cases where tone was the only reflex of a derivational process, such as a change in word class. Languages that have (almost) no morphology may, however, show traces of earlier derivational processes, as in the following falling-tone nom inalizations in Standard Mandarin: shán ‘to fan’ → shân ‘fan’ (n.), lı ̌an ‘to connect’ → lîan ‘chain’, shù(´) ‘to count’ → shû ‘number’ (Wang 1972: 489). Both in Chinese and in other tone languages, tones can be modified in compounding. Thus, in Shanghainese compounds all but the tone of the first word reduce to a default L, as in (13a) (Zee 1987; cf. Selkirk and Shen 1990). (13)
a.
ɕɪŋ + vəŋ → ɕɪŋ vəŋ ‘news’ < ɕɪ ̂ŋ ‘new’ (HL) HL LH H L ɕɪŋ + vəŋ + tɕia → ɕɪŋ vəŋ tɕia ‘news reporting circle’ HL LH MH H L L ɕɪŋ + ɕɪŋ + vəŋ + tɕi + tsɛ → ɕɪŋ ɕɪŋ vəŋ tɕi tsɛ ‘new news reporter’ HL HL LH MH MH H L L L L
b.
khʌʔ + sɤ → khʌʔ sɤ ‘to cough’ MH MH M H khʌʔ + sɤ + dã → khʌʔ sɤ dã ‘cough drops’ MH MH LH M H L khʌʔ + sɤ + jʌʔ + sr̹ + bɪŋ → khʌʔ sɤ jʌʔ sr̹ bɪŋ ‘cough tonic bottle’ MH MH LH MH LH M H L L L
The examples in (13b) show that when the first tone is a contour, here MH, its tones map to the first two syllables, any remaining syllables receiving a default L. It is very common for elements of a compound to undergo tonal modifications. This hap pens also in Barasana [Tukanoan; Colombia], which contrasts H-H, H-L, L-H, and L-HL on disyllabic words. In (14), ~ marks nasality, a prosodic property of morphemes (GomezImbert and Kenstowicz 2000: 433–434). (14)
a. H-L + H-L → H-L + L-L H-L + L-H → H-L + L-L H-L + L-HL → H-L + L-L
~újù ~kùbà ‘kind of fish stew’ (~kúbà ‘stew’) ~kíì jècè ‘peccary (sp.)’ (jècé ‘peccary’) héè rìkà ‘tree fruits (rìká` ‘fruits’) (in ritual)’
b. H-H + H-L → H-H + H-H ~ɨ ́dé ~bídí ‘bird (sp.)’ H-H + L-H → H-H + H-H ~kóbé cótɨ ́ ‘metal cooking pot’ H-H + L-HL → H-H + H-H héá ~gɨ ́tá-á ‘flint stone’
(~bídì ‘bird’) (còtɨ ́ ‘cooking pot’) (~gɨ̀tá-à ‘stone-cl’)
As seen in (14a), if the first member of the compound ends with L, the second m ember of the compound will be L-L. In (14b), however, where the first member ends with H, the second member is realized H-H. It is reasonable to assume that the tones of the second member have been deleted, followed by the spreading of the final H or L of the first member.
OUP CORRECTED PROOF – FINAL, 07/12/20, SPi
62 larry m. hyman and william r. leben Reduction or loss of tone on certain member(s) of a compound is quite widespread. In the Mande group of West Africa, such changes are very common. Known as compacité tonale in the French literature, the process applies to compounds and to certain derivational processes (Creissels 1978). The following Bambara [Mali] examples from Green (2013: 4) illustrate compacité tonale in compounds (15a), noun+adjective (15b), and noun+derivational suffix (15c) combinations. (15)
a.
jàrá lion jàkúmá cat
+ +
wòló skin wòló skin
→ →
jàrà-wóló jàkùmà-wóló
‘lion skin’ ‘cat skin’
b.
jàkúmá cat jìgí hope
+ +
wárá wild -ntan neg
→ →
jàkùmà-wárá jìgì-ntán
‘feral cat’ ‘hopeless’
If the first word has a LH melody, the full compound is realized LH with the H on the final constituent and the L on preceding ones, as first formulated to our knowledge by Woo (1969: 33–34), with an acknowledgement to Charles S. Bird for help with the data (see also Leben 1973: 128; Courtenay 1974: 311). This is seen particularly clearly in more complex examples (16) (Green 2013: 9). (16)
a. fàlí + bálá + yὲlὲn → fàlì-bàlà-yɛ́lɛ́n ‘ride a donkey’ donkey upon climb b. nún nose
+ kɔ̀rɔ́ + síí under hair
→ nún-kɔ́rɔ́-síí
‘moustache’
4.4.6 Phrase-level tonology It is commonly observed that tones have the potential for considerable mobility and mutual interaction at a distance. This is seen particularly dramatically in their behaviour at the phrase level. As an example, Giryama [Bantu; Kenya] contrasts /H/ with Ø. In (17a) all of the morphemes are toneless, and all of the TBUs are pronounced with default L pitch. In (17b), however, where only the subject prefix /á-/ differs, its /H/ is realized on the penulti mate mora of the phrase (Volk 2011a: 17). (17) a.
All L tone
‘I want ...’ ni-na-maal-a ni-na-mal-a ku-guul-a ni-na-mal-a ku-gul-a ŋguuwo
b.
H tone on penultimate mora ‘he/she wants ...’ a-na-maál-a a-na-mal-a ku-guúl-a a-na-mal-a ku-gul-a ŋguúwo = H
‘... to buy’ ‘... to buy clothes’
OUP CORRECTED PROOF – FINAL, 07/12/20, SPi
tone systems 63 Outside of tone, no other phonological property is capable of such a long-distance effect. Even less dramatic tone rules applying between words are still much richer than what is found in segmental phonology. In the Chatino languages (Cruz 2011; Campbell 2014; McIntosh 2015; Sullivant 2015; Villard 2015) the tonal processes apply throughout the clause, blocked only by a sentence boundary or pause. In other languages they are subject to apply ing within certain prosodic domains that in turn are defined by the syntax. A good example of the latter occurs in Xiamen [Sinitic; Fujian province, China]. Known as the Southern Min Tone Circle, when followed by another tone, each of the five contrast ing tones is replaced by a different paired tone according to the schema 24, 44 → 22 → 21 → 53 → 44. Thus, in (18) (Chen 1987: 113), only the last tone remains unchanged. (18)
# yi kiong-kiong kio gua ke k’uah puah tiam-tsing ku ts’eq # 44 24 24 21 53 44 21 21 53 44 53 32 → 22 22 22 53 44 22 53 53 44 22 44 he by force cause I more read half hour long book ‘he insisted that I read for another half an hour’
The above changes take place only within what Chen (1987) calls a ‘tone group’, a phrasal prosodic domain determined by the syntax (cf. Selkirk 1986, 2011). While most of the tonal processes discussed in §4.3 were shown to be natural phonological rules, Xiamen shows that such ‘tone sandhi’ can also involve quite arbitrary replacive tone. Cases involving tone and syntactically defined prosodic domains are common, early examples being Chimwiini (Kisseberth and Abasheikh 1974, 2011), Ewe [Niger-Congo; Ghana, Togo] (Clements 1978), and several additional languages described in Clements and Goldsmith (1984), Kaisse and Zwicky (1987), and Inkelas and Zec (1990). Many of these studies show that the left or right edge of a prosodic domain can be marked by a boundary tone. An example is the floating L tone in Japanese at the end of an accentual phrase (Poser 1984a; Pierrehumbert and Beckman 1988). More generally, the tonal elements illustrated here for word-level tonology, accent, and phrasal tonology play a key role in intonation. This is true in non-tone languages as well, as one can gather from Gussenhoven (2004) and chapter 4 of this volume. The behaviour of tones in lexical tone systems has provided inspiration for the analysis of intonation in tone languages and non-tone lan guages alike. This tradition reaches back at least as far as Liberman (1975) and Pierrehumbert (1980), as traced by Gussenhoven (2004) among others, and as evidenced in many current analyses of intonation in specific languages, including those compiled in Jun (2005a, 2014a) and Downing and Rialland (2017a).
4.5 Further issues: Phonation and tone features As noted in §4.1.2, pitch sometimes interacts with phonation types and syllable types. For example, in some languages phonological tones are accompanied by a laryngeal gesture, such as breathiness and glottalization. Even if analysed as final /-h/ and /-ʔ/, these laryngeal
OUP CORRECTED PROOF – FINAL, 07/12/20, SPi
64 larry m. hyman and william r. leben gestures can affect the distribution of tones, as they do in Itunyoso Trique [Oto-Manguean; Mexico] (Table 4.11) (DiCanio 2016: 231).
Table 4.11 Tonal distributions in Itunyoso Trique Tone
Open syllable
/4/ /3/ /2/ /1/ /45/ /13/ /43/ /32/ /31/
yu᷉ yu᷉3 u᷉2 yu᷉1 yo13 ra43 ra᷉32 ra᷉31 4
‘earthquake’ ‘palm leaf ’ ‘nine’ ‘loose’ ‘fast (adj.)’ ‘want’ ‘durable’ ‘lightning’
Coda /h/
Coda /ʔ/
ya᷉h ya᷉h3 tah2 ya᷉h1 toh45 toh13 nna᷉h43 nna᷉h32
‘dirt’ ‘paper’ ‘delicious’ ‘naked’ ‘forehead’ ‘a little’ ‘mother!’ ‘cigarette’
niʔ tsiʔ3 ttʃiʔ2 tsiʔ1
‘see.1dual’ ‘pulque’ ‘ten’ ‘sweet’
4
4
As seen, the high rising tone /45/ only occurs on CVh syllables, while only the four level tones occur on CVʔ syllables. ‘Stopped’ syllables typically have fewer contrasts than ‘smooth’ syllables in Chinese and South East Asian languages in general. Tone and phonation also interact in many other languages around the world in a variety of ways: cf. Volk (2011a), Wolff (1987), and chapters 9 (§9.6), 12 (§12.2), 23 (§23.2.2), 27 (§27.3.5), 28 (§28.2.3, §28.3.3, and §28.4.1), and 29 (§29.4 and §29.6.2). Interactions between pitch, laryngeal gesture, and syllable type can pave the way to tono genesis and subsequent tonal splits (Haudricourt 1961; Matisoff 1973; Kingston 2011). A common pattern is for one or both of the contrasting /H/ and /L/ tones to further ‘bifurcate’ into two distinct heights, each conditioned by the voicing of the onset consonant. This likely accounts for the four-way contrast in Gban in Table 4.10. Once a language has at least a binary H vs. L contrast, the tones themselves can also interact to produce further tone heights, such as the M of Babanki in (11). While a featural analysis was provided for Gban, whether (or to what extent) tone features are needed in the phonology of tone has been questioned (Clements et al. 2010; Hyman 2010b; but cf. McPherson 2017). This is a key issue in the analysis of tone systems that remains to be resolved.
4.6 Conclusion This chapter has offered a definition of tone broad enough to cover its various functions, behaviours, and manifestations in the languages of the world while preserving the notion that tone is the same phonological entity in all the cases discussed. Our survey has attempted to cover the general properties of tone systems and some unusual ones as well. Tone, as seen, can interact in a variety of ways with other phonetic features as well as with the abstract feature accent. The basic phonological unit to which a tone is linked can differ from language to language and may include vowels (or syllabic segments), morae, and syllables.
OUP CORRECTED PROOF – FINAL, 07/12/20, SPi
tone systems 65 A separate question is the domain across which a tone or tone melody is mapped. Different tone system typologies have been based on the number of tone heights or tone shapes (including potentially contours) in the phonological inventory and on several distributional properties, contrastiveness being most important for the phonologist. Another type of tonal typology differentiates the various types of assimilation and dissimilation that tones can undergo. Yet another aspect of tone is its function as a property of lexical morphemes, grammatical morphemes, or both, and its ability to function at the level of the syntactic or phonological phrase as well as in intonation.
We dedicate this chapter to the memory of our dear friend and colleague of over four dec ades, Russell G. Schuh, who loved tone as much as we do.
OUP CORRECTED PROOF – FINAL, 04/12/20, SPi
chapter 5
Wor d -Str ess Systems Matthew K. Gordon and Harry Van Der Hulst
5.1 Introduction The term ‘stress’ refers to increased prominence on one or more syllables in a word.1 Depending on the language, stress is diagnosed in different ways: through a combination of physical properties, speaker intuitions, and phonological properties such as segmental constraints and processes. For example, the first syllable, the stressed one, in the English word totem /ˈtoʊtəm/ is longer, louder, and realized with higher pitch than the unstressed second syllable. In addition, the /t/ in the stressed syllable is aspirated, while the unstressed vowel is reduced to schwa and the preceding /t/ is flapped. It is possible for a word to have one or more secondary stresses that are less prominent than the main (or primary) stress. For example, the word manatee /ˈmænəˌti/ has a primary stress on the first syllable and a secondary stress on the final syllable, as is evident from the non-flapped /t/ in the onset of the final syllable. In §5.2 we consider the ways in which stress is manifested phonetically and in its correlations with segments and syllables, as well as in speaker intuitions, while in §5.2.4 we discuss some of its distributional properties in languages generally. A summary of typological research is provided in §5.3, while §5.4 considers stress’s relation to rhythm and foot structure. Finally, §5.5 deals with some outstanding issues.
5.2 Evidence for stress 5.2.1 Phonetic exponents Acoustic correlates of stress include increased duration, higher fundamental frequency (pitch), greater overall intensity (loudness), and spectral attributes such as an increased weighting in favour of higher frequencies and a shift in vowel quality (see chapter 10 for a discussion). There is considerable cross-linguistic variation in the properties that mark 1 Some researchers refer to ‘accent’ rather than ‘stress’; see van der Hulst (2014a) for terminological matters.
OUP CORRECTED PROOF – FINAL, 04/12/20, SPi
WORD-STRESS SYSTEMS 67 stress. In a 75-language survey, Gordon and Roettger (2017) found duration to be the most reliable correlate of stress, distinguishing stressed from unstressed syllables in over 90% of these languages. Other exponents of stress included in their survey (intensity, f0, vowel reduction, and spectral tilt) are also predictive of stress in the majority of studies. Acoustic evidence for secondary stress is more tenuous. In virtually all studies in Gordon and Roettger’s survey, secondary stress was distinguished from primary stress and/or lack of stress through only a subset of properties, if any at all, that were used to distinguish primary stressed from unstressed syllables.
5.2.2 Speaker intuitions and co-speech gestures Evidence for stress may also come from speaker intuitions, which may be accessed either directly through questioning or covertly through observation of co-speech gestures, such as beat gestures, tapping, or eyebrow movements, which tend to coincide with peaks in fundamental frequency (e.g. Tuite 1993; Cavé et al. 1996; Leonard and Cummins 2010). In the tapping task commonly employed by stress researchers, speakers are asked to simultan eously tap on a hard surface while pronouncing a word. When asked to tap once, speakers typically tap on the primary stress. Additional prompted taps characteristically coincide with secondary stresses. Tapping has been used to elicit judgements about stress not only for languages with lexically contrastive stress, such as noun–verb pairs in English (e.g. ˈimport vs. imˈport), but also for languages with predictable stress, such as Tohono O’odham [Uto-Aztecan; United States] (Fitzgerald 1997) and Banawá [Arawan; Brazil] (Ladefoged et al. 1997). The tapping diagnostic has its limitations, however, and is not successful for speakers of all languages.
5.2.3 Segmental and metrical exponents of stress Stress also conditions various processes, many of which are phonetic or phonological mani fest ations of the strengthening and weakening effects discussed earlier. Stressed and unstressed vowels are often qualitatively different. Unstressed vowels are commonly centralized relative to their stressed counterparts, although unstressed high vowels are more peripheral in certain languages (see Crosswhite 2004 for the typology of vowel reduction). Unstressed vowels in English typically reduce to a centralized vowel, gradiently or categor ically. Gradient reduction occurs in the first vowel in [ɛ]xplain/[ə]xplain. Such qualitative reduction is typically attributed to articulatory undershoot due to reduced duration, which precludes the attainment of canonical articulatory targets (Lindblom 1963). Categorical reduction in English can often be argued to have a derivational status, as in the case of the second vowel in ˈhum[ə]n in view of its stressed counterpart in huˈm[æ]nity, but underived reduced vowels are frequent, like those in the second syllables of totem and manatee mentioned in §5.1. Vowel devoicing is another by-product of undershoot in the context of voiceless conson ants or right-edge prosodic boundaries, contexts that are characteristically associated with laryngeal fold abduction, which may overlap with a vowel, especially if unstressed. For example, in Tongan [Austronesian; Tonga] (Feldman 1978), an unstressed high vowel
OUP CORRECTED PROOF – FINAL, 04/12/20, SPi
68 MATTHEW K. GORDON AND HARRY VAN DER HULST devoices when it occurs after a voiceless consonant and either before another voiceless consonant or utterance-finally, as in /ˈtuk[i ̥]/ ‘strike’, /ˈtaf[u̥]/ ‘light a fire’, /ˌpas[i̥]ˈpas[i̥]/ ‘applaud’ (see Gordon 1998 for the typology of devoicing). Deletion is an extreme manifestation of reduction. For example, the first vowel in t[ə]ˈmato and the middle vowel in ˈfam[ə]ly are often absent in rapid speech. In San’ani Arabic [Afro-Asiatic; Yemen], unstressed vowels optionally delete, e.g. /fiˈhimtiː/~ /ˈfhimtiː/ ‘you f.sg understood’, /kaˈtabt ~ ˈktabt/ ‘I wrote’ (Watson 2007: 73). Vowel deletion often parallels devoicing in displaying gradience and optionality. Furthermore, deletion is often only a perceptual effect of shortening as articulatory traces of inaudible vowels may remain (see Gick et al. 2012). A complementary effect to reduction is strengthening in stressed syllables (see Bye and de Lacy 2008 for an overview). For example, short vowels in stressed non-final open syl lables in Chickasaw [Muskogean; United States] are substantially lengthened (Munro and Ulrich 1984; Gordon and Munro 2007), e.g. /ʧiˌpisaˌliˈtok/ → [ʧiˌpiːsaˌliːˈtok] ‘I looked at you’, /aˌsabiˌkaˈtok/ → [aˌsaːbiˌkaːˈtok] ‘I was sick’. Stressed syllables may also be bolstered through consonant gemination, e.g. Delaware [Algonquian; United States] /nəˈmə.təmeː/ → [nəˈmət.təmeː] (Goddard 1979: xiii). Gemination in this case creates a closed and thus heavy syllable (see §5.3.3 on syllable weight). Gemination can also apply to a consonant in the onset of a stressed syllable, as in Tukang Besi [Austronesian; Indonesia] (Donohue 1999) and Urubú Kaapor [Tupian; Brazil] (Kakumasu 1986). Stress may also have phonological diagnostics extending beyond strengthening and weakening. In the Uto-Aztecan language Tohono O’odham (Fitzgerald 1998), traditional song meter is sensitive to stress. The basic stress pattern (subject to morphological complications not considered here) is for primary stress to fall on the first syllable and secondary stress to occur on subsequent odd-numbered syllables (Fitzgerald 2012; see §5.4 on rhythmic stress): /ˈwa-paiˌɺa-dag/ ‘someone good at dancing’, /ˈʧɨpoˌs-id-a-ˌkuɖ/ ‘branding instrument’. Although lines in Tohono O’odham songs are highly variable in their number of syllables, they are subject to a restriction against stressed syllables in the second and final positions; these restrictions trigger syllable and vowel copying processes (Fitzgerald 1998). Stress may also be diagnosed through static phonotactic restrictions, such as the confinement of tonal contrasts to stressed syllables in Trique [Oto-Manguean; Mexico] (DiCanio 2008), the restriction of vowel length contrasts to stressed syllables in Estonian [Uralic; Estonia] (Harms 1997), or the occurrence of schwa in unstressed syllables in Dutch (van der Hulst 1984) in words where there is no evidence for an underlying full vowel.
5.2.4 Distributional characteristics of stress There are certain properties associated with ‘canonical’ stress systems (see Hyman 2006 for a summary). One of these is the specification of the syllable as the domain of stress, a property termed ‘syllable integrity’ by Hayes (1995). Syllable integrity precludes stress contrasts between the first and second halves of a long vowel or between a syllable nucleus and a coda. Syllable integrity differentiates stress from tone, which is often linked to a sub-constituent of the syllable, the mora. Another potentially definitional characteristic of stress is ‘obligatoriness’, the requirement that every word have at least one stressed syllable. Obligatoriness precludes a system in
OUP CORRECTED PROOF – FINAL, 04/12/20, SPi
WORD-STRESS SYSTEMS 69 which stress occurs on certain words but not others. Obligatoriness holds for phonological rather than morphological words; thus, a function word together with a content word, e.g. the man, constitutes a single phonological word. Unlike stress systems, canonical tone systems do not require every word to have tone. The complement of obligatoriness is ‘culminativity’, which requires that every word have at most one syllable with primary stress. Most, if not all, stress systems obey culminativity. Culminativity is not, however, definitional for stress since culminativity is a property of certain tone languages that only allow a single lexically marked tone per word, such as Japanese [Japonic; Japan]. These are often called ‘restricted tone languages’ (Voorhoeve 1973). Although syllable integrity, obligatoriness, and culminativity are characteristic of most stress systems, each of them has been challenged as a universal feature of stress systems. Certain Numic [Uto-Aztecan; United States] languages, such as Southern Paiute (Sapir 1930) and Tümpisa Shoshone (Dayley 1989), are described as imposing a rhythmic stress pattern sensitive to mora count, a system that allows for either the first or the second half of a long vowel to bear stress, a violation of syllable integrity. Some languages are described as violating obligatoriness in having stressless words, such as words lacking heavy syllables in Seneca [Iroquoian; United States] (Chafe 1977) and phrase-final and isolation words of the shape CVCV(C) in Central Alaskan Yupik [Eskimo-Aleut; United States] (Miyaoka 1985; Woodbury 1987; see chapter 20 for discussion of Yupik). Other languages are said to have multiple stresses per word none of which stands out as the primary stress, such as Central Alaskan Yupik (Woodbury 1987) and Tübatulabal [Uto-Aztecan; United States] (Voegelin 1935), a violation of culminativity. Hayes (1995) suggests that isolated violations of syllable integrity, obligatoriness, and culminativity are amenable to alternative analyses that preserve these three proposed universals of stress.
5.3 Typology of stress The typology of stress systems has been extensively surveyed (e.g. Hyman 1977; Bailey 1995; Gordon 2002; Heinz 2007; van der Hulst and Goedemans 2009; van der Hulst et al. 2010; Goedemans et al. 2015). We summarize here some of the results of this research programme.
5.3.1 Lexical versus predictable stress A division exists between languages in which stress is predictable from phonological properties such as syllable location and shape and those in which it varies as a function of morphology or lexical item. Finnish [Uralic; Finland] (Suomi et al. 2008), in which primary stress falls on the first syllable of every word, provides an example of phonologically predictable stress. At the other extreme, Tagalog [Austronesian; Philippines] (Schachter and Otanes 1972) words may differ solely on the basis of stress, e.g. /ˈpito/ ‘whistle’ vs. /piˈto/ ‘seven’. In reality, degree of predictability of stress represents more of a continuum than a binary division, since most languages display elements of both contrastive and predictable stress. For example, although stress in Spanish is lexically distinctive, e.g. /ˈsabana/ ‘bed sheet’ vs. /saˈbana/ ‘savannah’, it is confined to a three-syllable window at the right edge of a
OUP CORRECTED PROOF – FINAL, 04/12/20, SPi
70 MATTHEW K. GORDON AND HARRY VAN DER HULST word with a strong statistical preference for the penultimate syllable (Roca 1999; Peperkamp et al. 2010). Similarly, stress-rejecting affixes in Turkish Kabardian [Northwest Caucasian; Turkey] (Gordon and Applebaum 2010) create deviations from the otherwise predictable stress pattern, such as the predictable penultimate stress in /ˈməʃɐ/ ‘bear’ vs. the final stress in /məˈʃɐ/ ‘this milk’ attributed to the stress-rejecting prefix mə- ‘this’.
5.3.2 Quantity-insensitive stress Phonologically predictable stress systems differ depending on their sensitivity to the internal structure of syllables. In languages with ‘quantity-insensitive’ or ‘weight-insensitive’ stress, stress falls on a syllable that occurs at or near the periphery of a word. For example, Macedonian [Indo-European; Macedonia] stresses the antepenultimate syllable of words (Lunt 1952; Franks 1987): /voˈdeniʧar/ ‘miller’, /vodeˈniʧari/ ‘miller-pl’, /vodeniˈʧarite/ ‘miller-def.pl’. Surveys reveal five robustly attested locations of ‘fixed stress’: the initial, the second, the final, the penultimate, and the antepenultimate syllables. Third syllable is a more marginal pattern, reported for Ho-chunk [Siouan; United States] (but see discussion of alternative tonal analyses in Hayes 1995) and as the default pattern in certain languages with lexical stress, such as Azkoitia Basque [isolate; Spain] (Hualde 1998]. Three stress locations (initial, penultimate, and final) statistically predominate, as illustrated in Figure 5.1, based on the StressTyp2 (Goedemans et al. 2015) database of 699 languages.
5.3.3 Quantity-sensitive stress In many languages, stress is sensitive to the internal structure or ‘weight’ of syllables, where criteria for which syllables count as ‘heavy’ vary across languages (Hayes 1989a; Gordon 2006). For example, in Piuma Paiwan [Austronesian; Taiwan] (Chen 2009b), stress typically falls on the penultimate syllable of a word: /kuˈvuvu/ ‘my grandparents’, /səmuˈkava/
200 175 150 125 100 75 50 25 0 Initial
Peninitial
Antepenult
Penult
Final
Figure 5.1 Number of languages with different fixed-stress locations according to StressTyp2 (Goedemans et al. 2015).
OUP CORRECTED PROOF – FINAL, 04/12/20, SPi
WORD-STRESS SYSTEMS 71 ‘to take off clothes’. However, if the penult contains a light syllable, one containing a schwa, stress migrates rightward to the final syllable (even if it too contains schwa): /qapəˈdu/ ‘gall’, /ʎisəˈqəs/ ‘nit’. The rejection of stress by schwa is part of a cross-linguistic weight continuum in which non-low central vowels are lighter in some languages than peripheral vowels. Among peripheral vowels, languages may treat low vowels as heavier than non-low vowels or nonhigh vowels as heavier than high vowels (Kenstowicz 1997; De Lacy 2004; Gordon 2006); see, however, Shih (2016, 2018) and Rasin (2016) for the paucity of compelling evidence for vowel-quality-based stress. It is more common for a weight-sensitive stress system to be sensitive to the structure of the syllable rime than to vowel quality (see Gordon 2006 for statistics). Many languages thus treat syllables with long vowels (CVV) as heavier than those with short vowels, while others preferentially treat both CVV and closed syllables (CVC) as heavy. For example, in Kabardian (Abitov et al. 1957; Colarusso 1992; Gordon and Applebaum 2010), stress falls on a final syllable if it is either CVV or CVC, otherwise on the penult: /sɐˈbən/ ‘soap’, /saːˈbiː/ ‘baby’, /ˈwənɐ/ ‘house’, /χɐrˈzənɐ/ ‘good’. Tone may also condition stress in some languages, where higher tones are preferentially stressed over lower tones (de Lacy 2002). In some languages, weight is scalar (Hayes 1995; Gordon 2006), and in others, weight is sensitive to onset consonants (Gordon 2005b; Topintzi 2010; see §5.3.3). Pirahã [MuraPirahã; Brazil] (Everett and Everett 1984a; Everett 1998) observes a scalar weight hierarchy that simultaneously appeals to both onset and rimal weight: stress falls on the rightmost heaviest syllable within a three-syllable window at the right edge of a word. The Pirahã weight scale is KVV > GVV > VV > KV > GV, where K stands for a voiceless onset and G for a voiced onset. Onset-sensitive weight is rare compared to rime-sensitive weight. Of 136 languages with weight-sensitive stress in Gordon’s (2006) survey, only four involve onset sensitivity (either presence vs. absence or type of onset). The primacy of rimal weight is mirrored language-internally: onset weight almost always implies rimal weight, and, where the two coexist, rimal weight takes priority over onset weight. This dependency is exemplified in Pirahã, where a heavier rime (one consisting of a long vowel) outweighs a heavier onset (one containing a voiceless consonant)—that is, GVV outweighs KV.
5.3.4 Bounded and unbounded stress In the stress systems discussed thus far, stress is limited, or ‘bounded’, to a range of syllables at a word edge. For example, in Piuma Paiwan, which avoids stress on schwa (Chen 2009b; §5.3.3), stress falls on one of the last two syllables, even if there is a peripheral vowel to the left of the penult and the final two syllables both contain schwa. Stress windows are also observed at the left edge in some languages. In Capanahua [Panoan; Peru] (Loos 1969; Elías-Ulloa 2009), stress falls on the second syllable if it is closed, but on the first otherwise, as seen in /ˈmapo/ ‘head’, /ˈwaraman/ ‘squash’, /piʃˈkap/ ‘small’, /wiˈrankin/ ‘he pushed it’ (see van der Hulst 2010a for more on window effects for weight-sensitive stress). As the word /ˈwaraman/ indicates, stress is limited to the first two syllables even if these are light and a syllable later in the word is heavy. Lexical stress may also be bound to stress windows. For example, Choguita Rarámuri [Uto-Aztecan; Mexico] (Caballero and Carroll 2015) has lexically contrastive stress operative
OUP CORRECTED PROOF – FINAL, 04/12/20, SPi
72 MATTHEW K. GORDON AND HARRY VAN DER HULST within a three-syllable window at the left edge of a word, where the default stress location is the second syllable: /ˈhumisi/ ‘run away pl’ vs. /aˈsisi/ ‘get up’ vs. /biniˈhi/ ‘accuse’. When a lexically stressed suffix attaches to a root with default second-syllable stress, stress is shifted to the suffix unless it falls outside the left-edge three-syllable window. For example, the conditional suffix /sa/ attracts stress in /ru-ˈsa/ ‘s/he is saying’ and /ʧapi-ˈsa/ ‘S/he is grabbing’, but not in /ruruˈwa-sa/ ‘s/he is throwing liquid’. Not all weight-sensitive or lexical stress systems are bounded. For example, stress in Yana [isolate; California] (Sapir and Swadesh 1960) is ‘unbounded’, falling on the leftmost heavy syllable (CVV or CVC) regardless of its position in a word. In words lacking a heavy syllable, stress defaults to the initial syllable. Languages such as Yana featuring unbounded stress may either have initial stress in the default case, as in Yana, or default final stress, as in Kʷak’ʷala [Wakashan; Canada] (Boas 1947; Bach 1975; Wilson 1986; Shaw 2009; Gordon et al. 2012). If, in languages with unbounded stress, several morphemes with inherent stress are combined into a complex word, the leftmost or rightmost among them will attract stress. This situation parallels unbounded weight-sensitive stress, if lexical stress is viewed as diacritic weight (van der Hulst 2010a). In both cases, stress defaults to the first or last syllable (or the peninitial or penult, if extrametricality/non-finality applies) if no heavy syllable is present. A case in point is Russian [Indo-European; Russia], in which primary stress falls on the rightmost syllable with diacritic weight and on the first syllable if there is no syllable with diacritic weight: /gospoˈʒa/ ‘lady’, /koˈrova/ ‘cow’ vs. /ˈzʲerkalo/ ‘mirror’, /ˈporox/ ‘powder’ (Halle 1973).
5.3.5 Secondary stress In certain languages, longer words may have one or more secondary stresses. In some, there may be a single secondary stress at the opposite edge from the primary stress. For example, in Savosavo [Central Solomon Papuan; Solomon Islands] (Wegener 2012), primary stress typically falls on the penult with secondary stress on the initial syllable, as in /ˌsiˈnoqo/ ‘cork’, /ˌkenaˈɰuli/ ‘fishing hook’. In other languages, secondary stress rhythmically propagates either from the primary stress or from a secondary stress at the opposite edge from the primary stress. Rhythmic stress was exemplified earlier (see §5.2.3) for Tohono O’odham, in which primary stress falls on the first syllable and secondary stress falls on subsequent odd-numbered s yllables: /ˈwa-paiˌɺa-dag/ ‘someone good at dancing’, /ˈʧɨpoˌs-id-a-ˌkuɖ/ ‘branding instrument’. Languages with a fixed primary and a single fixed secondary stress are relatively rare compared to those with rhythmic stress. In Gordon’s (2002) survey of 262 quantityinsensitive languages, only 15 feature a single secondary stress compared to 42 with rhythmic secondary stress. Both, though, are considerably rarer than single fixed stress systems, which number 198 in Gordon’s survey, although it is conceivable that some languages for which only primary stress is described may turn out to have secondary stress. Even rarer are hybrid ‘bidirectional’ systems in which one secondary stress ‘wave’ radiates from the primary stress with a single secondary stress occurring on the opposite edge of the word. For example, primary stress in South Conchucos Quechua [Quechuan; Peru] (Hintz 2006) falls on the penult, with secondary stress docking both on the initial syllable and on alternating syllables to the left of the penult, as in /ˌwaˌraːkaˌmunqaˈnaʧi ̥/ ‘I crunch up my own (e.g. prey) with teeth’. The bidirectional nature of stress leads to adjacent stresses (i.e. stress clashes) in words with an odd number of syllables. In some bidirectional systems,
OUP CORRECTED PROOF – FINAL, 04/12/20, SPi
WORD-STRESS SYSTEMS 73 such as Garawa [Australian; Australia] (Furby 1974), rhythmic stress is suppressed where it would result in a stress clash. Another rare system has stress on every third syllable. For example, primary stress in Cayuvava [isolate; Bolivia] (Key 1961, 1967) falls on the antepenultimate syllable and secondary stress falls on every third syllable to the left of the primary stress: /ikiˌtapareˈrepeha/ ‘the water is clean’, /ˌʧa.adiˌroboβuˈuruʧe/ ‘ninety-five (first digit)’. StressTyp2 (Goedemans et al. 2015) cites only two quantity-insensitive stress systems with stress on every third syl lable, although there are a few quantity-sensitive stress languages (§5.3.3) in which ternary intervals occur in sequences of light syllables (see Hayes 1995). Stanton’s (2016) survey of word length in 102 languages suggests that rhythmic stress (generalized over all subtypes) is especially prevalent in languages with longer words, whereas single stress systems are more common in languages with fewer long words. Figure 5.2 plots the median percentages of words ranging from one to four or more syllables for languages with a single stress per word (34 languages in Stanton’s database) and for those with rhythmic secondary stress (22 languages). Non-stress languages and those with other types of stress systems, such as those based on tone or those with one stress near each edge of the word, are excluded in Figure 5.2. The two sets of languages display virtually identical frequency patterns for words with two and three syllables, but differ in the relative frequency of monosyllabic words and words of at least four syllables. Monosyllables vastly outnumber (by nearly 30%) words with four or more syllables in the single stress languages, but are only marginally more numerous than long words in the languages with rhythmic stress. This asymmetry suggests that stress lapses are dispreferred and that when the morphology of a language creates longer words in sufficient frequency, speakers tend to impose rhythmic stress patterns, which may then generalize to shorter words. A more cynical view might attribute the link between word length and rhythmic stress to the perceptual transfer of rhythmic secondary stresses by researchers accustomed to hearing secondary stresses in their native language, a phenomenon that Tabain et al. (2014) term ‘stress ghosting’ in their study of Pitjantjatjara [Australian; Australia]. Median % of Words of Different Lengths
45 Single Stress
40
Rhythmic Stress
35 30 25 20 15 10 5 0
1
2
3
4+
1
2
3
4+
Figure 5.2 Median percentages of words with differing numbers of syllables in languages with a single stress per word and those with rhythmic secondary stress in Stanton (2016).
OUP CORRECTED PROOF – FINAL, 04/12/20, SPi
74 MATTHEW K. GORDON AND HARRY VAN DER HULST A recurring feature of languages with rhythmic secondary stress is that the primary stress serves as the starting point for the placement of secondary stresses (van der Hulst 1984). Thus, in a language with rightward propagation of secondary stress, such as Tohono O’odham, the primary stress is the leftmost stress, whereas in languages with leftward iteration of secondary stress, such as Émérillon [Tupian; French Guiana] (Gordon and Rose 2006) and Cayuvava [isolate; Bolivia] (Key 1961, 1967), the rightmost stress is the primary one. Systems in which the stress at the endpoint of the rhythmic train is the primary one are comparatively rare. Virtually all of the exceptions to this generalization involve cases of rightward propagation of stress and the rightmost stress being the primary one, a pattern that plausibly reflects phrasal pitch accent rather than word stress (van der Hulst 1997; Gordon 2014). Perhaps the only case in the literature of leftward stress assignment and promotion of the leftmost stress to the primary one is found in Malakmalak [Australian; Australia] (Birk 1976).
5.3.6 Non-finality effects Many stress systems exhibit a bias against (primary or secondary) stress on final syllables. Final stress avoidance has various manifestations. Some languages suppress or shift a rhythmic secondary stress that would be predicted to fall on a final syllable. An example of final stress suppression comes from Pite Saami [Uralic; Sweden] (Wilbur 2014), which has the same basic rhythmic stress pattern as Tohono O’odham except that final odd-numbered syllables are not stressed, e.g. /ˈsaːlpmaˌkirːje/ ‘psalm book nom.sg’, /ˈkuhkaˌjolkikijt/ ‘long-leg-nmlzacc.pl’. Other languages may stress the second syllable of a word, but not if that stress would be final. For example, in Hopi (Jeanne 1982), stress falls on the second syllable of a word with more than the two syllables if the first syllable is light, but in disyllabic words stress is initial regardless of the weight of the first syllable: /kɨˈjapi/ ‘dipper’, /laˈqana/ ‘squirrel’, /ˈkoho/ ‘wood’, /ˈmaqa/ ‘to give’. Another species of non-finality occurs in weight-sensitive systems in which final weight criteria are more stringent than in non-final syllables, a pattern termed ‘extrametricality’ (Hayes 1979). Thus, in Cairene Arabic [Afro-Asiatic; Egypt] (Mitchell 1960; McCarthy 1979a; Watson 2007), CVC attracts stress in the penult, as in /muˈdarris/ ‘teacher m.sg.’, but a final syllable containing a short vowel must have two coda consonants (CVCC) to attract stress, cf. /kaˈtabt/ ‘I wrote’ but /ˈasxan/ ‘hotter’ (see Rosenthall and van der Hulst 1999 for more on context-driven weight for stress).
5.4 Rhythmic stress and the foot Languages with rhythmic stress have provided the impetus for theories that assume the foot as a prosodic constituent below the word (e.g. Liberman and Prince 1977; Hayes 1980, 1995; Selkirk 1980; Halle and Vergnaud 1987; Halle and Idsardi 1995). In these theories, foot type is treated as a parameter with certain languages employing trochaic feet, which consist of strong–weak pairs of syllables, and others opting for iambic feet, consisting of weak–strong pairs. Tohono O’odham provides an example of trochaic footing where in words with odd syllables the final syllable constitutes a monosyllabic foot, as in /(ˈʧɨpo)(ˌsida)(ˌkuɖ)/
OUP CORRECTED PROOF – FINAL, 04/12/20, SPi
WORD-STRESS SYSTEMS 75 ‘ branding instrument’ (cf. /(ˈwapai)(ˌɺadag)/ ‘someone good at dancing’). The mirror-image trochaic system stresses even-numbered syllables counting from the right, as in Émérillon (excluding words with a final heavy syllable, which attract stress from the penult) (Gordon and Rose 2006): /(ˌmana)(ˈnito)/ ‘how’, /(ˌdeze)(ˌkasi)(ˈwaha)/ ‘your tattoo’. Osage [Siouan; United States] (Altshuler 2009), in which stress falls on even-numbered syllables counting from the left, exemplifies iambic stress: /(xoːˈʦo)(ðiːbˌrɑ̃)/ ‘smoke cedar’, /(ɑ̃ːˈwɑ̃) (lɑːˌxy)ɣe/ ‘I crunch up my own (e.g. prey) with teeth’. (The final syllable remains unfooted to avoid a stress clash with the preceding syllable.) Its mirror-image iambic pattern stresses odd-numbered syllables from the right, as in Urubú Kaapor (Kakumasu 1986). Trochaic stress patterns predominate cross-linguistically. In StressTyp2, the Tohono O’odham-type trochaic pattern is found in 42 languages, while the Émérillon-type trochaic system is found in 40 languages. In contrast, their inverses, Osage iambic and Urubú Kaapor iambic systems, are observed in only 7 and 13 languages, respectively. The alternative to a foot-based theory of stress represents stress only in terms of a prominence grid (e.g. Prince 1983; Selkirk 1984; Gordon 2002), in which stressed syllables project grid marks while unstressed ones do not. Differences in level of stress (e.g. primary vs. secondary stress) are captured in terms of differences in the number of grid marks dominating a syllable. Foot-based theories assume that the grid marks are grouped into (canonic ally) disyllabic constituents, although single syllables may be parsed into feet at the periphery of a word, as in Tohono O’odham. Foot-based and grid-based representations of stress are exemplified for Tohono O’odham in (1). (1)
Level 1 (Primary stress) Level 2 (Secondary stress)
Foot-based ( x ) ( x . )( x .) ( x ) (ˈʧɨpo)(ˌsida) (ˌkuɖ)
Grid-based x x x x ˈʧɨpo ˌsida ˌkuɖ
Phonologists have long debated the role of the foot in the analysis of stress (see Hermans 2011). An asymmetry between trochaic and iambic feet in their sensitivity to weight provides one of the strongest pieces of evidence for the foot. Unlike quantity-insensitive rhythmic stress systems, which are biased towards trochees, quantity-sensitive rhythmic stress tends towards iambic groupings with an ideal profile consisting of a light–heavy sequence, an asymmetry termed the ‘iambic-trochaic law’. Chickasaw instantiates a prototypical iambic language in which stressed light (CV) syllables are lengthened non-finally (see §5.2.3) and all heavy (CVV, CVC) syllables are stressed: /(ʧiˌkaʃ)(ˈʃaʔ)/ ‘Chickasaw’, /(ˈnaːɬ)(toˌkaʔ)/ ‘policeman’, /ʧiˌpisaˌliˈtok/ → /(ʧiˌp[iː])(saˌl[iː])(ˈtok)/ ‘I looked at you’. In contrast to iambic feet, trochaic feet in some languages are subject to shortening of stressed vowels to produce a canonical light-light trochee, e.g. Fijian /m͡buːŋ͡gu/ → /(ˈm͡b[u].ŋ͡gu)/ my grandmother’ (Schütz 1985).
5.5 Outstanding issues in word stress 5.5.1 The diagnosis of stress Stress is easily identified in its prototypical instantiation in which phonetic and phonological exponents, speaker intuitions, and distributional characteristics converge. There are many languages, however, in which evidence for stress is more ambiguous. It is thus often difficult to
OUP CORRECTED PROOF – FINAL, 04/12/20, SPi
76 MATTHEW K. GORDON AND HARRY VAN DER HULST determine whether prominence should be attributed to stress rather than other properties, including tone, intonation, and the marking of prosodic boundaries (for discussion see Gordon 2014; Roettger and Gordon 2017). Raised pitch could thus potentially reflect a high tone in a tone language, reflect a phrase- or utterance-initial boundary tone, or be triggered by focus. Similarly, increased length could be attributed to a prosodic boundary rather than stress. Distributional restrictions on other phonological properties may be diagnostic of stress in lieu of obvious phonetic exponents or phonological alternations. For example, certain Bantu languages preferentially restrict high tone to a single syllable per word (Hyman 1989; Downing 2010), a distribution that is consistent with the property of culminativity holding of canonical stress systems (§5.2.4). There are also languages in which potential phonetic correlates of stress may not converge on the same syllable, as in Bantu languages with high tone on the antepenult but lengthening of the penult (Hyman 1989), or languages such as Belarusian [Indo-European; Belarus] (Dubina 2012; Borise 2015) and Welsh [Indo-European; United Kingdom] (Williams 1983, 1999) with cues to stress spread over the stressed and adjacent syllable. Non-convergence may be due to the existence of multiple prominence systems (e.g. intonation vs. word-level stress) or to a diffuse phonetic realization of stress (e.g. a delayed or premature f0 peak relative to the stress).
5.5.2 Stress and prosodic taxonomy Stress is widespread in languages of the world. Of the 176 languages included in the 200-language World Atlas of Language Structures sample, approximately 80% (141 languages) are reported to have stress (Goedemans 2010: 649; see chapter 10 for a lower estimate). Phonemic tone and stress have traditionally been regarded as mutually exclusive. However, an increasing body of research has demonstrated cases of stress and tone co-existing in the same language, whether functioning orthogonally to each other, as in Thai [Tai-Kadai; Thailand] (Potisuk et al. 1996), Papiamentu [Portuguese Creole; Aruba] (Remijsen and van Heuven 2002), and Pirahã (Everett and Everett 1984a; Everett 1998); in a dependent relationship in which tone is predictive of stress, as in Ayutla Mixtec [Oto-Manguean; Mexico] (Pankratz and Pike 1967; de Lacy 2002); or where stress is predictive of tone, as in Trique (DiCanio 2008, 2010). On the other hand, there are several languages that have traditionally been regarded as stress languages but that are now generally considered languages in which prominence can be linked to phrasal pitch events rather than word-level stress (or tone), such as French [Indo-European; France] (Jun and Fougeron 1995), Korean [Koreanic; Korea] (Jun 1993), Indonesian [Austronesian; Indonesia] (van Zanten et al. 2003), Ambonese Malay [Austronesian; Indonesia] (Maskikit-Essed and Gussenhoven 2016), West Greenlandic [Eskimo-Aleut; Greenland] (Arnhold 2014), and Tashlhiyt [Afro-Asiatic; Morocco] (Roettger et al. 2015). These languages all share in common pitch events that occur near the edges of prosodic domains larger than the word, though they differ in the consistency of the timing of the pitch events.
5.5.3 Stress typology and explanation A burgeoning area of research explores various perceptual and cognitive motivations behind stress patterns. For example, several scholars have developed phonetically driven
OUP CORRECTED PROOF – FINAL, 04/12/20, SPi
WORD-STRESS SYSTEMS 77 accounts of onset weight that appeal to auditory factors such as perceptual duration (Goedemans 1998), adaptation and recovery (Gordon 2005b), and perceptual p-centres (Ryan 2014). Gordon (2002) offers an account of rime-sensitive weight appealing to the non-linear mapping between acoustic intensity and perceptual loudness and to the tem poral summation of energy in the perceptual domain. Non-finality effects have been linked to an avoidance of tonal crowding between the high pitch characteristic of stress and the default terminal pitch fall typically associated with the right edge of an utterance (Hyman 1977; Gordon 2000, 2014). Lunden (2010, 2013) offers an account of final extrametricality based on differences in the relative phonetic duration of syllables in final versus non-final syllables. Stanton (2016) hypothesizes that the absence of languages that orient stress towards the middle of the word rather than an edge, the ‘midpoint pathology’, is attributed to the difficulty in learning such a pattern due to the relative rarity of words of sufficient length to enable the learner to disambiguate midpoint stress from other potential analyses.
5.6 Conclusion Although a combination of typological surveys of stress and detailed case studies of particular languages has revealed a number of robust typological generalizations governing stress, many questions still remain. These include the abstract versus physical reality of stress, the relationship between word stress and prominence associated with higher-level prosodic units, and the role of functional and grammatical factors in explaining the behaviour of stress. The continued expansion of typological knowledge gleaned from phono logical, phonetic, and psycholinguistic studies of stress will continue to shed light on these issues (but will undoubtedly raise more questions).
Additional reading There are several overviews of typological and theoretical aspects of word stress that contain further references to particular topics, including Kager (2007), van der Hulst et al. (2010), Gordon (2011a, 2011b, 2015), Hammond (2011), Hermans (2011), Hyde (2011), and Gordon and Roettger (2017). Hyman (2006) is a recent discussion of definitional characteristics of stress as a prosodic class distinct from tone. The papers in van der Hulst (2014a, 2014b), Heinz et al. (2016), Goedemans et al. (2019), and Bogomolets and van der Hulst (in press) explore various contemporary descriptive and theoretical issues related to word stress.
OUP CORRECTED PROOF – FINAL, 07/12/20, SPi
chapter 6
The Au tosegm en ta lM etr ica l Th eory of I n tonationa l Phonol ogy Amalia Arvaniti and Janet Fletcher
6.1 Introduction The autosegmental-metrical theory of intonational phonology (henceforth AM) is a widely adopted theory concerned with the phonological representation of intonation and its phonetic implementation. The term ‘intonation’ refers to the linguistically structured modulation of fundamental frequency (f0), which directly relates to the rate of vibration of the vocal folds and gives rise to the percept of pitch. Intonation is used in all languages and specified at the ‘post-lexical’ (phrasal) level by means of a complex interplay between metrical structure, prosodic phrasing, syntax, and pragmatics; these factors determine where f0 movements will occur and of what type they will be. Intonation serves two main functions: encoding pragmatic meaning and marking phrasal boundaries. In addition to intonation, f0 is used for lexical purposes, when it encodes tonal contrasts in languages traditionally described as having a ‘lexical pitch accent’, such as Swedish and Japanese, as well as languages with a more general distribution of ‘lexical tone’, such as Mandarin, Thai, and Igbo. Both types are modelled in AM together with tones that signal intonation (see e.g. Pierrehumbert and Beckman 1988 on Japanese). In addition to these linguistic uses, f0 is used to signal ‘paralinguistic’ information such as boredom, anger, emphasis, or excitement (on paralinguistic uses of f0, see Gussenhoven 2004: ch. 5; Ladd 2008b: ch. 1; see also chapter 30). Several models for specifying f0 contours are available today, such as Parallel Encoding and Target Approximation (PENTA) (Xu and Prom-On 2014, inter alia), the International Transcription System for Intonation (INTSINT) (Hirst and Di Cristo 1998), and the Fujisaki model (Fujisaki 1983, 2004). However, many aim at modelling f0 curves rather than defining the relation between f0 curves and the phonological structures that give rise to them. In contrast, AM makes a principled distinction between intonation as a subsystem of a language’s phonology and f0, its main phonetic exponent. The arguments for this distinction
OUP CORRECTED PROOF – FINAL, 07/12/20, SPi
THE AUTOSEGMENTAL-METRICAL THEORY OF INTONATIONAL PHONOLOGY 79 are similar to those that apply to segmental aspects of speech organization. Consider the following analogy. In producing a word, it is axiomatic in linguistic theory that the word is not mapped directly onto the movements of the vocal organs. There is instead an intervening level of phonological structure: a word is represented in terms of abstract units of sounds known as ‘phonemes’ or articulatory ‘gestures’, which cause the vocal organs to move in an appropriate way. According to AM, the same applies in intonation. If a speaker wants to produce a meaning associated with a polar question (e.g. ‘Do you live in Melbourne?’), this meaning is not directly transduced as rising pitch. Instead, there is an intervening level of ‘abstract tones’ (which can, like phonemes, be represented symbolically); these tones specify a set of pitch targets that the speaker should produce if this particular melody is to be communicated. This relationship between abstract tones and phonetic realization also applies in languages that have lexically specified tone. In both types of language, only the abstract tones form part of the speaker’s cognitive-phonological plan in producing a melody, with the precise details of how pitch changes are to be realized being filled in by phonetic procedures. AM thus integrates the study of phonological representation and phonetic realization (for details, see §6.2 and §6.3 respectively). The essential tenets of the model are largely based on Pierrehumbert’s (1980) dissertation (see also Bruce 1977), with additional refinements built on experimental research and formal analysis involving a large number of languages (see Ladd 2008b for a theoretical account; see Gussenhoven 2004 and Jun 2005a, 2014a for language surveys). The term ‘autosegmental-metrical’, which gave the theory its name, was coined by Ladd (1996) and reflects the connection between two subsystems of phonology: an autosegmental tier representing intonation’s melodic part as well as any lexical tones (if part of the system), and metrical structure representing prominence and phrasing. The connection reflects the fact that AM sees intonation as part of a language’s ‘prosody’, an umbrella term that encompasses interacting phenomena that include intonation, rhythm, prominence, and prosodic phrasing. The term ‘prosody’ is preferred over the older term ‘suprasegmentals’ (e.g. Lehiste 1977a; Ladd 2008b), so as to avoid the layering metaphor inherent in the latter (cf. Beckman and Venditti 2011): prosody is not a supplementary layer over vowels and consonants but an integral part of the phonological representation of speech. Crucial to AM’s success has been the central role it gives to the underlying representation of tunes as a series of tones rather than contours. Specifically, AM analyses continuous (and often far from smooth) pitch contours as a series of abstract primitives. This is a challenging endeavour for two reasons. First, intonational primitives cannot be readily identified based on meaning (as tones can in tone languages, such as Cantonese, where distinct pitch patterns are associated with changes in lexical meaning). In contrast, the meaning of intonational primitives is largely pragmatic (Hirschberg 2004), so in languages like English choice of melody is not constrained by choice of words. Second, f0 curves do not exhibit obvious changes that readily lead to positing distinct units; thus, breaking down the f0 curve into constituents is not as straightforward as identifying distinct patterns corresponding to segments in a spectrogram. This is all the more challenging, as a melody can spread across several words or be realized on a monosyllabic utterance. To illustrate this point, consider the pitch contours in Figure 6.1. The utterance in panel a is monosyllabic, while the one in panel b is eight syllables long. The f0 contours of the two utterances are similar but not identical: neither can be said to be a stretched or squeezed version of the other. Nevertheless, both contours are recognized by native speakers of English as realizations of the same melody, in terms of both form and prag-
OUP CORRECTED PROOF – FINAL, 07/12/20, SPi
80 AMALIA ARVANITI AND JANET FLETCHER (a)
350
f0 (Hz)
310 270 230 190 150
Lou?! L-H%
L*+H 0
Time (s)
0.6964
(b) 350
f0 (Hz)
310 270 230 190 150
A
ballet
aficionado?! L-H%
L*+H 0
Time (s)
1.496
Figure 6.1 Spectrograms and f0 contours illustrating the same English tune as realized on a monosyllabic utterance (a) and a longer utterance (b).
matic function, the aim of which is to signal incredulity (Ward and Hirschberg 1985; Hirschberg and Ward 1992). The differences between the contours are not random. Rather, they exhibit what Arvaniti and Ladd (2009) have termed ‘lawful variability’, i.e. variation that is systematically related to variables such as the length of the utterance (as shown in Figure 6.1), the position of stressed syllables, and a host of other factors (see Arvaniti 2016 for a detailed discussion of additional sources of systematic variation in intonation). Besides understanding what contours like those in Figure 6.1 have in common and how they vary, a central endeavour of AM is to provide a phonological analysis that reflects this understanding.
6.2 AM phonology 6.2.1 AM essentials In AM, intonation is phonologically represented as a string of Low (L) and High (H) tones and combinations thereof (Pierrehumbert 1980; Beckman and Pierrehumbert 1986; Ladd 2008b; cf. Leben 1973; Liberman 1975; Goldsmith 1981). Tones are considered
OUP CORRECTED PROOF – FINAL, 07/12/20, SPi
THE AUTOSEGMENTAL-METRICAL THEORY OF INTONATIONAL PHONOLOGY 81 ‘autosegments’: they are autonomous segments relative to the string of vowels and consonants. Ls and Hs are the abstract symbolic (i.e. phonological) primitives of intonation (much as they are the primitives in the representation of lexical tone). Their identity as Hs and Ls is defined in relative terms: H is used to represent tones deemed to be relatively high at some location in a melody relative to the pitch of the surrounding parts of the melody, while L is used to represent tones that are relatively low by the same criterion (cf. Pierrehumbert 1980: 68–75). Crucially, the aim of the string of tones is not to faithfully represent all the modulations that may be observed in f0 contours but rather to capture significant generalizations about contours perceived to be instances of the same melody (see Arvaniti and Ladd 2009 for a detailed presentation of this principle). Thus, AM phonological presentations are underspecified in the sense that they do not account (and are not meant to account) for all pitch movements; rather they include only those elements needed to capture what is contrastive in a given intonational system. At the phonetic level as well, it is only the tones of the phonological representation that are realized as targets, with the rest of the f0 contour being derived by ‘interpolation’ (see §6.3.3 for a discussion of interpolation).
6.2.2 Metrical structure and its relationship with the autosegmental tonal string The relationship between tones and segments (often referred to as ‘tune–text association’) is mediated by a metrical structure. This is a hierarchical structure that represents (i) the parsing of an utterance into a number of constituents and (ii) the prominence relations between them (e.g. the differences between stressed and unstressed syllables). The term ‘metrical structure’, as in the term ‘autosegmental-metrical theory’, is typically used when the representation of stress is at issue; when the emphasis is on phrasal structure, the term ‘prosodic structure’ is often used instead. Both relative prominence and phrasing can be captured by the same representation (see e.g. Pierrehumbert and Beckman 1988). An example is given in (1), which represents the prosodic structure of the utterance in Figure 6.1b, ‘a ballet aficionado?!’. As can be seen, syllables (σ) are grouped into feet (F), which in turn are grouped into prosodic words (ω); in this example, prosodic words are grouped into one intermediate phrase (ip), which is the only constituent of the utterance’s only intonational phrase (IP). Relative prominence is presented by marking constituents as strong (s) or weak (w). (1)
Intonational Phrase
IP
intermediate phrase
ips ωs
ωw
Fs
Prosodic Word
Fw
σw
σs
σw
σw
a
bal
let
a
Fs
σs σw fi
σs σw
cio na do
Foot Syllable
OUP CORRECTED PROOF – FINAL, 07/12/20, SPi
82 AMALIA ARVANITI AND JANET FLETCHER The prosodic structure in (1) is based on the model of Pierrehumbert and Beckman (1988), which has been implicitly adopted and informally used in many AM analyses. This model is similar to other well-known models (cf. Selkirk 1984; Nespor and Vogel 1986) but differs from them in some critical aspects. First, the number and nature of levels in the prosodic hierarchy are not fixed but language specific. For instance, Pierrehumbert and Beckman (1988) posit three main levels of phrasing for Tokyo Japanese: the accentual phrase (AP), the intermediate phrase (ip) and the Intonational Phrase (IP). However, they posit only two levels of phrasing for English: the ip and the IP, as illustrated in (1), since they found no evidence for an AP level of phrasing (Beckman and Pierrehumbert 1986). Further, the model assumes that it is possible to have headless constituents (i.e. constituents that do not include a strong element, or ‘head’). In the analysis of Pierrehumbert and Beckman (1988), this applies to Japanese AP’s that do not include a word with a lexical pitch accent; in such AP’s, there are no syllables, feet, or prosodic words that are strong. The same understanding applies by and large to several other languages that allow headless constituents, such as Korean (Jun 2005b), Chickasaw (Gordon 2005a), Mongolian (Karlsson 2014), Tamil (Keane 2014), and West Greenlandic (Arnhold 2014a); informally, we can say that these languages do not have stress. In addition, the model of Pierrehumbert and Beckman (1988) relies on n-ary branching trees (trees with more than two branches per node); grouping largely abides by the Strict Layer Hypothesis, according to which all constituents of a given level in the hierarchy are exhaustively parsed into constituents of the next level up (Selkirk 1984). However, Pierrehumbert and Beckman also accept limited extrametricality, such as syllables that are linked directly to a prosodic word node (Pierrehumbert and Beckman 1988: 147 ff.); this is illustrated in (1), where the indefinite article a and the unstressed syllable at the beginning of aficionado are linked directly to the relevant ω node. (For an alternative model of prosodic structure that allows limited recursiveness, see Ladd 2008b: ch. 8, incl. references.) Independently of the particular version of prosodic structure adopted in an AM analysis, it is widely agreed that tones associate with phrasal boundaries or constituent heads (informally, stresses) or both (see §6.2.3 for details on secondary association of tones, and Gussenhoven 2018 for a detailed discussion of tone association). Tones that associate with stressed syllables are called ‘pitch accents’ and one of their roles is prominence enhancement; they are notated with a star (e.g. H*). The final accent in a phrase is called the ‘nuclear pitch accent’ or ‘nucleus’, and is usually deemed the most prominent. Pitch accents may consist of more than one tone and are often bitonal; examples include L*+H- and L-+H* (after Pierrehumbert’s 1980 original notation but also annotated as L*H or L*+H, and LH* or L+H* respectively). Pitch patterns have been analysed as reflexes of bitonal accents in a number of languages, including English (Ladd and Schepman 2003), German (Grice et al. 2005a), Catalan (Prieto 2014), Arabic (Chahal and Hellmuth 2014b), and Jamaican Creole (Gooden 2014). Grice (1995a) has also posited tritonal accents for English. (See also §6.2.3 on secondary association.) In Pierrehumbert (1980), the starred tone of a bitonal pitch accent is metrically stronger than the unstarred tone, and the only one that is phonologically associated (for details see §6.3.1); the unstarred weak tone that leads or trails the starred tone is ‘floating’ (i.e. it is a tone without an association). Research on a number of languages since Pierrehumbert (1980) indicates that additional types of relations are possible between the tones of bitonal accents. Arvaniti et al. (1998, 2000) have provided experimental evidence from Greek that
OUP CORRECTED PROOF – FINAL, 07/12/20, SPi
THE AUTOSEGMENTAL-METRICAL THEORY OF INTONATIONAL PHONOLOGY 83 tones in bitonal accents can be independent of each other, in that neither tone exhibits the behaviour of an unstarred tone described by Pierrehumbert (1980). Frota (2002), on the other hand, reports data from Portuguese showing that the type of ‘loose’ bitonal accent found in Greek can coexist with pitch accents that show a closer connection between tones, akin to the accents described by Pierrehumbert (1980) for English. Tones that associate with phrasal boundaries are collectively known as ‘edge tones’ and their main role is to demarcate the edges of the phrases they associate with. These may also be multitonal; for example, for Korean, Jun (2005b) posits boundary tones with up to five tones (e.g. LHLHL%), while Prieto (2014) posits a tritonal LHL% boundary tone for Catalan. Following Beckman and Pierrehumbert (1986), many analyses posit two types of edge tone, ‘phrase accents’ and ‘boundary tones’, notated with - and % respectively (e.g. H-, H%). Phrase accents demarcate ip boundaries and boundary tones demarcate IP boundaries. For example, Nick and Mel were late because they missed the train is likely to be uttered as two ip’s forming one IP: [[Nick and Mel were late]ip [because they missed the train]ip]IP; the boundary between the two ip’s is likely to be demarcated with a H- phrase accent. An illustration of the types of association between tones and prosodic structure used in AM is provided in (2) using the same utterance as in (1). (2)
IP ips ωs
ωw
Fs σw
σs
Fw σw
σw
Fs
σs σw σs σw
====================== L* +H L-H% ====================== a bal let a fi cio na do All of the languages investigated so far have edge tones that associate with right boundaries. Left-edge boundary tones have also been posited for several languages, including English (Pierrehumbert 1980; Gussenhoven 2004), Basque (Elordieta and Hualde 2014), Dalabon (Fletcher 2014), and Mongolian (Karlsson 2014). However, the specific proposal of Beckman and Pierrehumbert (1986) linking phrase accents to the ip and boundary tones to the IP has not been generally accepted, although it has been adopted by many analyses, including those for Greek (Arvaniti and Baltazani 2005), German (Grice et al. 2005a), Jamaican Creole (Gooden 2014), and Lebanese and Egyptian Arabic (Chahal and Hellmuth 2014b). Some AM analyses dispense altogether with phrase accents either for reasons of parsimony—positing only two types of primitives, pitch accents and boundary tones—or because they adopt a different conception of how the f0 contour is to be broken into constituent tones (see, among others, Gussenhoven 2004, 2005 on Dutch; Frota 2014 on Portuguese; Gussenhoven 2016 on English). Thus, phrase accents are not necessarily included in all AM analyses.
OUP CORRECTED PROOF – FINAL, 07/12/20, SPi
84 AMALIA ARVANITI AND JANET FLETCHER Following Pierrehumbert and Hirschberg (1990), pitch accents and edge tones are treated as intonational morphemes with pragmatic meaning that contribute compositionally to the pragmatic interpretation of an utterance (Pierrehumbert and Hischberg 1990; Steedman 2014; see chapter 30 for a discussion of intonational meaning). Although this understanding of intonation as expressing pragmatic meaning is generally accepted, it may not apply to the same extent to all systems. For example, in languages like Korean and Japanese, in which intonation is used primarily to signal phrasing, tones express pragmatic meaning to a much lesser extent than in languages like English (Pierrehumbert and Beckman 1988 on Japanese; Jun 2005b on Korean).
6.2.3 Secondary association of tones In addition to a tone’s association with a phrasal boundary or constituent head, AM provides a mechanism for ‘secondary association’. For instance, according to Grice (1995a: 215 ff.), leading tones of English bitonal accents, such as L in L+H*, associate with the syllable preceding the accented one (if one is available), while trailing tones, such as H in L*+H, occur a fixed interval in ‘normalized time’ after the starred tone. The former is a type of secondary association (for discussions of additional association patterns, see Barnes et al. 2010a on English; van de Ven and Gussenhoven 2011 on Dutch; Peters et al. 2015 on several Germanic varieties). Although secondary association has been used for a variety of purposes, it has come to be strongly associated with phrase accents. Pierrehumbert and Beckman (1988) proposed the mechanism of secondary association to account for the fact that phrase accents often spread (see also Liberman 1979). Specifically, Pierrehumbert and Beckman (1988) proposed that edge tones may acquire additional links (i.e. secondary associations) either to a specific tone-bearing unit (TBU), such as a stressed syllable, or to another boundary. For example, they posited that English phrase accents are linked not only to the right edge of their ip (as advocated in Beckman and Pierrehumbert 1986) but also to the left edge of the word carrying the nuclear pitch accent. An example of such secondary association can be found in Figure 6.1b, in which the L- phrase accent is realized as a low f0 stretch. This stretch is due to the fact that the L- phrase accent associates both with the right ip boundary (and thus is realized as close as possible to the right edge of the phrase) and with the end of the accented word (and thus stretches as far as possible to the left). The general mechanism whereby edge tones have secondary associations has also been used by Gussenhoven (2000a) in his analysis of the intonation of Roermond Dutch, which assumes that boundary tones can be phonologically aligned both with the right edge of the phrase and with an additional leftmost position. The analyses of Pierrehumbert and Beckman (1988) and Gussenhoven (2000a) were the basis for a wider use of secondary association for phrase accents developed in Grice et al. (2000), who argue that the need for positing phrase accents in a given intonation system is orthogonal to the need for the ip level of phrasing. Grice et al. (2000) examined putative phrase accents in a variety of languages (Cypriot Greek, Dutch, English, German, Hungarian, Romanian, and Standard Greek). They showed that the phrase accents they examined are realized either on a peripheral syllable, as expected of edge tones, or an earlier one, often one that is metrically strong; which of the two realizations prevails depends on whether the
OUP CORRECTED PROOF – FINAL, 07/12/20, SPi
THE AUTOSEGMENTAL-METRICAL THEORY OF INTONATIONAL PHONOLOGY 85 (a) 275
f0 (Hz)
240 205 170 135 100 koˈlibise
ˈðimitra
i
H-L%
L* 0
1.332
Time (s)
(b) 275
f0 (Hz)
240 205 170 135 100 koˈlibise L*+H 0
ˈðimitra
i L* Time (s)
H-L% 1.509
Figure 6.2 Spectrograms and f0 contours of the utterance [koˈlibise i ˈðimitra] with focus on [koˈlibise] ‘swam’ (a) and on [ˈðimitra] (b), translated as ‘Did Dimitra SWIM?’ and ‘[Was it] DIMITRA who swam?’ respectively.
metrically strong syllable is already associated with another tone or not. This type of variation is illustrated in Figure 6.2 with the Greek polar question tune L* H-L% (Grice et al. 2000; Arvaniti et al. 2006a). As can be seen in Figure 6.2, both contours have a pitch peak close to the end of the utterance. This peak co-occurs with the stressed antepenult of the final word in the question in Figure 6.2a ([ˈði] of [ˈðimitra]), but with the last vowel in the question in Figure 6.2b (the vowel [a] of [ˈðimitra]). (Note also that the stressed antepenult of [ˈðimitra] has low f0 in Figure 6.2b, as does the stressed syllable of [koˈlibise] in Figure 6.2a; both reflect an association with the L* pitch accent of this tune.) Grice et al. (2000) attribute this difference in the alignment of the pitch peak to secondary association: the peak is the reflex of a H- phrase accent associated with a phrasal boundary, but also has a secondary association to the last metrically strong syvllable of the utterance. This association is phonetically realized when this metrically strong syllable is not associated with a pitch accent; this happens when the focus is on an earlier word, which then attracts the L* pitch accent. The phonological structures involved are shown in (3a) and (3b); (3a) shows the primary and secondary association of the H- phrase accent; (3b) shows that the secondary association of H- is not possible because [ˈði] is already associated with the L* pitch accent.
OUP CORRECTED PROOF – FINAL, 07/12/20, SPi
86 AMALIA ARVANITI AND JANET FLETCHER (3) a. ‘Did Dimitra SWIM?’ L*
b. ‘[Was it] DIMITRA [who] swam?’
H- L%
[[koˈlibise i ˈðimitra]ip]IP
L* +H
L*
H- L%
[[koˈlibise i ˈðimitra]ip]IP
6.2.4 The phonological composition of melodies In Pierrehumbert (1980), the grammar for generating English tunes is as shown in Figure 6.3. Primitives of the system can combine freely and, in the case of pitch accents, iteratively. With the exception of left-edge boundary tones, which are optional, all other elements are required. In other words, a well-formed tune must include at least one pitch accent followed by a phrase accent and a boundary tone (e.g. H* L-L%). The fact that elements combine freely is connected to Pierrehumbert’s position that there is no hierarchical structure for tunes (they are a linear string of autosegments, as illustrated in (2)). It follows that there are no qualitative differences between pitch accents, as in other models of intonation, and no elements are privileged in any way. This conceptualization of the tonal string also allows for the integration of lexically specified and post-lexical tones (i.e. intonation) into one tonal string. Not everyone who works within AM shares this view. Gussenhoven (2004: ch. 15, 2005, 2016) provides analyses of English and Dutch intonation that rely on the notion of nuclear contours as units comprising what in other AM accounts is a sequence of a nuclear pitch accent followed by edge tones. Gussenhoven’s nuclear contours are akin to the nuclei of the British School. Gussenhoven (2016) additionally argues that a grammar along the lines of Figure 6.3 makes the wrong predictions, since not all possible combinations are grammatical in English, while the grammar results in both over- and under-analysis (capturing dubious distinctions while failing to capture genuine differences between tunes, respectively). Dainora (2001, 2006) also showed that some combinations of accents and edge tones are much more likely than others (though the frequencies she presents may be skewed as they
H* %H
L*
H
H%
L
L%
L+H*
L*+H %L
H*+L H+L* H*+H
Figure 6.3 The English intonation grammar of Pierrehumbert (1980); after Dainora (2006).
OUP CORRECTED PROOF – FINAL, 07/12/20, SPi
THE AUTOSEGMENTAL-METRICAL THEORY OF INTONATIONAL PHONOLOGY 87 are based on a news-reading corpus). Other corpus studies of English also find that there are preferred combinations of tones in spoken interaction (e.g. Fletcher and Stirling 2014). Overall, the evidence indicates that some combinations are preferred and standardized, possibly because they reflect frequently used pragmatic meanings. This is particularly salient in some languages in which tune choice is limited to a small number of distinctive patterns (e.g. see chapter 26 for a description of intonation patterns in Indigenous Australian languages) by contrast with languages such as Dutch or English, where a range of tonal combinations are available to speakers (e.g. Gussenhoven 2004, 2005).
6.3 Phonetic implementation in AM As noted in §6.1, AM provides a model for mapping abstract phonological representations to phonetic realization. Much of what we assume about this connection derives from Pierrehumbert (1980) and Bruce (1977). Phonetically, tones are said to be realized as ‘tonal targets’ (i.e. as specific points in the contour), while the rest of an f0 curve is derived by interpolation between these targets. That is, f0 contours are both phonologically and phonetically underspecified, in that only a few points of each contour are determined by tones and their targets (see Arvaniti and Ladd 2009 for empirical evidence of this point). Tonal targets are usually ‘turning points’, such as peaks, troughs, and elbows in the contour; they are defined by their ‘alignment’ and ‘scaling’ (see §6.3.1 and §6.3.2). Scaling refers to the value of the targets in terms of f0. Alignment is defined as the position of the tonal target relative to the specific TBU with which it is meant to co-occur (e.g. Arvaniti et al. 1998, 2006a, 2006b; Arvaniti and Ladd 2009). The identity of TBUs varies by language, depending on syllable structure, but we can equate TBUs with syllable nuclei (and in some instances with morae and coda consonants; see Pierrehumbert and Beckman 1988 on Japanese; Ladd et al. 2000 on Dutch; Gussenhoven 2012a on Limburgish). The TBUs with which tones phonetically co-occur are related to the metrical positions with which the tones associate in phonology: thus, pitch accents typically co-occur with stressed syllables (though not all stressed syllables are accented); edge tones are realized on peripheral TBUs, such as phrase-final vowels.
6.3.1 Tonal alignment In AM, tonal alignment is a phonetic notion that refers specifically to the temporal alignment of tones with segmental and/or syllabic landmarks. Alignment can refer to the specific timing of a tone, but it may also reflect a phonological difference, in which case the timing of tones relative to the segmental string gives rise to a change in lexical or pragmatic meaning. For example, Bruce (1977) convincingly showed that the critical difference between the two lexical pitch accents of Swedish, Accent 1 and Accent 2, was due to the relative temporal alignment of a HL tonal sequence. For Accent 1, the H tone is aligned earlier with respect to the accented vowel than for Accent 2, a difference that Bruce (2005) encoded as a phonological difference between H+L* (Accent 1) and H*+L (Accent 2) for the Swedish East Prosodic dialect (see Bruce 2005 for a full overview of dialect-specific phonological
OUP CORRECTED PROOF – FINAL, 07/12/20, SPi
88 AMALIA ARVANITI AND JANET FLETCHER v ariation in Swedish). Pierrehumbert (1980) similarly proposed L+H* and L*+H in English to account for the difference between early versus late alignment, with the H of the L*+H being realized after the stressed TBU (see also the discussion of trailing and leading tones in §6.2.3). While alignment differences are to be encoded in tonal representations when they are contrastive, in cases where variation in tonal alignment is not contrastive, a single representation, or ‘label’, is used. For instance, in Glasgow English, the alignment of the rising pitch accent, L*H in the analysis of Mayo et al. (1997), varies from early to late in the accented rhyme, without any apparent difference in meaning. In such cases, the tonal representation may allow for more options without essentially affecting the analysis. For instance, since the rising pitch accent of Glasgow English is variable, a simpler representation as H* instead of L*H may suffice, as nothing hinges on including a L tone in the accent’s representation or on starring one or the other tone (cf. Keane 2014 on Tamil; Fletcher et al. 2016 on Mawng; Arvaniti 2016 on Romani). One point that has become very clear thanks to a wide range of research on tonal alignment is that the traditional autosegmental idea that phonological association necessarily entails phonetic co-occurrence between a tone and a TBU does not always hold (e.g. see Arvaniti et al. 1998 on Greek; D’Imperio 2001 on Neapolitan Italian). This applies particularly to pitch peaks. Indeed, one of the most consistent findings in the literature is that of ‘peak delay’, the finding that accentual pitch peaks regularly occur after the TBU they are phonologically associated with. Peak delay was first documented by Silverman and Pierrehumbert (1990), who examined the phonetic realization of prenuclear H* accents in American English. It has since been reported for (among many others) South American Spanish (Prieto et al. 1995), Kinyarwanda (Myers 2003), Bininj Gun-wok (Bishop and Fletcher 2005), Catalan (Prieto 2005), Irish (Dalton and Ní Chasaide 2007a), and Chickasaw (Gordon 2008). The extent of peak delay can vary across languages and pitch accents, but it remains stable within category (Prieto 2014). This stability in known as ‘segmental anchoring’. The idea of segmental anchoring is based on the alignment patterns observed by Arvaniti et al. (1998) for Greek prenuclear accents and further explored in subsequent work by Ladd and colleagues on other languages (e.g. Ladd et al. 2000 on Dutch; Ladd and Schepman 2003 and Ladd et al. 2009b on English; Atterer and Ladd 2004 on German). Segmental anchoring is the hypothesis that tonal targets anchor onto particular segments in phonetic realization. The idea of segmental anchoring spurred a great deal of research in a variety of languages that have largely supported it (e.g. D’Imperio 2001 on Neapolitan Italian; Prieto 2009 on Catalan; Arvaniti and Garding 2007 on American English; Gordon 2008 on Chickasaw; Myers 2003 on Kinyarwanda; Elordieta and Calleja 2005 on Basque Spanish; Dalton and Ní Chasaide 2007a on Irish). Finally, research on tonal alignment also supports the key assumption underpinning AM models in which tonal targets are levels rather than contours (i.e. rises or falls). This idea was put to the test in Arvaniti et al. (1998), who found that the L and H targets of Greek prenuclear accents each have their own alignment properties. A consequence of this mode of alignment is that the rise defined by the L and H targets has no invariable properties (such as duration or slope), a finding used by Arvaniti et al. (1998) to argue in favour of levels as intonational primitives. Empirical evidence from tone perception in English (Dilley and Brown 2007) showing that listeners perceptually equate pitch movements with level tones supports this view (see also House 2003).
OUP CORRECTED PROOF – FINAL, 07/12/20, SPi
THE AUTOSEGMENTAL-METRICAL THEORY OF INTONATIONAL PHONOLOGY 89
6.3.2 Tonal scaling Since Ladd (1996) a distinction has been made between ‘pitch span’, which refers to the extent of the range of frequencies used by a speaker, and ‘pitch level’, which refers to whether these frequencies are overall high or low; together, level and span constitute a speaker’s ‘pitch range’. Thus, two speakers may have the same pitch span of 200 Hz but one may use a low level (e.g. 125–325 Hz) and the other a higher level (e.g. 175–375 Hz). A speaker’s pitch range may change for paralinguistic reasons, while, cross-linguistically, gender differences have also been observed (e.g. Daly and Warren 2002; Graham 2014). Three main linguistic factors affect tonal scaling: declination, tonal context, and tonal identity. Declination is a systematic lowering of targets throughout the course of an utterance (’t Hart et al. 1990), though declination can be suspended (e.g. in questions) and is reset across phrasal boundaries (Ladd 1988; see also Truckenbrodt 2002). Listeners anticipate declination effects and adjust their processing of tonal targets accordingly (e.g. Yuen 2007). Within AM, the understanding of declination follows Pierrehumbert (1980): the scaling of tones is modelled with reference to a declining baseline that is invariant for each speaker (at a given time). The baseline is defined by its slope and a minimum value assumed to represent the bottom of the speaker’s range, which tends to be very stable for each speaker (Maeda 1976; Menn and Boyce 1982; Pierrehumbert and Beckman 1988). L and H tones (apart from terminal L%s) are scaled above the baseline and with reference to it. Tonal context relates to the fact that the scaling of targets is related to the targets of preceding tones. For sequences of accentual H tones in particular, Liberman and Pierrehumbert (1984) have argued that every tone’s scaling is a fraction of the scaling of the preceding H. Tonal scaling is influenced by tonal context: for example, according to Pierrehumbert (1980: 136), the difference between the vocative chant H*+L- H- L% and a straightforward declarative, H* L- L%, is that the L% in the former melody remains above the baseline (and is realized as sustained level pitch), while the L% in the latter is realized as a fall to the baseline. In Pierrehumbert’s analysis, this difference is due to tonal context: in H*+L H-L%, L% is ‘upstepped’ (i.e. scaled higher) after a H- phrase accent; this context does not apply in H* L-L%, so L% reaches the baseline. One exception to the view that each H tone’s scaling is calculated as a fraction of the preceding H is what Liberman and Pierrehumbert (1984) have called ‘final lowering’, the fact that the final peak in a series is scaled lower than what a linear relation between successive peaks would predict. It has been reported in several languages with very different prosodic systems, including Japanese (Pierrehumbert and Beckman 1988), Dutch (Gussenhoven and Rietveld 1988), Yoruba (Connell and Ladd 1990; Laniran and Clements 2003), Kipare (Herman 1996), Spanish (Prieto et al. 1996), and Greek (Arvaniti and Godjevac 2003); see Truckenbrodt (2004, 2016) for an alternative analysis of final lowering. Tonal identity refers to different effects of a number of factors on the scaling of H and L tones. In English, for instance, L tones are said to be upstepped following H tones, while the reverse does not apply (Pierrehumbert 1980). Further, changes in pitch range affect the scaling of H and L tones in different ways: L tones tend to get lower when pitch span expands, while H tones get higher (e.g. Pierrehumbert and Beckman 1988; Gussenhoven and Rietveld 2000). An aspect of tonal scaling that has attracted considerable attention is ‘downstep’, the lower-than-expected scaling of H tones. In Pierrehumbert (1980) and Beckman and Pierrehumbert (1986), the essential premise is that downstep is the outcome of contextual rules. Thus, Pierrehumbert (1980) posits that downstep applies to the second H tone in a
OUP CORRECTED PROOF – FINAL, 07/12/20, SPi
90 AMALIA ARVANITI AND JANET FLETCHER HLH sequence, as in the case of the second H in Pierrehumbert’s (1980) representation of the vocative chant H*+L- H- L% above. In Beckman and Pierrehumbert (1986), tonal identity is also a key factor: all bitonal pitch accents trigger downstep of a following H tone. This position has been relaxed somewhat as it has been found that bitonal accents do not always trigger downstep in spoken discourse in American English (Pierrehumbert 2000). Others have argued that downstepped accents differ in meaning from accents that are not downstepped and thus that downstep should be treated as an independent phonological feature to mark the contrast between downstepped and non-downstepped accents, such as !H* and H* respectively (Ladd 1983, 2008b). The issue of whether downstep is a matter of contextdependent phonetic scaling or represents a meaningful choice remains unresolved for English and has been a matter of debate more generally. In some AM systems, additional notations are used to indicate differences in scaling. For example, 0% and % have been used to indicate an intermediate level of pitch that is neither high nor low within a given melody and is often said to reflect a return to a default midpitch in the absence of a tonal specification (e.g. Grabe 1998a; Gussenhoven 2005; see Ladd 2008b: ch. 3 and Arvaniti 2016 for discussion). Still other systems incorporate additional variations in pitch, such as ‘upstep’, or higher-than-expected scaling. For instance, Grice et al. (2005a) use ^H% in the analysis of German intonation, and Fletcher (2014: 272) proposes ^ as ‘an upstepped or elevated pitch diacritic’ in her analysis of Dalabon. The use of symbols such as 0% reflects the awkwardness that mid-level tones pose for analysis, particularly if evidence suggests that such mid-level tones contrast with H and L tones proper, as has been argued for Greek (Arvaniti and Baltazani 2005), Maastricht Limburgish (Gussenhoven 2012b), Polish (Arvaniti et al. 2017), and German (Peters 2018). The use of diacritics more generally reflects the challenge of determining what is phonological and what is phonetic in a given intonational system, and thus what should be part of the phonological representation; see Jun and Fletcher (2014) and Arvaniti (2016) for a discussion of field methods that address these issues. A reason why separating phonological elements from phonetic realization in intonation is such a challenge is the significant amount of variation attested in the realization of intonation. Even speakers of languages that have a number of different contrastive pitch accents may realize the same pitch accent category in different ways. Niebuhr et al. (2011a), for instance, report data from North German and Neapolitan Italian showing that some speakers signal a pitch accent category via f0 shape, whereas others manipulate tonal alignment (see also Grice et al. 2017 on individual variation in the realization and interpretation of pitch accents in German). Tonal alignment may also vary depending on dialect (e.g. Grabe et al. 2000 on varieties of English; Atterer and Ladd 2004 on varieties of German), and even the amount of voiced material available (Baltazani and Kainada 2015 on Ipiros Greek; Grice et al. 2015 on Tashlhiyt Berber). There may also be variation in the degree of rising or falling around the tone target, or general pitch scaling differences depending on where the target occurs in an utterance (IP, ip, or AP), degree of speaker emphasis, dialect, or speaker-specific speaking style. Phrase accent and boundary tone targets also vary in terms of their phonetic realization even across typologically related varieties. The classic fall-rise tune of Australian English H* L-H% is often realized somewhat differently from the same phonological tune in Standard Southern British English. The L-H% represents a final terminal rise in both varieties but scaling of the final H% tone tends to be somewhat higher in Australian English and is often described as a component of ‘uptalk’ (Warren 2016; see Fletcher and
OUP CORRECTED PROOF – FINAL, 07/12/20, SPi
THE AUTOSEGMENTAL-METRICAL THEORY OF INTONATIONAL PHONOLOGY 91 Stirling 2014 and chapter 19 for more detailed discussion). It follows from the preceding discussion that the actual phonetic realization of tonal elements can be gradient with respect to both alignment and scaling. Listeners may not necessarily interpret the different realizations as indicative of different pragmatic effects, suggesting that there is no need to posit additional contrastive categories to represent this variation. It is therefore important that a phonetic model for any language or language variety can account for this kind of realizational variation (for a proposal on how to do so, see Arvaniti 2019 and chapter 9).
6.3.3 Interpolation and tonal crowding Interpolation is a critical component of AM, as it is directly linked to the important AM tenet that melodies are not fully specified either at the phonological or the phonetic level, in that the f0 on most syllables in an utterance is determined by the surrounding tones. The phonological level involves a small number of tones; at the phonetic level, it is only these tones that are planned as tonal targets, while the rest of the f0 contour is derived by interpolation between them. The advantages of modelling f0 in this manner were first illustrated with Tokyo Japanese data by Pierrehumbert and Beckman (1988: 13 ff.). They showed that the f0 contours of AP’s without an accented word could be modelled by positing only one H target, associated with the AP’s second mora, and one L target at the beginning of the following AP; the f0 slope from the H to the L target depended on the number of morae between the two. This change in the f0 slope is difficult to model if every mora is specified for f0. Despite its importance, interpolation has not been investigated as extensively as alignment and scaling. The interpolation between targets is expected to be linear and can be conceived of as the equivalent of an articulator’s trajectory between two constrictions (cf. Browman and Goldstein 1992b). An illustration of the mapping between phonological structure and f0 contour is provided in (4), where the open circles in the f0 contour at the bottom represent tonal targets for the four tones of the phonological representation. As mentioned in §6.2.3, the L- phrase accent in this melody shows a secondary association to the end of the word with the nuclear accent, here ‘ballet’, and thus has two targets, leading to its realization as a stretch of low f0 (for a detailed discussion, see Pierrehumbert and Beckman 1988: ch. 6). (4)
IP ips ωs Fs
ωw Fw
Fs
σw σs σw σw σs σw σs σw ==================== L-H% L* +H ==================== a bal let a fi cio na do
OUP CORRECTED PROOF – FINAL, 07/12/20, SPi
92 AMALIA ARVANITI AND JANET FLETCHER One possible exception to linear interpolation is the ‘sagging’ interpolation between two H* pitch accents discussed in Pierrehumbert (1980, 1981) for English; sagging interpolation is said to give rise to an f0 dip between the two accentual peaks. It has always been seen as something of an anomaly, leading Ladd and Schepman (2003) to suggest that in English its presence is more plausibly analysed as the reflex of a L tone. Specifically, Ladd and Schepman (2003) proposed that the pitch accent of English represented in Pierrehumbert (1980) as H* should be represented instead as (L+H)*, a notation implying that in English both the L and H tone are associated with the stressed syllable. Independently of how sagging interpolation is phonologically analysed, non-linear interpolation is evident in the realization of some tonal events. For instance, L*+H and L+H* in English differ in terms of shape, the former being concave and the latter convex, a difference that is neither captured by their autosegmental representations nor anticipated by linear interpolation between the L and H tones (Barnes et al. 2010b). In order to account for this difference, Barnes et al. (2012a, 2012b) proposed a new measure, the Tonal Center of Gravity (see chapter 9). Further, although it is generally assumed that tones are realized as local peaks and troughs, evidence suggests this is not always the case. L tones may be realized as stretches of low f0, a realization that may account for the difference between convex L+H* (where the L tone is realized as a local trough) and concave L*+H (where the L tone is realized as a low f0 stretch). Similarly, H tones may be realized not as peaks but as plateaux. In some languages, plateaux are used interchangeably with peaks (e.g. Arvaniti 2016 on Romani), while in others the two are distinct, so that the use of peaks or plateaux may affect the interpretation of the tune (e.g. D’Imperio 2000 and D’Imperio et al. 2000 on Neapolitan Italian), the scaling of the tones involved (e.g. Knight and Nolan 2006 and Knight 2008 on British English), or both (Barnes et al. 2012a on American English). Data like these indicate that a phonetic model involving only targets as turning points and linear interpolation between them may be too simple to fully account for all phonetic detail pertaining to f0 curves or for its processing by native listeners. Nevertheless, the perceptual relevance of these additional details is at present far from clear. As noted above, the need for interpolation comes from the fact that the phonological representation of intonation is sparse; for example, ‘a ballet aficionado’ in (2) has eight syllables but the associated melody has a total of four tones. Nevertheless, it is also possible for the reverse to apply—that is, for an utterance to have more tones than TBUs; ‘Lou?’ uttered with the same L*+H L-H% tune (as in Figure 6.1b) is such an instance, as four tones must be realized on one syllable. In AM, this phenomenon is referred to as ‘tonal crowding’. Tonal crowding is phonetically resolved in a number of ways: (i) ‘truncation’, the elision of parts of the contour (Bruce 1977 on Swedish; Grice 1995a on English; Arvaniti et al. 1998 and Arvaniti and Ladd 2009 on Greek; Grabe 1998a on English and German); (ii) ‘undershoot’, the realization of all tones without them reaching their targets (Bruce 1977 on Swedish; Arvaniti et al. 1998, 2000, 2006a, 2006b on Greek; Prieto 2005 on Catalan); and (iii) temporal realignment of tones (Silverman and Pierrehumbert 1990 on American English; Arvaniti and Ladd 2009 on Greek). Undershoot and temporal realignment often work synergistically, giving rise to ‘compression’. Attempts have been made within AM to pin different resolutions of tonal crowding to specific languages (Grabe 1998a). Empirical evidence, however, indicates that the mechanism used is specific to elements in a tune, rather than to a language as a whole (for discussion see Ladd 2008b; Arvaniti and Ladd 2009; Arvaniti 2016).
OUP CORRECTED PROOF – FINAL, 07/12/20, SPi
THE AUTOSEGMENTAL-METRICAL THEORY OF INTONATIONAL PHONOLOGY 93 Arvaniti et al. (2017) proposed using tonal crowding as a diagnostic of a putative tone’s phonological status, as it allows us to distinguish optional tune elements (those that are truncated in tonal crowding) from required elements (those that are compressed under the same conditions).
6.4 Applications of AM The best-known application of AM is the family of ToBI systems. ToBI (Tones and Break Indices) was a tool originally developed for the prosodic annotation of American English corpora (Silverman et al. 1992; see also Beckman et al. 2005; Brugos et al. 2006). Since then several similar systems have been developed for a variety of languages (see e.g. Jun 2005a, 2014a for relevant surveys). In order to distinguish the system for American English from the general concept of ToBI, the term MAE_ToBI has been proposed for the former (where MAE stands for Mainstream American English; Beckman et al. 2005). ToBI was originally conceived as a tool for research and speech technology; for example, the MAE_ToBI annotated corpus can be searched for instances of an intonational event, such as the H* accent in English, so that a sample of its instantiations can be analysed and generalizations as to its realization reached. Such generalizations are useful not only for speech synthesis but also for phonological analysis and the understanding of variation (for additional uses and extensions see Jun 2005c). ToBI representations consist of a sound file, an associated spectrogram and a pitch track, and several tiers of annotation. The required tiers are the tonal tier (a representation of the tonal structure of the pitch contour) and the break index tier (in which the perceived strength of prosodic boundaries is annotated using numbers). In the MAE_ToBI system, [0] represents cohesion between items (such as flapping between words in American English), [1] represents the small degree of juncture expected between most words, [3] and [4] represent ip and IP boundaries respectively, and [2] is reserved for uncertainty (e.g. for cases where the annotator cannot find tonal cues for the presence of a phrasal boundary but does perceive a strong degree of juncture). A ToBI system may also include an orthographic tier and a miscellaneous tier for additional information, such as disfluencies. Brugos et al. (2018) suggest incorporating an ‘alternatives’ tier, which allows annotators to capture uncertainty in assigning a particular tonal category. The content of all tiers can be adapted to the prosodic features of the system under analysis, but also to particular research needs and theoretical positions of the developers. For instance, Korean ToBI (K_ToBI) includes both a phonological and a phonetic tier (Jun 2005b), while Greek ToBI (GR_ToBI) marks sandhi, which is widespread in Greek and thus of phonological interest (Arvaniti and Baltazani 2005). ToBI as a concept has often been misunderstood. Some have taken ToBI to be the equivalent of an IPA alphabet for intonation, a claim that the developers of ToBI have taken pains to refute (e.g. Beckman et al. 2005; Beckman and Venditti 2011). A ToBI annotation system presupposes and is meant to rely on a phonological analysis of the contrastive elements of the intonation and prosodic structure of the language or language variety in question. ToBI can, however, be used as an informal tool to kick-start such an analysis on the understanding that annotations will have to be revisited once the phonological analysis is complete (Jun and Fletcher 2014; Arvaniti 2016).
OUP CORRECTED PROOF – FINAL, 07/12/20, SPi
94 AMALIA ARVANITI AND JANET FLETCHER
6.5 Advantages over other models AM offers a number of advantages, both theoretical and practical, relative to other models. A major feature that distinguishes AM is that as a phonological model it relies on the combined investigation of form and meaning, and the principled separation of phonological analysis and phonetic exponence. The former feature distinguishes AM from the Institute for Perception Research (IPO) model (’t Hart et al. 1990), which focuses on intonation patterns but strictly avoids the investigation of intonational meaning. The latter feature contrasts AM with systems developed for the modelling and analysis of f0—such as PENTA (e.g. Xu and Prom-on 2014), INTSINT (Hirst and Di Cristo 1998), or Fujisaki’s model (e.g. Fujisaki 1983)—which do not provide an abstract phonological representation from which the contours they model are derived. As argued elsewhere in some detail (Arvaniti and Ladd 2009), the principled separation of phonetics and phonology in AM gives the theory predictive power and allows it to reach useful generalizations about intonation and its relation to the rest of prosody, while accounting for attested phonetic variation. In terms of phonetic implementation, the target-and-interpolation modelling of f0 allows for elegant and parsimonious analyses of complex f0 patterns, as shown by Pierrehumbert and Beckman (1988) for Japanese. AM can also accommodate non-linear interpolation, unlike the IPO (’t Hart et al. 1990). In addition, although tonal crowding is extremely frequent cross-linguistically, AM is the only model of intonation that can successfully handle it and predict its outcomes (see e.g. Arvaniti and Ladd 2009 and Arvaniti and Ladd 2015 for comparisons of the treatment of tonal crowding in AM and PENTA; see also Xu et al. 2015). Further, by relying on the formal separation of metrical structure and the tonal string, AM has disentangled stress from intonation. This has been a significant development, in that the effects of stress and intonation on a number of acoustic parameters, particularly f0, have often been confused in the literature (see Gordon 2014 for a review and chapter 5). This confusion applies both to documenting the phonetics of stress and intonation, and developing a better understanding of the role of intonation in focus and the encoding of information structure. Research within AM has shown that it is possible for words to provide new information in discourse without being accented, or to be accented without being discourse prominent (Prieto et al. 1995 on Spanish; Arvaniti and Baltazani 2005 on Greek; German et al. 2006, Beaver et al. 2007, and Calhoun 2010 on English; Arvaniti and Adamou 2011 on Romani; Chahal and Hellmuth 2014b on Egyptian Arabic). Finally, since AM reflects a general conceptualization of the relationship between tonal elements on the one hand and vowels and consonants on the other, it is sufficiently flexible to allow for the description of diverse prosodic systems—including systems that combine lexical and post-lexical uses of tone—and the development of related applications. In addition to the development of ToBI-based descriptive systems, as discussed in §6.4, such applications include modelling adult production and perception, developing automatic recognition and synthesis algorithms, and modelling child development, disorders and variation across contexts, speakers, dialects, and languages (see e.g. Sproat 1998 on speech synthesis; Lowit and Kuschmann 2012 on intonation in motor speech disorders;
OUP CORRECTED PROOF – FINAL, 07/12/20, SPi
THE AUTOSEGMENTAL-METRICAL THEORY OF INTONATIONAL PHONOLOGY 95 Thorson et al. 2014 on child development; Gravano et al. 2015 on prosodic entrainment and speaker engagement detection; Kainada and Lengeris 2015 on L2 intonation; Prom-on et al. 2016 on modelling intonation; see Cole and Shattuck-Hufnagel 2016 for a general discussion). In conclusion, AM is a flexible and adaptable theory that accounts for both tonal phonology and its relation to tonal phonetics across languages with different prosodic systems, and it can be a strong basis for developing a gamut of applications for linguistic description and speech technology.
OUP CORRECTED PROOF – FINAL, 06/12/20, SPi
chapter 7
Prosodic Mor phol ogy John J. M c carthy
7.1 Introduction The phrase ‘prosodic morphology’ refers to a class of linguistic phenomena in which prosodic structure affects morphological form. In the Nicaraguan language Ulwa, for example, possessive morphemes are observed to occur after the main stress of the word, which always falls on one of the first two syllables in an iambic pattern, shown in (1) (Hale and Lacayo Blanco 1989; McCarthy and Prince 1990a). (1) Ulwa possessives ˈsuːlu ‘dog’ ˈbas ˈasna saˈna siˈwanak aˈrakbus
‘hair’ ‘clothes’ ‘deer’ ‘root’ ‘gun’
ˈsuː-ki-lu ˈsuː-ma-lu ˈsuː-ka-lu ˈbas-ka ˈas-ka-na saˈna-ka siˈwa-ka-nak aˈrak-ka-bus
‘my dog’ ‘your (sg.) dog’ ‘his/her dog’ ‘his/her hair’ ‘his/her clothes’ ‘his/her deer’ ‘his/her root’ ‘his/her gun’
In prosodic morphological terms, the possessive is suffixed to the main-stress metrical foot of the word: (siˈwa)-ka-nak. The possessive suffix subcategorizes for a prosodic constituent, the main-stress foot, rather than a morphological one, the stem. Ulwa is an example of infixation (§7.6), because the possessive suffix is internal to words with non-final stress. Other prosodic morphological phenomena to be discussed include reduplication (§7.3), root-and-pattern morphology (§7.4), and truncation (§7.5). First, though, a brief summary of the relevant assumptions about prosodic structure is necessary.
7.2 Prosodic structure Word prosody is an area of lively research and consequent disagreement, but there are certain fairly standard assumptions that underlie much work on prosodic morphology (though for other views see Downing 2006: 35; Inkelas 1989a, 2014: 84). The constituents of word prosody
OUP CORRECTED PROOF – FINAL, 06/12/20, SPi
PROSODIC MORPHOLOGY 97 are the prosodic or phonological word (ω), the metrical or stress foot (Ft), the syllable (σ), and the mora (μ). The parsing of words into metrical feet is fundamental to most theories of word stress, with binary feet accounting for the typical rhythmic patterns of stress assignment: (ˌipe) (ˌcacu)(ˈana) (i.e. the English word ipecacuana). The mora is the unit of syllable weight. Generally, syllables ending in a short vowel (often referred to as CV syllables) are monomoraic and therefore light, while syllables ending in a long vowel, a diphthong, or a consonant (CVː, CVV, and CVC syllables) are bimoraic and therefore heavy. Some languages, called quantityinsensitive, do not make distinctions of syllable weight; in these languages, all syllables (or perhaps all CV and CVC syllables) are monomoraic (see chapter 5). These constituents are arranged into a prosodic hierarchy as in (2) (Selkirk 1981), in which every constituent of level n is obligatorily headed by a constituent of level n−1. (2) Prosodic hierarchy ω ⎜ Ft ⎜ σ ⎜ μ The head of a prosodic word is its main-stress foot, the head of a foot is the syllable that bears the stress, and the head of a syllable is the mora that contains the syllable nucleus. In addition to the headedness requirement, there are various principles of form that govern each level of the prosodic hierarchy. Of these, the one that is most important in studying prosodic morphology is foot binarity, the requirement that feet contain at least two syllables or morae. Many languages respect foot binarity absolutely; all languages, it would appear, avoid unary feet whenever it is possible to form a binarity foot. Combining the headedness requirement of the prosodic hierarchy with foot binarity leads to the notion of a minimal word (Broselow 1982; McCarthy and Prince 1986/1996). If every word must contain some foot to serve as its head, and if every foot must contain at least two syllables or two morae, then the smallest or minimal word in a language that respects foot binarity will be a disyllable (if distinctions of syllable weight are not made) or a single heavy syllable (if distinctions of syllable weight are made). Thus, in the Australian language Diyari (Austin 1981; Poser 1989), which is quantity-insensitive, monosyllabic words are prohibited, while in Latin, which is quantity-sensitive, the smallest word is the smallest foot, a heavy monosyllable CVC, CVː, or CVV, as in (ˈsol) ‘sun’, (ˈmeː) ‘me’, or (ˈkui) ‘to whom’. (Though for other views of the minimal word = minimal foot equivalence see Hayes 1995: 86; Garrett 1999; Gordon 1999: 255.)
7.3 Reduplication Reduplicative morphology involves copying all or part of a word. From the standpoint of prosodic morphology, partial reduplication is more interesting because prosodic structure determines what is copied. The naive expectation is that reduplication identifies a prosodic
OUP CORRECTED PROOF – FINAL, 06/12/20, SPi
98 JOHN J. MC CARTHY constituent in the stem and then copies it. This is not always the case, however, and is in fact atypical. Much more commonly, reduplication involves copying sufficient material to create a new prosodic constituent (Marantz 1982). The prosodic requirement is imposed on the copied material, not on the base from which it was copied. The example in (3) will clarify this important distinction. In the Philippine language Ilokano, the plural of nouns is formed by prefixing sufficient copied material to make a heavy syllable (Hayes and Abad 1989). (3) Heavy syllable reduplication in Ilokano pusa pus-pusa ‘cat/pl.’ kaldiŋ kal-kaldiŋ ‘goat/pl.’ ʤanitor ʤan-ʤanitor ‘janitor/pl.’ A heavy syllable is the prosodic characterization of the prefixed reduplicative material: pus-, kal-, and ʤan- are all heavy syllables. It is clearly not the case, however, that a syllable, heavy or otherwise, is targeted in the stem and then copied. Although kal happens to be a syllable in the stem, pus and ʤan are not. Rather, these segmental sequences in the stem are split across two syllables: pu.sa, ʤa.ni.tor. Other examples are given in §25.3. The analysis of partial reduplication posits a special type of morpheme, called a prosodic template, that characterizes the shape of the reduplicated material. In the Ilokano example, this morpheme is a heavy syllable, [μμ]σ, that is devoid of segments. The heavy-syllable prefix borrows segments from the stem to which it is attached via a copying operation. The details of how copying is achieved are not directly relevant to the topic of this volume, but see Marantz (1982), McCarthy and Prince (1988, 1999), Steriade (1988), Raimy (2000), Inkelas and Zoll (2005: 25), and McCarthy et al. (2012) for various approaches. Like segmental morphemes, templatic reduplicative morphemes come in various forms. In addition to its heavy-syllable reduplicative prefix, Ilokano also has a light-syllable prefix with various meanings, shown in (4). When combined with the segmental prefix ʔagin-, it conveys the sense of pretending to do something. (4) Light-syllable reduplication in Ilokano (Hayes and Abad 1989) ʤanitor ‘janitor’ ʔagin-ʤa-ʤanitor ‘pretend to be a janitor’ trabaho ‘to work’ ʔagin-tra-trabaho ‘pretend to work’ saŋit ‘to cry’ ʔagin-sa-saŋit ‘pretend to cry’ Observe that both simplex and complex onsets are copied: sa, tra. The light-syllable reduplicative template is satisfied by both CV and CCV, because onsets do not contribute to syllable weight. Whenever it is the case that the template does not limit copying, the segmental make-up of the base is duplicated exactly. Another reduplicative prosodic template, particularly common in the Australian and Austronesian languages, is the foot or minimal word. Recall that the minimal word in Diyari is a disyllabic foot. So is the reduplicative prefix (which has varied morphological functions), as shown in (5).
OUP CORRECTED PROOF – FINAL, 06/12/20, SPi
PROSODIC MORPHOLOGY 99 (5) Minimal word reduplication in Diyari (McCarthy and Prince 1994a) ˈwil ̪a ˈwil ̪a-ˈwil ̪a ‘woman’ ˈwakari ˈwaka-ˈwakari ‘to break’ ˈnaŋkan̪t ̪i ˈnaŋka-ˈnaŋkan̪t ̪i ‘catfish’ ‘bird’ ˈtjilparku ˈtjilpa-ˈtjilparku The reduplicative morpheme in Diyari is quite literally a prosodic word (Austin 1981): it has its own main stress impressionistically, its first syllable has segmental allophones that are diagnostic of main stress, and it must end in a vowel, like all other prosodic words of Diyari. Reduplicated words in Diyari are prosodically compound, consisting of a minimal prosodic word followed by one that is not necessarily minimal. Why is the reduplicative part minimal even though the stem part is not? In other words, how is Diyari’s minimal word reduplication distinguished from total reduplication, like hypothetical ˈnaŋkan̪t̪i-ˈnaŋkan̪t̪i? McCarthy and Prince (1994a) argue that Diyari reduplication, and perhaps all forms of partial reduplication, are instances of what they call ‘emergence of the unmarked’. In Optimality Theory (OT), markedness constraints can be active even when they are ranked too low to compel violation of faithfulness constraints (Prince and Smolensky 1993/2004). Minimal word reduplication emerges when certain markedness constraints that are important in basic stress theory are active but dominated by faithfulness. Among these constraints is Parse-Syllable, which is violated by unfooted syllables (McCarthy and Prince 1993a). In a prosodic hierarchy of strict domination, there would be no ParseSyllable violations, because every constituent of type n−1 would be immediately dominated by a constituent of type n. In OT, however, the force of Parse-Syllable is determined by its ranking. In Diyari, Parse-Syllable is ranked below the constraints requiring faithfulness to the underlying representation, so it is not able to force deletion of stem segments in an odd-numbered (and hence unfooted) final syllable. But ParseSyllable is ranked above constraints requiring total copying of the stem into the reduplicative prefix (denoted here by the ad hoc constraint Copy). The effect of this ranking is shown somewhat informally in (6). (6) Emergence of the unmarked Red-tjilparku
Faith
Parse-Syll Copy
a. →
[(ˈtjilpa)Ft]PWd-[(ˈtjilpar)Ftku]PWd
*
b.
[(ˈtjilpa)Ftku]PWd-[(ˈtjilpar)Ftku]PWd
**
c.
[(ˈtjilpa)Ft]PWd-[(ˈtjilpa)Ft]PWd
**
***
The losing candidate in (6b) has copied the unfooted syllable ku, and necessarily left it unfooted, because Diyari does not permit monosyllabic feet. This candidate fails because
OUP CORRECTED PROOF – FINAL, 06/12/20, SPi
100 JOHN J. MC CARTHY it has incurred two Parse-Syllable violations, while (6a) has only one. The losing candidate in (6c) has eliminated all Parse-Syllable violations by deleting ku from the stem, a fatal violation of faithfulness. The winning candidate in (6a) retains the stem’s ParseSyllable violation—unavoidable because of high-ranking faithfulness—but it avoids copying that violation, at the expense of only low-ranking Copy. The extent to which other reduplicative templates, like those of Ilokano, are reducible to emergence of the unmarked, like Diyari, is a topic of discussion. See, for e xample, Urbanczyk (2001), Blevins (2003), Kennedy (2008), and Haugen and Hicks Kennard (2011).
7.4 Root-and-pattern morphology In root-and-pattern morphology, a prosodic template is the determinant of the form of an entire word, rather than just an affix, as is the case with reduplication. The prosodic template specifies the word pattern onto which segmental material (the root) is mapped (McCarthy 1981). A root-and-pattern system is arguably the fundamental organizing principle in the morphology of the Semitic languages (though see Watson 2006 for a review of divergent opinions). Some of the Classical Arabic prosodic templates are shown in (7). (7) Classical Arabic prosodic templates based on the root ktb ‘write’ Template Word Gloss Function of template CaCaC katab ‘wrote’ basic verb form CaCːaC kattab ‘caused to write’ causative verb CaːCaC kaːtab ‘corresponded’ reciprocal verb CuCuC kutub ‘books’ plural CaːCiC kaːtib ‘writer’ agent maCCaC maktab ‘office’ place maCCuːC maktuːb ‘written’ passive participle As usual in root-and-pattern systems, the effect of imposing a template is a fairly thorough remaking of a word’s form, so it may initially seem unrecognizable. Observe, however, that the consonants of the root are constant throughout. The same can be found with other roots; the root ħkm, for example, can also be found in other words that deal with the general concept of ‘judgement’: ħakam ‘passed judgement’, ħakkam ‘chose as arbitrator’, ħaːkama ‘prosecuted’, ħakiːm ‘judicious’, ħaːkim ‘a judge’, maħkam-at ‘a court’, and so on. For further discussion, see McCarthy (1981, 1993) and McCarthy and Prince (1990a, 1990b).
7.5 Truncation In truncation, a portion of the stem is deleted to mark a morphological distinction (Doak 1990; Ito 1990; Mester 1990; Weeda 1992; Benua 1995; Féry 1997; Ito and Mester 1997; Bat-El 2002; Cohn 2005; Alber and Arndt-Lappe 2012). There are two ways in which prosodic structure
OUP CORRECTED PROOF – FINAL, 06/12/20, SPi
PROSODIC MORPHOLOGY 101 affects truncation: by specifying what remains or by specifying what is taken away. The former is a type of templatic morphology, closely resembling reduplicative and root-and-pattern morphology (§7.4). The latter is often referred to as subtractive morphology. In templatic truncation, a word is typically reduced to one or two syllables. This is particularly common in nicknames and terms of address, as in (8) and (9), though it can be found in other grammatical categories as well. (8) Japanese templatic truncation (Poser 1984b, 1990; Mester 1990) Name Truncated juːko o-juː Yuko ɾanko o-ɾan Ranko jukiko o-juki Yukiko midori o-mido Midori ʃinobu o-ʃino Shinobu (9) Indonesian templatic truncation (Cohn 2005) Word Truncated anak nak ‘child’ bapak pak ‘father’ Agus Gus personal name Lilik Lik personal name Glison Son personal name Mochtar Tar personal name The analysis of templatic truncation is very similar to the analysis of reduplication in Diyari. The template is some version of the minimal word, a single foot. Mapping an existing word to this template shortens it to minimal size. Mapping can proceed from left to right, as in Japanese, or right to left, as in Indonesian. Mapping may also start with the stressed syllable, as in English Elizabeth/Liz, Alexander/Sandy, and Vanessa/Nessa. In subtractive truncation, the material with constant shape consists of what is removed rather than what remains. In Koasati, for example, there are processes of plural formation that truncate the final VC or Vː (10) or the final C (11) of the stem. (10) Koasati VC subtractive truncation (Martin 1988) Singular Plural pitaf-fi-n pit-li-n ‘slice up the middle’ albitiː-li-n albit-li-n ‘to place on top of ’ akocofot-li-n akocof-li-n ‘to jump down’ (11) Koasati C subtractive truncation (Martin 1988) Singular Plural bikot-li-n bikoː-li-n ‘to bend between the hands’ asikop-li-n asikoː-li-n ‘to breathe’ What remains after truncation can be one, two, or three syllables long, depending on the length of the original stem. The constant, then, is that which is taken away rather than that which remains.
OUP CORRECTED PROOF – FINAL, 06/12/20, SPi
102 JOHN J. MC CARTHY Subtractive truncation is not common, and often appears to be a historically secondary development in which erstwhile suffixes have been reanalysed as part of the base. As Alber and Arndt-Lappe (2012) note, there have been various efforts to analyse putative cases of subtractive truncation as something entirely different, such as phonological deletion. There have also been proposals within OT to introduce antifaithfulness constraints (i.e. constraints that require underlying and surface representations to differ from one another), and these constraints have been applied to the analysis of subtractive truncation (Horwood 1999; Alderete 2001a, 2001b; Bat-El 2002). There is a third type of truncation that cannot be readily classified as either templatic or subtractive. A class of vocatives in Southern Italian truncates everything after the stress, as illustrated in (12). (12) Southern Italian vocatives (Maiden 1995) Word Vocative avvoˈkatu avvoˈka ‘lawyer!’ miˈkele miˈke ‘Michael!’ doˈmeniko doˈme ‘Dominic!’ Similar phenomena can be found in other languages: Coeur d’Alene (Doak 1990; Thomason and Thomason 2004), English (Spradlin 2016), and Zuni (Newman 1965). The shape constant is that which remains after truncation—a word with final stress—but it is not a prosodic constituent such as a foot, because it is of arbitrary length. Phenomena such as this suggest that the identification of templates with prosodic constituents is insufficient. Generalized template theory (McCarthy and Prince 1993b, 1994b) allows templates to be defined by phonological constraints, much as we saw in (6). For further discussion, see Downing (2006), Flack (2007), Gouskova (2007), and Ito and Mester (1992/2003).
7.6 Infixation Infixes are affixes that are positioned internal to the root. As the Ulwa example in (1) shows, infixes sometimes fall within the general scope of prosodic morphology because prosodic factors affect their position. In Ulwa, the possessive affixes subcategorize for the head foot of the word, to which they are suffixed. Expletive infixation in English is another example (McCarthy 1982). Expletive words, such as fuckin’ or bloody, can be inserted inside of other words, provided that they do not split metrical feet: (ˌabso)Ft-fuckin’-(ˈlutely)Ft, not *(ˌab-fuckin’-so)Ft (ˈlutely)Ft or *(ˌabso)Ft ( ˈlute-fuckin’-ly)Ft. Prince and Smolensky (1993/2004) analyse Tagalog um-infixation, illustrated in (13), as prosodically conditioned. When a word begins with a single consonant, um is placed right after it. When a word begins with a consonant cluster, the infix variably falls within or after the cluster.
OUP CORRECTED PROOF – FINAL, 06/12/20, SPi
PROSODIC MORPHOLOGY 103 (13) Tagalog um-infixation sulat s-um-ulat Ɂabot Ɂ-um-abot gradwet g-um-radwet ~ gr-um-adwet preno p-um-reno ~ pr-um-eno
‘to write’ ‘to reach for’ ‘to graduate’ ‘to brake’
Prince and Smolensky propose that infixed um is actually a prefix that is displaced from initial position because otherwise the word would begin with a vowel: *um-sulat. In OT’s terms, um’s prefixhood is determined by a ranked, violable constraint, Align-Left(um, word). This constraint is ranked below Onset, which requires syllables to begin with consonants. (For further details, see Klein 2002; Zuraw 2007.) McCarthy and Prince (1993b) discuss a case of reduplicative infixation in the Timugon Murut language of Borneo. In this language, a light-syllable reduplicative template is prefixed to words beginning with a consonant (14a), but it is infixed after the first syllable of a word beginning with a vowel (14b). (14) Infixing reduplication in Timugon Murut (Prentice 1971) a. Copy initial CV bulud ‘hill’ bu-bulud ‘ridge’ limo ‘five’ li-limo ‘about five’ b. Skip initial V(C) and copy following CV ulampoj no gloss u-la-lampoj no gloss abalan ‘bathes’ a-ba-balan ‘often bathes’ ompodon ‘flatter’ om-po-podon ‘always flatter’ If the reduplicative prefix were not infixed, as in *u-ulampoj or *o-ompodon, the result would be adding an Onset violation. Abstractly, the analysis is the same as in Tagalog: Onset dominates the constraint requiring left-edge alignment of the reduplicative prefix. For further examples of prosodic morphology, see chapter 25, and for a comprehensive review of infixation phenomena and another point of view, see Yu (2007).
7.7 Summary This brief overview of prosodic morphology has introduced the principal phenomena— reduplication, root-and-pattern morphology, truncation, and infixation—and some of the proposals for how they should be analysed—prosodic templates and constraint interaction.
chapter 8
Sign L a nguage Prosody Wendy Sandler, Diane Lillo-Martin, Svetlana Dachkovsky, and Ronice Müller de Quadros
8.1 The visible organization of sign languages The first thing that the naive observer notices when deaf people communicate in sign language is rapid and precise motion of the hands. The second thing that strikes the observer is salient expressions on the face and motions of the head. And indeed, all of these channels are put to critical use in the organization of sign languages. But what is the division of labour between them? In the early years of sign language research, the central goal was to understand the structure of words and the syntax of their arrangement. This meant that attention was on the hands, which convey words (Battison 1978; Bellugi and Klima 1979; Stokoe 1960). Before long, however, researchers began to set their sights beyond the hands and to observe the face. Baker and Padden (1978) observed, for example, that blinks systematically marked phrasal boundaries in American Sign Language (ASL), and Liddell (1978, 1980) showed that systematic patterns of facial expression and head position characterize yes/no and content questions, as well as relative clauses and other complex structures in ASL. Liddell attributed these non-manual patterns to syntax, claiming that syntactic markers in ASL are non- manual. An advantage of this approach was that it was able to show that there are indeed complex sentences in ASL. This explicitly syntactic view of non-manual articulations established a tradition in sign language research (Wilbur and Patschke 1999; Neidle et al. 2000; Cecchetto et al. 2009). Other studies of complex sentences in ASL and in Israeli Sign Language (ISL), such as conditionals and relative clauses, likened the behaviour of facial expression and head position in these sentences to intonation (Reilly et al. 1990; Nespor and Sandler 1999; Wilbur 2000; Sandler and Lillo-Martin 2006; Dachkovsky and Sandler 2009; Dachkovsky et al. 2013). We adopt the intonation position here, because of the linguistic function and formal patterning
SIGN LANGUAGE PROSODY 105 of such expressions, and support the existence of a prosodic component in the architecture of language structure more broadly. We review research showing that facial expression and head position correspond to intonation within a prosodic system that also includes (and is aligned with) timing and prominence, which in turn are signalled by the hands (Nespor and Sandler 1999). The existence of a prosodic component in the linguistic architecture is supported by evidence for units of timing, prominence, and intonational meaning that cannot be explained on the basis of other levels of structure, such as the lexical or syntactic levels. Like spoken languages, sign languages are characterized by the following hierarchy of prosodic constituents: phonological utterance > intonational phrase > phonological phrase > prosodic word > foot > syllable (Sandler 2010). §8.2.1 sets the stage with a brief overview of the syllable and the prosodic word, and we then ascend to higher levels of prosodic structure—intonational (§8.2.2) and phonological (§8.2.3) phrases. The nature and linguistic identity of non-manual elements—primarily facial expressions that are found at the higher levels of structure—are controversial, and we continue in §8.3 with an overview of the articulations that are at issue, and their coordination with manually conveyed signs. We support the claim that these signals are explicitly intonational, showing that information structure accounts for their occurrence and distribution. A discussion of three main categories of information structure—topic/comment, given/new, and focus/ background—and their expression in sign languages follows in §8.4. These sections are based primarily on research on ASL and ISL, two well-studied but unrelated sign languages. To address the issue of the architecture of the grammar—specifically, to what extent the markers at issue belong to the syntactic or the prosodic component—we turn our attention in §8.5 to yes/no and wh-questions, around which the debate has fomented, and include evidence from Brazilian Sign Language (Libras) as well. §8.6 is a summary and conclusion.
8.2 Prosodic constituency in signed languages 8.2.1 The syllable and the prosodic word A well-formed sign in a sign language must include movement of the hand(s). This m ovement consists of (1) a movement path from one location to another, (2) hand-internal movement (by change of orientation or of finger position), or (3) both path and internal movement together. Figure 8.1 shows a sign with a complex syllable nucleus of this type. Coulter (1982) was the first to propose that ASL is a monosyllabic language. The idea is that any type of movement constitutes a syllable nucleus, and that the vast majority of signs have only one nucleus—one syllable. Many studies have attributed visual sonority to the movement component of signs (e.g. Sandler 1989, 1993, 1999; Brentari 1990, 1993, 1998; Perlmutter 1991;1 Wilbur 1993).
1 Perlmutter (1993) proposed that ASL also has a moraic level.
106 WENDY SANDLER et al.
Figure 8.1 The monosyllabic sign SEND in ISL. The dominant hand moves in a path from the chest outward, and the fingers simultaneously change position from closed to open. The two simultaneous movements constitute a complex syllable nucleus.
Sign languages are known to be morphologically complex. How, then, is monosyllabicity preserved under morphological operations such as inflection and compounding? First, morphological complexity in both ASL and ISL is typically nonconcatenative (Sandler 1990), and thus it does not affect the monosyllabic structure of the base sign. Compounding—in which two morphosyntactic words combine to form a new word—is common in sign languages. While novel compounds can occur freely, resulting in disyllabic words (Brentari 1998), lexicalized compounds often reduce to one syllable in ASL (Liddell and Johnson 1986; Sandler 1989) and in ISL (Sandler 2012). Reduplicative processes in ASL copy only the final syllable of compounds (or the only syllable if they are reduced), providing support for the syllable unit in that language (Sandler 1989, 2017). It is not only concatenation of words in compounds that can reduce two morphosyntactic words to a single syllable. Cliticization of pronouns to hosts can do the same. Such phenomena suggest a broader generalization, compatible with Coulter’s insight: the optimal ‘prosodic word’ in sign languages is monosyllabic (Sandler 1999).2 In both reduced compounds and cliticized pronouns, two morphosyntactic words constitute a single prosodic word, much like cliticized and contracted forms in spoken languages, such as Sally’s an unusual person, or I’m going to Greece tomorrow. The research sketched above demonstrates that prosodic and morphosyntactic constituents are not isomorphic, and we will see more evidence for this claim below. Our focus in the rest of this chapter is on higher prosodic levels: the intonational phrase (IP) and the phonological phrase (PP). 2 Brentari (1998) does not dispute that most ASL signs are monosyllabic, but proposes that the maximal prosodic word in ASL is disyllabic.
SIGN LANGUAGE PROSODY 107
8.2.2 Intonational phrases Let’s consider the sentence in Figure 8.2, from ISL, ‘The cake that I baked is tasty’.
CAKE
IX
I
BAKE
TASTY
Figure 8.2 ISL complex sentence, ‘The cake that I baked is tasty’, glossed: [[CAKE IX]PP [I BAKE]PP]IP [[TASTY]PP]IP. ‘IX’ stands for an indexical pointing sign.
The sentence consists of two intonational phrases, indicated by IPs after the relevant brackets in the caption of Figure 8.2. The first IP consists of two phonological phrases, indicated by PP’s—[CAKE IX] and [I BAKE]—and the second consists of one phonological phrase, [TASTY]. We will deal with PPs in the next section. As for IP’s, according to Nespor and Sandler (1999) and subsequent research on ISL, the final manual sign at the IP boundary is marked in one of three ways: larger sign, longer duration, or repetition. In this particular sentence, the IP-final signs BAKE and TASTY have more repetitions than the citation forms. Non-manual markers are equally important in characterizing IP’s. The non-manual markers of facial expression and head position align temporally with the manual markers, and all markers change across the board between IP’s. In the sentence in Figure 8.2, the entire first IP is marked by raised eyebrows, squint, and forward head movement.3 The form and functions of these non-manual elements will be further discussed in §8.3 and §8.4. 3 Certain head movements, such as side-to-side headshake for negation, may well directly perform syntactic functions and indicate their scope in some sign languages. See Pfau and Quer (2007) for an in-depth comparison of negation devices in Catalan Sign Language and German Sign Language.
108 WENDY SANDLER et al.
8.2.3 Phonological phrases Spoken language researchers have found evidence for the existence of prosodic phrases below the level of the IP, called phonological phrases (Nespor and Vogel 2007) or intermediate phrases (Beckman and Pierrehumbert 1986). Two kinds of evidence support the existence of this prosodic constituent in ISL. The first type is articulatory. Manual markers of PP boundaries—increased size or duration, or repetition—are similar to but more subtle than those marking IP boundaries. As for face/head intonation, there can be either a relaxation of non-manual markers at PP boundaries or a partial change. For example, in Figure 8.2, the final sign at the first PP boundary, IX, is repeated once quickly with minimum displacement, and the facial expression (here, eye gaze) and head position are slightly relaxed. The second kind of evidence, found for ISL in the Nespor and Sandler (1999) study, is a phonological process that occurs within but not across prosodic phrases. The non-dominant hand, which enters the signing space in the two-handed sign BAKE, spreads regressively to the left edge of the PP, as shown with circles in Figure 8.2. The shape and location of this hand in the sign BAKE are already present during the preceding, onehanded sign, I.4 In a study of a narrative in ASL, Brentari and Crossley (2002) support the co-occurrence of non-manual and manual signals with different prosodic domains, including non-dominant hand-spreading behaviour. The spread of the non-dominant hand in ISL is formally similar to external sandhi processes in spoken language that are also bounded by the PP, such as French liaison, and ‘raddoppiamento sintattico’, or syntactic doubling, in Italian (Nespor and Vogel 2007). In §8.3, we describe the form and function of intonational signals. We then provide a principled information-structure-based analysis of the role of these signals in §8.4.
8.3 Defining properties of sign language intonation The non-manual intonational signals under discussion are not articulatorily similar to those of spoken language; however, they share important structural similarities. The current section will overview similarities and differences between spoken and signed language intonation both in terms of its structural properties and in terms of its meaning and usage. First, as the previous section demonstrated, facial intonation and head movements are temporally aligned with the timing of the words on the hands according to prosodic phrasing. By virtue of this neat alignment, intonational ‘tunes’ reinforce the prosodic constituency of signed utterances. An informative overview of non-manual signals and discussion of their prosodic and syntactic roles across sign languages is found in Pfau and Quer (2010).
4 The position of the nondominant hand in the preceding PP [CAKE IX], is the rest position for this signer.
SIGN LANGUAGE PROSODY 109 Second, intonation in both language modalities is compositionally organized. Pierrehumbert and Hirschberg (1990) and Hayes and Lahiri (1991) provide evidence for compositionality in intonational systems, and we find the same in ISL and ASL. Just as the Bengali focus pattern L*HL can combine with the H tone of continuation to produce LH*LH (Hayes and Lahiri 1991), so can non-manual intonational signals co-occur on the same constituents in sign languages. The compositionality claim depends on evidence that each intonational element separately contributes a stable meaning. This raises the challenging issue of intonational meaning. Like linguistic intonation in spoken language, facial expression and head position in sign languages can serve a variety of functions—they can signal illocutionary force (e.g. interrogative and imperative), information status such as topic and comment, and relationships between constituents, such as dependency. Although some languages (such as English and French) also have morphosyntactic markers for some of these functions, many other spoken languages (such as Hebrew and Russian) can express them with intonation alone. This is typically the case with sign languages as well. In ISL, as in many sign languages, raised eyebrows signal yes/no questions as well as continuation between parts of a sentence (Figure 8.3a). Also common across many sign languages is a furrowed brow expression for wh- (content) questions (Figure 8.3b). Squinted eyes signal retrieval of shared information in ISL (Figure 8.3c). In each type of sentence elicited from several signers in ISL, the designated non-manual actions appear reliably in over 90% of the appropriate sentence types (Dachkovsky and Sandler 2009; Dachkovsky et al. 2013), indicating that they are part of the conventionalized linguistic–prosodic system.
Figure 8.3 Linguistic facial expressions for three types of constituent in ISL. (a) Yes/no questions are characterized by raised brows and head forward and down; (b) wh-questions are characterized by furrowed brow and head forward; and (c) squint signals retrieval of information shared between signer and addressee. These linguistic face and head positions are strictly aligned temporally with the signing hands across each prosodic constituent.
However, the realizations of compositionality differ in signed and spoken languages. In the former, the visual modality allows non-manual components to be simultaneously superimposed on one another, rather than sequentially concatenated like tones in spoken language melodies. Whereas Figure 8.3a illustrates the raised brows of yes/no
110 WENDY SANDLER et al. questions and Figure 8.3c shows the squint of shared information retrieval, a question such as ‘Did you rent the apartment we saw last week?’ is characterized by the raised brows of yes/no questions together with the shared information squint, as shown in Figure 8.4.
Figure 8.4 Simultaneous compositionality of intonation in ISL: raised brows of yes/no questions and squint of shared information, e.g. ‘Did you rent the apartment we saw last week?’.
Another characteristic property of intonation is its intricate interaction with non- linguistic, affective expressions. Although this area is under-investigated in sign languages, here we also find similarities and differences with spoken languages. In spoken languages the intonational expression of emotion (e.g. anger, surprise, happiness) is realized through gradient features of pitch range and register rather than through specific intonational contours (Ladd 1996; Chen et al. 2004a). In contrast, in the visual modality both linguistic and emotional functions are expressed through constellations of the same types of visual signal (facial expressions and head/torso movements) that characterize linguistic intonation (Baker-Shenk 1983; Dachkovsky 2005). This results in different patterns of interaction between the two. For example, Weast (2008) presents the first quantitative analysis of eyebrow actions in a study of six native Deaf participants producing yes/no questions, wh-questions, and statements, each in neutral, happy, sad, surprise, and angry states. The findings demonstrate that ASL maintains linguistic distinctions between questions and statements through eyebrow height regardless of emotional state, as shown in Figure 8.5a, where the emotional expression of disgust co-occurs with the raised brows of yes/no questions. On the other hand, De Vos et al. (2009) show on the basis of Sign Language of the Netherlands (NGT) that conflicting linguistic and emotional brow patterns can either be blended (as in Figure 8.5a) or supersede each other, as in Figure 8.5b, where linguistic lowered brows of content questions are superseded by raised brows of surprise (see similar examples in Libras in Table 8.1). Which factors determine the specific type of interaction between linguistic and emotional displays within a sign language is a question that requires further investigation.
SIGN LANGUAGE PROSODY 111
Figure 8.5 Overriding linguistic intonation with affective intonation: (a) yes/no question, ‘Did he eat a bug?!’ with affective facial expression conveying fear/revulsion, instead of the neutral linguistic yes/no facial expression shown in Figure 8.3a. (b) wh-question, ‘Who gave you that Mercedes Benz as a gift?!’ Here, affective facial expression conveying amazement overrides the neutral linguistic whquestions shown in Figure 8.3b.
8.4 Intonation and information structure One of the main functions of intonation is the realization of information structure (Gussenhoven 1983b, 2004; Ladd 1996; House 2006; see also chapter 31). Since sign languages are generally ‘discourse oriented’ (e.g. Friedman 1976; Brennan and Turner 1994), intonational signals of different aspects of information structure play an important role in the signed modality. However, after decades of research, there is still little consensus with regard to the basic terms and categories of information structure, or to how they interact with each other. We rely on the model of information structure presented by Krifka (2008) and Gundel and Fretheim (2008). The primary categories of information structure discussed by the authors are topic/comment, given/new information, and background/focus. Although these notions partially overlap (Vallduví and Engdalh 1996), they are independent of each other. For instance, as exemplified by Krifka (2008) and Féry and Krifka (2008), focus/ background and givenness/newness cannot be reduced to just one opposition, because given expressions, such as pronouns, can be focused. Also, we do not claim that the notions of topic, givenness, and focus exhaust all that there is to say about information structure, and other effects connected to information flow relate to broader discourse structure (Chafe 1994). Here we will stay within the confines of the sentence (in a particular context), and we will illustrate some of the ways in which the information structure notions specified above are expressed intonationally in sign languages.
112 WENDY SANDLER et al.
8.4.1 Topic/comment One category of information structure is the opposition between topic and comment. This opposition involves a partition of the semantic/conceptual representation of a sentence into two complementary parts: identification of the topic, and providing information about it in a comment (Gundel and Fretheim 2008; Krifka 2008). A common assumption is that particular accent types or intonational patterns mark utterance topics in spoken languages (e.g. Jackendoff 1972; Ladd 1980; Gussenhoven 1983b; Vallduví and Engdalh 1996). However, there are very few empirical studies on the issue, and the results that have been offered involve different languages (e.g. English in Hedberg and Sosa 2008, and German in Braun 2006), making it difficult to arrive at generalizations. The variability of topic intonation might reflect the fact that topics vary on syntactic, semantic, and pragmatic grounds. In the following discussion, ‘topic’ refers to sentence topic, and not necessarily to topicalization or movement of a constituent. The visual nature of non-manual intonational signals and their articulatory independence from one another can be an advantage for researchers by providing a clearer form–meaning mapping of the intonational components. Several studies have demonstrated that topic– comment is a common organizing principle for sentences in sign languages (Fischer 1975 and Aarons 1994 for ASL; Crasborn et al. 2009 for NGT; Kimmelman 2012 for Russian Sign Language; Rosenstein 2001 for ISL; Sze 2009 for Hong Kong Sign Language), and that topics are typically marked with particular facial expressions and head positions, which, we argue, are comparable to prosodic marking in spoken languages for this information structuring role. Kimmelman and Pfau (2016) present a comprehensive overview of information structure and of the topic/comment distinction in sign languages. Early work on topic marking in ASL identified specific cues for topics (Fischer 1975; Friedman 1976; Ingram 1978; Liddell 1980), such as raised brows and raised/retracted head position. Topic–comment constructions, set off by specific and systematic non-manual marking, have also been reported in many other sign languages, such as Swedish Sign Language (Bergman 1984), British Sign Language (Deuchar 1983), Danish Sign Language (Engberg-Pedersen 1990), Finnish Sign Language (Jantunen 2007), NGT (Coerts 1992; Crasborn et al. 2009), and ISL (Meir and Sandler 2008). The most frequent non-manual marker of topics reported cross-linguistically is raised eyebrows. However, a comparative study of prosody in ASL and ISL, two unrelated sign languages, demonstrates language-particular differences in the marking and systematicity of information structure, which one would expect in natural languages (Dachkovsky et al. 2013). Comparison of the same sentences in the two languages across several signers revealed that topics are marked differently in ISL and ASL. ISL topics are marked by head movement that starts in a neutral or slightly head-up position and gradually moves forward, and often by squint (shared information) (see Figure 8.6a). In ASL, a static head-up position, together with raised brows (Figure 8.6b), usually marks the scope of the entire topic, as Liddell (1980) o riginally claimed. These findings highlight the importance of head position as an intonational component in sign language grammar, and show that topics are characterized by different head positions that can vary from language to language. They demonstrate that information structure is systematically marked linguistically in sign languages generally, but that the specific marking is language specific and not universal, as is also the case in spoken languages.
SIGN LANGUAGE PROSODY 113
Figure 8.6 Intonational marking of topics in (a) ISL and (b) ASL.
Similarly, this research demonstrated that the facial cues accompanying topics in both languages are also variable, within and across sign languages. Variability in topic marking has surfaced in more recent studies on other sign languages as well (e.g. Hong Kong Sign Language; Sze 2009). The authors found that topics in Hong Kong Sign Language are not marked consistently by one particular non-manual cue, but rather by a variety of signals, and sometimes even by manual prosodic cues alone. The reason for this variability might lie in the very nature of topics as discourse entities: they can be linked to the preceding discourse in various ways and may include different types of information—that is, information that is more accessible or less accessible to the interlocutors (e.g. Krifka 2008). This brings us to another dimension of information structure—the given/new distinction.
8.4.2 Given/new information Referential givenness/newness is defined as indicating whether, and to what degree, the denotation of an expression is present in the interlocutors’ common ground (Gundel and Fretheim 2008: 176; Krifka 2008). On the assumption that different types of mental effort or ‘cost’ are involved in the processing of referents, the information structure literature distinguishes a scale of information states, ranging from active (or given/highly accessible) to inactive (or new/inaccessible) (Chafe 1974; Ariel 1991; Lambrecht 1996). The general pattern that has emerged from spoken language literature is that referents with a lower degree of activation, or accessibility, tend to be encoded with greater intonational prominence and/or with particular accent types, although these patterns very much depend on the language (e.g. Baumann and Grice 2006; Chen et al. 2007; Umbach 2001). The given/new category has received much less attention in sign language literature in comparison with the topic/comment category. Engberg-Pedersen (1990) demonstrated that squint in Danish Sign Language serves as instruction to the addressee to retrieve information that is not given in the discourse, and might be accessible from prior knowledge or shared background. On the basis of a fine-grained analysis, Dachkovsky (2005) and Dachkovsky and Sandler (2009) argued for a comparable function of squint in ISL. Similar conclusions related to the function of squint as a marker of information with low accessibility have been reported for German Sign Language (Herrmann and Steinbach 2013; Herrmann 2015).
114 WENDY SANDLER et al. The study by Dachkovsky et al. (2013) mentioned earlier investigates the interaction between two categories of information structure—given/new and topics—in ISL and ASL. It demonstrates essential differences with regard to the intonational marking of low referent accessibility in topics in the two languages. First of all, coding of non-manual signals using the Facial Action Coding System (Ekman et al. 2002) reveals a cross-linguistic phonetic difference in articulation of low-accessibility signal/squint. To be precise, ‘squint’ is achieved by different muscle actions, or actions units (AUs), in the two sign languages, in both cases serving to narrow the eye opening and to emphasize the ‘nasolabial triangle’ between nose and mouth in each language. While in ISL the effect is produced by lower-lid tightening (AU7 in Figure 8.7) and deepening of the nasolabial furrow (AU11), in ASL, a similar appearance is achieved by raising the cheeks (AU6) and raising the upper lip (AU10), as shown in Figure 8.7.
Figure 8.7 Different phonetic realizations of the low accessibility marker, squint, in (a) ISL and (b) ASL.
The differences between ISL and ASL information status marking pertain not only to formal properties but also to the functional properties of information status. Dachkovsky et al. (2013) suggest that the ISL intonational system is more sensitive than that of ASL to the accessibility status of a constituent. Specifically, ASL tends to reserve squint for topic constituents with very low accessibility only, according to the motivated accessibility ranking of Ariel (1991), high > mid > low. Most other topics are marked by raised brows. On the other hand, the occurrence of squint in ISL topics is broader—it systematically co-occurs with mid- as well as low-accessibility topic constituents. These findings show that syntax does not determine the non-manual marking of information structure. Specifically, the presence of squint in ISL and ASL topics is related to different degrees of sensitivity to pragmatic considerations—the degree of accessibility— regardless of their syntactic role (i.e. whether they are adverbial phrases, object noun phrases, or subject noun phrases).
SIGN LANGUAGE PROSODY 115
8.4.3 Focus/background The third crucial category in the organization of information is focus/background. A focused part of an utterance is usually defined through the presence of alternatives that are relevant for the interpretation of the utterance (e.g. Rooth 1992; Krifka 2008). Information focus can be conceived of as the part of a sentence that answers a content question. Here we can also observe how the focus/background distinction interacts with other information structure categories, such as the given/new distinction. Specifically, contrastive and emphatic foci are used to negate or affirm information that was previously given (mentioned) in the discourse. In spoken languages, a focused constituent often receives prosodic prominence and particular types of accent (Ladd 1996; Gussenhoven 2004). One study that investigated prosodic distinctions between different types of foci was conducted on NGT (Crasborn et al. 2009). The authors find that focused constituents are generally characterized by a range of non-manual signals, none of which is exclusive or obligatory. The markers of focus include intense gaze at the addressee, intensified mouthing, and enhanced manual characteristics of signs, such as size and duration. The study demonstrates that, although the prosodic marking of focus is sensitive to various semantic– pragmatic considerations and syntactic factors, the relation between these and prosody is not systematic. Kimmelman’s (2014) study demonstrates that Russian Sign Language is very different from NGT: it hardly employs any non-manuals to mark either type of focus. The most common signals of focus in Russian Sign Language are manual cues, such as holds or size and speed modifications of a sign, along with some syntactic strategies, such as d oubling and ellipsis. In German Sign Language, only contrastive focus seems to be systematically marked. Distinguishing information focus (i.e. new information) from contrastive focus, Waleschkowski (2009) and Herrmann (2015) demonstrate that, whereas the marking of information focus is not obligatory in German Sign Language, contrastive focus is consistently marked by specific manual and non-manual marking (focus particles)—mostly by head nods. ASL seems to have more regular non-manual patterns characterizing different types of focus, with contrastive foci being distinguished by opposite body leans (Wilbur and Patschke 1998). Schlenker et al. (2016; see also Gökgöz et al. 2016) complement Wilbur and Patschke’s (1998) findings on ASL by comparing them to a similar data set in LSF (French Sign Language), and show that prosodic modulations of signs and non-manuals (brow raise and head movements) suffice to convey focus, with diverse semantic effects, ranging from contrastive to exhaustive, as in spoken language. Distinct non-manual profiles were also observed for focus in Libras in a comparative study with ASL. Lillo-Martin and Quadros (2008) discuss three types of focus: information (non-contrastive) focus, contrastive focus, and emphatic focus (which like contrastive focus can be used to negate previous information, but unlike contrastive focus can also be used to affirm; Zubizarreta 1998). They report that different non-manual markers are associated with information focus (raised brows and head tilted back) and contrastive focus (brows that are furrowed with their inner parts raised, and head tilted to the side), as illustrated in Figure 8.8.
116 WENDY SANDLER et al.
Figure 8.8 Non-manual markers in Libras accompanying (a) information focus (raised brows and head tilted back) and (b) contrastive focus (raised and furrowed brows, and head tilted to the side).
As Lillo-Martin and Quadros (2008) report, both information focus and contrastive focus can occur with elements that are at the beginning of a sentence or in their sentenceinternal position, as illustrated in examples (1) (information focus) and (2) (contrastive focus).5 These examples are grammatical in both ASL and Libras. The fact that the same non-manual marking is associated with elements in different syntactic positions in (1) can be seen as evidence in favour of the view that these elements are determined by pragmatic rather than strictly syntactic factors. (1) Information focus S1: S2: S2:
wh WHAT YOU READ ‘What did you read?’ I-focus BOOK CHOMSKY I READ I-focus I READ BOOK CHOMSKY ‘I read Chomsky’s book’
(2) Contrastive focus
S1: S2:
y/n YOU READ STOKOE BOOK ‘Did you read Stokoe’s book? C-focus NO, BOOK CHOMSKY I READ ‘No, I read Chomsky’s book.’
5 Throughout this paper, signs are indicated by translation equivalent glosses in upper case. Here, PU indicates a sign made with palm up; this sign is used in ASL for general wh-questions as well as other functions. For a recent review and analysis of ‘palm up’ in gesture and in sign languages, see Cooperrider et al. (2018) and references cited therein. An abbreviation on a line above glosses indicates a non-manual marker co-occurring with the indicated signs, as follows: wh: wh-question y/n: yes/no question I-focus: information focus C-focus: contrastive focus hn: head nod
SIGN LANGUAGE PROSODY 117 Unlike information focus and contrastive focus, in Libras and ASL emphatic focus is not associated with a particular type of non-manual marking but with a stressed version of the marking that would go with the non-emphatic reading of the sentence. Emphatic elements are generally expressed in the sentence-final position or in doubling constructions, as illustrated in (3), which again represents sentences that are acceptable in both languages, where the default position of modals such as CAN is pre-verbal. (3)
hn a. JOHN CAN READ CAN hn b. JOHN CAN READ CAN ‘John really CAN read.’
Even though the examples in (3) show isomorphism between the prosody and the syntactic structure, it is possible for this relationship to be broken. For example, in Libras, there is the possibility of having an extension of the head nod accompanying the emphatic focus element after the end of the manual sentence is produced, as shown in (4). (4) JOHN CAN READ CAN
hn
In sum, we have seen that information structure is signalled by facial expression and head position in sign languages, performing much the same role as intonation in spoken languages. Along all familiar parameters of information structure—topic/comment, given/new, and focus/background—the literature shows non-trivial similarities between the distribution of these non-manual articulations in sign language and intonation in spoken language. While these information structure categories also often have syntactic characteristics, the relationship between prosody and syntax is indirect. Nevertheless, there is disagreement on this issue, with some researchers taking the position that familiar non-manual markers are associated directly with the syntactic component of the grammar, and, as such, reveal underlying syntactic structure. We now address the dispute regarding whether observed non-manual markers are best understood as more closely related to the prosodic or the syntactic component of the grammar by turning our attention to illocutionary force— specifically, to interrogatives. In spoken languages, questions may be syntactically marked, and they are also apparently universally marked by intonation. What is the case in sign languages?
8.5 Prosody versus syntax: evidence from wh-questions §8.3 showed that a particular set of non-manual markers accompanies questions in ISL. In fact, these markers are found in many sign languages, including ASL and Libras. One of the earliest descriptions of these markers, by Liddell (1980), proposed that the spread of the nonmanual is determined by the structural configuration known as ‘command’ (a precursor to the contemporary notion c-command), as illustrated by the tree in (5).
118 WENDY SANDLER et al. (5)
S
A
Y
B If syntactic elements are located in a tree structure such as the one in (5), element A commands element B if the S node dominating A also dominates B (Y indicates a node between A and B). Liddell’s argument is that the non-manual markers are represented in tree structures and that their spread is determined by syntactic relationships. The analysis of non-manual marking for questions as determined by syntactic structure has been maintained by numerous scholars, including Petronio and Lillo-Martin (1997) and Neidle et al. (2000) for ASL, and Cecchetto et al. (2009) for LIS (Italian Sign Language). These researchers have used the spread of non-manuals, especially those associated with wh-questions, to help determine the syntactic structure involved. Cecchetto et al. furthermore claimed that in Italian Sign Language, the fact that wh-question scope marking is indicated by non-manuals allows for the default linear order of syntactic constituents typically found in spoken languages to be overridden specifically for sign languages. However, as we will now show, the assumed relationship between non-manual marking of questions and syntactic structures can be violated in numerous ways (see Sandler and LilloMartin 2006 for evidence from ISL). This lack of a direct correspondence calls into question a syntactic analysis of non-manual marking, and, even more so, the use of non-manual marking spread to make inferences about syntactic structures (see Sandler 2010). Instead, in line with the conclusion put forward in the previous section, we take this common nonisomorphism to show that (at least in some sign languages) question non-manual marking behaves like intonation. It conveys pragmatic (illocutionary) information whose spread is determined by semantic–pragmatic organization. The constituents organized in this way are often correlated with syntactic phrasing and categories, but in many cases they are not (see Nespor and Sandler 1999; Nespor and Vogel 2007). For an approach that aims to reconcile syntactic distribution with prosodic spreading of non-manuals in numerous structures, see Pfau (2016). In the following, the ‘wh’ label above text corresponds to the standard non-manual marker of wh-questions in ASL. In ASL, wh-questions are characterized by furrowed brow and forward head position, as illustrated in Table 8.1a (Liddell 1980). In a study that elicited wh-questions from six ASL signers, this facial expression and head position characterized 100% and 65% of wh-questions respectively (Dachkovsky et al. 2013), and they are thus a systematic part of the linguistic system. The first case to consider is indirect questions. If the spread of the wh-question non-manual marker is determined by the scope of the [+wh] element, as is often assumed, we should see the pattern illustrated in (6).6 This pattern should be found regardless of the matrix verb.7 6 Of course, syntactic accounts could make different predictions, as in the one proposed by Petronio and Lillo-Martin (1997). Here we simply address the most common syntactic proposal associating the non-manual marking with the [+wh] feature. 7 At this point we are abstracting away from questions about the position of the wh-phrase in both matrix and indirect questions. Because wh-phrases are frequently found in sentence-final position in
SIGN LANGUAGE PROSODY 119 wh (6) a. [Matrix wh-question [embedded clause] ] wh b. [Matrix clause [embedded wh-question] ] However, the pattern listed in (6) does not hold. If the matrix verb is ASK, a pattern like this might be seen, as shown in (7a). However, if the matrix verb is WONDER, a puzzled expression, like that in Figure 8.8b, might appear across the whole sentence, as in (7b). Furthermore, if the matrix verb is KNOW, no furrowed brow wh-question expression is observed; instead, there might be a head nod, as in (7c). wh (7) a. IX_1 ASK [ WHERE JOHN LIVE ] wh b. IX_1 WONDER [ WHY JOHN LEAVE ] hn c. IX_1 KNOW [ HOW SWIM ] The distribution of wh-question non-manual markers observed in (7) is unexpected on an account by which the spread of this marker is determined by the syntactic scope of a whphrase. On the other hand, if the marker indicates a direct question to which a response is expected, the lack of any such marking in (7c) is completely expected. Furthermore, its presence in (7a) can be accounted for by assuming that this example represents a (quoted) direct question. In (7b), the expression has different characteristics: it is a puzzled expression, with eye gaze up rather than towards the addressee; this expression co-occurs with the matrix clause because the whole sentence expresses puzzlement. It would be possible to maintain that the pattern of non-manual marking in indirect questions is not inconsistent with a syntactic account of their distribution, but only that the syntactic account must be more nuanced. However, even matrix wh-questions might not display typical wh-question non-manuals in various contexts. For example, a signer can produce a wh-question with affective facial expressions that completely replace the typical wh-question non-manuals, as illustrated in row c of Table 8.1 and Figure 8.5b, similar to the example from ISL in Figure 8.3b above. It is also possible for a non-question (i.e. without morphosyntactic question markers) to be used to seek information (row d of Table 8.1), in which case it will use a non-manual marker that has the same or similar properties to the typical wh-question nonmanual marking, or for a non wh-question to indicate puzzlement using a facial expression that has a brow furrow, like standard wh-questions (row a of Table 8.1 and Figure 8.5a) All of these examples are straightforwardly compared to intonational patterns in spoken languages. ASL, some scholars (e.g. Neidle et al. 2000) have taken that to be the unmarked position (or the position of the specifier of CP, the highest ‘root’ sentence node). The grammaticality of some sentence-initial whphrases is disputed by these authors. However, for adjunct wh-phrases in indirect questions, the clauseinitial position is generally claimed to be acceptable, so we use clause-initial adjuncts in the examples in (7). As far as we know, the same pattern is found no matter the position of the wh-phrase.
120
WENDY SANDLER et al.
Table 8.1 Non-manual marking used in different contexts in ASL and Libras Context
ASL
Libras
a. Direct WH-questions
b. Indirect question (not all types have the same marking)
c. WH-question with affect
d. Non-question requesting information
e. Non-question with puzzled affect
In Libras, while the distribution of non-manual markings is similar to that described for ASL, the actual facial expressions are different, as shown in Table 8.1. In row a, typical wh-questions are shown, with brows that are furrowed in both sign languages, and, in Libras, with the inner parts of the brows somewhat raised and the head tilted back. Row b illustrates two types of marking used with indirect questions, in different contexts in the two languages. Row c illustrates that wh-questions can have affective overlays such as playful doubt (ASL) or exasperation (Libras). Row d shows that sentences without a manual (morphosyntactic) wh-element can be produced with the brow expressions typically used in wh-questions, in order to seek information.⁸ Such examples suggest that it is the ⁸ We note that furrowed brows characterize wh-questions in ISL, ASL, and Libras, as in other sign languages, such as British Sign Language (Woll 1981) and Sign Language of the Netherlands (Coerts 1992;
SIGN LANGUAGE PROSODY 121 information-seeking function that determines the facial intonation, and not the syntactic structure. Similarly, row e represents an ASL non-question with furrowed brows and head tilt, conveying puzzlement. Further research is needed to determine the degree of systematicity, additional factors determining the distribution, and associated pragmatic roles of the intonational displays shown in rows b–e of Table 8.1 and in example 8.3 above, from ISL. Our point is simply that these displays, like the systematic linguistic marking of standard wh-questions, are i ntonational. Taken together, these observations reveal that the non-manual marking typically associated with questions serve pragmatic functions, seeking a response or (in the case of certain yes/no questions) confirmation. As such, they are used only when questions have this pragmatic function, and even when declaratives have such a function. Furthermore, the spread of the non-manuals is often consistent with syntactic structure: for example, a question or a declarative seeking a response might have some indication of the non-manual marking throughout the whole clause. However, we note that the syntactic constituent is often, but not always, isomorphic with the prosodic constituent. The various components of the nonmanual marking (brows, eyelids, head position, eye gaze) can change over the course of the sentence, and may be most intense at the end of the signer’s turn, when the interlocutor’s response is expected (an observation that is given a syntactic account by Neidle et al. 2000 and others). In addition, other factors can interrupt the flow of the wh-question marker. For example, in an alternative question, the list intonation interrupts the production of the furrowed brow wh-question marker, as first observed for ISL by Meir and Sandler (2008) and illustrated in ASL in Figure 8.9.
PU
OR
FLAVOR
CHOCOLATE
LAYER
VANILLA
OR
PU
Figure 8.9 ASL alternative question, glossed: PU FLAVOUR CHOCOLATE VANILLA OR LAYER OR PU, translated roughly as ‘What flavour do you want, chocolate, vanilla or layer?’. Zeshan 2004). This suggests that this aspect of facial expression may have a general non-linguistic source that is conventionalized in sign languages.
122 WENDY SANDLER et al. It remains to be seen whether the patterns observed for ASL, Libras, and ISL are replicated in other sign languages. There is extensive discussion of the patterns of wh-questions found in different sign languages in Zeshan (2006), although these works do not address the central question here, which is whether the non-manual marking typically associated with questions represents an essentially prosodic phenomenon versus a syntactic one. How ever, any analysis of the structure of questions in sign languages should take into consideration the evidence that non-manual marking behaves as an intonational component, whose distribution and scope are determined more by pragmatic than by syntactic factors (Selkirk 1984; Nespor and Vogel 2007, Sandler and Lillo-Martin 2006; Sandler 2010).
8.6 Summary and conclusion The systematic use of facial expression and head position in sign languages, and their alignment with manual cues of prominence and timing, offer a unique contribution to linguistic theory by literally making the intricate interaction of intonation with pragmatics and syntax clearly visible. In sign languages, both the occurrence and the scope of manual and nonmanual signals, as well as their coordinated interaction, are there for all to see. Nevertheless, there are disputes in this relatively new field of inquiry. We have pointed out that one of the characteristics of prosody, including intonation, is variation—due to semantic, pragmatic, and other factors, such as rate of speech/signing—and we encourage future researchers to engage in controlled and quantified studies across a number of signers in order to document and account for the data. We hope that future research, conducted at finer and finer resolutions across signers and sign languages, will further illuminate the nature of the system, allowing us to arrive at a detailed model of prosody in sign languages, and of its interaction with other linguistic components. The evidence we have presented shows that the distribution and behaviour of these signals correspond closely to those of intonation and prosodic phrasing in spoken languages, suggesting that an independent prosodic component is a universal property of language, regardless of physical modality.
Acknowledgements Portions of the research reported here have received funding from the European Research Council under the European Union’s Seventh Framework Programme, grant agreement No. 340140. Principal Investigator: WS; Israel Science Foundation grants number 553/04 PI WS, and 580/09, PIs WS and Irit Meir; the U.S. National Institutes of Health, NIDCD grant #DC00183 and NIDCD grant #DC009263, Principal Investigator: DLM; and the Brazilian National Council for Research, CNPq Grant #CNPQ #200031/2009-0 and #470111/2007-0, Principal Investigator: RMQ.
PA rt I I I
PRO S ODY I N SPE E C H PRODUC T ION
chapter 9
Phon etic Va r i ation i n Ton e a n d I n tonation Systems Jonathan Barnes, Hansjörg Mixdorff, and Oliver Niebuhr
9.1 Introduction In both tonal and segmental phonology, patterned variability in the realization of abstract sound categories is a classic object of empirical description as well as a long-standing target of theoretical inquiry. Among prosody researchers in particular, focus on this critical aspect of the phonetics–phonology interface has been constant and intensive, even during periods when intellectual contact between segmental phonologists and their phonetician counterparts notably ebbed. In this chapter, we review the most commonly cited phenomena affecting the phonetic realization of both lexical and intonational tone patterns. The chapter’s title purposefully invokes a broad range of interpretations, and we aim, if not for exhaustivity, then at least for inclusivity of coverage within that range. Many of the phenomena we investigate here are examples of non-distinctive, within-category variability—realizational elasticity, in other words, within some phonetic dimension, in spite of or unrelated to the phonological contrasts being expressed. At the same time, however, we mean equally to review the ways in which particular phonetic dimensions of the signal may be modulated in support of the expression of contrasts (in the manner of Kingston and Diehl 1994). The thread that unifies it all is a broad concern with how tone and intonation patterns are implemented phonetically. Throughout what follows, we return repeatedly to several issues we view as central to the development of the field. One such focus is on phonetic motivation for phonological patterns, with emphasis on both perception and production evidence. We also touch on cross-language variation in the distribution and implementation of the phenomena reviewed. Do they appear in both tone and intonation systems? Do they have both gradient and categorical manifestations, and, if so, how are these connected? Lastly, we urge researchers to consider the potential interaction of all the phenomena under discussion.
126 JONATHAN BARNES, HANSJöRG MIXDORFF, AND OLIVER NIEBUHR The emergence of higher-level regularities, such as those involving enhancement relations or perceptual cue integration, is only now being explored by prosody researchers in the way that it has been for decades in the study of segmental contrasts such as voicing or consonant place. This bears great promise for the future of the field. We begin with a discussion of coarticulation patterns among adjacent tonal targets (§9.2), then turn to a more general consideration of patterns of tonal timing (§9.3) and of f0 scaling (§9.4). §9.5 reviews how aspects of global contour shape contribute to the realization and recognition of tone contrasts, while §9.6 does the same for non-f0 factors such as voice quality. A brief conclusion is offered in §9.7.
9.2 Tonal coarticulation Contrasting pitch patterns in phonological inventories are commonly described in terms of their canonical static or dynamic f0 shapes when uttered in isolation. However, like their segmental counterparts, such patterns rarely occur entirely on their own. As the larynx strives to attain multiple sequenced targets in finite time, adjacent targets may influence one another in a manner at least partially reducible to physiological constraints on tone production. At the same time, this coarticulation is tempered by the need to maintain contrasts within the system, and thus may take on diverse shapes across languages (DiCanio 2014). Much of the literature on tonal coarticulation focuses on lexical tone languages in East and South East Asia, with detailed descriptions for Standard Chinese (Ho 1976; Shih 1988; Shen 1990; Xu 1994, 1997, 1999, 2001), Cantonese (Gu and Lee 2009), Taiwanese (Cheng 1968; Lin 1988b; Peng 1997), Vietnamese (Han and Kim 1974; Brunelle 2003, 2009a), and Thai (Abramson 1979; Gandour et al. 1992a, 1992b, 1994; Potisuk et al. 1997). More recently, studies have also been devoted to African tone languages (e.g. Myers 2003 on Kinyarwanda; Connell and Ladd 1990, Laniran 1992, and others on Yoruba), and Central and South American tone languages (e.g. DiCanio 2014 on Triqui). Laboratory studies of tonal coarticulation typically involve elicitation of specific tone sequences, focusing on effects of preceding and following context on the realization of each tone in terms of both f0 scaling and alignment with the segmental string. Focus patterns and speech rate are also commonly varied. Figures 9.1 and 9.2, from Xu (2001), investigating tonal coarticulation in Mandarin, are representative. In Figure 9.1, we see a high (H, top) and rising (R, bottom) tone preceded by low (L), high (H), rising (R), and falling (F) tones. The f0 contour in the second syllable is clearly influenced by the preceding tone and the ultimate target is often only approximated towards the syllable’s end, yielding ‘carry-over’ or perseveratory coarticulation. By contrast, Figure 9.2 displays Mandarin high and rising tones in the first syllable followed by four different tone types in the second. When the first syllable precedes a Low, f0 trajectories are higher than before other second-syllable tone types. The second-syllable Low target thus influences preceding Highs in a pattern of ‘anticipatory’, dissimilative coarticulation that has been called ‘pre-Low raising’ (see below). A comparison of Figure 9.1 and Figure 9.2 also reveals that in Mandarin, carry-over coarticulation would appear to be more dramatic than anticipatory coarticulation.
PHONETIC VARIATION IN TONE AND INTONATION SYSTEMS 127 (a)
ma
160 140
H
F
120
ma
R
100
L
H
F0 (Hz)
80 (b) 160 140 120 100
H
F R
L
R
80 Normalized time
Figure 9.1 Carry-over coarticulation in Mandarin Chinese. See text for an explanation of the abbreviations. (Xu 2001)
Nearly all studies of individual languages document tonal coarticulation in both directions. Perhaps the strongest cross-linguistic generalization, however, is that both the magnitude and the duration of perseveratory coarticulation regularly greatly exceed those of anticipatory coarticulation (though Brunelle 2003, 2009a finds that for both Northern and Southern Vietnamese, anticipatory coarticulation, though weaker in magnitude, is longer lasting). The perseveratory pattern known as peak delay, whereby a high target, particularly in fast speech or on weaker or shorter syllables, has a tendency to reach its maximum during a following syllable, rather than during its phonological host, is one common reflection of this general tendency. The reason for the directional asymmetry is not immediately apparent, though see Flemming (2011) for a candidate explanation based on matching of tonal targets to regions of high segmental sonority. The pattern, in any case, is apparently not universal. In Kinyarwanda (Myers 2003), tonal coarticulation is primarily anticipatory (see also Chen et al. 2018). Perseveratory coarticulation is in all known cases assimilative in nature. Anticipatory coarticulation, however, may be either assimilative or dissimilative, depending on the language or even specific tones in a single language. Vietnamese has assimilative anticipatory coarticulation, while most studies on Thai have reported dissimilative anticipatory coarticulation, especially as Low tones affect high offsets of preceding targets (see Figure 9.2).
128 JONATHAN BARNES, HANSJöRG MIXDORFF, AND OLIVER NIEBUHR (a) 160
ma
ma F
H
140 120 100
R H L
F0 (Hz)
80 (b) 160 140
F
H
R R
120 100
L
80 Normalized time
Figure 9.2 Anticipatory coarticulation in Mandarin Chinese appears dissimilatory. See text for an explanation of the abbreviations. (Xu 2001)
For Taiwanese, Cheng (1968) and Peng (1997) find anticipatory effects to be primarily assimilative, but Peng (1997) also noticed the dissimilative effect in which high-level and mid-level tones are higher when the following tone has a low onset. For Standard Chinese, dissimilative raising of preceding high targets before a low has been noted by multiple studies (Shih 1986; Shen 1990; Xu 1994, 1997), but Shih (1986) and Shen (1990) also report assimilative tendencies in anticipatory coarticulation, such as raising of tone offsets before following high-onset tones (Shen 1990). A common theme, however, is that Low tones more often cause dissimilation of preceding Highs than Highs do of preceding Lows. This phenomenon, often called high tone (or pre-low) raising, has been taken to represent a form of syntagmatic contrast enhancement.1 In West African languages in particular, it is often mentioned in the context of the implementation of downstep, where its local effect would be to maximize the distinction between a phonological Low tone and a downstepped High, while a global effect might be the prophylactic expansion of the pitch range, to avoid the endangerment of tone contrasts under later downstep-driven compression (Laniran 1 A connection might thus be drawn between high tone raising and ‘low tone dipping’, discussed in §9.5.2. In both instances, low tone targets appear to be phonetically enhanced, by either the addition or the exaggeration of adjacent high targets.
PHONETIC VARIATION IN TONE AND INTONATION SYSTEMS 129 and Clements 2003). While primarily observed in lexical tone systems, high tone raising has also been reported in certain intonation systems (Féry and Kügler 2008; Kügler and Féry 2017). Tonal coarticulation should be distinguished from potentially related processes known as tone sandhi (see chapter 22) and other phonological processes such as tone shift or tone spreading. Coarticulation in the clearest cases is assumed not to alter the phonological status of affected tones, and instead just to shape their acoustic realization. Likewise, tonal coarticulation is typically gradient and may vary across speech rates and styles, whereas sandhi is categorical, ideally not dependent on rate or style, and may result in the neutral ization of underlying tone contrasts. There are, however, resemblances between the coarticulatory patterns noted above and certain common phonological tone patterns. Tone spreading, for example, usually described as assimilation affecting all or part of a target tone (Manfredi 1993), is extremely common across languages, and proceeds overwhelmingly from left to right (Hyman and Schuh 1974; Hyman 2007). For example, in Yoruba, High tones spread to following syllables with Lows, creating a falling contour on the host syllable; Low tones similarly spread right to syllables with Highs, creating surface rises (Schuh 1978). Hyman (2007) notes the connection between the two patterns and suggests that tonal coarticulation is in fact a phonetic precursor to many commonly phonologized patterns in tone systems. Downstep of High tones following a Low in terraced-level tone systems is likewise often considered part of categorical phonology, to the extent that it involves lowering the ceiling of the pitch range for all further Highs in a domain in a manner that is independent of speech rate or effort. Such systems may, however, have phonetic roots in gradient perseveratory coarticulation. At the same time, the distinction between phonetic and phonological patterns is not always clear. The Yoruba spreading pattern is typically treated as phonological, but in many analogous cases there is little evidence to decide the matter. In Standard Chinese trisyl lables, Chao (1968) finds a perseveratory effect of an initial high or rising tone on a following rise, which apparently causes the rise to neutralize with underlying high tones, at least in fast speech. If the process is indeed neutralizing, this sounds like typical tone sandhi. Its rate dependence, by contrast, sounds phonetic and suggests coarticulation.
9.3 Timing of pitch movements With the rise of autosegmental phonology (Leben 1973; Goldsmith 1976a; Pierrehumbert 1980; see also chapter 5), investigation of tonal implementation in terms of specified contour shapes was largely replaced by a focus on the locations in time and f0 space of phonetic tone-level targets thought to be projected by tonal autosegments (H, L, etc.) in the phon ology. These targets, in turn, were commonly equated with observable turning points (maxima, minima, ‘peaks’, ‘valleys’, ‘elbows’) in the f0 contour. In this section, we review the literature on phenomena affecting the timing of tonal targets (though, in practice, many of these patterns involve target scaling as well). The timing of phonetic tonal targets, relative either to segmental elements or other pros odic structures, is called ‘tonal alignment’ (cf. ‘tonal association’, a phonological relation
130 JONATHAN BARNES, HANSJöRG MIXDORFF, AND OLIVER NIEBUHR
c
V´
c
c
V´
c
Figure 9.3 Segmental anchoring: a schematic depicting the relative stability of alignment of f0 movements (solid line) with respect to the segmental string (here CVC) and the accompanying variation in shape (i.e. slope and duration) of the f0 movement.
rather than a physical one). A key finding in the alignment literature is segmental anchoring (Arvaniti and Ladd 1995; Arvaniti et al. 1998; Ladd et al. 1999), the observation that under changes to the duration or number of segments spanned by tonal movements, movement shapes (e.g. durations or slopes) vary correspondingly (Figure 9.3). By contrast, the tem poral relationship between pitch movement onsets/offsets and select segmental landmarks remains relatively constant. This finding comfortably echoes the ‘target-and-interpolation’ approach to tonal implementation suggested by Goldsmith (1976a: §3.2) in his dissertation, where f0 contours move from target to target, each with specified timing and scaling. Between targets, f0 interpolates along the shortest or articulatorily cheapest path. Distinctive variation in ‘underspecified’ regions is not predicted.
9.3.1 Segmentally induced variability in f0 target realization While segmental anchoring in some form is broadly accepted, questions remain concerning its details. Given a pitch rise (phonological LH), do we expect both tonal targets to be equally, independently ‘anchored’, or is there some asymmetry and/or interdependence between them? For example, the pitch accents of various European intonation systems show relatively stable timing of pitch movement onsets, while movement offset timing varies considerably relative to segmental hosts (Silverman and Pierrehumbert 1990; Caspers and van Heuven 1993; van Santen and Hirschberg 1994). One expression of this is the tendency for rising accent peaks to anchor relatively earlier in closed syllables than in open (D’Imperio 2000; Welby and Lœvenbruck 2006; Jilka and Möbius 2007; Prieto and Torreira 2007; Prieto 2009; Mücke et al. 2009), with rise onsets largely unaffected. Perhaps relatedly, peaks are also seen to align earlier in syllables with obstruent codas than in allsonorant rhymes (van Santen and Hirschberg 1994; Rietveld and Gussenhoven 1995; Welby and Lœvenbruck 2006; Jilka and Möbius 2007). Both these patterns may reflect a tendency for critical f0 contour regions to avoid realization in less sonorous contexts (House 1990; Gordon 1999; Zhang 2001, 2004a; Prieto 2009; Dogil and Schweitzer 2011; Flemming 2011; Barnes et al. 2014). Some studies (e.g. Rietveld and Gussenhoven 1995) have shown an influence of syllable onset composition on alignment patterns as well (cf. Prieto and Torreira 2007).
PHONETIC VARIATION IN TONE AND INTONATION SYSTEMS 131
9.3.2 Time pressure effects on f0 target realization Non-segmental factors, such as temporal pressure, may also influence tonal alignment. Silverman and Pierrehumbert (1990) famously found earlier alignment of prenuclear H* accents in English induced by multiple facets of upcoming prosodic context, including word boundaries and pitch accented syllables. This phenomenon, known as ‘tonal crowding’, is observed in many languages, including Dutch (Caspers and van Heuven 1993), Spanish (Prieto et al. 1995), Greek (Arvaniti et al. 1998), and Chickasaw (Gordon 2008). Interestingly, in cases where both right-side prosodic context and speech rate were involved, only the former resulted in significant alignment adjustment (Ladd et al. 1999; cf. Cho 2010, whose speech rate manipulation study shows both segmental anchoring and some pressure towards constant pitch movement duration in English, Japanese, Korean, and Mandarin). The similarity of pitch-movement onset/offset asymmetries to segmental inter-gestural coordination patterns, whereby onset consonant timing is more stable and more reliably coordinated with syllable nuclei (Browman and Goldstein 1988, 2000), may pose a challenge to the standard conception of segmental anchoring (Prieto and Torreira 2007; Gao 2008; Mücke et al. 2009; Prieto 2009). Other studies, however, argue that while ‘right context’ does condition alignment changes for pitch movement offsets in various languages, within a given structural condition, movement offsets are no more variable (and hence no less anchored) than movement onsets (e.g. Ladd et al. 2000; Dilley et al. 2005; Schepman et al. 2006; Ladd et al. 2009b). In both English and Dutch (Schepman et al. 2006; Ladd et al. 2009b), for example, pitch movement offsets align differently for syllables containing phonologically ‘short’ and ‘long’ vowels, but these differences are not determined by phonetic vowel duration, requiring reference to structural factors instead (e.g. syllabification). Some languages also exhibit more timing variability in pitch movement onsets than in offsets (e.g. Mandarin: Xu 1998).2 Much remains to be understood about cross-language variation here, both for timing patterns and for anchor types. Right or left syllable edges are often loosely invoked as anchors, but holistically construed entire syllables (Xu and Liu 2006), as well as various subsyllabic constituents (e.g. morae: Zsiga and Nitsaroj 2007), have also been proposed. Comparative studies likewise demonstrate subtle cross-language alignment differences for otherwise analogous tone patterns. Southern German speakers align prenuclear rises slightly later than Northern Germans, and both align these later than Dutch or English speakers (Atterer and Ladd 2004; Arvaniti and Garding 2007 for American English dialects; Ladd et al. 2009b on Scottish and Southern British English). Mennen (2004) shows not only that ‘comparable’ pitch accents in Greek and Dutch differ subtly in timing but also that Dutch non-native speakers of Greek display different timing patterns than both native Greek and native Dutch speakers. The phonological implications of such differences remain contested (e.g. Prieto et al. 2005).
9.3.3 Truncation and compression Much of the crowding literature focuses on repair strategies for tone strings realized in temporally challenging circumstances. Two distinct strategies, called ‘truncation’ and 2 In other words, carryover coarticulation is stronger than anticipatory (§9.2).
132 JONATHAN BARNES, HANSJöRG MIXDORFF, AND OLIVER NIEBUHR
Hz
''Compression''
''Truncation''
Short
Short
Long word
Long word
Figure 9.4 Schematic representation of compressing and truncating approaches to f0 realization under time pressure. (Grabe et al. 2000)
‘compression’, have been identified, with some suggesting that languages differ monolithically according to their preference between these two (Erikson and Alstermark 1972; Bannert and Bredvad-Jensen 1975, 1977; Grønnum 1989; Grabe 1998a; Grabe et al. 2000). The original distinction here is that, given a pitch movement under temporal pressure (e.g. HL in phrase-final position, or with a shorter host vowel, a complex or voiceless coda, etc.), a compressing language alters the timing of pitch targets—for example, by retracting the final Low. All targets remain realized, but in a compressed interval and therefore with steeper movements. A truncating language, by contrast, resolves the problem by undershooting targets. A fall’s slope might remain unchanged, for example, but its final target would not reach as low as usual (Figure 9.4). Grabe (1998a) argues that English compresses while German truncates. Bannert and Bredvad-Jensen (1975, 1977) distinguish dialects of Swedish similarly, and Grabe et al. (2000) find that while many British dialects favour compression (e.g. Cambridge), others prefer truncation (e.g. Leeds). More recent work, however, calls into question the binarity of the distinction. Rathcke (2016), for example, shows that German and Russian, both ostensibly truncating, in fact employ a mixture of strategies (see also Hanssen et al. 2007 on Dutch). Cho and Flemming (2015) further point out inconsistencies in usage surrounding these concepts. Some (e.g. Grice 1995a; Ladd 1996, 2008b) take compression to mean fitting a complete tone melody into a reduced temporal interval, potentially affecting both timing and scaling. Truncation, by contrast, refers not to phonetic undershoot but to deletion of some phonological specification altogether. As Ladd (2008b: 180–184) points out, it is often not clear whether a given instance represents phonological deletion or phonetic undershoot, making the distinction challenging to investigate empirically. Lastly, it must be recognized that temporal pressure on f0 realization is not always remedied exclusively to the detriment of the tone melody. A range of alterations to segmental material, allowing it to better accommodate the timing of an intended tone string, have also been documented, including lengthening of final sonorous segments, blocking of vowel devoicing, and final vowel epenthesis (Hanssen 2017; Roettger and Grice 2019).
PHONETIC VARIATION IN TONE AND INTONATION SYSTEMS 133
9.4 Scaling of pitch movements 9.4.1 Pitch range variability: basic characteristics Unlike musical notes, linguistic pitch specifications cannot be invariantly associated with context-independent f0 values, but rather are massively, multi-dimensionally relative. Speaker size, age, gender, sexual orientation, sociolinguistic identity, and emotional state can all influence pitch range in a global fashion, as can environmental factors such as ‘telephone voice’ or the Lombard effect during speech in noise or with hearing impairment (Hirson et al. 1995; Gregory and Webster 1996; Junqua 1996; Schötz 2007; Pell et al. 2009; Pépiot 2014). Some variability is determined by physiological factors, such as the length and mass of the vocal folds, though this is clearly both modulated by sociocultural norms (e.g. van Bezooijen 1993, 1995; Biemans 2000; Mennen et al. 2012) and actively employed for identity construction and sociolinguistic signalling. Pitch range varies not just globally, however, from individual to individual, or utterance to utterance, but in a structured, often highly local fashion as well, and it is this variability that is usually associated with the encoding of linguistic meanings or functions (cf. §9.4.2).3 One basic observation about this variation is that it tends to affect higher f0 targets more saliently than lower.4 Rather than shifting globally upward or downward, pitch range seems instead to be compressed or expanded, primarily through raising or lowering of its topline or ceiling. The bottom, or ‘baseline’ (Maeda 1976), remains comparatively unperturbed (see Figures 9.5 and 9.6). (Cf. Ladd’s 1996 decomposition of pitch range into ‘pitch level’, the overall height of a speaker’s f0, and ‘pitch span’, or the distance between the lowest and highest values within that range.) 300
F0 IN Hz
250 200
RANGE OF PEAKS
150 100
RANGE OF LOWS
Figure 9.5 An f0 peak realized over the monosyllabic English word Anne at seven different levels of emphasis. Peaks vary considerably, while the final low is more or less invariant. (Liberman and Pierrehumbert 1984)
3 It is worth mentioning here that by ‘pitch range’, we do not usually mean the physiologically determined upper and lower limits on frequencies an individual can produce, but rather the continuously updating contextually determined band of frequencies an individual is using in speech at a given moment. 4 Even when correcting for non-linearities in the perception of f0 measured in Hz.
134 JONATHAN BARNES, HANSJöRG MIXDORFF, AND OLIVER NIEBUHR Group III KS (S3, S4, S6, S7, S8, S9, S16, S19, S20, S22, S23)
F0 ( Hz) 130
110
90
.4
.8
1.2
1.6
2.0
2.4
TIME (sec)
Figure 9.6 f0 contours for 11 English sentences read by speaker KS. A general downward trend is clearly observed (§9.4.3), but the distance between the peaks and the baseline is also progressively reduced, due to the topline falling more rapidly than the baseline. S = sentence. (Maeda 1976)
Additionally, different tone types may not be treated identically under pitch range modifications. Pierrehumbert (1980) influentially asserts that a central difference between phonologically High and Low tones is that, under hyperarticulation, High targets are raised while Lows are lowered (resembling peripheralization of corner vowels). Gussenhoven and Rietveld (2000) and Grice et al. (2009) provide some supporting evidence (the former only perceptual, the latter highly variable). Contradictory findings, however, also abound. Chen and Gussenhoven (2008) find that hyperarticulatory lowering of Lows in Mandarin is subtle and variable at best. Tang et al. (2017) find that all tones increase in f0 under noiseinduced (Lombard effect) hyperarticulation. Zhao and Jurafsky (2009) report raising of all Cantonese tones, including Low, in Lombard speech, and Kasisopa et al. (2014) report the same for Thai. Gu and Lee (2009) report raising of all tones under narrow focus for Cantonese, but note that higher targets are affected more dramatically than lower. Michaud et al. (2015) report raising of all tones in Naxi, including Low, in ‘impatient’ speech. Pierrehumbert (1980: 68) suggests that lowering of hyperarticulated Lows may be constrained physiologically by proximity to the pitch floor, and in some cases obscured by simultaneous pitch floor raising. Disentangling these possibilities empirically presents a challenge.
9.4.2 Paralanguage, pitch range (quasi-)universals, and grammaticalization One major challenge in the study of pitch range is that, while canonical linguistic contrasts are categorical in nature (a given root bears a high tone or a low tone, a given utterance does or does not show wh-movement, etc.), linguistically significant pitch range variation often appears gradient (e.g. the different levels of emphasis in Figure 9.5; see also chapter 29).
PHONETIC VARIATION IN TONE AND INTONATION SYSTEMS 135 A distinction between ‘linguistic’ and ‘paralinguistic’ uses of f0 is frequently invoked here, though in practice this boundary can be elusive (Ladd 2008b: 37). Furthermore, the Saussurean arbitrariness we expect to link sounds and meanings is not always obvious in intonation systems. Certain sound–meaning pairings appear with suspicious regularity across languages, sometimes in gradient, paralinguistic forms, other times categorical and grammaticalized. Gussenhoven (2004: ch 5), building on work by Ohala (e.g. 1984), approaches these parallels in terms of what he calls ‘biological codes’. His ‘Effort Code’, for example, involves a communicatively exploitable link between greater expenditure of articulatory effort and higher degrees of emphasis. If greater effort results in larger f0 movements, then gradient pitch range expansion might be a paralinguistic expression of agitation or excitement, while global compression might signal the opposite. A linguistic codification of this pattern might be the broadly attested tendency across languages for focused elements to be realized with expanded pitch ranges, or for given information to be realized with compressed pitch range (e.g. post-focal compression: Xu et al. 2012, deaccenting, or dephrasing of post-focal material).5 In some cases, the link is gradient (Figure 9.5 again), while in others it is categorical (e.g. European Portuguese: Frota 2000; Gussenhoven 2004: 86), where two contrasting pitch accents—mid-falling H+L* and high-peaked H*+L—encode the difference between presentational and corrective focus. Note also, however, recent work attempting to unify accounts of ostensibly categorical and gradient patterns in a dynamic systems model (Ritter et al. 2019).
9.4.3 Downtrend Perhaps the most exhaustively studied pattern of contextual pitch range variability is downtrend, a term spanning a range of common, if not universal, phenomena involving a tendency for pitch levels to fall over the course of an utterance, which has long been recognized (e.g. Pike 1945: 77). The nature of the patterns involved, however, and even the number of distinct phenomena to be recognized, evokes sustained disagreement among scholars. Even the basic terminology is contentious and frustratingly inconsistent across research traditions. Our usage here is as follows: ‘declination’ refers to ‘a gradual tapering off of pitch as the utterance progresses’ (Cohen and ’t Hart 1965). Declination is putatively global, gradient, and time dependent. ‘Downstep’ refers to a variety of categorical (or at least abrupt) lowering patterns, usually, though not exclusively, affecting High tones. Some downstep patterns are phonologically conditioned, such as H-lowering after L in terraced-level tone systems (Welmers 1973; see also §9.2), or boundary-dependent lowering of sequential Highs in Japanese (Pierrehumbert and Beckman 1988) and Tswana (Zerbian and Kügler 2015). Other cases are lexically or constructionally specific, such as contrastively downstepped pitch accents in English and lexical downstep in Tiv (Arnott 1964) or Medumba (Voorhoeve 1971).6 Lastly, the term ‘final lowering’ (Liberman and Pierrehumbert 1984) refers to various 5 The paralinguistic expression of the Effort Code is (arguably) universal. Its linguistic manifestation, though, clearly varies. Some languages, such as Northern Sotho (Zerbian 2006) and Yucatek Maya (Kügler and Skopeteas 2006), are reported to lack prosodic encoding of information structure. Akan (Kügler and Genzel 2012) appears to express focus with a lowering of pitch register. 6 Some sources, following Stewart (1965), distinguish between ‘automatic’ and ‘non-automatic’ downstep, where the former refers to phonologically conditioned post-Low downstep, while the latter usually
136 JONATHAN BARNES, HANSJöRG MIXDORFF, AND OLIVER NIEBUHR
Figure 9.7 Waveform, spectrogram, and f0 contour of a Cantonese sentence, 媽媽擔憂娃娃, maa1 maa1 daam1 jau1 waa1 waa1, ‘Mother worries about the baby’, composed entirely of syllables bearing high, level Tone 1. Gradually lowering f0 levels over the course of the utterance could be attributed to declination. (Example courtesy of Di Liu)
probably distinct phenomena involving lowered pitch in domain-final contexts. Some applications appear gradient, possibly time dependent (e.g. Beckman and Pierrehumbert 1986 on Japanese), while others are heavily structure dependent and grammaticalized (e.g. Welmers 1973: 99 on Mano). Figures 9.7 and 9.8 show examples of apparent declination and terracing downstep in Cantonese.7 Much of the controversy over downtrend centres on whether globally implemented f0 patterns such as declination are distinct from, say, downstep conditioned by local inter actions between tonal targets or other phenomena. Pierrehumbert and Beckman (1988), for example, argue that much of what has been taken to represent gradual f0-target lowering in the past may simply be the result of phonetic interpolation from an early High target in an utterance to a late Low. Intervening targets would thus not be lowered but rather be means morphosyntactic or lexical downstep (sometimes attributed to the presence of ‘floating’ Low tones in phonological representation). Other terms one encounters for different types of downstep include ‘catathesis’ (Beckman and Pierrehumbert 1986, referring to the Japanese pattern) and ‘downdrift’ (sometimes used to refer to phonologically conditioned terracing downstep, and sadly also sometimes used for declination). See Leben (in press) for an excellent recent overview from a phonological perspective, as well as chapter 4. 7 Cantonese is not normally mentioned among the languages that show terraced-level downstep. To confirm that this is in fact the correct characterization of the Cantonese pattern, we would need to verify that (i) the degree of lowering in the post-Low case is greater than would be expected from a background declination effect and (ii) that this is not merely an effect of perseveratory coarticulation (i.e. that the effect is not time dependent—for example, causing lowering of the pitch ceiling that persists, absent further modifications, for the remainder of the relevant prosodic constituent).
PHONETIC VARIATION IN TONE AND INTONATION SYSTEMS 137
Figure 9.8 Waveform, spectrogram, and f0 contour of a Cantonese sentence, 山岩遮攔花環, saan1 ngaam4 ze1 laan4 faa1 waan4, ‘A mountain rock obstructs the flower wreath’, in which high Tone 1 alternates with the low falling Tone 4, creating a HLHLHL pattern reminiscent of the terracing downstep typically described in African languages. (Example courtesy of Di Liu)
absent altogether. Assuming declination does exist, though, there is also controversy over whether it is ‘automatic’ (’t Hart et al. 1990: ch 5; Strik and Boves 1995). Lieberman (1966) may be the first suggestion of a causal link between some form of downtrend and falling levels of subglottal pressure over a domain he calls the ‘breath group’. If uncompensated changes in subglottal pressure result in directly proportional f0 changes (Ladefoged 1963), and if subglottal pressure falls gradually over the utterance with decreasing volume of air in the lungs, then perhaps declination is not strictly speaking linguistic, insofar as it is not ‘programmed’ or ‘voluntary’.8 (Lieberman actually seems to be focused on rapidly falling subglottal pressure domain-finally, yielding what is now called ‘final lowering’ (see Herman et al. 1996). It is also worth noting that Lieberman only finds this connection relevant to ‘unmarked breath groups’. Languages of course also implement ‘marked breath groups’, e.g. English interrogative f0 rises, during which no trace of this ‘automatic’ tendency should be observable.) A further challenge to the idea of ‘automatic’ declination is Maeda’s (1976) observation that longer utterances typically have shallower declination slopes than shorter ones. Assuming fixed initial and final f0, the magnitude of declination thus appears constant and time independent, which in turn seems to require pre-planning, if not specifically of f0 8 Much of the disagreement here seems to hinge on tacit assumptions about what ‘automatic’ means. Some seem to understand it as ‘physiologically uncontrolled and unavoidable’, while others (e.g. ’t Hart et al. 1990) may mean something weaker, such as ‘not explicitly specified syllable by syllable’, while admitting some form of linguistically informed global control or targeting. Additionally, it should be clear that the existence of an underlying biological motivation for a linguistic pattern hardly makes that pattern ‘automatic’, the literature being replete with instances of phonetically motivated patterns that have nonetheless been ‘grammaticalized’ (§9.4.2).
138 JONATHAN BARNES, HANSJöRG MIXDORFF, AND OLIVER NIEBUHR decay then at least of the rate of airflow from the lungs. Further evidence for pre-planning comes from anticipatory or ‘look-ahead’ raising (Rialland 2001). At least some speakers of some languages have been shown to exhibit higher initial f0 levels in longer utterances than in shorter (e.g. ’t Hart et al. 1990: 128 for Dutch; Shih 2000 for Mandarin; Prieto et al. 2006 for Catalan, Italian, Portuguese, and Spanish; Yuan and Liberman 2010 for English and Mandarin; Asu et al. 2016 for Estonian).9 Another problem relating to automaticity is that downtrend can apparently be context ually modulated or even turned off. This is the case both within morphosyntactic or discourse contexts (e.g. Thorsen 1980 on Danish; Lindau 1986 and Inkelas and Leben 1990 on Hausa; Myers 1996 on Chichewa) and for specific lexical tone categories (Connell and Ladd 1990 on Yoruba; Connell 1999 on Mambila). In Choguita Rarámuri, Garellek et al. (2015) find no evidence for declination at all. Extensive basic empirical work across languages is urgently required here. One last point in the automaticity debate concerns implementation of so-called declin ation ‘reset’ (Maeda 1976), the abandonment of a given interval of downtrend, and return of the pitch range ceiling to something like typical utterance-initial levels. Here the notion of ‘breath group’ as domain becomes problematic, in that resets frequently fail to correspond to inhalations on the part of the speaker (Maeda 1976; Cooper and Sorenson 1981). Instead, reset tends to occur at linguistically salient locations, such as syntactic boundaries, thereby serving as a cue to the structure of the utterance. Degree of reset furthermore sometimes correlates with the depth of the syntactic boundary in question, distinguishing hierarchical structures with different branching patterns (Ladd 1988, 1990; van den Berg et al. 1992; Féry and Truckenbrodt 2005). How reset interacts with other known cues to boundary size and placement (e.g. pitch movements, lengthening) is an active area of current research (e.g. Brugos 2015; Petrone et al. 2017). Concerning global versus local conditioning, Liberman and Pierrehumbert (1984) argued that downtrend in English is entirely a consequence of local scaling relations between adjacent targets. They famously modelled the heights of sequential accent peaks in downstepping lists (e.g. Blueberries, bayberries, raspberries, mulberries, and brambleberries . . .), such that each peak’s f0 is scaled as a constant fraction of the preceding one.10 The resulting pattern of exponential decay creates the appearance of global downtrend without actual global planning. A tendency towards higher initial f0 in longer lists, rem iniscent of ‘look-ahead raising’, was observed in this study but discounted as non- linguistic ‘soft pre-planning’. The constant-ratio approach has been applied successfully in various languages (e.g. Prieto et al. 1995 on Mexican Spanish). Beckman and Pierrehumbert (1986), however, found that assuming an additional global declining f0 trend improved their constant-ratio model of Japanese downtrend, suggesting coexistence of downstep and declination (Poser 1984a). The constant-ratio model of English also systematically underpredicted downstep of series-final pitch accents, leading Liberman and Pierrehumbert (1984) to posit the activity of an additional decrement, or ‘final lowering’.
9 Cf. Laniran and Clements (2003) on Yoruba and Connell (2004) on Mambila, neither of which languages appear to exhibit this tendency. 10 An idea they attribute to Anderson (1978), writing on terraced-level tone systems.
PHONETIC VARIATION IN TONE AND INTONATION SYSTEMS 139
9.4.4 Perceptual constraints on tone scaling patterns The extreme malleability of tone values in the frequency domain raises questions about how listeners map between realized f0 levels in an utterance and the linguistic tone categories they express. Pierrehumbert’s (1980: 68) purely paradigmatic definition of High and Low tone (a High tone is higher than a Low tone would have been in the same context) encapsulates this difficulty, opening up the counterintuitive possibility that High tones under the right circumstances (e.g. late in a downstepping pattern) might be realized lower than Low tones within the same utterance. How often Highs and Lows in fact cross over in this manner is still not entirely clear. Welmers (1973) distinguishes between discrete tone-level languages, in which contrasting level tones tend to avoid realization within the other tones’ characteristic ‘frequency bands’, and other languages, such as Hausa, in which cross-over may take place in extended terracing downstep. Yoruba has terracing downstep but may resist cross-over (Laniran and Clements 2003).11 Mapping from realized f0 to phonological tone categories is commonly thought to involve evaluating the heights of individual targets against some form of contextually updating reference level. Many studies equate perceived prominence with the magnitude of, for example, a High pitch accent’s excursion over a ‘reference line’ (Pierrehumbert 1980; Liberman and Pierrehumbert 1984; Rietveld and Gussenhoven 1985). Pierrehumbert (1979) showed that for two American English pitch accents in sequence to sound equal in scaling, the second must be lower than the first. If the two are equal, listeners perceive the second as higher.12 The reference line, whether ‘overt’ (e.g. extrapolated through low f0 valleys between prominences) or ‘implicit’ (Ladd 1993), appears to be constantly declining. Gussenhoven and Rietveld (1988) provide evidence that perceptual declination is global and time dependent. Gussenhoven et al. (1997) present additional evidence that realized f0 minima in an utterance do not determine perceived prominence of neighbouring peaks.13 While phrase-initial f0 levels do alter the course of the projected reference line, Gussenhoven and Rietveld (1998) show that global normalization factors, such as inferred speaker gender, also play a role (cf. Ladd 1996, 2008b on initializing vs. normalizing models of scaling perception).
9.5 Contour shape Autosegmental phonology promotes a view of tone and intonation with just two orthogon ally variable representational dimensions: the levels of tonal autosegments (H, M, L, etc.), and the timing relationships between them, emerging from their alignments with segmental 11 It is tempting to relate the notion of ‘frequency banding’ to the principle of adaptive dispersion (Liljencrants and Lindblom 1972), though efforts to locate such parallels between, say, tone systems and vowel inventories have thus far yielded equivocal results (e.g. Alexander 2010). 12 This has been interpreted as a form of perceptual compensation for realized declination in production. 13 Interestingly, even utterance-final low f0, despite its demonstrated invariance within speakers, exerts no influence on judgements of prominence.
140 JONATHAN BARNES, HANSJöRG MIXDORFF, AND OLIVER NIEBUHR hosts. The target-and-interpolation view of tonal implementation, moreover, extends this picture directly into the phonetics, where research has been focused either on the timing of tonal targets or on their scaling in the f0 dimension. It is furthermore commonly assumed that those tonal targets can be operationalized more or less satisfactorily as f0 turning points reflecting target attainment in both domains. While turning points are surely related to phonological tone specifications, the directness and exhaustivity of this relationship are much less certain. Much recent research has been devoted to aspects of global f0 contour shape that are varied systematically by speakers, relied upon as cues by listeners, and yet difficult to characterize in terms of turning-point timing or scaling (Barnes et al. 2012a, 2012b, 2014, 2015; Niebuhr 2013; Petrone and Niebuhr 2014). Where the target-and-interpolation model sees empty spaces or transition zones, we are increasingly identifying aspects of tonal implementation that are no less important than the ‘targets’ themselves. The following sections focus on a few of these characteristics.
9.5.1 Peak shapes and movement curvatures Peak shape and movement curvature are two aspects of contour shape whose perceptual relevance is increasingly seen in both tone and intonation systems. This section makes three points. First, pitch accent peaks need not be symmetrical. Second, f0 movements towards and away from high tones need not be linear. Third, peak maxima need not be local events. The first point is reflected in most autosegmental-metrical analyses. Pitch accents include leading or trailing tones that precede or follow ‘starred tones’ by what was originally thought to be a constant interval (cf. §9.3). This idea embodies the observation that some slopes related to pitch accents are characteristically steeper than others. For a H+L* accent, for instance, the falling slope to the Low is expected to be systematically steeper than the rise to the H, whereas for L*+H, a steep rise should be a defining feature. Perception experiments by Niebuhr (2007a) on German support these expectations and show that movement slopes furthermore interact with peak alignment: the less steep the fall of a H+L* accent, the earl ier in time the entire falling pattern must be to reliably convey its communicative function. L*+H accents are not identified at all by listeners without a steep rise.14 For H*, identification is best if both rise and fall are shallow. The shallower these slopes, in fact, the less important it is perceptually how the peak aligns with respect to the accented syllable. (Similarly, see Rathcke 2006 on Russian.) Cross-linguistic research shows that this interplay of peak shape and alignment can be a source of inter-speaker variation (Niebuhr 2011). In both German and Italian, a continuum has been identified between two opposing strategies for pitch accent realization. Some speakers (‘aligners’) distinguish their pitch accent categories primarily using pitch movement timing, with peak shapes kept virtually constant. Other speakers (‘shapers’) produce contrasting pitch accents with more or less identical timing, but with strong differences in shape. Figure 9.9 illustrates this difference using data from two exemplary speakers of German. The corpus data suggest that pure shapers are fairly rare, the vast majority of speakers using both strategies to some degree, with alignment typically dominant. 14 Identification of L*+H is additionally enhanced by a steep fall. See Niebuhr and Zellers (2012) for the relevance of falling slope here, and a possible tritonal analysis.
PHONETIC VARIATION IN TONE AND INTONATION SYSTEMS 141 H+L*
“Aligner”
H*
La
Wi
Na
Ma
H+L*
“Shaper”
H*
Figure 9.9 The realization of the H+L* versus H* contrast in German by means of variation in f0 peak alignment (top) or f0 peak shape (bottom). The word-initial accented CV syllables of Laden ‘store’, Wiese ‘meadow’, Name ‘name’, and Maler ‘painter’ are framed in grey. Unlike for the ‘aligner’ (LBO), the f0-peak maxima of the ‘shaper’ are timed close to the accented-vowel onset for both H+L* and H*.
Dombrowski and Niebuhr (2005) discovered systematic variation in the curvature of phrase-final boundary rises in German. Concave rises, starting slow and subsequently accelerating, were mainly produced in turn-yielding contexts, whereas convex (but non-plateauing) rises were mainly produced at turn-internal phrase-final boundaries. A convex–concave distinction also appears at the ends of questions, where a convex shape signals ‘please just respond and let me continue speaking afterwards’ and a concave one ‘please feel free to take the turn and keep it’. Dombrowski and Niebuhr (2005) and Niebuhr and Dombrowski (2010) capture the communicative function ‘activating’ (convex) or ‘restricting’ (concave) the interlocutor. Asu (2006), Petrone and D’Imperio (2008), and Cangemi (2009) report similar convex–concave distinctions in varieties of Estonian and Italian. For the latter, a convex fall from a high target marks questions, while a concave fall marks statements. Petrone and Niebuhr (2014) showed that the same form–function link applies to final falls in German as well, and even extends here, in a perceptually relevant way, to the prenuclear accent peaks of questions and statements. That is, listeners infer from the shape of a prenuclear fall whether the utterance is going to be a question or a statement. Concave rises and/or convex falls are such powerful cues to sentence mode in German that they may sway listeners even in the absence of other morphosyntactic or prosodic interrogative markers, as shown in Figure 9.10. Temporal instantiation of f0 peaks may be ‘sharp’, with rapidly rising and falling flanks, or flatter, with f0 lingering close enough to its maximum for no single moment within that lengthier high region to be identifiable as ‘the target’ (Figure 9.11).
142 JONATHAN BARNES, HANSJöRG MIXDORFF, AND OLIVER NIEBUHR
ne
Woh nung
(a)
350
Question: Katherina searches for a flat Ka the ri
F0 (in Hz)
co nv ex 0
sucht
Time (s)
50 1.889 0
na
sucht
ne
Woh nung
(b)
ex nv co
50
na e cav con
F0 (in Hz)
Ka the ri
concav e
Statement: Katherina searches for a flat
350
Time (s)
1.532
Figure 9.10 A declarative German sentence produced once as a statement (left) and once as a question (right). The shapes of the prenuclear pitch accent peaks are different. The alignment of the pitch accent peaks is roughly the same (and certainly within the same phonological category) in both utterances (statement and question).
f0 (hz)
600
100
0
Time (s)
2.467
Figure 9.11 A sharp peak, and a plateau, realized over the English phrase ‘there’s luminary’.
In many languages, this variation remains largely unexplained (Knight 2008: 226). Its perceptual consequences, however, are increasingly clear. In the scaling domain, it is widely observed (e.g. D’Imperio 2000, citing a remark in ’t Hart 1991; Knight 2003, 2008) that plateau-shaped accentual peaks sound systematically higher to listeners than analogous sharp peaks with identical maximum f0. Köhnlein (2013) suggests that this higher perceived scaling may be the reason for the relative unmarkedness across languages of high-level pitch as the phonetic realization of a High tonal target. In Northern Frisian (Niebuhr and Hoekstra 2015), extended-duration peaks appear systematically in contexts where speakers of other languages would expand pitch range (e.g. contrastive focus). Turning to the composition of lexical tone inventories, it is tempting to see this as one factor making high-level tones common across languages relative to, say, sharp-peaked rising-falling tones. Cheng (1973), for example, in his survey of 736 Chinese tone inventories, finds 526 instances of high-level tones (identified by 55 or 44 transcriptions using the Chao tone numbers), against just 80 instances of convex (rising-falling) tones.15 Explanations of the higher perceived scaling of plateaux include greater salience of the f0 maximum, owing to longer exposure (Knight 2008), and the suggestion that scaling perception involves a form of f0 averaging over time (Barnes et al. 2012a, 2012b). That is, if a plateau-shaped pattern remains close to its maximum f0 over a longer time span, then listeners should perceive it as higher in pitch. This account (correctly) predicts perceived 15 One could of course also appeal to the greater structural complexity or production difficulty of convex tones in explaining this asymmetry.
PHONETIC VARIATION IN TONE AND INTONATION SYSTEMS 143 s caling differences for other shape variations as well (Barnes et al. 2010a; Mixdorff et al. 2018; see also Niebuhr et al. 2018 on peak shape variation in ‘charismatic’ speech). The effect of the sharp peak versus plateau shapes has also been studied with respect to the perceived timing of f0 targets (D’Imperio 2000; Niebuhr 2007a, 2010; D’Imperio et al. 2010; Barnes et al. 2012a), uncovering significant variation across languages and even between different intonation contours in a single language. What is clear is that no single point within a high plateau can be identified in a stable manner with the f0 maximum of a sharp peak for the purposes of ‘target’ identification. It is also clear that attempts to study perception (or production) of tonal timing independently of tone scaling will inevitably miss key insights, insofar as any aspect of f0 contour shape that affects one of these dimensions likely affects the other as well, yielding perceptual interactions that we are only just beginning to explore.
9.5.2 ‘Dipping’ Lows and local contrast If level f0 around an f0 maximum is an efficacious way of implementing High tones, the same may not be true for Low targets, which are instead frequently buttressed, and especially preceded, by higher f0, creating a salient movement down towards the low target, or up away from it, a pattern Gussenhoven (2007; after Leben 1976), refers to as ‘dipping’. Gussenhoven cites allophonic concave realization of Mandarin Tone 3, as well as instances of phonological Lows enhanced by higher surrounding pitches in Stockholm Swedish (Bruce 1977; Riad 1998a), Northern European Portuguese (Vigário and Frota 2003), and Borgloon Dutch (Peters 2007). Ahn (2008) discusses High f0 flanking L* pitch accents in English yes/noquestions in similar terms. While some of these patterns have been treated as phonological High tone insertion, others may be a matter of gradient phonetic enhancement of the Low target. (Gussenhoven 2007 presents Swedish as a case that has been analysed both ways by different scholars.) The commonplace description of late-peak (L*+H) pitch accents as ‘scooped rises’ suggests that a similar, if less dramatic, pattern may be standard in many languages. Again consulting Cheng’s (1973) Chinese tone inventory survey, we observe that while convex rise-fall patterns are relatively rare as citation forms (see §9.5.1), concave or fall-rise patterns are in fact quite common: 352 attestations in the survey of 736 systems, as against only 166 instances of tones described as low and level (Chao numerals 22 or 11).16 The connection between enhancement patterns such as Low dipping and the Obligatory Contour Principle (Leben 1973) should be clear and is explicitly invoked by both Gussenhoven (2007) and Ahn (2008). Analogous insertion of Low to separate underlying Highs has also been proposed (Gussenhoven 2012b on the Maastricht Fall-Rise). Nonetheless, it seems fair to observe that High tone analogues to the dipping Low pattern are substantially less common, leaving us to wonder why dynamicity so commonly complements Low targets, while stasis, in the form of plateaux, is so suited to Highs.17 Perhaps 16 Cheng’s sample skews heavily towards Mandarin, especially Northern Mandarin, so some caution in interpreting this typology is advised. The fact that we are in most cases observing citation forms of tones in such surveys also merits caution. 17 It is possible, of course, that the corresponding pattern for Highs is hiding in plain sight: if the English L+H* pitch accent is correctly thought of as an emphatic or hyperarticulated H* (a big if, of
144 JONATHAN BARNES, HANSJöRG MIXDORFF, AND OLIVER NIEBUHR there is some sense in which High tones, with their particular relation to prominence of varying descriptions, are perceptually advantaged and perhaps less in need of support from syntagmatic enhancements than their Low counterparts (see Evans 2015 and references therein).
9.5.3 Integrality of f0 features The interaction of contour shape with f0 timing and scaling makes clear the need to view individual aspects of the f0 contour as part of a larger constellation of cues working together to realize the contrasting categories of phonological systems. Peak timing and scaling, for example, interact not just with contour shape but with one another. Gussenhoven (2004) documents a pattern across languages whereby later peak timing either co-varies with or substitutes for higher peak scaling. His explanation for this ‘later = higher’ pattern involves an inference by listeners, such that longer elapsed time implies greater distance covered in f0 space, and hence a higher target.18 Numerous instances of the opposite pattern are also documented however, whereby earlier peak timing co-varies with higher peak scaling (e.g. Face 2006 and others on earlier, higher peaks in Spanish narrow-focus constructions, similarly Cangemi et al. 2016 on Egyptian Arabic, and Smiljanić and Hualde 2000 on Zagreb Serbo-Croatian). Gussenhoven (2004) treats such counterexamples as a distinct manifest ation of his Effort Code. Barnes et al. (2015, 2019) suggest that both patterns may originate from language- and construction-specific manipulations of peak timing in order to maximize mean f0 differences during a particular syllable or interval. Trading and enhancement relations between contour shape features are only now beginning to be explored. Work in connection with the Tonal Center of Gravity theory (TCoG) (Barnes et al. 2012b; chapter 3), for example, makes explicit predictions concerning which aspects of contour shape should be mutually reinforcing and hence likely to trade or cooccur (Figure 9.12), both with one another and with particular timing and scaling patterns. For example, for a rising-falling pitch accent, both a concave rise and a convex fall would shift the bulk, or TCoG, of the raised-f0 region later. They thus enhance one another and together promote the perception of a relatively later high pitch event. Their co-occurrence might therefore be preferred across languages (while mirror images would be avoided in late-timing contexts, insofar as they would counteract it). Bruggeman et al. (2018) use the notion of the TCoG to generalize across patterns of inter-speaker variability in the realization of pitch accents in Egyptian Arabic. Patterns of individual difference such as the ‘shapers’ and ‘aligners’ presented above (§9.5.1) may also be explained in this manner. Lastly, Barnes et al. (2015, 2019) develop the notion of the TCoG as a perceptual reference location for f0 events, both in the timing and the scaling dimensions, as a way of accounting for apparent perceptual interactions of the kind noted earlier in this section. course: Ladd and Morton 1997), then perhaps the leading Low tone is in fact precisely this sort of enhan cing feature. Similarly, ’t Hart et al. (1990: 124) refer to ‘extra-low F0 values preceding the first promin ence-lending rise’ in a phrase, called an ‘anticipatory dip’ by Cohen and ’t Hart (1967, and observed also by Maeda (1976). 18 Also called the ‘Tau Effect’, a potentially domain-general phenomenon whereby increased separation of events in time causes overestimation of their separation in space (Helson 1930; Henry et al. 2009).
PHONETIC VARIATION IN TONE AND INTONATION SYSTEMS 145 (a)
(b)
TCoG (c)
TCoG (d)
TCoG (e)
TCoG (f)
TCoG (g)
TCoG (h)
TCoG
TCoG
Figure 9.12 Schematic depiction of how various f0 contour shape patterns affect the location of the Tonal Center of Gravity (TCoG) (Barnes et al. 2012b) and the concomitant effect on perceived pitch event alignment. The shapes on the left should predispose listeners to judgements of later ‘peak’ timing, while the mirror images (right) suggest earlier timing. Shapes that bias perception in the same direction are mutually enhancing and hence predicted to co-occur more frequently in tonal implementation.
Niebuhr’s (2007b, 2013) Contrast Theory likewise showcases the interplay of seemingly disparate aspects of the signal in perception, but with an emphasis on perceived promin ence. Its basic assumption is that varying realization strategies serve to increase the perceived prominence of some f0 intervals over others, enhancing phonological contrasts. For instance, the final low section of H+L* and the central high section of H* in German should each achieve maximum prominence, assuming that T* is prominent. One way to
146 JONATHAN BARNES, HANSJöRG MIXDORFF, AND OLIVER NIEBUHR (She was once a PAINter) “Sie war mal MA– lerin”
frequency (Hz)
5000
0 85 intensity (dB)
[m]
65 150
[a:]
H*
F0 (Hz)
H+L*
85
time (sec.) 1.19642
0 100 90
% H* identification
80 70 60 50 40 30 20 10 0 1
2
3
4
5
6
7
8
9
10
11
Stimulus
Figure 9.13 f0-peak shift continuum and the corresponding psychometric function of H* identifications. The lighter lines refer to a repetition of the experiment but with a flatter intensity increase across the CV boundary.
achieve this would be to exploit the prominence-lending effect of duration and create a plateau-shaped peak for H*. Another strategy would be to centre the relevant f0 stretches over the accented vowel, thereby exploiting its inherent prominence-lending energy level. This would yield, as typically attested, an earlier peak alignment for H+L* than for H*. Indeed, perception studies involving f0 peak alignment continua (Figure 9.13) locate the
PHONETIC VARIATION IN TONE AND INTONATION SYSTEMS 147 category boundary between these two accents in synchrony with the intensity increase corresponding to the transition from onset consonant to accented vowel. Moreover, the less abrupt this intensity increase is, the less abrupt the categorization shift from H+L* to H* in the peak timing continuum. The conceptual core of the Contrast Theory relies on ideas developed by Kingston and Diehl (1994), whereby multiple acoustic parameters may be coordinated by speakers such that they form ‘arrays of mutually enhancing acoustic effects’ (p. 446), with each array representing ‘a single contrastive perceptual property’ (p. 442). Here, combinations of timing, shape, and slope properties of f0 movements would constitute the ‘arrays’, while the contrastive perceptual property would be prominence. Contrast Theory holds that speakers vary individual prominences to create a certain prominence Gestalt. Together with the coinciding pitch Gestalt, communicative meanings and functions are encoded. Contrast Theory and the TCoG both represent attempts to reconcile tension in the literature between ‘configuration’-based accounts of tone patterns and those based on level-tone targets (Bolinger 1951). Both approaches turn on the integration of acoustic f0 cues into higher-level perceptual variables. While often complementary, these approaches sometimes diverge in interesting ways, as in the case of accentual plateaux in German (Niebuhr 2011), where results run contrary to Barnes et al. (2012a) on English, in a way that appears to favour a contrast-based approach over one involving timing of the TCoG.
9.6 Non-f0 effects The fact that Mandarin, for example, is highly intelligible when whispered, or resynthesized with flattened f0, attests to the salience of non-f0 cues to tonal contrasts in that language (Holbrook and Lu 1969; Liu and Samuel 2004; Patel et al. 2010). That listeners can discriminate above chance both whispered question–statement pairs and different prom inence patterns in languages such as Dutch (Heeren and van Heuven 2014) speaks similarly regarding intonation. In addition to the interaction of intensity/sonority with f0 already discussed, chapter 3 discusses other non-f0 cues as well, such as duration as well as ‘segmental intonation’ (Niebuhr 2008, 2012), the pseudo-f0 present in (for example) obstruent noise during voiceless intervals. In what follows, we focus on one additional non-f0 factor: phonation type or voice quality. We focus on creaky voice here, though a similar literature exists regarding breathiness (e.g. Hombert 1976; Hombert et al. 1979; Esposito 2012; Esposito and Khan 2012). For overviews on phonation type, see Gerratt and Kreiman (2001) and Gordon and Ladefoged (2001), and on interactions with tone in particular see Kuang (2013a). Though creak is a well-known cue to prosodic boundary placement and strength (Pierrehumbert and Talkin 1992; Dilley et al. 1996; Redi and Shattuck-Hufnagel 2001; Garellek 2014, 2015), here we discuss it solely in relation to tonal contrasts. In some languages, voice quality is a contrast-bearing feature essentially orthogonal to tone (e.g. Jalapa Mazatec: see Silverman et al. 1995; Garellek and Keating 2011; Dinka: see Andersen 1993). In other cases, the two may be linked in complex ways. In White Hmong (Garellek et al. 2013), for example, the association of breathy voice with an otherwise high-falling tone is sufficiently strong for breathiness alone to cue listener identifications. Low falling tone, by
148 JONATHAN BARNES, HANSJöRG MIXDORFF, AND OLIVER NIEBUHR contrast, though frequently creaky, depends primarily on duration and f0 for identification, with voice quality playing no discernible role.19 Both physiological and perceptual motivations for association patterns between voice qualities and tones have been proposed. Creaky voice, for example, frequently co-occurs with lower tones, both lexical and intonational. Welmers (1973: 109) notes that native speakers of Yoruba ‘have sometimes interpreted a habitually creaky voice in an American learner as signalling low tone even when the pitch relationships represented an adequate imitation of mid and high’. Yu and Lam (2014) show that creaky voice added to otherwise identical f0 contours is sufficient to shift Cantonese listener judgements from low-level Tone 6 to low falling Tone 4. In Green Hmong, Andruski and Ratliff (2000) show that three low falling tones with broadly similar f0 are distinguished primarily by voice quality (modal, creaky, and breathy). In some cases, the phonetic link between low f0 and creak appears quite direct, as in Mandarin, where Kuang (2013a, 2017) shows that, although creak is a strong cue for (Low) Tone 3, it actually occurs whenever context draws speakers to the lower extremes of their pitch range (e.g. some offsets of high-falling Tone 4). Likewise, Tone 3 creaks less when realized in a raised pitch range and more when pitch range is lowered. Puzzlingly, creakiness is also frequently associated with very high f0 targets. The issue may be partly terminological (Keating et al. 2015). However, Kuang (2013a, 2017), building on work by Keating and Shue (2009) and Keating and Kuo (2012), demonstrates a connection between creaky or tense phonation and both high and low f0. English and Mandarin speakers producing rising and falling ‘tone sweeps’ exhibited a wedge-shaped relationship between f0 and voice quality, such that both extreme low and high f0 targets were realized with shallower spectral slopes (i.e. low H1–H2). Kuang hypothesizes that extreme f0 values at either end of the pitch range lead to increased vocal fold tension and thus non-modal phonation. For low f0 values, this becomes prototypical creak or vocal fry, often with irregu lar glottal pulsing added to shallow spectral slope. For high f0, it results instead in ‘tense’ or ‘pressed’ voice quality, sharing shallow spectral slope but lacking irregular pulsing.20 (Kingston 2005 reasons similarly regarding apparent tone reversals in Athabaskan languages related to glottalization.) There is also, however, a psychoacoustic component to these associations. Many have suggested (e.g. Honorof and Whalen 2005) that voice quality provides cues to where f0 targets lie within a speaker’s pitch range, facilitating speaker normalization. For example, tense voice quality on high-f0 syllables might indicate the top end of the speaker’s range. Kuang et al. (2016) show that manipulation of spectral slope to include more higher- frequency energy (i.e. to create a ‘tenser’ voice quality) causes listeners to report higher pitches than when the same f0 is presented with steeper spectral slope. Moreover, Kuang and Liberman (2016a) showed that at least some listeners interpreted the same spectral slope differently in different pitch ranges. Shallower slope elicited lower pitch judgements 19 In some languages these features, along with vowel quality and duration, are so densely interwoven that the term ‘register’ (Henderson 1952) is substituted. See Brunelle and Kirby (2016) and chapter 23 on problems with this distinction. 20 Interestingly, some of Kuang’s Mandarin speakers, particularly when instructed not to creak during their tone sweeps, produced breathy voice at the low end of their pitch ranges instead, and Kuang cites Zheng (2006) for the observation that for some speakers at least, the dipping Mandarin Tone 3 may be allophonically breathy, rather than creaky, as it is usually described.
PHONETIC VARIATION IN TONE AND INTONATION SYSTEMS 149 when appearing to be low in a synthetic female pitch range, but higher when high in that same range (cf. the wedge-shaped production relationship above). Lastly, Kuang and Liberman (2016b) elicited lower f0 judgements to stimuli with synthetic vocal fry added (through ‘jittering’ pulse spacing) than to the same stimuli presented unjittered. The relationship between f0 and voice quality interactions in speech production and the integration of those features in perception represent rich ground for future research. Brunelle’s (2012) investigation of integration of f0, F1 (Formant 1), and voice quality in the perception of ‘register’ in Cham dialects is exemplary here.
9.7 Conclusion In the foregoing we have reviewed to the extent possible the main patterns of variation documented in the realization of f0 contours across tone and intonation systems. We have attempted to give some indication both of what these patterns are like descriptively and of what kinds of explanations researchers have offered for them, drawing especially on connections with the literature on perception. Beyond this, we have attempted to underscore the importance of viewing all these patterns in light of their mutual interactions. In general, we expect study of the integration of cues from all the dimensions of the contour discussed herein to be a rich source of progress in prosody research concerning the production and perception of tonal contrasts for years to come.
OUP CORRECTED PROOF – FINAL, 06/12/20, SPi
chapter 10
Phon etic Cor r el ates of Wor d a n d Sen tence Str ess Vincent J. van Heuven and Alice Turk
10.1 Introduction It has been estimated that about half of the languages of the world have stress (van Zanten and Goedemans 2007; van Heuven 2018). In such languages every prosodic domain has a stress (also called prosodic head). In non-stress languages (e.g. tone languages) there are no head versus dependent relationships at the word level. Prosodic domains are hierarchically ordered such that each next-higher level in the hierarchy is composed of a sequence of elements at the lower level (e.g. Nespor and Vogel 1986). In a stress language one of these lower-level units is the prosodic head; the other units, if at all present, are the dependents. This chapter deals with the prosodic heads at the word and sentence levels, called the (primary) word stress and (primary) sentence stress, respectively. Sentence stresses, whether primary or secondary, typically involve the presence of a prominence-lending tone or tone complex (i.e. a pitch accent) in a syllable with word stress (e.g. Sluijter and van Heuven 1995), which may additionally have effects on other phonetic variables and is as such profitably discussed in combination with word stress. Stresses on dependents at each of these levels can be considered secondary, or even tertiary; secondary and tertiary stress will not be considered here.1 Word stress is generally seen as a lexical property. Its location is fixed for every word in the vocabulary, by one or more fairly simple regularities. In Finnish, Hungarian, and Estonian, for instance, the word stress is invariably on the first syllable, in Turkish it is on the last, and in Polish it is on the second-but-last syllable (except in some loanwords). In Dutch the location of the word stress is correctly predicted in about 85% of the vocabulary by half a dozen quantity- sensitive rules (Langeweg 1988). In some languages (e.g. Russian and Greek) the location of the stress is fixed for each individual word but apparently no generalizations can be 1 See Rietveld et al. (2004), and references therein, for duration differences between primary and secondary stress in Dutch.
OUP CORRECTED PROOF – FINAL, 06/12/20, SPi
PHONETIC CORRELATES OF WORD AND SENTENCE STRESS 151 formulated that predict this location; here stress has to be memorized for each word in the lexicon separately. From a structural point of view, languages have a richer inventory of stressed than unstressed syllables (Carlson et al. 1985; van Heuven and Hagman 1988). Stressed syllables often allow more complex onsets and codas, as well as long vowels and diphthongs. Unstressed syllables often permit only single consonants in the onset and none in the coda, while distinctions in vowel length tend to be neutralized. Moreover, stressed syllables tend to resist deletions and assimilations to neighbouring unstressed syllables, whereas unstressed syllables tend to assimilate to adjacent stressed syllables and are susceptible to weakening processes and deletions. The classical definition of stress equates it with the amount of effort a speaker spends on the production of a syllable. This implies that some extra effort is spent on each of the stages of speech production—that is, the pulmonary stage (more air is pushed out of the lungs per unit time), the phonatory stage (strong contraction of selected laryngeal muscles), and the articulatory stage (closer approximation on articulatory targets of segments). It has been notoriously difficult, however, to find direct physiological or neural correlates of effort or stress in speech production and we will not attempt to improve on this state of affairs in the present chapter. Instead, we will survey the acoustic correlates of primary stress, with emphasis on languages such as English and Dutch, at the word and sentence level.2 We will show that sentence stress is signalled by all the properties that are acoustic correlates of word stress but that some extra properties are added when the word receives sentence stress. We will also review the literature on the relative importance of the correlates of word and sentence stress. Acoustic markers assume a higher position in the rank order of stress correlates as they more reliably differentiate stressed syllables from their unstressed counterparts in automatic classification procedures. The review will also bring to light that the rank order of acoustic correlates does not correspond in a one-to-one fashion with the perceptual importance of the cues. The final part of this chapter will briefly consider the universality of the rank order of stress cues and consider the question: Is the relative import ance of the acoustic correlates or of the perceptual cues the same across all languages that employ stress, or does it differ from one language to the next, and if so, what are the factors that influence the ranking of stress cues?
10.2 Acoustic correlates of word stress In this section, we will consider how we can determine the acoustic correlates of primary word stress. The procedure is relatively straightforward if a language has minimal stress pairs, which are pairs of lexical items that contain the same phoneme sequence and differ only in the location of the stress. Such minimal stress pairs do not abound in the Germanic languages, but there are enough of them for research purposes. The most frequently used minimal stress pairs in research on English are the noun–verb pairs in words of Latin origin, such as (the) import versus (to) import. A single word (i.e. a one-word sentence, also 2 For recent surveys of stress correlates in a wider range of languages we refer to, e.g., Hargus and Beavert (2005), Remijsen and van Heuven (2006), Gordon (2011b), and Gordon and Roettger (2017).
OUP CORRECTED PROOF – FINAL, 06/12/20, SPi
152 VINCENT J. VAN HEUVEN AND ALICE TURK called the citation form) will always receive sentence stress. If we want to study the corre lates of word stress, recorded materials should not have sentence stress on the target word(s), and multi-word sentences must therefore be constructed in which the sentence stress is shifted away from the target word by manipulating the information structure of the sentence. For instance, putting the time adjunct in focus would shift the sentence stress away from the target word onto the adverb in the answer part in (1), where bold small capitals denote word stress and large capitals represent sentence stress: (1) Q. When did you say ‘the IMport’? A. I said ‘the import’ YESterday.
(2) Q. When did you say ‘to imPORT’? A. I said ‘to import’ YESterday.
In Germanic languages, the rule is to remove sentence stress from words (or larger constituents) that were introduced in the immediately preceding context. By introducing the target word in the precursor question, it typically no longer receives sentence stress in the ensuing answer. The acoustic correlates of word stress can now be examined by comparing the members of the minimal stress pair in (1A) and (2A). This is best done by comparing the same syllable in the same position with and without word stress, a procedure that is referred to as ‘paradigmatic comparison’. Syntagmatic comparison of the first and second syllables is problematic since the comparison is between segmentally different syllables in different positions, and should be avoided. For (partial) solutions for syntagmatic comparisons, see van Heuven (2018). Different segments have inherently different durations, intensities, and resonance frequencies. The vowel in im- is shorter, has less intensity, and has different formant frequencies than the vowel in -port, which differences preclude a direct comparison of the effect of stress. Also, final syllables of words tend to be pronounced more slowly than nonfinal syllables, which adds to the difficulty of isolating the correlates of stress. Although it is possible, in principle, to correct for segment-inherent and position-dependent properties, this is not normally done in research on acoustic correlates of stress. If a language has no minimal stress pairs—for instance, when the language has fixed stress—paradigmatic comparison is not possible. However, phonetic differences between stressed and unstressed syllables in a fixed-stress language will always be ambiguous, since the effects can be caused by a difference in stress, but also by the difference of the position of the syllable in the word. In order to make the stressed syllable stand out from its environment, the talker makes an effort to pronounce this syllable more clearly. The result is that the stressed vowel and consonants approximate their ideal articulatory targets more closely, which in turn causes the segments to be lengthened and be produced with greater acoustic distinctiveness and intensity.
10.2.1 Segment duration Segmentation is a somewhat artificial task because of widespread coarticulation of speech movements. However, the timing of events such as consonantal closure, consonantal release, and voice onset can often be reliably identified in the acoustic waveform and spectrogram, and segment durations can be measured on the basis of these intervals
OUP CORRECTED PROOF – FINAL, 06/12/20, SPi
PHONETIC CORRELATES OF WORD AND SENTENCE STRESS 153 (e.g. Turk et al. 2006). Findings for segment-related durations suggest that the lengthening effects of stress are strongest for vocalic intervals, which in English and Dutch will be approximately 40–50% longer when stressed (Fry 1955; Nooteboom 1972; Sluijter and van Heuven 1995, 1996a).3 Fry showed that the duration of the vocalic interval differentiated stressed from unstressed tokens in an automatic classification procedure with an accuracy of 98% (for details, see van Heuven 2018). In English, the lengthening caused by stress is found irrespective of the position of the syllable in the word (e.g. van Santen 1994). In Dutch, however, a phrase-final syllable that is already longer because of pre-boundary lengthening will not be lengthened further when it has sentence stress (Cambier-Langeveld and Turk 1999). Fry (1955) suggested that consonants were less susceptible to lengthening by stress than vowels. Findings with respect to stress on consonantal intervals come primarily from comparisons of consonantal intervals in unstressed syllables with those in stressed syllables that bear both word and sentence stress (e.g. Lisker 1972; van Santen 1994). In English, the size of the stress effect depends on the type of consonant and its position in the word (van Santen 1994). Word-initial and word-final effects of stress in van Santen (1994) were no greater than 20 ms, but larger effects were observed in word-medial position for ˈVCV versus VˈCV comparisons, particularly for /s/ and /t/ (see also Klatt 1976; Turk 1992). Alveolar stops in ˈVCV position often lenite to taps in American English, as does /t/ to glottal stop in British English. In Dutch, Nooteboom (1972) found small but consistent effects of stress on consonants, both in onset and coda position, in trisyllabic CVCVCVC nonsense words (see also van Heuven 2018).
10.2.2 Intensity In most studies on stress, the peak intensity of the vowel is measured as a correlate of stress. Intensity can be measured as the root-mean-square average of the amplitude of the sound wave in a relatively short time window that should include at least two periods of the glottal vibration. For a male voice the integration window should be set at 25 ms; for a female voice the window can be shorter. Instead of the peak intensity, some studies (also) report the mean intensity (mean of the intensities per time window divided by the number of time steps contained in the vocalic interval). Consonant intensities are not normally reported as correlates of stress. Beckman (1986) proposed the intensity integral (i.e. the total area under the intensity curve of a vowel) as an optimal correlate of stress. It should be realized, however, that the intensity integral is essentially a two-dimensional measure, whose value is determined jointly by the vowel duration and the mean intensity. For this reason we prefer to report the values for these two dimensions separately. Sound intensities are conventionally reported as decibels (dB). The decibel scale has an arbitrary zero-point, which is equal to the sound level that distinguishes audible sound from silence for an average human hearer. The decibel scale is logarithmic: every time we multiply the intensity of a sound by a factor of 10, we add 10 dB. The loudest intensity the human ear can tolerate (the threshold of pain) is a trillion times stronger than the threshold of hearing, so that the range of 3 Fry (1955) and Nooteboom (1972) actually elicited their words in sentence contexts. However, Nooteboom made sure the targets were out of focus (no sentence stress), and Fry did not measure pitch effects. So, it would be safe to use these data as correlates of word stress rather than sentence stress.
OUP CORRECTED PROOF – FINAL, 06/12/20, SPi
154 VINCENT J. VAN HEUVEN AND ALICE TURK i ntensities is between 0 and 120 dB. Vowel intensities are typically in the range of 55 and 75 dB. The effects of stress are small but consistent: a stressed vowel is roughly 5 dB stronger than its unstressed counterpart. Fry (1955) has shown that the stressed and unstressed realization of the same syllable in English minimal stress pairs can be discriminated from each other with 89% accuracy (see van Heuven 2018 for details).
10.2.3 Spectral tilt When we produce a sound with more vocal effort, the effect is not just that the overall intensity of the sound increases. The increased airflow through the glottis makes the vocal folds snap together more forcefully, which specifically boosts the intensity of the higher harmonics (above 500 Hz), thereby generating a flatter spectral tilt. This effect of vocal effort has been shown to be perceptually more noticeable than the increase in overall intensity (Sluijter and van Heuven 1996a; Sluijter et al. 1997). The spectral slope of a vowel can be estimated by computing its long-term average spectrum (LTAS) in the range of 0 and 4,000 Hz, and then fitting a linear regression line through the intensities of the LTAS. Spectral tilt is the slope coefficient of the regression line and is expressed in dB/Hz. The spectral tilt is typically flatter for the stressed realization of a vowel than for an unstressed realization (all else being equal). See also Campbell and Beckman (1997), Hanson (1997), Fulop et al. (1998), Hanson and Chuang (1999), Heldner (2001), Traunmüller and Eriksson (2000), and Kochanski et al. (2005) for additional ways of measuring spectral tilt.
10.2.4 Spectral expansion Stressed sounds are articulated more clearly. For vowels, this means that the formant values will deviate further away from those of a neutral vowel (schwa). The acoustic vowel triangle with corner points for /i, a, u/ in an F1-by-F2 plot will be larger for vowels produced with stress than when produced without stress (‘spectral expansion’; a shrinking of the effective vowel triangle for unstressed vowels is commonly referred to as ‘spectral reduction’). Spectral expansion is expressed as the Euclidean distance of the vowel token in the F1-by-F2 vowel space from the centre of the vowel space, where the neutral vowel schwa is located (i.e. F1 = 500 Hz, F2 = 1500 Hz for a typical male voice; add 15% per formant for female voices). It is advised to apply perceptual scaling of the formant frequencies in order to abstract away from differences in sensitivity of the human ear for low and high formant frequencies, applying Bark conversion (also to the neutral reference vowel). Hertz-to-Bark conversion is done as in (3) by an empirically determined transformation (Traunmüller 1990). (3) Bark = 7 × Ln (hertz / 650 + sqrt (hertz / 650)2 + 1) The Euclidean distance D of the vowel token (V) from the centre of the vowel space (schwa) is then computed by (4): (4) D = sqrt ((F1V − F1schwa)2 + (F2V − F2schwa)2) Typically, the difference (in hertz or, better still, in Bark units) between the D of the stressed vowel token and that of its unstressed counterpart (in a paradigmatic comparison) is
OUP CORRECTED PROOF – FINAL, 06/12/20, SPi
PHONETIC CORRELATES OF WORD AND SENTENCE STRESS 155 ositive, indicating that the stressed vowel is further away from the centre of the vowel p space. Spectral expansion has been reported as a useful correlate of stress for Dutch (Koopmans-van Beinum 1980; van Bergem 1993; Sluijter and van Heuven 1996a) and for English (Fry 1965; Sluijter et al. 1995; Sluijter and van Heuven 1996b). Although automatic discrimination between stressed and unstressed (and therefore partially reduced) tokens of the same vowels was well above chance, the discriminatory power of spectral expansion is smaller than that of either duration or intensity. A calculation of spectral expansion and reduction might also be attempted for frication noise—that is, for fricative sounds and release bursts of stops and affricates. Frication noise is not normally analysed in terms of resonances but characterized by the statistical properties of the entire energy distribution. The noise spectra do not normally extend to frequencies below 1 kHz. In order to be able to compare the noise spectra across voiced and voiceless sounds, it is expedient to confine the spectral analysis to a 1 to 10 kHz frequency band, thereby excluding most of the energy produced by vocal fold vibration in voiced sounds. Maniwa et al. (2009) propose that all four moments of the energy distribution be measured as correlates—that is, the spectral mean (also known as centre of gravity), the standard deviation, the skew, and the kurtosis. Their analysis of clearly articulated (stressed) fricatives in American English shows that the combination of spectral mean and standard deviation discriminates well between fricatives of different places of articulation. We are not aware, however, of any studies that have applied the concept of spectral moments to the effects of stress.
10.2.5 Resistance to coarticulation One characteristic of a spectrally expanded stressed syllable is that its segments show little coarticulation with each other or with the abutting segments of preceding and following unstressed syllables. Unstressed syllables are, however, strongly influenced by an adjacent stressed syllable, in that properties of the stressed syllable are anticipated in a preceding unstressed syllable and perseverate into the following unstressed syllable (van Heuven and Dupuis 1991). Resistance to coarticulation was claimed to be an important articulatory correlate of stress in English by Browman and Goldstein (1992a) and in Lithuanian by Dogil and Williams (1999); see also Pakerys (1982, 1987). Acoustic correlates of coarticulatory resistance are most likely to be longer segment durations and larger frequency differences between onset and offset of CV and VC formant transitions in stressed syllables, when paradigmatically compared with their unstressed counterparts.
10.2.6 Rank order The findings on acoustic correlates of word stress in English and Dutch are compatible with the suggestion traditionally made in the literature that duration of the syllable (or the vowel within it) is the most consistent and reliable correlate of stress, followed by intensity. The literature on other correlates is somewhat scarcer, but what emerges is that (flatter) spectral tilt and spectral expansion of vowels are at the bottom of the rank order, with no clear difference between them.
OUP CORRECTED PROOF – FINAL, 06/12/20, SPi
156 VINCENT J. VAN HEUVEN AND ALICE TURK
10.3 Acoustic correlates of sentence stress Words that are the prosodic heads of constituents that are in focus (i.e. constituents that are presented by the speaker as introducing important information into the discourse) are pronounced with sentence stress on the syllable that carries the word stress; such words are often called ‘(nuclear pitch) accented’. Words that refer to concepts that were introduced in the (immediately) preceding context are typically pronounced without a sentence stress and have word stress only. Function words generally do not receive sentence stress. When a word is pronounced with sentence stress (nuclear or prenuclear pitch accent), the stressed syllable in that word is associated with a prominence-lending change in the rate of vocal fold vibration causing a change in pitch often called a ‘pitch accent’ (Vanderslice and Ladefoged 1972; Pierrehumbert 1980; Pierrehumbert and Hirschberg 1990; Beckman and Edwards 1994). The fundamental frequency (f0) change may be caused by a local rise in the frequency with which the vocal folds vibrate (causing f0 to go up), by a fall of the f0, or by a combination of rise and fall. These abrupt f0 changes are typically analysed as sequences of two f0 targets, H (for high f0) and L (for low f0), where one of the targets is considered to be prominence-lending (indicated with a star in phonological analyses (e.g. the ToBI transcription system; Beckman et al. 2005). An H* configuration (assuming preceding low f0) would then represent a rise, whereas as H*L would be a rise-fall configuration—in both cases with a prominence-lending H target. Changes in f0 associated with sentence stress may differ in size and in their location in the syllable. Less than full-sized f0 changes are denoted by downstepped H targets (!H*). However, for accurate (automatic) classification of sentence stress, the f0 movement in normal human speech should be at least four semitones (a change in f0 of at least a major third, or 25% in hertz). Normally, smaller f0 changes are not prominence-lending. Such small f0 perturbations (also called micro-intonation; Di Cristo and Hirst 1986; ’t Hart et al. 1990) are interpreted by the listener as involuntary (non-planned), automatic consequences of, among other things, the increase in transglottal pressure due to the sudden opening of the mouth during the production of a vowel. Sluijter et al. (1995) showed that the members of disyllabic English minimal stress pairs were differentiated automatically with 99% accuracy by the presence of an f0 change of at least four semitones within the confines of the stressed syllable. The f0 contrast between initial and final stress fell to chance level for words produced without sentence stress (i.e. without pitch accents), suggesting that f0 is not a correlate of word stress in English. The temporal alignment of the prominence-lending f0 change is a defining property of the sentence stress. For instance, rises that occur early in a Dutch syllable are interpreted as a sentence stress, but when late in a phrase-final syllable they are perceived as a H% boundary tone (’t Hart et al. 1990: 73).4 The alignment of the f0 changes differs characteristically 4 It is assumed here that ’t Hart et al.’s boundary-marking rise ‘2’ refers to the same phenomenon as the H% boundary tone (cf. Gussenhoven 2005: 139).
OUP CORRECTED PROOF – FINAL, 06/12/20, SPi
PHONETIC CORRELATES OF WORD AND SENTENCE STRESS 157 between languages (e.g. Arvaniti et al. 1998 for Greek versus Ladd et al. 2000 for Dutch), and even between dialects of the same language (e.g. van Leyden and van Heuven 2006). A secondary correlate of sentence stress has been found in temporal organization. Dutch words with sentence stress are lengthened by some 10% to 15%. Van Heuven’s (1998) experiments that manipulated focus independently of sentence stress ([+focus, −sentence stress] vs. [+focus, +sentence stress]) showed that durational effects are due to sentence stress, rather than to focus; effects of focus on duration are indirect and occur via the mapping of focus to sentence stress (focus-to-accent principle; Gussenhoven 1983a; Selkirk 1984; Ladd 2008b). Early work (e.g. Eefting 1991; Sluijter and van Heuven 1995; Turk and Sawusch 1997; Turk and White 1999; Cambier-Langeveld and Turk 1999) showed that sentence stress affects more than just the syllable with word stress. For example, both syllables in bacon (English) and panda (Dutch) are longer in contrastively focused contexts where the word bears sentence stress than when it does not. Experiments that manipulated the location of sentence stress in two-word target phrases (e.g. bacon force vs. bake enforce) showed that effects were largely restricted to the word bearing sentence stress (with the exception of much smaller spill-over effects, which can occur on a syllable immediately following the stressed syllable across a word boundary). The occurrence of longer durations on both syllables in words such as bacon led to the question of whether the sentence stress targeted a continuous domain, perhaps corresponding to the whole word, or to part of a word (e.g. a foot). However, findings for longer words, such as Dutch geˈkakel ‘cackling’ (Eefting 1991) and English ˈpresidency and ˌcondeˈscending (Dimitrova and Turk 2012), show that sentence stress lengthens particular parts of words more than others, specifically stressed syllable(s), word-onset consonant closure intervals, and final syllable rhyme intervals. These findings suggest that words bearing sentence stress are marked durationally in two ways, (i) by lengthening their stressed syllables (primary and secondary) and (ii) by lengthening their edges, in a similar way to well-documented effects of phrase-initial and phrase-final lengthening (e.g. Wightman et al. 1992; Fougeron and Keating 1997;). Additional spill-over effects that are small in magnitude (10% or less) can be observed on syllables immediately adjacent to stressed syllables. Van Bergem (1993) showed that sentence stress can cause spectral expansion of full vowels in Dutch comparable in magnitude to the spectral expansion effect of word stress. See also Summers (1987), de Jong et al. (1993), Beckman and Edwards (1992, 1994), Cho (2005), and Aylett and Turk (2006) for spectral expansion, differences in articulatory magnitudes, and resistance to coarticulation for different stress categories in English. In Germanic languages, then, sentence stress is signalled acoustically by all the properties of word stress. In addition to these properties that word stress and sentence stress share, sentence stress has prominence-lending f0 changes and some lengthening of (parts of) the word containing the sentence stress. These additional properties in particular discredit the theory that word stress is simply a reduced version of sentence stress (e.g. Chomsky and Halle 1968). The experimental results reported also show that a large change in f0, when appropriately aligned with the segments making up the syllable, is a powerful correlate of (sentence) stress, even though there is no consistent f0 contour that accom panies all sentence stresses.
OUP CORRECTED PROOF – FINAL, 06/12/20, SPi
158 VINCENT J. VAN HEUVEN AND ALICE TURK
10.4 Perceptual cues of word and sentence stress In §10.2 and §10.3, we have seen that word and sentence stress are acoustically marked by at least five different correlates—that is, longer duration, higher (peak or mean) intensity, flatter spectral tilt, more extreme formant values, and, in the case of sentence stress, a prominence-lending change in the f0. Studies on the perceptual cues for stress employ synthetic speech or resynthesized nat ural speech in which typically two stress-related parameters are varied systematically. The range of variation of each parameter is representative of what is found in stressed and unstressed tokens produced in natural speech. For each parameter the range is then subdiv ided into an equal number of steps (e.g. five or seven). The relative perceptual strength of a parameter is quantified as the magnitude of the cross-over from stress to non-stress and as the width or steepness of the psychometric function describing the cross-over.5 By running multiple experiments constructed along the same lines, a generalized rank order of perceptual stress cues will emerge. For instance, Fry published a series of three experiments comparing the perceptual strength of vowel duration (as a baseline condition) with that of three other parameters: peak intensity (Fry 1955), f0 (Fry 1958) and vowel quality (Fry 1965).6 Fry (1955) varied the durations of V1 and V2 in synthesized tokens of English minimal stress pairs such as import–import in five steps. The target words were embedded in a fixed carrier sentence Where is the accent in . . ., with sentence stress on the target; the f0 was 120 Hz throughout the sentence. The duration steps were systematically combined with five intensity differences (by amplifying V1 and at the same time attenuating V2) such that the V1–V2 difference varied between +10 and −10 dB. Figure 10.1a presents perceived initial stress for the five duration steps (averaged over words and intensity steps) and for the five intensity steps (averaged over words and duration ratios). The results show a cross-over from stress perceived on the first syllable to the second syl lable. The cross-over takes place between duration steps 2 and 3 and is both steep (within one stimulus step) and convincing (≥ 75% agreement on either side of the boundary). In contrast to this, the intensity difference is inconsequential: although there is a gentle trend for more initial stress to be perceived as V1 has more decibels than V2, the difference is limited to some 20 percentage points; the boundary width, which can only be estimated by extrapolation, would be some 15 times larger than for duration. This shows that duration outweighs intensity in Fry’s experiment roughly by a factor of 15. See Turk and Sawusch (1996) for similar findings. Figure 10.1b shows the results of a similar experiment run by Sluijter et al. (1997) for a single Dutch minimal stress pair, the reiterant non-word nana. The results are the same as in English. Van Heuven (2014, 2018) showed that manipulating the duration of a consonant was largely inconsequential for stress perception in disyllabic Dutch reiterant nonwords. 5 The width (or steepness) is ill-defined if the cross-over from one percept to the other is incomplete. 6 Because the cue value of vowel quality was very weak, Fry (1965) limited the range of duration variation severely relative to the earlier two experiments.
OUP CORRECTED PROOF – FINAL, 06/12/20, SPi
PHONETIC CORRELATES OF WORD AND SENTENCE STRESS 159
Percent initial stress perceived
100 Duration
Intensity
80 60 40 20 0
–10 .25
–5 .60
0 1.20
5 1.75
10 2.25
–3 .47
–2 .58
–1 .69
0 .83
1 .98
2 1.15
3 1.35
Intensity difference V1– V2 (dB) and duration ratio V1 / V2
Figure 10.1 Initial stress perceived (%) as a function of intensity difference between V1 and V2 (in dB) and of duration ratio V1 ÷ V2 in minimal stress pairs (a) in English, after Fry (1955), and (b) in Dutch, after van Heuven and Sluijter (1996).
An almost complete cross-over from initial to final stress perception was achieved nevertheless by shortening or lengthening either the onset or coda C in the first syllable by 50%, but only if the syllable contained a short vowel. Consonant changes in the second final syl lable, or in syllables with long vowels, had no perceptual effects. Sluijter et al. (1997) showed that intensity differences become (much) more important as perceptual stress cues if the energy differences are concentrated in the higher frequency bands (above 500 Hz), which is tantamount to saying that a flatter spectral slope cues stress. Under normal listening conditions vowel duration remained the stronger cue, but manipulating the spectral slope became almost as effective when stimuli were presented with a lot of reverberation (so that the segment durations were poorly defined). Fry’s (1965) results indicate that spectral vowel reduction was only a weak stress cue in English noun–verb pairs (contract, digest, object, subject), where stress was less likely to be perceived on the syllable with reduced vowel quality; the tendency was somewhat stronger when the vowel quality was reduced in the F2 dimension (backness and rounding) than in the F1 dimension (height), and was strongest when both quality dimensions were affected simultaneously. The effect of vowel quality was small and did not yield a convincing crossover: the percentage of initial-stress responses varied between 45% and 60%. The effect of vowel duration was clearly much stronger. Even with the smaller range of duration variation adopted in this experiment, a convincing cross-over was obtained spanning more than 50 percentage points. Fry’s results were confirmed for Dutch in a more elaborate study by van Heuven and de Jonge (2011). Vowel reduction was taken as a cue for non-stress only when the duration ratio was ambiguous between initial and final stress. Fry (1958) found that f0 changes were stronger perceptual stress cues than duration changes (see van Heuven 2018 for a detailed analysis of this experiment and a similar one by Bolinger 1958). For Dutch, a properly timed pitch change will always attract the perception of stress. This cue cannot therefore be neutralized by any combination of cues suggesting stress on a different syllable (van Katwijk 1974; see also van Heuven 2018). The upshot of the experiments on English and Dutch is that f0 provides a very strong cue to stress perception, overriding all (combinations of) other cues, provided the f0 change is properly aligned.
OUP CORRECTED PROOF – FINAL, 06/12/20, SPi
160 VINCENT J. VAN HEUVEN AND ALICE TURK An appropriately aligned pitch change is the hallmark of sentence stress and since it is exclusively a property of sentence stress (i.e. not of word stress), this explains why the effects of the f0 change cannot be counteracted by other cues. The overall conclusion of this section is that the strength of acoustic correlates of stress and strength of their perceptual cue values do not correlate. This is for two reasons. First, the location of an f0 change is a strong correlate of stress in speech production, but it can only yield reliable automatic detection if the f0 change exceeds a threshold of three to four semitones and if it is appropriately aligned with the segmental structure. When words do not receive sentence stress, the f0 change is no longer a reliable correlate. A smaller f0 change may still be effective as a cue for sentence stress as long as it is noticeably larger than the f0 movements associated with micro-intonation (see §10.3). A change from 97 to 104 Hz (roughly one semitone) was enough to evoke final-stress perception, while the reverse change of the same magnitude yielded initial stress (Fry 1958, experiment 1). Therefore, f0 change may be perceptually the strongest cue, but it is acoustically unreliable. Second, the human listener does not rely on uniform intensity differences between stressed and unstressed syllables. This probably makes intensity the weakest perceptual cue of all, even though it is acoustically quite reliable.7 Differences in vowel duration are both perceptually strong and acoustically highly reliable, for both word stress and sentence stress.
10.5 Cross-linguistic differences in phonetic marking of stress There has been some speculation on the question of whether or not any language that uses the linguistic parameter of stress also uses the same correlates, with the same order of relative importance of these acoustic correlates and as cues to stress perception. The general feeling is that different correlates (and different perceptual cues) are employed depending on the structure of the language under analysis. In this section we will discuss two sets of differences between languages and their potential consequences for stress marking. The first set of differences concerns the type of stress system a language employs, whereas the second source of difference is located in the relative exploitation within a language of stress parameters for other linguistic contrasts.
10.5.1 Contrastive versus demarcative stress It seems reasonable to assume that languages with fixed stress have a smaller need for strongly marked stress positions than languages in which the position of the stressed syl lable varies from word to word. In the latter type, the position of the stress within the word
7 The relative unresponsiveness of the human hearing mechanism to differences in intensity in a linguistic context has been known for over a century. The first to comment on this phenomenon was Saran (1907). See also Mol and Uhlenbeck (1956: 205–213, 1957: 346) and Bolinger (1958: 114).
OUP CORRECTED PROOF – FINAL, 06/12/20, SPi
PHONETIC CORRELATES OF WORD AND SENTENCE STRESS 161 is a potentially contrastive property, whereas in the former type words are never distinguished by the position of the stress, which is the same for all the words in the language.8 We would predict, therefore, that the size of the f0 movements does not vary as a function of the type of word-stress system of the language, but that the difference between stressed and unstressed syllables in non-focused words is less clearly marked along all the non-f0 parameters correlating with word stress. There is some evidence that the basic prediction is correct. Dogil and Williams (1999) presented a comparative study of Polish (fixed penultimate stress) and German (quantity-sensitive plus lexical stress) stress marking, and concluded that stress position is less clearly marked in Polish. Similar results were found more recently in a strictly controlled cross-linguistic study of Spanish and Greek (with contrastive stress) versus Hungarian (fixed initial stress) and Turkish (fixed final stress) by Vogel et al. (2016). Their results show that the same set of acoustic stress parameters (applied in the same manner across the four languages) affords good to excellent automatic classification of stressed and unstressed syllables at the word level for the two contrastive-stress languages but not for the fixed-stress languages.
10.5.2 Functional load hypothesis Berinstein (1979) was the first to formulate the Functional Load Hypothesis (FLH) of stress marking. The FLH predicts that stress parameters will drop to a lower rank in the hierarchy of stress cues when they are also employed elsewhere in the phonology of the language. For instance, if a language has a length contrast in the vowel system, vowel duration—which is normally a strong cue for stress—can no longer function effectively in the signalling of stress. Berinstein (1979) is often quoted in support of the FLH (e.g. Cutler 2005). In fact, Berinstein’s claim is contradicted by her own results. Languages with long versus short vowels (English, K’ekchi) were found to exploit duration as a stress cue as (in)effectively as similar languages without a vowel length contrast (Spanish, Kaqchikel). For a detailed ana lysis of Berinstein’s results, see van Heuven (2018). Similarly, Vogel et al. (2016) compared the strength of correlates of word and sentence stress in Hungarian, Spanish, Greek, and Turkish. For Hungarian, which has a vowel length contrast that is lacking in the other three languages, the FLH predicts a lower rank of duration as a stress cue, which was not in fact found. This null result was confirmed by Lunden et al. (2017), who found no difference in the use of duration as a stress cue between languages with and without segmental length contrasts in their database of 140 languages. These results suggest that the FLH by itself does not determine the overall ranking of particular stress cues in individual languages. There is some evidence, however, that functional load may be involved in determining the magnitudes of effects in some contexts in some languages, preventing the use of stress correlates in production from disrupting the use of the same acoustic correlates in signalling lexical contrast. For example, Swedish is a language with phonemic vowel length distinctions; according to the FLH this language 8 The assumption is that the word-boundary cue imparted by fixed stress is highly redundant and can be dispensed with even in noisy listening conditions, whereas contrastive stress provides vital information to the word recognition process (see Cutler 2005 for a detailed discussion of the role of stress in the word recognition process).
OUP CORRECTED PROOF – FINAL, 06/12/20, SPi
162 VINCENT J. VAN HEUVEN AND ALICE TURK would be expected to make little use of duration to signal stress. However, Swedish does use duration to signal sentence stress, but only for phonemically long vowels (Heldner and Strangert 2001). In this way, phonemic contrasts are maintained. The FLH would therefore need to be weakened to accommodate these findings. Although there seems little support for the original, strong version of the FLH, as it relates to stress versus segmental quantity correlates, the situation may well be different when stress-related parameters are in competition with f0 contrasts. What, for instance, if a language has both stress and lexical tone? In such cases, it might be more difficult for the listener to disentangle the various cues for the competing contrasts. Potisuk et al. (1996) investigated the acoustic correlates of sentence stress in Thai, a language with five different lexical tones and a vowel length contrast. Fundamental frequency should not be a high-ranking correlate of (sentence) stress, given that f0 is the primary correlate of lexical tone. Duration should not be an important stress cue since it is already implicated in the vowel length contrast. Automatic classification of syllables as stressed versus unstressed was largely unsuccessful when based on f0, while intensity was not significantly affected by stress. Duration proved the strongest stress correlate, yielding 99% correct stress decisions. These results, then, are in line with the idea developed above that stress parameters can be used simultaneously in segmental quantity and stress contrasts but not in simultaneous stress and tone contrasts. This was confirmed by Remijsen (2002) for Samate Ma’ya. This Papuan language has both lexical tone and stress, but does not have a vowel length contrast. Acoustic correlates of stress were the f0 contour, vowel quality (expansion/reduction), loudness (intensity weighted by frequency band), and duration. Remijsen’s results reveal a perfect inverse relationship between the parameters’ positions in the rank orders of cues for stress and tone.9 The original FLH idea, as formulated by Berinstein (1979), Hayes (1995), and Potisuk et al. (1996), was that stress correlates cannot be effectively used if they are also involved in lexical contrasts, whether tonal or segmental in nature. This, then, would seem to be too strong a generalization.10 The FLH appears to make sense only as long as parameters are in competition for stress versus lexical tone contrasts.
10.6 Conclusion In this chapter we reviewed the acoustic correlates of word and sentence stress (§10.2 and §10.3, respectively), drawing mainly on studies done on European languages. We concentrated on the marking of primary stress at both levels, leaving the marking of secondary and lower stress levels largely untouched. In the rank order of stress correlates that emerged, the effects of stress are most reliably seen in (relatively) longer vowel duration, followed by 9 Since Mandarin Chinese is a tone language, the FLH predicts an avoidance of the use of f0 in s ignalling focus. However, it does use expanded f0 range as a correlate of sentence stress marking focus (Xu 1999). Fundamental frequency range and tone shape therefore seem to operate as independent parameters. Similarly, f0 is used in Germanic languages to signal sentence stress as well as boundary tone without competition because tone shape and alignment are separate parameters. 10 In a recent meta-analysis, Lunden et al. (2017) presented results from a database of reported stress correlates and use of contrastive duration for 140 languages, and found no support for the FLH.
OUP CORRECTED PROOF – FINAL, 06/12/20, SPi
PHONETIC CORRELATES OF WORD AND SENTENCE STRESS 163 greater intensity, more spectral expansion, and flatter spectral tilt. A change in f0 does not reliably correlate with word stress, but, if appropriately timed and larger than three to four semitones, is a unique and highly reliable marker of sentence stress. In §10.4 we examined the perceptual cues for stress. It was found that an appropriately timed change in f0 is the strongest cue for stress, such that it can counteract any (combin ation of) other cues suggesting stress on a different syllable. We argue that this is because the f0 cue is the unique marker of sentence stress, and sentence stress outranks word stress. The rank order of acoustic correlates of stress is therefore not necessarily the same as the order of importance of perceptual cues. We interpret the findings as evidence suggesting that word and sentence stress are different phenomena with different communicative functions, rather than that word stresses are just lower degrees of sentence stress. In §10.5 we asked whether stress is cued by the same acoustic parameters in the same order of magnitude in all languages. The available data suggest that, overall, stress is acoustically less clearly marked in languages with fixed stress than in languages in which the stress position varies between words. No cross-linguistic support was found for the claim that stress cues become less reliable or less salient when they are implicated in segmental length contrasts. However, a weaker version of this FLH may remain viable, since ( sentence) stress and (lexical) tone do draw on partially shared prosodic parameters.
Appendix Measuring correlates of stress using Praat speech processing software The program Praat (Boersma and Weening 1996) can be downloaded at no cost from www. praat.org. No scripting is assumed here. Results of measurements can be copy-pasted from the information window to a spreadsheet.
Measuring duration D1. Read the sound file into Praat. Select the Sound object and in the editor drag the portion of the waveform that corresponds exactly to the target vowel or consonant. D2. Click Query > Get selection length for target duration.
Measuring intensity I1. As D1. I2. Perform step P1 (under ‘Measuring pitch correlates’ below) to set appropriate window size. I3. Under Intensity, click Get intensity for mean intensity. I4. Click Get maximum intensity for peak intensity.
OUP CORRECTED PROOF – FINAL, 06/12/20, SPi
164 VINCENT J. VAN HEUVEN AND ALICE TURK
Measuring spectral tilt T1. As D1, then under File click Extract selected sound (time from 0). T2. In the Objects window click Analyse spectrum > To Ltas…, Bandwidth = 100 Hz, OK. T3. In the Objects window click Compute trend line…, Frequency range from 50 to 4,000 Hz, OK. T4. In the Objects window click Query > Report spectral tilt, 0 to 4,000 Hz, linear, robust, OK.
Measuring formants F1, F2 (for vowels and sonorant consonants) F1. As D1. Then set the spectrogram window by checking Show spectrogram (under Spectrum); view range from 0 to 10,000 Hz, Window length = 0.005 s, Dynamic range = 40 dB, OK. F2. In the spectrogram window, drag the (spectrally stable) portion of the waveform you want to analyse. F3. Under Formant, check Show formants. Set parameters Maximum formant = 5,000 Hz, Number of formants = 5, Window Length = 0.025 s, Dynamic Range = 30 dB, Dot size = 1 mm, OK. F4. Visually check that the formant tracks coincide with the energy bands in the spectrogram (if not, adjust the settings by changing Maximum formant and/or Number of formants). F5. Under Formant click Get first formant (or press F1 on keyboard), Get second formant (or press F2).
Measuring noise spectra (for fricatives, stops, and affricates) N1. Perform T1 for the portion of the waveform that corresponds to the noise burst you want to analyse. N2. In the Objects window click Filter…, From 1,000 Hz, To 10,000 Hz, Smoothing 100 Hz, OK. N3. In the Objects window click Analyse spectrum > To spectrum (Fast), OK. N4. Query Get centre of gravity…, Power = 2; Get standard deviation…, Power = 2; Get skewness…, Power = 2; Get kurtosis…, Power = 2.
Measuring pitch correlates P1. Read the sound file into Praat. Select the sound file in the Praat objects list. Click View and Edit. Ask for a pitch display by checking the box Show pitch (under Pitch). Adjust settings to the speaker’s voice: click Pitch settings. . . > Pitch
OUP CORRECTED PROOF – FINAL, 06/12/20, SPi
PHONETIC CORRELATES OF WORD AND SENTENCE STRESS 165 range = 75 Hz, 250 Hz (for a male voice; double these frequencies for a female voice), Unit = Hertz, Analysis method = Autocorrelation, Drawing method = Speckles. P2. In the editor drag the time interval in which you wish to locate a pitch maximum and minimum. Click Query > List (or press F5). P3. In the listing delete all lines except the ones with the f0 maximum and minimum. Copy and paste the time-frequency coordinates. Minimum precedes maximum for f0 rise, but follows maximum for f0 fall. Note: complex f0 changes (e.g. rise-fall) are analysed separately for the rise and fall portions. The time-frequency coordinates of the maximum will be the same for the rise and the fall. Hertz values can be converted offline to either semitones or (better still) equivalent rectangular band units (ERB).11 P4. In the waveform, locate the vowel onset (or some other segmental landmark you want to use for your alignment analysis) of the target syllable. Query > Get cursor. Store the time coordinate of the segmental landmark (this will be needed later offline to measure the alignment of the pitch change).
11 The ERB scale is preferred when the f0 interval is the correlate of perceived prominence (Hermes and van Gestel 1991). The semitone scale is more appropriate for f0 as the correlate of lexical or boundary tones (Nolan 2003).
OUP CORRECTED PROOF – FINAL, 04/12/20, SPi
chapter 11
Speech R h y th m a n d Ti mi ng Laurence White and Zofia Malisz
11.1 Introduction Rhythm is a temporal phenomenon, but how far speech rhythm and speech timing are commensurable is a perennial debate. Distinct prosodic perspectives echo two Ancient Greek conceptions of time. First, chronos (χρόνος) signified time’s linear flow, measured in seconds, days, years, and so on. Temporal linearity implicitly informs much prosody research, wherein phonetic events are interpreted with respect to external clocks and surface timing patterns are expressible through quantitative measures such as milliseconds. Second and by contrast, kairos (καιρός) was a more subjective notion of time as providing occasions for action, situating events in the context of their prompting circumstances. Kairos was invoked in Greek rhetoric: what is spoken must be appropriate to the particular moment and audience. Rhythmic approaches to speech that might be broadly classified as ‘dynamical’ reflect—to varying degrees—this view of timing as emerging from the intrinsic affordances occasioned by spoken interaction. Interpretation of observable timing patterns is complicated by the fact that vowel and consonant durations are only approximate indicators of the temporal coordination of articulatory gestures, although there is evidence that speakers do manipulate local surface durations for communicative goals (e.g. signalling boundaries and phonological length; reviewed by Turk and Shattuck-Hufnagel 2014). Furthermore, perception of speech’s temporal flow is not wholly linear. For example, Morton et al. (1976) found that a syllable’s perceived moment of occurrence (‘P-centre’) is affected by the nature of its sub-constituents. Moreover, variation in speech rate can affect the perception of a syllable’s presence or absence (Dilley and Pitt 2010) and the placement of prosodic boundaries (Reinisch et al. 2011). Thus, surface timing patterns may have non-linear relationships both to underlying control structures and to listeners’ perceptions of prominence and grouping. More generally, the term ‘speech rhythm’, without qualification, can cause potentially serious misunderstandings because: ‘ “rhythm” carries with it implicit assumptions about the way speech works, and about how (if at all) it involves periodicity’ (Turk and
OUP CORRECTED PROOF – FINAL, 04/12/20, SPi
SPEECH RHYTHM AND TIMING 167 Shattuck-Hufnagel 2013: 93). Various definitions of rhythm applied to speech, and the timing thereof, are considered by Turk and Shattuck-Hufnagel: periodicity (surface, underlying, perceptual), phonological/metrical structure, and surface timing patterns. In this chapter we do not attempt a single definition of speech rhythm but review some of these diverse perspectives and consider whether it is appropriate to characterize the speech signal as rhythmical (for other definitions, see e.g. Allen 1975; Cummins and Port 1998; Gibbon 2006; Wagner 2010; Nolan and Jeon 2014). With such caveats in mind, the remainder of this section reviews four aspects of speech that may influence perceptions of rhythmicity: periodicity, alternation between strong and weak elements, hierarchical coordination of timing, and articulation rate. §11.2 discusses attempts to derive quantitative indices of rhythm typology. §11.3 contrasts two approaches to speech timing, one based on linguistic structure and localized lengthening effects and the other on hierarchically coupled metrical units, and §11.4 considers the prospects for a synthesis of such approaches. We do not attempt a definitive summary of empirical work on speech rhythm and timing (for reviews, see e.g. Klatt 1976; Arvaniti 2009; Fletcher 2010; White 2014) but aim to highlight some key theoretical concepts and debates informing such research.
11.1.1 Periodicity in surface timing Before technology made large-scale analyses of acoustic data tractable, descriptions of speech timing were often impressionistic, with terminology arrogated from traditional poetics. In particular, the assumption that metrical structure imposes global timing constraints has a long history (Steele 1779). A specific timing constraint that proved pervasively influential was ‘isochrony’, the periodic recurrence of equally timed metrical units such as syllables or stress-delimited feet. Classe (1939), while maintaining that isochrony is an underlying principle of English speech, concluded from his data that ‘normal speech [is] on the whole, rather irregular and arrhythmic’ (p. 89), due to variation in the syllable number and phonetic composition of stress-delimited phrasal groups, as well as to grammatical structure. Pike (1945) contrasted typical ‘stress-timed’ English rhythm and ‘syllable-timed’ Spanish rhythms, while asserting that stylistic variation could produce ‘syllable-timed’ rhythm in English. Abercrombie (1967) formalized ‘rhythm class’ typology, asserting that all languages were either syllable timed (e.g. French, Telugu, Yoruba) or stress timed (e.g. Arabic, English, Russian). Isochronous mora-timing has been claimed for Japanese, among other languages (Ladefoged 1975). The mora is a subsyllabic constituent (e.g. consonant plus short vowel), with somewhat language-specific definitions, and is important in Japanese poetics (e.g. haiku comprise 17 morae), where syllables with long vowels or consonantal rhymes constitute two morae. Apparently by extension from poetry (cf. syllables in French and Spanish, stress feet in English and German), spoken Japanese morae were assumed to be isochron ous (e.g. Bloch 1950). Some data suggested approximate mora-timing but with deviations due to the mora’s internal structure (Han 1962) and utterance position (longer morae phrase-finally; Kaiki and Sagisaka 1992). Warner and Arai’s (2001) review concluded that Japanese mora duration is not isochronous, and that relatively regular mora-timing—when observed—is due to contingent features such as syllable phonotactics.
OUP CORRECTED PROOF – FINAL, 04/12/20, SPi
168 LAURENCE WHITE AND ZOFIA MALISZ The ‘rhythm class’ concept persisted despite much evidence (e.g. Bertinetto 1989; Eriksson 1991) demonstrating the lack of isochrony of syllables or stress-delimited feet in surface speech timing. In a proleptic challenge to the syllable-timing hypothesis, Gili Gaya (1940; cited in Pointon 1980) observed that Spanish syllable duration is strongly affected by structural complexity, stress, and utterance position. Pointon (1980), reviewing Spanish timing studies, concluded that syllable duration is determined bottom-up—what he called an ‘antirhythmic’ or ‘segment-timed’ pattern—and found further support in a study of six Spanish speakers (Pointon 1995; see also Hoequist 1983, contra Spanish syllable-timing). Roach (1982) found similar correlations between interstress interval duration and syllable counts in Abercrombie’s (1967) ‘stress-timed’ and ‘syllable-timed’ languages, with variance measures of syllable and interstress interval duration failing to support the categorical typology. Although the elementary design and single speaker per language limits i nterpretation of Roach’s study, it proved influential for the use of variance measures of interval duration, later adopted in ‘rhythm metrics’, and for challenging the rhythm class hypothesis.
11.1.2 Contrastive rhythm Brown (1911) distinguished ‘temporal rhythm’—the regular recurrence of structural elem ents (here termed ‘periodicity’)—from ‘accentual rhythm’, the relative prominence of certain structural elements (for similar distinctions, see, inter alia: Allen 1975; Nolan and Jeon 2014; White 2014). As discussed above, speech usually lacks periodicity in surface timing, but many languages contrast stronger and weaker elements through lexical stress (relative within-word syllable prominence) and phrasal accent (relative within-phrase word prominence, also called ‘sentence stress’). Here we use the term ‘contrastive rhythm’ rather than ‘accentual rhythm’ (to avoid confusion with the nature of the contrast: lexical stress or phrasal accent). Dauer (1983), in an influential discussion, elaborated upon Roach’s (1982) suggestion that cross-linguistic rhythmic differences may inhere in structural regularities such as vowel reduction and syllable complexity, and their relation with syllable stress. In particular, Dauer observed that the phonetic realization of stressed syllables and their (lexically or syntactically determined) distribution conspire to make (for example) English and Spanish seem rhythmically distinct. Most Spanish syllables have consonant–vowel (CV) structure, whereas the predominant English syllable structure is CVC and up to three onset conson ants and four coda consonants are permissible. Moreover, the association between lexical stress and syllable weight (related to coda cluster complexity) is stronger for English, and also for Arabic and Thai, than Spanish. Additionally, unstressed syllables undergo minimal vowel reduction in Spanish, but most English unstressed syllables contain a reduced vowel, predominantly schwa (Dauer noted unstressed vowel reduction also for Swedish and Russian). All these patterns converge towards a high durational contrast between English strong and weak syllables. Furthermore, English stressed syllables tend to recur with relative regularity, particularly given the existence of secondary lexical stress, while long unstressed syllable sequences are more likely in Greek, Italian, and Spanish (Dauer 1983). Structural trends do not converge onto a high–low contrast gradient for all languages, however: for example, Polish has high syllable complexity but limited vowel reduction, while Catalan has low syllable complexity but significant vowel reduction (Nespor 1990).
OUP CORRECTED PROOF – FINAL, 04/12/20, SPi
SPEECH RHYTHM AND TIMING 169 In part due to the language’s recent status as a scientific lingua franca, analytical concepts derived from English linguistics have sometimes guided the characterization of other languages. Thus, much early comparative field linguistics had a guiding assumption that ‘stress’ was universally meaningful. In fact, English—particularly standard southern British English—seems a conspicuously ‘high-contrast’ language in terms of lexical stress and also phrasal accent. Comparisons of global timing properties between selected languages often show English to have the highest variation in vowel duration (e.g. Ramus et al. 1999; White and Mattys 2007a). In Nolan and Asu (2009)’s terminology, English has a markedly steep ‘prominence gradient’. Even other Germanic languages, such as Dutch, have sparser occurrence of reduced vowels in unstressed syllables (Cutler and van Donselaar 2001). However, stress is—manifestly—linguistically important in many languages lacking such marked stress cues as English: thus, Dauer (1983) observed that while stress-related duration contrasts are substantially greater in English than Spanish, combinations of cues make stressed syllables in Spanish, Greek, or Italian salient to native listeners (in contrast with French, which lacks lexical stress). Indeed, Cumming (2011b) suggested that languages may appear less rhythmically distinct once prosodic perceptual integration is taken into account (see also Arvaniti 2009). On the other hand, it is also becoming clear that many languages may lack metrically contrasting elements (e.g. Korean: Jun 2005b; Ambonese Malay: Maskikit-Essed and Gussenhoven 2016; see Nolan and Jeon 2014 for references questioning the status of stress in several languages). Tabain et al. (2014) suggested the term ‘stress ghosting’ to highlight how Germanic language speakers’ native intuitions may induce stress perception in languages unfamiliar to them. Stress ghosting arises due to misinterpretation of phonetic or structural patterns that would be associated with prominence in languages—such as Dutch, English, or German— with unambiguous lexical stress contrast (e.g. English ˈinsight vs. inˈcite). By contrast, native speakers of languages without variable stress placement as a cue to lexical identity have been characterized as having ‘stress deafness’ (Dupoux et al. 2001). Specifically, speakers of languages that either lack lexical stress (e.g. French) or have non-contrastive stress (e.g. Finnish or Hungarian fixed word-initial stress) do not appear to retain stress patterns of nonwords in short-term memory, suggesting that their phonological representations do not include stress (Peperkamp and Dupoux 2002; see also Rahmani et al. 2015). Thus, the notion of contrastive rhythm, while pertinent for some widely studied linguistic systems, may be inapplicable for many languages (Nolan and Jeon 2014).
11.1.3 Hierarchical timing Unlike typologies based on isochrony of morae, syllables, or stressed syllables (reviewed above), hierarchical timing approaches do not identify a single privileged unit strictly governing any language’s surface timing. They describe relative timing dependencies between at least two hierarchically nested constituents—for example, the syllable and the stressdelimited foot (e.g. O’Dell and Nieminen 1999). The syllable (or syllable-sized unit, e.g. vowel-to-vowel interval) is regarded as a basic cyclic event in speech perception or production (Fowler 1983) and the smallest beat-induction speech unit (Morton et al. 1976). With regard to the stress-delimited foot, various definitions are proposed, sometimes related to
OUP CORRECTED PROOF – FINAL, 04/12/20, SPi
170 LAURENCE WHITE AND ZOFIA MALISZ the metrical structure of particular languages, with a key distinction being whether or not the foot respects word boundaries (e.g. Eriksson 1991; Beckman 1992; Bouzon and Hirst 2004). Analysis of timing relationships between hierarchically nested constituents were developed from Dauer’s (1983) findings that, in various languages, stress foot duration is neither independent of syllable number (the expectation based on foot isochrony) nor an additive function of syllable number (the expectation based on syllable isochrony). Eriksson (1991) further explored Dauer’s data on the positive relationship between total foot duration and syllable number. The durational effect of adding a syllable to the foot (i.e. the slope of the regression line) was similar for all five of Dauer’s (1983) languages. However, the intercept differed between putative ‘rhythm classes’ (‘syllable-timed’ Greek, Italian, Spanish: ~100 ms; ‘stress-timed’ English, Thai: ~200 ms). Eriksson claimed that the natural interpretation of the intercept variation was that the durational difference between stressed and unstressed syllables is greater in English and Thai than in Greek, Italian, or Spanish. However, as Eriksson observed (also O’Dell and Nieminen 1999), the positive intercept does not itself indicate where the durational variation takes place. Eriksson further noted an inverse relationship between the number of syllables in the foot and the average duration of those syl lables. Similarly, Bouzon and Hirst (2004) found sub-additive relationships between several levels of structure in British English: syllables in a foot; phones in a syllable; feet in an intonational unit. These linear relationships between foot duration and the number of sub-constituents, with positive slope and intercept coefficients of the linear function, can—as described in more detail in §11.3.2—be modelled as systems of coupled oscillators (e.g. O’Dell and Nieminen 1999, at the syllable level and foot level). Other approaches that relate surface timing to the coupled interaction of hierarchically nested constituents include work on the coordination of articulatory gestures within syllables and prosodic phrases (e.g. Byrd and Choi 2010) and on the coordination of articulatory gestures within syllables and feet (Tilsen 2009).
11.1.4 Articulation rate Cross-linguistic variations in predominant syllable structures (Dauer 1983) are associated with systematic differences in ‘articulation rate’, defined as syllables per second excluding pauses (given that pause frequency and duration significantly affect overall speech rate; Goldman-Eisler 1956). Estimated rates vary between studies due to the spoken materials, the accents chosen for each language, and speaker idiosyncrasies. Stylistic and idiosyncratic effects notwithstanding, languages with predominantly simple syllable structures, such as Spanish, tend to be spoken at a higher syllables-per-second rate than languages with more complex syllable structures, such as English (White and Mattys 2007a; Pellegrino et al. 2011). Of course, such differences in syllable rates do not imply that Spanish speakers articulate more quickly than English speakers, rather that more syllables are produced per unit of time when those syllables contain fewer segments. Additionally, Pellegrino et al. (2011) pointed to an effect of information density on rate: for example, Mandarin Chinese has lower syllable-per-second rates than Spanish, but more informationally rich syllables when taking lexical tone into account, hence their information density is roughly similar.
OUP CORRECTED PROOF – FINAL, 04/12/20, SPi
SPEECH RHYTHM AND TIMING 171 Listeners’ linguistic experience may strongly affect rate judgements, particularly with unfamiliar languages. Thus, where Japanese and German utterances were assessed by native speakers of both languages, there was overestimation of the unfamiliar language’s rate compared to the first language (Pfitzinger and Tamashima 2006). This has been described as the ‘gabbling foreigner illusion’ (Cutler 2012): when confronted with speech that we cannot understand, we tend to perceive it as spoken faster (see also Bosker and Reinisch 2017, regarding effects of second language proficiency). This illusion may, in part, be due to difficulties segmenting individual words in unfamiliar languages (Snijders et al. 2007). Conversely, when judging non-native accents, listeners generally interpret faster speech rate as evidence of more native-like production (e.g. White and Mattys 2007b; see HayesHarb 2014 for a review of rate influences on accentedness judgements). Moreover, the perception of cross-linguistic ‘rhythm’ contrasts is influenced by structurally based rate differences (Dellwo 2010). For example, when hearing delexicalized Spanish and English sasasa stimuli (all vowels replaced by /a/, all consonants by /s/, but with the original segment durations preserved), English speakers were more likely to correctly classify faster Spanish but slower English utterances (White et al. 2012; Polyanskaya et al. 2017). Thus, some perceptions of linguistic differences typically described as ‘rhythmic’ may be associated with systematic variations in rate (Dellwo 2010).
11.2 ‘Rhythm metrics’ and prosodic typology Informed, in particular, by Dauer’s (1983) re-evaluation of rhythmic typology, various studies under the ‘rhythm metrics’ umbrella have attempted to empirically capture crosslinguistic differences in ‘rhythm’ (often loosely defined: see §11.2.2 and Turk and ShattuckHufnagel 2013). These studies employed diverse metrics of durational variation (cf. Roach 1982), notably in vocalic and consonantal intervals. Some studies were premised on the validity of ‘rhythm class’ distinctions (e.g. Ramus et al. 1999), raising a potential circularity problem where the primary test of a metric’s worth is whether it evidences the hypothesized class distinctions (Arvaniti 2009), although studies of perceptual discrimination between languages (e.g. Nazzi and Ramus 2003) were sometimes cited as external corroboration. However, the accumulated evidence from speech production and perception—reviewed §11.2.2—strongly questions the validity and usefulness of categorical rhythmic distinctions. Some evaluative studies have highlighted empirical strengths and limitations of different rhythm metrics, observing that while certain metrics might provide data about cross-linguistic variation in the durational marking of stress contrast, they neglect much else that might be relevant to ‘rhythm’, notably distributional information (White and Mattys 2007a; Wiget et al. 2010). More trenchantly, other researchers have argued that the ‘rhythm metrics’ enterprise was compromised by a lack of consistency regarding which languages were distinguished (Loukina et al. 2011)—for example, when comparing read and spontaneous speech (Arvaniti 2012). Indeed, the term ‘rhythm metrics’ is a misnomer: aggregating surface timing features does not capture the essence of ‘speech rhythm’, however defined
OUP CORRECTED PROOF – FINAL, 04/12/20, SPi
172 LAURENCE WHITE AND ZOFIA MALISZ (e.g. Cummins 2002; Arvaniti 2009). We next consider some lessons from the ‘rhythm metrics’ approach.
11.2.1 Acoustically based metrics of speech rhythm: lessons and limitations In the development of so-called rhythm metrics for typological studies, there was a threefold rationale for quantifying durational variation based on vowels and consonants, rather than syllables or stress feet. First, languages such as Spanish typically have less vowel reduction and less complex consonant clusters than, for example, English (Dauer 1983). Second, Mehler et al. (1996), assuming early sensitivity to vowel/consonant contrasts, proposed that young infants use variation in vowel duration and intensity to determine their native language ‘rhythm class’. Third, syllabification rules vary cross-linguistically and are not uncontroversial even within languages, while applying heuristics to identify vowel/consonant boundaries is (comparatively) straightforward (Low et al. 2000). Thus, Ramus et al. (1999) proposed the standard deviation of vocalic and consonantal interval duration (‘ΔV’ and ‘ΔC’ respectively), along with the percentage of utterance dur ation that is vocalic rather than consonantal (%V). They found that a combination of ΔC and %V statistically reflected their predefined rhythm classification of, in increasing %V order: Dutch/English/Polish, Catalan/French/Italian/Spanish, and Japanese. Seeking to capture syntagmatic contrast within an utterance as well as global variation, pairwise variability indices (PVIs) average the durational differences between successive intervals—primarily, vocalic/consonantal—over an utterance (see Nolan and Asu’s 2009, account of this development). PVI-based measures showed differences between a Singaporean and a British dialect of English that had been claimed to be rhythmically distinct (Low et al. 2000), as well as gradient variation between languages previously categorized as either ‘stress-timed’ or ‘syllable-timed’ (Grabe and Low 2002, based on one speaker per language). While PVIs were intended to capture sequential durational variation more directly than global measures, Gibbon (2006) noted that PVIs do not necessarily discriminate between alternating versus geometrically increasing sequences (although the latter are implausible in speech: Nolan and Jeon 2014). Variance-based measures of interval duration tend to show high correlation with speech rate: as overall intervals lengthen with slower rate, so—other things being equal—do standard deviations (Barry et al. 2003; Dellwo and Wagner 2003; White and Mattys 2007a). With normalized PVI (nPVI)-based metrics, interval durations were normalized to take account of speech rate variation (Low et al. 2000). With standard deviation measures (ΔV, ΔC), speech rate normalization was implemented through coefficients of variation for conson antal intervals (VarcoC: Dellwo and Wagner 2003) and vocalic intervals (VarcoV: Ferragne and Pellegrino 2004). In the case of consonants, however, VarcoC lacked discriminative power (White and Mattys 2007a): as noted by Grabe and Low (2002), mean consonantal interval duration varies substantially due to language-specific phonotactics, so using the mean as a normalizing denominator also eliminates linguistically relevant variation. Comparing the power of various metrics, White and Mattys (2007a) suggested that ratenormalized metrics of vowel duration (VarcoV, nPVI-V) are more effective in capturing
OUP CORRECTED PROOF – FINAL, 04/12/20, SPi
SPEECH RHYTHM AND TIMING 173 cross-linguistic variation, alongside %V to represent differences in consonant cluster complexity. (For broadly similar conclusions about the relative efficacy of the normalized vocalic metrics, see Loukina et al. 2011; Prieto et al. 2012b). In contrast with Ramus et al. (1999), cross-linguistic studies employing such metrics often found variation in scores within hypothesized rhythm classes to be as great as those between classes (Grabe and Low 2002; White and Mattys 2007a; Arvaniti 2012). While conclusions about prosodic typology based only on rhythm metrics should be treated with circumspection, these data generally align with recent perceptual studies (White et al. 2012; Arvaniti and Rodriquez 2013; White et al. 2016) in refuting categorical notions of rhythm class. Several studies emphasize the limitations of even the more reliable metrics for capturing language-specific durational characteristics, given their susceptibility to variation in utterance composition and idiosyncratic differences between speakers (e.g. Wiget et al. 2010; Loukina et al. 2011; Arvaniti 2012; Prieto et al. 2012b). Given that %V, for example, is designed to reflect variation in the preponderance of syllable structures between languages, it is unsurprising to find that sentences constructed to represent language-atypical structures elicit anomalous scores (Arvaniti 2012; Prieto et al. 2012b). Moreover, the sensitivity of rhythm metrics to speaker-specific variation, a potential problem for typological studies, has been exploited in forensic phonetics and speaker recognition (Leemann et al. 2014; Dellwo et al. 2015) and in discriminating motor speech disorders (Liss et al. 2009). It is clear, however, that large sample sizes and a variety of materials are needed to represent languages in typological studies, a major limitation given the laborious nature of manual measurement of segment duration (and the potential for unconscious language-specific biases in application of acoustic segmentation criteria; Loukina et al. 2011). While automated approaches have potential (Wiget et al. 2010), data-trained models for recognition and forced alignment may not be available for many languages; furthermore, Loukina et al. (2011) indicated drawbacks with forced alignment that they addressed using purely acoustic-based automated segmentation. Also problematic for ‘rhythm metrics’ is that relationships between sampled languages vary according to elicitation methods (for comparisons of read and spontaneous speech, see Barry et al. 2003; Arvaniti 2012) and that no small set of metrics, even the more reliable, consistently distinguishes all languages (Loukina et al. 2011). Furthermore, articulation rates should also be reported, as the more reliable metrics are rate-normalized (VarcoV and nPVI, although not %V), but perceptual evidence shows the importance for language dis crimination of syllable-per-second rate differences (Dellwo 2010; White et al. 2012; Arvaniti and Rodriquez 2013). At best, metrics such as VarcoV and %V are approximate indicators of broad phonetic and phonotactic patterns. Questions about cross-linguistic timing differences—for example, comparing the durational marking of prominences and boundaries—could often be better addressed by more direct methods (Turk and Shattuck-Hufnagel 2013). Moreover, duration-based metrics neglect other perceptually important prosodic dimensions (Cumming 2011b). From a theoretical perspective, the need to declare one’s assumptions about the nature of speech rhythm is paramount (e.g. Wiget et al.’s 2010 specific use of the term ‘contrastive rhythm metrics’). Indeed, there is usually a more directly appropriate term for one’s object of phonetic or phonological study than ‘rhythm’ (Turk and ShattuckHufnagel 2013).
OUP CORRECTED PROOF – FINAL, 04/12/20, SPi
174 LAURENCE WHITE AND ZOFIA MALISZ
11.2.2 The fall of the rhythm class hypothesis Rhythm classes based on isochronous units—syllable-timed, stress-timed, mora-timed— have long been undermined by durational evidence, as discussed above. The multi-faceted nature of prominence provides further counter-arguments to the rhythm class hypothesis. Two languages characterized as ‘syllable-timed’ are illustrative. Spanish and French both have limited consonant clustering, minimal prominence-related vowel reduction, and relatively transparent syllabification. As Pointon (1995) observed, however, French lacks lexical stress and has phrase-final prominence, while Spanish has predominantly word-penultimate stress but lexically contrastive exceptions, minimally distinguishing many word pairs (e.g. tomo ‘I take’ vs. tomó ‘she took’). Despite such ‘within-class’ structural distinctions, some studies have suggested that initial speech processing depends upon speakers’ native ‘rhythm class’. Thus, French speakers were quicker to spot targets corresponding to exactly one syllable of a carrier word: for example, ba in ba.lance, bal in bal.con versus (slower) bal in ba.lance, ba in bal.con (Mehler et al. 1981). This ‘syllable effect’ was contrasted with metrical segmentation, wherein speakers of Germanic languages with predominant word-initial stress (e.g. Dutch and English) were held to infer word boundaries preceding stressed (full) syllables (Cutler and Norris 1988; although Mattys and Melhorn 2005 argued that stressed-syllable-based segmentation implies, additionally, a syllabic representation). These different segmentation strategies were explicitly associated with ‘rhythm class’ (Cutler 1990), which Cutler and Otake (1994) extended to Japanese, the ‘mora-timed’ archetype. Furthermore, the importance of early childhood experience was emphasized, suggesting that infants detect their native ‘rhythm class’ to select a (lifelong) segmentation strategy (Cutler and Mehler 1993). It is questionable, however, whether Spanish and French speakers would share rhythmical segmentation strategies, given differences in prominence distribution and function. Indeed, the ‘syllable effect’ subsequently appeared elusive in Spanish, Catalan, and Italian, all with variable, lexically contrastive stress placement (SebastiánGallés et al. 1992; Tabossi et al. 2000). Moreover, Zwitserlood et al. (1993) found that speakers of (‘stress-timed’) Dutch showed syllabic matching effects comparable to those Mehler et al. (1981) reported for French (for syllabic effects in native English speakers, see Bruck et al. 1995; Mattys and Melhorn 2005). It appears that syllabic and metrical effects are heavily influenced by linguistic materials and task demands, rather than fixed by listeners’ earliest linguistic experiences (for a review, see White 2018). Some perceptual studies have shown that listeners can distinguish two languages from distinct ‘rhythm classes’, but not two languages within a class. For example, American English-learning five-month-olds distinguished Japanese utterances from—separately— British English and Italian utterances, but did not distinguish Italian and Spanish, or Dutch and German (Nazzi et al. 2000a). Using monotone delexicalized sasasa speech preserving natural utterance timing (as described above), Ramus et al. (2003) found between-class, but not within-class, discrimination by French adult listeners (but postulated a fourth ‘rhythm class’ to account for discrimination of Polish from—separately—Catalan, Spanish, and English). However, subsequent similar studies found discrimination within ‘rhythm classes’: for five-month-olds hearing intact speech (White et al. 2016) and for adults with delexicalized speech (White et al. 2012; Arvaniti and Rodriquez 2013). Discrimination patterns can be explained by cross-linguistic similarity on salient prosodic dimensions, including speech
OUP CORRECTED PROOF – FINAL, 04/12/20, SPi
SPEECH RHYTHM AND TIMING 175 rate and utterance-final lengthening, without requiring categorical distinctions (White et al. 2012). In her influential paper ‘Isochrony Reconsidered’, Lehiste (1977b) argued that support for isochrony-based theories was primarily perceptual; indeed, data from perception studies have since been invoked to buttress the rhythm class hypothesis. It now seems clear, however, that responses to speech stimuli are not determined by listeners’ native ‘rhythm class’ (segmentation studies) or by categorical prosodic classes of language materials (discrimin ation studies). Languages clearly vary in their exploitation of temporal information to indicate speech structure, notably prominences and boundaries, but this variation is gradient and—integrating other prosodic features—multi-dimensional. There remain typological rhythm-based proposals, such as the ‘control versus compensation hypothesis’ (Bertinetto and Bertini 2008), but these assume gradient between-language variation in key param eters. The concept of categorical rhythm class seems superfluous, indeed misleading, for theories of speech production and perception.
11.3 Models of prosodic speech timing Factors affecting speech duration patterns are diverse and not wholly predictable, including—beyond this linguistically oriented survey’s scope—word frequencies, emotional states, and performance idiosyncrasies. At the segmental level, voicing and place/ manner of articulation influence consonant duration, while high vowels tend to be shorter than low vowels (for reviews see Klatt 1976; van Santen 1992). Some languages signal consonant or vowel identity by length distinctions, sometimes with concomitant quality contrasts (for a review see Ladefoged 1975). Connected speech structure also has durational consequences: for example, vowels are shorter preceding voiceless obstruents than voiced obstruents (Delattre 1962). This consonant–vowel duration trade-off (‘pre-fortis clipping’) is amplified phrase-finally (Klatt 1975, hinting at the importance of considering prosodic structure when interpreting durational data; e.g. White and Turk 2010). Beyond segmental and intersegmental durational effects, an ongoing discussion concerns the nature of the higher-level structures that are important for describing speech timing, and the mechanisms through which these structures influence observed durational patterns. Here we review two of the many extant approaches to these problems (see also, inter alia, Byrd and Saltzman 2003; Aylett and Turk 2004; Barbosa 2007). §11.3.1 considers approaches based on localized lengthening effects associated with linguistic constituents. §11.3.2 considers dynamical systems models based on hierarchical coup ling of oscillators. For each, we briefly highlight key features and consider their accounts of some observed timing effects.
11.3.1 Localized approaches to prosodic timing The fundamental claim of ‘localized’ approaches to prosodic timing is that no speech units impose temporal constraints on their sub-constituents throughout the utterance (van Santen 1997). Timing is primarily determined bottom-up, based on segmental identity (echoing
OUP CORRECTED PROOF – FINAL, 04/12/20, SPi
176 LAURENCE WHITE AND ZOFIA MALISZ Pointon’s 1980 description of Spanish as ‘segment-timed’) and processes of accommodation and coarticulation between neighbouring segments. Higher-level structure influences timing via localized lengthening effects at linguistically important positions (White 2002, 2014). The most well-attested lengthening effects are at prosodic domain edges and, for some languages, at prosodic heads (see Beckman 1992 regarding edge-effect universality versus head-effect language-specificity). Final (‘pre-boundary’) lengthening is widely observed at various levels of linguistic structure (e.g. English: Oller 1973; Dutch: Gussenhoven and Rietveld 1992; Hebrew: Berkovits 1994; Czech: Dankovičová 1997; see Fletcher 2010 for an extensive review). Lengthening (and gestural strengthening) of word-initial consonants is also reported cross-linguistically (e.g. Oller 1973; Cho et al. 2007), with greater lengthening after higher-level boundaries (e.g. Fougeron and Keating 1997). In many languages, lexically stressed syllables are lengthened relative to unstressed syllables (e.g. Crystal and House 1988), although the magnitude of lengthening varies (e.g. Dauer 1983; Hoequist 1983) and, as discussed in §11.1.2, some languages may lack lexical stress. Additionally, stressed and other syllables are lengthened in phrasally accented words (e.g. Sluijter and van Heuven 1995). White’s (2002, 2014) prosodic timing framework proposed that lengthening is the dur ational means by which speakers signal structure for listeners. The distribution of lengthening depends on the particular (edge or head) structural influence: for example, the syllable onset is the locus of word-initial lengthening (Oller 1973), while the pre-boundary syllable rhyme is lengthened phrase-finally (as well as syllable rhymes preceding a final unstressed syllable; Turk and Shattuck-Hufnagel 2007). Thus, the distribution (‘locus’) of lengthening disambiguates the nature of the structural cue (e.g. Monaghan et al. 2013). This emphasis on localized lengthening affords a reinterpretation of ‘compensatory’ timing processes, inverse relationships between constituent length, and the duration of subconstituents. For example, Lehiste (1972) reported ‘polysyllabic shortening’, an inverse relationship between a word’s syllable count and its primary stressed syllable’s duration. As observed by White and Turk (2010), however, many duration studies have only measured phrasally accented words, such as in fixed frame sentences (e.g. ‘Say WORD again’). The primary stressed syllables are lengthened in these phrasally accented words, as—to a lesser extent—are unstressed syllables; moreover, the greater the number of unstressed syllables, the smaller the accentual lengthening on the primary stressed syllable (Turk and White 1999). Hence, pitch accented words appear to demonstrate polysyllabic shortening (e.g. cap is progressively shorter in cap, captain, captaincy; likewise mend in mend, commend, recommend); however, in the absence of pitch accent, there is no consistent relationship between word length and stressed syllable duration (White 2002; White and Turk 2010). Similar arguments apply to apparent foot-level compression effects. Beckman (1992: 458) noted the difficulty in distinguishing ‘rhythmic compression of the stressed syllable in a polysyllabic foot from the absence of a final lengthening for the prosodic word’. Likewise, Hirst (2009) considered the durational consequences of the length of the English ‘narrow rhythm unit’ (NRU) (or ‘within-word foot’, from a stressed syllable to a subsequent word boundary). He found the expected linear relationship between syllable number and NRU duration, not the negative acceleration expected for a cross-foot compression tendency (Nakatani et al. 1981; Beckman 1992). Furthermore, Hirst (2009) attributed the ‘residual’ extra duration within each NRU (the intercept of the regression line for NRU length vs. duration) to localized lengthening effects at the beginning and end of the NRU (cf. White 2002, 2014; White and Turk 2010). Similarly, Fant et al. (1991: 84), considering Swedish,
OUP CORRECTED PROOF – FINAL, 04/12/20, SPi
SPEECH RHYTHM AND TIMING 177 French, and English, suggested that the primary (but ‘marginal’) durational consequence of foot-level structure was in ‘the step from none to one following unstressed syllables in the foot’. This localized lengthening of the first of two successive stressed syllables (e.g. Fourakis and Monahan 1988; Rakerd et al. 1987; called ‘stress-adjacent lengthening’ by White 2002) may relate to accentual lengthening variation in cases of stress class. Generalizing from these observations, White (2002, 2014) reinterpreted apparent compensatory timing as being due to variation in the distribution of localized prosodic lengthening effects at domain heads and domain edges (e.g. phrasal accent lengthening or phrase-final lengthening). The localized lengthening framework further argues that, outside the loci of prosodic lengthening effects, there is little evidence for relationships between constituent length and sub-constituent duration (see also e.g. Suomi 2009; Windmann et al. 2015). Beyond such localized lengthening, the primary determiner of a syllable’s duration is its segmental composition (van Santen and Shih 2000).
11.3.2 Coupled oscillator approaches to prosodic timing Coupled oscillator models posit at least two cyclical processes that are coupled—that is, influence each other’s evolution in time. O’Dell and Nieminen’s (1999, 2009) model of pros odic timing considers the hierarchically coupled syllable and stress-delimited foot oscillators (but see Malisz et al. 2016 regarding other units). Some models additionally include nonhierarchically coupled subsystems: Barbosa’s (2007) complex model also contains a coupled syntax and articulation module, the syntactic component being controlled by a probabilistic model as well as a coupled prosody-segmental interaction, and generates abstract vowel-tovowel durations tested on a corpus of Brazilian Portuguese. (For overviews of dynamical approaches to speech, including timing, see Van Lieshout 2004; Tilsen 2009). Empirical support for coupled oscillator models on the surface timing level has been found in the linear relationship between the number of syllables in a foot and the foot’s duration, discussed in §11.3, with non-zero coefficients. This relationship naturally emerges from O’Dell and Nieminen’s (2009) mathematical modelling of foot and syllable oscillator coupling. Importantly, there is variable asymmetry in the coupling strengths of the two oscillators, between and within languages (see also Cummins 2002). If one process wholly dominated, isochrony of syllables or stress feet would be observed: in strict foot-level isochrony, foot duration would be independent of syllable count; in strict syllable-level isochrony, foot duration would be additively proportional to syllable count. That such invariance is rarely observed is, of course, not evidence against oscillator models. Surface regularity of temporal units is not a prerequisite; rather, it is the underlying cyclicity of coupled units that is assumed (for a discussion see Turk and Shattuck-Hufnagel 2013; Malisz et al. 2016). Indeed, periodic control mechanisms, if coupled, should not typically produce static surface isochrony on any subsystem level (e.g. syllable or foot): hierarchical coupling promotes variability in temporal units (Barbosa 2007; Malisz et al. 2016), and only under specific functional conditions is surface periodicity achieved. Regression of unit duration against number of sub-constituents cannot, however, distinguish where local expansion or compression may take place. Indeed, coupled oscillator models are neutral about where durational effects are allocated within temporal domains (Malisz et al. 2016), ranging from extreme centralization to equal allocation throughout the
OUP CORRECTED PROOF – FINAL, 04/12/20, SPi
178 LAURENCE WHITE AND ZOFIA MALISZ domain. By contrast, localized approaches (e.g. White 2014) suggest strict binding of lengthening effects to specific loci (e.g. syllable onset or rhyme) within the domain (e.g. a word) while other syllables are predicted to remain unaffected as they are added to the domain outside this locus. Effects on surface timing predicted by the coupled oscillator model are thus less constrained than those of localized approaches, which specifically argue for the absence of compression effects beyond the locus (see §11.3.1). Dynamical models, such as O’Dell and Nieminen (2009), updated in Malisz et al. (2016), depend rather on evidence of hierarchical coupling, such as that provided by Cummins and Port (1998) in the specific case of speech-cycling tasks. While repeating a phrase to a uniformly varying metronome target beat, English speakers tended to phase-lock stressed syl lables to the simple ratios (1:3, 1:2, 2:3) of the repetition cycle. Furthermore, other periodicities emerge at harmonic fractions of the phrase repetition cycle (Port 2003, who relates these observations to periodic attentional mechanisms (Jones and Boltz 1989; see also McAuley and Fromboluti 2014)). There is also suggestive empirical support for metrical influences on speech production in findings that speakers may prefer words and word combinations that maintain languagespecific metrical structures (Lee and Gibbons 2007; Schlüter 2009; Temperley 2009; Shih 2014), although Temperley (2009) found that contextually driven variations from canonical form (e.g. stress-clash avoidance) actually increase interval irregularity in English. In dynamical theories, coupling is evident within hierarchical speech structures, between speakers in dialogue, and within language communities (Port and Leary 2005; Cummins 2009). Periodic behaviour is understood to be one of the mechanisms of coordination within complex systems (Turvey 1990), mathematically modelled by oscillators. Furthermore, coupled oscillatory activity behaviour is a control mechanism that spontaneously arises in complex systems where at least two subsystems interact, without necessarily requiring a periodic referent, such as a regular beat (Cummins 2011). Whether the undoubted human ability to dynamically entrain our actions is mirrored in the entrainment of metrical speech units remains debatable, as discussed here. Evidence of the entrainment of endogenous neural oscillators (e.g. theta waves) to the amplitude envelope of speech (e.g. Peelle and Davis 2012) suggests a possible neural substrate for oscillatorbased speech behaviour, potentially important in listeners’ generation of durational predictions based on speech rate (e.g. Dilley and Pitt 2010). Theories of neural entrainment need, however, to address the lack of surface periodicity in most speech, as well as the imprecise mapping between the amplitude envelope and linguistic units (Cummins 2012). More generally, oscillator models of timing may find a challenge in evidence that many languages lack levels of prominence, such as lexical stress, that were once thought universal (e.g. Jun 2005b; Maskikit-Essed and Gussenhoven 2016).
11.4 Conclusions and prospects The hypothesis that speech is consistently characterized by isochrony has succumbed to the weight of counterevidence, and the associated hypothesis about categorical ‘rhythm class’ has, at best, scant support. The accumulated production and perception data do, however, support a continuing diversity of approaches to speech timing, varying in their balance
OUP CORRECTED PROOF – FINAL, 04/12/20, SPi
SPEECH RHYTHM AND TIMING 179 between chronos and kairos, notably the degree to which surface timing patterns or hier archical control structures are emphasized. Regarding the two approaches sketched here, there may appear superficial contrasts between dynamical timing models, emphasizing underlying coupling between hierarchic ally organized levels of metrical structure (e.g. Cummins and Port 1998; O’Dell and Nieminen 2009; Malisz et al. 2016), and localized approaches, emphasizing the irregularity of surface timing and the information about structure and phonology provided for listeners by this temporal unpredictability (e.g. Cauldwell 2002a; Nolan and Jeon 2014; White 2014). A synthesis of dynamical and localized models may, however, emerge from a deeper understanding of the complex interaction between the information transmission imperative in language and the affordance that speech offers for multi-level entrainment of interlocutors’ gestural, prosodic, linguistic, and social behaviour (Tilsen 2009; Pickering and Garrod 2013; Mücke et al. 2014; Fusaroli and Tylén 2016). Some degree of broad predictability is a prerequisite for humans interacting in conversation or other joint action. More specifically, local unpredictability in speech timing cannot be interpreted as structurally or prosodically motivated unless listeners have a foundation on which to base temporal predictions and the ability to spot violations of predictions (e.g. Baese-Berk et al. 2014; Morrill et al. 2014b). Where mutual understanding confers predictability—for example, via a common social framework or foregoing linguistic context—then the surface timing of speech may be freer to vary unpredictably, towards maximizing encoding of information. When interlocutors lack shared ground and predictability is consequently elusive, then relative underlying periodicity may dominate, supporting mutual coordination and ease of processing, but with potential loss of redundancy in information encoding (see Wagner et al. 2013). This proposal, which we tentatively call the ‘periodicity modulation hypothesis’, lends itself to ecologically embedded studies of infant and adult spoken interactions and their relationship to neurophysiological indices of perception and understanding.
OUP CORRECTED PROOF – FINAL, 04/12/20, SPi
Pa rt I V
PRO S ODY AC RO S S T H E WOR L D
OUP CORRECTED PROOF – FINAL, 04/12/20, SPi
chapter 12
Su b-Sa h a r a n A fr ica Larry M. Hyman, Hannah Sande, Florian Lionnet, Nicholas Rolle, and Emily Clem
12.1 Introduction In this chapter we survey the most important properties and issues that arise in the pros odic systems of sub-Saharan Africa. While our emphasis is on the vast Niger-Congo (NC) stock of approximately 1,500 languages, much of what is found in NC is replicated in Greenberg’s (1963) other major stocks: Nilo-Saharan, Khoisan, and the Chadic, Cushitic, and Omotic subgroups of Afro-Asiastic. As we shall point out, both the occurrence of tone and other properties that are found in the prosodic systems of sub-Saharan Africa show noteworthy areal distributions that cut across these groups. We start with a discussion of tone (§12.2), followed by word accent (§12.3) and then intonation (§12.4).
12.2 Tone Tone is clearly an ancient feature across sub-Saharan Africa, with the exception of Afro-Asiatic (e.g. Chadic), which likely acquired tone through contact with NC and/or Nilo-Saharan (Wolff 1987: 196–197). It is generally assumed that Proto-NC, which existed somewhere between 7,000 and 10,000 years ago, already had tone, most likely with a con trast between two heights, H(igh) and L(ow) (Hyman 2017). First, almost all NC languages are tonal, including the controversial inclusions such as Mande, Dogon, and Ijoid. Second, non-tonal NC languages are geographically peripheral and have lost their tone via natural tone simplification processes (cf. Childs 1995) and/or influence from neighbouring nontonal languages (cf. Hombert 1984: 154–155). This includes not only Swahili in the East but also Northern Atlantic (Fula, Seereer, Wolof, etc.), Koromfé (Northern Central Gur; Rennison 1997: 16), and (outside NC) Koyra Chiini (Songhay; Heath 1999: 48), which could be the effect of contact with Berber or Arabic, either directly or through Fula (Childs 1995: 20). The only non-peripheral cases concern a near-continuous band of Eastern Bantu languages, such as Sena [Mozambique], Tumbuka [Malawi] (Downing 2017: 368), and Nyakyusa [Tanzania], which are not likely to have lost their tones through external contact. Since
OUP CORRECTED PROOF – FINAL, 04/12/20, SPi
184 Larry M. Hyman et al. tonogenesis usually if not always produces a binary contrast, languages with multiple tone heights have undergone subsequent tonal splits conditioned either by laryngeal features, such as the obstruent voicing of so-called depressor consonants, as in Kru (see Singler 1984: 74 for Wobe) or by the tones themselves, such as raising of L to M(id) before a H tone that then subsequently drops out, as in Mbui [Grassfields Bantu; Cameroon] (Hyman and Schuh 1974: 86).
12.2.1 Tonal inventories Numerous sub-Saharan languages still show a binary contrast in their tones, which may be analysed as equipollent /H/ vs. /L/, e.g. Ga [Kwa; Ghana] /lá/ ‘blood’, /là/ ‘fire’ (Kropp Dakubu 2002: 6), privative /H/ vs. Ø, e.g. Haya [Bantu; Tanzania] /-lí-/ ‘eat’ vs. /-si-/ ‘grind’ (Hyman and Byarushengo 1984: 61), or (more rarely) privative /L/ vs. Ø, e.g. Malinke de Kita [Mande; Mali] /nà/ ‘to come’ vs. /bo/ ‘to exit’ (Creissels 2006: 26). As pointed out by Clements and Rialland (2008: 72–73), languages with three, four, or even five contrastive pitch heights tend to cluster along a definable East–West Macro-Sudan Belt (Güldemann 2008) starting in Liberia and Ivory Coast, e.g. Dan (Vydrin 2008: 10), and ending in Ethiopia, where Bench also contrasts five tone heights: /ka̋r/ ‘clear’, /kárí/ ‘inset or banana leaf ’, /kār/ ‘to circle’, /kàr/ ‘wasp’, and /kȁr/ ‘loincloth’ (Rapold 2006: 120). Most of those spoken south of the Macro-Sudan Belt (Güldemann 2008) contrast two tone heights (see Map 12.1 in the plate section).1 Another area of high tonal density is to be found in the Kalahari Basin region (Güldemann 2010) in southwestern Africa, where languages formerly subsumed under the label ‘Khoisan’ have up to four level tones: the Kx’a language ǂʼAmkoe (Gerlach 2016) and the Khoe-Kwadi languages Khwe (Kilian-Hatz 2008: 24–25), Gǀui (Nakagawa 2006: 32–60), and Ts’ixa (Fehn 2016: 46–58) have three tone heights (H, M, L), while Khoekhoe (KhoeKwadi: Haacke 1999: 53) and the Ju branch of Kx’a (formerly ‘Northern Khoisan’: Dickens 1994; König and Heine 2015: 44–48) have four (Super-H, H, L, Super-L). Only one Kalahari Basin language (the West dialect of Taa, aka ǃXóõ) has been analysed as opposing two tone heights (H vs. L: Naumann 2008). Besides the number of tone heights, African tone system inventories differ in whether they permit contours or not, and, if permitting, which ones are present. Map 12.2 (see plate section) shows that R(ising) and F(alling) tonal contours tend more to appear in languages in the Macro-Sudan Belt. In terms of the number of contour tones, African languages have been reported with up to five falling tones, e.g. Yulu HM, HL, HꜜL, ML, MꜜL (Boyeldieu 1987: 140), and five rising tones, e.g. Wobe 31, 32, 41, 42, 43, where 1 is the highest tone (Bearth and Link 1980: 149). Another difference in inventory concerns whether a language allows downstepped tones or not. Whereas some languages contrast the three tone heights /H/, /M/, and /L/, which in principle can combine to produce nine possibilities on two syllables and 27 possibilities on 1 Note that in Map 12.1 tone heights are counted based on the number of contrastive pitch levels a language employs on the surface. Thus, if a language has a system consisting of L, H, and ꜜH, it will be counted as having three tone heights.
OUP CORRECTED PROOF – FINAL, 04/12/20, SPi
Sub-Saharan Africa 185 three, as in Yoruba (Pulleyblank 1986: 192–193), others contrast /H/, /L/, and a downstepped ꜜH, which usually is contrastive only after another (ꜜ)H. As seen in Map 12.3 (see plate section), a smaller number of languages have contrastive ꜜM and ꜜL. While in most African languages with downstep an underlying L tone between two H tones results in the second H surfacing as ꜜH, an input H-H sequence may also become H-ꜜH, as in Shambala (Odden 1982) and Supyire (Carlson 1983). A number of underlying /H, M, L/ languages lack ꜜH but have downstepped ꜜM, which results from a wedged L tone, e.g. Yoruba (Bamgbos̹e 1966), Gwari (Hyman and Madaji 1970: 16), and Gokana (Hyman 1985: 115). Downstepped L, on the other hand, is more likely to derive from a lost H tone, as in Bamileke-Dschang (Hyman and Tadadjeu 1976: 92). ꜜH is by far the most common downstepped tone, and a three-way H vs. ꜜH vs. L contrast is the most common downstep system. On the other hand, Ghotuo is reported to have both ꜜM and ꜜL, but no ꜜH (Elugbe 1986: 51). Yulu (Boyeldieu 2009), which is said to have an ‘infra-bas’ tone, may be best ana lysed as having ꜜL. Similarly, the contrastive L falling tone of Kalenjin may be best analysed as a LꜜL contour tone (Creider 1981). While ꜜH occurs throughout sub-Saharan Africa, ꜜM and ꜜL are more commonly found in the eastern part of the Macro-Sudan Belt (e.g. Nigeria, Cameroon, Central African Republic).
12.2.2 The representation of tone The density of the tonal contrasts depends on whether a contrastive tone is required on every tone-bearing unit (TBU), instead of allowing some or many TBUs to be toneless. In the most dense system, the number of contrastive tone patterns will equal the number of contrastive tones multiplied by the number of TBUs. Thus, in a /H, L/ system there should be two patterns on a single TBU, four patterns on two TBUs, and so on (and perhaps more patterns if tonal contours are allowed). A sparse tonal situation tends to arise in languages that have longer words but have a more syntagmatic tone system. In these languages, single tones, typically privative H, can be assigned to a specific position in a lexically defined group of stems or words, as in Chichewa, where an inflected verb stem can be completely toneless or have a H on its penultimate or final syllable. It is such systems that lend them selves to a privative /H/ vs. Ø analysis. In another type of system often referred to as melodic, the number of TBUs is irrelevant. In Kukuya, for instance, verb stems can have any of the shapes CV, CVV, CVCV, CVVCV, or CVCVCV, i.e. up to three morae over which five different tonal melodies are mapped: H, L, LH, HL, or LHL (Paulian 1975). Since a CV TBU can have any of the LH, HL, or LHL contours, analysed as sequences of tones, Kukuya unambiguously requires a /H/ vs. /L/ analysis. Other languages can reveal the need for a /H/ vs. /L/ analysis by the presence of floating tones (e.g. many Grassfields Bantu languages).
12.2.3 Phonological tone rules/constraints Almost all sub-Saharan languages modify one or another of their tones in specific tonal environments. By far the most common tone rule found in NC languages is perseverative tone spreading, which most commonly applies to H tones, then L tones, then M tones.
OUP CORRECTED PROOF – FINAL, 04/12/20, SPi
186 Larry M. Hyman et al. In languages that have privative /H/ vs. Ø, only H can spread, as in much of Eastern and Southern Bantu, and similarly regarding L in /L/ vs. Ø systems such as Ruwund (Nash 1992–1994). Such spreading can be either bounded, affecting one TBU to the right, or unbounded, targeting either the final or penultimate position within a word or phrase domain. In some languages both H and L tone spreading can result in contours being created on the following syllable, as in the Yoruba example /máyò̙ mí rà wé/ → [máyô̙ mı ̌ râ wě] ‘Mayomi bought books’ (Laniran and Clements 2003: 207). In languages that do not tolerate contours, the result is doubling of a tone to the next syllable. This is seen particularly clearly in privative H systems, e.g. Kikerewe [Bantu; Tanzania] /ku-bóh-elan-a/ → [ku.bó.hé.la.na] ‘to tie for each other’ (Odden 1998: 177). In some cases the ori ginal tone delinks from its TBU, in which case the result is tone shifting, as in Jita /ku-βón-er-an-a/ → [ku-βon-ér-an-a] ‘to get for each other’ (Downing 1990: 50). Tone anticipation is much less common but does occur, e.g. Totela /o-ku-hóh-a/ → [o-kú-hoh-a] ‘to pull’ (Crane 2014: 65). Other languages may raise or lower a tone, e.g. a /L-H/ sequence may be realized L-M as in Kom [Bantoid; Cameroon] (Hyman 2005: 315–316) or M-H as in Ik [Eastern Sudanic; Uganda] (Heine 1993: 18), while the H in a /H-L/ sequence may be raised to a super-high level, as in Engenni [Benue-Congo; Nigeria] /únwónì/ → [únwőnì] ‘mouth’ (Thomas 1974: 12). Finally, tone rules may simplify LH rising and HL falling tones to level tones in specific environments. For more on the nature of tone rules in African languages see Hyman (2007) and references cited therein.
12.2.4 Grammatical functions of tone One of the most striking aspects of tone across Africa is its frequent use to mark grammat ical categories and grammatical relations. Three types of grammatical tone (GT) are illus trated below from Kalabari [Ijoid; Nigeria] (Harry and Hyman 2014). The first is morphological GT at the word level, which turns a transitive verb into an intransitive verb by replacing lexical tones with a LH tone melody. In this case, the only mark of the gram matical category is the GT, with no segmental exponence, illustrated in (1). (1) kán kíkíꜜmá
H
‘demolish’
→ kàán
LH
‘be demolished’
HHꜜH
‘hide’
→ kìkìmá LLH ‘be hidden’
The second, syntactic type occurs at the phrase level. As shown in (2), when a noun is pos sessed by a possessive pronoun (e.g. /ìnà/ ‘their’), the lexical tones of the noun are replaced with a HLH melody, realized [HꜜH] on two syllables. (2) námá bélè
HH ‘animal’ → ìnà náꜜmá HꜜH ‘their animal’ HL
‘light’
→ ìnà béꜜlé
HꜜH ‘their light’
Unlike the first case, here GT only secondarily expones the grammatical category ‘possessive’, and must co-occur with a segmentally overt possessive pronoun. Both morphological and syntactic types are referred to as ‘replacive tone’ (Welmers 1973: 132–133). Finally, the third type is also phrase level, but is crucially different in that the GT does not replace lexical tone but rather co-occurs with it. For example, still in Kalabari the future auxiliary /ɓà/ assigns
OUP CORRECTED PROOF – FINAL, 04/12/20, SPi
Sub-Saharan Africa 187 /H/ to a preceding verb, which surfaces as [LH] if it has /L/: /sì/ ‘(be) bad’ → námá sìí ɓà ‘the animal will become bad’. While all tonal languages in Africa exhibit GT, typically robustly, we know of only one language with only lexical tone. GT usage cuts across other typological dimensions such as tone inventory, degree of analyticity/syntheticity, and the headedness parameter, and can express virtually all grammatical categories and many distinct types of grammatical rela tions, including derivation (valency, word class changes, and more), and all major inflec tional categories such as number, case, tense, aspect, mood, subject agreement, and polarity, as in Aboh [Benue-Congo; Nigeria]: [ò jè kò] ‘s/he is going’, [ó jé kò] ‘s/he is not going’ (L. Hyman, personal notes). One robust pattern found across Africa involves GT marking ‘associative’ (roughly, genitive and compound) constructions, e.g. in Mande (Creissels and Grégoire 1993; Green 2013), Kru (Marchese 1979: 77; Sande 2017: 40), much of Benue-Congo, and (outside of NC) Chadic (Schuh 2017: 141), the isolate Laal (Lionnet 2015), and many Khoe-Kwadi languages (Haacke 1999: 105–159; Nakagawa 2006: 60–80, among others). In the verbal domain, tone often only has a grammatical function if verb roots lack a lexical tonal contrast. Table 12.1 illustrates this with the closely related Bantu languages Luganda and Lulamogi (Hyman 2014a). Both exhibit a lexical tonal contrast in the nominal domain, but only Luganda does so in the verbal domain.
Table 12.1 Grammatical tone in a language without a tone contrast in the verb stem (Luganda) and its absence in a language with such a tone contrast (Lulamogi) Nouns Verbs
Luganda e-ki-zimbe ≠ e-ki-zîmba o-ku-bal-a ≠ o-ku-bál-a
Lulamogi é-ki-zimbé ≠ é-kí-zîmbá ó-ku-bal-á = ó-ku-bal-á
‘building’ ‘boil, tumor’ ‘to count’ ‘to produce, bear fruit’
The lack of lexical tone contrasts in the verbal domain is common across African tonal languages, such as in Kisi [South Atlantic; Sierra Leone] (Childs 1995: 55), Konni [Gur; Ghana] (Cahill 2000), CiShingini [Kainji; Nigeria] (N. Rolle and G. Bacon, field notes), and Zande [Ubangi; Central African Republic] (Boyd 1995), not to mention many Bantu lan guages where tones are assigned by the inflectional morphology (tense-aspect-mood-nega tion), e.g. Lulamogi /a-tolók-a/ ‘s/he escapes’ vs. /á-tolok-é/ ‘may s/he escape!’. At least one language, Chimwiini [Bantu; Somalia] has only GT and no lexical tone in any domain (Kisseberth and Abasheikh 2011). Here, a single final or penultimate privative H tone is determined by the grammar, e.g. [ji:lé] ‘you sg. ate’, [jí:le] ‘s/he ate’ (Kisseberth and Abasheikh 2011: 1994), and although the above contrast derives from the inflectional morphology of the verb, it is realized phrasally: [jile ma-tu:ndá] ‘you sg. ate fruit’, [jile matú:nda] ‘s/he ate fruit’. Kisseberth and Abasheikh (2011) analyse phrase-penultimate H as the default, with the final H in these examples being conditioned by first (and second) person subject marking. Other final H tones are assigned by relative clauses, conditional clauses introduced by /ka-/, the negative imperative, and the conjunction /na/ ‘and’ (Kisseberth and Abasheikh 2011: 1990–1992).
OUP CORRECTED PROOF – FINAL, 04/12/20, SPi
188 Larry M. Hyman et al. The interaction of GT with lexical tone and other GTs is extremely rich and varied. One profitable way of illustrating complex GT interaction is through tonological paradigms show ing which morphosyntactic features assign which tones. These assignments often conflict. It is profitable to view GT competition as ‘dominance effects’ (Kiparsky 1984; Inkelas 1998; Kawahara 2015). As implemented in Rolle (2018), dominant GT wins systematically in com petition with lexical tone resulting in tonal replacement, as was exemplified in the first two types above from Kalabari (1–2). In contrast, non-dominant GT does not systematically win over other tones, often resulting in tones from two distinct morphemes co-occurring together, as in the third type in Kalabari shown above with the future auxiliary /ɓà/. Dominant and non-dominant GT can be interleaved in morphologically complex words, resulting in ‘layers’ of GT. The following example comes from Hausa (Inkelas 1998: 132). In (3a), dominant affixes /-íí/ agent and /-ìyáá/ fem replace any tones in the base and assign a L and H melody, respect ively. Non-dominant affixes /má-/ nominalizer and /-r/ ref either assign no tone, or assign a floating tone which docks to the edge but does not replace tones, as shown in (3b). (3) a. [ [ [ má- [ káràntá -Líí ] ] -Hìyáá ] -Lr ] nml- read -agent -fem -ref ‘the reader (f.)’ b. Dom Non-dom Dom Non-dom
káràntá -Líí má- kàràncí mákàràncíí -Hìyáá mákáráncììyáá -Lr
→ → → →
kàràncíí mákàràncíí mákáráncììyáá mákáráncììyâr
GT interaction can be very complex and may involve intricate rules of resolution not easily captured as dominant versus non-dominant GT, as in the case of the grammatical H tones in Jita [Bantu; Tanzania] (Downing 2014). In addition, tone may exhibit allo morphic melodies conditioned by properties of the target. For example, in Tommo So [Dogon; Mali] (McPherson 2014) possessive pronouns assign a H melody to bimoraic tar gets but a HL melody to longer targets. Thus, /bàbé/ ‘uncle’ → [mí bábé] ‘my uncle’ vs. / tìrὲ-àn-ná/ ‘grandfather’ → [mí tírὲ-àn-nà] ‘my grandfather’. Although virtually all sub-Saharan tonal languages exhibit both lexical tone and GT, the functional load of each can vary significantly. Many statements of African languages expli citly note the lack of tonal minimal pairs, as in the Chadic languages Makary Kotoko (Allison 2012: 38) and Goemai (Tabain and Hellwig 2015: 91); for Cushitic as a whole (Mous 2009), e.g. Awngi (Joswig 2010: 23–24); and in Eastern Bantu languages such as Luganda and Lulamogi (see Table 12.1). Other languages have more frequent minimal pairs, such as the oft-cited minimal quadruplet in Igbo: /àkwà/ ‘bed’, /àkwá/ ‘egg’, /ákwà/ ‘cloth’, /ákwá/ ‘crying’. The functional load of GT similarly varies across Africa.
12.3 Word accent While we have a great understanding of tone in African languages, there has been consider ably less clarity about the status of word stress. In this section we adopt word accent (WA) as a cover term to include word stress and other marking of one and only one most prominent
OUP CORRECTED PROOF – FINAL, 04/12/20, SPi
Sub-Saharan Africa 189 syllable per word. In the most comprehensive survey of WA in sub-Saharan Africa to date, the studies cited by Downing (2010) describe individual African languages with WA assigned at the left or right edge, on the basis of syllable weight, or by tone. However, many if not most authors either fail to report word stress or explicitly state that there is no stress, rather only tone. Thus in Lomongo, ‘stress is entirely eclipsed by the much more essential marking of tones’ (Hulstaert 1934: 79, our translation). Some of the relatively few non-tonal languages do appear to have WA, such as initial (~ second syllable) in Wolof (Ka 1988; Rialland and Robert 2001) and penultimate (~ antepenultimate) in Swahili (Vitale 1982). Other non-tonal languages appear not to have WA at all, but rather mark their prosody at the phrase level, such as by lengthening the vowel of the phrase-penultimate syllable and assigning a H tone to its first mora in Tumbuka (Downing 2017: 369–370). Kropp (1981) describes the stylistic highlighting (‘stress’) of different syllables within the pause group in Ga [Kwa; Ghana]. While descriptions of many tone languages do not mention stress or accent, we do find occasional attempts to predict WA from tone. In Kpelle, Welmers (1962: 86) shows that basic (unaffixed, single-stem) words can have one of five tone melodies H, M, HL, MHL, and L. He goes on to say that when these melodies are mapped on bisyllabic words as H-H, M-M, H-L, M-HL, and L-L, accent falls on the initial syllable if its tone is H or L, otherwise on the second syllable if its tone is HL. Words that are M-M are ‘characterized by lack of stress’ (Welmers 1962: 86). However, the fact that some words are accentless makes the ana lysis suspect, as obligatoriness is a definitional property of stress in non-tone languages. Since the MHL and M melodies derive from /LHL/ and /LH/, respectively (Hyman 2003: 160), Welmers’ accent would have to be a very low-level phenomenon, assigned after the derivation of LH → M. We suspect that other claims of WA based on tonal distinctions are equally due to the intrinsic properties of pitch and other factors unrelated to WA. While in cases such as Kpelle there is a lack of additional phonetic or phonological evi dence for WA, in a number of sub-Saharan African languages the stem- (or root-)initial syllable is an unambiguously ‘strong’ position licensing more consonant and vowel con trasts than pre- and post-radical positions, where ‘weaker’ realizations are also often observed. Perhaps the best-known case is Ibibio, whose consonant contrasts are distributed within the prosodic stem as in (4) (Urua 2000; Harris and Urua 2001; Akinlabi and Urua 2006; Harris 2004). (4)
a. b. c. d. e.
prosodic stem structures: stem-initial consonants: coda consonants: intervocalic VCV: intervocalic VCCV:
CV, CVC, CVVC, CVCV, CVVCV, CVCCV b f m t d s n y ɲ k ŋ kp w p m t n y k ŋ β m ɾ n ɣ ŋ pp mm tt nn yy kk ŋŋ
As indicated, 13 consonants contrast stem-initially versus six or seven in the other positions. The intervocalic weakening of stops to [β, ɾ, ɣ] only between the two stem syllables (not between prefix and stem, for instance), as well as the realization of /i, u/ as [ɨ, ʌ] in stem-initial position, points to the formation of a prosodic stem with a strong-weak foot structure: /díp/ → [dɨ ́p] ‘hide’, /díp-á/ → [dɨ ́βé] ‘hide onseself ’. In addition, although the first syllable can have any of the six vowels /i e u o ɔ a/, the second syllable is limited to a single vowel analysable as /a/, which assimilates to the vowel of the first syllable (cf. [tòβó] ‘make an order’, [dééβ-é] ‘not scratch’, [kɔ́ŋ-ɔ́] ‘be hung’).
OUP CORRECTED PROOF – FINAL, 04/12/20, SPi
190 Larry M. Hyman et al. Such distributional asymmetries are an important and widespread areal feature in West and Central Africa, in a zone extending from parts of Guinée, Côte d’Ivoire, and Burkina Faso in the West to Gabon and adjacent areas in the two Congos, partly overlapping with what Güldemann (2008) identifies as the core of the Macro-Sudan Belt. Most languages in this stem-initial prominence area are from the NC stock. However, the pattern whereby the initial syllable is favoured with consonant and vowel contrasts while the second is starved of them is an areal feature and cuts across families. It is strongest in the centre of the area (i.e. on both sides of the Nigeria–Cameroon border) and decreases towards the periphery. Most peripheral NC languages have very few such distributional asymmetries (e.g. NorthCentral Atlantic, Bantu south of the Congo), while it is present in Northwest Bantu, but not (or not as much) in the rest of Bantu. Finally, most non-NC languages with similar distri butional asymmetries are found at the periphery of the area, where they are likely to have acquired stem-initial prominence through contact with NC languages, such as Chadic lan guages spoken on the Jos Plateau next to Benue-Congo languages with stem-initial prominence, including Goemai, which has a long history of contact with Jukun (cf. Hellwig 2011: 6). Similarly, the initial-prominent Chadic languages Ndam (Broß 1988) and Tumak (Caprile 1977) and the isolate Laal (Boyeldieu 1977; Lionnet, personal notes) are spoken in southern Chad next to Lua and Ba, two Adamawa languages with strong steminitial prominence (Boyeldieu 1985; Lionnet, personal notes). Nilo-Saharan languages, most of which are spoken far from the stem-initial prominence area, do not seem to have similar distributional asymmetries. This is also true of Saharan or Bongo-Bagirmi languages, spoken relatively close to the area. Stem-initial prominence cued by segmental distribu tional asymmetries thus seems to be an areal feature within the Macro-Sudan Belt, affecting mostly NC languages (cf. Table 12.2 section a), but also neighbouring unrelated languages through contact (cf. Table 12.2 section b). However, as in the case of multiple tone heights, the Kalahari Basin area acts as a south ern counterpart to the Macro-Sudan Belt in being an area of strong stem-initial promin ence. In most of the languages formerly known as ‘Khoisan’, lexical stems strictly conform to the phonotactic templates C(C)1V1C2V2, C(C)1V1V2, and C(C)1V1N. The stem may start with virtually any consonant in the (sometimes very large) inventory, including clicks, and any of the attested clusters (only a few sonorants are not attested stem-initially), while only a handful of consonants, mostly sonorants, are attested in C2 (cf. Table 12.2 section c).
Table 12.2 Stem-initial prominence marked by distributional asymmetries a. Mande Gur Gbaya Adamawa
Guro (Vydrin 2010), Mano (Khachaturyan 2015) Konni (Cahill 2007), Koromfe (Rennison 1997) Gbaya Kara Bodoe (Moñino and Roulon 1972) Lua (Boyeldieu 1985), Kim (Lafarge 1978), Day (Nougayrol 1979), Mundang (Elders 2000), Mambay (Anonby 2010), Mumuye (Shimizu 1983), Dii/Duru (Bohnhoff 2010) Plateau Izere (Blench 2001; Hyman 2010c), Birom (Blench 2005; Hyman 2010c) Cross River Ibibio (Urua 2000; Harris and Urua 2001; Akinbiyi and Urua 2002; Harris 2004), Gokana (Hyman 2011) NW Bantu Kukuya (Paulian 1975; Hyman 1987), Tiene (Ellington 1977; Hyman 2010c), Basaa (Hyman 2008), Eton (Van de Velde 2008) b. Chadic Goemai (Hellwig 2011), Tumak (Caprile 1977), Ndam (Broß 1988) Isolate Laal (Lionnet, personal notes) c. ‘Khoisan’ (Beach 1938; Traill 1985; Miller-Ockhuizen 2001; Nakagawa 2010)
OUP CORRECTED PROOF – FINAL, 04/12/20, SPi
Sub-Saharan Africa 191 The initial stem syllable may also affect how tone rules apply (e.g. attracting a H tone to them, as in Giryama; Volk 2011b) or stopping the further spread of H, as in Lango, where Noonan (1992: 51) states that ‘primary stress in Lango is invariably on the root syllable’. Since stem-initial stress is common cross-linguistically, it is natural to identify such stem-initial effects with the broader concept of WA, despite the otherwise elusive nature of WA in sub-Saharan African languages. For further discussion see Hyman (2008: 324–334), Downing (2010: 408–409), Hyman et al. (2019), and references cited therein.
12.4 Intonation Focusing on the prosodic features that mark sentence type or syntactic domain, we follow Ladd’s (2008b) definition: ‘Intonation, as I will use the term, refers to the use of suprasegmental phonetic features to convey “post-lexical” or sentence-level pragmatic meanings in a linguistically structured way’ (italics retained from Ladd). Following Ladd, we leave out a discussion of paralinguistic functions of intonation, such as enthusiasm or excitement, as achieved by tempo and pitch range modulations. A number of African languages distinguish sentence types with intonational pitch contours, often in addition to the lexical and GTs or WA in the language. Other prosodic features, such as length, are also used to mark phrasal boundaries. However, some highly tonal languages in Africa show little to no effect of intonation at all. Perhaps expectedly, there seems to be a correlation between high numbers of contrastive lexical and gram matical level tones and a lack of intonational contours to mark sentence type. For example, Connell (2017) describes the prosodic system of Mambila, a Bantoid language (Nigeria and Cameroon) with four contrastive tone heights in addition to GT marking, as having no consistent f0 contribution in indicating sentence type (i.e. declarative sentence vs. polar question). This section surveys intonational tendencies in polar questions, declarative sentences, and focus constructions across sub-Saharan African languages.
12.4.1 Pitch as marking sentence type or syntactic domain One particularly salient property of intonation contours in a wide range of sub-Saharan African languages is the lack of a rising right-edge boundary tone in polar questions (Clements and Rialland 2008: 74–75; Rialland 2007, 2009). However, even languages that lack a H% in polar questions by and large show pitch raising in some respect, either through register raising (as in Hausa and Lusoga) or by a raising of a H before a final L%. In a sample of over 100 African languages, Clements and Rialland (2008) found that more than half lack an utterance-final high or rising contour in polar questions. A number of languages show no intonational difference between declarative sentences and polar ques tions. Others make use of utterance-final low or falling intonation in polar questions. Specifically, such marking of polar questions by a final boundary tone L% is found in most Gur languages, as well as a number of Mande, Kru, Kwa, and Edoid languages, suggesting that it is an areal feature of West Africa. Clements and Rialland (2008: 77) found no Bantu,
OUP CORRECTED PROOF – FINAL, 04/12/20, SPi
192 Larry M. Hyman et al. Afro-Asiatic, or Khoisan languages that mark polar questions with a final L%,2 though see Rialland and Embanga Aborobongui (2017) on Embosi, a Bantu language with a HL% fall ing boundary tone in polar questions. Further east in Lusoga, another Bantu language, there is a right-edge H% tone in declaratives, but a L% in interrogatives and imperatives (Hyman 2018). All of the verb forms and the noun ‘farmers’ in (5a) are underlyingly tone less, while in (5b) ‘women’ has a H to L pitch drop from /ba-/ onto the first syllable of the noun root /kazi/. (5) a. Declarative: Interrogative: Imperative:
à-bál-á á-bá-límí ‘s/he counts the farmers’ à-bàl-à à-bà-lìmì ‘does s/he count the farmers?’ bàl-à à-bà-lìmì ‘count the farmers!’
b. Declarative: à-bál-á á-bá-kàzí ‘s/he counts the women’ Interrogative: à-bál-á á-ba̋-kàzì ‘does s/he count the women?’ Imperative: bàl-à à-ba̋-kàzì ‘count the women!’ (a- ‘3sg noun class 1 subject’, -bal- ‘count’, -a ‘final inflectional vowel’, a- ‘augment determiner’, -ba- ‘class 2 noun prefix, -limi ‘farmer’, -kàzi ‘woman’) While the speaker typically raises the pitch register to produce the completely toneless interrogative utterance in (5a), the whole sequence trends down towards the final L%. In the interrogative in (5b), the phonological L that follows the H is realized on super-high pitch with subsequent TBUs progressively anticipating the level of the L%.3 This widespread L% question marking across sub-Saharan Africa is surprising from a cross-linguistic perspec tive, since a H% or rising intonation in polar questions has been noted across language families (Bolinger 1978: 147) and at one time was thought to be a near-universal linguistic property (Ohala 1984: 2). On the other hand, a large number of African languages show a right-edge L% in declara tive sentences. Of the 12 African language prosodic systems described in Downing and Rialland (2017a), 10 display a L% in declaratives. The two exceptions are Basaa [Bantu; Cameroon] (Makasso et al. 2017) and Konni [Gur; Ghana) (Cahill 2017). See also the dis cussion above of Lusoga , which has a H% in declaratives. Remarkably from a cross-linguistic perspective, not many African languages use pros ody to mark focus constructions. While there are a number of distinct focus constructions found across African languages (Kalinowski 2015), intonation plays little to no role in focus marking. According to Kalinowski (2015: 159), ‘It is evident from the collection of examples from these 135 languages that focus encoding in African languages is largely morphosyntactic in nature. While prosodic cues of stress and intonation may also be involved, they are not the primary means of encoding focus.’ However, there are a few exceptions to the rule, where focused elements show a marked intonation contour: Hausa (Inkelas 1989b; Inkelas and Leben 1990), Chimwiini [Bantu; Somalia] (Kisseberth 2016),
2 Hausa shows an optional low boundary tone in polar questions (Newman and Newman 1981); how ever, there is also clear register raising in polar questions (Inkelas and Leben 1990), which rules it out of Clements and Rialland’s list. 3 Concerning the imperative, it is also possible to say [bàl-à à-bá-kàzí] if the meaning is a suggestion, e.g. ‘what should I do?’, answer: ‘count the women!’. It is never possible to show a final rise in a question.
OUP CORRECTED PROOF – FINAL, 04/12/20, SPi
Sub-Saharan Africa 193 Akan (Kügler 2016), Shingazidja [Bantu; Comoros] (Patin 2016), and Bemba [Bantu; Zambia) (Kula and Hamann 2017). In Hausa (Inkelas et al. 1986; Inkelas and Leben 1990), almost any word in an utterance can be emphasized (focused) by raising the first high tone in the emphasized word, which marks the beginning of an intonational domain. Phonological alternations that only apply within and not across intonational domains in Hausa (i.e. downdrift and raising of under lying low tones between two underlying high tones) do not apply between an emphasized word and the preceding word. In languages with both complex tonal inventories and intonation, the two sometimes interact. In Embosi [Bantu; Congo-Brazzaville] (Rialland and Aborobongui 2017), inton ational boundary tones are superimposed onto lexical tones, resulting in super-high and super-low edge tones. In (6), the final lexical L is produced with super-low pitch due to the utterance-final L%. (6)
[wáβaaɲibeabóowéȅ] (Rialland and Aborobongui 2017: 202) wa áβaaɲi bea bá (m)o-we 3sg.pro 3sg.take.away.rec cl8.property cl8.gen cl1-deceased ‘He took away the properties of the deceased.’
In Hausa, the final falling tone (HL%) in interrogatives neutralizes the difference between underlying H and underlying HL (Inkelas and Leben 1990). For example, word-final kai, ‘you’, with underlying H, and kâi, ‘head’, with underlying HL, are both produced with a HL fall as the final word in a question. In addition, downdrift is suspended both in questions and under emphasis in Hausa (Schachter 1965). In other languages with both tone and intonation, the two have very little effect on each other. In a number of languages, coordination is often optionally marked with intonation alone. This is the case, for example, in Jamsay (Dogon: Heath 2008: 136–138), a two-tone language where coordinated NPs can simply be listed, the coordination being marked on every coordinated element only by what Heath terms ‘dying quail’ intonation, characterized by the ‘exaggerated prolongation of the final segment (vowel or sonorant), accompanied by a protracted, slow drop in pitch lasting up to one second’, e.g. /wó∴ kó∴/ → [wóōò kóōò] ‘he/ she and it’. A similar phenomenon is attested in Laal (Lionnet, personal notes), which has three tones (H, M, and L) and where two intonational patterns marking emphatic conjunc tion are attested. In both cases, the conjoined NPs are juxtaposed, and the coordination is marked only by a specific word-final pitch contour. In the first case, illustrated in (7a), the last syllable of every coordinated member is realized with rising pitch, irrespective of the value of the last lexical tone. The second strategy is preferred when the list is very long. Here, the word-final rhyme is significantly lengthened and the rising pitch is followed by a fall to mid pitch, as shown in (7b). (7) a. bààr↗ náár↗ í tú pār → [bàa̋r náa̋r…] his.father his.mother it.is Bua all ‘Both his father and his mother are Bua [ethnic group].’ b. í sèré↗↘ í cáŋ↗↘ kə́w kíínà pār → [í sèrée̋ē í cáŋ̋ŋ̄…] it.is S.Kaba it.is Sara also do.it all ‘The Sara Kaba, the Sara, etc. everyone used to practise it too (slavery).’
OUP CORRECTED PROOF – FINAL, 04/12/20, SPi
194 Larry M. Hyman et al.
12.4.2 Length marking prosodic boundaries Pitch is not the only phonetic parameter used to demarcate utterance and phrasal boundar ies. A number of Bantu languages display lengthening of the penultimate vowel of a par ticular syntactic domain or sentence type. For instance, in Shekgalagari, which contrasts /H/ vs. Ø, the penultimate vowel is lengthened in declarative utterances, creating differ ences between medial versus final forms of nouns, as shown in (8) (Hyman and Monaka 2011: 271–272). While nouns with /Ø-H/ and /H-Ø/ patterns simply show vowel lengthening, penultimate lengthening affects the tone of the last two syllables of the other two patterns. When the last two syllables are toneless, the lengthened penultimate vowel contours from a L to super-low tone. When the last two syllables are /H/, the final H is lost and the penulti mate H contours from H to L. (8)
Underlying /Ø-Ø/ /Ø-H/ /H-Ø/ /H-H/
Medial nàmà nàwá lórì nárí
Final nȁːmà nàːwá lóːrì nâːrí
‘meat’ ‘bean’ ‘lorry’ ‘buffalo’
Penultimate lengthening does not occur in interrogative or imperative sentence types, where the final tones are realized as in medial position: [à-bàl-à rì-nárí] ‘is s/he counting buffalos?’ (cf. [à-bàl-à rì-nâːrí] ‘s/he is counting buffalos’). See also Selkirk (2011) for clauselevel penultimate lengthening in Xitsonga and Hyman (2013) for a survey of the status of penultimate lengthening in different Bantu languages.
12.5 Conclusion The prosodic systems of sub-Saharan languages are quite varied. While tone is almost uni versal in the area, some languages have very dense tonal contrasts, some sparse; some lan guages make extensive grammatical use of tone, some less; and so forth. Word stress is less obvious in most languages of the area, with the question of whether stem-initial promin ence should be equated with WA being unresolved. Finally, while less studied, the recent flurry of intonational studies is very encouraging.
OUP CORRECTED PROOF – FINAL, 04/12/20, SPi
chapter 13
North A fr ica a n d the M iddl e E ast Sam Hellmuth and Mary Pearce
13.1 Introduction This chapter reviews the prosodic systems of languages spoken in North Africa and the Middle East, taking in the Horn of Africa, the Arabian Peninsula, and the Middle East. The area’s southern edge is formed by Mauretania, Mali, Niger, Chad, South Sudan, Ethiopia, and Somalia, as illustrated in Map 13.1, which indicates word-prosodic features by the greyscaled locator circles.1 We outline the scope of typological variation within and across the Afro-Asiatic and Nilo-Saharan language families in word prosody, prosodic phrasing, melodic structure, and prosodic expression of meaning (sentence modality, focus, and information structure). The survey is organized around language sub-families (§13.2 and §13.3). We close with a brief discussion in §13.4, where we also set out priorities for future research. In this chapter the term ‘stress’ denotes word-level or lexical prominence. We assume tone and stress are independent, with no intermediate accentual category (Hyman 2006). The term ‘pitch accent’ thus always denotes a post-lexical prominence or sentence accent, as used in the autosegmental-metrical framework (Gussenhoven 2004; Ladd 2008b).
13.2 Afro-Asiatic 13.2.1 Berber The Berber—now known as Amazigh—languages are all non-tonal but appear to vary regarding presence of stress. The Eastern varieties (in Tunisia, Libya, and Egypt) display word-level stress (Kossmann 2012), though without stress minimal pairs. Relatively little is 1 We also mention languages in the Nilotic family spoken further south, in Uganda, Kenya, and into Tanzania.
OUP CORRECTED PROOF – FINAL, 04/12/20, SPi
196 SAM HELLMUTH AND MARY PEARCE
stress
no stress
one tone
two tones
three or more tones
Map 13.1 Geographical location of languages treated in this chapter, with indications of the presence of stress and of tone contrasts (1 = binary contrast; 2 = ternary contrast; 3+ = more complex system), produced with the ggplot2 R package (Wickham 2009)
known about the word prosody of most Libyan dialects, such as Ghadames (Kossmann 2013), but in Zwara stress generally falls on the penult (Gussenhoven 2017a). In contrast, in the Northern varieties (in Morocco and Algeria), although it is possible to construe rules for stress assignment in citation forms, these do not hold in connected speech (Kossmann 2012). For example, in Tarifit, prosody marks both clause structure and discourse structure, but pitch and intensity do not routinely co-occur (McClelland 2000). Similarly, in Tuareg, although stress can be described for citation forms (lexically determined in nouns and verbs but on the antepenultimate otherwise), accentual phrasing overrides these citation form stress patterns in ways that are as yet poorly understood and require further investigation (Heath 2011: 98). This variable pattern has been clarified in Tashlhiyt, through experimental investigation, as a non-tonal, non-stress language (without culminative stress). For example, in Tashlhiyt the intonational peak in polar questions varies probabilistically; sonorant segments tend to attract the pitch accent and tonal peaks are later in questions than in statements (Grice et al. 2015), and a similar pattern is found in wh-questions (Bruggeman et al. 2017). Intonational peaks in Tashlhiyt thus do not display the kind of consistent alignment that might indicate underlying association with a stressed syllable. In contrast, the intonation patterns of Zwara, which has word-level stress, are readily analysed in terms of intonational pitch accents and boundary tones (Gussenhoven 2017a).
OUP CORRECTED PROOF – FINAL, 04/12/20, SPi
NORTH AFRICA AND THE MIDDLE EAST 197 In general, Amazigh languages make use of an initial or final question particle in polar questions and display wh-word fronting (Frajzyngier 2012). Focused elements are usually fronted but can also be right-dislocated, with associated prosodic effects; a topic is similarly placed clause-initially and marked by intonation (Frajzyngier 2012). Verb focus can be marked solely prosodically in most Amazigh varieties, with the exception of Kabyle, which requires the verb to be clefted (Kossmann 2012: 94).
13.2.2 Egyptian The now extinct Egyptian language went through several stages (Old, Middle, Late, and Demotic) before evolving into Coptic. There is no indication that the language had contrast ive tone at any stage. Egyptian had wh-in-situ, and it is assumed (Frajzyngier 2012) that at all stages a polar question could be realized on declarative syntax by changing the intonation contour. It had a set of stressed pronouns and a focus particle, and topicalization was realized through extraposition and a particle. Coptic language was spoken from the fourth to the fourteenth centuries, cohabiting with Arabic from the ninth century onwards, and survives only in the liturgy of the Coptic Orthodox church. Reconstructing from Coptic, it is likely that stress in Egyptian fell on either the final or the penult syllable and is reported to be marked by ‘strong expiratory stress’ (Fecht 1960; cited in Loprieno and Müller 2012: 118). Questions in Coptic were marked by particles and ‘possibly also by suprasegmental features such as intonation’ (Loprieno and Müller 2012: 134).
13.2.3 Semitic The Semitic languages are almost all non-tonal stress languages (exceptions are noted below).
13.2.3.1 East Semitic Evidence from texts in the now extinct Akkadian language indicate that it did not have phonemic stress, but otherwise little is known about its prosody (Buccellati 1997). It displayed fronting of topics and right-dislocation with resumptive pronouns (Gragg and Hoberman 2012).
13.2.3.2 West Semitic: Modern South Arabian In the western Modern South Arabian languages (Hobyot, Bathari, Harsusi, and Mehri), stress falls on the rightmost long syllable in the word, else on the initial syllable; in the eastern languages, Jibbali can have more than one prominent syllable per word, while in Soqotri stress falls towards the beginning of the word (Simeone-Senelle 1997, 2011). Polar questions are marked in the Modern South Arabian languages by means of intonation alone, and wh-words are either always initial (e.g. Soqotri) or always final (e.g. Mehri) (Simeone-Senelle 1997, 2011). A recent investigation of speech co-gestures in Mehri and Shehri (Jibbali) notes that intonation is used in Mehri to mark the scope of negation, though without explicitly describing the prosodic means used to achieve this effect (Watson and Wilson 2017).
OUP CORRECTED PROOF – FINAL, 04/12/20, SPi
198 SAM HELLMUTH AND MARY PEARCE
13.2.3.3 West Semitic: Ethio-Semitic Although Ge’ez is no longer spoken, tradition suggests that stress fell on the penult in verbs but was stem-final in nouns and pronouns, with some exceptions (Gragg 1997; Weninger 2011). The position of stress in Tigrinya has been described as shifting readily from one position to another and is not always marked in parallel by dynamic stress correlates (intensity/dur ation) and pitch. Kogan (1997: 439) suggests therefore that ‘sentence intonation is clearly predominant over the stress of an individual word’, resembling descriptions noted above for Amazigh varieties that lack stress. A similar pattern is reported for neighbouring Tigré (Raz 1997). In Amharic, stress is described as ‘not prominent’, falling primarily on stems but displaying some interaction with syllable structure, and requiring further research (Hudson 1997). In other Ethio-Semitic languages, descriptions tend to be limited to a statement that stress is not phonemic, without elaborating further (e.g. Wagner 1997 for Harari), or make no mention of stress at all (Watson 2000). Hetzron (1997a) suggests that there is variation among Outer South Ethiopic languages, with only the most ‘progressive’ (Inor) displaying discernible stress (typically on a final heavy syllable, else on the penult). In Amharic, polar questions are marked by rising intonation, a clause-final question marker, or a verbal suffix (Hudson 1997); in wh-questions the wh-word occurs before the sentencefinal verb (Frajzyngier 2012). Questions are formed by means of a question particle attached to the questioned constituent in Tigrinya (Kogan 1997), and by an optional sentence-final particle in the Outer South Ethiopic languages (Hetzron 1997a).
13.2.3.4 Central Semitic: Sayhadic Little is known about the stress system or any other aspect of the prosody of the now extinct Sayhadic languages (Kogan and Korotayev 1997).
13.2.3.5 Central Semitic: North West Semitic In Biblical Hebrew, stress was generally final, with exceptions (Edzard 2011); stress markings in the codified Masoretic text of the Hebrew Bible show that stress position was contrastive and that surface vowel length was governed by stress (Steiner 1997). Segmental sandhi, known as ‘pausal forms’, are observed at phrase boundaries (McCarthy 1979b, 2012), and prosodic rhythm rules applied in the Tiberian system (Dresher 1994). Stress in Modern Hebrew falls on the final or penult syllable, with some morphological exceptions (Berman 1997; Schwarzwald 2011), as it most likely did in early Aramaic. In the Eastern Neo-Aramaic languages, stress tended to fall on the penult, whereas in the West Aramaic languages the position of stress depends on syllable structure, as for Arabic (Jastrow 1997; Arnold 2011; Gragg and Hoberman 2012).
13.2.3.6 Central Semitic: Arabian Little is known of the prosody of the extinct Ancient North Arabian languages. The other members of the Arabian family form five regional groups of spoken dialects across North Africa, Egypt and Sudan, the Levant, Mesopotamia, and the Arabian Peninsula (Watson 2011).
OUP CORRECTED PROOF – FINAL, 04/12/20, SPi
NORTH AFRICA AND THE MIDDLE EAST 199 The position of primary stress in the word is in general predictable from syllable structure in Arabic dialects (as also in Maltese) and there is an extensive literature on microvariation between dialects in stress placement, as illustrated in Table 13.1 (see summaries in van der Hulst and Hellmuth 2010; Watson 2011; Hellmuth 2013).
Table 13.1 Stress assignment in different Arabic dialects Standard Arabic Palestinian Arabic Lebanese Arabic Cairene Arabic Negev Bedouin kiˈtaːb ˈkaːtib ˈmaktaba ˈkatab
kiˈtaːb ˈkaːtib ˈmaktaba ˈkatab
kiˈtaːb ˈkaːtib ˈmaktabe ˈkatab
kiˈtaːb ˈkaːtib makˈtaba ˈkatab
kiˈtaːb ˈkaːtib ˈmaktabah kaˈtab
book writer library he wrote
(Adapted from Hellmuth 2013: 60)
Exceptions to the general rule of predictable stress in Arabic include Nubi (derived from an Arabic pidgin), which has an accentual system (Gussenhoven 2006), and Moroccan Arabic, in which the status of stress is disputed. Maas and Procházka (2012) argue that Moroccan Arabic and Moroccan Berber (including Tashlhiyt) form a Sprachbund, sharing a large number of features across all linguistic domains, including phonology. They thus argue that Moroccan Arabic—like Moroccan Berber (see §13.2.1)—has post-lexical phrasal accentuation only, and no stress. There have been differing views on Moroccan Arabic stress (Maas 2013), since a stress generalization can be formulated for citation forms that no longer holds in connected speech (Boudlal 2001). One suggestion is that Moroccan Arabic has stress but is an ‘edge-marking’ language with boundary tones only and no prominence-marking intonational pitch accents (Burdin et al. 2015). Indeed, the descriptive observation is that tonal peaks occurring on a phrase-final word display alignment with the syllable that would be stressed in citation form (Benkirane 1998), confirmed also in corpus data (Hellmuth et al. 2015). This suggests the peak is neither solely prominence marking nor edge marking, forcing analysis as an edge-aligned pitch accent, as proposed for French (Delais-Roussarie et al. 2015), or as a non-metrical pitch accent (Bruggeman 2018). A recent comparative study (Bruggeman 2018) shows that Moroccan Arabic and Moroccan Berber speakers both demonstrate perceptual insensitivity to lexical prominence asymmetries, of the type shown by speakers of other languages known to lack word-level stress, such as French or Farsi (Rahmani et al. 2015). Standard Arabic is not acquired by contemporary speakers as a mother tongue but is instead learned in the context of formal religious or state education. It is possible to formulate a stress algorithm for Standard Arabic (Fischer 1997), and stress rules are described for learners of Arabic (Alhawary 2011), but Gragg and Hoberman (2012: 165) note that the Arab traditional grammarians did not describe the position of stress in Classical Arabic, and take this as evidence that Classical Arabic did not have stress and was ‘like modern Moroccan Arabic’. Retsö (2011) similarly suggests that the absence of stress–morphology interaction in Classical Arabic indicates that it had a system in which prominence was marked only by pitch. The prosody of Standard Arabic, as used today in broadcasting and other formal settings, most likely reflects the prosodic features of a speaker’s mother tongue spoken dialect
OUP CORRECTED PROOF – FINAL, 04/12/20, SPi
200 SAM HELLMUTH AND MARY PEARCE (cf. Retsö 2011). For stress this generates micro-variation in stress assignment patterns in Standard Arabic in different contexts, such as Cairene Classical Arabic versus Egyptian Radio Arabic (Hayes 1995). For intonation, some sharing of tonal alignment features between colloquial and Standard Arabic was found in a small study of Egyptian Arabic speakers (El Zarka and Hellmuth 2009). Prosodic juncture is marked in Standard Arabic by ‘pausal forms’, whereby grammatical case and other suffixes appear in a different form when phrase-final (Hoberman 2008; Abdelghany 2010; McCarthy 2012), as in Table 13.2. Accurate use of pausal forms is part of tajwīd rules for recitation of the Qur’ān (Al-Ali and Al-Zoubi 2009).
Table 13.2 Pausal alternations observed in Classical Arabic (McCarthy 2012) Absence of suffix case vowel Epenthesis of [h] after stem vowel Metathesis of suffix vowel Absence of suffixal [n] [ah] for suffix [at]
Phrase-internal
At pause
ʔalkitaːb-u ʔiqtadi ʔal-bakr-u kitaːb-un kaːtib-at-un
ʔalkitaːb ʔiqtadih ʔal-bakur kitaːb kaːtib-ah
the book (nom) imitate (3ms.imp) the young camel (nom) a book (nom) a writer (f.nom)
There are relatively few descriptions of cues to phrasing in spoken Arabic dialects (Hellmuth 2016), but it is likely that there is variation across dialects in the ‘default’ prosodic phrasing, similar to that seen in Romance languages: in Spanish, a phrase boundary is typically inserted after the subject in an SVO sentence, but not in Portuguese (Elordieta et al. 2005), and a similar pattern appears to differentiate Jordanian Arabic and Cairene Arabic (Hellmuth 2016). Segmental sandhi mark prosodic boundaries in some dialects: laryngealization in dialects of the Arabian peninsula (Watson and Bellem 2011) and Tunisia (Hellmuth, 2019), diphthongization of final vowels in the Levant, and nasalization in western Yemen (Watson 2011). Further research is needed to determine whether these cues mark syntactic structure or some other aspect of discourse structure, such as turn-finality. Focus and topic marking are achieved in spoken Arabic through a mixture of syntactic and prosodic means, including clefts or pseudo-clefts with associated prosodic effects. In most varieties a polar question can be realized through prosodic means alone; dialects vary with respect to wh-fronting versus wh-in-situ (Aoun et al. 2010). Focus can also be marked by prosodic means alone in many if not all dialects (see the literature review in Alzaidi et al. 2018). There is a growing body of literature on the intonational phonology of Arabic dialects (see summaries in Chahal 2006; El Zarka 2017). So far, all Arabic dialects outside North Africa appear to display intonation systems comprising both pitch accents and boundary tones. Variation in the inventory of nuclear contours (nuclear accent + final boundary tone combinations), as reported in Chahal (2006), suggests dialectal variation in the inventory of boundary tones, at least, and further comparative work may reveal variation in pitch accent inventories. Retsö (2011) notes variation across dialects in the phonetic realization of stress, differentiating ‘expiratory accent’ in Levantine varieties from ‘tonal accent’ in Cairene; this observation has been reanalysed in the autosegmental-metrical framework as variation in the distribution of pitch accents, occurring on every prosodic word in Cairene but more sparsely distributed, at the phrasal level, in Levantine (Hellmuth 2007; Chahal and Hellmuth 2014a).
OUP CORRECTED PROOF – FINAL, 04/12/20, SPi
NORTH AFRICA AND THE MIDDLE EAST 201
13.2.4 Chadic Chadic languages are tonal. Many Chadic languages (e.g. Migaama, Mofu, and Mukulu) are open to analysis as ‘accentual languages’ in which there is at most one H tone per word, which is accompanied by other indicators of prominence (Pearce 2006), but others (e.g. Kera, Masa, and Podoko) have three tones and a variety of lexical tone melodies on nouns. A common explanation for this variety within the Chadic family is that a process of tonogenesis has generated a tonal split from a single tone system into two tones in some languages, and into three in others (Wolff 1987). A typical example, as illustrated for Musgu in (1), is a system where syllables with voiceless onsets usually carry a H tone and syllables with voiced onsets usually carry a L tone; sonorants and implosives may be associated with a third tone (M), or they might pattern with one of the other groups. (1) Musgu depressor and raiser consonants (Wolff 1987) depressor: L zìrì ‘align’ vìnì ‘take’ raiser: H sírí ‘squash’ fíní ‘stay’ neutral: L yìmì ‘trap’ H yímí ‘be beautiful’ The wide variety of systems observed in Chadic suggests that tonogenesis probably occurred independently in separate languages rather than once in proto-Chadic (Jim Roberts, personal communication). Whatever the diachronic history, in the synchronic grammar, the roles may become reversed: in Kera it is now tone that is phonemic, with laryngeal voice onset time cues serving as secondary enhancement to the tone cues (Pearce 2005). The function of tone in Chadic is lexical as well as grammatical (Frajzyngier 2012), and most languages appear to display little tone movement or spreading, and probably no downstep (Jim Roberts, personal communication); however, exceptions include Ga’anda, which has floating tones and associated downdrift (Ma Newman 1971), and Ngizim, which has tone spreading and downstep (Schuh 1971). Hausa has two basic tones: H~L. Surface falling tones derive from adjacent underlying HL sequences (e.g. due to affixation) but can only be realized on a heavy syllable; in contrast, underlying LH sequences are truncated to a surface high tone (P. Newman 2000, 2009). A more complex case is Kera, which has three tones in rural speech communities, but in urban varieties (where there has been prolonged contact with French) the system reduces to two tones plus a voicing contrast in some contexts, and the change is sociolinguistically conditioned: among women in the capital, there is an almost complete loss of tone (Pearce 2013). Although Kera is cited as one of the few languages to exhibit long-distance voicing harmony between consonants (Odden 1994; Rose and Walker 2004), the facts can be accounted for by proposing tone spreading with voice onset time corresponding to the tone (Pearce 2006). Similarly, it has been claimed that Kera voiced (‘depressor’) consonants lower the tone of the following syllable (Ebert 1979; Wolff 1987; Pearce 1999), but acoustic analysis confirms that although there is surface con sonant and tone interaction, it is the tones that are underlying and distinct (Pearce 2006). Mawa has three tones in surface transcription, which can probably be reduced to two underlying tones, M and L, with H as an allophone of L (Roberts 2013), and Roberts (2005) makes similar claims for Migaama. In sum, the typical Chadic tonal system has a two-way contrast between /H/ and a non-high tone [M] or [L], depending on the preceding conson ant, which in some languages has developed into a three-way /H, M, L/ contrast.
OUP CORRECTED PROOF – FINAL, 04/12/20, SPi
202 SAM HELLMUTH AND MARY PEARCE Turning to sentence prosody, in Central and some West Chadic languages, word-final vowel retention (i.e. blocking of word-final vowel deletion) marks the right edge of prosodic phrases, for example in Hausa and Gidar, with similar blocking of vowel raising at phrase edges in Mafa; Wandala does not permit consonants in phrase-final position (Frajzyngier and Shay 2012). Polar questions are typically marked with a particle, whereas focus and topic can be marked in different ways, including particles, extraposition, tense-aspect markers, or intonation (Green 2007; Frajzyngier 2012). Focus is not always prosodically marked, however, if marked at all (Hartmann and Zimmermann 2007a).
13.2.5 Cushitic All languages in the Cushitic family appear to be tonal, and generally of the non-obligatory type (in which a tone is not observed on every syllable, or on every lexical item). In contrast to Chadic, in most Cushitic languages the function of tone is mostly grammatical, not lexical, marking categories such as negation, case, gender, or focus (Frajzyngier 2012; Mous 2012). Some languages in the family have a purely demarcative tonal system, such as K’abeena, whereas Awngi has demarcative phrasal stress as well as lexical tone (Hetzron 1997b). Somali has three surface word melodies, high LLH ~ falling LHL ~ low LLL (Saeed 1987), typically analysed as a privative system in which presence of a high tone (underlying /H/) alternates with absence of a high tone (underlying ‘Ø’) realized phonetically with low tone (Saeed 1999; Hyman 2009). Iraqw also has either surface H or L on the final syllable but all non-final syllables realized with mid or low tone (Nordbustad 1988; Mous 1993), and can also be analysed as /H/~ Ø (Hyman 2006). Beja has one culminative tone per word, whose position is contrastive, yielding minimal pairs such as [ˈhadhaab] ‘lions’ ~ [haˈdhab] ‘lion’ (Wedekind et al. 2005). Sidaama has at most one tone per word, whose position is contrast ive but also subject to movement in connected speech (Kawachi 2007). Afar has an obligatory phrasal H tone on the first word in each accentual phrase, appearing on the accented syllable in lexically accented words, otherwise on the final syllable (Hayward 1991). In some Cushitic languages (including Somali, Iraqw, and Alagwa), when a sentencefinal H tone is added to a word to form a polar question, all and any preceding H tones in the word or phrase are deleted, resulting in a low-level contour with a final rise that is described as ‘a phonologized intonational pattern’ (Mous 2012: 352). More generally in Cushitic, polar questions are formed by a change to the prosodic pattern, such as a rise in pitch or a rise-fall (e.g. in Sidaama: Kawachi 2007), with the addition of further segmental material in some languages. In Iraqw this takes the form of a verbal suffix, whereas in K’abeena it is fully voiced (modal) rather than whispered realization of the utterance-final vowel (Crass 2005; cited in Mous 2012); in southern Oromo dialects, the final fall in pitch is realized on a ‘linker clitic’ [áa] (Stroomer 1987). Focus is marked in Cushitic by clefting and/or use of focus particles, and topicalization by means of extraposition and determiners (Frajzyngier 2012). Iraqw and Somali display topic-fronting with a following pause (Frascarelli and Puglielli 2009; Mous 2012). In Oromo, a fronted syllable attracts sentence stress, as does a focus particle (Stroomer 1987). In Iraqw, a polar question is realized by the addition of a sentence-final particle, together with a H tone on the penult syllable of the phrase, which is also lengthened
OUP CORRECTED PROOF – FINAL, 04/12/20, SPi
NORTH AFRICA AND THE MIDDLE EAST 203 (Nordbustad 1988). In Beja, the shape and direction of the prosodic contour at phrase edges also serves to disambiguate the function of connective co-verbs in marking discourse structure (Vanhove 2008).
13.2.6 Omotic All Omotic languages are reported to have contrastive tone, with an overall tendency in the group towards marking of grammatical function rather than lexical contrast (Frajzyngier 2012). In some languages a tone is observed on every syllable, but in others tones do not necessarily appear on every lexical item, nor on every syllable if a word does have tone. Overall, then, the tonal systems vary considerably across the languages in this putative group—which may contribute to doubts about the degree of their relatedness.2 The number of contrastive tones and their distribution range from just one tone per word in Wolaitta, in which H tones contrast with toneless syllables, up to the typologically large system of six level tones in Bench, with a tone on every syllable. Dizi and Sheko each have a system of four level tones, and Nayi and Yem have three. This wide degree of variation may be due to contact with Nilo-Saharan languages (Amha 2012). Hayward (2006) notes a constraint on tonal configurations in nominals in a subset of Omotic languages such that only one high tone may appear within the nominal phrase, with other syllables bearing low tone, yielding a LHL contour, which he calls the ‘OHO’ (one high only) constraint. He notes further that this constraint is confined to those languages that display consistent head-final syntax, and thus have post-modifers in the noun phrase. Polar questions are formed in Maale by means of a question particle, optionally accompanied by rising intonation, but in Zargulla a question is marked by a change in verbal inflection, without any accompanying prosodic marking (Amha 2012). Focus is generally achieved by means of extraposition, again with no mention of accompanying prosodic marking (Frajzyngier 2012).
13.3 Nilo-Saharan The Nilo-Saharan languages3 are tonal, and most have two or three tonal categories with little tone spreading but some interaction of tone with voice quality and vowel harmony.
2 The North Omotic and South Omotic languages are now treated as independent (Hammarström et al. 2018) due to their lack of Afro-Asiatic features (Hayward 2003), despite earlier inclusion in Afroasiatic (Hayward 2000; Dimmendaal 2008). This section reviews the prosody of languages treated as members of the Omotic family at some point or spoken in south-western Ethiopia (within the geographical scope of Figure 13.1) without taking a position on the affiliation of individual languages or sub-families to Afroasiatic. 3 The Nilo-Saharan languages are diverse and there is debate as to the integrity of the family (Bender 1996).
OUP CORRECTED PROOF – FINAL, 04/12/20, SPi
204 SAM HELLMUTH AND MARY PEARCE
13.3.1 Eastern Sudanic Tone is used to mark case in a number of East African languages, with case often marked exclusively by tone, as in Maa and Datooga (König 2009); in all cases where case is marked by tone, the language has a ‘marked nominative’ case system (König 2008). Hyman (2011, 2019) also notes tonal case-marking in Maasai. Similarly, the Ik language displays lexical tone (in verb and noun roots) realized differently according to grammatical context, with tonal changes that must accompany certain segmental morphemes (Schrock 2014, 2017); the underlying H and L tones each have surface ‘falling’ and downstepped variants and are also subject to downdrift and the effects of depressor consonants. Overall, the patterning of these tonal processes in Ik indicates a role for metrical feet, alongside distinct intonational contours marking indicative, interrogative, and ‘solicitive’ illocutionary force. In Ama, tone plays a part in several grammatical constructions and—in contrast to Ik— there are cases where tone is the only change, as shown in (2) (Norton 2011). searching (2) Imperfective third person present sāŋ Imperfective first or second person present sàŋ
sleeping túŋ tūŋ
washing ágēl āgèl
Dinka has a robustly demonstrated three-way vowel length contrast (Remijsen and Gilley 2008; Remijsen and Manyang 2009; Remijsen 2014), and appears to have developed from a vowel harmony type system into a contrast between breathy voice and creaky voice. Dinka is also rich in grammatical tone, for case marking and in verb derivations (Andersen 1995), with some dialectal variation in the number of tonal categories. Acoustic analysis has shown that Dinka contour tones contrast in the timing of the fall relative to the segmental content of the syllable, as well as in tonal height (Remijsen 2013). Contrastive alignment is also found in Shilluk (Remijsen and Ayoker 2014), thus challenging earlier typological claims that alignment in contour tones is never phonologically contrastive (Hyman 1988; Odden 1995). Shilluk has a complex tonal system involving three level tones and four contour tones (Remijsen et al. 2011). Tone has lexical function, marking mor phemic status in verbs, but there is also some grammatical function (e.g. the possessive marker). In Mursi, anticipatory ‘tonal polarity’ is observed at the end of any word ending in a toneless syllable, in anticipation of the tone on the following word (Muetze and Ahland, in press). As with the other Nilo-Saharan languages, there seems to be a limit of one syllable on tone spreading or displacement. Mursi appears to have a two-tone system plus a neutral ‘toneless’ option, and this may hint at a link between two- and three-tone languages in this family—that is, if a ‘toneless’ category developed into a mid M tone in some languages, or vice versa. Kunama also has stable tones that do not move or spread, though tonal suffixes may cross syntactic boundaries and replace tones. Kunama has three level tones (H, M, and L), three falls (HM, HL, and ML) and one rise (M and H), with contours observed only on heavy syllables or on word-final short vowels (Connell et al. 2000; Yip 2002: 141–142).
OUP CORRECTED PROOF – FINAL, 04/12/20, SPi
NORTH AFRICA AND THE MIDDLE EAST 205
13.3.2 Central Sudanic Sudanic languages typically have three tones (Jim Roberts, personal communication). In Ngiti [Central Sudanic; Zaire], the tone is largely grammatical, such as marking tense or aspect on verbs (Kutsch Lojenga 1994). The Sara language also has three level tones, but little grammatical tone or tonal morphology (Jim Roberts, personal communication). In contrast to these three-tone languages, Boyeldieu (2000) describes Modo, Baka, and Bongo as having four melodies in disyllabic words and no contour tones on individual syllables, suggesting a classic two-tone system: a phonetic M tone is derived from adjacent /H/ tones, the second of which drops to [M]. In Bongo, tone marks perfective aspect, and lexical tone on verbs is affected by preceding subject pronouns (Nougayrol 2006; Hyman 2016a). Gor has three tones but could have originated from a two-tone system, as four melodies predominate: HM, LM, LL, and MM, with no ML pattern (Roberts 2003). However, Gor cannot now be analysed as a two-tone language because words with CV structure can carry any of the three melodies H, M, or L. Tonal person markers are found in noun suffixes: a H tone indicates possession, but the same segments with no H tone indicate the direct object.
13.3.3 Maban In Masalit, tone has low functional load; the language has a 10-vowel system exhibiting advanced tongue root (ATR) harmony from suffix to root, though the [+ATR] close vowels are increasingly merging with their [−ATR] counterparts (Angela Prinz, personal communication). Weiss (2009) analyses Maba (which is a 10-vowel ATR system) as a pitch accent system that affects the intensity, distinctiveness, and quality of vowels; the position of the accent is determined by the presence of H tone, a long vowel, and the syllabic structure.
13.3.4 Saharan In Zaghawa, there appear to be two tones instead of the usual Sudanic three, as well as ATR harmony, but it is too early to make major statements about the tonal system.
13.4 Discussion This chapter yields a near-comprehensive picture for only one of the four aspects of prosody in our survey—namely, word prosody. That is, we can identify what type of word prosody each language has—that is, whether a language has tone or stress, or both, or neither. Frajzyngier (2012) points to a basic divide in word prosody across Afro-Asiatic languages, between tonal and non-tonal languages, and notes debate about the origin of such a divide. One view argues that if any members of the wider family have lexical tone, the common
OUP CORRECTED PROOF – FINAL, 04/12/20, SPi
206 SAM HELLMUTH AND MARY PEARCE ancestor must also have had it; thus, non-tonal languages must result from loss of tonal contrast over time, and this is argued to explain the large number of homophones found in Semitic. The competing view proposes tonogeneses of various kinds: due to laryngeal effects in Chadic, where tone more commonly has lexical function, or evolving from a predictable stress system coupled with segmental neutralization, and/or due to contact with robustly tonal languages from other language families. It is beyond the scope of this chapter to resolve this debate, but our survey confirms that the tonal versus non-tonal divide does not equate to a simple ‘stress versus tone’ dichotomy. Among tonal languages, there is wide variation in the number, distribution, and function of tonal contrasts, and it is now becoming clear that non-tonal languages do not all have stress. The non-binary nature of the stress versus tone distinction is well established in theoretical literature on tone (Hyman 2006) and is matched by more recent analyses of non-tonal but also non-stress languages as ‘edge-marking’ languages, in which tonal events associate with the edges of prosodic domains (only), within the autosegmental-metrical framework (Jun 2014b). Our ability to document prosodic variation, with respect to prosodic phrasing, melodic structure, and prosodic expression of meaning, is limited by the availability of descriptions of these aspects of the languages under consideration. This is sometimes due to a general lack of description of a language, but, more commonly, to a lack of description of postlexical prosody in those descriptions that do exist (with notable exceptions). Going before us, Frajzyngier (2012: 606) also notes, in a discussion of parataxis (marking of the relationship between clauses in complex sentences), that prosodic characteristics are ‘seldom indicated in grammars’, and our survey shows that this is still the norm. Some of these gaps will be artefacts of methodological choices and priorities, but others may be due to the practical difficulties, perceived or real, involved in the performance of post-lexical prosodic analysis. For example, Watson and Wilson (2017) highlight the importance of information about intonation patterns in contexts that are syntactically ambiguous in written transcription, but also note the ‘cumbersome’ nature of prosodic annotation, and thus argue for collection and sharing of audio (and audiovisual) recordings of less-described languages. There is so much scope for further research on the prosodic systems of North Africa and the Middle East, and particularly on post-lexical prosody, that the work of overcoming these obstacles is merited.
chapter 14
Sou th W e st a n d Cen tr a l Asi a Anastasia Karlsson, Gülİz Güneş, Hamed Rahmani, and Sun-Ah Jun
14.1 Introduction This chapter offers a survey of prosodic features of languages across Southwestern, Central, and Northern Asia. In this rather large area we find a variety of language families. In §14.2, our focus is on Turkish, the standard variant spoken in Turkey (Turkic), while §14.3 deals with Halh (Khalkha) Mongolian, the standard variant spoken in Mongolia (Mongolic language family). In §14.4, the standard variant of Persian spoken in Iran (Indo-European) is treated. §14.5 deals with standard Georgian (Kartvelian). The Turkic and Mongolic groups are usually regarded as two of the three branches of the proposed Altaic language superfamily, the third being the Tungusic group. Georgian belongs to the South Caucasian language group. The term ‘Caucasian’ applies to the four linguistic families indigenous to the Caucasus: Kartvelian, Abkhaz-Adyghe, Daghestanian, and Nakh (Kodzasov 1999). Owing to the considerable lack of descriptions of the prosody of languages spoken in the Caucasus and Central Asia, Georgian is the only language in this group that can be given more than a cursory treatment here.
14.2 Turkic The majority of Turkic languages lack contrastive lexical stress, and its status and realization in many of them are still debated, something that is characteristic of the Altaic language group generally. According to Özçelik (2014), most Turkic languages have finally prominent words, but the nature and function of this final prominence varies across them. For example, Kazakh has iambic feet, while Uyghur has footless intonational prominence, marked tonally by principles similar to those applying in Turkish. A counterexample to this general right-edged prominence is Chuvash [Turkic; western part of the Russian Federation], which marks words tonally on their left edge (Dobrovolsky 1999).
208 ANASTASIA KARLSSON ET AL.
14.2.1 Lexical prosody in Turkish: stress Turkish has long been analysed as a stress-accent language (Lees 1961; Kaisse 1985; Barker 1989; Inkelas and Orgun 1998; Inkelas 1999; Kabak and Vogel 2001; İpek and Jun 2013; İpek 2015; Kabak 2016). In this tradition, word stress is assigned to a word-final syllable (1) with some exceptions, such as place names (2c, 2d), some loanwords, some exceptionally stressed roots, or pre-stressing suffixes (e.g. Sezer 1983; Inkelas and Orgun 1998; Kabak and Vogel 2001). More recently, Turkish has been analysed as a lexical pitch accent language (Levi 2005; Kamali 2011), whereby words with exceptional stress, as in (2c, 2d), are lexically accented with a H*L pitch accent and words with the regular word-final stress, as in (1) and (2b, 2c), are lexically unaccented. Unaccented words acquire a H tone post-lexically, marking the right edge of the phonological word (ω) (Güneş 2015), providing a reliable cue for word segmentation in speech processing (Van Ommen 2016). An event-related potential study investigating the processing of Turkish stress by Domahs et al. (2013) demonstrates that native speakers of Turkish process these two types of stress/accent differently. Turkish participants showed the ‘stress deafness’ effect (Dupoux et al. 1997; Peperkamp and Dupoux 2002) only for the regular finally stressed or lexically unaccented words, and treated violations of stress/accent location as a lexical violation only for the exceptionally stressed or accented words. (1) Final word ‘stressed’/‘accentless’ words in Turkish a. taní ‘know’ b. tanı-dík ‘acquaintance’ c. tanı-dığ-ím ‘my acquaintance’ In lexically accented words, H*L occurs on roots and creates a lexical contrast between segmentally identical strings, as shown for bebek in (2a) and (2c). The word accent remains on the root as the morphosyntactic word (and ω) is extended, as seen in (2c) and (2d). (2) Final stress (a, b), and exceptional lexical stress plus H*L accent (c, d) in Turkish a. bebék ‘baby’ b. bebek-ler-ím ‘my babies’ c. Bébek ‘Bebek’ (the name of a neighbourhood in Istanbul) d. Bébek-li-ler ‘Those who are from Bebek.’ The Turkish ω-final position is thus assigned a demarcative prominence, a lexical stress, or a post-lexically tone-bearing syllable, depending on the analysis.
14.2.2 Lexical prosody: vowel harmony in Turkish In Altaic languages, vowel harmony interacts with prosodic constituent structure. Vowel harmony may involve backness, labiality (rounding), vowel height, and pharyngealization (van der Hulst and van de Weijer 1995). Many Turkic languages have backness and labial harmony, while pharyngeal and labial harmony occurs in Mongolian. In Turkish, front vowels must follow front vowels and back vowels must follow back vowels (3) (Clements and Sezer 1982; Charette and Göksel 1996) due to the backness harmony. In rounding
south west and central asia 209 armony, non-initial vowels in a word can be round only if preceded by another rounded h vowel (4) (cf. Göksel and Kerslake 2005). Like Mongolian, Turkish is agglutinative and suffixes harmonize with the root. (3) a. araba-lar-da b. kedi-ler-de (4)
a. üz-gün-üz b. kız-gın-ız
‘in the cars’ ‘in the cats’ ‘we are sad’ ‘we are angry’
The domain of Turkish vowel harmony is not always the ω (Kornfilt 1996). A single harmony domain may contain two ω’s (Göksel 2010), while multiple vowel harmonic domains may be parsed as a single ω (Güneş 2015). Turkish compounds, regardless of whether they are parsed as single ω’s (5a) or two ω’s (5b), are non-harmonic. Loanwords (5c) and certain suffixes, such as gen in (5d), are also non-harmonic. (5) a. b. c. d.
(çek-yát)ω (keçí)ω(boynuzú)ω kitap altı-gen
‘pullover sofa’ ‘carob’ ‘book’ ‘hexagon’
14.2.3 Post-lexical prosody in Turkish Unless pragmatically marked tunes dictate otherwise, sentence-internal prosodic constituency in Turkish can be traced to syntactic branching and relations between syntactic constituents. Root clauses are parsed as intonational phrases (ι) (Kan 2009; Güneş 2015). ι’s contain (a number of) phonological phrases (φ), which correspond to syntactic phrases (Kamali 2011) and contain maximally two ω’s (Güneş 2015). The prosodic hierarchy proposed in the inton ational model of İpek and Jun (2013) and İpek (2015) is similar to this, but their intermediate phrase (ip), which corresponds to φ, can contain more than two prosodic words. Four major cues are employed to distinguish between intonational phrases (ι) and phonological phrases (φ) in Turkish. These are (i) boundary tones (H- for the right edges of non-final φ’s, and H% or L% for the right edges of ι’s), (ii) pauses (shorter across φ’s and longer across ι’s), (iii) head prominence, and (iv) final lengthening (shorter final syllable before φ boundaries and longer final syllable before ι boundaries). Figure 14.1 presents the prosodic phrasing of (6) with one ι and three φ’s. (6)
[((Nevriye)ω)φ ((araba-da)ω)φ ((yağmurluğ-u-nu)ω (ar-ıyor.)ω)φ]ι Nevriye car-loc raincoat-3poss-acc search-prog ‘Nevriye is looking for her raincoat in the car.’ (Güneş 2015: 110)
In Turkish, ι’s are right-prominent and φ’s are left-prominent (Kabak and Vogel 2001; Kan 2009). Prominence is marked with variation in pitch register and prosodic phrasing across the head and non-head part of φ’s. In φ’s with two ω’s, the head-ω (i.e. the leftmost ω in a φ) exhibits a higher f0 register and a final H, which is accompanied by a rise in nonfinal φ’s and a plateau in final φ’s (7). The head-ω of the final φ is also the head of its ι (i.e. the nucleus), yet its register is not higher than the heads of prenuclear φ’s. Any item that follows the nucleus receives low-level f0 and is prosodically integrated with the non-head
Pitch (Hz)
210 ANASTASIA KARLSSON ET AL.
350 300 250 200 150 100 nev
ri
H-
L
ye
a
Nevriye
Hra
ba
da
L yağ
car-loc
L%
H L mur
lu
ğu nu a
raincoat-3poss-acc
rı
yor
search-prog
‘Nevriye is looking for her raincoat in the car’ 0.2568
2.502
Time (s)
Pitch (Hz)
Figure 14.1 Multiple φ’s in all-new context and with canonical SOV order. 350 300 200 100 ali
biliyor
aynurun buraya gelmeden once nereye gitmis olabilecegini
0
Time (s)
3.7842
Figure 14.2 Pitch track of Ali biliyor Aynurun buraya gelmeden önce nereye gitmiş olabileceğini ‘Ali knows where Aynur might have gone to before coming here’, illustrating multiple morphosyntactic words as a single ω, with focus for the subject Ali (Özge and Bozşahin 2010: 148).
part of the final φ. A schematic illustration of the prosodic and tonal structure of a declarative with unaccented words is given in (7). (7)
%L
H-
[ pre-nucleus ( non-final φ ) φ ( )ω
L
H
L
pre-nucleus ( non-final φ (head) ω (
H-
L
H
L
L%
)φ )ω
nucleus post-nucleus ] ι ( final φ )φ ( headN ) ω ( )ω
Regardless of its morphological and syntactic complexity, the postnuclear ω bears low levelled, flat f0 (Özge and Bozşahin 2010), as illustrated in (8) and Figure 14.2. (8) %L H L L% [(Ali)ω-N (biliyor Aynurun buraya gel-me-den önce nereye gitmiş ol-abil-eceği -ni)ω-post-N]ι Ali knows Aynur.gen to.here come -neg-abl before where gone be-ABIL-comp.3poss-acc ‘Ali knows where Aynur might have gone to before coming here.’
south west and central asia 211
14.2.4 Focus in Turkish In Turkish, prosodic phrasing is the main focus alignment strategy. In single focus contexts, focus is aligned as the head of an ι, the nucleus (9). Word order variation can also be indir ectly related to focus alignment, in which case the focused constituent is realized in a default nuclear position (i.e. the immediately pre-verbal area) (10) (cf. Kennelly 1999; İşsever 2003; İpek 2011; Gürer 2015; but cf. İşsever 2006). (9) (OFOCSV), focused object, not immediately pre-verbal but the nucleus (adapted from Özge and Bozşahin 2010: 139) %L
H
L
L%
[((KAPIYI)ω-N/FOC (Ali kırdı)ω-Post-N)FINAL-φ]ι door.acc Ali broke ‘Ali broke the DOORFOC.’ (10) (O)(SFOCV), focused subject immediately pre-verbal and the nucleus %L
H-
L H
L
L%
[ (Kapıyı)φ-Pre-N ((ALİ)ω-N/FOC (kırdı)ω-Post-N)FINAL-φ]ι door.acc Ali broke ‘ALİFOC broke the door.’ In addition to prosodic phrasing and word order, focus in Turkish is marked by f0. Unlike intonation languages where the pitch range of a focused word is expanded compared to that of the pre-focus words, the pitch range of a focused word in Turkish is reduced in comparison to pre-focus words. The syllable before the nuclear word has a higher f0 than an equivalent syllable at a default phrase boundary. The pitch range of post-focus words is, however, substantially compressed (İpek and Jun 2013; İpek 2015), see Figure 14.2. When words with non-final lexical accent are focused, right after the accented syllable, a steep f0 fall is observed (Kamali 2011). In such cases, the non-final lexical pitch accent marks the prosodic head of the final φ, and hence is associated with focus if this head is aligned with a focused item. If words with non-final lexical accent occur in the post-verbal, postnuclear area, they get deaccented and bear low-level f0 (Güneş 2015; İpek 2015).
14.3 Mongolian 14.3.1 Lexical prosody in Mongolic: stress There is no consensus among linguists on the status and realization of lexical stress in Mongolian and Mongolic in general (for an overview see Svantesson et al. 2005). Native speakers also disagree about the placement of lexical stress in judgement tasks and in some
212 ANASTASIA KARLSSON ET AL. cases do not perceive any stress at all (Gerasimovič 1970 for Halh Mongolian; Harnud 2003 for Chakhar [Standard Mongolian; China]). Analysis by Karlsson (2005) suggests that Mongolian has no lexical stress, and three potential correlates of stress (vowel quality, vowel duration, and tone) do not correlate in marking any single syllable as stressed. Moreover, vowels, even phonemically long, can be completely deleted in all positions in casual speech. Since the initial syllable governs vowel harmony in Mongolian, this position is often ascribed stress. However, this vowel is often elided in casual speech. Mongolian speakers often devoice and completely delete all vowels in a word, which leads to chains of words with very little or no voiced material. Neither does vowel epenthesis always occur as predicted by syllabification rules, as when underlying /oʃgɮ-ʧ/ ‘kick-converb’, which is pronounced [oʃəgɮəʧ] in formal speech, is pronounced [ʃxɮʧ] in casual speech, with failed epenthesis and deletion of the phonemic vowel (Karlsson and Svantesson 2016). Extreme reduction is frequent. For example, /gaxai/ is reduced to [qχ] in /хar ɢaхai хɔjr-iŋ/ хар гахай хоёрын ‘black pig two-gen’ realized as [χarq.χɔj.riŋ], with syllabification taking place across word boundaries.
14.3.2 Lexical prosody: vowel harmony in Mongolian Pharyngeal harmony prevents pharyngeal /ʊ a ɔ/ and non-pharyngeal /u e o/ from cooccurring in the same word, with transparent /i/ occurring in either set. Harmony spreads from left to right in a morphological domain and the root word thus determines the vowel in affixes in this agglutinative language, as in the reflexive suffix -e, (e.g. ug-e ‘word’, xoɮ-o ‘foot’, am-a ‘mouth’, mʊʊr-a ‘cat’, and ɔr-ɔ ‘place’). Non-initial /i/ is ignored by vowel harmony (e.g. the reflexive suffix in mʊʊr-a ‘cat’ does not change in mʊʊr-ig-a ‘cat-acc-rfl’). Rounding harmony applies in the same domain, with /i/ again being transparent and high back /ʊ u/ being opaque. The opaque vowels block rounding harmony, as in ɔr-ɔd ‘enterperf’ (cf. ɔr-ʊɮ-ad ‘enter- caus-perf’) (Svantesson et al. 2005: 54).
14.3.3 Post-lexical prosody in Mongolian In read speech, major-class words have rising pitch, due to Lα being associated with the first mora and Hα with the second, as a result of which /mʊʊ.ra/ ‘cat’ has a pitch rise in its first syllable and /xo.ɮo/ ‘foot’ a pitch rise over its two syllables. The assignment of LαHα to the left edge of the accentual phrase (α) is post-lexical, as shown by its sensitivity to post-lexical syllabification. For instance, /nʊtʰgtʰai/ ‘homeland-com’ is either trisyllabic due to schwa epenthesis, [nʊ.tʰəg.tʰai], or disyllabic, [nʊtʰx.tʰai], with Hα appearing on [tʰəg] in the first and on [tʰai] in the second case. The domain of syllabification has been described as the ω and the domain of LαHα assignment as α. Post-positions always share the same ω (or α) as their left-edge host. Many word combinations that function as compounds (often written as two words) are realized as one α, such as the compound ɢaɮtʰ tʰirəg ‘train’ (literally: ‘fire vehicle’), pronounced with one LαHα in [ɢaɮtʰtʰirəg]α. In spontaneous speech, vowels are often deleted and words are syllabified across a word boundary. Several lexical words can thus be clustered
south west and central asia 213 as one ω and marked as α, which will lead to a discrepancy between the morphological domain of vowel harmony and the phonological domain for prosodic parsing. LαHα mark the left edge of an accentual phrase (α) and by implication an ip in Mongolian. The ip corresponds to the syntactic phrase and often contains more than one α. As a consequence of α-phrasing, almost every major-class word in neutral declaratives in read speech begins with a lowering of f0 towards the phrase-initial Lα, as illustrated in Figure 14.3. The Hα tones in a series of LαHα boundary rises show a downtrend across the ip that is reset at the beginning of every ip except the last, which corresponds to a verb phrase and is pronounced with distinctly lower pitch on the last word. Figure 14.4 shows the downtrend on the second syllable of marɢaʃ and the reset on the second syllable of /ɢɔɮig/. Tonal marking with a right-edge ip boundary tone H- occurs optionally in subordination, coordination, and enumeration. Clauses are parsed as intonational phrases (ι), which come with a right-edge L% or H% and contain one or more ip’s. However, in spontaneous speech, units larger than root clauses can be marked as ι, something that is somehow connected to discourse structure. Moreover, L% is rare in spontaneous speech, where final rises due to H% are frequent. The intonation of other Mongolic languages has been described by Indjieva (2009) in her comprehensive account of prosody of the Houg Sar and Bain Hol varieties of Oirat, a Western Mongolic language spoken in the Xinjiang region of China. Oirat lacks lexical
f0 (Hz)
300
100
0
–LαHα
LαHα
LαHα
mʊʊr
nɔxɔint
parʲəgtəw
cat
dog
catch Time (s)
L%
1.318
Figure 14.3 Pitch track showing the division into α’s of all-new [[mʊʊr]α[nɔxɔint]α[parʲəgtəw]ip]ι ‘A cat was caught by a dog’, where underlined bold symbols correspond to the second mora in an α. -LH marks the beginning of the ip (Karlsson 2014: 194).
214 ANASTASIA KARLSSON ET AL.
f0 (Hz)
300
100 –LαHα
LαHα
–LαHα
LαHα
L%
pit
marGaš
xirɮjɘŋ
gɔɮig
thʊʊɮɘŋ
we
tomorrow
Kherlen
river
cross
0
Time (s)
1.809
Figure 14.4 Pitch track of [[pit]α [marɢaʃα]ip [[xirɮʲəŋ]α [ɢɔɮig]α]ip [tʰʊʊɮəŋ]ip]ι ‘We will cross the Kherlen river tomorrow’ (Karlsson 2014: 196). -LH marks the beginning of an ip.
stress and nuclear pitch accents, and instead marks edges of prosodic units, the α, with its initial LαHα, and the ι. These features are very similar to those of Mongolian.
14.3.4 Focus in Mongolian Mongolian is strictly a verb-final subject-object-verb (SOV) language. The pre-verbal pos ition is sometimes claimed to be a focus position, but this has not been confirmed (Karlsson 2005). Focus is marked by strengthening the initial boundary of the ip that contains the focused word(s), resulting in an enhanced pitch reset. A similar pattern is found in Oirat (Indjieva 2009). Dephrasing does not occur except for a special marking of focal constituents by pitch lowering. This is only found for the ι-final position in read speech. Even in such cases, α-boundaries are often traceable. In spontaneous speech, focus is most often marked by Hfoc at the end of the focused phrase(s), as illustrated in Figure 14.5. Its scaling brings more evidence that it correlates with the new/given dichotomy: it is higher when new information coincides with the second part of the ι. To formally show the leftward spreading of Hfoc to the beginning of the ip that contains the focus constituent, an arrow is used: ←Hfoc.
south west and central asia 215
f0 (Hz)
400
100 –LαHα manai aaw pɔɮ mine father 0
–LαHα ←Hfoc
–LαHα
H%
saixəŋ
cantai
ʊxaɮəg
xuŋ
nice
nice
wise
person
Time (s)
L%
3.068
Figure 14.5 Pitch track and speech waveform illustrating final ←Hfoc marking focus on all the preceding constituents. The utterance is [[[manai aaw pɔɮ]α]ip [[[saixəŋʦantai]α]ip [[ʊxaɮəg]α]ip]foc [xuŋ]ip]ι ‘My father is nice and wise’.
14.4 Persian 14.4.1 Lexical prosody in Persian Persian word prominence has been described as having stress in nouns, adjectives, and most adverbs. Right-edge clitics, such as the indefinite [=i] and the possessive markers, are excluded from stress assignment, whereas verbs with inflectional prefixes take stress on the leftmost prefix, as illustrated in (11) (Ferguson 1957; Lazard 1992). (11)
a. pedár father b. pedár=am father=1sg ‘my father’ c. mí-goft dur-said.3sg ‘s/he would say’
216 ANASTASIA KARLSSON ET AL. While some authors have attempted to show that Persian ‘stress’ is exclusively governed by prosodic phrasing (e.g. Kahnemuyipour 2003), recent research suggests that it is in fact a post-lexical tone that is assigned on the basis of the morphosyntactic label, independently of prosodic phrasing (Rahmani et al. 2015, 2018; Rahmani 2018, 2019). That analysis is in line with three recent experimental findings. First, the syllabic prominence at issue is created only by f0, suggesting that it is a tone or accent, rather than a metrical entity (Abolhasanizadeh et al. 2012; but see Sadeghi 2017 for a different view). Second, it is not obligatory on the surface in that it disappears in some sentential contexts (Rahmani et al. 2018), thus escaping a hallmark feature of stress as defined by Hyman (2006). Third, despite the high functional load of ‘stress’ location, for instance due to homophony between derivational suffixes and clitics ([xubí] ‘goodness’ vs. [xúbi] ‘good.2sg’), Persian listeners are ‘stress deaf ’ in the sense of Dupoux et al. (2001), indicating that there is no word-prosodic information in the lexicon (Rahmani et al. 2015). Phonologically, the Persian accent consists of a H tone. The syntactic motivation behind the location of accent is based on several observations, two of which are given here. First, a given word may receive accent on different syllables depending on the syntactic environment it appears in or the grammatical function it performs. Thus, nouns are accented on the initial syllable when appearing as vocatives as opposed to their default final accent (cf. [pédar] ‘father!’ vs. [pedár] ‘father’) (Ferguson 1957). Similarly, the pos ition of accent on various grammatical words is sensitive to sentential polarity. Examples are the intensifier/xejli/ ‘very’ and the compound demonstrative /hamin/ ‘this same one’, which are accented on the first syllable in positive sentences (cf. [xéjli], [hámin]) but take a final accent in negative sentences (cf. [xejlí], [hamín]). Second, whenever an expression (including phrases or clauses) is used in such a way as though the entire group were syntactically a single noun, it follows the accentual pattern of nouns—that is, it is assigned one accent on its final syllable irrespective of its default phrasal accent pattern (VahidianKamyar 2001). (12a) illustrates a clause in its default accentuation. As shown in (12b), when the same form is used as a head noun in a possessive construction to refer to a movie title, it is reanalysed as a noun by the accent rule—that is, the entire unit is assigned one accent on its final syllable. (12)
a. [bɒ́d mɒ́=rɒ xɒhád bord] wind 1sg=obj want.3sg carry ‘The wind will carry us.’ b. [bɒd mɒ=rɒ xɒhad bórd]=e kiɒrostamí wind 1sg=obj want.3sg carry=ez Kiarostami ‘Kiarostami’s The wind will carry us’
Independently of their accentual pattern, Persian words have iambic feet, which serve as the domain for assimilation processes such as vowel harmony (Rahmani 2019). Mid vowels assimilate to the following high vowels, if only the two syllables are grouped into a single foot. Thus, while [o] normally raises to [u] in [ho.lu] ‘peach’, which is a disyllabic iamb, it cannot do so in [hol.gum] ‘pharynx’, which contains two monosyllabic iambs. In Ossetian [Indo-Iranian; Central Caucasus], accent becomes actualized only as a function of prosodic phrasing. Words do not have an individual stress but are organized in groups by a tonal accent (Abaev 1949).
south west and central asia 217
14.4.2 Post-lexical prosody in Persian The Persian prosodic hierarchy includes the φ and ι, in addition to the ω. ω is the domain of obligatory syllabification (Hosseini 2014). It roughly corresponds to a (simple or derived) stem plus inflectional affixes and clitics. φ and ι may be characterized by different degrees of pause length and pre-boundary lengthening (Mahjani 2003). Persian has a small tonal inventory. In addition to the syntactically driven accent H, there are two ι-final boundary tones, L% and H% (see §14.6). Some models of Persian intonation have assumed ‘focus accent’ and ‘phrase accent’ in the tonal inventory of the language (e.g. Scarborough 2007), for which there would appear to be insufficient supporting evidence (Rahmani et al. 2018). The two prosodic segmentations for each of the members of the minimal pair (13a, 13b) show the irrelevance of prosodic constituency to the distribution of accent. Their pitch tracks are presented in Figure 14.6. (13) a. bɒd mɒ=rɒ xɒhad bórd wind 1sg=obj want.3sg carry ‘The wind will carry us’ (naming expression) [((bɒd)ω (mɒ=rɒ)ω)φ ((xɒhad)ω (bórd)ω)φ ]ι [((bɒd)ω)φ ((mɒ=rɒ)ω (xɒhad)ω (bórd)ω)φ ]ι b. bɒ́d mɒ́=rɒ xɒhád bord wind 1sg=obj want.3 sg carry ‘The wind will carry us.’ (sentential expression) [((bɒ́d)ω (mɒ́=rɒ)ω)φ ((xɒhɒ́d)ω (bord)ω)φ ]ι [((bɒ́d)ω)φ ((mɒ́=rɒ)ω (xɒhɒ́d)ω (bord)ω)φ ]ι The intonation systems of other Iranian languages are not well documented, an exception being Kurdish (Northern Kurmanji) (Hasan 2016).
14.4.3 Focus in Persian Persian has SOV as the unmarked word order with all possible combinations for pragmatic purposes (Sadat-Tehrani 2007). It is still unclear whether word order variations cue focus, intonation being the most reliable cue. Under broad focus, post-verbal words are obligator ily unaccented and all other words obligatorily accented. Thus, while in an SAOV utterance every word is accented, in VSAO only an accent on the verb remains. Under narrow focus, post-focal words are deaccented, irrespective of the position of the verb. Thus, SfocAOV will have one accent, on Sfoc. The prosodic expression of focus is syntactically restricted. While in sentences with the unmarked SOV word order, any word can be prosodically marked for focus, in sentences with pragmatically marked word order, post-verbal words cannot be focused if the unmarked position of these words is pre-verbal. Some clause types may deviate slightly from these patterns, such as those with nonspecific objects, manner adverbials, or clauses with motion verbs, which are ignored here for lack of space. See Sadat-Tehrani (2007) and Kahnemuyipour (2009) for more information.
218 ANASTASIA KARLSSON ET AL.
f0 (Hz)
(a)
f0 (Hz)
(b)
Figure 14.6 f0 contours of 13a (a) and 13b (b).
south west and central asia 219
14.5 Caucasian About 50 languages are spoken in the Caucasus, 37 of which are indigenous (Kodzasov 1999). Among these, Georgian, a member of the South Caucasian language group, is the most studied and is described in §14.5.1. Daghestanian, a member of the Northern Caucasian language group, is briefly described in §14.5.2.
14.5.1 Georgian 14.5.1.1 Lexical prosody in Georgian Although the existence and location of lexical stress in Georgian are debated in the literature, a general consensus has been that stress is assigned word-initially (Robins and Waterson 1952; Aronson 1990). Some studies further claim that, for words longer than four syllables, both the initial and the antepenultimate syllables are stressed, with primary stress on the antepenult (Harris 1993). However, Vicenik and Jun (2014) showed that the domain of antepenult stress is not a word, but the α. Stress is not influenced by syllable weight (Zhgenti 1963) or vowel quality (Aronson 1990). The main phonetic correlate of Georgian stress was claimed to be high pitch by Robins and Waterson (1952) based on the word in isolation data, or to be related to a rhythmicalmelodic structure by Zhgenti (1963; cited in Skopeteas et al. 2009). However, based on the acoustic measurements of words in a carrier sentence with the same quality of target vowels, Vicenik and Jun (2014: 157) found that the word-initial syllable had significantly greater duration and intensity than all following syllables, while the antepenultimate syllable was not stronger than the syllable immediately preceding it. The f0 of the word-initial syllable was typically low, demarcating the beginning of a word (and an α) in declaratives with neutral focus, but was often high or rising in question sentences (see §14.6) or when the word was narrowly focused (see §14.5.1.3). That is, the pitch of the stressed syllable is determined post-lexically based on the sentence types or focus, confirming the observations made in earlier studies (Zhgenti 1963; Tevdoradze 1978).
14.5.1.2 Post-lexical prosody in Georgian There are only a few studies that have examined prosody at the post-lexical level in Georgian (Bush 1999; Jun et al. 2007; Skopeteas et al. 2009; Skopeteas and Féry 2010; Vicenik and Jun 2014; Skopeteas et al. 2018) (studies published in Russian and Georgian are not included here). These studies all agree that the intonation of simple declarative sentences typically consists of a sequence of rising f0 contours. Jun et al. (2007) and Vicenik and Jun (2014) showed that the domain of a rising f0 contour, an α, often contains one content word, though it can have more. They proposed that Georgian, like Mongolian, has three prosodic units above the word: an ι, an ip, and an α. The rising contour of the α is analysed as a L* pitch accent on the initial syllable, followed by a Hα boundary tone on the final syllable. However, when the α is part of an embedded syntactic constituent or occurs in a (wh or polar) interrogative sentence, it is often realized with a falling contour, i.e. initial H* and
220 ANASTASIA KARLSSON ET AL.
f0 (Hz)
400
50 manana
dzalian
lamaz
meomars
bans
manana
very
beautiful
soldier
is washing
L* 0
Ha
L*
Ha
L*
Ha
Time (s)
L*
Ha
L*
L% 2.559
Figure 14.7 Pitch track and speech waveform of Manana dzalian lamaz meomars bans, ‘Manana is washing the very beautiful soldier’. Each word forms an α with a rising contour, [L* Hα].
(Vicenik and Jun 2014: fig. 6.1, redrawn in Praat)
final Lα. Figure 14.7 shows the f0 of a simple declarative sentence where each word forms one α, and illustrates a downtrend of final Hα tones over the whole utterance. In Figure 14.7, the sentence-final syllable is marked with a low boundary tone, L%, a common ι boundary tone for a declarative sentence. This means that the whole sentence forms one ι and also one ip, which includes five α’s. A sequence of α’s can form an ip when the α’s are close together syntactically or semantically. This higher prosodic unit is marked by a High boundary tone, H-, which is higher than the High tone of the preceding α. Figure 14.8 shows an example pitch track of a declarative sentence, The soldier’s aunt is washing Manana, where a complex NP subject, [meomris mamida], forms an ip, marked with a H- boundary tone. The f0 height of H- breaks the downtrend of α-final H tones across the utterance, as in Figure 14.7. Finally, the Georgian α can have one more tonal property. When it exceeds four syllables, a falling tone occurs over the antepenult and penult, in addition to the α-initial pitch accent. In that case, the antepenult has a H tone and the penult a L tone, regardless of the location of a word boundary inside the α. Since this f0 fall is not a property of a word, it is categor ized as a H+L phrase accent of an α. As shown in §14.6 and §14.5.1.3., this phrase accent occurs frequently in questions and as a marker of focus in Georgian.
south west and central asia 221
f0 (Hz)
400
50
L*
meomris
mamida
mananas
bans
soldier’s
aunt
manana
is washing
Ha
0
L*
H– Time (s)
L*
Ha
L*
L% 2.652
Figure 14.8 Pitch track of The soldier’s aunt is washing Manana. The complex NP subject [meomris mamida] forms an ip, marked with a H- boundary tone that is higher than the preceding Hα. (Vicenik and Jun 2014: fig. 6.4, redrawn in Praat)
14.5.1.3 Focus in Georgian Focus in Georgian is marked by word order and prosody. As in Turkish, a pre-verbal argument receives prominence in the neutral focus condition, showing that word order is sensitive to information structure. However, an infelicitous word order for focus may become felicitous by an appropriate prosodic structure, suggesting that prosodic constraints outrank syntactic constraints in encoding information structure (Skopeteas et al. 2009). In addition, Georgian shows different intonation patterns depending on the location of the word in a sentence. Skopeteas and Féry (2010), Vicenik and Jun (2014), and Skopeteas et al. (2018) show that a focused word is realized with high f0 (due to H*) sentence-initially, but with a low flat f0 (L*) sentence-finally. Sentence-final focused words are often preceded by a phrase break marked with a high boundary tone (H- in Vicenik and Jun 2014). Though a (L)H* pitch accent marks prominence of a focused word in Georgian, it is not always realized in an expanded pitch range, especially when the focused word is sentencemedial. However, there is nevertheless salient prominence for the focused word due to increased intensity and duration of its stressed syllable and a reduced pitch range of the postfocus words. Interestingly, Vicenik and Jun (2014) show that a focused word is often marked by an additional tone, a H+L phrase accent, on the antepenultimate syllable of the focused word itself or a larger phrase that consists of a focused word and the following word. Figure 14.9 shows an example where a pre-verbal argument (Gela) is focused but the H+L phrase accent
222 ANASTASIA KARLSSON ET AL.
f0 (Hz)
400
50 ara
gela
imaleba
navis
uk’an
No,
GELA
hide
ship
behind
LH* 0
H+L Time (s)
La H* H+L
L% 2.234
Figure 14.9 Pitch track of No, GELA is hiding behind the ship, where the subject noun is narrowly focused and the verb, instead of being deaccented, has a H+L phrase accent. The focused word and the verb together form one prosodic unit.
is realized on the following word, a verb. In addition to tonal prominence marked by pitch accent and phrase accent, prosodic phrasing may mark focus, too. In terms of Vicecik and Jun’s (2014) model, a focused word often begins an ip. Production data in fact suggest that focus can be expressed by word order, prosodic phrasing, and pitch accent; any of these can mark focus, but none of them seems obligatory.
14.5.2 Daghestanian The majority of Daghestanian languages have no stress (Kodzasov 1999), appearing instead as tonal languages (e.g. Andi, Akhvakh) or quasi-tonal languages (most languages of North Dagestan), besides stress languages (most languages of Southern Dagestan). In the quasitonal languages, tone is probably connected to a stiffness/slackness contrast, whereby the articulatory transition from slackness to stiffness generates a rising f0. Thus, while the tonal contours of Andi words like hiri (LowLow) ‘red’ and mic’c’a (HighHigh) ‘honey’ are generated by lexical tones, the tonal contrasts in /aː/ (RisingLow) ‘broth’, /aː/ (LowLow) ‘pus’, and /aː/ (HighHigh) ‘ear’ result from underlying stiffness/slackness contrasts in Chamalal. In Ingush, tones are grammatical, and some tonal suffixes have a rising-falling tone, as in lät-âr ‘fought (witnessed past)’ vs. lât-ar ‘used to fight (imperfect)’, where the opposition is marked by ablaut and tone shift (Nichols 2011). Tone in Ingush can occur only on one syllable per word. Chechen, a North East Caucasian language, mainly uses word order to signal focus (Komen 2007). Certain clitics and suffixes have an inherent high pitch (Nichols 1997),
south west and central asia 223 which suggests the presence of lexical tone. Komen recognizes an ι and an α in Chechen, both marked by L at their left edge and followed by H*.
14.6 Communicative prosody: question intonation In Turkish questions, the wh-word and the item that precedes the polar question particle are parsed as the nucleus (14). While in wh-questions right edges of ι’s are decorated with H%, polar questions end with %L (Göksel and Kerslake 2005; Göksel et al. 2009). Like focused items, the item preceding the Q-particle is aligned with the nucleus via prosodic phrasing (Shwayder 2015). Göksel et al. (2009) observe that the pre-wh-word area exhibits higher pitch than the pre-nuclear area in polar questions and declaratives. (14) Prosodic phrasing of polar and wh-questions compared with declaratives ι-boundary tone (post-nucleus)ω)φ]ι L%/H% Declarative: [----------- ((nucleus)ω (post-nucleus)ω)φ]ι H% wh-question: [----------- ((wh-word)ω Polar question: [----------- ((a constituent)ω (Q-particle+post-nucleus)ω)φ]ι L% A vocative proper name will exhibit a pitch fall (H*L%) (15a), which may convey surprise if spoken with an expanded pitch range (15b). Rising f0 in the same environment (i.e. LH*H%) conveys a question with the meaning of ‘Is it you?’ (15c) (Göksel and Pöchtrager 2013). (15) Vocatives with various meanings Calling address Surprise address H*L% H*L% a. Aslı
b. Aslı
Is-it-you address LH*H% c. Aslı
Mongolian polar questions are marked by a final question particle. It typically also appears at the end of wh-questions, in which the wh-word is in situ, but it may be omitted in colloquial speech. While Mongolian interrogatives often have f0 shapes that are similar to declaratives, with final H% being used in both of these, in all-new interrogatives dephrasing and suspension of the downtrend (i.e. inclination) may occur. Persian polar questions have similar intonation contours to declaratives (SadatTehrani 2011). The question particles are often omitted in colloquial speech, in which case a final H% distinguishes them from declaratives, which have L%. Additionally, questions are characterized by sentence-final syllable lengthening and wider pitch range. wh- questions are generally characterized by deaccentuation of the elements after the wh-word and a final L% boundary. Native listeners can easily differentiate wh-questions from their declarative counterparts on the basis of the part of the utterances before the wh-word (Shiamizadeh et al. 2017). In Georgian, both polar and wh-questions are marked by word order and prosody. The wh-word occurs sentence-initially and is immediately followed by a verb, with which it tends to form a single ip. This phrase is marked by the sequence H* H+L, where H* occurs on the wh-word and H+L on the antepenultimate and penultimate syllables of the verb if it
224 ANASTASIA KARLSSON ET AL. has four or more syllables or only L appears on the penult if the verb is shorter than three syllables. Most commonly, a final H- appears on the final syllable of the ip, although L- is also possible. The end of wh-question is often marked by H% or L%, less frequently HL%, without obvious differences in meaning. In polar questions, the verb occurs either sentence-initially or sentence-medially. A sentence-initial verb forms an ip by itself, with a H* L H- pattern. A sentence-medial verb either shows a H* L H- pattern by itself or appears together with a preceding subject in an ip marked by H* H+L H-, similar to the pattern in the wh-word + verb group described above. Polar questions, too, end in H%, L%, or HL%. Bush (1999) pointed out that HL% is characteristic of polite questions.
14.7 Conclusion All the languages discussed in this chapter lack contrastive lexical stress and, more generally, they lack culminative stress, in Trubetzkoy’s (1939/1969) terms. That is, minimal word pairs like English éxport versus expórt do not occur, or are at best limited to a few cases, and stress is not morphologically conditioned. Moreover, pitch, intensity, and duration are not found to coincide in marking the prominent word-initial or word-final syllable, indicating that it is not metrically strong and instead is marked by tone. Interestingly, this seems to be true for most Altaic, Caucasian, and Indo-Iranian languages and may be the reason for the lack of consensus about the status, realization, and placement of lexical stress in these languages. Vowel harmony can be seen as signalling a word as an entity in speech. Baudouin de Courtenay (1876) and Kasevič (1986), for instance, suggested that this coherence-signalling function parallels Indo-European lexical stress. If vowel harmony has a demarcative function similar to lexical stress, this may explain the redundancy of stress in harmonic languages. The absence of contrastive stress is a common feature of many harmonic languages, as we reported here for Turkic. Other examples are a number of Uralic languages (among them Finnish and Hungarian), while stress is completely absent in Mongolian (as described in §14.3), Erzya [Finno-Ugric; Mordovia], and some Chukchi-Kamchatkan languages [Paleo-Asian] (Jarceva 1990). Some languages, such as Uzbek [Turkic; Uzbekistan] and Monguor [Mongolic; China, Qinghai, and Gansu provinces], have developed lexical stress after losing vowel harmony (Binnick 1980; Kasevič 1986). In Monguor and its dialects, final lexical stress has arisen and the first syllable, which governs vowel harmony in other Mongolic languages, is lost in some words; for example, Old Mongolian *Onteken ‘egg’ has become ontəg in Halh and ndige in Monguor. These correlations suggest that harmony has a demarcative function similar to lexical stress. Though the languages treated in this chapter share some structural features, such as SOV word order, agglutination, and some prosodic similarities, their tonal tunes are aurally rather different, due to (among other things) different interactions between lexical and post-lexical prosody (micro- and macro-rhythm; Jun 2014b) as well as the shapes and distribution of pitch accents and boundary tones.
OUP CORRECTED PROOF – FINAL, 06/12/20, SPi
chapter 15
Cen tr a l a n d E aster n Eu rope Maciej Karpiński, Bistra Andreeva, Eva Liina Asu, Anna Daugavet, Štefan Beňuš, and Katalin Mády
15.1 Introduction The languages of Central and Eastern Europe form a typologically divergent collection that includes Baltic (Latvian, Lithuanian), Finno-Ugric (Estonian, Finnish, Hungarian), Slavic (Belarusian, Bulgarian, Czech, Macedonian, Polish, Russian, pluricentric BosnianCroatian-Montenegrin-Serbian (BCMS), Slovak, Slovenian, Ukrainian), and Romance (Romanian). Most of them have well-established positions as official state languages, but there are also a good many minority and regional languages varying in their history, status, and number of speakers (e.g. Sorbian, Latgalian, Kashubian, a number of Uralic languages, and groups of Romani dialects). Slavic and Baltic languages are assumed to have emerged from a hypothetical common ancestor—Proto-Balto-Slavic (also referred to as very late Proto-Indo-European; Comrie and Corbett 1993: 62)—and to have split some 2,000 years ago (Mallory and Adams 2006: 103–104). Slavic broke up into East, West, and South Slavic (Mallory and Adams 2006: 14, 26; Sussex and Cubberley 2006; Clackson 2007: 8, 19). Romanian (Eastern Romance) arose from the Romanization of Dacia in the first centuries ad and the later invasion of Goths (Du Nay 1996). Hungarian is considered to have emerged from the Ugric branch of Proto-Uralic, while Estonian and Finnish belong to the Finnic branch (Abondolo 1998). Beyond genetic relations, it was migration, language policy, and language contacts that shaped the present linguistic picture of Central and Eastern Europe, including many pros odic aspects. This chapter discusses the word prosody (§15.2) and sentence prosody (§15.3) of the major languages of the region.
OUP CORRECTED PROOF – FINAL, 06/12/20, SPi
226 MACIEJ KARPIńSKI et al.
15.2 Word prosody 15.2.1 Quantity Quantity distinctions play an important role in the word prosody in the region and may involve consonants in addition to vowels. In the majority of cases, vowel quantity distinctions are accompanied by a difference in vowel quality (e.g. Kovács 2002; Podlipský et al. 2009; Skarnitzl and Volín 2012; Grigorjevs and Jaroslavienė 2015).
15.2.1.1 Baltic Latvian and Lithuanian have a quantity contrast in vowels, and Latvian has additionally developed contrastive quantity in consonants. Some dialects have lost quantity in unstressed syllables. The durational proportion between short and long vowels, pronounced in isolation, has been shown to be 1:2.1 for both Latvian (Grigorjevs 2008) and Lithuanian (Jaroslavienė 2015). In Lithuanian, short open vowels are lengthened under stress in nonfinal syllables in the word, except in certain grammatical forms; see (1) below (Girdenis 1997). In Latvian, voiceless intervocalic obstruents are lengthened if preceded by a short stressed vowel, which has been attributed to Finnic influence (Daugavet 2013).
15.2.1.2 Finno-Ugric Estonian has developed a three-way quantity system with short (Q1), long (Q2), and overlong (Q3) degrees, where duration closely interacts with stress and tone (Lehiste 1997). A decisive factor in determining the degree of quantity is the duration ratio of the first (stressed) syllable and the second (unstressed) syllable in a disyllabic foot (Lehiste 1960a), while pitch remains a vital cue for distinguishing the long and overlong quantity degrees (e.g. Lehiste 1975; Danforth and Lehiste 1977; Eek 1980a, 1980b; Lippus 2011). In disyllabic Q1 and Q2 feet, the f0 steps down between the two syllables, while in Q3 feet there is an f0 fall early in the first syllable. In Finnish, both consonant and vowel durations are contrastive, independent of each other and of word stress. That is, short and long vowels may occur before and after both short and long consonants in both stressed and unstressed syllables (Suomi et al. 2008: 39). As in Estonian, in Finnish the f0 contour may act as a secondary cue for distinguishing phonological quantities (Lehtonen 1970; O’Dell 2003; Järvikivi et al. 2007; Vainio et al. 2010). Additionally, Hungarian differentiates between short and long vowels and consonants, although the quantity contrast for consonants is less crucial, as various phonotactic constraints make consonant length predictable (Siptár and Törkenczy 2007).
15.2.1.3 Slavic The historically widespread presence of vowel quantity in the area is now absent from Bulgarian, Macedonian, Polish, Ukrainian, Belarusian, and Russian, and it never existed in the only Romance language in the region, Romanian. It is preserved in Czech, Slovak, Slovenian, and pluricentric BCMS. It is found in stressed and unstressed syllables in Czech, Slovak, and BCMS, where long vowels are, however, excluded from a pre-stressed position. In Slovenian, phonological quantity is present only in final stressed syllables, stressed vowels being otherwise long and unstressed short.
OUP CORRECTED PROOF – FINAL, 06/12/20, SPi
CENTRAL AND EASTERN EUROPE 227 Syllabic /l/ and /r/ occur in Czech and Slovak, and syllabic /r/ in South Slavic. Syllabic liquids participate in the quantity contrast in Slovak and BCMS, but in Slovenian the only syllabic liquid /r/ is always long. Duration ratios between short and long nuclei, relevant to rhythm, vary considerably, in part depending on style (laboratory speech vs. reading) (for Czech see Janota and Jančák 1970 and Palková 1994; for Slovak see Daržágín et al. 2005 and Beňuš and Mády 2010; for BCMS see Lehiste and Ivić 1986: 63 and Smiljanić 2004). The relevance of distinctions between long and short vowels has been called into question in Slovenian (SrebotRejec 1988) as well as in the Zagreb dialect of BCMS (Smiljanić 2004).
15.2.2 Word stress The entire range of word stress patterns—mobile and fixed, left edge and right edge, based on various phonetic properties, interacting with other prosodic domains (van der Hulst 2014b)—is represented in the languages of the region as a result of both genetics and contact factors.
15.2.2.1 Baltic Lithuanian retains the mobile stress of the Balto-Slavic system (Young 1991; Girdenis 1997; Stundžia 2014) and features a tonal contrast in the stressed syllable (Dogil 1999a: 878; see also Revithiadou 1999; Goedemans and van der Hulst 2012: 131). Latvian stress tends to fall on the initial syllable of the main word of the clitic group (e.g. /uz ˈjumta/ ‘on the roof ’) (Kariņš 1996), which is sometimes attributed to Finnic influence (Rinkevičius 2015; cf. Hock 2015), although what may be seen as early stages of the tendency towards initial stress are also found in Lithuanian dialects, where there is no Finnic influence. Secondary stresses in both Latvian and Lithuanian occur at intervals of two or three syllables, but may also depend on syllable weight and morphological structure (Daugavet 2010; Girdenis 2014). A unique feature of Latvian, a weight-insensitive unbounded system (van der Hulst et al. 1999: 463), is the existence of distinctive patterns involving pitch and glottalization on both stressed and unstressed heavy syllables (Seržants 2003). Lithuanian orthography distinguishes three marks tradition ally referred to as ‘accents’, which conflate stress and length. ‘Grave’ indicates a stressed light syllable, as in the final syllable of the instrumental case for ‘wheel’ in (1), while ‘acute’ and ‘circumflex’ indicate what is traditionally referred to as a tonal contrast on heavy syllables, as in (2a, 2b), which is now lost on long vowels. Phonetically, the role of f0 is secondary compared to the duration ratio between the first and second halves of the heavy rhyme (Dogil and Williams 1999: 278–284). In syllables with the acute accent, the first element of the diphthong and of short-vowel-plus-sonorant combinations is lengthened, while the second is short and presumably non-moraic; the circumflex accent indicates that the second element is lengthened, while the first is short and qualitatively reduced, indicating a possible loss of its mora (Daugavet 2015: 139). The circumflex is traditionally believed to be the accent of the short vowels that are lengthened under stress, as observed in §15.2.1.1, shown in (1). Stress on circumflex syllables may also shift to certain morphemes (‘stress mobility’). (1) stressed-vowel lengthening rãtas [ˈrɑː.tas] ‘wheel’, cf. ratù [ra.ˈtʊ] inst.sg
OUP CORRECTED PROOF – FINAL, 06/12/20, SPi
228 MACIEJ KARPIńSKI et al. (2) a. acute b. circumflex
áukštas [ˈɑˑʊk.ʃtas] ‘high’ táiką [ˈtɑˑɪ.kɑː] ‘aim; apply’ prs.prtc.nom.pl aũkštas [ˈɒuˑk.ʃtas] ‘storey of a building’ taĩką [ˈtəiˑ.kɑː] ‘peace’ acc.sg
15.2.2.2 Finno-Ugric In Estonian, the primary stress in native words always falls on the first syllable, but it may occur elsewhere in recent loans (e.g. menüü [meˈnyː] ‘menu’). Secondary stresses normally occur on odd-numbered syllables; their placement is determined by the deriv ational and syllabic structure of the word (Viitso 2003). The foot is maximally trisyllabic; words of more than three syllables may consist of combinations of monosyllabic, disyllabic, and trisyllabic feet (Lehiste 1997). A tetrasyllabic word is generally made up of two disyllabic trochees. The main phonetic correlate of stress in Estonian is vowel duration in interaction with the three-way quantity system: in long (Q2) and overlong (Q3) quantity, the stressed vowels are longer than the unstressed ones, whereas in short quantity (Q1) it is the other way round (Lippus et. al 2014). Primary stress in Finnish always falls on the first syllable of the word (Iivonen 1998: 315). The placement of secondary stress depends on several factors, including the segmental structure of syllables and the morphology of the word (Karlsson 1983: 150–151; Iivonen 1998: 315; Karvonen 2005). Long words are formed of disyllabic feet. In compound words, the secondary stress falls on the first syllable of the second element, even if both elements are monosyllabic (e.g. puupää [ˈpuːˌpæː] ‘blockhead’). The main phonetic correlate of stress in Finnish is the duration of segments when they constitute the word’s first or second mora (relative to segment durations elsewhere in the first foot) (Suomi and Ylitalo 2004). There is virtually no reduction of vowel quality in unstressed syllables relative to stressed syllables (Iivonen and Harnud 2005: 65). In Hungarian too, primary stress is fixed to the word-initial syllable but, unless the word carries a pitch accent, is not marked by salient acoustic cues such as vowel quality, duration, or intensity (Fónagy 1958; Szalontai et al. 2016). The existence of secondary stress is disputed (Varga 2002).
15.2.2.3 Slavic All modern West Slavic languages feature weight-insensitive word stress systems (van der Hulst et al. 1999: 436). Word stress is bound in different ways to the left or to the right edge of the word. It falls on the initial syllable in Czech and Slovak but mostly on the penultimate syllable in Polish (Jassem 1962; Steffen-Batóg 2000). In Czech, stress is achieved mainly by means of intensity with no systematic vowel reduction in unstressed conditions (Palková 1994). As an exception to the Polish penultimate syllable stress rule, stress may fall on the antepenultimate syllable in some loanwords (3a) or even on the preantepenultimate one in some verb forms (3b). (3) a. matematyka [matɛˈmatɨka] ‘mathematics’ nom.sg b. pojechalibyśmy [pɔjɛˈxalʲibɨɕmɨ] ‘we would go’ The primary stress may also move to a different syllable in order to keep its penultimate position in inflectional forms.
OUP CORRECTED PROOF – FINAL, 06/12/20, SPi
CENTRAL AND EASTERN EUROPE 229 (4) bałagan [baˈwaɡan] ‘mess’ nom.sg bałaganu [bawaˈɡanu] ‘of mess’ gen.sg The nature of secondary stress in Polish is still under discussion (Rubach and Booij 1985; Newlin-Łukowicz 2012; Łukaszewicz 2018), with recent studies showing a lack of systematic acoustic evidence for it (Malisz and Żygis 2018). Similar doubts apply to Czech and Slovak. Czech and Slovak proclitics are integrated into the prosodic word (5a), while Polish word stress preserves its position except in the case of one-syllable pronominals (5b) (Dogil 1999b: 835). (5) a. Czech/Slovak do domu b. Polish do mnie
[ˈdo domu] [ˈdɔ mɲɛ]
‘(to, towards) home’ gen.sg.m ‘to me’ i.gen.sg.m
In the Eastern South Slavic group, Bulgarian has traditionally been described as having distinctive (non-predictable) dynamic word stress (Stojkov 1966). In Bulgarian, three of the six vowels are subject to stress-related phonological vowel reduction (Pettersson and Wood 1987; Andreeva et al. 2013). Macedonian is the only non-West Slavic language with fixed stress, which is antepenultimate in trisyllabic and longer words (Koneski 1976, 1983; Bethin 1998: 178; van der Hulst et al. 1999: 436). Unlike Bulgarian, Polish, and Slovenian, BCMS apply stress assignment rules to clitic groups (Nespor 1999: 145; Werle 2009). In Macedonian, for example, post-verbal clitics cause a stress shift to the antepenultimate syllable (Rudin et al. 1999: 551). Among Western South Slavic languages, Serbian and Croatian have a lexical high tone that spreads to the syllable to its left if there is one, with some exceptions specific to the region of Zagreb (e.g. Smiljanić 2004). Stress in Slovenian falls on the first syllable with a strong low tone or, if there is no tone, on the last syllable (van der Hulst 2010b: 455). Slovenian stress is independent of lexical low and high tones, which are obligatory in some dialects but optional in the standard language (Gvozdanović 1999; Jurgec 2007). East Slavic languages, Russian, Ukrainian, and Belarusian, have unbounded distinctive word stress systems where the stress may occupy any position in a word and differ across inflexional forms, for example as shown in (6). (6) Russian
борода бороды
[bərɐˈda] ‘beard’ nom.sg [ˈborədᵻ] ‘beards’ nom.pl
Russian is often characterized as having free-stress assignment (Danylenko and Vakulenko 1995; Hayes 1995; Lavitskaya 2015). Longer word forms may feature secondary stress, but rules for its location remain a matter of dispute. Duration and intensity, the latter being less significant, are the major acoustic correlates of word stress, while pitch may be important when duration- and intensity-based cues are inconclusive (Eek 1987: 21). Duration and intensity would also appear to be the major correlates of word stress for Ukrainian and Belarusian, but they may differ in terms of their weight (Nikolaeva 1977: 111–113, 127–130; Łukaszewicz and Mołczanow 2018). In Russian, vowels are systematically reduced in unstressed positions (Bethin 2012). In standard Belarusian, the contrast between non-high vowels is neutralized to [a] or a more lax [ɐ] in unstressed syllables, and vowel reduction is categorical (Czekman and Smułkowa 1988).
15.2.2.4 Romance Romanian features weight-sensitive, right-edge word stress, influenced in both verbs and nouns by derivational affixes but not by inflexional ones (Chitoran 1996; Franzen and
OUP CORRECTED PROOF – FINAL, 06/12/20, SPi
230 MACIEJ KARPIńSKI et al. Horne 1997). Vowel quality in Romanian does not change significantly across stressed and unstressed tokens. Empirical studies show greater vowel dispersion under stress and limited centralization in unstressed positions (Renwick 2014).
15.3 Sentence prosody Intonational properties of the languages of the region have been studied to varying degrees employing both the more traditional contour-based methods and target-based descriptions such as the autosegmental-metrical (AM) framework (Table 15.1). Empirical studies of speech rhythm in these languages have contributed to the discussion on interval-based rhythm metrics.
Table 15.1 Available descriptions based on the autosegmental-metrical framework Language
Prosodic units
Pitch accents
Estonian
Intonation phrase
Finnish
Intonation phrase
Finno-Ugric H*, L*, H*+L, ^H*+L, %, H% L*+H, H+L*, H+!H* L+H*, L*+H L%, H%
Czech
Intonation phrase
Slovak
Accentual phrase Intermediate phrase Intonation phrase Intermediate phrase (aka Minor phrase) Major phrase Intonation phrase
Polish Russian BCMS Bulgarian
Phonological word Intermediate phrase Intonation phrase Intermediate phrase Intonation phrase
Slavic H*, L*, H*L, L*H, and a flat contour S* H*, L*, !H*, L* H*L, L*H, LH*, HL*, LH*L H*L, H*H, H*M, L*, L*, L*H, ^HL*, H*M/(H)L* H*+L, L*+H H*, L*, L+H*, L*+H, H+!H*, H+L* -
Boundaries
L%, H%, M%, 0% H-, L-, !H%H, H%, L% H-a, L-
Source Asu (2004, 2006), Asu and Nolan (2007) Välimaa-Blum (1993) Duběda (2011, 2014) Rusko et al. (2007), Reichel et al. (2015) Wagner (2006, 2008)
H%, L% %H, %M, %L, Odé (2008) L%, 0% %L, %H Godjevac LH(2000, 2005) L%, H%, HL% L-, HAndreeva (2007) %H, L%, H%
Romance Romanian Intermediate phrase H*, L*, L+H*, L+ Catalan cabell), and trochees are equivalent to Catalan monosyllabic words (Spanish and Portuguese caro ‘expensive’ > Catalan car). Final stress and monosyllabic words are more frequent in Portuguese than in Spanish, among other things due to the historical loss of intervocalic /l/ and /n/ (Spanish palo ‘stick’ > Portuguese pau, Catalan pal; Spanish artesana ‘craftswoman’ > Portuguese artesã, Catalan artesana). The different position of Spanish is shown in Figure 17.1. Antepenultimate stress is rare, particularly in Catalan and Portuguese. Although there are competing analyses of stress assignment in these languages, accounts of stress as a predictable phenomenon have relied on morphological information (e.g. lexical stem, suffixal morphemes, word category; for reviews, see Vigário 2003a; Hualde 2013). The extent to which syllabic quantity or weight determines stress location is debatable (Mateus and Andrade 2000; Wheeler 2005; Garcia 2017; Fuchs 2018). Besides a demarcative tendency, stress systems tend to show an alternation of prominences within the word, yielding patterns of secondary stresses. Catalan, Spanish, and some varieties of Portuguese, such as Brazilian Portuguese, may display the typically Romance alternating pattern of secondary stresses to the left of the primary stress (Frota and Vigário 2000; Hualde 2012). However, native speakers’ intuitions are less clear on the locations of secondary stress. Experimental work has often failed to find evidence for alternating patterns (Prieto and van Santen 1996; Díaz-Campos 2000). A different pattern occurs in European Portuguese, with alignment of secondary stress with the left edge of the word (Vigário 2003a). This pattern may also be found in Catalan and Spanish (Hualde 2010, 2012).
45 40 35 30 25 20 15 10 5 0
Catalan Spanish Portuguese
Monosyll
Disyll WS
Disyll SW
Trisyll WSW
Figure 17.1 Frequencies of stress patterns (%) in Catalan, Spanish, and Portuguese. S = Strong; W = Weak. (Data from Frota et al. 2006; Prieto 2006; Vigário et al. 2006, 2010; Frota et al. 2010)
OUP CORRECTED PROOF – FINAL, 04/12/20, SPi
Iberia 253 Pitch accents generally associate to syllables with primary stress (see §17.4.1). However, in Catalan and Spanish emphatic speech, pitch accents can additionally associate with secondary stresses located at the left edge of the prosodic word. Prominence in compound words in Ibero-Romance languages is right-headed; that is, the rightmost element of the compound bears the compound stress. Internal members of compounds typically keep their stress in Catalan and Portuguese (Prieto 2004; Vigário 2003a, 2010), whereas in Spanish the survival of internal stresses seems to depend on the morphosyntactic or lexicalized nature of the compound (Hualde 2006).
17.2.2 Basque In contrast with Catalan, Portuguese, and Spanish, Basque displays a variety of wordprosodic systems. Most Basque dialects belong to the stress-accent type—that is, all lexical words have a syllable with main word stress (cf. Hualde 1997, 1999, 2003a), with differences in the (default) stress location. The most widespread system is deuterotonic stress—that is, stress on the second syllable from the left word edge (e.g. alába ‘daughter’, emákume ‘woman’, argálegi ‘too thin’). Some varieties do not allow for final stress, so in disyllabic words stress falls on the initial syllable. In many of these varieties, there is evidence that the domain for foot construction is the root rather than the whole word, as words such as lúrrentzako ‘for the lands’ and béltzari ‘to the black one’, which morphologically are composed of the monosyllabic roots lur and beltz followed by suffixes, have initial stress. Borrowings from Spanish (Sp.) or Latin (Lt.) show initial stress: jénde or jénte ‘people’ ( aɦ͂ári > ahái ‘ram’; neskáa > neská ‘the girl’). Secondary stress has been reported in Standard Basque on the final syllable of words that are four or more syllables long, without secondary stresses on alternate syllables (e.g. gizónarenà ‘the one of the man’, enbórrekìn ‘with the tree trunks’). Final syllable secondary
OUP CORRECTED PROOF – FINAL, 04/12/20, SPi
254 Sónia Frota, Pilar Prieto, and Gorka Elordieta stress has also been found in Southern High Navarrese, which has initial stress (Hualde 1997, 1999, 2003a). The geographical distribution of this pattern is as yet unknown. The Northern Bizkaian Basque (NBB) varieties have been classified as pitch accent systems, due to the similarity of NBB’s lexical contrast between accented and unaccented words with Tokyo Japanese. In NBB, a subject-object-verb (SOV) language, unaccented words have a prominent word-final syllable when they occur finally in a sentence fragment, including citation pronunciations, and when they occur pre-verbally in a sentence. Unlike these unaccented words, accented words have a lexically or morphologically determined prominent syllable regardless of their sentential position. The prominence derives from a H*+L accent in all cases. In the sentence on the left in Figure 17.2, the accented words amúmen ‘of the grandmother’ and liburúa ‘book’ have word-level stress, with a falling accent on the pre-final syllable. However, in the sentence on the right, the lexically unaccented word lagunen ‘of the friend’ does not have word-level stress (i.e. it does not have a pitch accent). The word dirua ‘money’ is lexically unaccented, but it receives an accent on its final syllable because it precedes the verb (cf. e.g. Hualde et al. 1994, 2002; Elordieta 1997, 1998; Jun and Elordieta 1997; Hualde 1997, 1999, 2003a). Accented words receive an accent because they have one or more accented morphemes (including roots). In most of the local varieties of NBB, the falling accent is assigned to the syllable that precedes the morpheme. If there is more than one accented morpheme in the word, the leftmost one determines the position of the accented syllable. In eastern varieties of NBB, such as the well-documented Lekeitio variety, a fixed location for stress developed (the penultimate or antepenultimate syllable of the word, depending on the variety), regardless of the location of the leftmost accented root. Further to the east, the Central Basque varieties of Goizueta and Leitza (in Navarre) can also be classified as pitch accent systems, this time not because of their similarity to Japanese but to languages such as Serbian, Croatian, Swedish, and Norwegian, which have a lexical tone contrast in the syllable with
200
0
0.5
1
1.5
2
2.5
f0 (Hz)
170 140 110 80 50
H*+L amúmen
H*+L liburúa
H*+L emon nau
lagunen
diruá
emon nau
Figure 17.2 Left utterance: Amúmen liburúa emon nau (grandmother-gen book-abs give aux ‘(S)he has given me the grandmother’s book’); right utterance: Lagunen diruá emon nau (friend-gen money-abs give aux ‘(S)he has given me the friend’s money’).
OUP CORRECTED PROOF – FINAL, 04/12/20, SPi
Iberia 255 word stress. Thus, in Goizueta and Leitza there is no lexical contrast between accented and unaccented morphemes and words, as all words have one syllable that stands out as more prominent prosodically; rather, there is a distinction in stress location as well as lexical pitch accent type. Four-way distinctions exist: words with stress on the initial syllable with rising or falling pitch accents, and words with stress on the second syllable with rising or falling pitch accents (Hualde 2007, 2012; Hualde et al. 2008). Although the pitch accent varieties of NBB and Goizueta/Leitza are different, there is a historical connection between them. In fact, Hualde (2003b, 2007, 2012), Elordieta (2011), and Egurtzegi and Elordieta (2013) argue that the NBB varieties are the remnants of a once general prosodic system with an accented/unaccented distinction, which changed into stress-accent systems in most areas of the Basque-speaking territory. Supporting evidence lies in the fact that accented morphemes in NBB are precisely those that introduce marked accentuation patterns (initial stress) in the stress-accent varieties with deuterotonic stress (cf. Hualde 2003b, 2007, 2012). Compounds show deuterotonic stress in Standard Basque. In stress-accent varieties there is considerable variation across and within local varieties. In most pitch accent varieties, compounds are generally pitch accented, even when the members in isolation are unaccented. The accent tends to occur on the last syllable of the first member of the compound, although there is variation among local varieties as well. (Hualde 1997, 2003a).
17.3 Prosodic phrasing The hierarchical structure of prosodic constituents is characterized by patterns of metrical prominence and may co-determine the tonal structure of the utterance. The prosodic systems of Iberian languages differ in the set of prosodic phrases that are intonationally relevant as well as in the patterns of phrasal prominence.
17.3.1 Prosodic constituents and tonal structure Prosodic phrasing may be signalled by boundary tones. Across the Iberian languages and language varieties, up to three prosodic constituents have been defined at the phrasal level: the intonational phrase (IP), the intermediate phrase (ip), and the accentual phrase (AP). In Catalan and Spanish, the IP and the ip are intonationally relevant. They are characterized by pre-boundary lengthening (stronger at the IP level) and the presence of boundary tones after their final pitch accent, with the inventory of boundary tones for the ip being smaller than that for the IP (Prieto 2014; Hualde and Prieto 2015; Prieto et al. 2015). Figures 17.3 and 17.4 illustrate these prosodic phrases in Catalan and Spanish, respectively. Differently from Catalan and Spanish, in Portuguese only one prosodic constituent is intonationally relevant—the IP (Frota 2014; Frota et al. 2015; Frota and Moraes 2016; Moraes 2008). The IP is the domain for pre-boundary lengthening; it defines the position for pauses and it is the domain of the minimal tune, which in the European variety may consist only of the nuclear accent plus the final boundary tone. Prosodic phrases smaller than the intonational phrase do not exhibit tonal boundary marking. An illustration is provided in Figure 17.5.
OUP CORRECTED PROOF – FINAL, 04/12/20, SPi
256 Sónia Frota, Pilar Prieto, and Gorka Elordieta
400
0
0.5
1
1.5
2
f0 (Hz)
340 280 220 160 100
la
boliviana
de
Badalona
rememorava
la
noia
ip
IP
Figure 17.3 f0 contour of the Catalan utterance La boliviana de Badalona rememorava la noia (‘The Bolivian woman from Badalona remembered the girl’).
400
0
2
f0 (Hz)
340 280 220 160 100
la
niña
de
Lugo
miraba ip
la
mermelada IP
Figure 17.4 f0 contour of the Spanish utterance La niña de Lugo miraba la mermelada ‘The girl from Lugo watched the marmalade’.
Many of the constructions reported in Catalan and Spanish to be signalled by ip boundaries, such as parentheticals, tags, and dislocated phrases, are signalled in Portuguese by IP boundaries (Vigário 2003b; Frota 2014), and the prosodic disambiguation of identical word strings by ip’s in Spanish and Catalan occurs at the IP level in Portuguese. While in Catalan, Spanish, and Portuguese there is no evidence for tonally marked prosodic constituents between the prosodic word and the ip/IP (with the exception of Northern Catalan due to contact with French; see Prieto and Cabré 2013), in many dialects of Basque, three constituents are relevant to tonal structure: the IP, the ip, and the AP.
OUP CORRECTED PROOF – FINAL, 04/12/20, SPi
Iberia 257
300
0
0.5
1
1.5
f0 (Hz)
260 220 180 140 100
a
nora
da
mãe
falava
do
namorado IP
Figure 17.5 f0 contour of the Portuguese utterance A nora da mãe falava do namorado (‘The daughter-in-law of (my) mother talked about the boyfriend’).
200
0
0.5
1
1.5
2
f0 (Hz)
170 140 110 80 50
Mirénen
lagúnen AP
liburúa AP
erun ip
dot IP
Figure 17.6 f0 contour of an utterance from Northern Bizkaian Basque: ((Mirénen)AP (lagúnen)AP (liburúa)AP )ip erun dot (Miren-gen friend-gen book-abs give aux ‘I have taken Miren’s friends’ book’).
In Basque, IP’s are signalled by different boundary tones depending on sentence modality and on whether IP’s are final or non-final in an utterance (Elordieta and Hualde 2014). While final IP’s may have low, rising, or falling contours, non-final IP’s are intonationally marked by rising contours in Basque, signalling continuation. In Standard Basque, ip’s are marked by rising boundary tones at their right edge, but in NBB they are not marked by any right-edge boundary tones. Rather, they are characterized as domains of downstep, where the H*+L pitch accents cause downstep on a following accent (Elordieta 1997, 1998, 2003, 2007a, 2007b, 2015; Jun and Elordieta 1997; Gussenhoven 2004; Elordieta and Hualde 2014; see Figure 17.6). The lower-level constituent is the AP. In NBB, AP’s are sequences of one or
OUP CORRECTED PROOF – FINAL, 04/12/20, SPi
258 Sónia Frota, Pilar Prieto, and Gorka Elordieta
200
0
0.5
1
1.5
2
f0 (Hz)
170 140 110 80 50
Imanolen
alabien
diruá
erun ip
dot IP
Figure 17.7 f0 contour of an utterance from Northern Bizkaian Basque: ((Imanolen alabien diruá)AP )ip erun dot (Imanol-gen daughter-gen money-abs give aux ‘I have taken Imanol’s daughter’s money’).
more words marked by a rise in pitch at the left edge and a pitch fall at the right edge. The initial rise is a combination of an AP-initial L boundary tone and a phrasal H tone phonologically associated with the second syllable of the AP, and the pitch fall is a H*+L pitch accent. The H*+L accent may belong to an accented word or to a lexically unaccented word that occurs in immediate pre-verbal position (see §17.2). In all other contexts, lexically unaccented words do not carry an accent and are included in the same AP with any following word(s). Figures 17.6 and 17.7 illustrate the general patterns of phrasing into AP’s in NBB, respectively showing a sequence of three accented words and thus three AP’s, and a sequence of three unaccented words (which form one AP) before the verb. In Standard Basque, AP’s are not clearly identified at their left edge by a rising intonation. Rather, any word with an accent could constitute an AP (Elordieta 2015; Elordieta and Hualde 2014).
17.3.2 Phrasal prominence Phrasal prominence refers to the main prosodic prominence within a prosodic constituent. It is frequently related to the expression of focus, in line with the tendency for languages to exploit prosodic structure for the marking of information status. In Catalan, Spanish, and Portuguese phrasal prominence is rightmost. The last prosodic word in the phrase gets nuclear stress and the nuclear pitch accent thus typically occurs close to the right edge of the intonational phrase. In a broad-focus statement such as ‘They want jam’, the main phrasal prominence is on the last word, as illustrated in (1). Similarly, in a narrow-focus statement such as ‘They want JAM (not butter)’, it is also final, as in (2), but the pitch accent used to express it is different. In all three languages, a particular pitch accent is commonly used to convey narrow (contrastive) focus (see also §17.4.2).
OUP CORRECTED PROOF – FINAL, 04/12/20, SPi
Iberia 259 (Catalan) ‘(They) want jam’ (1) a. (Volen melmelada)IP L* L% b. (Quieren mermelada)IP (Spanish) L* L% c. (Querem marmelada)IP (Portuguese) H+L* L% (Catalan) ‘(They) want JAM (not butter)’ (2) a. (Volen MELMELADA)IP L+H* L% b. (Quieren MERMELADA)IP (Spanish) L+H* L% c. (Querem MARMELADA)IP (Portuguese) H*+L L% Non-final nuclear prominence is also possible, but here the three languages differ. Changes in the placement of the nuclear accent can be used as a strategy to convey narrow focus in Catalan, Spanish, and Portuguese, as shown in (3) for the statement ‘MARINA is coming tomorrow (not Paula)’, where nuclear prominence is found on ‘Marina’. However, Catalan and to a somewhat lesser extent Spanish are less flexible in shifting the main prominence to a non-phrase-final position than West Germanic languages (Vallduví 1992; Hualde and Prieto 2015). Instead, word order changes are generally used for focus marking in combination with prosodic prominence strategies (Vanrell and Fernández-Soriano 2013), as in (4). Although word order changes are also possible in some constructions in Portuguese, prosodic strategies like those exemplified in (2) and (3) are more widely used. For further details on phrasal prominence and focus in Catalan, Spanish, and Portuguese, see Frota (2000, 2014), Face (2002), Fernandes (2007), Vanrell et al. (2013), Prieto (2014), Frota et al. (2015), Prieto et al. (2015), and Frota and Moraes (2016). (3) a. (La MARINA vendrà demà)IP L* L% L+H* b. (MARINA vendrá mañana)IP L+H* L* L% c. (A MARINA virá amanhã)IP H*+L H+L* L% (4)
a. (MELMELADA)ip L+H* Lb. (MERMELADA)ip L+H* L-
(Catalan) ‘MARINA is coming tomorrow (not Paula)’ (Spanish) (Portuguese)
(volen)IP (Catalan) ‘(They)want JAM (not butter)’ L* L% (quieren)IP (Spanish) L* L%
Differently from Catalan, Spanish, or Portuguese, the neutral word order in Basque declarative sentences is SOV, and the main prosodic prominence is assigned to the pre-verbal constituent—that is, the object (Hualde and Ortiz de Urbina 2003). In the sentence in (5), the direct object madari bát ‘a pear’ is interpreted as the constituent with main prominence.
OUP CORRECTED PROOF – FINAL, 04/12/20, SPi
260 Sónia Frota, Pilar Prieto, and Gorka Elordieta (5) Mirének umiari madari bát emon dotzo. Miren-erg child-dat pear-abs one give aux ‘Miren has given a pear to the child’ Any word order that is not SOV necessarily indicates that the sentence has a constituent that is focalized and other constituents have to be understood as ‘given’ information. For instance, if the sentence in (5) were to have the order OSV, the subject would be interpreted as having narrow focus and the object would be ‘given’ information. In narrow-focus contexts, the focalized constituent must be immediately to the left of the verb. Narrow focus can also occur post-verbally, at the end of the clause (Hualde and Ortiz de Urbina 2003; Elordieta and Hualde 2014). It is not clear whether there is a difference between the realization of prosodic prominence in broad focus and in non-corrective narrow focus. In Central and Standard Basque, in both cases pitch accents are rising. Whereas pitch accents may have their peaks on the post-tonic syllable in broad focus, in narrow focus there is a tendency for such peaks to be realized within the tonic syllable (Elordieta 2003; Elordieta and Hualde 2014). In the specific type of narrow focus called ‘corrective’, the focalized constituent always has an accent with a peak in the tonic syllable, followed by a reduced pitch range on the following material. This holds for all varieties (Elordieta 2003, 2007a). Thus, Iberian languages show varying prosodic prominence effects of focus, with Basque displaying heavy syntactic constraints that are not found in the other languages and with Portuguese showing a more flexible use of prosodic prominence.
17.4 Intonation In Iberian languages, the tonal structure of utterances comprises intonational pitch accents and boundary tones. Lexical pitch accents contribute to the tonal structure in Basque only (§17.2). Differences in the types, complexity, and distribution of pitch events, and resulting nuclear configurations, are found across languages. The division of labour between prosodic and morphosyntactic means to express sentence modality and other pragmatic meanings varies greatly too. Unless otherwise stated, the description below is mostly based on the varieties of each language whose intonation is best known: Central Catalan, Castilian Spanish, Standard European Portuguese, and Northern Bizkaian and Standard Basque.
17.4.1 Tonal events All of the languages described in this chapter have pitch accents and IP boundary tones, but not all show ip or AP edge tones. Portuguese stands out for the absence of ip tonal boundaries, whereas Basque is unique in the central role played by the AP in its tonal structure. This is as expected, given the set of intonationally relevant prosodic constituents for each language described in §17.3.1. There are larger and often different sets of nuclear pitch accents than of prenuclear pitch accents in Catalan, Spanish, and Portuguese. While most pitch accents in the languages’
OUP CORRECTED PROOF – FINAL, 04/12/20, SPi
Iberia 261 inventories are used only in nuclear position, it is not uncommon for a given pitch accent to occur only in prenuclear position, as do Catalan L*+H and Spanish L+ L% Was the kid smart?
Is do en Luus2 jewäs? LL* H L% Is there a louse?
Figure 18.6 Cologne Accent 1 and Accent 2 in nuclear position interacting with two contours, H*L L% and L*H L%.
In some dialects, the opposition between Accent 1 and Accent 2 remains intact outside accented syllables. In Cologne and Maastricht, for instance, the contrast is maintained in postnuclear position by both durational and f0 cues (Gussenhoven and Peters 2004; Peters 2006b; Gussenhoven 2012b), while in Tongeren and Hasselt, Accent 2 affects primarily f0 (Heijmans 1999; Peters 2008). Venlo, Roermond, and Helden maintain the contrast only in intonationally accented syllables and IP-final syllables, where the lexical tone of IP-final Accent 2 interacts with the final boundary tone (Gussenhoven and van der Vliet 1999; Gussenhoven 2000c; Gussenhoven and van den Beuken 2012). Lexical tones associate either to the mora (Central Franconian and Dutch Limburgian dialects) or to the syllable (some Belgian Limburgian dialects) (Peters 2008). Where the TBU is the mora, the tone contrast is bound to bimoraic syllables, with (sonority) requirements on segmental structure that vary across dialect groups. In addition, regional vari ation is found in the lexical distribution of Accent 1 and Accent 2 (for overviews, see Schmidt 1986, 2002; Hermans 2013). Unlike the case in CNG, the presence of a lexical tone accent distinction in Franconian varieties does not seem to drastically restrict the richness of the intonational system (Gussenhoven 2004: 228 ff.). Figure 18.6 illustrates the interaction of Accent 1 and Accent 2 with two nuclear contours in Cologne German, where the lexical tone of Accent 2 assimilates to the starred tone (after Peters 2006b). More recently, alternative metrical analyses have been proposed that account for the opposition in terms of a contrastive foot structure (e.g. Hermans and Hinskens 2010; Hermans 2013; Köhnlein 2016; Kehrein 2017; for discussion see Gussenhoven and Peters 2019). Apart from the question of whether the tone accent distinction is better accounted for by a tonal or a metrical analysis, there is an unresolved controversy about the origin of the Franconian tone accents (cf. de Vaan 1999; Gussenhoven 2000c, 2018a; Schmidt 2002; Köhnlein 2011; Boersma 2017).
18.4 Concluding remarks The prosody of varieties of Continental Germanic may be more similar than descriptive traditions suggest. For the stress system, the bigger divide goes between English and the other Germanic languages, for which the Norman Conquest is usually taken to be respon-
OUP CORRECTED PROOF – FINAL, 07/12/20, SPi
284 Tomas Riad and Jörg Peters sible. Within Continental Germanic, there is a major divide between the languages that retain a surface-evident consonant quantity distinction and those that do not. That isogloss runs between the northern CNG languages on the one hand, and Danish, Standard German, Standard Dutch, English, and so on, on the other. At the same time there are several CWG varieties with a generally southern (Alemannic) spread that retain a consonant quantity distinction (Kraehenmann 2001). The Germanic languages are all intonational languages, employing pitch accents and boundary tones that tend to be distributed and/or used for information-structural and pragmatic purposes. There are also two lexical tonal systems that are superimposed on the intonation system. In CNG, the tonal contrast must be assumed to have restricted the pragmatic variation as expressed by the intonation, unlike the case in CWG, where the inton ation system is less drastically affected by the presence of a tonal contrast. This speaks to a basic difference between the CNG and CWG tonal contrasts, where the Franconian tonal system is constituted more like the Latvian and Lithuanian ones than the North Germanic one. This is also seen in the interaction with segmentals. The CNG system is relatively ‘pros odic’ in that it (i) does not care about the sonority of syllables, (ii) does not affect vowel quality, (iii) typically requires more than one syllable to be expressed (Accent 2), and (iv) assigns a version of the lexical tonal contour (Accent 2) to compounds, by a prosodic rule. By contrast, the CWG tone contrast (i) requires two sonorant morae if the TBU is the mora, (ii) may affect vowel quality, (iii) may occur within a single stressed syllable, and (iv) does not have a prosodic rule assigning a particular tonal contour to compounds.
OUP CORRECTED PROOF – FINAL, 04/12/20, SPi
chapter 19
I n tonation Systems Across Va r ietie s of English Martine Grice, James Sneed German, and Paul Warren
19.1 The role of English in intonation research The mainstream standard varieties of English have played a major role in the development of models of intonation, with different traditions on veither side of the Atlantic. The British School emphasized auditory training, including the production and perception of representative intonation patterns—akin to training in the cardinal vowel system (Jones 1967)— and transcription using tonetic stress marks. These diacritics indicate the position of stressed syllables and pitch movements (falling, rising, level, falling-rising, and risingfalling) across these and following syllables. The most important feature is the ‘nuclear tone’, the pitch movement over the stressed syllable of the most prominent word in the phrase1 (nucleus) and any following syllables (tail). The British School has included didactic approaches (O’Connor and Arnold 1961), approaches focused on phonetic detail and broad coverage (Crystal 1969), and approaches considering the relationship with syntax and semantics/pragmatics (e.g. Halliday 1967; Tench 1996). Meanwhile, American Structuralism focused on phonological structure, separating stress-related intonation from intonation marking the edges of phrasal constituents (Pike 1945; Trager and Smith 1951). Together with the all-encompassing work by Bolinger (1958, 1986, 1989), who introduced the concept of pitch accent, and insights from work on Swedish (Bruce 1977), this set the stage for autosegmental-metrical (AM) approaches to English intonation (e.g. Liberman 1975; Leben 1976; Pierrehumbert 1980; Gussenhoven 1983b; Ladd 1983; Beckman and Pierrehumbert 1986). In AM theory, the separation of prominence-cueing and edge-marking aspects of inton ation is crucial: pitch accents are associated with stressed syllables, and edge tones with 1 Whether the nuclear syllable is always the most prominent in the phrase has been contested. Instead it has been defined positionally as the last content word or the last argument of the verb.
OUP CORRECTED PROOF – FINAL, 04/12/20, SPi
286 MARTINE GRICE, JAMES SNEED GERMAN, AND PAUL WARREN phrase edges. The AM equivalent to the British School nuclear tone is the combination of the last pitch accent and the following edge tones, referred to as the ‘nuclear contour’. In what follows, we couch our discussion in AM theory, currently the most widespread approach to intonation, paying special attention to the nuclear contour, which is often the focus of comparison in the papers we have consulted. Although using AM representations facilitates comparison, we nonetheless have to exercise caution, since AM models can differ from each other in ways that do not always reflect differences in the varieties they describe. Each model can be seen as a prism through which to observe a variety, and at the same time a model that is developed to describe a variety is likely to have been shaped by this variety. We return to this problem when discussing mainstream varieties below.
19.2 Scope of the chapter This chapter is concerned with the structure and systematicity underlying intonation. A historical emphasis on standardized varieties from the British Isles and North America means that our descriptive understanding of those varieties is particularly comprehensive, and they have also played a central role in the development of theoretical models and frameworks. Therefore, we treat those varieties, and closely related varieties from Australia, New Zealand, and South Africa, under the label ‘Mainstream English Varieties’ (MEVs). MEVs show relatively few differences in overall phonological organization, whereas more substantial variation arises in cases where MEVs were adopted and subsequently nativized by non-Englishspeaking populations. We therefore examine separately a selection of ‘Contact English Varieties’ (CEVs) that differ from MEVs in typologically interesting ways.2 We explore the challenges posed by their diverging prosodic structures, in terms of both prominence and edge-marking, and observe that an account of the intonation of current-day ‘Englishes’ needs to cover a broad range of typological phenomena going well beyond what is present in the extensively researched mainstream varieties. This broad range in turn provides us with a chance to observe prosodic variation within one language in a way that is usually only available to cross-linguistic studies.
19.3 Intonational systems of Mainstream English Varieties MEVs have many intonational properties in common, both in terms of the distribution of tones throughout the utterance and in terms of their local and global tonal configurations. Hence, although we discuss northern hemisphere and southern hemisphere varieties separ ately below, this is mainly for convenience in the context of this handbook, which groups languages according to their geographical location. 2 This distinction is closely similar to that between Inner and Outer Circle varieties (Kachru 1985). A separate terminology is used here because the present review is concerned primarily with similarities and differences in the synchronic structural aspects of nativized varieties, as opposed to their sociolinguistic contexts.
OUP CORRECTED PROOF – FINAL, 04/12/20, SPi
INTONATION SYSTEMS ACROSS VARIETIES OF ENGLISH 287
19.3.1 Northern hemisphere All MEVs have lexical word stress, in that specific syllables within a word are designated as prosodically privileged in the sense of Ladd (2008b) and Hyman (2014b). Not only do such syllables receive acoustic and articulatory enhancement but they are also possible locations for pitch accent realization. When present, pitch accents usually occur on the primary stressed syllable of a word, though, in particular contexts, secondary stressed or even unstressed syllables can be promoted to bear accents. This can be for rhythmic reasons— Chiˈnese, has an initial accent in CHInese GARden to avoid a clash of pitch accents on adjacent syllables—or in metalinguistic corrections, such as ‘I met roSA not roSIE’. Northern hemisphere MEVs also share a considerable reduction of unstressed syllables leading to strong durational differences between stressed and unstressed syllables. This is not always the case in southern hemisphere MEVs (see §19.3.4). Pitch accents are sparsely distributed in MEVs, and are more common on nouns than on verbs, and on content words than function words. Each intonational phrase (or intermediate phrase if the model includes one) requires at least one pitch accent—the nuclear pitch accent (with exceptions for subordinate phrases and tags; see Crystal 1975: 25; Firbas 1980; Gussenhoven 2004: 291; Ladd 2008b: 238). Prenuclear pitch accents are not usually obligatory in MEVs, but they may serve information-structural functions (e.g. topic marking). They may also be optionally inserted for rhythmic purposes, usually near the beginning of a phrase. MEVs use the placement of nuclear pitch accents to mark the difference between new or contrastive information versus discourse-given information. The rules governing this relationship are complex and outside the scope of this chapter (see Wagner 2012a for a review). In broad terms, however, a focused constituent obligatorily contains a nuclear pitch accent, while discourse-given words following the focused constituent are unaccented. This contrasts with many contact varieties, for which accent placement does not appear to be used to mark information status or structure. The intonation systems of MEVs have a rich paradigmatic range of pitch accents and edge tone combinations. The pitch accents proposed in a consensus system (Beckman et al. 2005) to describe Mainstream American English (MAE) are simple low (L*) and high (H*) tones, rises (L+H*), and scooped rises (L*+H); downstepped variants of each of these last three (!H*, L+!H*, L*+!H); and an early-peak falling accent (H+!H*). This system assumes two phrase types: the intonational phrase (IP) and the intermediate phrase (ip). The latter has a high (H-), low (L-), or downstepped high-mid (!H-) edge tone at its right edge. IP-finally, the ip edge tone is followed by a high (H%) or low (L%) IP edge tone (Pierrehumbert 1980; Beckman and Pierrehumbert 1986). The co-occurrence of ip and IP edge tones leads to complex tone sequences at IP edges: a phrase-final nuclear syllable—if combined with a bitonal pitch accent—can carry up to four tones. These pitch accents and edge tones are illustrated in online ToBI (Tones and Break Indices) training materials (Veilleux et al. 2006). The edge tones of ip’s are also referred to as ‘phrase accents’. For some varieties it has been argued that they can be associated with a post-focal stressed syllable (Grice et al. 2000),3 lending them an accent-like quality. Although this does not promote the syllable sufficiently
3 This is particularly common in fall-plus-rise contours, referred to as a compound fall-plus-rise tune in the British School, e.g. ‘My \mother came from /Sheffield’, where Sheffield is given in the discourse context (O’Connor and Arnold 1961: 84).
OUP CORRECTED PROOF – FINAL, 04/12/20, SPi
288 MARTINE GRICE, JAMES SNEED GERMAN, AND PAUL WARREN to allow it to bear a pitch accent (thus, it does not counteract deaccentuation), the syllable is rendered more prominent by virtue of bearing a tone. This same inventory of pitch accents and edge tones could in principle be used to describe Southern Standard British English (SSBE) (Roach 1994; Ladd 2008b). However, there are a number of differences in the tonal inventories of AM models developed for SSBE, although they appear to reflect differences in the models themselves rather than differences in intonation (MAE and SSBE). For instance, some models include a pitch accent representing a nuclear fall H*+L (also written H*L; see Gussenhoven 1983b, 2004 on British English; Grabe et al. 2001), whereas others capture this configuration with a sequence of pitch accent and ip edge tone H* L-. AM models of MEVs also differ in their treatment of the movement (onglide) towards the starred tone. In MAE_ToBI it is c aptured with a leading tone (L+H*), whereas in some other models it is either non-distinctive (and therefore treated as phonetic detail; Grabe et al. 2001) or derived from a L tone from a previous accent (Gussenhoven 2004). Nonetheless if the pitch movement is falling (i.e. there is high pitch before the accented syllable), there is a general consensus across the different models that this should be captured with an early-peak accent (H+!H*). The status of the onglide is certainly of theoretical import, but it does not capture differences across the intonation of these varieties (see Grice 1995a for a discussion). It may be, for instance, that a more frequent use of falling pitch phrase-medially leads to the conclusion that there must be a H*+L pitch accent (Estebas Vilaplana 2003), whereas if falls tend to occur phrase-finally, they might be more likely to be analysed as H* followed by a Ledge tone. The danger of comparing different intonational systems using models that are based on different assumptions, even if they are all in the AM tradition, is discussed in Ladd (2008a), who argues that this can lead to comparisons that are not about particular languages or varieties but about the models themselves. Crucially, differences in the analysis of specific patterns should not be downplayed, as they have consequences for the overall organization of a model, including, for example, the need for an ip level of phrasing. Ladd (2008b) and Gussenhoven (2016), among others, provide valuable critiques of the MAE-based consensus model outlined above. However, they highlight that revisiting early theoretical choices of that model need not undermine the appropriateness of an AM approach to MEVs. Although any combination of pitch accents and edge tones is in principle allowed, certain preferred combinations occur more frequently than others (Dainora 2006). Likewise, although individual pitch accents and edge tones have been assigned pragmatic functions (Pierrehumbert and Hirschberg 1990; Bartels 1999), it is often combinations of nuclear pitch accents and edge tones (i.e. nuclear contours) that are referred to when the meaning of intonation is discussed (see e.g. Crystal 1969; Brazil et al. 1980; Cruttenden 1986; Tench 1996). For instance, rise-fall and (rise-)fall-rise nuclear contours are said to convey meanings such as unexpectedness or disbelief respectively (see §19.5 for a discussion of rises). Differences between North American and British standard varieties can often be expressed in terms of which nuclear contour is preferred in which pragmatic context (Hirst 1998), rather than differences in the inventory. However, differences in usage can lead to misunderstandings: for instance, a request using H* L-H% is routinely used in SSBE but can sound condescending in MAE (Ladd 2008b), where a simple rise is preferred.
OUP CORRECTED PROOF – FINAL, 04/12/20, SPi
INTONATION SYSTEMS ACROSS VARIETIES OF ENGLISH 289
19.3.2 Non-mainstream varieties of American English The limited research on the intonation of regional or ethnic varieties of American English suggests that varietal differences are relatively minor. Comparing rising pitch accents in speakers from Minnesota and Southern California, for example, Arvaniti and Garding (2007) found that the former group has later alignment of tonal targets and likely lacks a distinction between what are taken to be two distinct accents in the latter variety, H* and L+H*. In a comparison of Southern and Midland US varieties, Clopper and Smiljanić (2011) found distributional differences, namely in the relative frequencies of certain edge tone categories, which also varied across gender and texts. African American English (AAE) has been relatively well studied (see Thomas 2007, 2015 for reviews). One widely observed feature is that certain words have initial stress where MAE has non-initial stress (e.g. ˈpo.lice, ˈho.tel; e.g. Fasold and Wolfram 1970). This is noteworthy since it affects tonal alignment. A few studies report final falling or level contours in AAE for polar questions (Tarone 1973; Loman 1975; Jun and Foreman 1996; Green 2002), whereas MAE speakers typically use a high rising contour (L* H-H%). Jun and Foreman (1996) also report post-focal pitch accents (i.e. no deaccenting) with early focus. Other differences involve f0 scaling, including a greater overall range (Tarone 1973; Hudson and Holbrook 1981, 1982; Jun and Foreman 1996) and a lower degree of declination (Wolfram and Thomas 2002; Cole et al. 2008). Research on Chicano English in Hispanic communities remains entirely descriptive, and characteristics vary by region. Metcalf (1979) provides a comprehensive survey of Chicano English studies spanning 20 years, identifying five features as typical: (i) a tendency for noun compounds to have primary stress on the second constituent rather than the first (e.g. baby ˈsitter; Metcalf 1979; Penfield 1984), (ii) a less pronounced utterance-final fall for declaratives and wh-interrogatives (Castro-Gingras 1974; Metcalf 1979), (iii) greater use of non-final rising pitch prominences (Castro-Gingras 1974; Thomas and Ericson 2007), (iv) the use of final falling contours for polar questions (Castro-Gingras 1974; Fought 2003), and (v) the use of emphatic ‘rising glides’ with a peak very late in the stressed syllable (Penfield 1984; Penfield and Ornstein-Galicia 1985). Fought (2002) also notes additional lengthening of stressed syllables at the beginnings and ends of utterances. Burdin (2016) presents a comprehensive examination of English intonation among Jewish and non-Jewish Americans in Ohio. While the differences mainly concern category usage (e.g. Jewish speakers use more rising pitch accents in listing contexts and more level contours in narratives), her findings suggest that contact with Yiddish has contributed to a more ‘regular alternation between high and low tones within a prosodic phrase’ (p. 12), or ‘macro-rhythm’ (Jun 2014b), in Jewish English. In general, existing research on non-standard American English varieties is highly descriptive, rarely concerned with phonological structure, and mostly from older studies; this area of research clearly needs to be updated.
19.3.3 Non-mainstream British varieties Across British varieties there is considerable regional and individual variation in intonational inventories (Cruttenden 1997; Grabe et al. 2000), as well as differences in preferred patterns in
OUP CORRECTED PROOF – FINAL, 04/12/20, SPi
290 MARTINE GRICE, JAMES SNEED GERMAN, AND PAUL WARREN given pragmatic contexts. One striking aspect of many urban northern varieties (e.g. in Birmingham, Liverpool, Glasgow, Belfast, and Tyneside, which Cruttenden 1994 dubbed Urban Northern British (UNB)), is the routine use of rising contours in statements. Not only their form but also their usage as the default contour distinguish them from uptalk (see §19.5). Cruttenden (1997) proposes four types of ‘rise’: a simple rise (preferred in Glasgow), two rises with plateaux (rise-plateau and rise-plateau-slump, both common in Birmingham, Liverpool, Belfast, and Tyneside), and a rise-fall, found less commonly in several UNB varieties as well as in Welsh English. An alternative AM transcription system has been developed specifically to capture this regional variation (Intonational Variation in English, or IViE; Grabe et al. 2001); it would transcribe these contours as L*+H (Grabe 2002, 2004) combined with different edge tone sequences. To capture the difference between a final rise, a level, and a slump, IViE incorporates a null boundary tone (0%) in addition to H% and L% (the latter taken to be low, not upstepped as is sometimes the case in MAE_ToBI). However, in a detailed description of Manchester English, Cruttenden (2001) argues that this addition is insufficient to capture the range of distinctive contours both here and in UNB varieties. Distinctions need to be made between slumps and slumps plus a mid-level stretch, and between simple falls and falls plus a lowlevel stretch. Cruttenden argues for a feature involving sustention, rather like spreading, which §19.4 will show is useful for contact varieties. Irish English, referred to as Hiberno-English, is defined by a set of unofficial localized linguistic standards, setting it apart from British varieties. The intonation patterns in Hiberno-English varieties can be mapped onto corresponding regional Irish varieties (Dalton and Ní Chasaide 2007b; Dorn and Ní Chasaide 2016). There is a clear difference in the intonation of the south and the north of the island making up the Republic of Ireland and Northern Ireland. According to Kalaldeh et al. (2009), declarative statements in Dublin and Drogheda English tend to have prenuclear and nuclear falling contours, analysed as bitonal H*L pitch accents, whereas further north (e.g. in Donegal) they have prenuclear and nuclear rises similar to those in Belfast (analysed as L*H). O’Reilly et al. (2010) report post-focal deaccenting in Donegal English. Early focus, interestingly, leads to a gradual fall from the focal H tone over the post-focal domain, reaching a low-pitch phrasefinally, rather than the characteristic final rise of broad-focus or late-focus statements in this variety
19.3.4 Southern hemisphere mainstream varieties Many of the properties of northern hemisphere varieties apply to southern hemisphere varieties too. Analyses of Australian English (AusE) assume the same inventory of pitch accents, phrase accents, and boundary tones as the AM analyses of MAE (see e.g. Cox and Fletcher 2017), although earlier analyses (e.g. Fletcher et al. 2002) included H*+L and its downstepped variant, !H*+L, which are absent from MAE_ToBI. Rising tunes have been widely studied in AusE (see also §19.5). Fletcher et al. (2002) found that low-onset high rises (L* H-H%) and expanded-range fall-rises (H*+L H-H%) are more frequent and have greater pitch movements than low-range rises (L* L-H%) and low-range fall-rises (H* L-H%). There are higher topline pitch values and more high-onset
OUP CORRECTED PROOF – FINAL, 04/12/20, SPi
INTONATION SYSTEMS ACROSS VARIETIES OF ENGLISH 291 rises (H* H-H%) in instructions and information requests than in acknowledgements. For participants issuing instructions, expanded-range rises are more likely at the ends of turns than turn-medially. An analysis of newsreader speech showed that New Zealand English (NZE) has comparatively fewer level nuclei and more complex nuclei (the latter potentially expressing greater emotional involvement) than British English (Warren and Daly 2005). NZE tends to have more dynamic intonation patterns with a higher rate of change in pitch (Warren and Daly 2005) than British English, though this varies by region within New Zealand. Vermillion (2006) found that NZE is characterized by higher pitch, with higher boundary tones (H%) but smaller pitch falls between adjacent H* accents. Comparisons of AusE and NZE have chiefly focused on rising intonation patterns, particularly uptalk (Fletcher et al. 2005; Warren and Fletcher 2016a, 2016b). One point of comparison has been how the two varieties distinguish uptalk rises on declaratives from yes/no question rises, with more dramatic rises for statements than for questions (see §19.5) and more fall-rise patterns for uptalk in AusE. Moving finally to South Africa, given that English is the most commonly spoken language in official and commercial public life and is the country’s lingua franca, it is surprising how little has been written about the intonation of this variety. Indeed, reference descriptions of South African English (e.g. Lass 2002; Bowerman 2008) make no mention of prosodic features. Uptalk rises on declaratives have been reported for White South African English (WSAfE), and, as in NZE, these tend to have a later rise onset than question rises (Dorrington 2010a, 2010b). For Black South African English see §19.4.8.
19.4 English intonation in contact This section considers varieties arising from second language (L2) use of English, with subsequent nativization. Contact with typologically diverse languages has resulted in inton ation systems that differ, sometimes dramatically, from those of MEVs.
19.4.1 Hong Kong English Hong Kong English (HKE), or Cantonese English, refers to a variety spoken by first language (L1) speakers of Cantonese, either as an L2 or as a nativized co-L1 in contexts such as Hong Kong, where both languages are spoken side by side. Luke (2000) provided the first account of HKE intonation in terms of tone assignment that is sensitive to lexically stressed syllables. Word stress patterns of SSBE are interpreted in terms of three Cantonese level tones: High (H), Mid (M), and Low (L). The resemblance between Cantonese level tones and the pitch patterns of HKE syllables is indeed striking, and most authors assume that the tone inventory of HKE at least partly originates from Cantonese (though see Wee 2016 for a discussion). Luke (2000) proposes the rules in (1) for tone assignment in HKE, where ‘stress’ corresponds to all primary stressed syllables and variably to secondary stressed syllables in British English. Other patterns result from concatenations of the word-level patterns, except
OUP CORRECTED PROOF – FINAL, 07/12/20, SPi
292 MARTINE GRICE, JAMES SNEED GERMAN, AND PAUL WARREN that certain classes of function words (possessors, modals, and monosyllabic prepositions) are realized as M (Wee 2016). (1) a. Stressed syllables are realized as H. b. Unstressed syllables before the first stressed syllable in a word are realized as M. c. H spreads rightward. Luke (2000) assumes that all syllables have level pitch, though later studies show that the final syllable may be realized as a fall, resulting from a declarative L% boundary (Wee 2008; Cheung 2009). As Figures 19.1a and 19.1b show, the final syllable is realized as HL if it is stressed and as L otherwise. For sentences that end with more than one unstressed syllable in sequence, an interpolated fall stretches from the H of the last stressed syllable (Figure 19.1c) (Wee 2016). Luke (2000) and Gussenhoven (2017b, 2018b) also observe a sustained, high-level final pattern for final-stress utterances. Gussenhoven attributes this to an absence of boundary tone (Ø), which contrasts with L% and conveys a non-emphatic declarative meaning. Wee (2016), however, argues that this is a phrase-medial pattern (even when pronounced in isolation), ruling out a boundary tone. The right edge of polar interrogatives is high rising H H% if the last syllable is stressed (Figure 19.2a) and falling-rising HLH% from the last stressed syllable otherwise (Figures 19.2b and 19.2c). According to Gussenhoven (2017b, 2018b), pre-final L in the latter case arises from a HL assigned to the last stressed syllable (as opposed to H for earlier stresses), while Wee (2016) assumes that LH% is a compound boundary tone. Existing accounts mostly agree on the surface phonological representations in HKE, and that deriving those representations begins with an assignment of H to lexically privileged syllables corresponding to stress in MEVs. They also agree that H spreads rightward within a. Final stress
b. Penultimate stress
c. Antepenultimate stress
tea
apple
yesterday
H H%
H LH%
H
LH%
Figure 19.1 Tonal representations and stylized f0 contours for three stress patterns in a declarative context. (Adapted from Wee 2016: ex. 15d)
a. Final stress
b. Penultimate stress
c. Antepenultimate stress
tea
apple
yesterday
H L%
H L%
H
L%
Figure 19.2 Tonal representations and stylized f0 contours for three stress patterns in a polar interrogative context. (Adapted from Wee 2016: ex. 15g)
OUP CORRECTED PROOF – FINAL, 04/12/20, SPi
INTONATION SYSTEMS ACROSS VARIETIES OF ENGLISH 293 a word unless the word is IP-final, and M is assigned to unstressed syllables before the first stress in a word, as well as to certain functional categories.
19.4.2 West African Englishes (Nigeria and Ghana) Nigerian English (NigE) intonation has been described as involving predominantly level tones (Wells 1982; Jowitt 1991; Gut 2005). This is not surprising given its emergence in contact with level-tone languages such as Hausa, Igbo, and Yoruba. Gut (2005) proposes that tones are assigned to syllables based on word stress and grammatical category: lexical words take H on all syllables except the first, which is L if unstressed and H otherwise. Certain functional categories (e.g. personal pronouns) also take H, while articles, prepositions, and conjunctions take L. Downtrending is widely observed (Jowitt 2000; Gut 2005). A perception study (Gussenhoven and Udofot 2010) suggests that this results from obligatory downstep between subsequent H tones, triggered by a L associated to the left edge of lexical words (in words with initial stress, L is floating). A further production study (Gussenhoven 2017b) indicated that, contra Gut (2005), H is assigned only to syllables that have primary or secondary stress in SSBE, and that intervening syllables have high f0 due to interpolation. Word-level tone assignment appears obligatory and cannot be modified for the expression of contrast or information status (Jibril 1986; Jowitt 1991; Gut 2005). This may explain the impression that NigE has nuclear pitch on the final word. For declarative utterances, the final syllable is falling, which Gussenhoven (2017b) attributes to L%. Polar interrogatives generally end with a high or rising tune (Eka 1985; Jowitt 2000), suggesting that L% alternates with H%. Ghanaian English (GhanE) is very similar to NigE with a few notable exceptions. Gussenhoven (2017b), building on observations by Criper (1971) and Criper-Friedman (1990), proposes that H is assigned only to syllables corresponding to primary stresses in British English. For lexical words and certain functional categories, a word-initial L spreads to all syllables before the first H, while H spreads rightward from that syllable. IP-finally, word-final unstressed syllables are low or falling, suggesting that H is blocked from spreading rightward in that position. As in NigE, most function words are assigned L. Downstep is triggered across subsequent LH-bearing words, while pitch reset occurs across IP boundaries. Polar interrogatives are not usually marked with a boundary tone, ending instead with the pitch level of the last H. Occasionally, the final syllable of a polar interrogative is low-rising; thus, when H% occurs, a polar tone rule changes the final H to L.
19.4.3 Singapore English Singapore English (SgE) is a fully nativized L1 variety. Virtually all Singaporeans speak SgE by early school age, and it is the dominant home language for one third of the population (Leimgruber 2013). SgE has been influenced by contact with a wide range of language var ieties, and even highly standardized uses of SgE differ markedly from MEVs in terms of prosody (Tay and Gupta 1983). SgE intonation has been described as a series of ‘rising melodies’, ending in a final rise-fall (Deterding 1994). The domain of rises is typically a single content word along with preceding
OUP CORRECTED PROOF – FINAL, 04/12/20, SPi
294 MARTINE GRICE, JAMES SNEED GERMAN, AND PAUL WARREN
200 Hz
80 Hz animals 0
were
digging
Time (s)
in
the
rubbish 1.816
Figure 19.3 Waveform, spectrogram, and f0 track for a sentence of read speech in Singapore English. (D’Imperio and German 2015: fig. 1b)
function words (Chong 2013), though in certain syntactic contexts a rise may span two content words (Chong and German 2015, 2017). As Figure 19.3 illustrates, the utterance-initial rise has a large pitch excursion, while subsequent rises have smaller ranges (Deterding 1994; Low and Brown 2005). The repeating rises likely correspond to units of phrasing, as they reflect the primary domain of tune assignment and serve a grouping function. Chong (2013) proposes an AM model based on the accentual phrase (AP). The AP is marked at left and right edges by phrasal L and H tones respectively, as well as by an optional L* pitch accent aligned with the primary stressed syllable of the content word. Chong’s phrasing analysis is further supported by the fact that these rises are associated with extra lengthening of the word-final syllable (Chong and German 2017). Chong’s (2013) model also includes two further levels of phrasing: the ip and IP. The ip is marked at the right edge by H- (IP-medially) or L- (IP-finally) and accounts for pitch reset in adjacent AP’s. The IP is marked at the right edge by L% for declaratives and by H% for polar interrogatives. F0 peaks are aligned close to the end of the word in non-final AP’s. Thus, if contrastive word-level prominence is present in SgE, the link to f0 differs from that in MEVs. SgE lacks marked f0 changes between stressed and unstressed syllables, as well as any step-down in f0 from stressed to post-stress syllables within a word (Low and Grabe 1999). The proposed stress-sensitivity of L* was not corroborated by Chong and German (2015), who found no effect of stress on the alignment of low or high targets in initial AP’s. Instead, the contour shape was similar across stress patterns, while words with initial stress were produced with a globally higher f0. The alignment of f0 peaks in IP non-initial positions needs systematic investigation. Wee (2008) and Ng (2009, 2011) propose that three level tones (L, M, H) are assigned to syllables at the phonological word level. H is assigned to all final syllables, L to initial unstressed syllables, and M elsewhere either as the default tone or through spreading. Some syllables may end up toneless, in which case interpolation applies. Further quantitative
OUP CORRECTED PROOF – FINAL, 04/12/20, SPi
INTONATION SYSTEMS ACROSS VARIETIES OF ENGLISH 295 studies are needed to establish which aspects of SgE contours are better accounted for by level tones assigned to syllables versus phrase-level tones. Existing research concentrates on ethnically Chinese speakers and read speech (though see Ng 2009). Tan (2002, 2010), however, describes tune differences across three ethnic groups of Singapore (Chinese, Malay, and Indian), which reflect characteristics of the respective ‘mother tongue’ languages. More research is needed to establish which aspects of SgE intonation are shared across communities and registers.
19.4.4 Indian English Indian English (IE) is widely considered to be a fully nativized variety (e.g. Kachru 1983). Although there is a high degree of variation across speakers, IE intonation appears to conform to a single basic phonological organization similar to that of MEVs, while varietal differences concern mainly word-level stress, tonal inventories, or meaning. Word-level stress in IE is largely rule-governed rather than lexically specified, though the specific rules differ by L1 background, region, and individual (Wiltshire and Moon 2003; Fuchs and Maxwell 2015; Pandey 2015). For instance, stress assignment is quantity-sensitive for both Hindi and Punjabi speakers (Sethi 1980; Pandey 1985) but quantity-insensitive for Tamil speakers (Vijaykrishnan 1978). This poses a challenge for researchers, since it is difficult to anticipate stress patterns in materials. Nevertheless, most varieties involve promin ent pitch movements on stressed syllables, wherever the latter occur (Pickering and Wiltshire 2000; Wiltshire and Moon 2003; Puri 2013). Féry et al. (2016) analyse the narrow-focus patterns produced by Hindi background speakers in terms of phrasal tones. However, studies by Maxwell (see Maxwell 2014; Maxwell and Fletcher 2014) involving Bengali and Kannada speakers showed that for rising contours (i) peak delay relative to syllable landmarks did not vary with number of post-accentual syllables and (ii) peak delay and L-to-H rise time were correlated with syllable duration. Since these results indicate peak alignment with stressed syllables and not the right edge of the word, they disfavour a phrase tone (i.e. edge tone) analysis. Studies nevertheless show that IE has at least two levels of phrasing (ip and IP, as in MEVs), with differential final lengthening (Puri 2013; Maxwell 2014). Based on detailed timing data, Maxwell (2014) and Maxwell and Fletcher (2014) characterize the alignment characteristics of rising pitch accents in IE. For all speakers, L is consistently anchored to the onset consonant of the stressed syllable. For Kannada speakers, H aligns to the end of the accented vowel if it is long and to the onset of the post-accented syllable if the accented vowel is short. For Bengali speakers, H aligns to the post-accentual vowel in nuclear focal accents and variably to the accentual or post-accentual vowel in prenuclear accents. These results suggest that Kannada speakers have a single rising accent, while Bengali speakers use both L+H* and L*+H for prenuclear accents. ToBI analysis of read and spontaneous speech shows evidence of other pitch accent categories, including H* and L*, as well as downstepped variants of pitch accents and phrase accents. A study by Wiltshire and Harnsberger (2006) similarly found broad differences between Gujarati speakers, who use mostly rising pitch accents (L+H* or L*+H), and Telegu speakers, who additionally produce falling accents (H+L*, H*+L, H*).
OUP CORRECTED PROOF – FINAL, 04/12/20, SPi
296 MARTINE GRICE, JAMES SNEED GERMAN, AND PAUL WARREN In general, focused words are realized with some combination of greater duration, amplitude, and pitch excursion on the accented syllable (Moon 2002; Wiltshire and Harnsberger 2006; Maxwell 2010; Maxwell 2014). Studies also report compression in the post-focal region (Féry et al. 2016) or alternation between deaccenting and post-focal compression without deaccenting (Maxwell 2014). Focus may also be marked by the insertion of a phrase boundary before or after the focused constituent (Maxwell 2010; Maxwell 2014; Féry et al. 2016). Many regional and L1 varieties of IE remain understudied; thus, more research is needed to corroborate the impression that IE varieties involve a similar phonological organization for intonation. Additionally, detailed phonetic evidence is needed to clarify many existing claims.
19.4.5 South Pacific Englishes (Niue, Fiji, and Norfolk Island) The islands of the South Pacific are home to a range of contact varieties of English. Because of patterns of economic migration, many of these varieties have greater numbers of speakers outside the islands (e.g. Niuean English, which has more speakers in New Zealand than on Niue). In addition, the movement of indentured labourers from India to Fiji in the late nineteenth and early twentieth centuries has resulted in an Indo-Fijian English variety alongside Fijian English. Starting with Fiji, since English is a second or third language for nearly all Fiji Islanders, there is L1 transfer in suprasegmental as well as segmental features (Tent and Mugler 2008). Stress patterns tend to be quite different from Standard English (i.e. in this case SSBE), such as [ˈkɒnˌsidɐret] for considerate or [ˌɛˈmikabɐl] for amicable. There is also a tendency for the nuclear pitch accent to occur on the verb, even in unmarked sentences, such as I am STAYing in Samabula. The most marked intonational feature, however, is an overall higher pitch level than in MEVs, especially a pattern on yes/no questions that starts high and ends with a rapid rise and sudden fall in pitch. This pattern (see Figure 19.4) is much closer to Fijian than to Standard English, and sometimes results in misunderstandings, where Standard English listeners have the impression that the speaker expects a positive response. Durational differences between stressed and unstressed syllables are weaker in Niuean English than in MEVs (Starks et al. 2007). This variety also exhibits the high rising terminal or uptalk intonation pattern found in NZE.
(Are) you ready to go ? (Fijian Fiji English)
Are you ready to go ? (Standard English)
Figure 19.4 Intonation patterns for yes/no questions in Fijian and Standard English. (Tent and Mugler 2008: 249)
OUP CORRECTED PROOF – FINAL, 04/12/20, SPi
INTONATION SYSTEMS ACROSS VARIETIES OF ENGLISH 297 The status of Norfolk Island English (‘Norfuk’) as either a variety of English or a creole is uncertain (Ingram and Mühlhäusler 2008). Its intonation is characterized as having a wide pitch range with much (often exaggerated) variation in pitch and tempo.
19.4.6 East African Englishes (Kenya and Uganda) The intonation of East African Englishes is understudied. Otundo’s (2017a, 2017b) import ant groundwork reveals considerable differences in the English spoken by different L1 groups in Kenya, which she refers to as ethnically marked varieties. For declarative questions, Nandi speakers produce a rising pitch accent followed by a boundary rise (L*+H H%) whereas Bukusu speakers produce a high pitch accent followed by a fall (H* L%). Statements predominantly have either a rise-fall (L*+H L%) or a fall (H* L%) in both groups, depending on the presence or absence of prenuclear material, respectively. wh-questions have a nuclear pitch accent on the wh-word, unlike MEV varieties, but they may have a terminal rise, in which case the nuclear accent is on a later word in the phrase. Nassenstein (2016), in an extensive overview of Ugandan English, gives no detail on intonation but does identify an interesting use of an intonational rise accompanied by add itional vowel lengthening to mark assertive focus, as in (2). (2) And he went ↑far up to there. ‘And he went far (further than I had imagined).’ (Nassenstein 2016: 400) Both Kenyan English and Ugandan English show less durational reduction of unstressed syllables compared to most MEVs (Schmied 2006).
19.4.7 Caribbean English This section must begin with a caveat: descriptions of Caribbean English do not always clarify whether they are describing Caribbean-accented varieties of English or Englishbased Caribbean creoles. A notable exception is Devonish and Harry (2008), who explicitly distinguish Jamaican Creole and Jamaican English, identifying the latter as an L2 acquired through formal education by speakers of Jamaican Creole. The two languages coexist, therefore, in a diglossic situation. The prosodic feature that these authors highlight is the prominence patterns on disyllabic words. For example, Jamaican Creole distinguishes between the kinship and priest meanings of faada (‘father’) by having a high tone on the first syllable4 for the first meaning but on the second syllable for the second meaning, also treated as a lexical stress contrast /ˈfaada/ vs. /faaˈda/ (Cassidy and LePage 1967/1980). In Jamaican English, these tonal patterns are maintained on /faaðo/, but the lexical stress is on the first syllable in both cases. Wells (1982: 572–574) highlights several other features of the prosody of West Indian English. These include a reduced durational distinction between stressed and unstressed 4 In other analyses of Caribbean creoles (e.g. Sutcliffe 2003), it is suggested that founder speakers of the creoles might have reinterpreted English word stress along lines of the tonal distinctions of West African languages, and realized stressed syllables with a high tone.
OUP CORRECTED PROOF – FINAL, 04/12/20, SPi
298 MARTINE GRICE, JAMES SNEED GERMAN, AND PAUL WARREN syllables as compared to MEVs (see also Childs and Wolfram 2008). At the same time, however, pitch ranges tend to be wider, which compensates somewhat for the reduction in the stress contrast by increasing the difference between accented and unaccented syllables. Wells also observes a tendency to shift stress rightwards, particularly in emphasis, giving /kɪˈʧɪn/ for kitchen (noted also for Eastern Caribbean creoles by Aceto 2008), which he suggests might in fact be a rise-fall nucleus associated with the initial syllable, in autosegmental terms: L* H- L%. In this case, rather than stress being shifted, the second syllable may simply have pitch prominence from the H- phrase accent. A further intonational characteristic highlighted for Bahamian English (Wells 1982; Childs and Wolfram 2008) as well as Trinidadian and Tobagonian Creole (Youssef and James 2008) is the use of high rising contours with affirmative sentences. These rises appear to be inviting a response from the listener.
19.4.8 Black South African English Given that English in South Africa is less common (at 9.6%) as a first language (L1) than IsiZulu (22.7%) or IsiXhosa (16.0%) and is the L1 for just 2.9% of Black South Africans (Statistics South Africa 2011), it is unsurprising that Black South African English (BlSAfE) is heavily influenced by its speakers’ L1s. Swerts and Zerbian (2010) compared intonation in the English of intermediate and advanced speakers who had Zulu as their L1. Both groups used rising and falling intonation patterns common to both Zulu and English to mark the difference between non-final and final phrases respectively, but only the advanced speakers patterned with the native L1 speakers of English in using intonation to mark focus (Zerbian 2015). Coetzee and Wissing (2007) report that compared to WSAfE and Afrikaans English, BlSAfE (in this case Tswana English) has less of a distinction in duration between stressed and unstressed syllables, and furthermore does not show phrase-final lengthening. This supports similar general observations for BlSAfE by van Rooy (2004), who also notes that— again largely in line with the speakers’ L1—stress assignment is on the penultimate syllable (e.g. sevénty) unless the last syllable is superheavy. This author also observes ‘more frequent occurrence of pragmatic emphasis, leading to a different intonation structure of spoken BlSAfE’ (p. 178) and notes that IP’s tend to be shorter than in WSAfE (cf. Gennrich-de Lisle 1985).
19.4.9 Maltese English Maltese English (MaltE), alongside Maltese and Maltese Sign Language, is an official language of Malta, with the great majority of its population being bilingual to various degrees. MaltE does not reduce unstressed syllables to the same extent as MEVs and does not have syllabic sonorants in unstressed syllables (Grech and Vella 2018; see also chapter 16). As in a number of other contact varieties, MaltE also differs from MEVs in stress assignment in compounds, such as fire ˈengine and wedding ˈpresent, with stress on the final rather than
OUP CORRECTED PROOF – FINAL, 04/12/20, SPi
INTONATION SYSTEMS ACROSS VARIETIES OF ENGLISH 299 initial element, except in cases where the final element is monosyllabic, such as ˈfireman (Vella 1995). However, like Maltese, MaltE has regular pitch accents associated to lexically stressed syllables and tones associated with the right edge of IP’s. In wh-questions and some other modalities (e.g. vocatives and imperatives), tones can also be associated to the left edge of consituents (Vella 2003, 2012). The right-hand phrasal edge tone, a phrase accent in the sense of Grice et al. (2000), is particularly important in the tonal phonology of MaltE and is often associated with a lexic ally stressed syllable (Vella 1995, 2003), leading to a percept of strong post-focal prominence (Galea Cavallazzi 2004).
19.5 Uptalk A frequently discussed feature of English intonation is uptalk, sometimes referred to as ‘high rising terminal’ intonation (but see discussion in Warren 2016 and references therein). This use of rising intonation at the end of a declarative utterance should not be confused with UNB rises, from which it is phonetically and functionally distinct. The term ‘antipodean rise’ reflects its possible provenance in Australia and/or New Zealand, where it is becoming an unmarked feature. It is, however, found in many English varieties, including those spoken in the United States, Canada, South Africa, and the United Kingdom (see e.g. Armstrong and Vanrell 2016; Arvaniti and Atkins 2016; Moritz 2016; Prechtel and Clopper 2016; Warren 2016; Wilhelm 2016). Because of phonetic similarity to rises on yes/no and echo questions, uptalk is frequently interpreted by non-uptalkers as signalling that the speaker is questioning the content of their own utterances and is therefore uncertain or insecure. However, the distribution of uptalk and its interpretation by uptalkers indicate that it is used as an interactional device, to ensure the listener’s engagement in the conversation. (For further discussion of the meanings of uptalk see Tyler and Burdin 2016; Warren 2016.) In AM terms, uptalk has been labelled as L* H-H%, L* L-H%, and H* L-H% for Canadian English (Di Gioacchino and Jessop 2011; Shokeir 2007, 2008); L* L-H%, L* H-H%, and H* H-H% for American English (Hirschberg and Ward 1995; McLemore 1991; Ritchart and Arvaniti 2013); L* H-H%, H* H-H%, and the longer sequence H* L* H-H% for AusE and NZE (Fletcher and Harrington 2001; Fletcher 2005; Fletcher et al. 2005; McGregor and Palethorpe 2008); and H* L-H% or H*+L H-H% for British English (Bradford 1997). This mixture of labels indicates that there is variation both within and between varieties in terms of the shape of the uptalk rise, many labels including a falling component before the rise, so that the rise is from either a low-pitch accent or a low-phrase accent. Moreover, researchers have identified phonetic distinctions between uptalk and question rises, including the relative lateness of the rise onset in uptalk (in NZE: Warren 2005; for WSAfE: Dorrington 2010a) and a lower rise onset in uptalk (especially in AusE: Fletcher and Harrington 2001); see Figures 19.5 and 19.6 respectively. The differences between uptalk and question rises can be perceived and interpreted appropriately by native speakers of the varieties (Fletcher and Loakes 2010; Warren 2005, 2014).
OUP CORRECTED PROOF – FINAL, 04/12/20, SPi
300 MARTINE GRICE, JAMES SNEED GERMAN, AND PAUL WARREN
250 Hz
75 Hz It’s
probably a
little
bit
away
from
whispering
pine
0.04 s
2.49 s
Figure 19.5 Fall-rise uptalk contour, Australian English. (Warren 2016: fig. 4.1)
400 Hz
100 Hz and yer gonna
go
right
round
the
bowling
1.34 s
alley
3.96 s
Figure 19.6 Late rise uptalk contour, New Zealand English. (Warren 2016: fig. 4.2)
19.6 Conclusion Our survey has revealed a diverse set of intonational systems across varieties of English. For some—especially mainstream—varieties, diversity is limited to relatively minor features in inventories or in correspondences between contours and meanings. However, differences in overall phonological organization are also observed. There are, alongside stress-accent var ieties, those such as HKE, NigE, and GhanE that involve level tones assigned by both lexical specification and spreading rules. In terms of phrasing, MEVs and most other varieties
OUP CORRECTED PROOF – FINAL, 04/12/20, SPi
INTONATION SYSTEMS ACROSS VARIETIES OF ENGLISH 301 involve a lowest level of phrasing that is relatively large, while SgE patterns with languages such as French and Korean in having smaller units of phrasing (i.e. AP’s) that generally span only one or two content words. At the other end of the spectrum, HKE, NigE, and GhanE include only one large unit of phrasing (i.e. IP’s) that contributes to the tonal specification. In light of issues raised by MEVs concerning the linking of tones to either pitch accents or edge tones (see §19.3.1), an important issue for future research is whether the need for specific phrasing levels can be supported by evidence that is independent of the edge tones they are purported to carry (e.g. progressive final lengthening). All the English varieties we have covered include word stress in some form or another. This can involve a manifestation of greater acoustic prominence or else merely designate syllables as prosodically privileged. In the latter case, privileged syllables are singled out to receive specific tonal values without concomitant acoustic or articulatory enhancement, and in most cases these tones are lexically specified. The high functional load of pitch accents in mainstream varieties has most probably led to the need for the location of these pitch accents (lexical stress) to be reinterpreted as tone. Across the different varieties, there is considerable variation in the assignment of word stress, differing from MEVs for certain words (e.g. Fijian, where secondary stress and primary stress are sometimes exchanged, and AAE, where there is a preference for initial stress) or word types (e.g. the compound rule is different in Maltese and Hispanic English). In Indian English, stress is rule based, with quantity sensitivity in some L1 groups (e.g. Hindi and Punjabi), but overall the rules are highly variable. In BlSAfE, stress also appears to be rule based, with penultimate stress in most cases. Caution is required, however: sometimes it might appear that word stress is shifted when in fact the pitch accent is low and is followed by a H-phrase accent (Caribbean English, SgE) or when constituents in a compound are treated as separate word domains (NigE, GhanE). A similar issue applies to sentence-level prominence (and, by extension, word citation forms), since in varieties such as NigE and HKE a lack of postnuclear deaccenting, combined with rightward spreading of H tones, can give the impression that prominence is ‘shifted’ to a later word or syllable than in MEVs. It is therefore essential that the interplay of stress, accent, and tonal assignment be worked out separately for each variety and independently of what might give the impression of prominence in another variety. This diversity highlights the important role played by the choice of framework and analytical tools when approaching any variety, whether unstudied or well studied. It would, for example, be clearly problematic to apply the notion of a nuclear contour to HKE, NigE, or GhanE, since, in those varieties, the tune at the end of a phrase is fully determined by lexical word stress and the choice of boundary tone. Apart from the latter, in other words, there is no optionality that could result in meaningful contrasts (see Gussenhoven 2017b for further discussion). Additionally, we need look no further than MEVs to recognize that aspects of the prenuclear contour can be informative in terms of both structure and meaning. Thus, applying AM or British School categories developed for a well-studied variety to a variety whose phonological organization and intonational inventory has not yet been established is highly problematic, since it strongly biases the researcher towards an analysis in terms of pitch accents (or even nuclear pitch accents), thereby missing other typological possibilities. Moreover, even if the variety in question does pattern with MEVs in having regular pitch accents, the use of a pre-existing inventory runs the risk of (i) treating non-contrastive differences as contrastive and (ii) lumping existing contrasts into a single category. This issue
OUP CORRECTED PROOF – FINAL, 04/12/20, SPi
302 MARTINE GRICE, JAMES SNEED GERMAN, AND PAUL WARREN is underscored by the numerous studies on UNB and American English varieties (Arvaniti and Garding 2007; Clopper and Smiljanić 2011) as well as IE (Maxwell 2014; Maxwell and Fletcher 2014). In these cases, detailed phonetic studies have revealed inventory differences across varieties that would have been missed had the authors been restricted to any of the inventories available for MEVs. Besides the differences outlined above, our survey also revealed certain regularities. For example, the majority of non-MEV varieties either lack post-focal deaccenting or use it with less regularity. These include HKE, SgE, NigE, GhanE, BlSAfE, IE, MaltE, and AAE. This broad tendency suggests that post-focal deaccenting as a feature tends to resist transfer in contact situations, either because it is not compatible with certain types of phonological organization or because it is not sufficiently salient to L2 speakers during nativization. It is also interesting to note the striking similarity in the intonation systems of HKE and West African varieties, which resulted from contact with unrelated tone languages. As Gussenhoven (2017b: 22) notes, this can be ‘explained by the way the pitch patterns of [British English] words were translated into tonal representations’. While some accounts suggest that a similar generalization applies to SgE, that variety has received substantial influence from intonation languages including Malay, Indic, and Dravidian languages, and even IE (Gupta 1998), which could explain why it patterns less closely with the above ‘tonal’ varieties. In this chapter we have only been able to report on a selection of contact varieties of English. The typological diversity we have observed will no doubt be enriched, once we take up the challenge of analysing newly emerging (extended circle) varieties, such as those spoken in China and Korea.
OUP CORRECTED PROOF – FINAL, 04/12/20, SPi
chapter 20
The North Atl a n tic a n d the A rctic Kristján Árnason, Anja Arnhold, Ailbhe Ní Chasaide, Nicole Dehé, Amelie Dorn, and Osahito Miyaoka
20.1 Introduction The languages described in this chapter belong to three families, Celtic, North Germanic, and Eskimo-Aleut. The grouping into one chapter is purely geographical and the different groups have very different histories and little in common structurally. There is also little evidence of Sprachbund or contact phenomena between these groups, although pre-aspiration and some tonal features have been noted as common between some Scandinavian varieties and Celtic.
20.2 Celtic Irish and Scottish Gaelic are the indigenous languages of Ireland and Scotland, belonging to the Goidelic branch of Celtic together with Manx, while the Brittonic branch of Celtic comprises Breton, Welsh, and Cornish. Irish is, with English, one of the two official languages in the Republic of Ireland. Today, it is mostly spoken as a community language in pockets (Gaeltachtaí) that mainly stretch along the western Irish coast. There is no spoken standard, but there is standardization of written forms. The three main Irish dialects, Ulster, Connaught, and Munster Irish, differ at the phonological level, both segmental and prosodic, besides differences at the morphological, lexical, and syntactic levels. The strongest Scottish Gaelic-speaking areas today are also concentrated along the northwestern coastal areas and islands of Scotland. Welsh-speaking areas, on the other hand, extend across wider parts of Wales, with higher numbers of speakers particularly in the north and west. Breton is spoken in very limited areas, mainly in the western half of Brittany, France.
OUP CORRECTED PROOF – FINAL, 04/12/20, SPi
304 kristján árnason et al.
20.2.1 Irish and Scottish Gaelic Irish is a verb-subject-object language with reduced vowels in unstressed syllables, a contrast between long and short vowels, and frequent consonant clusters, features that have been associated with ‘stress-timed’ languages (cf. Ó Siadhail 1999). Primary stress generally falls on the first syllable of words and, with few exceptions, this is also true for Scottish Gaelic (cf. Clement 1984; Bosch 2010). In Munster Irish, primary stress shifts to the second or third syllable of disyllabic and trisyllabic words if this is the first heavy syllable—that is, containing a long vowel or diphthong (cf. Blankenhorn 1981; Ó Sé 1989, 2019; Ó Siadhail 1999) or a rhyme /ax/, as in /baˈkax/ ‘lame’. Stress in Welsh (cf. Williams 1983; Hannahs 2013) and Breton (cf. Jackson 1967; Ternes 1992), on the other hand, is traditionally placed on the penultimate syllable in polysyllabic words, with exceptions being confined to loanwords or dialectal variation. Syllable structure in the Irish dialects, described as similar to English, is complex (Carnie 1994; Green 1997; Ní Chiosáin 1999) and open to different interpretations by linguists (cf. de Bhaldraithe 1945/1966; de Búrca 1958/1970; Mhac an Fhailigh 1968; Ní Chiosáin et al. 2012). For Scottish Gaelic, a close link between syllabicity and pitch has been suggested (cf. Borgstrøm 1940; Oftedal 1956; Ternes 2006). Studies on Scottish Gaelic intonation have commented on its use of lexical tone or ‘word accent’ (Borgstrøm 1940; MacAuley 1979), a feature not observed in Irish and sometimes posited as a Viking influence. Phonetic studies have also addressed the different realizations of lexical tone (cf. Bosch and de Jong 1997; Ladefoged et al. 1998). In Welsh, on the other hand, stress and pitch prominence have been noted to occur independently of one another (cf. Thomas 1967; Oftedal 1969; Williams 1983, 1985; Bosch 1996). There is a long tradition of segmental description of Irish dialects (e.g. Quiggin 1906; Sommerfelt 1922; de Bhaldraithe 1945/1966; Breatnach 1947; de Búrca 1958/1970; Ó Cuív 1968; Ní Chasaide 1999; Ó Sé 2019). The most striking feature at the phonological level (which Irish shares with Scottish Gaelic) is the systematic contrast of palatalized and velarized consonants (the term ‘velarization’ covers here secondary articulations in the velar, uvular, or upper pharyngeal regions). The palatalization or velarization of the consonant tends to give rise to diphthongal on- or offglides to or from an adjacent heterorganic vowel (i.e. when a palatalized consonant occurs with a back vowel or a velarized consonant with a front vowel). Thus, for example, the phoneme /i:/ may be realized as [i:] in bí /bʲiː/ ‘be’; [iˠi] in buí /bˠiː/ ‘yellow’; [iiˠ] in aol /i:lˠ/ ‘limestone’; [ˠiiˠ] in baol /bˠi:lˠ/ ‘danger’. This phonological contrast of palatalized–velarized pairs of consonants plays an important role in signalling grammatical information—for example, in marking the genitive case in óil / oːlʲ/ as compared to the nominative case ól /oːlˠ/ ‘drink’. In Scottish Gaelic, the realization of the stop voicing contrast entails pre-aspiration of the voiceless series (and post-aspiration in pre-vocalic contexts). There is little or no voicing of the voiced series. The extent of preaspiration varies across dialects, and similar realizations are attested for Irish (Shuken 1980; Ní Chasaide 1985).
20.2.2 Intonation Until recently, relatively little was known about Irish intonation. Broad-ranging instrumental analyses of the northern and southern Irish dialects have been carried out in the Prosody
OUP CORRECTED PROOF – FINAL, 04/12/20, SPi
THE NORTH ATLANTIC AND THE ARCTIC 305 of Irish Dialects project (Dalton and Ní Chasaide 2003, 2005a, 2005b, 2007a, 2007b; Ní Chasaide 2003–2006; Ní Chasaide and Dalton 2006; Dalton 2008) and beyond (Dorn 2014; O’Reilly 2014). These studies reveal a major divide between the northern dialects of Donegal on the one hand and the southern dialects of Connemara, Aran Islands, Kerry, and Mayo on the other. The southern dialects have falling tonal contours for both declaratives and questions. The northern dialects are atypical in that the default tune in neutral declaratives is a rising contour, while this is also the contour used in wh- and polar questions (cf. Dalton and Ní Chasaide 2003; Dalton 2008; Dorn et al. 2011). Further accounts of Connemara Irish are found in Blankenhorn (1982) and Bondaruk (2004). Irish has identifiable pitch accents, both monotonal and bitonal, and boundary tones. On the segmental level, the main tonal target typically aligns with the lexically stressed syllable. As mentioned above, a north–south difference emerges. For the northern varieties (of Donegal), neutral declaratives are typically characterized by sequences of (L*+H L*+H L*+H 0%), whereas for the southern dialects investigated so far (Connemara, Kerry, and Mayo), they most typically consist of a sequence of falls in all positions in the IP (H*+L H*+L H*+L 0%). Boundary tones can be high (H%), low (L%), or level (%, aka 0%, Ø). (For more detailed descriptions, see Dalton and Ní Chasaide 2003; Dalton 2008; Dorn 2014; O’Reilly 2014.) For Scottish Gaelic, Welsh, and Breton, different accounts of intonation exist. MacAulay (1979) describes Scottish Gaelic intonation (Bernera dialect, Lewis) generally with phrasefinal falls and phrase-final rises in questions, but without considering the interaction of lexical and intonational tones, something that was studied more recently by Nance (2013), who also looked at features of language change in young Gaelic speakers (Nance 2013, 2015). Breton intonation is described by Ternes (1992) with rising sentence questions and falling affirmative sentences, and several studies have addressed Welsh intonation using different transcription methods (cf. Thomas 1967; Pilch 1975; Rees 1977; Rhys 1984; Evans 1997). Detailed instrumental analyses of Welsh intonation (Anglesey variety) were carried out by Cooper (2015), who addressed both phonological and phonetic issues, concentrating on the alignment and scaling of tonal targets and on the intonational encoding of interrogativity. In Irish, in addition to syntactic marking, interrogativity is marked prosodically. Although the tonal sequence for questions is the same as for the neutral declaratives, the former entail phonetic adjustments—more specifically to the relative pitch, regardless of tune, in the initial or final accent of the utterance. In Donegal Irish, the prenuclear accent in wh-questions is boosted by raising the pitch accent targets relative to statements and polar questions. For polar questions, the principal marker involves a similar boosting of the nuclear pitch accent compared to statements. Thus, wh-questions can overall be characterized by downdrift (i.e. a falling f0 slope), while polar questions more typically show a tendency towards upsweep (i.e. a rising f0 slope) compared to statements. In South Connaught Irish (Cois Fharraige and Inis Mór), polar questions typically have a higher pitch peak in initial prenuclear accents relative to statements as well as a raised pitch level (cf. Dorn et al. 2011; O’Reilly and Ní Chasaide 2015). Cross-dialect differences in peak alignment are found in the Irish dialects (cf. Dalton and Ní Chasaide 2005a, 2005b, 2007a; Ní Chasaide and Dalton 2006; O’Reilly and Ní Chasaide 2012). Although Donegal (Gaoth Dobhair) Irish and the Connaught dialect of Cois Fharraige differ greatly in terms of their tunes, both dialects tend towards fixed alignment in that for both prenuclear and nuclear accents, the tonal target tends to be anchored to a particular point in the accented syllable. However, a comparison of peak timing in two
OUP CORRECTED PROOF – FINAL, 04/12/20, SPi
306 kristján árnason et al. southern varieties of Connaught Irish (Cois Fharraige and Inis Oírr) reveals striking differences. As mentioned, Cois Fharraige has fixed alignment, but Inis Oírr exhibits variable peak alignment: the peak of the first prenuclear accent is timed earlier if the anacrusis (the number of unaccented syllables preceding the initial prenuclear accent) is longer, while the peak of the nuclear accent is later when the tail (the postnuclear stretch) is longer. These alignment differences are all the more striking as they occur within what is considered to be a single dialect (Dalton and Ní Chasaide 2005a, 2007a). Focus in Irish can be realized by means of prosodic prominence as well as by syntactic clefting and emphatic versions of, for example, pronouns (the emphatic version of the pronouns ‘me’ mise [mʲɪʃə] vs. the non-emphatic version mé [mʲe:], typically reduced to [mʲə]). Prosodic prominence involves highlighting the focal element by raising the f0 peak and extending the duration of the accented syllable as well as deaccentuating post-focal material, and possibly reducing the scaling of pre-focal accents (cf. O’Reilly et al. 2010; O’Reilly and Ní Chasaide 2016). The same pitch accent type is used across semantically different focus types (narrow, contrastive) in Donegal Irish (cf. Dorn and Ní Chasaide 2011).
20.3 Insular Scandinavian Icelandic and Faroese form the western branch of the Nordic family of languages, sometimes referred to as Insular Scandinavian as opposed to the Continental group, Norwegian, Danish, and Swedish. They inherit initial word stress from Old Norse and have a similar quantity structure with stress-to-weight and a distinction between long (open-syllable) and short (closed-syllable) vowels as a result of a quantity shift that has occurred in both languages. Stress may be realized by tonal accents and lengthening of long vowels, and in closed syllables by post-vocalic consonants, producing for Icelandic so-called half length (Árnason 2011: 149–151, 189–195). An important difference is that Faroese has a set of segmentally restricted stressless syllables, which draw on a reduced vowel system, whereas in Icelandic all syllables are phonotactically equal. The modern languages do not have a tonal distinction in the manner of Swedish and Norwegian, although older Icelandic shows rudimental signs of such distinctions (Árnason and Þorgeirsson 2017). Both languages have post-vocalic pre-aspiration on fortis stops, correlating with stress (Árnason 2011: 216–233), but there are differences in realization and distribution (Hansson 2003), Icelandic pre- aspiration being more ‘segment like’, as if receiving half length after short vowels and closing the syllable as a coda.
20.3.1 Stress in words and phrases Icelandic has regular initial stress with a rhythmically motivated secondary stress on oddnumbered syllables: forusta [ˈfɔːrʏsˌta] ‘leadership’. Morphological structure in compounds and derived words can affect the rhythmic stress contour, as stress on second components may be retained as secondary stress, disregarding alternating stress: höfðingja#vald [ˈhœvðiɲcaˌvalt] ‘aristocracy’; literally: ‘chieftains-gen#power’. However, adjacent stresses tend to be avoided, thus shifting the morphologically predicted secondary stress to the right
OUP CORRECTED PROOF – FINAL, 04/12/20, SPi
THE NORTH ATLANTIC AND THE ARCTIC 307 in words such as borð#plata [ˈpɔrðplaˌta] ‘table top’; literally: ‘table plate’. Normally, a prefix takes initial stress in Icelandic, but there are principled exceptions: Hann er hálf-leiðinlegur [haulvˈleiːðɪnlɛɣʏr̥] ‘He is rather boring’ (literally: ‘half-boring’) (Árnason 1987). The unstressed prefixes have a special modal function, and, although morphologically bound, they do not form phonological words with their anchors. Some loanwords and foreign names show arhythmic word-stress patterns: karbórator [ˈkʰarˑpouraˌtʰɔˑr̥] ‘carburettor’, suggesting karbóra- and -tor as separate pseudo-morphs; the plosive in -tor is aspirated in southern Icelandic, indicating that it is foot initial. A majority of native Faroese words have their main stress on the first syllable: tómur [ˈtʰɔuːmʊɹ] ‘empty’, hestarnir [ˈhɛstanɩɹ] ‘the horses’, onga#staðni [ˈɔŋkanˌstɛanɩ] ‘nowhere (literally: no place)’ (Lockwood 1955/1977: 8). Secondary stress can be placed on the fourth syllable of a compound, as in tosingarlag [ˈtɔːsiŋkarˌlɛa] ‘mode of speaking’. Adjacent stresses occur, as inˈtil#ˌbiðja ‘worship’ (literally: ‘to pray’). According to Dehé and Wetterlin (2013), the most prominent phonetic parameters related to Faroese secondary stress are vowel duration and voice onset time. Faroese prefixes such as aftur- ‘again’ commonly take initial stress, as in ˈafturtøka ‘repetition’, but words of three or more syllables, which take prefixes such as ó- ‘un’ or ser- ‘apart’, regularly take stress on the second morphological constituent: ser#stakliga(ni) [sɛɹˈstɛaːklijanɩ] ‘especially’. Compound pre posi tions and adverbs commonly have stress on the second part: afturum [aʰtəˈɹʊmː] ‘behind (literally: after about)’. Corresponding prepositions take initial stress in most varieties of Icelandic: ˈaftanvið ‘behind’. The stress pattern of many native Faroese compounds also seems to vacillate: burðar#vektir ‘birth weights (of infants)’, which in a careful style can have the main stress either on the first or the second component: [ˈpuɹaˌvɛktɩɹ] or [ˌpuɹaˈvɛktɩɹ]. Forms such as ítrótt [ˈʊiːtrɔʰt] ‘sports’, bláloft [ˈplɔɑːlɔft] ‘(blue) sky’ have two nonrestricted syllables, of which the first takes the word stress and the second is weak accordingly. Rhythmic effects seem to be common in Faroese, so that words such as ˈvið#víkjˌandi ‘concerning’ (literally: ‘to#applying’), benkinun [ˈpɔiɲʧɩˌnʊn] ‘the bench’ show alternation. Restricted syllables such as the last one in benkinun, which take the rhythmic type of secondary stress, have been classified as leves, whereas the fully weak or reduced ones are levissimi (Hagström 1967). Forms such as tíðliga [tʰʊiʎ.ja] ‘early’, where the vowel of the second syllable remains unpronounced, show that alternating rhythm is a traditional feature of the phonological system. Many Faroese loanwords have non-initial stress: signal [sɩkˈnaːl] ‘signal’, radiatorur [ˌɹaˑtiaˈtʰoːɹʊɹ] / [ˌɹaˑtɩˈaːtoɹʊɹ] ‘radiator’. Icelandic also shows non-initial stress in loans: Þetta er dálítið extreme [ɛxsˈtriːm] ‘This is a bit extreme’, experimental notkun [ˌɛxspɛrɪˈmɛn̥talˈnɔtkʏn] ‘experimental use’. The final stress of extreme is likely to be moved to the first syllable in cases such as Þetta er extreme dæmi [ˈexstrimˈtaimɪ] ‘This is an extreme example’, avoiding a stress clash. In ‘emphatic re-phrasing’ (Árnason 2009: 289), words may be uttered as phrases, placing the accent on non-initial syllables, as in Hann er hrika-LEGUR [ˌr̥ɪˑkaˈleɛːɣʏr̥] ‘He is really terrible’ (Árnason 2011: 151). The normal pattern of phrasal accentuation in both languages is right strong: Nanna borðar HAFRAGRAUT [ˌnanˑaˌpɔrðarˈhavraˌkrœyˑt] ‘Nanna eats oat-meal’, Dávur spælar FÓTBOLT [ˌtɔaːvʊɹˌspɛalaɹˈfɔuː(t)pɔl ̥t] ‘David plays football’, but two types of Icelandic exceptions to the unmarked pattern have been noted, one systematic and the other pragmatic (Árnason 2009). Some word classes, like verbs, are ‘stronger’ than others, like nouns, and may for that reason reject the phrasal accent in wide focus, as in Ég sá JÓN koma
OUP CORRECTED PROOF – FINAL, 04/12/20, SPi
308 kristján árnason et al. (noun > verb) ‘I saw JOHN coming’. A phrasing with the accent on the verb koma would place the focus on the verb. A strength hierarchy of noun > verb > preposition > personal pronoun has been proposed (cf. Árnason 1994–1995, 1998, 2005: 446–447). Definite noun phrases are normally left-strong under broad focus: Ég gaf Jóni [GAMLA hestinn] ‘I gave John the old horse’, Ég gaf Jóni [gamlan HEST] ‘I gave John an old horse’, although semantically heavy nouns can retain their accent, as in Þarna er gamla PÓSTHÚSIÐ ‘There is the old post-office’ (Árnason 2005: 453–454; Ladd 2008b: 242). Faroese forms such as til tað [ˈtʰɪlta] ‘to that’, hjá honum [ˈʧɔːnʊn] ‘with him’, hjá mær [ˈʧɔmːɛaɹ] ‘with me’, where the pronouns tað ‘it, that’, honum ‘him’, and mær ‘me’ have been cliticized so as to form phonological words with the prepositions, must have their origin in phrases in which prepositions were stronger than pronouns. By contrast, compound pre positions have their main stress on the prepositional stem: afturum [aʰtəˈɹʊmː] ‘behind (literally: after about)’. As in Icelandic, ‘re-phrasing’ may turn parts of words into phrases. Faroese compounds such as Skálafjörður ‘a place name (literally: hall-gen#fjord)’ can be so split up, and individual syllables, as in í-TRÓTT-sögu [ʊiːˈtrɔʰtˌsœɵ] ‘sports history’ can also take contrastive stress for emphasis or as an instance of clear style of utterance (Árnason 2011: 292). Segmental processes may serve as cues to phrasing. Glottal onsets commonly occur before vowel-initial stressed syllables in both languages: Jón kemur ALDREI [jouɲcɛmʏrˈɁaltrei] (Icelandic) ‘John NEVER comes (John comes never)’, Okkurt um árið 1908 [ˈʔɔoʰkʊɹ̥tumˈʔɔɑːɹəˈnʊiːʧɔntrʊˈʔɔʰta] (Faroese) ‘Around the year 1908’. Final devoicing is a clear signal of pause or the end of an utterance in Icelandic: Jón fór [fouːr̥] ‘John went’ (Helgason 1993; Dehé 2014). Dehé’s (2014) results show that devoicing is obligatory at the ends of utterances and optional within utterances, with its frequency of occurrence reflecting the rank of the prosodic boundary. A common phenomenon in Faroese, often co-occurring with final devoicing, is the deletion or truncation of vowels in utterance-final syllables: eftir ‘after’ [ɛʰtɹ̥] (instead of, say, [ɛʰtɩɹ] or [ɛʰtəɹ]); veit ikki ‘don’t know’ [vaiːʧʰ], instead of [vaiːʧɩ] or [vaiːtɩʧɩ]. In Icelandic phrasal cohesion is shown by final vowel deletion in stronger constituents before weaker ones beginning in a vowel: Nonn(i) ætlar að far(a) á fund ‘Jonny is going to a meeting’ (Dehé 2008; Árnason 2011: 295, 299).
20.3.2 Intonation The intonation of Icelandic has been analysed in terms of prenuclear and nuclear accents and boundary tones, as well as typical nuclear contours (Árnason 1998, 2005, 2011; Dehé 2009, 2010, 2018). Boundary tones are low (L%) or high (H%), with L% marking finality and H% indicating some sort of non-finality or special connotation. Dehé (2009, 2010) suggests that there are two phrase accents (L- and H-). Pitch accents are monotonal high (H*) or low (L*) or bitonal (rising L*+H: late rise; L+H*: early rise) or falling (H*+ L). Based on tonal alignment data, however, it has been debated whether all bitonal pitch accent types are rising and the perceived fall at the end of an accented phrase is a low edge tone rather than due to a trailing tone of the pitch accent (Dehé 2010). A typical nuclear contour in neutral declaratives has one or more rising (L*+H) prenuclear accents and a (monotonal or bitonal) H* nuclear accent, terminated by L%. The default melody in all Icelandic utterance types is a fall to L%, with a downtrend within the intonational phrase. This includes polar questions (Árnason 2005, 2011; Dehé
OUP CORRECTED PROOF – FINAL, 04/12/20, SPi
THE NORTH ATLANTIC AND THE ARCTIC 309 2018; Dehé and Braun 2020) and other-initiated repair initials such as Ha ‘huh’ and Hvað segirðu ‘What did you say’ (Dehé 2015), which are typically rising in related languages. According to Árnason (2011: 323), questions with rising intonation ‘have special connotations’ (e.g. impatience), while questions with falling intonation are neutral. According to Árnason (2011: 322–323), an intonational difference between Icelandic statements and polar questions could be the type of nuclear accent, an early rise (L+)H* in statements versus a late rise L*+H in polar questions, while wh-questions typically have H* (Árnason 2005: 467–477); rhetorical polar and wh-questions have mostly L+H* nuclear accents (Dehé and Braun 2020). Other intonational differences may lie in distinctions in the overall downtrend (Dehé 2009), the prenuclear region, and the overall pitch level, but all of this is subject to current and future research. Systematic study of Faroese intonation has been even more limited than that of Icelandic, but scholars have observed more rising patterns (H%) than in Icelandic, especially in polar questions. There are also some anecdotal descriptions of intonational peculiarities of Faroese varieties in places such as Suðuroy and Vágar (see Árnason 2011: 324–326). In both Icelandic and Faroese, focus is marked by pitch accents, especially high tonal targets, which are then exempted from the overall downtrend (for Icelandic see Árnason 1998, 2009; Dehé 2006, 2010). Focus may also soften the strength of the prosodic boundary at the right edge of the focused constituent (Dehé 2008), enhancing the application of final vowel deletion. The likelihood of the deletion of the final vowel on Jónína in Jónína ætlar að baka köku ‘Jónína is going to bake a cake’ increases when Jónína is focused; the subject ends up forming one constituent with the verb, allowing final vowel deletion to apply within that constituent. An interesting aspect of Icelandic intonation is that given information may resist deaccentuation (Nolan and Jónsdóttir 2001; see also Dehé 2009: 19–21).
20.4 Eskimo-Aleut The Eskimo-Aleut language family consists of two branches, Eskimo (Inuit and Yupik) and Aleut. All languages in the family share central prosodic characteristics (e.g. phonemic vowel length) but differ in others (e.g. stress). The now extinct Sirenikski—originally classified as a Yupik language but now viewed as a separate branch of Eskimo (Krauss 1985a; Vakhtin 1991, 1998)—deviated by having an alternating stress pattern but no clear vowel length contrast, as well as showing vowel reduction in non-initial syllables (Krauss 1985c).
20.4.1 Inuit Within the Inuit dialect continuum, stretching from Alaska to Eastern Greenland, Kalaallisut (West Greenlandic), the official language of Greenland with about 45,000 speakers (Fortescue 2004), has received the most attention. While both vowel and consonant length are phonemically contrastive in Inuit (for phonetic studies, see Mase and Rischel 1971; Massenet 1986; Nagano-Madsen 1988, 1992), there is no lexical specification of prominence or tone. Rischel (1974) was the first to suggest that the notion of stress is not
OUP CORRECTED PROOF – FINAL, 04/12/20, SPi
310 kristján árnason et al. useful in analysing Kalaallisut, pointing out that native speakers as well as trained phoneticians could not agree on stress patterns. Fortescue (1984) agrees that there is neither contrastive nor demarcative stress, although the impression of prominence can arise when intonational pitch movements coincide with a heavy syllable. An experimental phonetic study by Jacobsen (2000) confirmed that duration and pitch are not correlated and that there is thus no evidence for stress in Kalaallisut. On the basis of auditory analysis, Fortescue (1983) states that stress is generally not part of the Inuit prosodic system. Massenet’s (1980) acoustic study supports this conclusion for speakers from Inukjuak, Quebec, living in Qausuittuq, Nunavut. Similarly, Pigott (2012) analysed f0, duration, and intensity and found no acoustic evidence for stress in Labrador Inuttut, nor did Arnhold et al. (in press) for South Baffin Island Inuktitut. A possible exception is Seward Peninsula Inupiaq, which according to Kaplan (1985, 2009) through contact with Yupik has adopted a system of consonant gradation in alternating syllables. However, Kaplan states that consonant gradation, in contrast to Yupik, is independent of stress, which appears on all non-final closed syllables and on all long vowels (for discussions of similar adjustments to syllable structure in other Inuit varieties, see Rischel 1974; Rose et al. 2012; Arnhold et al., in press). Instead of pitch accents, tonal targets seem to be associated with prosodic domains in the varieties that have been studied. Research into prosodic domains and phrasing is still somewhat preliminary, and no more than two tonally marked domains have been suggested, although further levels may be relevant for pitch scaling; for example, Nagano-Madsen and Bredvad-Jensen’s (1995) study of phrasing and reset in Kalaallisut text reading showed some correspondence between syntactic and prosodic units, but also considerable variation between speakers. As Inuit is polysynthetic, many words are complex and correspond to whole phrases when translated into languages such as English. Kalaallisut words, when uttered in isolation, consistently bear a final rising-falling-rising tonal movement (i.e. HLH tones associated with the last three vowel morae) (Mase 1973; Rischel 1974). In declarative utterances consisting of more than one word, words can have a final HLH, HL, or LH contour, as well as flat pitch, though HL and HLH are by far most frequent (Rischel 1974; Fortescue 1984; Nagano-Madsen 1993; Arnhold 2014a). Based on Rischel’s (1974) distinction between phrase-internal and phrase-final contours, Nagano-Madsen (1993) suggested a decomposition of the HLH melody into HL, which is associated with the word level, and the final H, associated with the phrase level. However, Arnhold (2007, 2014a) analysed all three tonal targets as associated with the word level to account for the high frequency of HLH realizations that did not coincide with phrase boundary markers, such as pauses and following f0 resets, while such markers may occur after words with HL contours. Both accounts agree that the final H tone in a larger unit, identified as the intonational phrase here, is often lowered, and words in utterance-final position are frequently reduced in intensity, devoiced, or ‘clipped’ by omitting the final segments (see also Rischel 1974; Fortescue 1984) (cf. Aleut in §20.4.3, and final devoicing and truncation in Insular Scandinavian in §20.3.1). Arnhold (2014a) additionally proposes a L% boundary tone associated with the inton ational phrase to account for the marking of sentence type. Whereas imperatives, exclamatives, and wh-questions end in high, though sometimes lowered, pitch like declaratives, polar interrogatives consistently end in low pitch (Rischel 1974; Fortescue 1984; Arnhold 2014a). This is true of the central dialect spoken in the Greenlandic capital Nuuk and south of it. In northern West Greenlandic, the same HLH contour as in declaratives appears in polar questions, but the last vowel of the word is lengthened so that the tones,
OUP CORRECTED PROOF – FINAL, 04/12/20, SPi
THE NORTH ATLANTIC AND THE ARCTIC 311 which are still associated to the last three morae, are ‘shifted’ one mora to the right from where they are in a declarative (Rischel 1974; Fortescue 1984). Fortescue (1983) describes three major differences in the intonation of statements and polar questions across Inuit varieties. First, while the mora is the tone-bearing unit in eastern and western Greenland, as well as Labrador and Arctic Quebec, he observed syllable-based patterns for the rest of Canada, Alaska, and northern Greenland (on North Greenlandic, see also Jacobsen 1991). Second, he finds declaratives to have a fall on the final, the penultimate, or the antepenultimate syllable or mora (followed by a rise in some varieties, such as Kalaallisut). Third, several eastern and western varieties have falling pitch in interrogatives as in central Kalaallisut, while others have a final pitch rise, with or without lengthening of the last syllable. Acoustic studies of intonation have been conducted for three varieties other than Kalaallisut. For Itivimuit,1 Massenet (1980) describes final f0 falls with a H tone on the penultimate vocalic mora in statements, a H tone on the last vocalic mora in exclamatives, a H tone on the antepenultimate vocalic mora in questions, followed by one of three contour shapes, a simple fall or a fall-rise with either a doubled or a tripled last vocalic mora. For Labrador Inuttut, Pigott (2012) describes similar patterns, though with a less clear distinction between questions and statements. In addition to frequent phrase-final lengthening, he also found aspiration of final coda plosives. On non-final words, he observed final HL contours. For their corpus of South Baffin Inuktitut, Arnhold et al. (2018) found HL contours on all words, with the H realized early in the word and the L at its end. On final words, the resulting fall was followed by a plateau in most cases, indicating the presence of another L associated with the intonational phrase. In addition to marking prosodic boundaries and distinguishing sentence types, Inuit prosody is influenced by pragmatics. Massenet’s three question contours distinguish ‘leading’ questions, where the speaker already knows part of the answer, from neutral polar questions and confirmation-seeking echo questions. For Kalaallisut, Fortescue (1984) describes a complex interplay between mood (e.g. indicative, interrogative, or causative), intonation (final rise vs. final fall), and context/speaker intent. Prosodic marking of information structure has only been investigated for Kalaallisut (Arnhold 2007, 2014a), where focused words are more often realized with HLH tones and an expanded pitch range, while given/backgrounded words have smaller ranges and more frequent HL realizations.
20.4.2 Yupik Yupik is a group of polysynthetic languages spoken in southwestern Alaska, the largest group being Central Alaskan Yupik (CAY). Mutually unintelligible St Lawrence Island Yupik is spoken to the north, while Alutiiq Yupik is spoken on the Alaskan peninsula to the south. CAY has approximately 10,000 speakers, many of whom speak mixed English–CAY varieties (‘half-speakers’, as they are known in the area). The description below deals with CAY and is based on Miyaoka (1971, 2002, 2012, 2015) and informed by Jacobson (1985), Leer (1985a), and Woodbury (1987). Some information on other varieties appears at the end of this section. The text below concentrates on feet and their dominating constituents. 1 According to Fortescue (1983) and Pigott (2012), Massenet’s findings are representative of the Arctic Quebec region the speakers left about 20 years before the recordings took place.
OUP CORRECTED PROOF – FINAL, 04/12/20, SPi
312 kristján árnason et al. Words may be monomorphemic as well as highly synthetic through suffixation, creating the first constituent, or ‘articulus’, above the morpheme in the morphosyntactic hierarchy.2 The next two levels up in the hierarchy are the enclitic bound phrase (indicated by {…}, with clitic boundaries indicated by ‘=’) and the non-enclitic bound phrase (with boundaries indicated by ‘≠’). There are four vowels, /i a u ə/, the first three of which also occur as long; /ə/ does not appear word-finally and has no long counterpart. In the enclitic bound phrase, syllabification is continuous and quantity-sensitive iambs are built from the left edge. The last of the stressed syllables is the primary stress. Long vowels reject being in the weak position of the foot, as does a word-initial closed syllable. As shown in (1), the last syllable of the word is never accented, due to a ban on final feet, which is suspended if the word otherwise remains footless, as in nuna /(nu.ˈna)/ ‘land’. As a result, one or two syllables at the end of a longer enclitic bound phrase will be unfooted. Thus, an initial monosyllabic foot occurs in (1a, 1b) and a disyllabic one in (1c, 1d). Penultimate stress in (1a) is due to the long vowel (cf. initial primary stress in /aana=mi/ leading to /{(ˈaa).na.mi)}/ ‘How about mother?’). In (1b), four word-internal closed syllables appear in weak positions. In addition, within the word, a sequence of heavy-light-light is parsed with a monosyllabic foot for the heavy syllable instead of a disyllabic one, as shown in (1d); compare /qayar-pag-mi=mi/ ‘How about in the big kayak’, which has a clitic boundary between the two light syllables, allowing the construction of a HL foot: /{(qa.ˌjaχ).(paɣ.ˈmi).mi}/. (1) a. aaluuyaaq (ˌaa).(ˈluu).jaaq ‘swing’ b. qusngirngalngur - tangqer - sugnarq - uq = llu = gguq goat there.be probable indic.3sg encl quot (ˌquz).(ŋiʁ.ˌŋal).(ŋuχ.ˌtaŋ).(qəχ.ˌsuɣ).(naχ.ˈquq). l ̥u.xuq ‘they say there seems to be a goat also’ c. cagayag - yagar - mini bear baby loc.4sg.sg/pl (ca.ˌɣa).(ja.ˌja).(ɣa.ˈmi).ni ‘in his own baby bear(s)’ d. qayar - pag - mini kayak big loc.4sg.sg/pl (qa.ˌjaχ).(ˈpaɣ).mi.ni ‘in his own big kayak(s) The primary stress is the location of a rapid high-to-low pitch fall. Since stressed syllables are bimoraic, the three full vowels /i a u/ are subject to iambic lengthening in strong open syllables (not indicated in the current orthography and the transcriptions here). In addition to underlyingly closed syllables, open syllables may be closed by a geminate consonant in specific right-hand contexts, for which reason the stress on a derived closed syllable is referred to as ‘regressive’ (e.g. Miyaoka 2012). 2 In Miyaoka (2002, 2012, 2015), a structural hierarchy is assumed, the ‘bilateral articulation’. It does not follow the commonly assumed conception of the ‘double articulation’ of language, specifically the notion of potentially non-isomorphic morphosyntactic and phonological hierarchies. In an integrated model of morpho syntax and phonology, each constituent has a morpho syntactic plane (content) as well as a phonological plane (expression) (cf. (4)).
OUP CORRECTED PROOF – FINAL, 04/12/20, SPi
THE NORTH ATLANTIC AND THE ARCTIC 313 In (2), three contexts are listed in which closed syllables arise in the enclitic bound phrase, underlined in the transcriptions. (2) a. An open syllable before a sequence of a consonant-initial open syllable and an onsetless syllable acquires a coda through gemination of the following consonant: . . . V.C1V.V. . . → . . .VC1.C1V.V…, as in /a.ki.a.ni/ {(ˌak).(ki.ˈa).ni} ‘across it’ (aki ‘one across’, -ani loc.3sg) and /ang.ya.cu.ar.mi/ {(ˌaŋ).(ˌjat).(tɕu.ˈaʁ).mi} ‘in the small boat’ (angya ‘boat’, -cuar- ‘small’, -mi loc.sg). b. The second of two identical onset consonants separated by /ə/ is geminated, as in /nang.te.qe.qapig.tuq/ {(ˌnaŋ).(te.ˌqəq).(qa.ˈpix).tuq} ‘he is very sick’ (nangteqe‘sick’, -qapigc- ‘very much’, -tuq ind.3sg). In the same context, /ə/ is deleted between non-identical consonants, as in /keme-ni/ {(ˈkəm).ni}/ ‘his own flesh’ (kəmə ‘flesh’, -ni abs.3sg.refl.sg/pl (cf. /keme-mi/ {(kə.ˈməm).mi} ‘of his own flesh’ keme ‘flesh’, -mi erg.3sg.refl.sg/pl). c. A word consisting of a single open syllable acquires a coda through gemination of the onset consonant of a following morpheme, as in /ca=mi/ {(ˈcam).mi} ‘then what?’ (cf. /ca-mi/ {(ca.mí)} ‘in what?’; -mi loc). This particular ‘enclitic regression’ shows some dialect variation (cf. Miyaoka 2012).
20.4.2.1 The enclitic bound phrase CAY has over a dozen monosyllabic enclitics, at least three of which may occur in succession in an enclitic bound phrase (3) a. nuna - ka = mi = gguq land abs.1sg.sg how about quot {(nu.ˌna).(ka.ˈmi).xuq} ‘how about my land, it is said’ b. aana - ka = llu = ggur = am mother abs.1sg.sg and quot emph {(ˌaa).(na.ˌka).(l u̥ .ˈxuʁ).ʁam} ‘tell him/them that my mother . . .!’ There are no stem compounds in the language, except for a few exceptional phrasal compounds, notably with inflections on both constituents. In (4), two of these are given. (4) a. Teknonymic terms, which are very common, refer to parents by mentioning the name of their child, such as May’am+arnaan {(ˌmaj).(ja.ˈmaʁ).na.an} ‘(lit.) May’aq’s mother’ (May’aq ‘proper name’, -m erg.sg, arnar ‘woman’, -an erg.3sg), used to avoid the real name of the parent), in contrast with the syntactic phrase May’am arnaa(n) {(ˈmaj).jam}{(ˈaʁ).na.a(n)} ‘(of) May’aq’s woman’. b. Complex verbs formed from nouns indicating location and verbs indicating existence have inflections on both constituents, such as /ang.ya.an+(e)tuq/ {(ˌaŋ)(ja.ˈa). nә.tuq} ‘she is in his boat’ (angya ‘boat’ -an contracted from -ani loc.3sg.sg, (e)t‘to exist’, -uq ind.3sg). By contrast, the two-phrase /angyaani etuq/ is retained in the Nunivak dialect as {(ˈaŋ)ya.an}{(ә.ˈtuq)}.
OUP CORRECTED PROOF – FINAL, 04/12/20, SPi
314 kristján árnason et al. The non-enclitic bound phrase is a halfway, somewhat variable category between the enclitic bound phrase and the (free) syntactic phrase. Often, the boundary between the words in a non-enclitic bound phrase blocks the maximum onset principle as well as footing, as for an enclitic bound phrase boundary, but allows footing to continue up to the medial boundary. It is indicated by ≠, an asymmetrical boundary. Example (5) would have been parsed as /{(nu.ˌna).(kaˈta).ma.na}/ if it were a clitic bound phrase and as /{(nu.ˌna). ka}{(ta.ˈma).na}/ if there were two syntactic phrases. As it is, (5) allows /kat/ to be footed, with gemination of /t/ to satisfy bimoricity. (5) nuna-ka ≠ tama - na land abs.1sg.sg that abs.sg {(nu.ˌna).(ˌkat) {(ta.ˈma).na}} ‘that land of mine’ In sum, the enclitic bound phrase is the domain of syllabification and right-edge iambic quantity-sensitive feet, whereby non-initial closed syllables count as light. The word has a minor effect in the way strings of V.CV.V are parsed, while the non-enclitic bound phrase has a final constituent that rejects inclusion in its host enclitic bound phrase but suspends the constraint on final feet. Woodbury (1987) reports various rules of expressive lengthening in varieties of CAY. In Central Siberian Yupik, coda consonants do not affect stress and stress does not cause consonant gemination. Stress appears on all syllables with long vowels and on each syllable following an unstressed syllable, except in final position (Jacobson 1985). Stressed vowels lengthen in open syllables (except /ə/, which is always short). Long stressed vowels are additionally marked by falling pitch (Jacobson 1990; but see Krauss 1985b on the loss of some length distinctions in younger speakers). The stress system of Naukanski is similar, though with closed syllables attracting stress and some apparent flexibility in stress placement (Krauss 1985c; Dobrieva et al. 2004). A different foot structure is reported for Alutiiq Yupik Sugt’stun, which combines binary and ternary stress (Leer 1985a, 1985b, 1985c; Martínez-Paricio and Kager 2017; and references therein).
20.4.3 Aleut Aleut (Unangam Tunuu) is a severely endangered language spoken on the western Alaska Peninsula, the Aleutian island chain, and the Pribilof and Commander Islands. In addition to the phonemic length contrast, vowel durations mark primary stress (Rozelle 1997; Taff et al. 2001), which is penultimate in eastern dialects, unless the penultimate syllable contains a short vowel and the ultimate a long vowel, in which case the ultimate is stressed (Taff 1992, 1999).3 This pattern is stable even if the last syllable is reduced or deleted, as happens frequently in phrase-final position (Taff 1999; cf. Kalaallisut in §20.4.1): when the ultima is deleted, the then final, but underlyingly penultimate, syllable is stressed (Rozelle 1997; Taff et al. 2001). 3 Acoustic investigations of prosody have not been conducted for Western dialects, which according to Bergsland (1994, 1997) prefer initial stress but have similar intonation to eastern dialects.
OUP CORRECTED PROOF – FINAL, 04/12/20, SPi
THE NORTH ATLANTIC AND THE ARCTIC 315 Intonation is very regular. Almost all content words bear a pitch rise at the beginning and a pitch fall near the end (Taff 1999; Taff et al. 2001; cf. South Baffin Inuktitut in §20.4.1), modelled by Taff (1999) with two phrase accent boundary tones, H and L, as movements are not associated with the stressed syllable. Within a sentence, H and L tones are successively lowered. Additionally, the first word in a sentence starts with lower pitch and has a later peak than subsequent words, which Taff (1999) models with a L% initial boundary tone. Moreover, pitch falls on sentence-final words are steeper than for preceding words, modelled with a final L% (which contrasts with H% for rare clause-final, sentence-internal cases with less clear falls). Intonation is very similar for declaratives and polar questions (Taff 1999; Taff et al. 2001). Prosodic marking of focus is optional and may employ suspension of downtrends between prosodic words, extra-long duration, and/or increased use of small peaks, followed by a fall, on the penultimate, which led Taff (1999) to propose a sparsely used H* accent.
20.5 Conclusion We have discussed three genetically unrelated language groups with different basic structures. The Indo-European languages make a clear distinction between word constituency and phrasal constituency, whereas for the Eskimo-Aleut languages this distinction is not as clear. Devoicing and the presence of pre-aspiration in the North Atlantic region are worthy of interest, and in particular the final devoicing and truncation noted for Insular Scandinavian, Kalaallisut, and Aleut.
chapter 21
The I n di a n Su bcon ti n en t Aditi Lahiri and Holly J. Kennard
21.1 Introduction The Indian subcontinent comprises Bangladesh, India,1 and Pakistan, and its languages come from five major language families: Indo-Aryan, Nuristani branches of Indo-Iranian, Dravidian, branches of Austroasiatic, and numerous Tibeto-Burman languages, as well as language isolates (Grierson 1903/1922; Masica 1991; Krishnamurti 2003; Thurgood and LaPolla 2003; Abbi 2013; Dryer and Haspelmath 2013). In general, the term ‘prosody’ subsumes quantity contrasts, metrical structure, lexical tone, phrasing, and intonation, but, as far as is known at present, the only language of the Indian subcontinent to have lexical tone is Punjabi. In this chapter, we touch on all of these aspects of prosody but discuss quantity only insofar as it is linked to stress, phrasing, and intonation. It is not possible to cover the entire range of existing languages; rather, we give representative examples, largely from Assamese, Bengali (Kolkata), Bangladeshi Bengali (Dacca), Hindi, Malayalam, Tamil, and Telugu.
21.2 Quantity From a prosodic perspective, quantity contrasts are pertinent since they are directly related to syllable weight and hence stress assignment. Vowel quantity is always relevant to stress in a quantity-sensitive system, but if geminates are truly moraic and add weight to the syllable (cf. Hayes 1989a), they too could attract stress.2 For our purposes, it is worth noting that consonant quantity contrasts prevail in most Indo-Aryan languages, while true vowel quantity distinctions are less frequent, a possible example being educated standard Marathi, for which Le Grézause (2015) reports a quantity contrast for high vowels, with no geminates. 1 The Indian Nagaland is covered in chapter 23. 2 Davis (2011) argues that although generally geminate consonants have an underlying mora, it could be the case that other factors constrain the surface realization. Mohanan and Mohanan (1984) also suggest that geminates in Malayalam may not be truly moraic.
THE INDIAN SUBCONTINENT 317 There is, however, a tense/lax distinction in vowels in most of the Indo-Aryan languages, and Masica (1991) provides the following details. Some languages have an eight-vowel system, as in Gujarati /i e ɛ ɑ ə ɔ o u/, while others are assumed to contain nine vowels, such as Dogri /iː ɪ e æ aː ə o uː ʊ/3 with three long vowels. Hindi has an additional vowel, but without the quantity distinctions: /i ɪ e æ a ə ɔ o u ʊ/, which is also the view held by Ohala (1999). In the Bengali vowel system, there is neither a lax/tense nor a quantity distinction: /i e æ ɑ ɔ o u/ (Lahiri 2000), but allophonic vowel lengthening occurs in monosyllabic words. Consequently, vowel quantity alternations can be observed in morphologically related (base vs. suffixed) words as well as across unrelated monomorphemic words, as in (1). (1) Vowel quantity alternations in Bengali [nɑːk] ‘nose’ ~ [nɑk-i] nose-adj, ‘nasal’ [kɑːn] ‘ear’ ~ [kɑnɑ] ‘blind’ As in most languages where consonantal quantity contrasts exist, singleton–geminate pairs are observable medially but not finally or initially. In most Indo-Aryan languages, almost all consonantal phonemes have geminate counterparts, but there may be language-specific constraints on their appearance. Hindi, Telugu, and Bengali allow geminates to appear only in medial position, and there are further segmental restrictions. In Hindi, /bʰ ɽ h ɦ/ do not geminate (Ohala 1999), while in Telugu /f ʂ ʃ h ɳ/ are always singletons (Bhaskararao and Ray 2017). Bengali does not allow /h/ to geminate; however, there are also constraints on singletons such that retroflex /ɖ ɖʰ/ are not permitted in word-medial position where they are rhotacized. Examples of monomorphemic word pairs with a subset of obstruent and sonorant phonemes are given in (2) for Bengali, Hindi, and Telugu. (2) Selected examples of singleton–geminate obstruents and sonorants a. Bengali [ɑʈɑ] [ɑʈːɑ] ‘wheat’ ‘eight o’clock’ voiceless unaspirated retroflex stop [ʃobʰɑ] ‘beauty’ [ʃobʰːo] ‘civilized’ voiced aspirated labial stop [ɔɡɑd] ‘plenty’ [ɔɡːæn] ‘faint’ voiced unaspirated velar stop [bɑtʃʰɑ] ‘to isolate’ [bɑtʃʰːɑ] ‘child’ voiceless aspirated palatoalveolar affricate [ʃodʒɑ] ‘straight’ [ʃodʒːɑ] ‘bedding’ voiced unaspirated palatoalveolar affricate [kɑn̪ɑ] ‘blind’ [kɑn̪ːɑ] ‘tears’ dental nasal *[ɖ] [bɔɖːo] ‘too much’ — voiced unaspirated retroflex stop b. Hindi (Ohala 1999: 101) [pətɑ̪ ] ‘address’ [pət̪ːɑ] ‘leaf ’ voiceless unaspirated dental stop [kət ʰ̪ ɑ] ‘narrative’ [kət̪ʰːɑ] ‘red powdered voiceless aspirated dental stop bark’ [ɡəd̪ɑ] ‘mace’ [ɡəd̪ːɑ] ‘mattress’ voiced unaspirated dental stop 3 Masica writes the lax vowels /ɪ ʊ/ as capital /I, U/. Also, following traditional usage for our languages, p and i are used instead of φ and ι to indicate the phonological phrase and the intonational phrase.
318 ADITI LAHIRI AND HOLLY J. KENNARD [bətʃɑ]
‘save’
[bətʃːɑ]
‘child’
[pəkɑ]
‘to cook’
[pəkːɑ]
‘firm’
voiceless unaspirated palatoalveolar affricate voiceless unaspirated velar stop
c. Telugu (Bhaskararao and Ray 2017: 234) [ɡɐdi] ‘room’ [ɡɐdːi] ‘throne’ [ɐʈu] ‘that side’ [moɡɐ] ‘male’ [kɐnu] ‘give birth to’ [kɐlɐ] ‘dream’ [mɐri] ‘again’
[ɐʈːu] ‘pancake’ [moɡːɐ] ‘bud’ [kɐnːu] ‘eye’ [kɐlːɐ] [mɐrːi]
voiced unaspirated dentialveolar stop voiceless unaspirated retroflex stop voiced unaspirated velar stop alveolar nasal stop
‘falsehood’ alveolar lateral approximant ‘banyan tree’ alveolar trill
Concatenation of identical phonemes leads to geminates, but gemination as a phonological process occurs quite commonly within and across words and across morphemes via assimilation processes. Such assimilations are restricted to specific prosodic domains, as we will see in §21.3). The examples in (3) illustrate this point. (3) Gemination a. Concatenation Bengali /kʰel-l-ɑm/ /ʃɑt̪ t̪ɔlɑ/
> >
[kʰelːɑm] [ʃɑt̪ːɔlɑ]
‘play-simple past-1pl’ ‘seven floors’
Marathi Glide gemination (Pandharipande 2003: 725) /nəu-wadzta/ > [nəwwadzta] ‘at nine o’clock’ /nahi-jet/ > [nahjːet] ‘does not come’ b. Derived via r-coronal assimilation, whereby /r/ assimilates completely to the following dental, palatoalveolar, and retroflex consonants, leading to geminates. Bengali /kor-t̪-ɑm/ > [kot̪ːɑm] ‘do-habitual past-1pl’ /tʃʰord̪i/ > [tʃʰod̪ːi] ‘youngest older sister’ There is another form of gemination that relates to emphasis and is marked by the gemination of a consonant. This is very clearly observed in Bengali time adverbials, as in (4). (4) Adverbs and emphatic germination Bengali4 [ækʰon] ‘now’ [ekʰːuni] [t̪ɔkʰon] ‘then’ [t̪okʰːuni]
‘immediately’ ‘right after that time’
Thus, geminates and gemination are typologically quite frequent in the Indo-Aryan languages (cf. Goswani 1966 for Assamese). However, although geminates may add weight to the preceding syllable, it does not necessarily follow that they play an active role in stress assignment, to which we turn in the next section.
4 Note that vowels are raised one step when a high vowel follows.
THE INDIAN SUBCONTINENT 319
21.3 Word stress Word stress is a contentious topic in the Indo-Aryan languages. In Bengali, for example, lexical prominence is on the first syllable of a word, but it is considered to be phonetically ‘weak’ and hardly perceptible (Chatterji 1926/1975; Hayes and Lahiri 1991; Masica 1991). Nevertheless, there are some clear diagnostics, largely post-lexical, for the location of the main prominence on a word (see also Khan 2016 for Dacca Bengali). For example, as mentioned in §20.2, there are seven oral vowels /i e æ ɑ ɔ o u/ in Bengali and all of them have a nasal counterpart. However, there are distributional constraints on the vowels based on stress. First, the vowels /ɔ/ and /æ/ only occur in word-initial position, as in [kɔt̪ʰɑ] ‘speech’, [bæt̪ʰɑ] ‘pain’. In contrast, plenty of examples exist with final /i e u o ɑ/, such as [pori] ‘fairy’, [bẽʈe] ‘short (in height)’, [d̪ʰɑt̪u] ‘metal’, [bɔɽo] ‘large, big’, [kɔlɑ] ‘banana’. Second, since all nasal vowel phonemes must be in stressed position, they are restricted to the first syllable of a word. Third, geminate consonants in monomorphemic words are also restricted to the stressed syllable: [ʃot̪ːi]; *ˈCVCVCːV words are not permitted. However, since geminate suffixes exist, they are permitted in non-initial syllables in polymorphemic words: /ˈd̪ækʰ‑ɑ-tʃʰː-e/ show‑caus‑prog-3sg, ‘(s)he is showing’; /ˈmɑrɑ‑t̪ːok/ ‘deadly’. As we shall see in §21.4, the alignment of pitch accents in intonational tunes is another indicator of the syllable that carries the main prominence. A further diagnostic is the adaptation of loans. From the seventeenth century onwards, and more so in the eighteenth and nineteenth centuries, numerous loans came into Bengali primarily from Portuguese and English, both of which can have words with non-initial and variable stress. Irrespective of the stress pattern of the donor language, Bengali has always borrowed words with main prominence on the first syllable. Portuguese estirár, ananás, alfinéte, espáda, bálde, janélla have been borrowed into Bengali as [ˈis̪ti̪ ri] ‘iron (for ironing clothes)’, [ˈɑnɑrɔʃ] ‘pineapple’, [ˈɑlpin] ‘pin’, [ˈiʃpɑt̪] ‘steel’, [ˈbɑlt̪i] ‘bucket’, [ˈʤɑnlɑ] (Lahiri and Kennard 2019). Irrespective of which syllable bore stress in Portuguese, Bengali has firmly maintained word-initial stress. The same occurs for English words: exhibítion, inspéctor, América, cómpany are pronounced as [ˈegʤibiʃɑn], [ˈinʃpekʈɔr], [ˈæmerikɑ], [ˈkompɑni]. As Chatterji (1926/1975: 636) puts it, ‘the stress is according to the habits of Bengali’.5 Other dialects of Bengali and related sister languages such as Oriya do not necessarily have fixed word-initial stress. Masica (1991: 121) claims that stress is more evenly spaced and is weak (see also Lambert 1943; Pattanayak 1966). In Hindi, stress is probably quantitysensitive and tends to fall more on the penultimate syllable if the syllable is closed or if the vowel is tense, else the antepenult, very similar to Latinate stress. Names can provide good comparative evidence. For example, the name Arundhati carries antepenultimate stress in Hindi [əˈrund̪ʰət̪i] but not in Bengali, where the stress is on the first syllable [ˈorund̪ʰot̪i]. However, there are language-specific differences as to which types of syllables contribute to weight and therefore attract stress. As Masica (1991: 121) states, ‘each language has its own peculiarities: Hindi gaˈrīb; nukˈsān ‘poor’; ‘loss’ vs. Gujarati ˈgarib, ˈnuksān’. Thus, for intonation purposes, it is important to consider whether the pitch accents are aligned to stressed syllables and whether there is more variation in some languages than others. 5 Cases of Bengali loans in English demonstrate that borrowed words in English conform to English stress rules: largely penultimate if heavy, otherwise antepenultimate. For instance, Darjeeling (a favourite Himalayan resort in West Bengali) is pronounced in Bengali as [ˈd̪ɑrʤiliŋ] while English has predictably borrowed it with stress on the penultimate syllable Darjéeling.
320 ADITI LAHIRI AND HOLLY J. KENNARD
21.4 Tone Punjabi tone arose from a merged laryngeal contrast in consonants, and the tone contrast only appears in that segmental context and most clearly in stressed syllables (Evans et al. 2018). On the basis of this segmental origin, the steeply falling tone should be expected to be a low tone, because it appears after historically voiced aspirated plosives (i.e. murmured), while the other tone is low and flat but should be expected to be high, as it appears after the plosives that historically had no breathy voice (voiceless unaspirated, voiceless aspirated, and plain voiced) (cf. §3.3.2).6 The authors propose that this apparent anomaly is to be explained by a perceptual enhancement of an original low tone after murmured plosives by a fall before it (‘dipping’; cf. §9.5.2). The low pitch for the tone after the other laryngeal groups must subsequently have arisen by a contrast-enhancing lowering of an original high tone.
21.5 Intonation and intonational tunes In this section we will discuss three nuclear tunes: the declarative (neutral), focus, and yes/ no question contours. The standard tune in Indo-Aryan languages is overwhelmingly LH. It not only marks focus but is also the general prenuclear contour. Another general fact is that plateaux are not common in Indian languages, which must be related to the fact that the Obligatory Contour Principle (OCP) prohibits the occurrence of a sequence of identical tones in the intonational phrase (IP). Beyond that, there are various interactions between the three intonational tunes in these languages. Three specific issues arise from a brief survey: 1. Are the focus tune and the neutral tune identical, and, regardless of whether they are, do they differ in pitch range? 2. How reliably can one distinguish a non-focused yes/no question from a narrowfocused one? 3. If the tunes of non-focused and narrow-focused yes/no questions are neutralized, might there still be phrasal segmental processes that can distinguish them? We next illustrate the various patterns, with examples from Bengali, Assamese, and Hindi. What these languages have in common is that focus is marked by a L*Hp contour, with L* associated to the earliest stressed syllable of the focused element. As for the neutral contour, Hayes and Lahiri (1991) and Lahiri and Fitzpatrick-Cole (1999) have claimed that it is marked by a different pitch accent, namely H*. Other researchers have made different claims (e.g. Twaha 2017 for Assamese), but the actual descriptions and pitch tracks seem to suggest that the sentence-level prosodic phrase bears a H tone. The prenuclear contour has been claimed to be also LH by most researchers, as mentioned above. We turn to this in more detail after discussing the individual tunes in Bengali. 6 The tone contrast must represent a fairly recent case of tonogenesis. The Gurmukhi script, which dates to the early sixteenth century, still indicates voiced aspirates in syllables with falling tone (Jonathan Evans, personal communication). Mikuteit and Reetz (2007) give a detailed description of the acoustics of voiced aspirates in Dacca Bengali, providing evidence that they are really voiced and aspirated.
THE INDIAN SUBCONTINENT 321
21.5.1 Declarative Bengali is strictly verb-final and the verb will form a nuclear phrase. However, not all sentences need to have a verb, as seen in (5a). More generally, the nuclear pitch accent H* would be aligned to the first prosodic word of the last prosodic phrase of an IP in a declarative intonation. Importantly, Bengali obeys the OCP within IP’s, disallowing sequences of H tones. (5) Declarative neutral TUNE (surface) a. L* HP H* LI (((ˈt∫hele-ʈi)ω)φ ((ˈlɔmbα)ω)φ)I boy-CLASSIFIER tall
‘The boy is tall.’ b. L* (((d̪id̪i-r)ω
HP (d̪æor)ω)φ
H* LI (rɑnːɑ)ω (kɔre)ω)φ)I
elder sister’s brother-in-law cook.VN do.3sg.pres ‘My elder sister’s husband’s younger brother cooks.’ In (5a), tall is the last full prosodic element to carry stress and thereby align to H*. In (5b), /d̪id̪i-r d̪æor/ falls within a single phonological phrase and can undergo assimilation; thus, the phrase /d̪id̪i-r d̪æor/ ‘elder sister’s husband’s younger brother’ surfaces with gemination: [d̪id̪id̪ːæor]. Since the prenuclear accent is L*Hp, the OCP will delete one of the H tones. In the examples, we have indicated the deletion of the phrasal Hp, but we could equally well have deleted the second H; pitch tracks show that there is a H hovering around the edge of the phrase, as in Figure 21.1. In §21.5.3, evidence is presented that in yes/no questions, Hp is deleted when it is adjacent to a final HI. Twaha (2017) provides detailed analyses of Standard Colloquial Assamese (SCA) and Nalbariya Variety of Assamese (NVA). In both varieties, each phonological phrase has a L*HP tune, while the final verb bears a pitch accent H* in a declarative sentence, as seen in Figure 21.2.7 Twaha later (2017: 79) suggests that only NVA, not SCA, has a H* on the verb or verbal complex, apparently assuming an OCP deletion of H* occasioned by the preceding HP in SCA. The NVA example is given in Figure 21.3. 7 Twaha (2017: 57–58) claims that ‘in the third P-phrase ghɔrɔk, a plateau is observable after HP is manifested on the first mora of the final syllable rɔk. This plateau is caused because, unlike the preceding two P-phrases, the following P-phrase geisil bears a high pitch accent H* on its first syllable. Since H*, as per the proposal here, is aligned to the first mora of the initial syllable of geisil, the interpolation from Hp of ghɔrɔk to H* of geisil creates a plateau.’ However, the H* on the verb is not marked consistently; in some figures there is only a rising contour on the pre-verbal phrase followed by a sentence-final L boundary tone.
322 ADITI LAHIRI AND HOLLY J. KENNARD
Hp
L*
Hp
L*
L*
Li
Hp H*
nɔren
runir
mɑlɑgulo
nɑmɑlo
Nɔren
Runi-GEN
garlands
bring down-PST-3SG
Noren brought down Runi’s garlands.
Figure 21.1 Declarative intonation in Bengali.
250
Pitch (Hz)
200
HP
150
LI
100 75
HP
L*
HP
L*
ramɔr 2
0
HP
L*
rɔmεnε
2
L%
H*
ghɔrɔk
geisil
[[rɔmεnε]P [ramɔr]P [ghɔrɔk]P [geisil]P ]I
2
3 1.096
Time (s)
Figure 21.2 Standard Colloquial Assamese ‘Ram went to Ramen’s house’ (Twaha 2017: 57).
Pitch (Hz)
300 250 200 150 150 75
L*
HP teok
azi 1 0
2
L*
dhɔmki
HP
LP
H*
LI
dilu 2
[[azi teok]P [dhɔmki]P [dilu]P ]I Time (s)
Figure 21.3 Nalbariya Variety of Assamese ‘Today I scolded him’ (Twaha 2017: 79).
3 0.9791
THE INDIAN SUBCONTINENT 323 In sum, it appears that the H* on declaratives does show up in dialects of Assamese and even in the standard variety. With respect to the OCP disallowing sequences of H tones, Twaha (2017: 80) concludes: ‘However, the H* nuclear accent in NVA declarative utterances may not be always phonetically apparent due to the phonetic pressure created by the prosodic boundary tones preceding and following it (HP and LI respectively).’ Consequently, although we see a sequence of H tones in the figures, they do not in fact always surface. Khan (2008, 2014) suggests that in Bangladeshi Bengali (spoken largely in Dacca), no sequence of identical tones is accepted, although no explicit assumption of the OCP is made. Following Selkirk’s (2007) analysis of Bengali, Khan argues that if an accentual phrase ends in a H tone, the following pitch accent is L and vice versa. This general pattern is again similar to what we have seen above. Féry (2010) argues that Hindi neutral sentences consist of repetitions of LH ending with HL and claims that the same sequences are maintained regardless of whether one chooses the canonical word order to be SOV or SVO or whether the elements are focused. That is, focus is also LH, be it initial or medial, and does not change the overall picture (see more in §21.3.2). Her coding of an SOV sentence, with focus on the subject, is given in example (6). (6) Hindi: SOV focus (Féry 2010: 294) HP LP HP HP LI LP adhaypak ne moorti ko banaaya teacher erg sculpture acc make.past Note that Féry labels the accusative marker with a H tone followed by another H on the verb, although no obvious plateau is apparent from the figure. The same sort of picture is observable in the Dravidian languages Tamil and Malayalam; the general pattern continues to be LH-marked phrases ending with HL, as (7) and (8) show. (7) Tamil (Féry 2010: 305) HP LP HP HP LI LP [[meeri]P [niRaiya]P [ceer vaank- in-aaL]P]I Mary many chairs buypast-png ‘Mary bought many chairs.’ (8) Malayalam (Féry 2010: 307) HP LP HP LP HP HP LI LP [[Peter]P [oru [rasakaram-aya]P]P [pustakam]P [vaichu]P]I Peter one interesting book read ‘Peter read one interesting book.’ There would thus appear to be a general constraint that prohibits a sequence of identical tones in most Indian languages. For Bengali, Hayes and Lahiri (1991) and Lahiri and Fitzpatrick-Cole (1999) argue that the H* of a declarative final prosodic phrase clashes with the preceding prenuclear LH. The arguments that have been raised against the assumption of a pitch accent on the final verb (cf. Dutta and Hock 2006) are probably based on the intolerance of a sequence of H tones, which causes variation in the surface alignment of the remaining H, as in Bengali, Hindi, and Assamese.
324 ADITI LAHIRI AND HOLLY J. KENNARD
21.5.2 Focus Returning now to our example in (5), we saw that in a neutral declarative sentence, Bengali has an initial L*Hp prenuclear accent, followed by a final H*LI, as shown in (5) and (6). The OCP deletes one of the high tones, so this contour surfaces as an initial low followed by a rise to the last prosodic phrase (which need not be the verb). If, however, one element, such as [lɔmbɑ] ‘tall’, is focused, we see very different patterns. The focus tune is L*Hp followed by a final LI for the declarative. In (9a), the focus on [lɔmbɑ] is marked by L*Hp. Focus on pre-final constituents, such as [tʃeleʈi] in (9b), is shown by the deletion of post-focal pitch accents, a feature that Bengali shares with many Indo-European languages. Example (9c) shows a sequence of two phonological phrases each consisting of two words, with focus on the second phonological phrase, the complex verb [rɑnːɑ kɔre] ‘cook-3sg’. In this analysis, the boundary tones are linked to edges of prosodic constituents, but their targets do not necessarily appear strictly at the edges. In contrast, the pitch accents L* or H*, which are always linked to the initial syllable of the first prosodic word, are more stably aligned to the stressed syllable. (9) Bengali: focus intonation a. L* HP L* HPLI (((ˈt∫hele-ʈi)ω)φ ((ˈlɔmbα)ω)φ)I ‘The boy is (really) tall!’ b. L* HP LI (((ˈt∫hele-ʈi)ω)φ ((ˈlɔmbα)ω)φ)I ‘The boy is tall’ c. HP L* L* HPLI (((d̪id̪i-r)ω (d̪æor)ω)φ ((rαnːα)ω (kɔre)ω)φ)I [d̪id̪id̪ːæor]
‘Sister’s brother-in-law cooks!’ We have taken a sentence with sonorants to illustrate the L*Hp contour for both the prenuclear and the focus tunes. In Figure 21.4, the first word Noren is focused and we see a clear L*Hp contour with a gradual fall ending with a Li at the end of the sentence; this should be compared with the example in Figure 21.1. Figure 21.5 illustrates the prenuclear as well as the focus contour on Runir; here the focus suggests that it is Runi’s garlands that were brought down and not someone else’s. The prenuclear L*Hp is lower than the focus contour, and again the intonation goes down to a final LI. The Bengali examples suggest that L*Hp is the focus tune and that the final non-focused tune is H*Lp, typically associated with the sentence-final verb or verbal cluster (a complex
THE INDIAN SUBCONTINENT 325
Hp
L*
Li
nɔren
runir
mɑlɑgulo
nɑmɑlo
Nɔren
Runi-GEN
garlands
bring down-PST-3SG
Nɔren brought down Runi’s garlands
Figure 21.4 Bengali intonation, focus on Nɔren.
L*
Hp L*
Hp
Li
nɔren
runir
mɑlɑgulo
nɑmɑlo
Nɔren
Runi-GEN
garlands
bring down-PST-3SG
Nɔren brought down Runi’s garlands
Figure 21.5 Bengali intonation, focus on Runir.
predicate or a noun-incorporating verb). Assamese has a similar pattern. Twaha (2017) states that the focused element begins with a L* pitch accent and ends with a focus H boundary tone, which he labels as L*fHp. An example is given in (10). (10) Assamese (Twaha 2017: 111) L* L* HP
HP L*
fHP
[[rɔmɛn-ɛ]P [dɔrza-r]P [sabi-pat]P [milɔn-ɔk]P Ramen-nom door-gen key-cl Milan-acc ‘Ramen gave the door-key to Milan.’
LI [di-lɛ]P]I give-3sg.past
Three further comments need to be added. First, if the focused constituent is longer than one prosodic word, the contour breaks up into L*H fHP. Second, if the focused constituent is towards the beginning of the sentence, the rest of the sentence ends with a general fall; thus, there is again post-focal deaccenting. Both of these are illustrated in example (11).
326 ADITI LAHIRI AND HOLLY J. KENNARD (11) Assamese (Twaha 2017: 106) L* HP L*+H
fHP
LI
[[madhɔb]P [kɔmɔla kha-bo-loi]P [khɔgɛn-ɔr ghɔr-ɔloi]P [go-isɛ]P ]I Madhab orange eat-fut-dat Khagen-gen house-dat go-past.3sg ‘Madhab went to Khagen’s house to eat oranges.’ Third, if the final verb is focused, there will be a sharp drop to the end of the sentence, as shown in (12). (12) Assamese (Twaha 2017: 105) sabi-pat]P [milɔn-ɔk]P [di-lɛ ]P ]I L* fHP LI Ramen-nom door-gen key-cl Milan-acc give-past.3sg ‘Ramen gave the door-key to Milan.’ [[rɔmɛn-ɛ]P
[dɔrza-r
Except for focusing longer constituents, SCA and Bengali are very similar. In Bengali, a longish focused constituent such as runi-r malagulo would provide a smooth rise, while in SCA it would be broken up. For Hindi, Féry (2010: 294) argues that the focused contour is also the same LH, as ‘there was no change in the overall contour’. She reports Moore’s (1965) claim that a H boundary is placed after the focused constituent; this would be similar to that in Assamese and Bengali. According to Féry’s own work, however, the only real difference between focused and non-focused utterances is that if the focus falls on an early constituent, there is a greater phonetic pitch rise, while the post-focal constituent is lowered. Tamil is also claimed to have a LH pattern ending with HL on the verb in a neutral sentence. However, when the object is topicalized, the sentence ends with LH, as in (13). (13) Tamil (Féry 2010: 306) HP LP HP LP HP LP HI LP [[[niRaiya]P [ceer]P]I [[meeri]P [vaank- in-aaL]P ]I many chairs Mary buypast-png ‘Mary bought many chairs.’ Similarly, Malayalam shows an identical pattern of LH ending with HL. Of course, many sentential and morphosyntactic options are available to mark focus. Nevertheless, from an intonation perspective, the Indo-Aryan and Dravidian languages appear to have very similar patterns. Structural devices to mark focus differ across languages. West Germanic languages, for example, use pitch accents to indicate focused parts of sentences, while European Portuguese differentiates presentational focus from corrective focus by pitch accents, using H+L* for the former and H*+L for the latter. In Italian, narrow focus is not expressed through deaccenting within noun phrases. Japanese, in contrast, marks focus by preventing the lowering of the pitch range, which is otherwise automatic. Northern Bizkaian Basque can allow complex noun phrases with a single accent on the final word to have exclusive focus on a preceding word, as in [lagunen aMA] friend-gen.sg mother ‘The FRIEND’s mother’ (Elordieta and Hualde 2014). English marks focus for domains that may include unaccented words— for example, ‘Don’t move the dinner table, just move the kitchen table’ (focus ambiguous
THE INDIAN SUBCONTINENT 327 between kitchen and kitchen table). In contrast, Bengali can discriminate words even within compounds, as a consequence of the combination of phrasing and pitch accent marking. In a declarative sentence such as nirmal lalbari-r malik ‘Nirmal is the owner of the red house’, one could focus lalbari ‘the red house’ or just lal ‘red’, suggesting that Nirmal owns the red house (not the black one). The contour, given in (14), shows that when the compound lalbari is the focus, the pitch accent associated with [lɑl] is followed by a continuous rise to the beginning of malik, while when the focus is on lal, the L*H contour is only on that part of the compound. (14) Bengali focus within compounds L* HP a. nirmɔl lɑl bɑri-r mɑlik Nirmal redhouse-gen owner ‘Nirmal is the owner of the red house.’ b.
L*HP nirmɔl lɑl bɑri-r mɑlik Nirmal redhouse-gen owner ‘Nirmal is the owner of the red house [not the black one].’
21.5.3 Yes/no questions, with and without focus In Bengali, as probably in most Indo-Aryan languages, the yes/no question tune carries a L* pitch accent and ends with a contour boundary tone Hi.Li. The examples in (15) illustrate the contour. (15) Yes/no contours a. L*H IL I kɔfi? ‘Coffee?’ b. L* HI. L I ˈtʃ ʰeleʈi kʰubˈlɔmbα? ‘Is the boy very tall?’ c. HIL I L* d̪id̪i-r d̪æor rɑnːɑ kɔre?
[d̪id̪id̪ːæor] ‘Does elder sister’s brother-in-law cook?’
328 ADITI LAHIRI AND HOLLY J. KENNARD As we can see, the generic contour in Bengali is LH, the important distinction being between H*, which signals a neutral declarative, and L*, which is used for focused declaratives as well as for yes/no questions. There now arises the question of how focused yes/no questions are prosodically distinct from broad-focus ones. If L* is also used to mark focus in interrogatives, and it is, how can a neutral yes/no question be distinguished from a focused one? Consider the orthographic string in the Bengali equivalent of the sentence ‘Has mother invited father’s friend?’ in (16) and the three tonal structures in (a), (b), and (c). Observe that in (16a) and (16b), the OCP has deleted the medial phrasal Hp. (16) Underlying versus surface patterns in yes/no questions mɑː-ki babɑ-r bondʰu-ke nemontonːo kor-e-tʃʰ-e mother-Qpart father-gen friend-obliq invitation do-perf-prog-3 ‘Has mother invited father’s friend?’ L* Hp HiLi a. mɑː-ki bɑbɑ-r bondʰu-ke [nemontonːo] f kor-e-tʃʰ-e L* Hp HiLi b. mɑ-ki [bɑbɑ-r bondʰu-ke]f nemontonːo kor-e-tʃʰ-e p ] i L* HiLi c. mɑː-ki [bɑbɑ-r bondʰu-ke nemontonːo kor-e-tʃʰ-e]p ] i Structure (16a) indicates a narrow focus on [nemontonːo] ‘invitation’—that is, the question is whether mother actually invited father’s friend (as opposed to the friend coming uninvited). In contrast, the focus on [bɑbɑ-r bondʰu-ke] ‘father’s friend’ in (16b) suggests that mother specifically invited this friend, and not someone else. The most generic reading is intended to be (16c), an answer to ‘What happened?’ Unfortunately, (16b) and (16c) are identical with respect to intonation. Given the OCP constraint, the Hp of the focus phrase is deleted, an inevitable consequence in the context of the question tune. The claim for Bengali has been that the focus tune is L*Hp, which suggests that focus marking has three elements: (i) the L* pitch accent, (ii) a Hp boundary tone, and (iii) a phrase boundary delimiting the focused part. The third element is relevant for the evaluation of an alternative account in which Hp is interpreted as a prominence-marking tone. Selkirk (2007: 2) seeks to ‘eliminate focus-phrasing alignment constraints from the universal interface constraint repertoire and to reduce all nonmorphemic, phonological, reflexes of focus to reflexes of stress prominence’ (a typological treatment of prosodic focus is offered in chapter 31). Consequently, under this view there are no constraints like Align L/R(FocusP) where P is a prosodic constituent like φ. Since in our account the OCP and the phrase boundary requirement for the focus are independent phonological constraints, the prediction is that the OCP-triggered deletion of Hp leaves the phrase boundary unaffected. In order to test this, we will look at phrasal assimilation rules: since these are bounded by phonological phrases, removal of the phrase boundary predicts an application of the assimilation rules within the merged phonological phrase. To illustrate the two scenarios, let us consider the four intonational structures provided for [mɑmɑ-r ʃɑli-r bie] ‘wedding of mother’s brother’s wife’s sister’ (17), illustrated in Figure 20.6, focusing on the interaction between the intonation contour and the phonological phrase structure.
THE INDIAN SUBCONTINENT 329 mɑmɑ-r ʃɑli-r bie mother-brother-gen sister-wife-gen wedding ‘Mother’s brother’s wife’s sister’s wedding’
(17)
Figure 21.6a is the neutral declarative, as presented earlier in (5) and (6), with [bie] carrying the last H*Li pitch accent and [mɑmɑr], [ʃɑlir], and [bie] occurring in separate φ’s, whereby the Hp of [ʃɑlir] was deleted by the OCP. The failure of [rʃ] to be assimilated to [ːʃ] at the boundary between [mɑmɑr] and [ʃɑlir] shows that this assimilation rule cannot apply across a phonological phrase boundary. As expected, deaccenting [ʃɑlir] and [bie] to mark an early focus on [mɑmɑr], shown in panel b, cannot alter the status of that boundary, since the focus is ended by a phonological phrase boundary. While this boundary is here overtly marked by Hp, an early focus in an interrogative sentence will not be able to retain its Hp, because it will be deleted by the OCP, as triggered by the final HiLi. Panel d shows the interrogative counterpart to panel b, and it does not show the assimilation, indicating the presence of a phonological phrase boundary. For the assimilation to apply, the sentence must have neutral focus and be spoken without the now optional boundaries, as in panel d. Thus,
(a)
(b)
128ms Hp
L* m
α
m
α r
mamα-r
ʃ
123ms L* α
l
i
ʃαli-r
(c)
r
b
H
Li
i
e
Hp
L* m
bie
α
m
α
r
mamα-r
(d)
187ms L* m α m α mamα(-r)
ʃ
ʃ
Li α
ʃαli-r
l
i
b
r
i
e
bie
132ms HLi α
ʃαli-r
l
i
r
b
i bie
L* e
m
α
m α
mamα-r
r
ʃ
HLi α l
ʃαli-r
i
r
b
i
e
bie
Figure 21.6 Four prosodic structures for Bengali [mɑmɑ-r ʃɑli-r bie] ‘Mother’s brother’s wife’s sister’s wedding’. The neutral, broad-focus declarative (a); the declarative with focus on [mɑmɑ-r] ‘It is Mother’s brother’s wife’s sister’s wedding’ (b); the neutral, broad-focus yes/no question (c); the yes/no question with focus on [mɑmɑ-r] ‘Is it Mother’s brother’s wife’s sister’s wedding?’ (d). Only in (c) can r-coronal assimilation go through, since there are focus-marking phonological phrase boundaries after [mɑmɑr] in (b) and (d), and an optional, tone-marked phonological phrase boundary in (a).
330 ADITI LAHIRI AND HOLLY J. KENNARD while the surface tunes in panels c and d are identical (L*HILI), the phrasing is not, and while r-coronal assimilation potentially applies in cases such as panel c, φ-structure in neutral sentences being optional, the phrase break before [ʃɑlir] is obligatory, since it comes between focused [mɑmɑ-r] and post-focal [ʃɑlir die], blocking assimilation.8 This means that phonological phrases, while being restricted to maximally one pitch accent each, can be unaccented. The question of the discriminability of yes/no questions with broad and narrow focus and the possible roles of r-coronal assimilation and pitch range remains a topic for future research.
21.6 Segmental rules and phrasing Despite their potential evidence, segmental processes are rarely appealed to for confirmation of prosodic phrasing. Hayes and Lahiri (1991) and Lahiri and Fitzpatrick-Cole (1999) provide evidence from assimilation rules in support of phonological words and phrases. These include voicing assimilation as well as the r-coronal assimilation discussed in §21.5.3. Other Indian languages have also reported phonological processes constrained to apply only within phonological phrases. Twaha (2017) provides evidence of aspirate spirantization and flapping, in addition to voicing assimilation and /r/ assimilation. In NVA, r-coronal assimilation within a phonological phrase is triggered by a focus constituent, as shown in (21). (21) Assamese (Twaha 2017: 136) rɔmɛnɛ (dɔrzar sabipat Ramen-nom door-gen key-cl ‘Ramen gave the door-key to Milan.’
milɔnɔk Milan-acc
dilak) give-past.3sg
Furthermore, intervocalic spirantization can apply across prosodic word boundaries, but it has to be phrase-internal. Thus, aspirate spirantization occurs in (22a) but not in (22b). Therefore, focus-governed phrasing can also constrain phonological rules in Assamese. (22) Assamese (Twaha 2017: 137–138) a. [[rɔmɛn-ɛ]P [makhɔn-ɔr ghɔr-ɔt]P [kɔmla]P [ kʰa-ba ge-isi]P]I does not apply b. [[rɔmɛn-ɛ]P [makhɔn-ɔr ghɔr-ɔt]P [kɔmla kʰa-ba ge-isi]P]I
does apply [kʰ] > [x]
8 To test this, in a pilot study we examined all occurrences of /r/ plus coronal obstruent sequences in focused questions of the type shown in panel c of Figure 21.6 in the oral rendition of the book Epar Bangla Opar Bangla (Bengal on this side and the other) (Fitzpatrick-Cole and Lahiri 1997). There was a total of 11,205 tokens with various coronal obstruents (stops, [tʃ], [ʃ]). Since r-coronal assimilation leads to complete gemination, it was easy to measure the duration of closure or the duration of frication for the stridents. Our results were as follows: 87% had a complete sonorant /r/, in 5% it was difficult to tell whether the /r/ was there or not, and 8% had complete assimilation. Thus, when the phrase boundary was there, despite the fact that there was no H boundary tone, the phrase boundary blocked the assimilation.
THE INDIAN SUBCONTINENT 331
21.7 Conclusion The Indian subcontinent is vast. Not only are there many languages but there are also at least four language families. Unlike in Germanic, stress does not play a very important role in most of these languages in terms of marking lexical contrast, and minimal pairs such as tórment (N) and tormént (V) will not occur. Nevertheless, stress does play a role in pitch accent association as well as in distributional constraints on segments; for example, nasal and oral vowels are contrasted in Bengali only in stressed position and thus only occur word-initially. Very rarely do we find tonal contrasts; Punjabi is the only language that has been claimed to have developed tone. The intonational systems are very comparable. In general, the basic contour appears to be LH, for prenuclear as well as focus tunes. Unsurprisingly, the pitch accents are aligned to the most prominent syllable, which may not be word-initial (e.g. Hindi; Khan 2016). Recent cross-linguistic studies confirm that, despite variation, the data are compatible with accentual phrases (also equated with phonological phrases) that begin with a L pitch accent and end with a H (Khan 2016, based on North wind and the sun in six languages; Deo and Tonhauser 2018, based on data from Chodri, Marathi, and Gujarati). Nevertheless, Khan argues that this is clearer in the Indo-Aryan languages than in Dravidian. The end of the focus seems generally to be demarcated by a H boundary tone. Moreover, sequences of identical tones would appear to be generally disallowed. Finally, we have seen that segmental processes are also bound by the accentual or phonological phrase in many languages.
Acknowledgements We are grateful to Shakuntala Mahanta, who very kindly provided a copy of Twaha (2017), and to Gadakgar Manjiri for providing references. The research was partially supported by the European Research Council Advanced Grant MORPHON 695481, PI Aditi Lahiri.
OUP CORRECTED PROOF – FINAL, 07/12/20, SPi
chapter 22
Chi na a n d Siber i a Jie Zhang, San Duanmu, and Yiya Chen
22.1 Introduction This chapter provides a summary of the prosodic systems of languages in Northern Asia, including varieties of Chinese spoken in mainland China and Taiwan as well as languages in Siberia, in particular Ket. A common theme in the prosody of these languages is their ability to use pitch to cue lexical meaning differences—that is, they are tone languages. The well-known quadruplet ma55/ma35/ma214/ma51 ‘mother/hemp/horse/to scold’1 in Standard Chinese is an exemplification of the tonal nature of the languages in this area. We start with a brief discussion of the typology of syllable and tonal inventories in Chinese languages (§22.2). These typological properties lead to three unique aspects of prosody in these languages: the prevalence of complex tonal alternations, also known as ‘tone sandhi’ (§22.3); the interaction between tone and word and phrase-level stress (§22.4); and the interaction between tone and intonation (§22.5). The prosodic properties of Ket are discussed briefly in §22.6. The last section provides a summary (§22.7).
22.2 The syllable and tone inventories of Chinese languages The maximal syllable structure of Chinese languages is CGVV or CGVC (C = consonant; G = glide; VV = long vowel or diphthong) (Duanmu 2008: 72). The syllabic position of the prenuclear glide is controversial, and it has been analysed as part of the onset (Duanmu 2007, 2008, 2017), part of the rime (Wang and Chang 2001), occupying a position of its own (van de Weijer and Zhang 2008), or variably belonging to the onset or the rime depending on the language, the phonotactic constraints within a language, and the speaker (Bao 1990; Wan 2002; Yip 2003). Yip (2003) specifically used the ambiguous status of the 1 Tones are transcribed in Chao numbers (Chao 1948, 1968), where ‘5’ and ‘1’ indicate the highest and lowest pitches in the speaker’s pitch range, respectively. Juxtaposed numbers represent contour tones; for example, ‘51’ indicates a falling tone from the highest pitch to the lowest pitch.
OUP CORRECTED PROOF – FINAL, 07/12/20, SPi
CHINA AND SIBERIA 333 prenuclear glide as an argument against the subsyllabic onset-rime constituency. The coda inventory is reduced to different degrees, from Northern dialects in which only nasals and occasionally [ɻ] are legal to southern dialects (e.g. Wu, Min, Yue, Hakka) where stops [p, t, k, ʔ] may also appear in addition to nasals. Syllables closed by a stop are often referred to as ‘checked syllables’ (ru sheng) in Chinese phonology, and they are considerably shorter than non-checked (open or sonorant-closed) syllables. There are typically three to six contrastive tones on non-checked syllables in Chinese dialects. On checked syllables, the tonal inventory is reduced—one or two tones are common, and three tones are occasionally attested. Table 22.1 illustrates the tonal inventories on non-checked and checked syllables in Shanghai (Wu), Fuzhou (Min), and Cantonese (Yue).
Table 22.1 Tonal inventories in three dialects of Chinese Non-checked syllables Checked syllables Cantonese (Matthews and Yip 1994) 55, 33, 22, 35, 21, 23 Shanghai (Zhu 2006) 52, 34, 14 Fuzhou (Liang and Feng 1996) 44, 53, 32, 212, 242
5, 3, 2 4, 24 5, 23
22.3 Tone sandhi in Chinese languages A prominent aspect of the prosody of Chinese languages is that they often have a complex system of ‘tone sandhi’, whereby tones alternate depending on the adjacent tones or the prosodic/morphosyntactic environment in which they appear (Chen 2000; Zhang 2014). Two examples of tone sandhi from Standard Chinese and Xiamen (Min) are given in (1). In Standard Chinese, T3 214 becomes T2 35 before another T3;2 in Xiamen, tones undergo regular changes whenever they appear in non-final positions in a syntactically defined tone sandhi domain (Chen 1987; Lin 1994). (1) Tone sandhi examples a. Tonally induced tone sandhi in Standard Chinese 214 → 35 / ___ 213
b. Positionally induced tone sandhi on non-checked syllables in Xiamen 53 → 44 → 22 → 24 in nonfinal positions of tone sandhi domain 21
Tone sandhi patterns can generally be classified as ‘left-dominant’ or ‘right-dominant’. Rightdominant sandhi, found in most Southern Wu, Min, and Northern dialects, preserves the 2 This is a vast simplification. While in identification tasks T2 is indistinguishable from the sandhi tone for T3 (e.g. Wang and Li 1967; Peng 2000), recent phonetic, psycholinguistic, and neurolinguistic evidence indicates the sandhi tone for T3 is neither acoustically identical to T2 (e.g. Peng 2000; Yuan and Chen 2014) nor processed the same way as T2 in online spoken word processing (e.g. Li and Chen 2015; Nixon et al. 2015).
OUP CORRECTED PROOF – FINAL, 07/12/20, SPi
334 JIE ZHANG, SAN DUANMU, AND YIYA CHEN base tone on the final syllable in a sandhi domain and changes the tones on non-final syllables; left-dominant sandhi, typified by Northern Wu dialects, preserves the tone on the initial syllable (Yue-Hashimoto 1987; Chen 2000; Zhang 2007, 2014). It has been argued that there is an asymmetry in how the sandhi behaves based on directionality, in that right-dominant sandhi tends to involve local or paradigmatic tone change, while left-dominant sandhi tends to involve the extension of the initial tone rightward (Yue-Hashimoto 1987; Duanmu 1993; Zhang 2007). We have seen in (1) that the tone sandhi patterns in both Standard Chinese and Xiamen are right-dominant and involve local paradigmatic tone change. In the left-dominant Shanghai tone sandhi pattern in (2), however, the tone on the first syllable is spread across the disyllabic word, neutralizing the tone on the second syllable (Zhu 2006). (2) Shanghai tone sandhi for non-checked tones: 52-X → 55-31 34-X → 33-44 14-X → 11-14 Zhang (2007) argued that the typological asymmetry is due to two phonetic effects. One is that the prominent positions in the two types of dialect have different phonetic properties: the final position in right-dominant systems has longer duration and can maintain the contrastive tonal contour locally; the initial position in left-dominant systems has shorter duration and therefore needs to allocate the tonal contour over a longer stretch in the sandhi domain. The other is the directionality effect of tonal coarticulation, which tends to be perseverative and assimilatory; the phonologization of this type of coarticulatory effect could then potentially lead to a directional asymmetry in tone sandhi. Duanmu (1993, 1994, 1999, 2007), on the other hand, argued that the difference stems from the syllable structure, and hence stress pattern difference between the two types of languages, as discussed in §22.4. Despite these typological tendencies, phonetically arbitrary tone sandhi patterns abound in Chinese dialects. For instance, the circular chain shift in the Xiamen pattern (1b) has no phonotactic, and hence phonetic, motivation, as the base tone itself is not phonotactically illegal in the sandhi position. Left-dominant sandhi, likewise, often has phonetic changes that cannot be predicted by a straightforward tone-mapping mechanism. In Wuxi (Wu), for example, the tone on the initial syllable of a word needs to be first replaced with another tone before it spreads rightward (Chan and Ren 1989), and Yan and Zhang (2016) argued that the tone substitution involves a circular chain shift, as in (3). (3) Wuxi tone sandhi for non-checked tones with voiceless initials 53-X → 43-34 Falling → Dipping 323-X → 33-44 34-X → 55-31 Rising
The phonetic arbitrariness and complexity of the synchronic tone sandhi patterns raise the question of whether all of these patterns are equally productive and learnable for speakers. This question has been investigated using ‘wug’ tests in a long series of work since the 1970s. For instance, Hsieh (1970, 1975, 1976), Wang (1993), and Zhang et al. (2011a) have shown that the circular chain shift in Taiwanese Southern Min (a very similar pattern to Xiamen in (1b)) is not fully productive. Yan and Zhang (2016) and Zhang and Meng (2016) provided comparisons between Shanghai and Wuxi tone sandhi and showed that the Shanghai pattern is
OUP CORRECTED PROOF – FINAL, 07/12/20, SPi
CHINA AND SIBERIA 335 generally productive; for Wuxi, the spreading aspect of the tone sandhi is likewise productive, but the substitution aspect of the sandhi is unproductive due to its circular-chain-shift nature. The relevance of phonetic naturalness to tone sandhi productivity has also been investigated in non-chain-shift patterns. Zhang and Lai (2010), for instance, tested the productivity difference between the phonetically less natural third-tone sandhi and the more natural half-third sandhi in Standard Chinese and showed that, although both apply consistently to novel words, the former involves incomplete application of the sandhi phonetically and is thus less productive. In general, the productivity studies of tone sandhi demonstrate that, to understand how native speakers internalize the complex sandhi patterns in their language, we need to look beyond the sandhi patterns manifested in the lexicon and consider ways that more directly tap into the speakers’ tacit generalizations. In our current understanding, the synchronic grammar of tone sandhi likely includes both productive derivations from the base tone to the sandhi tone and allomorph listings of sandhi tones, depending on the nature of the sandhi.
22.4 Lexical and phrasal stress in Chinese languages We begin with word stress. All monosyllabic content words in Chinese occur in heavy syl lables, are long and stressed, and have a lexical tone, such as lian214 脸 ‘face’. Function words can have stress and carry a lexical tone, too, but they often occur in light syllables, are short and unstressed, and have no lexical tone, such as the aspect marker le. The pattern is captured by the generalizations in (4) and (5). (4) Metrical structure in monosyllables (Hayes 1995) A heavy syllable has two morae, forms a moraic foot, and is stressed. A light syllable has one mora, cannot form a foot, and has no stress. (5) The Tone-Stress Principle (Liberman 1975; Goldsmith 1981; Duanmu 2007) A stressed syllable can be assigned a lexical tone. An unstressed syllable is not assigned a lexical tone. In two-syllable words or compounds, stress patterns are more complicated. Three degrees of stress can be distinguished, represented in (6) and (7) as S (strong), M (medium), and L (light or unstressed). Tones are omitted, since they differ from dialect to dialect. (6) Stress patterns in final positions Variety
Stress type
Example in Pinyin Gloss
Beijing
MS (67%) SM (17%) SL (14%)
da-xue 大学 bao-dao 报道 ma-ma 妈妈
‘university’ ‘report’ ‘mom’
Chengdu SM SL
da-xue 大学 ma-ma 妈妈
‘university’ ‘mom’
Shanghai SL
da-xue 大学
‘university’
OUP CORRECTED PROOF – FINAL, 07/12/20, SPi
336 JIE ZHANG, SAN DUANMU, AND YIYA CHEN (7) Stress patterns in non-final positions Variety
Stress type
Example in Pinyin Gloss
Beijing
SM SL
da-xue 大学 ma-ma 妈妈
‘university’ ‘mom’
Chengdu
SM SL
da-xue 大学 ma-ma 妈妈
‘university’ ‘mom’
Shanghai
SL
da-xue 大学
‘university’
Stress in Beijing is analysed based on Yin (1982). Stress in Chengdu has a robust phonetic realization in syllable duration (Ran 2011). Stress in Shanghai is realized in both syllable duration (Zhu 1995) and tone sandhi (Xu et al. 1988). In Chengdu and Shanghai, the stress patterns remain the same whether the position is final or non-final. In Beijing, however, MS is found in final position only, and it changes to SM in non-final positions. For example, da-XUE ‘university’ is MS when final but SM when non-final, as in DA-xue jiao-SHI ‘university teacher’ (uppercase indicates S). The patterns raise three questions, given in (8). (8) Three questions to explain a. Out of nine possible combinations (SS, SM, SL, MS, MM, ML, LS, LM, and LL), why are only SM and SL found in non-final positions? b. Why is MS found in the final position only? c. How do we account for dialectal differences? (8a) is explained by (9), which allows (SM) and (SL) but not *MM, *ML, *LM, *LL (no main stress), or *SS (two main stresses). (9) Constraint on stress patterns in non-final positions Chinese has syllabic trochee. (8b) is explained by (10), where 0 is an empty beat, which is realized as either a pause or lengthening of the preceding syllable. (10) Stress shift (SM) → M(S0) / __ # (8c) is related to the complexity of syllable rimes. As shown in (11), Beijing has the most complex rimes and Shanghai the simplest. (11) Rime complexity Variety
Diphthongs [-n -ŋ] contrast
Beijing
Yes
Yes
Chengdu
Yes
No
Shanghai
No
No
Rime complexity can explain differences in stress patterns: (12a) explains why stress shift occurs in Beijing but not in Chengdu or Shanghai. (12b) explains why Shanghai has S and L but no M.
OUP CORRECTED PROOF – FINAL, 07/12/20, SPi
CHINA AND SIBERIA 337 (12) Stress and rime complexity: a. Stress shift occurs in languages that have both diphthongs and contrastive codas. b. Languages without diphthongs or contrastive codas have no inherent heavy syllables. There is a common view that a language can only choose one foot type (Hayes 1995), which seems to contradict our assumption that Chinese has both moraic trochees and syllabic trochees. However, a standard assumption in metrical phonology is that multiple tiers of metrical constituents are needed, such as in the analysis of main and secondary word stress in English. The foot type of a language is simply the foot type at the lowest level of the metrical structure. Our analysis suggests that the lowest metrical tier in Chinese is the moraic foot. Monomorphemic words longer than two syllables are mostly foreign names, in which binary feet are built from left to right. Some examples in Shanghai are shown in (13), transcribed in Pinyin, where uppercase indicates stress. (13) Stress in polysyllabic foreign names in Shanghai ZI-jia-ge 芝加哥 ‘Chicago’ DE-ke-SA-si 德克萨斯 ‘Texas’ JIA-li-FO-ni-ya 加利福尼亚 ‘California’ JE-ke-SI-luo-FA-ke 捷克斯洛伐克 ‘Czechoslovakia’ Let us now consider phrasal stress. Chomsky and Halle (1968) proposed two cyclic rules for English, shown in (14). (14) Phrasal stress (Chomsky and Halle 1968) Nuclear Stress Rule
In a phrase [A B], assign stress to B.
Compound Stress Rule In a compound [A B], assign stress to B if it is branching, otherwise assign stress to A. The rules have been reinterpreted as a single rule and extended to other languages (Gussenhoven 1983a, 1983c; Duanmu 1990, 2007; Cinque 1993; Truckenbrodt 1995; Zubizarreta 1998). In (15), X is a syntactic head and XP a syntactic phrase. (15)
Stress-XP (Truckenbrodt 1995) In a syntactic unit [X XP] or [XP X], XP is assigned phrasal stress.
Stress-XP can be noncyclic. A comparison of the Compound Stress Rule (CSR) and Stress-XP is shown in (16), with English compounds. (16) A comparison of cyclic CSR and noncyclic Stress-XP CSR
Stress-XP
Cycle 2 Cycle 1
whale-oil lamp
law-school language-exam
x x [[XP X] X]
x x x [[XP X][XP X]]
x [[XP X] X]
x x [[XP X][XP X]]
OUP CORRECTED PROOF – FINAL, 07/12/20, SPi
338 JIE ZHANG, SAN DUANMU, AND YIYA CHEN On cycle 1, CSR assigns stress to the left in whale-oil, law-school, and language-exam. On cycle 2, CSR assigns stress to whale-oil (because lamp is not branching) and languageexam (because it is branching). In contrast, Stress-XP assigns stress to each XP in one step. There are three differences between the analyses. First, as Gussenhoven (1983a, 1983b) notes, Stress-XP achieves the result in one step, while CSR cannot. Second, CSR produces many stress levels, while Stress-XP produces far fewer, in support of Gussenhoven (1991). Third, in law-school language exam, CSR assigns more stress to language, while Stress-XP assigns equal stress to law and language. In what follows, we shall consider Stress-XP only, since it is a simpler theory and can account for all Chinese data. Now, consider a compound and a phrase in Shanghai Wu (Xu et al. 1988; Duanmu 1999), shown in (17). The foot/weight tier shows foot boundaries and syllable weight (H for heavy, L for light, and 0 for an empty beat). On the tone tiers, H means high and L mean