Second Language Prosody and Computer Modeling (Routledge Studies in Applied Linguistics) [1 ed.] 0367901129, 9780367901127

This volume presents an interdisciplinary approach to the study of second language prosody and computer modeling. It add

161 78 13MB

English Pages 188 [189] Year 2021

Report DMCA / Copyright


Polecaj historie

Second Language Prosody and Computer Modeling (Routledge Studies in Applied Linguistics) [1 ed.]
 0367901129, 9780367901127

Table of contents :
Half Title
Series Information
Title Page
Copyright Page
Table of Contents
Organization of the Book
Part I Linguistic Foundations of Prosody
1 Overview of Prosody
1.1 What Is Prosody?
1.2 The Role of Prosody in Discourse
1.3 History of Prosodic Approaches
1.4 The British Tradition
1.5 The American Tradition
1.6 Summary
2 Frameworks of Prosody
2.1 Two Prosodic Frameworks
2.2 David Brazil’s Framework
Tone Unit
Context of Interaction
Key and Termination
2.3 Janet Pierrehumbert’s and Julia Hirschberg’s Prosodic Framework
2.4 Summary
3 Prosodic Analyses of Natural Speech
3.1 Second Language (L2) Prosody
3.2 Segmental Properties in Discourse
3.3 Measuring Segmental Properties
3.3.1 Measuring Segmental Accuracy
3.3.2 Measuring Vowel Space
3.3.3 Measuring Vowel Duration
3.3.4 Measuring Voice Onset Time
3.4 Fluency in Discourse
3.5 Measuring Fluency
3.6 Word Stress in Discourse
3.7 Measuring Word Stress
3.8 Sentence Prominence in Discourse
3.9 Measuring Sentence Prominence
3.10 Pitch and Intonation in Discourse
3.11 Measuring Pitch and Intonation
3.12 Proficiency and Intelligibility
3.13 Summary
Part II Computer Applications of Prosody
4 Computerized Systems for Syllabification
4.1 Syllables and Automatic Syllabification
4.2 Machine Learning
4.3 Acoustic Algorithms for Syllabification
4.4 Phonetic Algorithms for Syllabification
4.4.1 Rule-Based Phonetic Algorithms
4.4.2 Data-Driven Phonetic Algorithms
4.5 Data-Driven Phonetic Syllabification Algorithm Implementations
4.5.1 Corpora TIMIT Corpus Boston University Radio News Corpus (BURNC)
4.5.2 Converting Audio Files to Noisy Phonetic Sequences
4.5.3 Syllable Alignment Error
4.5.4 Syllabification-By-Grouping
4.5.5 Sonority Scale
4.5.6 Syllabification By HMM
4.5.7 Syllabification By K-Means Clustering
4.5.8 Syllabification By Genetic Algorithm
4.5.9 Comparison of Syllabification Algorithms
4.6 Summary
5 Computerized Systems for Measuring Suprasegmental Features
5.1 Prominent Syllables
5.2 Pitch Contour Models
5.2.1 TILT Pitch Contour Model
5.2.2 Bézier Pitch Contour Model
5.2.3 Quantized Contour Model (QCM) Pitch Contour Model
5.2.4 4-Point Pitch Contour Model
5.3 Algorithms for Detecting Suprasegmental Features of the ToBI Model
5.3.1 ToBI (Tones and Break Indices) Labeling Scheme
5.3.2 Supervised Machine Learning Algorithms
5.3.3 Unsupervised Machine Learning Algorithms
5.3.4 Summary of Algorithms for Detecting Suprasegmental Features of the ToBI Model
5.4 Algorithms for Detecting Suprasegmental Features Motivated By Brazil’s Model
5.4.1 Algorithms for Detecting Prominent Syllables
5.4.2 Algorithms for Detecting Tone Choice
5.4.3 Algorithms for Detecting Tone Unit
5.4.4 Algorithms for Detecting Relative Pitch
5.4.5 Summary of Algorithms for Detecting Suprasegmental Features of Brazil’s Model
5.5 Algorithms for Calculating Suprasegmental Measures
5.6 Summary
6 Computer Models for Predicting Oral Proficiency and Intelligibility
6.1 Kang and Johnson Computer Model for Automatically Scoring Oral Proficiency
6.1.1 Cambridge English Language Assessment (CELA) Corpus
6.1.2 Step 1: Translate the Sound Recording Into Phones and Silent Pauses
6.1.3 Step 2: Partition the Phones and Silent Pauses Into Tone Units
6.1.4 Step 3: Syllabify the Phones
6.1.5 Step 4: Locate the Filled Pauses
6.1.6 Step 5: Identify the Prominent Syllables
6.1.7 Step 6: Determine the Tone Choice
6.1.8 Step 7: Calculate the Relative Pitch
6.1.9 Step 8: Compute Suprasegmental Measures
6.1.10 Step 9: Estimate Oral Proficiency Score
6.2 Zechner et al.’s (2009) Multiple-Regression Model for Automatically Scoring Oral Proficiency
6.3 Zechner et al.’s (2009) Classification and Regression Trees (CART) Model for Automatically Scoring Oral Proficiency
6.4 Linear Regression Model for Automatically Scoring Oral Proficiency
6.5 Automated Evaluation of Non-Native English Pronunciation Quality
6.6 Johnson and Kang Computer Model for Automatically Scoring Intelligibility
6.6.1 World Englishes Speech Corpus
6.6.2 Computer Model for Predicting Intelligibility Scores
6.7 Comparison of Feature Selection Methods for Automated Speech Analysis Applications
6.7.1 Corpus
6.7.2 Feature Sets
6.7.3 Proficiency Score Predictions
6.8 Summary
Part III The Future of Prosody Models
7 Future Research and Applications
7.1 Future Research and Directions
7.2 Critical Issues in ASR-Based Applications
7.3 Future Applications of Prosodic Models
7.4 Summary
Useful Resources

Citation preview


Second Language Prosody and Computer Modeling

This volume presents an interdisciplinary approach to the study of second language prosody and computer modeling. It addresses the importance of prosody’s role in communication, bridging the gap between applied linguistics and computer science. The book illustrates the growing importance of the relationship between automated speech recognition systems and language learning assessment in light of new technologies and showcases how the study of prosody in this context in particular can offer innovative insights into the computerized process of natural discourse. The book offers detailed accounts of different methods of analysis and computer models used and demonstrates how these models can be applied to L2 discourse analysis toward predicting real-​world language use. Kang, Johnson, and Kermad also use these frameworks as a jumping-​off point from which to propose new models of second language prosody and future directions for prosodic computer modeling more generally. Making the case for the use of naturalistic data for real-​ world applications in empirical research, this volume will foster interdisciplinary dialogues across students and researchers in applied linguistics, speech communication, speech science, and computer engineering. Okim Kang is a Professor of Applied Linguistics and Director of the Applied Linguistics Speech Lab at Northern Arizona University, Flagstaff, AZ. Her research interests include speech production and perception, L2 pronunciation and intelligibility, L2 oral assessment and testing, automated scoring and speech recognition, World Englishes, and language attitude. David O. Johnson is an Associate Teaching Professor of Electrical Engineering and Computer Science at the University of Kansas. His research interests are in artificial intelligence, machine learning, natural language processing, and human–​robot interaction. Alyssa Kermad is an Assistant Professor of Applied Linguistics and TESOL at California State Polytechnic University, Pomona. Her research interests are in second language speech and pronunciation, speech perception, prosody and pragmatics, second language acquisition, individual differences, and speech assessment.


Routledge Studies in Applied Linguistics

Autoethnographies in ELT Transnational Identities, Pedagogies, and Practices Edited by Bedrettin Yazan, Suresh Canagarajah, and Rashi Jain Researching Interpretive Talk Around Literary Narrative Texts Shared Novel Reading John Gordon Analyzing Discourses in Teacher Observation Feedback Conferences Fiona Copland and Helen Donaghue Learning-​Oriented Language Assessment Putting Theory into Practice Edited by Atta Gebril Language, Mobility and Study Abroad in the Contemporary European Context Edited by Rosamond Mitchell and Henry Tyne Intonation in L2 Discourse Research Insights María Dolores Ramirez-​Verdugo Contexts of Co-​Constructed Discourse Interaction, Pragmatics, and Second Language Applications Edited by Lori Czerwionka, Rachel Showstack, and Judith Liskin-​Gasparro Second Language Prosody and Computer Modeling Okim Kang, David O. Johnson, Alyssa Kermad For more information about this series, please visit:​ Routledge-​Studies-​in-​Applied-​Linguistics/​book-​series/​RSAL


Second Language Prosody and Computer Modeling Okim Kang, David O. Johnson, Alyssa Kermad


First published 2022 by Routledge 605 Third Avenue, New York, NY 10158 and by Routledge 2 Park Square, Milton Park, Abingdon, Oxon OX14 4RN Routledge is an imprint of the Taylor & Francis Group, an informa business © 2022 Taylor & Francis The right of Okim Kang, David O. Johnson, Alyssa Kermad to be identified as authors of this work has been asserted by them in accordance with sections 77 and 78 of the Copyright, Designs and Patents Act 1988. All rights reserved. No part of this book may be reprinted or reproduced or utilised in any form or by any electronic, mechanical, or other means, now known or hereafter invented, including photocopying and recording, or in any information storage or retrieval system, without permission in writing from the publishers. Trademark notice: Product or corporate names may be trademarks or registered trademarks, and are used only for identification and explanation without intent to infringe. Library of Congress Cataloging-​in-​Publication Data A catalog record for this title has been requested ISBN: 978-​0-​367-​90112-​7 (hbk) ISBN: 978-​1-​032-​07033-​9 (pbk) ISBN: 978-​1-​003-​02269-​5 (ebk) DOI: 10.4324/​9781003022695 Typeset in Sabon by Newgen Publishing UK



List of Figures  List of Tables  Introduction 

vii ix 1


Linguistic Foundations of Prosody 


1 Overview of Prosody 


2 Frameworks of Prosody 


3 Prosodic Analyses of Natural Speech 



Computer Applications of Prosody 


4 Computerized Systems for Syllabification 


5 Computerized Systems for Measuring Suprasegmental Features 


6 Computer Models for Predicting Oral Proficiency and Intelligibility 



vi Contents PART III

The Future of Prosody Models 


7 Future Research and Applications 


Useful Resources  References  Index 

154 155 172



.1 1 2.1 2.2 3.1 3.2 3.3 3.4 3.5 3.6 4.1 4.2 4.3 4.4 4.5 4.6 4.7 .8 4 5.1 5.2 5.3 5.4 .5 5 5.6 .7 5 6.1

Bolinger’s (1986) Pitch Profiles  14 Brazil’s Four Prosodic Systems  18 Brazil’s Speaker–​Hearer Convergence  21 Correct vs. Misplaced Word Stress on “Visiting”  47 Measuring Sentence Prominence  48 Illustration of a Tone Unit  52 Graphic Illustration of Tonic, Key, and Termination Syllables  52 Illustration of a Tone Unit and Tonic Syllables  54 Illustration of Tone Choice and Pitch Range  55 Illustration of Machine Learning  62 Cross-​Validation  66 Confusion Matrix  67 Recognizing the Word and Then Dividing it into Syllables  76 Converting Audio Files to Noisy Phonetic Sequences  77 Example of Alignment between Ground-​Truth and Detected Syllables  78 Example of State Transitions Force-​Aligned with the Ground-​Truth Syllable Boundaries  80 Syllabification by k-​Means Clustering  81 Parameters of the TILT Model of a Pitch Contour  89 Example of the Bézier Curve Fitting Stylization from Escudero-​Mancebo and Cardeñoso-​Payo (2007)  90 Quantized Contour Model (QCM) with n=4 value and m=5 Time Bins  92 4-​Point Model Sub-​Models: Rise-​fall-​rise (left) and Fall-​rise-​fall (right)  92 Brazil’s Five Tone Choices  105 Examples of How the Significance of the Rises and Falls Determines the Tone Choice  106 Relative Pitch Calculation Example  114 Phones and Silent Pauses Partitioned into Tone Units  124


viii  List of Figures .2 6 6.3 6.4 6.5 6.6

Phones Divided into Syllables  Filled Pauses Located  Prominent Syllables Identified  Tone Choice Determined  Relative Pitch Calculated 

125 126 127 128 129




.1 Select Measurements of Segmental Accuracy  3 36 4.1 Machine Learning Models  64 4.2 Summary of Data-​Driven Phonetic Syllabification Algorithms  73 4.3 144 BURNC Paragraphs Used by Johnson and Kang (2017a)  76 4.4 Sonority Scale  79 4.5 Example Rulebook for Syllabification by Genetic Algorithm  83 4.6 Example Rules Derived from the Force-​Aligned Syllable Boundaries Depicted in Figure 4.7  84 4.7 TIMIT and BURNC Syllabification Results  85 5.1 Comparison of ToBI Labeling Accuracy for Four Algorithms  97 5.2 Best Prominent Syllable Detectors Depending on Validation Method and Metric Utilized  103 5.3 Truth Table for All Possible Combinations of Significant and Insignificant Rise and Falls  107 5.4 Distribution of Tone Choices  110 5.5 Four Best Tone Choice Detectors Depending on Validation Method and Metric Utilized  111 5.6 Calculation of Suprasegmental Measures  116 5.7 Calculation of Suprasegmental Measures  119 6.1 Gender, Subject, and Duration of the 120 CELA Speech Files  122 6.2 Step 1 Example of the Data File of Time-​Aligned Phones and Silent Pauses  123 6.3 Corpus of Speech Files  132 6.4 Speaking Section Tasks of TOEFL Junior Comprehensive Assessment  132 6.5 Pearson Correlation of Model versus Inter-​Human  133 6.6 Normalized Intelligibility Scores  138 6.7 Suprasegmental Measures Used for Each Intelligibility Score  139 6.8 Proficiency Score Correlations by Different Feature Sets  142




Today, English is a global language. Out of the approximately 7.5 billion inhabitants on Earth, about 20% (or 1.5 billion) speak the English language; however, only about 360 million of these English speakers are natives (Lyons, 2017). In this era of globalization and technologization, it is essential to develop robust automatic speech recognition systems that can process English produced by both first (L1) as well as second language (L2) speakers. Therefore, such automated systems must be informed by thorough linguistic analyses of the discourse produced by all groups of English speakers. Further, a robust description of L2 English varieties is even more essential as the basis for other technological applications of linguistics, such as tools for the cyber-​learning of English. Specifically speaking, automatic speech recognition (ASR) systems enable feedback on language learning and assessment. The application of automatic speech scoring has been considered one of the most promising areas for computer-​ assisted language learning (CALL) and automated language assessment (Franco et al., 2010). In the settings of L2 learning and testing, human teachers or raters lack the time to provide detailed pronunciation feedback to individual students. At the same time, involving human raters in high-​stakes tests creates ongoing concerns due to high costs of administration and the subjectivity in human judgments (Kang & Rubin, 2009). Improving one’s speech requires frequent feedback from an external, objective source other than the language learner’s own perceptions, which makes automated speech assessment/​CALL a suitable arena for a tireless computer. One especially important linguistic component of any automated system is prosody, which conveys crucial meaning in speech. Prosody reflects various features of the speaker as well as the utterance, including the emotional state of the speaker, the presence of irony or sarcasm, contrast, focus, or other elements of language that may not be encoded by grammar or by choice of vocabulary. Prosody extends over one single sound segment in an utterance through speech properties such as pitch, tone, duration, intensity, and voice quality (Chun, 2002). The choice of tone on the focus word, for example, can affect both perceived information structure and social cues in discourse. In the field of speech science, DOI: 10.4324/​9781003022695-1


2 Introduction accurate and meaningful applications of such prosodic cues in computer systems have been long desired. Over time, various intonation frameworks have been introduced and developed. However, there has been little research on the extent to which the prosodic properties can predict variation (e.g., intelligibility, oral fluency, communicative success, and language development) in language use in actual discourse contexts –​issues that are directly relevant for the L2 English speech community. In fact, it is important to describe the applications of different analytical frameworks for their ability to capture the important patterns of prosodic variation in natural discourse. These patterns include different developmental stages for L2 learners of English, as well as evaluations of intelligibility, oral fluency, and communicative success for all speakers of English. Basic linguistic research of this type is an obvious prerequisite for speech science research designed to develop automated systems for speech processing, production, and cyber-​ learning. It is clearly beneficial to base such automated systems on the linguistic framework that proves to be the most effective for predicting patterns of real-​world language use. Thus, the current book aims to provide detailed accounts of prosody-​focused speech analyses and to demonstrate applications of prosody-​based computer models for L2 analyses of discourse. Moreover, thus far communicative success has been mostly evaluated by humans, but humans are costly and slow in evaluating; even under the best situations, they lack consistency and objectivity. The traditional paradigm for automatic assessment suffers from a shortcoming in which speech data are typically captured in the context of controlled human–​machine interactions. It is clear that “talking to machines” is not the same as “talking with humans.” Speech data in the conventional programs tend to be single-​sided or unidimensional, not taking into context the interlocuters. Therefore, in this respect, they do not depict the diversity of typical daily human-​to-​human interactions, including dialogs or conversations that are rich in their exchange of thoughts/​ideas, context development, and agreement/​ disagreement. However, resources are limited regarding the computerized process of natural discourse from the perspective of discourse analysis. The current book intends to offer examples of human discourse features (e.g., tone choice and use of prominence) which are also applicable to machine learning and testing/​ evaluation. To these ends, the current book aims to address the meaning of prosody, why prosody is important in communication, how prosodic properties can be analyzed, and how prosody-​based computer modeling can be developed. These questions are answered through a discussion of the following themes:

• The historical development of prosodic frameworks; • The role of prosody in discourse according to different frameworks;


Introduction  3

• Prosodic features connected to human communication (and prosodic interpretation in natural discourse);

• Descriptive accounts of segmental and suprasegmental analyses; • Computer models for the syllabification of phones to identify syllable boundaries;

• Differences between using word-​recognition and phone-​recognition computer models for identifying the syllables of speech;

• Computer models for predicting L2 oral proficiency and intelligibility from suprasegmental measures;

• Future applications of prosodic computer modeling. Essentially, by bridging the gap between two fields (computer science and applied linguistics), we have attempted to provide a guide and reference for objective speech assessment which will ultimately allow the audience to engage in interdisciplinary work. The overall take-​away message for our readers is that empirical research, such as our own carried out in this book, can be based on naturalistic data and used for real-​world applications. Through this message, we encourage our readers to use and replicate our work in their own environments for their own needs.

Organization of the Book Second Language Prosody and Computer Modeling leads the reader through a bottom up understanding of prosody, prosodic frameworks, human analyses of prosody, computerized systems, computer models, and applications of prosody and computer models to real-​ world scenarios. Our book is separated into three parts. Part I, “Linguistic Foundations of Prosody” encompasses Chapters 1–​3 and provides the linguistic foundation for computer modeling. Chapter 1, “Overview of Prosody,” traces the historical development of prosodic frameworks, setting the context for how the frameworks on which we draw fit into the larger timeline of prosodic research. Chapter 2, “Frameworks of Prosody,” provides a more in-​depth analysis of two major prosodic frameworks representing two major geographical traditions. Then, in Chapter 3, “Prosodic Analyses of Natural Speech,” descriptive accounts of both segmental (consonants/​vowel) and suprasegmental (prosodic) analyses and measurements are provided. For each, a discussion of their role in discourse is provided and comparisons are made between these features in first and second language speech. This chapter additionally offers a detailed explanation for how each of the speech properties can be measured and analyzed with examples and step-​by-​step procedures. The chapter ends with an explanation of the differences between measuring proficiency and intelligibility. The intent of Chapter 3 is to present the applied linguistic background required for the reader to comprehend the computer models for predicting oral proficiency and intelligibility discussed in later chapters.


4 Introduction Part II, “Computer Applications of Prosody,” which includes Chapters 4–​6, takes the foundational knowledge from Chapters 1–​3 and applies it to computer modeling processes. In Chapter 4, “Computerized Systems for Syllabification,” the process of breaking continuous human speech into syllables automatically with a computer is discussed. This chapter covers the basis for computer analyses of prosody and also provides detailed descriptions of various computer algorithms for detecting syllable boundaries within continuous English speech. Both word-​and phone-​ based techniques are explored. Chapter 5, “Computerized Systems for Measuring Suprasegmental Features,” moves on to the next step in building computer models by discussing how to derive underlying prosodic or suprasegmental properties. This chapter provides a history of computer algorithms to detect suprasegmental features. The next chapter, Chapter 6, “Computer Models for Predicting Oral Proficiency and Intelligibility,” compares several computer models for automatically scoring oral proficiency and intelligibility from suprasegmental measures of speech. Part III, “The Future of Prosody Models,” is composed of the last chapter, Chapter 7, on “Future Research and Applications.” This chapter explores how prosody models can be used in future research in addition to the future applications of prosody models. We discuss the applicability of prosody models to student research, language teaching, language learning, speech assessment, and computer science. Because this book primarily deals with the computer analysis of prosody in monologic speech, this chapter also examines future research in using segmental, suprasegmental, lexical, and grammatical measures for computerized analysis of dialogic speech. In addition, this chapter addresses possible directions for future research including the increasingly important addition of prosody in discourse contexts comprising World Englishes and English as a lingua franca. This chapter is important in underscoring the applications of our book to students, professors, researchers, language teachers, practitioners, language evaluators, and computer scientists from a wide range of backgrounds.


Part I

Linguistic Foundations of Prosody




1  Overview of Prosody

PROMINENT POINTS This chapter presents an overview of the following: 1.1 1.2 1.3 1.4 1.5 1.6

What Is Prosody? The Role of Prosody in Discourse History of Prosodic Approaches The British Tradition The American Tradition Summary

7 9 9 10 12 15

1.1  What Is Prosody? The process of human communication, specifically communication through speech, is a complex, multi-​faceted phenomenon. At any given time, when an utterance is made by a speaker, there are numerous linguistic processes simultaneously at play. First of all, a speaker must construct an utterance by putting words together in a certain order which conforms to the target language syntax in what is typically described as the “grammar” of the language. These constructions are made up of appropriate words which convey meaning, or the “semantics” of the language. The words themselves are made up of sounds placed into a particular order which connect with other sounds in what is known as the “phonology” of the language. Each individual sound is uniquely articulated and co-​articulated depending on the environment, and this process involves the “phonetics” of the language. This hierarchy of linguistic building blocks of speech encompasses several of the processes at play in communication (syntax, semantics, phonology, phonetics). There is another linguistic process which is spread out over all of these processes, and that is prosody. Crystal (2008) defines prosody in the following way:

DOI: 10.4324/​9781003022695-2


8  Linguistic Foundations of Prosody A term used in suprasegmental phonetics and phonology to refer collectively to variations in pitch, loudness, tempo and rhythm. Sometimes it is used loosely as a synonym for “suprasegmental”, but in a narrower sense it refers only to the above variables, the remaining suprasegmental features being labelled paralinguistic. (p. 393) Prosody, often interchangeable with the term “suprasegmentals,” is an ensemble of speech properties including intonation, volume, pitch, timing, duration, pausing, and speech rate. Combinations of these properties (in particular, higher pitch, more volume/​intensity, and longer syllables) lead to other prosodic functions including lexical (word) stress and sentence prominence. Pitch is also multi-​functioning, for intonation as we know it is the structure of pitch over a given utterance. Prosody “is applied to patterns of sound that range more or less freely and independently over individual sounds and individual words” (Bolinger, 1986, p. 37). In other words, prosody is a fairly independent system which is spread over a given sequence of words; it is not dependent on nor determined by the individual sounds which make up the individual words, nor on the syntax of the language. Bolinger (1986) states that a speaker must first make a decision about the choice of words, then the speaker organizes the sequence of those words, and then the speaker codes those words through the assembly of sound for that particular language. Bolinger describes all of those steps as “computational” (p. viii), whereas the use of prosody is more artistic. Void of prosody, any linguistic utterance would simply be a lifeless string of words arranged in a particular order. Prosody adds the “human element” (Ward, 2019, p. 1) of language that enables speakers to signal a range of functions which cannot be encoded through grammar or vocabulary. Prosody is both systematic and probabilistic: predictions can be made based on overall patterns that are common in a language, yet at the same time, the system is controlled by the speaker, and predictions about speaker choices and listener interpretations are not always definitive (Pickering, 2018). For these reasons, prosody is one of the more mysterious linguistic phenomena since it is more free-​flowing. Yet, because it is also systematic, it can be explicitly taught, learned, and understood, even to the extent where computers can be trained based on what we know about human speech. The collection of prosodic properties contrasts with another group of speech properties called “segmentals” (the term used to describe the segments which make up speech –​i.e., consonants and vowels). Consonants and vowels are the building blocks of words, providing the template on which prosody can work. Segmentals also provide the structure of a syllable, which is important for many functions of prosody and computer modeling.


Overview of Prosody  9

1.2  The Role of Prosody in Discourse The nature of prosody in discourse is tacit –​every competent speaker knows how to use prosody, yet a speaker often uses prosody under the level of conscious awareness (Pickering, 2018). Indeed, a large set of prosodic resources is available to a speaker, and a speaker uses these resources to make selections between prosodic form and meaning in real-​time to fit the interactive context and goals (Ward, 2019). The same principle applies to listeners: listeners rely on a range of prosodic cues for meaning in discourse, but this reliance is typically under the level of listener consciousness. This is not to say that speakers and listeners cannot and do not notice and use prosody purposefully –​they do and will. In fact, listeners especially take notice of prosody when it is used differently or inappropriately from what is expected, and this can occur often when a given language is not one’s first language. Listeners can walk away from a conversation thinking that one sounded “impolite” or “insincere.” Listeners may attribute these characteristics to the personality of the speaker, yet listeners are likely completely unaware that their impressions were made based on the speaker’s use of prosody. Pickering (2018) describes a classic example of the real-​life consequences of non-​target-​like prosody which refers back to a cross-​cultural communicative context described in Gumperz’ (1982) work on Discourse Strategies. In this example, Indian and Pakistani women working in a British airport were perceived as impolite by the British interlocuters with whom they interacted, especially due to the way they used intonation on the word “gravy.” Whereas British English speakers would typically ask someone if they wanted gravy through the use of rising intonation on the word “gravy,” these speakers would ask the question using falling intonation on the word “gravy.” For British English speakers, these patterns sounded less like an offer for gravy and more like a command, whereas for the Indian speakers, this was a completely normal way of asking the question. Considering how this cross-​cultural miscommunication arrived out of the use of intonation patterns in one very specific context, one can only imagine the consequences that can occur across discourse contexts when the speaker–​hearer follow different conventions for prosodic use.

1.3  History of Prosodic Approaches Much of the existing research on prosody has focused centrally on intonation. While frameworks of intonation agree that pitch is the underlying form of intonation, there are differences on how best to describe the structure, function, and underpinnings of the intonation system, e.g., whether it is made of discrete or variable components, or whether it is composed of tonal contours or tonal sequences (Chun, 2002; Pickering, 2018). Over the years, numerous frameworks have been proposed for


10  Linguistic Foundations of Prosody conceptualizing, measuring, and understanding prosody. Chun (2002) discusses some of these approaches: for example, the generative approach relies on prosodic form; the discourse approach focuses on prosody in interaction; the auditory approach makes use of impressionistic listening; and the acoustic approach analyzes physical or acoustic data. According to Wennerstrom (2001), there are three schools of thought which dominate the literature on intonation: (1) Halliday’s (1967a, 1967b) research on British English, (2) Pierrehumbert’s (1980) dissertation and coauthored work (Pierrehumbert & Hirschberg, 1990) on intonation, and 3) Brazil’s (1975, 1978, 1997) discourse-​based approach of (British) English intonation. These three schools of thought, along with other well-​ established approaches and frameworks, have set into place two overarching traditions: the British-​led tradition and the North American-​led tradition (hereafter referred to as the American tradition). The British-led tradition tends to make use of contour analyses of intonation where intonation shapes take on the form of tone units (otherwise called tone groups, sense groups, etc.), which then are further described in terms of their head or nucleus which carry pitch patterns of their own (Chun, 2002). The American tradition tends to rely more on levels analyses wherein pitch phonemes and morphemes are described by sequences of tones and levels of pitch (Chun, 2002). One exception to the American tradition is Dwight Bolinger’s (1951) theory of pitch accent which tends to be more in line with the British prosodic tradition, although different in its primary focus on prominence (Chun, 2002). Some frameworks that have widely influenced prosody models are David Brazil’s framework, which emerged from the British tradition, and Janet Pierrehumbert and Julia Hirschberg’s framework which emerged from the American tradition. We provide more discussions of Brazil’s framework as well as Pierrehumbert and Hirschberg’s framework in detail in Chapter 2. For the current chapter, an overview of the different traditions and approaches provides a general understanding of the history of intonation and how prosodic frameworks have emerged and evolved over the years.

1.4  The British Tradition Crystal (1969) describes John Hart’s work dating back to the mid-​1500s as a generally agreed upon starting place, documenting the earliest discussion of melody. Following Hart, there are no major works documenting intonation until work by Steele (1775) and Walker (1787). Steele’s work was largely inspired by music, and he developed the first systematic method for transcribing and notating prosodic features related to length, stress, and pitch. Walker’s work tended to be more pedagogical in nature with a purpose of teaching people to speak and read well; his work showed a concrete understanding of tonal contrasts.


Overview of Prosody  11 More recent approaches associated with the British English tradition tend to fall within two major categories: (1) the tune analysis, or the whole tune approach, and (2) the tonetic analysis, or the nuclear approach (Chun, 2002). Both of these approaches operate around a tone group, although the composition of the tone group differs with each approach. The tune analysis, which dates back to work by Jones (1909, 1918) and Armstrong and Ward (1926), describes intonation in terms of contrastive tunes (or tones) set within the bounds of one or several words or thoughts called sense groups (Chun, 2002). In each of these sense groups, the nucleus is given the most prominence and is marked by higher pitch, greater intensity, and longer duration. The tune pattern of each sense group is associated with the pitch as it begins on the last prominent syllable and extends through the rest of the sense group. The method of tune analysis also takes into consideration other speech properties, including pitch range and pitch height (Chun, 2002). Henry Sweet’s tonal analysis set the stage for future work embedded in this British tradition and likely exerted some influence on David Brazil’s (1997) conceptualization of intonation which is now heavily drawn upon in applied linguistics (see Chapter 2). While Sweet’s model of intonation is attitudinal and David Brazil’s pragmatic, both describe intonation with five tones: level, rising, falling, fall-​rising, and rise-​falling. The tonetic analysis is similar to the tune analysis, but it takes into consideration smaller components of tunes. Dating back to work by Palmer (1922), the tone group (similar to the sense group) has three structural components, including the nucleus, or the strongest syllable; the head, or the part of the tone group before the nucleus, and the tail, or the part of the tone group which follows the nucleus (Pickering, 2018). In this model, both the head of the tone group and the nucleus have their patterns of pitch movement which function independently: the head could be described as inferior, superior, or scandent, while the nucleus could be described with one of four main tones (Crystal, 1969). O’Connor and Arnold’s (1961, 1973) approach to intonation of conversational English aimed to remove itself from a grammatical function to a focus on the attitudinal function (Chun, 2002; Crystal, 1969; Pickering, 2018). In their approach, tone groups were composed of groupings of tones which expressed similar speaker attitudes. For all sense groups, no matter the length nor the number of prominent words, the tune of the last prominent word is assigned one of six patterns (low-​fall, high-​fall, rise-​ fall, low-​rise, high-​rise, and fall-​rise) (Chun, 2002). In its time, Halliday’s (1967a) work on prosody was perhaps the most significant and far-​reaching contribution from the British tradition and one which operated within a grammatical framework (Pickering, 2018). Halliday was the founder of systemic-​functional linguistics and developed a functional approach of intonation which identified five intonation contours with holistic interpretations (Wennerstrom, 2001). Halliday’s primary focus was information structure (old vs. new information)


12  Linguistic Foundations of Prosody (Pickering, 2018), and his approach to intonation had a far-​reaching influence on subsequent work treating the topic of intonation (e.g., Bolinger, 1986, 1989; Gussenhoven, 1984; Ladd, 1980; Tench, 1996). Finally, here only a brief discussion is provided of David Brazil’s work on prosody, but detailed attention is given to his framework in the next chapter (Chapter 2). For the present, it suffices to say that Brazil’s framework is built around his emphasis on the nature of intonation in discourse. What makes Brazil’s framework well-​drawn upon in applied linguistics and language pedagogy is the focus on naturally occurring discourse. Every utterance has a function within a larger speaker–​hearer context of interaction which renders a pragmatic nature to prosody (Pickering, 2018).

1.5  The American Tradition The American tradition of intonation has been set against the backdrop of two major strands of linguistics, structural and generative linguistics, which have influenced the way its prosodic systems are described. In structural linguistics, the individual parts of language which make up the whole are of central focus. Therefore, the theories associated with structural linguistics are inclusive of the sound system. Leonard Bloomfield was a leader of the structural linguistics movement, and in his seminal work Language published in 1933, he presented a phonemic analysis of pitch. For Bloomfield, intonation and stress are both secondary phonemes, largely because of their acoustic variability (Crystal, 1969). Bloomfield accounted for five different types of pitch phonemes (pitch movements) in sentence-​final positions, including the fall at the end of statement, the rise for yes/​no questions, the lesser rise for supplement (wh-​) questions, the exclamatory pitch for distortions of the pitch scheme, and the continuative before a pause (Chun, 2002; Crystal, 1969). In addition to these three pitch phonemes, he also distinguished between three stress phonemes (highest, ordinary, and low stress), which were also considered secondary phonemes (Crystal, 1969). Following Bloomfield was the comprehensive attention afforded to intonation by Kenneth Pike in 1945 through his pitch phoneme theory which dominated the American tradition for the next two decades after Bloomfield (Chun, 2002). Pike recognized the importance of a range of prosodic properties which function in tandem with intonation. Chun (2002, p. 25) recognizes the major contributions of Pike’s theory, including the following: “(1) its use of pitch heights or pitch phonemes as the basic elements for characterizing intonation contours; (2) its use of a relatively systematic set of functions pertaining to speaker attitude; and (3) its recognition of the interdependent systems that coexist and influence intonation, namely stress, quantity, tempo, rhythm, and voice quality.” Noteworthy of Pike’s work is the attitudinal function of intonation; intonation provides a temporary attitudinal meaning on top of


Overview of Prosody  13 what is conveyed lexically. According to Pike, intonation is not grammatically determined. Intonation contours were created through the sequencing of four relative pitch phonemes, or tone levels, represented through numbers, including 1 (extra high), 2 (high), 3 (mid), and 4 (low) (Chun, 2002). Eunice Pike (1985) describes how these contours begin on the prominent syllable and extend through that rhythm group, representing different speaker attitudes. For example, the high-​low contour expresses no emotion, but the extra-​high-​low contour expresses surprise, excitement, etc. A detached speaker attitude or displeasure, disappointment, or formality is projected through a mid-​low contour. A high-​mid is without emotional projections, but the extra high-​ mid contour expresses surprise, excitement, and so on. The mid-​high contour is used to signal lists, sequences, or additional information, and to add surprise or excitement, it becomes the mid-​extra-​high contour. The four pitch phonemes form many additional contours, some used for more specific meanings than others; Pike (1985) provides a thorough overview of their attitudinal functions. Dwight Bolinger’s (1951, 1958, 1986, 1989) approach, while originating from the American tradition, resembled that more of the British tradition, although it has been considered separately due to its major focus on prominence (Chun, 2002). Bolinger’s (1951) theory of pitch accent focused on the configurations of pitch as opposed to the level sequences of pitch (Chun, 2002). In this theory and when applied to English, pitch serves two functions: accent and intonation. Accent (i.e., prominence) refers to the abrupt movements in speech (as a result of pitch, loudness, or length) which cause syllables to stand out. Bolinger (1986) compared accent to “bumps on a landscape that may otherwise be simply level or inclined” (p. 10). When an important word is made an accent, there is an associated change in pitch which causes this “bump” to occur. As these accents occur, the baseline can remain steady (or level) while the accents take wider upward jumps, or the baseline can also experience fluctuation such as rising or falling. Bolinger (1986) refers to this patterning as the overall “landscape,” and this landscape is what describes intonation: “Though strictly speaking the term intonation includes the mere fact of there being one or more accents, it is generally used to refer to the overall landscape, the wider ups and downs that show greater or lesser degrees of excitement, boredom, curiosity, positiveness, etc.” (p. 11). Intonation is described as the overall melody of a sentence which varies depending on length and complexity. Bolinger’s 1958 work and the 1986 work set forth three major profiles of pitch accents, or “shapes determined by how the pitch jump cuing the accent is realized” (Bolinger, 1986, p. 139). Profile A describes “a relatively high pitch followed by a quick drop” (1986, p. 141); Profile B “starts higher than a preceding pitch and does not fall” (p. 141); Profile C is the mirror image of Profile A and “is approached from above, and does not fall” (p. 141). Then there are combinations of these profiles. Profile CA “starts at a relatively low pitch, goes up, and


14  Linguistic Foundations of Prosody

Figure 1.1  Bolinger’s (1986) Pitch Profiles.

abruptly comes down again” (Bolinger, 1986, p. 141). Profile AC is similar to Profile A, but with an additional rise at the end. Finally, profile CAC (the least frequent) has the shape of a tilde. Over-​ simplified graphical illustrations of these pitch profiles are provided in Figure 1.1. These are in no way presented to be a one-​to-​one correspondence with the profiles nor perfectly capture their acoustic complexity. Instead, they are meant to give the reader a general illustration of the profiles. Variations of the profiles and treatments of the syllables are provided in great detail in Bolinger (1986). Following the approaches associated with structural linguistics came the rise of the formal theory of generative linguistics, focusing on the form of the language. Generative phonology, in particular, is “a branch of phonology in which the sound system of a language is considered to be composed of internalized abstract elements from which actual speech sounds are ‘generated’ by the interaction of phonological and phonetic principles and constraints” (Wennerstrom, 2001, p. 273). Setting the stage for subsequent intonation frameworks was Liberman and Prince’s (1977) theory of metrical phonology (Chun, 2002). Their system made use of metrical trees to define stress in terms of strong and weak branches.


Overview of Prosody  15 Finally, one of the most well-​known approaches to come out of the generative movement was Pierrehumbert’s (1980) and Pierrehumbert and Hirschberg’s (1990) tone-​based model of intonation (described in great deal in Chapter 2). This model is largely drawn upon in speech synthesis and speech recognition and has led to broad applications in speech-​to-​ text technology and computational linguistics. For these reasons, we describe this framework in more detail in our following chapter in efforts to bridge the gap between discourse prosody and computer modeling.

1.6  Summary This chapter has provided a description of the meaning of prosody, its function in discourse, and major frameworks from two larger geographical traditions that have evolved throughout the years to describe prosody, mainly intonation. The history of work on intonation is important in situating the intonation models which will be reviewed in the next chapter: Brazil’s discourse-​based framework and Pierrehumbert /​Pierrehumbert and Hirschberg’s tone-​based framework. The purpose of reviewing these two frameworks together is to describe the different analytical frameworks well-​ practiced in the fields; i.e., both of these frameworks have informed computer models of prosody.



2  Frameworks of Prosody

PROMINENT POINTS This chapter presents an overview of the following: 2.1 Two Prosodic Frameworks 2.2 David Brazil’s Framework 2.3 Janet Pierrehumbert’s and Julia Hirschberg’s Prosodic Framework 2.4 Summary

16 17 26 30

2.1  Two Prosodic Frameworks In this section, we are primarily introducing two frameworks (of David Brazil and of Janet Pierrehumbert and Julia Hirschberg) as examples because some of the computer modeling algorithms introduced and exemplified in Chapters 5 and 6 of this book are derived from both of these. The major difference in these two frameworks, generally speaking, is their use of intonation terminologies and pedagogical practices. Brazil’s model has often been drawn upon in various fields for purposes of cross-​cultural and pedagogical applications related to discourse. It has been applied to a number of different varieties of English including second language learners (Pickering, 2018). The latter model is known as an Autosegmental-​Metrical approach (Pierrehumbert & Hirschberg, 1990). It is a system widely drawn upon in the field of linguistics and speech science, resulting in the popular intonational transcription system called the Tones and Break Indices, ToBI, developed in the early 1990s and which has significantly impacted intonation research. Pierrehumbert’s (1980) and Pierrehumbert and Hirschberg’s (1990) framework has been supported by computational linguists to model speech-​to-​text synthesis. However, this is not to say that this framework has not been used in applied language studies. For example, Wennerstrom (1997) drew on Pierrehumbert and Hirschberg’s (1990) interpretational model of intonation for non-​native speech which allowed for meaningful analyses below the level of the intonation contours, and Wennerstrom DOI: 10.4324/​9781003022695-3


Frameworks of Prosody  17 expanded the model to include deaccent and paratone. Another example is from The Handbook of Pragmatics (2004), when Hirschberg wrote on “Pragmatics and Intonation,” drawing upon the ToBI model to give interpretational accounts of intonation. Therefore, depending on the user’s approach to conceptualizing and measuring intonation, either model can be used with the user’s justification. The descriptive accounts of each framework in this chapter will not treat these two as competing frameworks, but they will generally account for their premises. The intent of this chapter is to provide the linguistic background as a means towards a more thorough conceptual understanding of these systems and applications, which will ultimately inform the computerized systems covered in Chapters 5 and 6.

2.2  David Brazil’s Framework David Brazil (1925–​1995) was born in Worcestershire, England and is perhaps best associated with the University of Birmingham where he pursued work in discourse analysis and collaborated with John Sinclair and Malcolm Coulthard. Brazil dedicated his research to discourse analysis and intonation, and some of his most well-​known works which resulted from this research are Pronunciation for Advanced Learners of English (Cambridge University Press, 1994) and A Grammar of Speech (Oxford University Press, 1995). Brazil’s now seminal book, The Communicative Value of Intonation in English (Cambridge University Press, 1997), was first published as an English Language Research Monograph at the University of Birmingham. In this book, Brazil presents in elaborate detail a discourse-​based framework which interprets the meaning and function of intonation within the speaker–​ hearer interaction, shared/​ unshared knowledge, and conversational control. This work now represents the British tradition of intonation and has been well-​recognized by students, teachers, researchers, and academics from the field of applied linguistics. One of the major appeals of Brazil’s framework is the manageability of the systems, which are straightforward and accessible for teachers and researchers. In this section, the four systems which make up Brazil’s framework are presented. These are the systems of (1) prominence, (2) tone, (3) key, and (4) termination (Brazil, 1997; Pickering, 2018). Within each system are possible choices, as illustrated in Figure 2.1. A syllable is either prominent or not. There are five choices of tone, including fall, rise-​fall, rise, fall-​rise, and level. Pitch levels on both the key and termination syllable can range from high, mid, to low. Each of these systems (prominence, tone, key, and termination) and choices are discussed in detail below; however, because these systems function within the confines of a tone unit and are understood in the interactive setting between the speaker and the hearer, a tone unit and the context of interaction are the logical starting points of discussion for this framework.


18  Linguistic Foundations of Prosody

Prominence • prominent syllable • non-prominent syllable

Tone • • • • •

fall rise-fall rise fall-rise level

Key • high • mid • low

Termination • high • mid • low

Figure 2.1 Brazil’s Four Prosodic Systems.

Tone Unit In technical terms, a tone unit refers to the “stretch of language that carries the systematically-​opposed features of intonation” (Brazil, 1997, p. 3). In other words, a tone unit is the product of the decisions a speaker makes when stretches of language are prosodically different from each other. A tone unit then is a language unit, or chunk, which is characterized by a difference in tone choice from preceding and following units. The boundaries of one tone unit to the next are made in real time. Tone units are often segmented by pauses, however, the pause is not absolutely necessary. Tone units tend to contain one or two prominent syllables, or syllables which have more emphasis acoustically and auditorily; they can, however, contain more than two prominent syllables. Tone units can contain at minimum one word (e.g., “Hello”), but they can also be longer stretches of meaningful chunks of language (e.g., “I wanted to stop by to say hello”). The hearer is often able to decode a particular tone unit as a whole, or a semantically related chunk of language. In previous frameworks and in English as L2 textbooks, other terms have been used to describe tone units, including sense groups, breath groups, tone groups, thought groups, and so on. The point of operation of the entire tone unit is the tonic syllable, or the last prominent syllable in the tone unit, because this is the syllable which carries the tone choice for the whole tone unit. Context of Interaction Another core concept central to understanding Brazil’s framework is what he calls the “context of interaction” (1997, p. 25). The context of interaction is the unique interactional setting between the speaker and the hearer. Decisions about prosody are made in real time during this interactive setting. If a given interaction is viewed as a continuum


Frameworks of Prosody  19 in a temporarily shared time and space between the speaker and the hearer, as the speaker/​hearer move along the continuum, there is a shared awareness of the discourse which leads up to any given present moment at the time of an utterance in the interaction. Therefore, what has previously been said in an interaction determines what is shared, what is new, and what is no longer relevant. The context of interaction follows pragmatic conventions which are appropriate for a particular language, and it can also follow rules of a particular speech community. More specifically, though, the context of interaction can be one that is unique to the speaker/​hearer (i.e., a speaker/​hearer can have their own particular way of speaking together). Prominence Prominent syllables are core to identifying further systems in Brazil’s framework because prominence determines where key, termination, and tone choices are assigned. That is, prominence determines the onset (where the key is located), if any, and the tonic syllable (where termination and tone choices are located). Once a syllable of a word is determined to be prominent, for purposes of simplicity, that word is considered to be a prominent word. For example, if the syllable “ver” in “uniVERsity” is considered to be prominent, that entire word “university” is considered to be a prominent word. Prominence has also commonly been referred to as sentence prominence or sentence stress. Acoustically speaking, prominence is the result of three properties: (1) pitch (measured in hertz at the onset of a vowel); (2) loudness (measured in amplitude of a wave); and (3) length (measured on the duration of the vowel). Auditorily speaking, one tends to hear prominent words stand out more than other words in an utterance. Both prominence and word stress involve the features of pitch, loudness, and vowel duration. Word stress largely follows a rule-​based system because there is a relatively expected pattern of stress within a word although this pattern can be complex itself. Prominence is part of a discourse-​based system as any word in a sentence can carry prominence, even function words (e.g., prepositions, articles, conjunctions). Prominence, therefore, is affected by decisions that a speaker makes in discourse in real time (e.g., which information to emphasize). Using Brazil’s framework as support, Pickering (2018) summarizes the four major functions of prominence which are often taught in L2 pedagogical contexts. These include using prominence to signal new information, to show contrast, to make a contradiction, or to show enthusiastic agreement (pp. 36–​37). Below, we provide examples of new information and contrastive information. We use capital letters to illustrate the prominent syllable and underlining/​bold to show the tonic syllable (the last prominent syllable of the tone unit).


20  Linguistic Foundations of Prosody

Example 2.1 Speaker 1: What are you DOing this weekend? Speaker 2: I’m taking a 5-​mile HIKE. In Example 2.1, Speaker 1 gives prominence to the word “doing” because its response holds the information to the unknown topic. We can assume that Speaker 1 is also projecting that “this weekend” is shared information based on the situational context (e.g., the speakers are having the conversation on a Thursday). Speaker 2 responds and gives prominence to the word “hike” as this is the new information in response to the question from Speaker 1. Speaker 2 is conveying that they are taking a hike as opposed to going shopping, studying, or working in the yard. In Example 2.2, Speaker 1 asks Speaker 2 about their availability to meet on Friday.

Example 2.2 Speaker 1: Are you available to meet this FRIday? Speaker 2: I can’t meet THIS friday but I can meet NEXT friday. At the time of the utterance, we will assume that “Friday” is new information, or a newly proposed day from Speaker 1 to hold a meeting. In Speaker 2’s response, prominence is no longer provided on the word “Friday” as this is now old information because it is already in play in the conversation. Speaker 2 is then able to make a contrast between “this” Friday and “next” Friday, which are the prominent words in that turn. Speaker 2 is making a contrast between “this” Friday and “next” Friday. Tone Brazil’s system of intonation is composed of five tone choices identified by the pitch movement on the tonic syllable in each tone unit. These tone choices are the following and are illustrated with arrows. Fall: ↘ Rise-​fall: ↗↘ Rise: ↘ Fall-​rise: ↘↗ Level: → At any given tone unit, a speaker is inclined to choose among these five choices. These tone choices are not predetermined, but they depend on the context of interaction between the interlocuters. Brazil assigns a “communicative value” (1997, p. 67) for each tone choice which applies to any occurrence of that tone. In Brazil’s model, tone choice


Frameworks of Prosody  21 directly affects pragmatic outcomes because each tone is associated with a particular communicative value in the context of interaction. Brazil underscores the importance of considering the consequences that a particular tone choice brings to communication. Because pragmatic meaning is communicated through tone, when choosing a particular tone, the speaker expects the hearer to understand the value of that choice. If the hearer does not understand the value, the speaker must re-​evaluate the communicative situation and reconsider his/​her tone choice. This system of tone works within a subsystem of opposition wherein a speaker makes a choice, and there is a consequence for choosing one tone over another. This subsystem is called the “Proclaiming/​Referring opposition” or “P/​R opposition” (Brazil, 1997, p. 68) and is best understood together with the concept of the “speaker hearer convergence” (p. 70). The P/​R opposition refers to the choice that a speaker has in using “proclaiming” or “referring” tones (p. 70). It should be noted at this point that the fifth tone (the level tone) does not operate within the P/​R opposition and will be discussed separately. Proclaiming tones include the fall and the rise-​fall, while referring tones include both the rise and the fall-​rise. These tones function within the notion of the speaker/​hearer convergence which suggests that, at any given interaction, the speaker and the hearer both have their own worldviews of which the other party does not know. At the same time, there can be an intersecting portion of these worldviews of which both the speaker and the hearer have shared knowledge. Brazil’s (1997, p. 70) concept of the speaker/​hearer convergence is presented in Figure 2.2 through an illustration inspired by this concept. The overlapping area of the word bubbles illustrates the common ground between the speaker and the hearer, or the knowledge and the experiences that they share. The areas outside of this common ground are individually unique to the speaker and the hearer. Generally speaking, when a speaker uses a referring tone (rise or fall-​ rise), they signal information that is somehow already in play in the conversation; i.e., it already exists within the common ground and involves information of which the hearer has previous knowledge. On the other hand, when a speaker uses a proclaiming tone, they are signaling that

Common ground


Figure 2.2 Brazil’s Speaker–​Hearer Convergence.



22  Linguistic Foundations of Prosody what they are bringing to the conversation is not yet in play in the conversation. It is therefore adding to the overlapping area between the speaker and the hearer and consequentially expanding, or widening, the state of convergence. Example 2.3 illustrates this distinction through the two most commonly used tones: fall (proclaiming) and fall-​rise (referring). In all examples throughout this book, the arrows are placed before the tonic syllable and represent the intonation pattern for that tone unit; this method of notation is also used in Pickering (2018). The double backslashes represent the tone unit boundaries. Chapter 3 goes more into detail about pause units (=runs) and tone units.

Example 2.3 //​ ↘↗ SPAnish //​is my favorite ↘ COURSE //​

Example 2.4 //​ ↘ SPAnish //​is my favorite ↘↗ COURSE //​ In Example 2.3, it can be assumed that the topic of the Spanish language is already in play in the conversation, and the utterance here is introducing new information about a course of which the hearer is not yet aware. In Example 2.4, it can be assumed the hearer is aware that the speaker is taking college courses, but the hearer does not know that the speaker is specifically taking a Spanish course, which is the new information. By using the falling tone on Spanish, the state of convergence between the speaker and the hearer is now broadened, i.e., there is more common ground between the interlocuters. The proclaiming tones add information which changes the worldview of what the hearer knows about the speaker. In Brazil’s framework, two additional tones, the rise-​fall and the rise, can be used with a special pragmatic function of dominance or control. The use of these tones is determined by social relationships or roles. The “dominant” speaker, or one who is in control of the discourse at any point in an interaction, has access to these dominant tones (1997, p. 85). Some asymmetric relationships can predetermine this dominant role (e.g., teacher/​student; doctor/​patient; parent/​child; lawyer/​client; etc.), but the use of dominant tones is not specific to asymmetric relationships, as they can be defined in real time as speakers take over the development of the discourse. The speaker in the dominant position has the option of using both rise and fall-​rise to refer and both fall and rise-​fall to proclaim, whereas the speaker in the non-​ dominant position should use either the fall or the fall-​rise. We present Brazil’s own examples to illustrate this point:


Frameworks of Prosody  23

Example 2.5 (Brazil, 1997, p. 85) //​the FIGure on the ↘↗ LEFT //​is a ↘ TRIangle //​

Example 2.6 (Brazil, 1997, p. 85) //​the FIGure on the ↗ LEFT //​is a ↘ TRIangle //​ In Example 2.5, one can imagine that in response to a teacher’s question “What is the figure on the left?” a student would be inclined to respond in the first manner with the fall-​rise tone indicating their response to the question which has yet to be affirmed by the teacher. However, it would be unexpected for a student to respond with a rising tone. On the other hand, in the course of an instructive explanation, such as in Example 2.6, it would be normal for a teacher to be pointing to a triangle on the board and use a rising tone. Due to their role, the teacher could use either tones (fall-​rise or rise) respectively. Further examples from Brazil (1997, p. 87; see also Pickering, 2018, p. 72) are illustrated below.

Example 2.7 (Brazil, 1997, p. 87) //​WHEN i’ve finished what i’m ↘↗ DOing //​i’ll ↘ HELP you //​

Example 2.8 (Brazil, 1997, p. 87) //​WHEN i’ve finished what i’m ↗ DOing //​i’ll ↘ HELP you //​ In Example 2.7, Brazil paraphrases the statement with the non-​dominant tone as, “If you wait a minute, I’ll help you” (p. 87). In Example 2.8, the paraphrased statement of the dominant tone is, “If you want me to help you, you’ll have to wait” (p. 87). One could imagine a child using the first example to offer help to an adult, and an adult using either tone, rise as a regulatory choice or fall-​rise as an accommodating choice. There is also the question of dominance and politeness which can be determined by the real-​time stance a speaker decides to take in an interactive setting. In asking a question, a speaker projects that they will determine what happens next in the interaction by projecting the content to which the hearer will attend; however, who the question benefits is a consideration. If the answer to the question will mainly benefit the speaker, it is more polite to use a non-​dominant tone, such as the fall-​rising tone in Example 2.9 below.

Example 2.9 (Brazil, 1997, p. 94) //​is THIS the ↘↗ SHEFfield train //​


24  Linguistic Foundations of Prosody The fall-​rising (non-​dominant) tone in Example 2.9 sounds more like a polite request for the hearer to provide information to the speaker as a favor. On the other hand, in Example 2.10, the rise (dominant) sounds more like a demand for the information.

Example 2.10 (Brazil, 1997, p. 94) //​is THIS the ↗ SHEFfield train //​ However, when a question is meant to bring benefit to the hearer, such as in the examples below, the dominant rise in Example 2.11 is actually perceived as a warmer offer than if the non-​dominant referring tone (fall-​ rise) in Example 2.12 was used.

Example 2.11 (Brazil, 1997, p. 95) //​CAN i ↗ HELP you //​

Example 2.12 (Brazil, 1997, p. 95) //​CAN i ↘↗ HELP you //​ A question such as “Can I help you?” is meant to bring some benefit (i.e., assistance) to the hearer. By the speaker taking control of the response with a dominant tone (rise), the hearer sees the offer as one more inclined to be accepted. If a non-​dominant tone is used (fall-​rise), the hearer can perceive this as a routine request or even a show of disinterest on the part of the speaker. Therefore, a speaker can choose to ask this type of question with a dominant tone (rise) which can be seen as more polite and warm. In general, the rise is a more socially acceptable way of asking the question; on the other hand, exerting dominance into the question can signal frustration or exasperation. The fifth tone, the level tone, does not function within the speaker/​ hearer convergence, as it does not contribute new information or signal information already in play in the conversation. In other words, the level tone is a neutral tone which does not project any assumption about the speaker/​hearer convergence. Pickering (2018) discusses how Brazil’s level tone is used to temporarily suspend the interactive context which provides the time for the speaker to withdraw from marking information as known or unknown. As such, it can also be used by a speaker strategically to withhold their opinion in uncomfortable situations like when one is asked to comment on something but is hesitant to voice a truthful opinion so as not to hurt or offend the hearer (Pickering, 2018). The level tone is also used to signal incomplete tone units and filled pauses in the act of verbal planning. Filled pauses, in particular, tend to


Frameworks of Prosody  25 be “dummy” tone carriers because they do not contribute any communicative value to the discourse and are used in verbal planning or holding the floor in discourse (Brazil, 1997, p. 139). Linguistic planning is also a common use of level tone. That is, when the speaker has some problems with the content of their discourse, they can temporarily pull away from the state of convergence to instead focus on linguistic properties related to the message itself (Pickering, 2018). Level tones are also characteristic of liturgical discourse or routinized discourse. Brazil gives the example of teacher talk (p. 138):

Example 2.13 (Brazil, 1997, p. 138) //​ STOP → WRITing //​PUT your pens → DOWN //​LOOK this → WAY //​ When a teacher gives a directive such as that in Example 2.13, the teacher is focusing on the routine of the class and refraining from the focus of common ground between teacher and students. Students will also recognize such discourse as routine talk. This use of level tone can also accompany rote explanations in teacher talk (like repeating formulas or definitions). Particularly in language classroom, the level tone can accompany sorts of scaffolding, such as in Example 2.14 below:

Example 2.14 (Brazil, 1997, p. 138) Teacher: //​he BOUGHT it → ON //​ Student: //​ ↘ THURSday //​ Teacher: //​he BOUGHT it on ↘ THURSday //​ ↘ YES //​ In the first turn, the teacher is scaffolding the response they are looking for. The student catches on and provides that response, “Thursday.” More information about the pragmatic functions and interpretations of these tone choices can be found in Pickering (2018, p. 47). Key and Termination Key and termination are both prosodic properties dealing with pitch height. In Brazil’s framework, there are three possible relative pitch heights: high (H), mid (M), and low (L). Again, the key is associated with the onset syllable, or the first prominent syllable in the tone unit. Just like intonation, levels of pitch height also carry communicative value. Pickering (2018, pp. 56–​59) summarizes the choices associated with high, mid, and low key. Generally speaking, a high key communicates information that is contrastive, a mid key communicates information that is additive, and a low key communicates expected, assumed, or


26  Linguistic Foundations of Prosody known information. The termination is associated with the tonic syllable, or the last prominent syllable in the tone unit, and pitch height here carries similar communicative values as the key. In other words, the high termination carries contrastive value, the mid termination carries additive value, and the low termination carries reformulative value (Pickering, 2018). Brazil’s concept of pitch concord can explain how a speaker achieves a communicative function by using a specific tone. Pitch concord deals with how speakers match (or do not match) pitch with interlocutors. When taking turns during interaction between two speakers (Speaker A and Speaker B), the ending tone (pitch) of speaker A’s turn should be in line with the beginning tone (pitch) of speaker B’s turn. Successful communicators are expected to match their pitch with their interlocuter’s pitch.

2.3  Janet Pierrehumbert’s and Julia Hirschberg’s Prosodic Framework We briefly introduce Pierrehumbert’s (1980) and Pierrehumbert and Hirschberg’s (1990) model in this section to provide a brief background about another important intonation framework so that the concepts and applications discussed in later chapters of this book can be better understood. Janet Pierrehumbert completed her dissertation at Massachusetts Institute of Technology in 1980, and with it, produced her dissertation on “The Phonology and Phonetics of English Intonation,” which presented a tone-​based model of intonational analysis. In Pierrehumbert’s words, she states the two main goals of the model: One main aim […] is to develop an abstract representation for English intonation which makes it possible to characterize what different patterns a given text can have, and how the same pattern is implemented on texts with different stress patterns. The second aim is to investigate the rules which map these phonological representations into phonetic representations. These two aims go hand in hand, since we seek the simplest possible underlying representation by determining what properties of the surface representation can be explained by rules applying during the derivation instead of being marked in the underlying form. (1980, p. 10) As part of the American generative linguistic tradition, and influenced by work of Liberman (1975) and Bolinger (1958), the model presupposes an underlying structure of language which can provide explanation to surface structure and form through phonetic representations (Wennerstrom, 2001). Pierrehumbert (1980, p. 10) describes the


Frameworks of Prosody  27 phonological characteristics of intonation as having three components: (1) the grammar of allowable phrasal tunes described in sequences of L and H tones; (2) metrical representations of texts (influenced by Liberman, 1975; Liberman & Prince, 1977) which are predicted by metrical grids of stressed/​unstressed syllables and their relationships wherein the nuclear stress (strongest in the phrase) is of primary importance; and (3) tune–​text alignment in which the text lines up with the tune. Together, these three components form intonation’s phonological representation, or “a metrical representation of the text with tones lined up in accordance with the rules” (p. 11). In English, these rules in question are those that assign phonetic values to tones creating the overall pitch contour between tones. The model depends on representations of F0, or fundamental frequency, which is the most common way of quantifying pitch and intonation. Tunes and texts co-​occur due for the need of the tune to be aligned over a particular linguistic utterance through a language’s rules; however, tunes are independent from the text itself. The “tune” (Pierrehumbert, 1980, p. 19) or the melody of an intonation phrase is composed of the three systems, including one or more pitch accents, phrase accents, and boundary tones. According to Pierrehumbert, “The well-​formed tunes for an intonation phrase are comprised of one or more pitch accents, followed by a phrase accent and then a boundary tone” (p. 22). Intonation can therefore be described through four processes dealing with the intonational phrases. These intonational phrases are made up of one or more intermediate phrases (Pierrehumbert & Hirschberg, 1990). Both intermediate and intonational phrases are often identified through pauses, phrase-​ final lengthening, and the melody associated with the phrase accent and boundary tone (Pierrehumbert & Hirschberg, 1990). The first process (Pierrehumbert, 1980, p. 15) is to locate the pitch accents, or the tonal patterns assigned to metrically strong syllables of a word. These pitch accents can be described by one or up to two (bitonal) tones on the stressed syllables. Notation in Pierrehumbert’s framework (influenced by Goldsmith, 1976) is made up of a series of combinations of L (low) and H (high) notations determined by the shape of the pitch movement. The tone associated with the accented (stressed) syllable is marked with an asterisk (*). The tone which occurs before or after the accented syllable (the starred tone) is marked by a raised hyphen (¯); there are two options for the low starred tone (L*+H¯ or H¯+L*) and two options for the high starred tone (H*+L¯ or L¯+H*). Taken together, there are six possible pitch accents (H*, L*, L*+H, L+ H*, H*+L, H+L*); pitch accents having two same tones do not exist (Pierrehumbert & Hirschberg, 1990). This framework draws on the system of metrical feet (Hayes, 1980; Liberman & Prince, 1977; Selkirk, 1980) in which each word can have several levels of stress. The bitonal elements are similar to bisyllabic feet


28  Linguistic Foundations of Prosody in which one is stronger than the other; i.e., the starred tone is stronger than the weaker tone. Single tones are similar to monosyllabic feet and can be compared to a starred tone. Discussed within the concept of metrical grids, pitch accents are located on metrical feet based on the overall metrical structure of the intonational phrase. There are typically around two to three pitch accents within an intonational phrase –​any more than five are extremely rare. The second process (Pierrehumbert, 1980, pp. 15–​16) is to describe the pitch pattern associated with the end of the intonation phrase. The pitch pattern at the end of the phrase is different from the patterns associated with other pitch accents because it does not necessarily have to line up with metrically strong syllables. It extends past the pitch accent until the end of the phrase boundary. There are two additional tones following the pitch accent on the nuclear stress of the phrase, including the boundary tone and the phrase accent, both made up of only one tone (they cannot be bitonal, e.g., the only options for a boundary tone are L% and H%). The boundary tone is the last tone of the intonational phrase and is found at the phrase boundary. A phrase boundary (indicated in this framework by %) is usually marked by a silent pause, where a pause could be inserted, or by a lengthening of the last syllable in the phrase. The phrase accent comes shortly after the nuclear accent (i.e., the pitch accent on the main stress of the phrase) but before the boundary tone, regardless of how close the phrase boundary occurs. In addition to marking unstarred tones of pitch accents, the use of the raised hyphen is also used to mark phrase accents. A phrase boundary is indicated by a percentage (%). In the word “Nancy” in Example 2.15, we illustrate these transcriptions. The first syllable is stressed and the second syllable is unstressed. The H* represents the starred pitch accent for the word “Nancy,” while the L¯ is the fall in pitch which follows the syllable “Nan.” The “H” is one more jump in pitch which extends through the end of the phrase, and the % marks the end of the phrase.

Example 2.15

Nancy H* L¯ H %

The third process (Pierrehumbert, 1980, p. 17) involved in describing the intonation phrase is to describe the patterns of pitch between tones which take three patterns. The first is a direct route between the tones, such as a steady rise between two tones. The next possibility is when pitch remains level and then rises at the last moment. The third pattern is when pitch makes a fall between two higher level tones. Phonetic rules


Frameworks of Prosody  29 are used to determine these inter-​tonal patterns. Finally, the fourth process (Pierrehumbert, 1980, pp. 17–​18) is to explain the relationship between a speaker’s tonal patterns and their pitch range. One same intonation pattern can be spoken in a number of different pitch ranges. The pitch range is therefore the difference between the highest point of fundamental frequency and the baseline, or the lowest point in fundamental frequency over the utterance (Pierrehumbert & Hirschberg, 1990). In later years, Pierrehumbert and Hirschberg applied this model to discourse. The model was adapted into the ToBI (Tones and Break Indices) labeling system which was created in a sequence of collaborative workshops (Silverman et al., 1992; Beckman & Ayers, 1994; Pitrelli et al., 1994). Using Pierrehumbert’s prosodic framework as a reference, the ToBI model describes Pierrehumbert’s prosodic features in three tiers: break indices, boundary tones, and pitch accents. Each tier makes use of time-​aligned symbols which correspond to the prosodic events within an utterance. The application of the ToBI model for computer modeling is described in more detail in Chapter 5. In Pierrehumbert and Hirschberg’s (1990) work on the function of intonation in discourse, “a speaker (S) chooses a particular tune to convey a particular relationship between an utterance, currently perceived beliefs of a hearer or hearers (H), and anticipated contributions of subsequent utterances” (p. 271). The tunes that convey information at the discourse level work together structurally as the combination of pitch accents, phrase accents, and boundary tones to convey meaning. Rather than characterizing intonation by attitudinal (e.g., politeness, surprise, etc.) and emotional (e.g., anger, joy, etc.) functions, these are instead derived from tune meaning and interpreted relatively within a context. Meaning is instead conveyed by the “propositional content” and the shared belief of the participants involved in the interaction (p. 285). Each component works together to provide meaning. Pitch accents are used to bring salience to arguments, referents, or modifiers in discourse. Phrase accents convey discoursal information which work together to structure the interpretation of meaning relatively to the meaning conveyed in other intermediate phrases. Boundary tones communicate larger meaning spread over the intonational phrase as a whole. This meaning is inherently connected to other intonational phrases which affect its interpretation. Overall, pitch accents, phrasal tones, and boundary tones provide overall intonational meaning which can be interpreted structurally. Hirschberg (2004) discusses the discourse phenomenon of intonational variation. For instance, while pronouns are function words (as opposed to content words) and thus typically receive no prominence, they may receive prominence depending on the discourse context in which they appear. Pitch range and pausing are important in establishing a topic


30  Linguistic Foundations of Prosody structure, while amplitude (loudness) can signal a topic shift. Prominence can distinguish discourse markers from their adverbial or structural role in a sentence. Finally, intonation can distinguish speech act types (e.g., direct or indirect speech act), as well as question types (e.g., yes/​no or wh-​questions).

2.4  Summary While Brazil’s framework and Pierrehumbert and Hirschberg’s framework have overall similarities, differences are also notable. Despite these differences, however, both models explicate the critical prosodic features of human discourse. Accordingly, language practitioners, researchers, and program developers can choose either framework that meets their scientific and pedagogical purposes and needs.



3  Prosodic Analyses of Natural Speech

PROMINENT POINTS This chapter presents an overview of the following topics: 3.1 Second Language (L2) Prosody 3.2 Segmental Properties in Discourse 3.3 Measuring Segmental Properties 3.4 Fluency in Discourse 3.5 Measuring Fluency 3.6 Word Stress in Discourse 3.7 Measuring Word Stress 3.8 Sentence Prominence in Discourse 3.9 Measuring Sentence Prominence 3.10 Pitch and Intonation in Discourse 3.11 Measuring Pitch and Intonation 3.12 Proficiency and Intelligibility 3.13 Summary

31 33 34 39 41 45 46 47 48 50 51 57 58

3.1  Second Language (L2) Prosody Prosodic acquisition in one’s native language happens naturally, easily, and under the level of consciousness. In fact, at around eight months, infants’ perceptive skills are already attuned to differences in rising and falling intonation, and between 27 and 30 months of age, toddlers can use intonation to ask questions, even without having mastered syntactic structure (Levey, 2019). Research has suggested that in utero, fetuses in the third trimester take in prosodic stimuli from their native languages which influence their cry pattern after birth (Mampe et al., 2009; Hardach, 2020). It is not until one learns a second/​additional language, typically as an adult learner, that one becomes explicitly aware of the process, and even then, pronunciation instruction for L2 learners has not been a primary focus when compared to other linguistic skills (grammar, writing, reading, etc.); attention to it has emerged relatively recently. Pronunciation DOI: 10.4324/​9781003022695-4


32  Linguistic Foundations of Prosody instruction usually refers to the ensemble of segmental (consonant/​vowel) and suprasegmental (intonation, volume, pitch, timing, duration, pausing, and speech rate) instruction. This means that oftentimes, unless explicitly instructed, language learners may not necessarily know that they are using prosody differently from the target speech community. The differences in segmental and suprasegmental production are what results in an accent in the L2. With the growing number of L2 English users, speech scientists and applied linguists are encouraged to explore features of native/​non-​ native speech in a more systematic and reliable way. As a result, different sets of quantitative speech properties have evolved in this line of study and are therefore presented in this chapter. In spoken communication, the production of speech properties (segmentals and suprasegmentals) is highly influential on how a message is perceived and understood by the listener. These speech properties are therefore critical in measuring the proficiency and intelligibility of an L2 learner’s overall speech. Computer models built for the purpose of assessing L2 speech should be trained in quantifying speech properties using similar processes to how they have been handled manually via human analyses. Due to pronunciation’s critical role in communication, speech properties spanning segmentals and suprasegmentals have been analyzed in previous research with respect to how they relate to larger communicative outcomes. Suprasegmental properties of non-​ native speech have been shown to account for 50% of the variance in listeners’ perception of L2 speakers’ oral proficiency and comprehensibility (Kang, Rubin, & Pickering, 2010). Suprasegmental features have also been noted to independently predict listeners’ ratings of the accentedness and comprehensibility of non-​native speech (Kang, 2010). Segmental features are additionally highly influential on how well speech is understood (Bent, Bradlow, & Smith, 2007; Fayer & Krasinski, 1987). The following sections provide an introductory presentation of speech analyses for segmentals, fluency, word stress, sentence prominence, and pitch and intonation. For each, a discussion of the function in discourse is presented, followed by a presentation on how to measure the speech property. This overview includes a description of both segmentals and suprasegmentals since, together, these two cover the range of the speech stream, and suprasegmental properties cannot be discussed without the foundation on which they are superimposed. The major focus of this chapter, however, is on suprasegmentals, or prosody, as it is the primary interest of the current book. The prosodic features discussed here will inform the development of the computer models described in Chapters 4, 5, and 6. Manual (human) quantification of speech properties provides the basis for training computer models. That is, even though computer extraction methods may not involve human coding processes directly and analysis processes may differ, human coding is the first step in creating computer models and later verifying their reliability. In order to understand how computer models are created and trained to analyze speech


Prosodic Analyses of Natural Speech  33 automatically, it is useful to understand the speech quantifications which humans have been calculating manually.

3.2  Segmental Properties in Discourse While segmental (consonant and vowel) properties are not prosodic, they are important in any discussion of speech analysis. First of all, target-​like segmentals predict a speaker’s extent of accent (accentedness), ease of comprehension (comprehensibility), and clarity of speech (intelligibility). Segmentals also provide the core structure of the syllable with which to superimpose prominence. (See Chapter 4 for a discussion on how computer systems recognize speech at the level of the phone and the syllable.) Catford (1987) describes the sound system of English as a hierarchy of ranks in which phonemes and syllables make up the other systems; that is, segmentals lay the foundation and syllable structure for other systems, such as prosody, to function over discourse. In the larger context of L2 proficiency and communicative success between the speaker and the hearer, segmental accuracy is deemed to be of high importance. Because consonants and vowels are the building blocks of speech, even a slight deviation of the way sounds are produced can be detected by a native speaker of that language. Accurate articulation is likely one of the most common ways of measuring segmentals, and oftentimes it is one of the most straightforward. For this reason, language teachers have often focused on segmentals more than suprasegmentals in ESL/​EFL contexts because segmentals can be easier to isolate and describe. In fact, English teachers are actually more comfortable teaching segmentals (Kang, Vo, & Moran, 2016). Research has examined the effect of segmental accuracy on listener perception and comprehension. Ultimately, this line of research can explain the importance of segmentals to overall communication. Much available research on the topic is in the area of speech perception, or how listeners perceive non-​native speech. Overall, segmental inaccuracy has been found to negatively affect how listeners perceive and understood speech. First, the more segmental inaccuracy, the more accented speakers sound (Brennan & Brennan, 1981; Munro & Derwing, 1999; Trofimovich & Isaacs, 2012). On the other hand, accurate segmental production has been linked to improved judgments of accentedness and comprehensibility (Crowther et al., 2015; Munro & Derwing, 1995, 2006; Saito, 2011; Saito et al., 2016), and their effect on intelligibility and proficiency has been verified empirically (Bent et al., 2007; Fayer & Krasinski, 1987; Kang & Moran, 2014; Kang, Thomson, & Moran, 2020). In addition to segmental accuracy, some other segmental analyses include measurements of vowel space, vowel duration, and voice onset time (VOT). Research has investigated the role of vowel space in relation to a speaker’s intelligibility, illustrating that clearer speech is in line with more extreme formant frequencies for peripheral vowels (Chen,


34  Linguistic Foundations of Prosody Evanini, & Sun, 2010). In fact, the space between the F1 and the F2 formants associated with acoustic properties of vowels has shown to be an important factor in signaling non-​nativeness in English speech (Bianchi, 2007; Chen et al., 2010; Tsukada, 2001). A larger vowel space is indicative of clearer articulations marked by greater tongue height (F1 dimension) and tongue advancement (F2 dimension) (Neel, 2008). Chen et al. (2010) suggest that incorporating acoustic information of vowel space in automatic speech recognition systems can be important in providing feedback to language learners on ways they can adjust their speech to sound more target-​like. In a similar vein, Bond and Moore (1994) studied five male speakers and found that unintelligible speech was marked by both shorter vowel duration and smaller vowel space. As for vowel duration: Bianchi (2007) looked at vowel duration of monolingual English speakers, early Spanish/​English bilingual speakers, and late Spanish/​English bilingual speakers and found that monolinguals and early bilinguals showed more similarities with vowel duration production than when compared to late bilinguals. Late bilinguals differentiated less between long and short vowels, causing their speech to be less clear. VOT duration has been a particular salient marker of L2 accentedness.

3.3  Measuring Segmental Properties Several methods have been undertaken to measure segmental production, including human auditory analyses, acoustic analyses, and automated computer analyses. In this section we will discuss ways of analyzing segmental accuracy, vowel space, vowel duration, and voice onset time. Here, we only provide some examples of ways to quantify segmentals; we fully acknowledge that there are a myriad of ways to do so. 3.3.1  Measuring Segmental Accuracy To begin the discussion of quantifying segmental accuracy, one must consider how segmental performance affects overall intelligibility in order to prioritize critical segmentals. Catford (1987) distinguishes between the frequency and the functional load of a phoneme or phonemic opposition: frequency is “the number of times that it occurs per thousand words in text” while functional load “is represented by the number of words in which it occurs in the lexicon, or in the case of a phonemic contrast, the number of pairs of words in the lexicon that it serves to keep distinct” (p. 88). Catford provides a list of phonemic contrasts in word initial, word medial, and word final position along with their corresponding functional load noted as a percentage. Catford’s suggested cut-​off point for considering some contrasts to be of a low functional load is at 30% and under. Two studies in particular (Kang & Moran, 2014; Munro & Derwing, 2006) have empirically measured segmentals though an approach which


Prosodic Analyses of Natural Speech  35 considers the functional load. Kang and Moran (2014) quantified “segmental deviations,” or the instances when segmentals deviated from Standard American English through auditory analyses. Rather than label segmental inaccuracy as segmental errors, the term “deviations” is more in line with the intelligibility principle (see Levis, 2005) which prioritizes clear speech over one’s extent of accent. In their study, deviations included instances when speakers added, deleted, or substituted consonant or vowel sounds and were coded by noting with the International Phonetic Alphabet (IPA) the sound that was produced, along with its targeted or intended sound. Also supporting the intelligibility principle, accepted alternatives (such as the absence of a postvocalic retroflex r or the presence of a trilled r) were not considered instances of segmental inaccuracy. The total number of segmental deviations was divided by the total number of syllables produced in the speech sample. These analyses were conducted auditorily by two human coders who established agreement for 10% of the speech data. Kang and Moran (2014) further analyzed these deviations through Catford’s (1987) functional load approach, considering those of a high functional load to be over 50% and those of a low functional load to be under 50%. Therefore, depending on the purposes and intent, cut-​off points are variable. Munro and Derwing (2006) specifically analyzed consonant errors of L2 English speech as they related to listeners’ impressions of accentedness (degree of accent) and comprehensibility (ease/​difficulty of understanding). Using Brown’s (1991) and Catford’s (1987) functional load approach, frequent consonantal substitutions were identified auditorily and then tallied and categorized according to high/​low functional load categories, resulting in seven consonant substitution types. The cut-​off value of 51–​ 100% was used for high functional errors and 1–​50% for low functional load errors. Other examples of measuring segmental accuracy, but which do not take a functional load approach, come from Trofimovich and Isaacs (2012) and Zielinski (2008). The former quantified segmental errors as “the number of phonemic (e.g., think spoken as tink) substitutions divided by the total number of segments articulated” (p. 4). These authors additionally calculated syllable structure errors which were “the total number of vowel and consonant epenthesis (insertion) and elision (deletion) errors (e.g., holiday spoken without the initial /​h/​) over the total number of syllables articulated” (p. 4). Zielinski (2008) used auditory and transcription methods to represent non-​native segments as their closest English approximants. Table 3.1 summarizes these select studies and their measurements of segmental accuracy. 3.3.2  Measuring Vowel Space In addition to auditory analyses used for measuring segmental accuracy, there are other methods for measuring segmental qualities. Using acoustic


36  Linguistic Foundations of Prosody Table 3.1 Select Measurements of Segmental Accuracy Authors



Kang & Moran Auditory Analysis (2014) + Functional Load Approach

Segmental Deviations • Vowel or consonant substitutions • dat instead of that • Simplifications of consonant clusters • expore for explore • Linking errors • Ifi twas instead of if it was • Vowel or consonant epenthesis • insertion; e.g., besta for best • Vowel or consonant elision • deletion; e.g., irst instead of first • Absence of syllable • sev for seven • Dark /​l/​ • candows instead of candles • Incorrect word stress • isiting instead of VISiting • Grammatical/​semantic • me mother instead of my mother • Unsure target sounds • Multiple errors • Too many errors in a small chunk of speech made it too difficult to accurately assess the deviations • Transcription difference • Instances when the transcript differed from the sound heard

Munro & Derwing (2006)

Consonant Substitutions • High Functional Load • l → n • ʃ → s • n → l • s → ʃ • d → z

Auditory Analysis + Functional Load Approach

• Low Functional Load • ð → d • θ → f Trofimovich & Auditory Analysis Isaacs (2012)

Segmental Errors • total number of phonemic substitutions divided by total number of segments • ex: think → tink Syllable Structure Errors • total number of segmental insertion/​ deletion errors divided by the total number of syllables • ex: holiday → oliday


Prosodic Analyses of Natural Speech  37 Table 3.1  Cont. Authors



Zielinski (2008)

Auditory Analysis + Transcription Approach

Segments Produced at Sites of Reduced Intelligibility • non-​target-​like segmentals were represented as the closest English equivalent •  ex: unaspirated /​p/​in word initial position would be transcribed as /​b/​

analyses through speech visualization software (Praat (Boersma & Weenink, 2014), Computerized Speech Lab, etc.) a common way to visualize vowels is through their amplitude peaks, or formants, at different frequency levels of the vocal tract. The most important formants for this type of measurement are the first two or three formants (F1, F2, and F3) respectively. Each vowel is associated with a particular pattern in the frequency levels associated with each formant, thus presenting different patterns of acoustic energy and space between the formants. Non-​native formant patterns can be compared to native-​like formant patterns in the analyses of vowel formants. Acoustic analyses of vowel space has been associated with F1 and F2 formant frequencies within a two-​dimensional space. This process can be illustrated through work by Chen, Evanini, and Sun (2010). They calculated vowel space for three extreme vowels ([i]‌as is “seek”, [ɑ] as in “sock”, and [oʊ] as in “soak”), and through several measures including range, area, overall dispersion, within-​ category dispersion, and F2–​F1 distance. Some of the more straightforward calculations are those of the vowel space range and F2–​F1. To calculate the vowel space range, the overall minimum value should be subtracted from the maximum value for both F1 and F2. That is, the lowest F1 recorded should be subtracted from the highest F1 recorded, for the F1 range, and the lowest F2 recorded should be subtracted from the highest F2 recorded for the F2 range. In their study of three vowels, this ended up looking like F1Range = MaxF1(ɑ) –​ MinF1(i) and F2Range = MaxF2(i) –​ MinF2(oʊ). Then, for any individual vowel, the F2–​F1 distance can be calculated for that vowel to measure the extreme points in the vowel space, but the distance will depend on the vowel in question. For example, for the vowel [i], this distance is usually the largest, while for the vowel [ɑ], the distance is usually the smallest. In their study, more peripheral vowels were related to more intelligible pronunciation; i.e., greater distances for [i] and smaller distances for [ɑ]. Formant analyses for these peripheral vowels of non-​native speakers (NNSs) were compared to human-​assigned scores for overall pronunciation proficiency, and all vowel space features were


38  Linguistic Foundations of Prosody significantly correlated with proficiency scores. Native speaker formant measurements were taken for the same vowels as a baseline comparison. 3.3.3  Measuring Vowel Duration Another way of measuring vowels quantitatively is through an analysis of vowel duration. Vowel duration is one property which distinguishes between tense and lax vowels (tense vowels are longer, e.g., eat, while lax vowels are shorter, e.g., it). Set ranges of vowel duration have been proposed for English monophthongs and diphthongs (Ladefoged, 2006). Vowel duration plays a role in determining word stress, as the stressed syllable of a word is longer than the other syllables. Vowel duration also is important in signaling sentence stress for the same reason but at a higher level of discourse related to information management. Furthermore, vowel duration is influenced by segments which come before/​after the vowel in question. Overall, vowel duration is critical in the production of clear speech and non-​native speech production (see Graham, Caines, & Buttery, 2015; Sun & Evanini, 2011), although perhaps not as commonly researched as vowel space. To calculate vowel duration for a single vowel, one can simply calculate the length of the specific syllable by isolating the vowel (beginning where the previous segment is inaudible and measuring the time until the segment after the vowel begins). This can be done acoustically through speech visualization software to measure the time associated with the said vowel. In Bianchi’s (2007) study, vowel duration was calculated between two analysts by analyzing waveforms and locating the beginning of the vowel and the end of the vowel through specific waveform and spectrogram criteria. Chen, Zechner, and Xi (2009) measured vowel duration through automated computer analyses created for the TOEFL practice online assessment. They developed two acoustic models, one used for recognition of non-​native speech and one trained on native or near-​native speech which was used for pronunciation scoring. Chen et al. (2009) added vowel duration (i.e., vowel duration shifts) to their model as an important feature associated with non-​native speaker proficiency, and these were compared to standard norms of native English speech. 3.3.4  Measuring Voice Onset Time An additional segmental measure applicable to consonant analyses is voice onset time (VOT). VOT is used to measure the time between the release of a stop closure and the beginning of vocal fold vibration. There are three different types of VOT: positive, zero, and negative. A positive VOT (usually occurring with voiceless aspirated stops) is where there is a delay between the release of the stop and the start of vocal fold vibration, whereas negative VOT, occurring on rare occasions with some voiced consonants, is when vocal fold vibration occurs before the release of the


Prosodic Analyses of Natural Speech  39 stop. Zero VOT, the most common occasion for voiced stops, is when the release of the stop coincides with vocal vibration. Studies have shown that listeners often perceive longer (positive) VOT durations as more native-​like (e.g., Major 1987; Riney & Takagi, 1999). VOT has also been used in automated oral assessment models (Kazemzadeh et al., 2006; Henry, Sonderegger, & Keshet, 2012). Kazemzadeh et al. (2006) analyzed speech data from a corpus of child L2 English reading assessment to determine classes of phones based on VOT. Henry et al.’s (2012) computer algorithms were developed to handle both positive and negative VOTs. In L2 research, VOT has been studied in comparisons of consonant production across speakers with various languages, e.g., Dutch as a native language and English as a second language (Flege & Eefting, 1987). VOT has also been a feature used to classify non-​native English accents (Das & Hansen, 2004; Hansen, Gray, & Kim, 2010) by showing VOT duration and its potential correlation with global foreign accent ratings.

3.4  Fluency in Discourse In L2 learning, fluency has been defined quite broadly, but oftentimes in two main ways: the first as a measure of overall speaking competence, and the second as a temporal phenomenon which is just one aspect of speaking competence (Lennon, 1990, 2000). In fact, assessing one’s fluency as part of an overall score in language assessment scenarios is quite common. Lennon (1990) defines fluency as the following: “fluency reflects the speaker’s ability to focus the listener’s attention on his/​her message by presenting a finished product, rather than inviting the listener to focus on the working of the production mechanisms” (pp. 391–​392). When considered in a discussion of prosody, the more common way to measure fluency is as a temporal phenomenon quantified through properties relating to speech rate, silent pauses, and filled pauses. This corresponds with Lennon’s (2000) more recent definition of fluency: “the rapid, smooth, accurate, lucid, and efficient translation of thought or communicative intention into language under the temporal constraints of on-​line processing” (p. 26). Fluency-​related speech properties are commonly explored in L2 speech and research has shown them to be strong predictors of speaking proficiency. Because of their strong relationship with oral proficiency, they have been integrated into well-​known automated scoring systems. For example, Zechner et al.’s (2009) SpeechRaterSM software automatically scores oral proficiency of unconstrained non-​native speech computing articulation rate, unique words per second, and mean words in a chunk, as three of its five main linguistic measures (others include grammatical accuracy and segmental accuracy). Similarly, fluency properties have also played a major role in Evanini and Wang’s (2013) automated speech system. All five suprasegmental features that they included were measures


40  Linguistic Foundations of Prosody of fluency (rate of speech, number of words per chunk (defined the same as Zechner et al., 2009), and mean duration between stressed syllables). Black et al.’s (2015) model also included fluency measures (silent pauses and speaking rate) for predicting the degree of nativeness. Kang and Johnson’s (2018a, 2018b) suprasegmental measures also include a range of fluency properties, including the number of syllables (including coughs, laughs, etc.) and filled pauses, the number of runs, and the duration of an utterance including silent and filled pauses. Overall, research has shown that non-​native speakers speak with a slower speech rate than native speakers (Guion et al., 2000; Lennon, 1990; Munro & Derwing, 1995b, 1998). Reasons for this decreased rate of speech could be due to developing syntactic and morphological knowledge, lexical retrieval issues, or non-​ native like articulatory challenges (Munro & Derwing, 2001). Too fast a speech rate for NNSs can actually be disadvantageous and less preferred for native listeners (Anderson-​Hsieh & Koehler, 1988; Derwing & Munro, 1997; Munro & Derwing, 1998). Therefore, the relationship between speech rate and listeners’ judgments of accentedness and comprehensibility has been proposed to be curvilinear (Munro & Derwing, 1998), as too fast or too slow a speech rate is actually dispreferred. This curvilinear relationship was tested empirically in Munro and Derwing (2001) and found to be significant for both accentedness and comprehensibility. Listeners have the most positive ratings of NNSs’ speech when it is faster than the speech rate generally used by NNSs, yet still not with unusual fast or slow extremes (Munro & Derwing, 1998; 2001). Statistical projections in Munro and Derwing (2001) showed that the optimal speech rate for the best accentedness scores is 4.76 syllables per second, and the optimal rate for comprehensibility is 4.23 syllables per second, yet in this same study, non-​native speech was mostly slower than these cut-​off values (mean of 3.24 syllables per second). While some studies have shown that speech rate has no effect on accentedness ratings (Anderson-​Hsieh & Koehler, 1988; Flege, 1988), others have found an effect on perception of accentedness (Munro & Derwing, 1998; Trofimovich & Baker, 2006), on comprehensibility (Kang, 2010), and both accentedness and comprehensibility (Munro & Derwing, 1998; Munro & Derwing, 2001). In addition to affecting listeners’ perception of comprehensibility and accentedness, rate measures (including speech rate, mean length of run, and phonation time ratio) have also been found to be the best predictors of fluency (Kormos & Dénes, 2004). Non-​native pausing has also been the focus of many fluency analyses which pertain to listener judgments on accentedness and comprehensibility. Researchers have measured length of silent/​filled pauses, number of silent/​filled pauses, and even the location of silent/​filled pauses. Findings


Prosodic Analyses of Natural Speech  41 thus far have been insightful, implying that, as a whole, pause measures have a significant impact on listener perception of non-​native speech. Trofimovich and Baker (2006) found that pause duration for Korean learners of English was the most significant predictor (among stress timing, peak alignment, speech rate, and pause frequency) of accentedness ratings, accounting for about 37% variance. Mean length of pauses have been significantly associated with fluency judgments (Kormos & Dénes, 2004), as well as significant predictors of accentedness (Kang, 2010). In Kang (2008), the number of silent pauses significantly predicted the oral proficiency judgments of international teaching assistants. Taken together, a combination of quantitative measurements of fluency are essential for any human or computer analysis of prosodic speech patterns.

3.5  Measuring Fluency A number of working calculations are available to quantify speech rate and pausing. In the following discussion, we will introduce some examples of ways to quantify fluency. Our discussion is influenced by definitions in Kormos and Dénes (2004), and our own examples are provided to illustrate the calculations. We adopt procedures from Kang et al. (2010) for pause cut-​off values. This method is also applied to those in Chapters 4, 5, and 6 in this book. The following section will therefore detail quantitative calculations for syllables per second, articulation rate, mean length of run, phonation time ratio, total number of silent pauses, mean length of silent pauses, total time of silent pauses, total number of filled pauses, mean length of filled pauses, and total time of filled pauses. In this book, the cut-​off point of pauses is operationalized as .1 second (i.e., 100 milliseconds). Researchers have debated the cut-​off point of pause length (Towell, Hawkins, & Bazergui, 1996) as it determines the operational quantity of fluency variables. In most pausological literature, articulatory pauses, which are less than .2 seconds, are often omitted from fluency calculations (e.g., Zeches & Yorkston, 1995). However, in our measurement approach, we use .1 second (less than .2 seconds) as a cut-​off point because such articulatory pauses have shown some meaningful differences especially in L2 speech (Kang, 2010; Kang et al., 2010). Some other studies have used such a method as well (e.g., Anderson-​ Hsieh & Venkatagiri, 1994; Griffiths, 1991; Swerts & Geluykens, 1994). Syllables per second is the total number of syllables in any given speech sample divided by the total speaking time, including all pauses, expressed in seconds. To calculate syllables per minute (if the total speaking time is less than one minute), then the number should be multiplied by sixty. The process of dividing speech into syllables is known as syllabification. Example 3.1 demonstrates syllabification and the calculation of syllables per second.


42  Linguistic Foundations of Prosody

Example 3.1 Hello, everyone! Today we are going to talk about the benefits of recycling. (total speaking time 3.5 seconds) First, one would want to remove all punctuation and syllabify the discourse as it was spoken. Hel.lo eve.ry one we are to talk a.bout the ben.e.fits of re.cyc.ling (total speaking time 3.5 seconds) We end up with 23 syllables. We can imagine that this string of discourse took the speaker about 3.5 seconds to produce. To calculate syllables per second, we would divide 23 by 3.5 to get 6.57 (rounded). Therefore, in this string of discourse, the speaker’s speech rate was 6.57 syllables per second. 23( syllables )

3.5 ( seconds)

= 6.57 syllables per second

We can then project how many syllables the speaker would produce in a minute by multiplying 6.57 by 60 (seconds): 394.2 syllables per minute. Articulation rate is calculated similarly to speech rate, but this variable excludes pause time from the calculation. In other words, articulation rate is the total number of syllables in any given speech sample divided by the number of seconds taken to produce the sample minus the pause time. To use our example from above, consider that the brackets in Example 3.2 below represent a pause of .5 seconds. As described above, the cut-​off value for a pause is a minimum of .1 seconds.

Example 3.2 Hel.lo eve.ry one [.50] we are to talk a.bout the ben.e.fits of re.cyc.ling (total speaking time 3.5 seconds) The pause of .50 seconds would then be excluded from the total speaking time (3.5 seconds –​.5 seconds = 3 seconds), and articulation rate would be calculated by taking the 23 syllables and now dividing them by 3 seconds, to get 7.67 (rounded). Articulation rate is always higher than syllables per second. 23( syllables ) = 7.67 articulation rate per second 3( seconds)


Prosodic Analyses of Natural Speech  43 Again, articulation rate can be projected to one minute by multiplying 7.67 by 60 (seconds): 460.2 per minute. Other ways of calculating speech rate include mean length of run and phonation time ratio. Before mean length of run can be quantified, the number of runs must first be calculated. Kang et al. (2010) define runs as “stretches of speech bounded by pauses of 100 milliseconds or longer” (p. 558). Another term for runs is pause-​based units or pause units. To calculate mean length of run, one should take the number of syllables and divide them by the number of runs. Returning to our example sentence, presented in Example 3.3, there are two stretches of speech surrounding the single pause of .5 seconds; therefore, there are two runs. Runs are set in between double backslashes (//​).

Example 3.3 //​ Hel.lo eve.ry one //​[.50] //​ we are to talk a.bout the ben.e.fits of re.cyc.ling //​(total speaking time 3.5 seconds) The mean length of run would then be the number of syllables (23) divided by the number of runs (2) to equal 11.5 average number of syllables per run. 23( syllables ) = 11.5 mean length of run 2 ( runs ) Phonation time ratio is the percentage of time spent speaking (including filled pauses like uh, er, um, etc.) after removing silent pauses. It is calculated by dividing the total length of the time spent speaking (excluding silent pause time) by the total length of the speech sample. This means the phonation time ratio for the example that we have been working with (presented below in Example 3.4) would be 86% (3 seconds of time spent speaking divided by 3.5 seconds, which is the total length of the speech sample).

Example 3.4 //​ Hel.lo eve.ry one //​[.50] //​ we are to talk a.bout the ben.e.fits of re.cyc.ling //​(total speaking time 3.5 seconds) 3( seconds) = .857 (86%) phonationtime ratio 3.5 ( seconds)


44  Linguistic Foundations of Prosody Measuring silent pauses is relatively straightforward. While there are certainly more ways to measure pauses than we could discuss in this chapter, some common measures from research (e.g., Kang et al., 2010; Kormos & Dénes, 2004) include the total number of silent pauses, the total time of silent pauses, the mean length of silent pauses, the total number of filled pauses, the total time of filled pauses, and the mean length of filled pauses. Silent pauses are breaks in discourse when there is no speech (usually when taking a breath, thinking, etc.). On the other hand, filled pauses in Kang et al. (2010) are defined as non-​lexical fillers such as um, uh, er, etc. To illustrate the calculations of these pause variables, we will begin with Example 3.5. Brackets still indicate silent pauses, as they did above, but the wavy brackets underneath the filled pauses indicate the actual length in seconds of the filled pauses.

Example 3.5 //​Um //​[.20] //​I would say //​[.20] //​uh //​[.25] //​it’s sort of a   {.1}              {.1} bluish color //​[.15] //​but more of a blue green //​(total speaking time 4 seconds) Based on this example, we can code the following: •

total number of silent pauses = 4 calculated by counting the number of all silent pauses .20 + .20 + .25 + .15 = 4 silent pauses

the total time of silent pauses = .80 calculated by summing up the time of all silent pauses .20 + .20 + .25 + .15 = .80 secondsof silent pause time

the mean length of silent pauses = .80 /​4 = .20 calculated by taking the total time of silent pauses and dividing by the number of silent pauses

.80 ( secondsof silent pause time ) = .20 mean lengthof silent pause time 4 ( pauses ) •

the total number of filled pauses = 2 calculated by counting the number of all filled pauses um + uh = 2 filled pauses


Prosodic Analyses of Natural Speech  45 •

the total time of filled pauses = .2 calculated by summing up the time of all filled pauses um (.1) + uh (.1) = .2 secondsof filled pause time

the mean length of filled pauses = .2 /​2 = .1 calculated by taking the total time of filled pauses and dividing by the number of filled pauses

.2 ( secondsof filled pause time ) = .1 mean lengthof filled pause time 2 ( filled pauses )

3.6  Word Stress in Discourse Word stress, synonymous with lexical stress, refers to the syllable of a word which is louder, longer, and higher in pitch than the other syllables of that word (e.g., uniVERsity, STUdent, proFESsor, etc.). These can be assumed to be the way a word would be pronounced in citation form or when looking a word up in a dictionary. Common function words (e.g., a, an, the, from, to, etc.) are typically unstressed in actual discourse unless a speaker’s specific intention to emphasize the word is involved. There are also monosyllabic words which only have stress on the one syllable (e.g., DOG, DESK, SKY, etc.). Listeners have been known to rely on word stress for comprehension of meaning. Indeed, misplacement of word stress can be detrimental to a listener’s global understanding. In Zielinski’s (2008) study, three native English-​ speaking listeners consistently relied on non-​ native English speakers’ stress patterns and segments in the discourse to make decisions about intelligibility. In fact, misplaced syllable stress and non-​target-​like segments caused listeners to incorrectly transcribe words. In addition to intelligibility, Trofimovich and Isaacs (2012) found word stress errors to be significantly correlated with accentedness and comprehensibility; that is, errors with word stress were related to more accented speech that was also more difficult to understand. Kang (2013b) also found that stress (and pitch) made the strongest contribution in predicting non-​ native proficiency. In past research, Cutler and Clifton (1984) found that native English listeners had major issues with intelligibility when non-​native speakers misallocated word stress to the right and also changed the quality of the vowel in doing so. Field (2005) extended Cutler and Clifton’s (1984) work by carefully investigating the role of word stress on intelligibility for both English native and non-​native (of mixed L1s) listeners. Native and non-​native listeners transcribed the words they heard when these factors of stress and vowel quality were manipulated. Field provides an example


46  Linguistic Foundations of Prosody with the word “second” (2005, p. 405). It can be said in its standard form, SEcond, by shifting the stress to the right, seCOND, and by shifting the stress to the right and changing the vowel quality, seCAND. Due to the nature of some words where the vowel quality would not change, they would only undergo manipulations in the shift of word stress. Field found that intelligibility was severely compromised when stress shifted without a change in vowel quality, such as in folLOW or listEN, but less so when the shift also accompanied a vowel change (2005, pp. 414–​415). Another major finding that emerged was that misallocations in word stress to the right were more severe than misallocations to the left. Overall, misplaced stress caused intelligibility to decrease almost 20% for native listeners and about 21% for non-​native listeners.

3.7  Measuring Word Stress Just like with other speech properties, there are different ways to measure word stress, most commonly centering around word stress errors. Word stress analyses can be performed auditorily as well as acoustically through speech visualization software which allows one to confirm their auditory assessment through acoustic patterns of pitch, intensity, and length. A combination of these three properties on a syllable will indicate stress and can therefore confirm auditory perceptions of misplaced word stress. Examples of word stress analyses can be drawn from studies previously discussed in this chapter (i.e., Kang & Moran, 2014; Trofimovich & Isaacs, 2012). For example, in Kang and Moran (2014), word stress was calculated quantitatively by analyzing the instances of incorrect lexical stress placement on polysyllabic words. They provide the example of misplaced stress on visITing instead of VISiting. Word stress errors can be calculated as the total number of instances of misplaced or missing primary stress in polysyllabic words divided by the total number of polysyllabic words produced. Therefore, if there were 10 stress-​misplaced polysyllabic words out of 20 polysyllabic words, the stress inaccuracy score would be 0.5 (10 divided by 20) for that utterance. Figure 3.1 illustrates correct versus misplaced word stress on the word “visiting.” In the first illustration to the left, “visiting” is spoken with stress on the first syllable as predicted in English and then to the right, “visiting” is spoken with misplaced stress on the second syllable. We can see that pitch (the lower, darker line), intensity (the upper, lighter line), and vowel length are more pronounced at the location where the syllable receives the stress. In even another example, Zielinski (2008) analyzed stress as part of the rhythmic performance of L2 speech. At each site of reduced intelligibility (a pause group which presented some word that was not identifiable to the listeners), three levels of stressed syllables were coded perceptually and then compared to graphic representations: “S” was for


Prosodic Analyses of Natural Speech  47

Figure 3.1 Correct vs. Misplaced Word Stress on “Visiting.”

the strongest syllable in the site, “s” was for a strong (but not the strongest) syllable, and “w” was for a weak syllable.

3.8  Sentence Prominence in Discourse Sentence prominence (also called sentence stress) at the discourse level is different from word stress at the dictionary or lexical level because sentence prominence is selective, functional, systematic, and based on what the speaker highlights as important. Prominence, in particular, is used in a strategic and meaningful way in English; in fact, if used inappropriately, it can create a loss of intelligibility (Hahn, 2004) or even cause L2 speakers of English to sound more accented (Kang, 2010). The stress-​ timed nature of English coupled with its rhythmic patterns has proven challenging for many NNSs (Hahn, 2004). In Kang (2010)’s research, a prominence-​related variable (proportion of stressed words to the total number of words) positively and significantly predicted listeners’ accent judgments of international teaching assistants’ (ITA) speech. In other words, the more stressed syllables in words ITAs produced, the more accented listeners found the ITAs’ lectures. Low-​fluency speakers are inclined to give relatively equal pitch to each word regardless of its role in the discourse structure (Wennerstrom, 2000), which leads to many sequential high-​pitched words (i.e., stressed words in this study). In Kang’s (2010, p. 310) study, one of the distinctive characteristics in ITAs’ speech patterns was that low-​proficiency ITAs placed stress on many functional words or articles such as “be,” “the,” “that,” and “this is.” Raters also commented that ITAs did not show emphasis appropriately.


48  Linguistic Foundations of Prosody In Brazil’s (1997) framework, the use of prominence depends on the context of the speaker–​listener interaction (see Chapter 2). Speakers will signal information as prominent when it is not already known or recoverable from the immediate situational context. Prominence is used sparingly per tone unit in order to provide clear, unambiguous cues to the listener. Acquiring the English system of prominence can be challenging for English learners. Pickering (2018) notes some particular patterns associated with the beginning and intermediate stages of English learning. First, speakers tend to break their speech into much shorter tone units. Then, these tone units have multiple prominences which can cause difficulty for the listener in deciphering important foci, including key information associated with onset and tonic syllables. These patterns are similar to ITAs’ speech patterns which led to more accented speech in Kang (2010).

3.9  Measuring Sentence Prominence As mentioned above, prominence is one or two syllables that a hearer can recognize as being in some sense more emphatic than the others (Brazil, 1997, p. 7). In Figure 3.2, prominent syllables appear to be darker with more intense energy (i.e., GROUP, BAD, and GROUP). In Brazil’s model, an individual tone unit has one or two prominent syllables, which are the ones that carry communicative significance. Importantly, the allocation of prominence in each tone unit is the consequence of a speaker’s decision with regard to a binary prominent/​non-​prominent choice. Tone units (introduced in Chapter 2) are semantically related language chunks characterized by differences in tone choice from preceding and following units, often separated by pauses.

Figure 3.2 Measuring Sentence Prominence.


Prosodic Analyses of Natural Speech  49 We will consider Example 3.6 and its corresponding illustration (Figure 3.2) to demonstrate these calculations. The example is extracted from a non-​native speaker of English with an intermediate proficiency level. Runs, which are equivalent to pause units, are indicated with double back slashes (//​). Then, tone units are only the ones that contain tonic syllables indicated as bold, underlined capital letters. In this example, there are three tonic syllables (i.e., BAD, GROUP, and indiVIDually), which can make three tone units, while there are seven runs. Example 3.6 has six prominent syllables (i.e., GROUP, BAD, GROUP, HE, SHE, and indiVIDually). Silent pauses are indicated with brackets; the associated pitch of the prominent syllable in hertz (Hz) is indicated below the prominent syllable.

Example 3.6 //​GROUP thinks is a → BAD decision that //​[.19] //​individual   130.1 Hz      106.7 Hz tend to make in a ↘GROUP//​[.31] // that //​[.27] //​the //​[.14] //​          97.5 that //​[.13] //​HE or  SHE will not make in //​[.2] //​        136.7 117.6 ↘indiVIDually//​ [1.09]    105.3 Kang et al. (2010) measure sentence prominence in three different ways, including pace, space, and prominence characteristics. Pace and space are the prominence features also used in Kormos and Dénes (2004). Pace is the number of prominent syllables in a given run. Space is the proportion of prominent words to the total number of words. Prominence characteristics are the percentage of tone units, containing final prominence or termination (tonic) syllables, out of the total number of runs. More details about how to identify tone units, tonic syllables, and tone choices (which are indicated as arrows right next to the bold capital letters below) will be explained in the following sections. Prominence characteristics can be calculated by dividing the number of the tone units that contain a final prominence/​termination syllable, by the total number of runs. Using ­example 3.6 above, we can calculate the following:

• Pace = 0.86

Calculated by counting the total number of prominent syllables and dividing by the total number of runs 6 ( prominent syllables ) 7 ( runs )

= 0.86


50  Linguistic Foundations of Prosody

• Space = .43

Calculated by dividing the number of prominent words (i.e., words containing prominent syllables) by the total number of words 6 ( prominent words ) = .24 ( space ) 25 ( words )

• Total number of runs = 7

calculated by counting the number of all pause units

• Total number of tone units that contain tonic syllables = 3

Calculated by counting the number of all tone units that have a termination

• Prominence characteristics = (3/​7) × 100 = .43 × 100 = 43%

Calculated as the percentage of tone units, containing a final prominence or termination, out of the total number of runs 3 (tone units ) × 100 = 43% ( Prominence characteristics ) 7 ( runs )

3.10  Pitch and Intonation in Discourse English intonation has commonly been defined as the use of vocal pitch level and pitch movement to produce meaning throughout one’s phrasal discourse (Kang et al., 2010). Acoustic analyses of pitch are measured in hertz (Hz) and fundamental frequency (F0). Chapter 2 discusses the pragmatic functions of intonation but, to summarize, native speakers often signal new information with falling tones and given information with rising tones. Level tones function differently because they do not make projections about the state of convergence; in fact, they are a temporary withdrawal from this state. Differences in non-​native use of intonation extends to tone choices and pitch range. NNSs have shown differences from NSs by using more falling tones (Hewings, 1995; Wennerstrom, 1994) or failing to show involvement (Pickering, 2001). In nurse–​patient interactions, international nurses also used more level tones for showing empathy, whereas U.S. nurses would use falling tones (Staples, 2015). This could be problematic, as research has shown that level tones are often used for routine phrases or asides (Cheng, Greaves, & Warren, 2008; Pickering, 1999; Wennerstrom, 1997; Wennerstrom & Siegel, 2003). Using level tones for expressing empathy can seem insincere, especially accompanied with a narrow pitch range (Staples, 2015). In Kang, Thomson, and Moran (2020), prosody which enhanced listeners’ comprehension of English speech included the appropriate use of rising tones. For pitch range, Pickering (2004) found


Prosodic Analyses of Natural Speech  51 that native-​speaking teaching assistants had a much wider pitch range (50–​250 Hz) than international teaching assistants (100–​200 Hz). Staples (2015) also found that United States nurses had a broader pitch range than international nurses, in addition to the fact that IENs had more prominent words. Pitch and intonation patterns have an effect on listeners’ judgments. Most of the associations have been established between accentedness and comprehensibility (Kang, 2010; Kang et al., 2010), although recent research has also uncovered the close connection between intelligibility and intonation. For example, Kang et al. (2018) evaluated the effect of phonological variables (including intonation) and found that measures of tone choice significantly contributed to intelligibility judgments on several tasks. Kang (2012) also found that intonation was capable of leading to a listener’s misunderstanding of a speaker’s intention. Pickering (2001) described the ability of intonation to lead to communication breakdown between ITAs and their students. Pitch has been an important element of suprasegmental features which have predicted 50% of the variance in proficiency and comprehensibility judgments of non-​native speech (Kang et al., 2010). Additionally, pitch operationalized as pitch range has been one of the best predictors of accent ratings of international teaching assistants’ speech (Kang, 2010).

3.11  Measuring Pitch and Intonation In Brazil’s framework, traces of pitch (fundamental frequency) measured in hertz (Hz) are the foundation for the underlying systems of both pitch pattern (tone choice), pitch height, and pitch range. To understand how these systems work, certain labels of the tone unit must first be reviewed (see Chapter 2 for detailed definitions and applications of these terms); these include the tonic syllable, and the key and termination syllables. In Brazil’s (1997) model, a tone unit must be first established. The pitch contour of a tone unit is often followed by a pause which can mark the boundary of the tone unit. Other factors that delimit the tone unit include lengthening of the final syllable of the tone unit, and faster pace at the beginning of a tone unit (Wagner & Watson, 2010). That is, the designation of the boundary of a tone unit is based on a number of factors, not just a pause boundary. For example, there should be a clear, sustained pitch contour on termination syllables. The second factor to consider is pausing, which plays a role in identifying the boundary of a tone unit. See Figure 3.3 to illustrate the concept of a tone unit which is followed by a short pause, reflected visually by the absence of a pitch contour and little movement in the wavelengths on the spectrogram. Pausing is not a necessary factor for designating tone units and tone units are not always bound by pauses. Also, the boundary happens to occur at a syntactic break, but


52  Linguistic Foundations of Prosody

Figure 3.3 Illustration of a Tone Unit.

Figure 3.4 Graphic Illustration of Tonic, Key, and Termination Syllables.

this is not considered as a factor for the tone unit analysis. Please note that the identification of tone units can be subjective; therefore, coders’ consistency (i.e., reliability) is critical in this process. The process for computerized identifications of a tone unit can be found in Chapter 5. The tonic syllable is the last prominent syllable in a tone unit, and this is the syllable which carries the tone choice for the entire tone unit. Key and termination refer to the levels of pitch on the first prominent syllable (key) and the tonic syllable (termination) in a given tone unit and can range from low to mid to high. Figure 3.4 illustrates these systems. However, it should be noted that when we analyze non-​native speakers’ speech, this tone unit protocol may not apply in the same way; i.e., prominent syllables can be determined, but tonic syllables may not be identified. In other words, as Brazil (1997, p. 14) describes, stretches of speech bound by pauses may have prominent syllables but contain no tonic syllables. Such units are regarded as incomplete. As seen in Example 3.6 above, “..//​ HE or SHE will not make in //​..”, although this unit has prominent syllables of “HE and SHE”, they are not tonic syllables. Therefore, this is not considered a tone unit.


Prosodic Analyses of Natural Speech  53 The tonic syllable should have one of Brazil’s five possible tone choices: falling, rise-​falling, rising, fall-​rising, or level. These are indicated below with their corresponding arrows. Falling: ↘ Rise-​falling: ↗↘ Rising: ↗ Fall-​rising: ↘↗ Level: → As discussed in Chapter 2, pitch height (H = high, M = mid, and L= low) is also part of Brazil’s (1997) framework and is measured on the key and termination syllables in a tone unit. We can illustrate some different ways to locate and describe pitch and tone choices with Example 3.7 below and its graphic illustration. In this transcription of a highly proficient English speaker’s speech sample, tone units are indicated with bold, underlined capital letters, pause units/​runs are indicated with double backslashes (//​), and pause lengths are shown within brackets. In Example 3.7, we can see that there are two tone units indicated from the presence of two tonic syllables (underlined capital letters in bold). Prominent syllables are indicated with capital letters. There are also three prominent syllables within the entire speech sample: “next,” “learn,” and “Spanish.” In the first tone unit, because “NEXT” is the first prominent syllable, or the onset, it is also the key. “LEARN” is the last prominent syllable, or the tonic syllable, and is therefore the termination. In the second tone unit, “SPAnish” is both the key and the termination. In the examples provided, the tone choice should therefore be analyzed on the tonic syllable. Arrows (↘ → ↗) indicate tone choices and numbers under the capital letters represent fundamental frequency (F0) values in hertz (Hz). Bolding/​underline is used to represent the tonic syllable. For each tone unit, one tonic syllable should be identified. We can see in the graphic example, in Figure 3.5, the line representing pitch contour. There are two tonic syllables in the sample which result in two tone units. The first tonic syllable (LEARN) has a fall-​rising intonation contour, and this is represented with the fall and rise arrows (↘↗) in the transcription below. The second tonic syllable has a falling tone movement with a fall arrow (↘). The pitch height should be measured on the key and termination for each tone unit. We can indicate the level of pitch associated with the key and termination in parentheses (high, medium, or low). Analyzing the relative pitch of “next,” we would identify it as mid (M). The relative pitch of “LEARN” is low (L) and the pitch height of “Spanish” is high (H). Accordingly, tone choices for two tonic syllables are a low fall-​rising tone for “LEARN” and a high falling tone for “SPAnish”.


54  Linguistic Foundations of Prosody

Figure 3.5 Illustration of a Tone Unit and Tonic Syllables.

Example 3.7 //​the NEXT language I’d like to↘↗LEARN (191 Hz) //​[.50]//​is ↘ SPAnish//​.

  (235.5 Hz)       (179.3 Hz)        (360.9 Hz)    M           L           H We present another transcription and graphic illustration in Example 3.8 of discourse composed of three tone units (presence of three underlined capital letters in bold). In the first tone unit, there is only one prominent/​ tonic syllable (“PASSED”); therefore, this tonic syllable carries both the key and the termination for that tone unit. The second tone unit has a separate key and termination syllable, but the tonic syllable is the termination (“GOOD”). The last tone unit has one prominent/​tonic syllable just like the first tone unit, which means that “exHAUSted” is both the key and the termination. The tonic syllable of the first tone unit has a low-​rising tone, the second has a mid-​level tone, and the third has a mid-​ falling tone.

Example 3.8 //​ I ↗ PASSED //​[.70] //​and my GRADES were pretty →GOOD //​   (200.5 Hz)       (258.9 Hz)      (260.5Hz)     L           M         M [.79] //​but I’m ↘exHAUSted.//​          (238.8Hz)        M Common ways of quantifying tone patterns are to calculate the total types of each tone choice within a given speech sample or to measure the


Prosodic Analyses of Natural Speech  55

Figure 3.6  Illustration of Tone Choice and Pitch Range.

percentage of one tone choice out of the total number of tone choices. Using Figure 3.6 as an example, we have three total tone choices (one per tone unit). Of these three, there is one rising tone, one level one, and one falling tone. As a result, in the entire speech sample presented, there are 33.3% falling tones, 33.3% rising tones, and 33.3% level tones.

• Falling 1( falling tone ) = 33.3% falling tones 3(total tones )

• Rising 1( rising tones ) = 33.3% rising tones 3(total tones )

• Level 1(level tones ) = 33.3% rising tones 3(total tones ) Different properties of pitch can also be measured quantitatively. For example, pitch range is calculated by finding the highest F0 on the vowel of every prominent syllable within an entire speech. Then, the lowest recorded pitch level should be subtracted from the highest recorded pitch level to calculate a speaker’s total range of pitch (Kang et al., 2010). Drawing on Example 3.8 and Figure 3.6, we have isolated the pitch associated with the highest prominent syllable. The recorded pitch is 260.5 Hz (rounded). We would collect this pitch recording for every


56  Linguistic Foundations of Prosody prominent syllable; i.e., others include “passed” at 200.5 Hz, “grades” at 258.9 Hz, and “exhausted” at 238.8 Hz. The pitch range then for this speech sample is calculated in the following way: 260.5 ( highest recorded pitch) − 200.5 hertz (lowest recorded pitch) = 60 (pitchrange) Other measures of pitch from Kang et al. (2010) include pitch prominent syllable, pitch non-​prominent syllable, pitch new information lexical item, and pitch given information lexical item. A pitch prominent syllable is calculated by measuring the pitch of five prominent syllables and calculating the average pitch value, while a pitch non-​prominent syllable is calculated by measuring the pitch of five non-​prominent syllables and calculating the average pitch. Pitch new/​given information lexical item is calculated by measuring the pitch of a lexical item which appears as new information initially and then becomes given information later in the discourse. In their study, when possible, five lexical items were used to calculate the average pitch for each category. Also dealing with pitch is the concept of paratone. Paratone deals with the pitch level of an overall speech paragraph, analogous to a written paragraph. The extra high pitch at the beginning of a speech paragraph is often called the reset and this turns into a gradual fall in pitch level to a low termination at the end of the speech paragraph. Measures in Kang et al. (2010) dealt with taking the average pitch level of paratone-​initial pitch choices (usually high pitch) and the average pitch level of paratone-​ termination choices (usually low pitch). These measures included number of low termination tones (calculated by counting the total number of low terminations followed by high-​key resets), average height of onset pitch (calculated by averaging the pitch of high-​key onsets), average height of terminating pitch (calculated by averaging the pitch of low terminations), and average paratone pause length (calculated by averaging the length of pauses at paratone boundaries). For more details about these additional intonation analyses and interpretations, please see Kang et al. (2010) or Pickering (2018). Overall, combinations of tone and pitch were calculated in Kang et al. (2010) using Brazil’s (1997) framework and included the identification of the following tones: high-​rising; high-​level; high-​falling; mid-​rising; mid-​level; mid-​falling; low-​rising; low-​level; and low-​falling. Later on in Chapter 5 we introduce an additional six combinations which are programmed into our computer model; these include high-​ fall-​ rising, mid-​fall-​rising, low-​fall-​rising, high-​rise-​falling, mid-​rise-​falling, and low-​ rise-​falling. These additional features are not directly drawn from Brazil’s tone choices, but simply created for the consistency of our computer algorithm development.


Prosodic Analyses of Natural Speech  57

3.12  Proficiency and Intelligibility Depending on the oral outcome to be assessed, computer models can be designed to measure speech properties differently. Ultimately, the selection of features to be automatically assessed should be informed by research in conjunction with the operationalization of a construct. Two global constructs of speech performance have been noted as important with respect to non-​native speech performance: proficiency and intelligibility. Proficiency is defined as the ability to use a language for some purpose (Carroll, 1980), that is, knowledge, competence, or ability in the use of a language (Bachman, 1990). On the other hand, intelligibility deals with the clarity of speech and how well it is understood from a listener’s perspective. Kang and Johnson’s ASR model (Kang & Johnson, 2018a, 2018b; Johnson, Kang, & Ghanem, 2016a, 2016b) is based on the computer’s assessment of non-​native speech in a nine-​step algorithm which translates the sound recording into a file of time-​aligned phones and silent pauses, partitions the phones and silent pauses into tone units. Then, the model syllabifies the phones, locates the filled pauses, identifies the prominent syllables, determines the tone choice of each tone unit, calculates the relative pitch associated with each tone unit, computes suprasegmental measures, and utilizes the suprasegmental measure to estimate an oral proficiency score. Step-​by-​step details of this computer model are provided in Chapter 6. Johnson, Kang, and Ghanem (2016a) found 19 best suprasegmental predictors of oral proficiency, including the following: (1) prominent characteristics, (2) rising tone choice and low relative pitch ratio, (3) neutral tone choice and low relative pitch ratio, (4) neutral tone choice and high relative pitch ratio, (5) falling tone choice and mid relative pitch ratio, (6) fall-​rise tone choice and low relative pitch ratio, (7) fall-​rise tone choice and high relative pitch ratio, (8) rise-​fall tone choice and low relative pitch ratio, (9) rise-​fall tone choice and high relative pitch ratio, (10) prominent syllable pitch range, (11) non-​prominent syllable pitch, (12) prominent syllable pitch, (13) syllables per second, (14) articulation rate, (15) filled pause rate, (16) filled pause length, (17) paratone boundary onset pitch height, (18) paratone boundary termination pitch height, and (19) paratone boundary pause duration. Zechner et al.’s (2009) Automatic Speech Recognizer model used a multiple regression formula to predict an oral proficiency score by using three suprasegmental measures (articulation rate, types divided by uttsegdur, and mean words in a chunk), a grammatical accuracy measure, and a pronunciation measure based on the computer’s feedback of the entire utterance. In Zechner et al.’s (2009) study using classification and regression trees (CART), they added six additional speech properties including mean deviation of chunks in words, duration of silences per word, mean of silence duration, mean duration of long pauses, frequency


58  Linguistic Foundations of Prosody of longer pauses divided by number of words, and unique words per second. Another model used for scoring oral proficiency is Evanini and Wang’s (2013) linear regression model which included suprasegmental, grammatical, and lexical measures. Because intelligibility is usually measured through transcription methods, computer models which calculate intelligibility scores are not as common as those which provide overall proficiency scores. Johnson and Kang (2017b) proposed an initial model for automatically assessing the intelligibility of non-​native speech in World English speech data. Depending on the type of intelligibility score, different suprasegmental features were drawn upon from properties of tone, pitch, silent/​filled pauses, new lexical item pitch, pace, paratone boundary termination pitch height, phonation time ratio, prominent characteristics, prominent syllable pitch range, silent pause rate, space, and syllables per second. Chapters 4–​6 provide more details regarding the applications of these properties to computer-​based analyses.

3.13  Summary This chapter has led readers through a discussion which highlights the importance of speech properties in discourse, with separate discussions on segmental and suprasegmental properties. When applicable, native speaker patterns of the speech property were discussed and compared to non-​native speech patterns. For each speech property, we provided step-​ by-​step instructions on how to quantify the property manually. Finally, the end of the chapter finished with a discussion of how many of these speech properties play a role in predicting L2 proficiency and intelligibility, in addition to how they can be used in computer applications.


Part II

Computer Applications of Prosody




4  Computerized Systems for Syllabification

PROMINENT POINTS This chapter presents an overview of the following: 4.1 4.2 4.3 4.4 4.5

Syllables and Automatic Syllabification Machine Learning Acoustic Algorithms for Syllabification Phonetic Algorithms for Syllabification Data-​Driven Phonetic Syllabification Algorithm Implementations 4.6 Summary

61 62 68 68 72 86

4.1  Syllables and Automatic Syllabification Syllables play an important role in the prosodic analysis of English speech. Thus, breaking continuous human speech into syllables automatically with a computer is the basis for a computer analysis of prosody. Syllabification is the process of dividing speech into syllables. This chapter provides detailed descriptions of various computer algorithms for detecting syllable boundaries within English speech. Both word-​and phone-​based techniques for syllabification will be explored. Vowels and consonants can be thought of as the segments of which speech is composed. Together they form the syllables that make up utterances. A syllable is a grouping of speech sounds (i.e., phones). Usually it consists of a syllable nucleus (more often than not a vowel) which may or may not be bounded by consonants before and/​or after. Speech can be segmented into a whole number of syllables. Syllables are regarded as the phonological building blocks of words. They shape the prosody and stress patterns of speech. Syllables play an important role in the prosodic analysis of English speech. Thus, breaking continuous human speech into syllables automatically with a computer is the basis for computer analysis of prosody. This chapter will describe various computer algorithms for detecting syllable boundaries within continuous English speech. DOI: 10.4324/​9781003022695-5


62  Computer Applications of Prosody Automatic syllabification algorithms fall into one of two major categories: acoustic or phonetic. Acoustic algorithms rely on the fact that the acoustical energy (defined as the absolute value of the amplitude of the audio signal) of the vowel is characteristically higher than that of the surrounding consonants. Acoustic algorithms utilize variations in acoustical energy to identify low energy syllable boundaries between high energy vowels. On the other hand, phonetic algorithms first identify phones, and then detect syllables by locating vowels and surrounding consonants. After we briefly go over “machine learning,” we will look at these two types of algorithms in more detail.

4.2  Machine Learning Before we start our discussion on the computer applications of prosody, we would like to briefly review the concept of “machine learning,” which is the basis for all the computer models discussed in Chapters 4, 5, and 6. Artificial intelligence (AI) is a branch of computer science that aims to create computer software that emulates human intelligence. John McCarthy, who coined the term in 1955, defines it as “the science and engineering of making intelligent machines” (McCarthy, n.d.). AI software emulates many aspects of human intelligence, such as reasoning, knowledge, planning, learning, communication, perception, and the ability to move and manipulate objects. There are a number of tools that AI uses to emulate these areas of human intelligence. This section covers a set of these tools loosely called “machine learning.” Early in the development of AI, machine learning was called pattern recognition. As illustrated in Figure 4.1, the machine learning model predicts the class of an object based on features (e.g., physical properties) of the object. The machine learning model is “trained” with examples of the objects to recognize the class of the object by the pattern of its features, in much the same way humans learn to classify things. For example, a 3-​year-​old child can recognize the difference between a truck and a car by their distinguishing features. Younger children learn to understand language by recognizing the different sounds of words. In computer applications of prosody, the object is generally something extracted from a speech file, such as a syllable or word. Features of the object (i.e., input to the model), such as pitch, duration, intensity, are extracted automatically using techniques similar to the manual ones


Input (features of object)

Machine learning model

Figure 4.1 Illustration of Machine Learning.

Output (class of object)


Computerized Systems for Syllabification  63 described in Chapter 3. The machine learning model then utilizes the features to classify the object, for example, identifying a prominent syllable vs. a non-​prominent syllable. In the computerized systems for syllabification described in the rest of Chapter 4, the input is acoustic information (e.g., intensity, pitch) or phonetic information (e.g., sonority, vowel, consonant) and the output is where the syllable boundaries are. In computerized systems for measuring suprasegmental features explained in Chapter 5, the input is the intensity, pitch, duration, and pitch contour and the output is suprasegmental features such as prominence, tone choice, relative pitch, or the ToBI attributes. Concerning the computer models for predicting oral proficiency and intelligibility detailed in Chapter 6, the input is suprasegmental measures derived from the suprasegmental features and the output is oral proficiency, degree of nativeness, or intelligibility. There are three types of machine learning: supervised, unsupervised, and reinforcement learning. In supervised learning, the classes of the objects are known and the machine learning model predicts the class of an object based on features of the object. For example: trucks vs. cars, recognition of the word “Alexa” vs. “Siri.” With unsupervised learning, the classes of the objects are not known and the goal is to group similar objects into clusters. An example would be market segmentation. The third type of machine learning, reinforcement learning, includes algorithms that learn to react to the environment, for example, a robot trying to get out of a maze. None of the algorithms in the remaining chapters of the book uses reinforcement learning. Most of them use supervised learning and a few use unsupervised learning in conjunction with supervised learning. Regression, which comes from statistics, is the simplest and oldest form of machine learning. Table 4.1 lists the 16 machine learning models used in the computer applications of prosody discussed in this book, with a brief description of how they classify data (supervised learning) or cluster data (unsupervised learning). All of the models accomplish the same goal, just in different ways. Some are better at classifying or clustering one kind of data while others are better at other kinds of data. Developing a computer model using machine learning may not be linear, but it has a number of well-​known steps: (1) define the problem, (2) acquire a data set (sometimes called a corpus) that represents what the application will encounter in the real world, (3) evaluate several combinations of input features and machine learning models and select the best combination for the application, (4) fine tune the selected model to get the best results, and (5) train the final model with the corpus and put the application into production. Cross-​validation is a machine learning model validation technique for assessing how the results of a machine learning model will generalize to an independent data set. Cross-​validation involves partitioning a corpus of data into subsets: Training and test data, training the model on one subset (called the training data), and validating the model on the other


64  Computer Applications of Prosody Table 4.1 Machine Learning Models Model







Classification and regression tree (CART)


Bagging is another name for Ensemble Learning which is defined below. Boosting is an Ensemble Learning method that uses a number of weak learners, i.e., the learners are not very good at predicting classes. Decision models where the target variable can take a discrete set of values are called classification trees; in these tree structures, leaves represent class labels and branches represent conjunctions of features that lead to those class labels. Decision trees where the target variable can take continuous values (typically real numbers) are called regression trees. While a classifier predicts a label for a single sample without considering “neighboring” samples, a CRF can take context into account. To do so, the prediction is modeled as a graphical model, which implements dependencies between the predictions. What kind of graph is used depends on the application. For example, in natural language processing, linear chain CRFs are popular, which implement sequential dependencies in the predictions. This model uses a decision tree as a predictive model to go from observations about an item represented in the branches to conclusions about the item’s target value represented in the leaves. Ensemble learning is the aggregation of multiple classifiers to improve performance; output of classifiers are combined, e.g., majority voting; usually the classifiers are the same kind, but sometimes an ensemble may include more than one type of machine learning model. EM is an iterative method to find (local) maximum likelihood or maximum a posteriori (MAP) estimates of parameters in statistical models, where the model depends on unobserved latent variables. The EM iteration alternates between performing an expectation (E) step, which creates a function for the expectation of the log-​likelihood evaluated using the current estimate for the parameters, and a maximization (M) step, which computes parameters maximizing the expected log-​likelihood found on the E step. These parameter-​estimates are then used to determine the distribution of the latent variables in the next E step.

Conditional random Supervised fields (CRF)

Decision trees


Ensemble learning


Expectation-​ maximization (EM)



Computerized Systems for Syllabification  65 Table 4.1  Cont. Model


Finite-​state-​machine Supervised (FSM)

Genetic algorithm


Hidden Markov model (HMM)


Instance-​based learning


Maximum entropy Supervised Maximum likelihood


Neural network


Support vector machine (SVM)


Gaussian mixture models (GMM)


K-​means clustering Unsupervised

Description A FSM is an abstract machine that can be in exactly one of a finite number of states at any given time. The FSM can change from one state to another in response to some inputs; the change from one state to another is called a transition. An FSM is defined by a list of its states, its initial state, and the inputs that trigger each transition. A genetic algorithm is used to generate high-​ quality solutions to optimization and search problems by relying on biologically inspired operators such as mutation, crossover and selection. A HMM is a statistical model in which the system being modeled is assumed to be a Markov process with unobservable (“hidden”) states, referred to as X. HMM assumes that there is another process, referred to as Y whose behavior “depends” on the model with the hidden states. The goal is to learn about X by observing Y with conditional probabilities. Instance-​based learning is a family of classifiers that classify an unknown sample by finding a similar known sample and using its class. Maximum entropy is another name for multinomial logistic regression Maximum likelihood is a method of estimating the parameters of a probability distribution by maximizing a likelihood function, so that under the assumed statistical model the observed data is most probable. A neural network is an interconnected group of artificial neurons inspired by human brain cells that uses a computational model for information processing based on a connectionist approach to computation. A SVM is a representation of the known samples as points in space, mapped so that the samples of the different classes are separated by an obvious gap that is as wide as possible. An unknown sample is then mapped into that same space and predicted to belong to a class based on the side of the gap on which it lands. A GMM is a probabilistic model for characterizing the existence of subpopulations within an unlabeled data set. K-​means clustering partitions n samples into k clusters in which each observation belongs to the cluster with the nearest mean (cluster centers or cluster centroid), serving as a prototype of the cluster.


66  Computer Applications of Prosody Test data

Training data

Iteration 1 Iteration 2 Iteration 3

Iteration k All data

Figure 4.2 Cross-​Validation.

subset (called the test data). To reduce variability, multiple iterations of cross-​validation are performed using different partitions, and the validation results are combined (e.g., averaged) over the iterations to give an estimate of the model’s predictive performance as illustrated in Figure 4.2. There are two ways to partition data into training and test data: 1) Leave-​one-​out cross-​validation (LOOCV) and 2) k-​fold cross-​validation. With LOOCV, the test data consist of one sample and the training data include the rest of the samples. Studies have shown that LOOCV leads to overly optimistic evaluations of machine learning models because of “overfitting.” In statistics, overfitting is “the production of an analysis that corresponds too closely or exactly to a particular set of data, and may therefore fail to fit additional data or predict future observations reliably” (Oxford English and Spanish Dictionary, n.d.)). In machine learning, overfitting refers to a model that represents the training data too well. Overfitting happens when a model learns the detail and noise in the training data to the extent that it negatively impacts the performance of the model on new data. K-​fold cross-​validation divides the data into k subsets. The test data are one subset and the training data are the other k-​1 subsets. For example, let k = 10 and the number of samples is 100. Then, the test data would be one subset with 100/​10 = 10 samples. The training data would be k-​1 subsets which is (10-​1)·(10 samples) = 90 samples. (Note: LOOCV is k-​fold cross-​ validation where k = the number of samples.) The larger the k, the more likely that overfitting will occur. 2-​fold cross-​validation provides the best protection from overfitting. 10-​fold cross-​validation is given as a typical value for k in many references. This is because (1) data sets used in design are often small and (2) when comparing machine learning models, high k values produce better performance numbers. A machine learning


Computerized Systems for Syllabification  67 Actual POSITIVE


Predicted POSITIVE

True-Positive (TP)

False-Positive (FP)

Predicted NEGATIVE

False-Negative (FN)

True-Negative (TN)

Figure 4.3 Confusion Matrix.

application will work better in the real world if k=2, i.e., use half of the data to train the machine learning model and half to test it. Selecting the best machine learning model and set of features is based on the validation results in terms of evaluation metrics which are combined (e.g., averaged) over the iterations to give an estimate of the model’s predictive performance. Evaluation metrics quantify the gap between desired performance and current performance, and measure progress over time. The typical evaluation metrics employed are based on the standard confusion matrix utilized in statistical analysis depicted in Figure 4.3. A confusion matrix, also known as an error matrix, is a specific table layout that allows visualization of the performance of an algorithm, typically a supervised learning one. Each row of the matrix represents the instances in a predicted class while each column represents the instances in an actual class. The name stems from the fact that it makes it easy to see if the system is confusing two classes (i.e. frequently mislabeling one as another). The names of the cells (i.e., TP, FP, FN, and TN) come from medical diagnostic tests where Positive or Negative indicate whether the test was positive or negative and where True and False indicate whether the test was a true diagnosis or a false diagnosis. In machine learning, TP and TN mean the machine learning model correctly classified the object, while FP and FN mean the model incorrectly classified the object. The most common metric for evaluating the performance of a machine learning model and its features is Accuracy (ACC) followed by Precision (P), Recall (R), and F-​score (F1), calculated as shown in Equations 1–​ 4 below. Although not commonly utilized in most machine learning applications, Pearson’s correlation (r) is frequently used in computer applications of prosody. Accuracy ( ACC ) = (TP + TN ) / (TP + TN + FP + FN ) (1) Precision(P) = TP / (TP + FP) (2) Recall (R) = TP / (TP + FN) (3) F - Score ( F1) = 2·( P·R ) / ( P + R ) (4)


68  Computer Applications of Prosody Overall, we intend to build on these machine learning concepts and their use in computer applications of prosody in this chapter and Chapters 5 and 6.

4.3  Acoustic Algorithms for Syllabification A fairly simple and typical acoustic method of syllabification, described below, has been found to be effective in distinguishing the speech patterns of children with autism, language delay, and typical development (Oller et al., 2010). With Oller et al.’s (2010) method, utterances were separated by low acoustic energy periods of greater than 300 ms. Utterances were then subdivided into syllables by low acoustic energy periods greater than 50 ms, but less than 300 ms. High acoustic energy periods were defined as the acoustic energy level rising to 90% above baseline for at least 50 ms and ending when it fell to less than 10% above baseline. The duration of 300 ms was selected because empirical studies have shown it to be near the top of the distribution for low acoustic energy periods (corresponding typically to silences or consonantal closures) occurring within an utterance (Oller & Lynch, 1992; Rochester, 1973). Oller et al. (2010) noted that human listeners counted noticeably more syllables in utterances than this method of syllabification did. They thought this resulted from the low energy threshold between syllables causing some consonants (e.g., glides and nasals) to be considered as within-​syllable acoustic events instead of syllable boundary markers.

4.4  Phonetic Algorithms for Syllabification Phonetic algorithms can be categorized as rule-​ based or data-​ driven (Marchand, Adsett, & Damper, 2009). Rule-​based algorithms operate on linguistic syllabification principles. Rule-​based algorithms are sometimes not effective because there are cases where accepted syllabification disobeys one of the principles. This has been fixed with a collection of data-​driven algorithms that are based on one or more of the principles, but permit exceptions to the principles that exist in a corpus of correctly syllabified words. Data-​ driven algorithms are based on unsupervised and supervised machine learning techniques. Marchand, Adsett, and Damper (2009) and Pearson et al. (2000) found that data-​driven methods outperformed rule-​based ones. The two types of phonetic algorithms (rule-​based and data-​driven) are described in detail below. 4.4.1  Rule-​Based Phonetic Algorithms Rule-​ based algorithms are usually based on one of three linguistic principles: the sonority sequencing principle (Clements, 1990; Selkirk, 1984), the legality principle (Hooper, 1972; Kahn, 1976; Pulgram, 1970; Vennemann, 2011), or the maximal onset principle (Kahn, 1976).


Computerized Systems for Syllabification  69 A sonority scale is a ranking of speech phones by intensity. The sonority sequencing principle defines the makeup of a syllable based on the sonority value of its phones (Clements, 1990; Hooper, 1972; Selkirk, 1984). The principle says that the middle of a syllable, called the syllable nucleus, which is a vowel or syllabic consonant, is a sonority apex that may have sequences of non-​syllabic consonants before (onset) or after (coda) with diminishing sonority values in the direction of both boundaries of the syllable. For example, the monosyllabic word “trust” begins with an onset consisting of t (stop, lowest on the sonority scale) and r (liquid, more sonorous); next is the nucleus, u, (vowel, sonority peak); it ends with a coda of s (fricative, less sonorous) and t (stop). In the strictest sense, the legality principle states that syllable onsets and codas are restricted to those phonotactically possible at word-​initial or word-​final positions; however, a less rigorous interpretation of the legality principle has been suggested by others (Hooper, 1972; Kahn, 1976; Pulgram, 1970; Vennemann, 2011), which says syllable onsets and codas are legal, if they appear in one of the words from a corpus of example words. Words that support the legality principle are “play” for onset and “help” for coda because the vowel is the most sonorous. “Six” and “sixth” are cases which violate a strict interpretation of the principle for codas because the codas are more sonorous than the vowels. “Skew” is an instance of a word that follows a more lenient reading of the principle for onsets because the onset is more sonorous than the vowel. If more than one legal split of a consonant cluster is permitted, the maximal onset principle chooses the one with the longest onset. For instance, the word “diploma” can be syllabified as di-​plo-​ma or dip-​lo-​ma, but di-​plo-​ma is the one that conforms to the maximal onset principle. The US National Institute of Standards and Technology implemented a tool called tsylb2 (Fisher, 1996) based on the maximal onset principle, which incorporates Kahn’s (1976) set of rules for dividing a sequence of phones into syllables. Shriberg et al. (2005) syllabified the word output of an ASR using tsylb2 (Fisher, 1996) and phone level timing data from the ASR. 4.4.2  Data-​Driven Phonetic Algorithms In reality, rule-​based algorithms are at times ineffective for the reason that there are instances where standard syllabification violates one of the principles. This has been ameliorated with an assortment of data-​driven algorithms that are founded on one or more of the principles, but which allow exceptions to the principles based on examples from a corpus of correctly syllabified words. The following is a history of the many data-​ driven algorithms. Daelemans and van den Bosch (1992) proposed a neural network-​ based syllabification method for Dutch, based on maximal onset and sonority principles. The determination of whether a phone is the first phone of a new syllable is arrived at based on its position in the word


70  Computer Applications of Prosody and a window of one to three phones on either side. Daelemans, van den Bosch, and Weijters (1997) investigated applying instance-​based learning to syllabification. Instance-​based learning usually involves a database search to find the most frequent syllabification for a word. The database is created from all the instances in a correctly syllabified corpus. The most similar word is used for syllabification when the actual word is not in the database. Zhang and Hamilton (1997) described the Learning English Syllabification Rules system, where each grapheme in a word is mapped into a C-​S-​CL format. C is a consonant; S is a syllabic grapheme; and CL a consonant cluster. Syllabification rules then specified where breaks occur. Where the breaks occur is learned by analyzing a corpus of correctly syllabified examples. The authors brought together a statistical technique and a symbolic pattern recognition methodology and computed the probability of each break to determine the syllabification rules. Krenn (1997) dealt with syllabification as a tagging job. Her algorithm mechanically creates tags for each phone from a catalog of correctly syllabified phoneme strings. A second-​order HMM then calculates the succession of tags. Finally, the syllable edges are retrieved from the tags. Hammond (1997) utilized Optimality Theory in a syllabification algorithm. In this algorithm, all likely syllabifications of a word are assessed with regard to a set of constraints which are ordered according to how likely they are to be broken. Kiraz and Möbius (1998) applied a weighted finite-​ state-​ machine (FSM) to language-​independent syllabification. In this method, finite-​ state-​ machines are created for onsets, nuclei, and codas by tallying incidences in a corpus of example data. A weighted finite-​state-​machine which receives sequences of one or more syllables as input is then built with these automatons. Ouellet and Dumouchel (2001) created a syllabification method that gives a cost to each consonant in a group of consonants and then separates the group at the point where the cost is at a minimum. Müller (2001) proposed a combination of rule-​ based and data-​ driven approaches. In later work, Müller (2006) calculated the syllable boundaries of a new phoneme sequence by selecting its most probable groupings of phonemes. The probability of the groupings was estimated with the English words from the CELEX pronunciation dictionary (Baayen, Piepenbrock, & Van Rijn 1993). The rule-​based part is a hand-​ built context-​free grammar of potential syllables. This grammar is then rendered data-​ driven with probabilities estimated from a corpus of example words. Ananthakrishnan (2004) posed the syllabification question as finding the most probable syllable grouping given a phoneme sequence. The syllabification probabilities were calculated with a combination of supervised and unsupervised machine learning techniques. Goldwater and Johnson (2005) described a language-​independent rule-​based procedure that uses


Computerized Systems for Syllabification  71 the expectation-​maximization algorithm to select a suitable set of rules for a grammar which describes the syllable structure. They investigated two grammar models: positional and bigram. Rogova, Demuynck, and Van Compernolle (2013), in a method similar to Ananthakrishnan (2004), syllabified words by utilizing segmental conditional random fields to combine features based on legality, sonority, and maximal onset with those based on the bigram probabilities of the training corpus. Demberg (2006) utilized a fourth-​order HMM as a syllabification module in a larger German text-​to-​speech system. Schmid, Möbius, and Weidenkaff (2007) enhanced Demberg’s algorithm by using a statistical scheme for separating words into syllables grounded on a joint n-​gram exemplar. Their program tagged each phone in a word with either a tag signifying a syllable boundary after the phone (B) or one signifying no syllable boundary (N). The program made use of a fifth-​order HMM that looks at both the prior tags and their resultant phonemes to calculate the most likely sequence of tags. Marchand, Adsett, and Damper (2007) evaluated a syllabification-​by-​ analogy tactic which functions like pronunciation-​by-​analogy. A word with unknown syllabification is matched with words in a database of words with known syllabification based on a set of criteria. Kockmann and Burget (2008) brought into play both acoustic features and phonemes for syllabification. The tokens from an ASR were grouped into three categories: silence, consonant, and vowel. Next, the algorithm evenly split every speech fragment flanked by two silences according to the number of vowels in the fragment. Lastly, to continue successive pitch contours (e.g., from a vowel to a voiced consonant), the algorithm moved the evenly split syllable boundary in the middle of two vowels based on the actual pitch at the boundary. Bartlett, Kondrak, and Cherry (2009) proposed an algorithm that joins a Support Vector Machine (SVM) with a HMM to syllabify words. The multi-​class SVM classifies each phoneme in relation to its location in a syllable. The HMM solves the dilemma of assuming every phoneme in a word is independent of every other phoneme in the word. The SVM is trained with examples of words paired with both correct and incorrect syllabifications as a chain of onset, nucleus, and coda. Mayer (2010) advocated a straightforward statistical method which enumerated the ground-​truth syllables to decide the optimum segmentation of consonant clusters in the middle of a word. Johnson and Kang (2017a) introduced four data-​ driven phonetic algorithms that are hybrids of the rule-​based and decision-​driven approaches. The first one, syllabification-​by-​grouping, is a combination of the algorithms proposed by Ouellet and Dumouchel (2001), Kockmann and Burget (2008), and Mayer (2010), which groups consonants around vowels based on how far in time they are from the vowel. The other three algorithms are all based on the sonority principle (Clements, 1990; Selkirk, 1984). Syllabification-​by-​HMM and syllabification-​by-​k-​means are based


72  Computer Applications of Prosody on a HMM which others have employed (Bartlett, Kondrak, & Cherry, 2009; Demberg, 2006; Krenn, 1997 Schmid, Möbius, & Weidenkaff, 2007) and is a typical machine learning technique employed with time-​ series data such as phonetic sequences. The final one, syllabification-​ by-​ genetic-​ algorithm does not appear to have been utilized by other researchers, but is roughly based on the legality principle (Hooper, 1972; Kahn, 1976; Pulgram, 1970; Vennemann, 2011) and employs a dictionary of syllabification rules, which is automatically created by a genetic algorithm. A key difference between these algorithms and some of the ones described above is that they handle “noisy” phonetic sequences generated from audio files by an ASR. Phone sequences from an ASR are referred to as “noisy” because an ASR cannot recognize the phones with 100% accuracy. This follows the signal processing definition of noise, a general term for unwanted and unknown modifications that a signal may suffer during capture, storage, transmission, processing, or conversion (Tuzlukov, 2002). Table 4.2 summarizes the various data-​driven phonetic algorithms. The columns indicate which principle or principles were employed and the rows show the machine learning technique utilized. Twelve of the syllabification programs are founded on the legality principle where the legal syllabifications were derived from a large corpus of correctly syllabified words by the machine learning algorithm. Four are based on the maximal onset principle where the boundary between vowels was determined statistically. Three augmented the legality principle by simplifying the syllabification rules by making use of the sonority principle to translate phones into sonority levels. Similarly, one made the maximal onset principle less difficult with the sonority principle. And finally, one of the algorithms applied all three principles. HMMs and statistical methods were the most common machine learning technique applied.

4.5  Data-​Driven Phonetic Syllabification Algorithm Implementations This section describes the implementation of four syllabification algorithms developed by Johnson and Kang (2017a). Although the details are specific to Johnson and Kang (2017a), the algorithms are representative of all of the data-​driven algorithms described above. First, we will describe converting audio files to noisy phonetic sequences, which is unique to Johnson and Kang’s (2017a) and Kockmann and Burget (2008). Johnson and Kang (2017a) employed an ASR to identify individual phones, while Kockmann and Burget (2008) utilized an ASR to recognize only three different acoustic events: silence, consonant, and vowel. Next, we will explain a metric called syllable alignment error with which Johnson and Kang (2017a) compared their four algorithms. Then, we will give the details of Johnson and Kang’s (2017a) syllabification-​by-​grouping algorithm which is very similar to those used by the other algorithms based on



Table 4.2 Summary of Data-​Driven Phonetic Syllabification Algorithms Linguistic principle or principles Legality principle Neural Network Instance based learning Statistical HMM Optimality Theory Weighted finite-​state-​machine Minimum cost Expectation-​maximization algorithm Segmental conditional random fields Pattern matching SVM/​HMM Genetic algorithm

1 2, 3, 4 5, 6, 7 8 9 10

Sonority and legality principles

Sonority and maximal onset principles

Sorority, legality, and maximal onset principles

20 13, 14, 15

17, 18

16 21

11 12


Note. 1. Daelemans, van den Bosch, and Weijters (1997); 2. Zhang and Hamilton (1997); 3. Müller (2006); 4. Ananthakrishnan (2004); 5. Krenn (1997); 6. Demberg (2006); 7. Schmid, Möbius, and Weidenkaff (2007); 8. Hammond (1997); 9. Kiraz and Möbius (1998); 10. Goldwater and Johnson (2005); 11. Marchand, Adsett, and Damper (2007); 12. Bartlett, Kondrak, and Cherry (2009); 13. Mayer (2010); 14. Kockmann and Burget (2008); 15. Johnson and Kang (2017a), syllabification-​by-​grouping; 16. Ouellet and Dumouchel (2001); 17. Johnson and Kang (2017a), syllabification-​by-​HMM; 18. Johnson and Kang (2017a), syllabification-​by-​k-​means; 19. Johnson and Kang (2017a), syllabification-​by-​genetic-​algorithm; 20. Daelemans and van den Bosch (1992); 21. Rogova, Demuynck, and Van Compernolle (2013)

Computerized Systems for Syllabification  73

Machine learning model

Maximal onset principle


74  Computer Applications of Prosody the maximal onset principle (Mayer, 2010; Kockmann & Burget, 2008; Ouellet & Dumouchel, 2001). Next, the sonority scale that Johnson and Kang (2017a) made use of to simplify syllabification is explained. A similar sonority scale was also utilized by Daelemans and van den Bosch (1992), and Rogova, Demuynck, and Van Compernolle (2013). Then, we describe the details of the other Johnson and Kang (2017a) algorithms: syllabification-​by-​ HMM, syllabification-​by-​k-​means-​clustering, and syllabification-​by-​ genetic-​algorithm. Each of these follows the legality principle by using the data in a corpus to derive syllabification rules based on sonority values. Similar techniques were applied by the other algorithms based on the legality principle: Daelemans, van den Bosch, and Weijters (1997), Zhang and Hamilton (1997), Müller (2006), Ananthakrishnan (2004), Krenn (1997), Demberg (2006), Schmid, Möbius, and Weidenkaff (2007), Hammond (1997), Kiraz and Möbius (1998), Goldwater and Johnson (2005), Marchand, Adsett, and Damper (2007), and Bartlett, Kondrak, and Cherry (2009). 4.5.1  Corpora The two corpora discussed in the explanation of Johnson and Kang’s (2017a) four algorithms are described below. Both of these corpora are used extensively by other computer science researchers.  TIMIT Corpus The DARPA TIMIT Acoustic-​ Phonetic Continuous Speech Corpus (TIMIT) of read speech is comprised of 6,300 utterances, ten utterances spoken by each of 630 speakers from eight main dialect areas of the United States of America (Garofolo et al., 1993). The read speech is made up of dialect sentences, phonetically-​diverse sentences, and phonetically-​ compact sentences. The two dialect sentences, which were read by all 630 speakers, were designed to identify the dialect of the speakers. The 1,890 phonetically-​diverse sentences were picked to include a wide variety of allophonic contexts in the corpus. Every one of these sentences was read by only one speaker and each speaker read three of them. The 450 phonetically-​compact sentences were intended to include most common phone pairs plus additional phonetic contexts that were either of specific concern or problematic. Five of the 450 phonetically-​compact sentences were spoken by each speaker and seven different speakers read each of them. The alphabetic notations for the 60 phones defined for the TIMIT corpus have become a de facto standard used by other corpora. The corpus also includes manually rectified beginning and ending times for the phones, phonemes, syllables, words, and pauses. The 6,300 utterances are subdivided into a suggested training set of 4,620 utterances and a test set of 1,680 utterances. No speaker appears


Computerized Systems for Syllabification  75 in both the training and testing sets. At least one male and one female speaker from each dialect are included in both sets. The extent of sentence overlap between the two sets is minimal. All the phonemes are contained in the test set and occur one or more times in dissimilar situations. TIMIT defines 60 symbols to represent 52 phones, six closures, and two silences. Many authors have reduced that number of TIMIT phones by combining symbols to arrive at a smaller set of 48 or 39 symbols. These authors argue that the smaller sets are more phonetically plausible. For example, they point out that the TIMIT closure symbol pcl relates more to an acoustic event rather than to a specific phoneme.  Boston University Radio News Corpus (BURNC) The BURNC is a corpus of professionally read radio news data which includes audio files of the speech and annotations for portions of the corpus (Ostendorf, Price, & Shattuck-​ Hufnagel, 1995). The corpus is comprised of over seven hours of speech recorded from four male and three female radio announcers. Each story read by an announcer is divided into paragraph size pieces, which typically includes several sentences. The paragraphs are annotated with an orthographic transcription, phonetic alignments, part-of-speech tags, and prosodic labels. The phonetic alignments were produced automatically using automatic speech recognition for the subset of data considered clean (vs. noisy) and then hand-​ corrected (Ostendorf, Price, & Shattuck-​ Hufnagel, 1995). Johnson and Kang (2017a) used a subset of 144 utterances. The subset met the following criteria: equal number of males and females (3), equal number of clean paragraphs (vs. noisy) for each speaker (24), and the phonetic alignments needed to be available. Table 4.3 gives the break-​ down of the 144 utterances. The naming conventions for the speaker IDs and paragraphs are those used by Ostendorf, Price, and Shattuck-​ Hufnagel (1995). Johnson and Kang (2017a) determined the syllable boundaries automatically using the dictionary provided with the corpus. When the phones in the dictionary did not match with the phones in the phonetic alignments, the dictionary was manually corrected to match the phonetic alignments. 4.5.2  Converting Audio Files to Noisy Phonetic Sequences Some of the algorithms discussed in Section 4.3.2 first recognized the word and then divided it into syllables. This is illustrated in Figure 4.4. One of the issues with this approach is the out-​of-​vocabulary problem, where the ASR system necessarily replaces a word not in its vocabulary with some other sequence of (wrong) words, thus making the syllabification task less well-​defined. Recognizing syllables instead of words would ameliorate this problem by reducing the number of items for the ASR to


76  Computer Applications of Prosody Table 4.3 144 BURNC Paragraphs Used by Johnson and Kang (2017a) Speaker ID Gender














s01p1 s01p2 s01p3 s01p4 s02p1 s02p2 s02p3 s03p1 s03p2 s03p3 s03p4 s03p5 s04p1 s04p2 s04p3 s04p4 s04p5 s04p6 s04p7 s05p1 s05p2 s05p3 s05p4 s05p5 s03p1 s03p2 s03p3 s03p4 s03p5 s05p1 s05p2 s05p3 s05p4 s06p1 s06p2 s06p3 s06p4 s09p1 s09p2 s09p3 s09p4 s09p5 s10p1 s10p2 s10p3 s10p4 s10p5 s10p6 s01p1 s02p1 s03p1 s04p1 s05p1 s07p1 s08p1 s09p1 s09p2 s10p1 rrlp1 rrlp2 rrlp3 rrlp4 rrlp5 rrlp6 rrlp7 trlp1 trlp2 trlp3 trlp4 trlp5 trlp6 trlp7 s01p1 s01p2 s01p3 s01p4 s01p5 s09p1 s09p2 s09p3 s10p1 s10p2 s10p3 s10p4 s10p5 s10p6 s03p3 s03p4 s03p5 s03p6 s02p1 s02p2 s02p3 s02p4 s02p5 s02p7 s01p3 s01p4 s01p7 s01p8 s01p9 s02p3 s02p4 s02p5 s02p6 s02p7 s02p8 s02p9 s02pa s04p2 s04p3 s04p4 s04p5 s03p1 s03p2 s03p3 s03p4 s03p5 s03p6 s03p7 jrlp1 jrlp2 jrlp3 jrlp4 jrlp5 jrlp6 prlp1 prlp2 prlp3 prlp4 rrlp1 rrlp2 rrlp3 rrlp4 rrlp5 rrlp6 rrlp7 trlp1 trlp2 trlp3 trlp4 trlp5 trlp6 trlp7

TIMIT Training


the quick brown fox jumped o.ver the la.zy dog.

Audio file

Figure 4.4 Recognizing the Word and Then Dividing it into Syllables.

recognize (i.e., the number of words in the English language vs. the number of syllables in the English language). As depicted in Figure 4.5, Johnson and Kang (2017a) took it one step further by utilizing an ASR that recognized phones instead of words, which further reduces the number of items to recognize, to only the phones of which all the syllables and words in the English language are composed. The ASR that Johnson and Kang (2017a) employed is a derivative of the KALDI speech recognition engine (Povey et al., 2011), which is based on finite-​state transducers (Johnson, Kang, & Ghanem, 2016a, 2016b). The 4,620 utterances in the TIMIT training set were employed to train the ASR. The ASR was trained with TIMIT because of its variety of phonetically-​diverse and phonetically-​compact sentences.


Computerized Systems for Syllabification  77

Start Stop Duration Phone time time TIMIT Training


Audio file

0.00 0.52 0.68 0.78 0.84 0.89 1.04 1.07 1.26 1.45 1.50 1.66 2.01 2.06

0.52 0.68 0.78 0.84 0.89 1.04 1.07 1.26 1.45 1.50 1.66 2.01 2.06 2.10

0.52 0.16 0.10 0.06 0.05 0.15 0.03 0.19 0.19 0.05 0.16 0.35 0.05 0.04

sil sh iy vcl b ao ax w aa r m sil dh l

Figure 4.5 Converting Audio Files to Noisy Phonetic Sequences.

4.5.3  Syllable Alignment Error Johnson and Kang (2017a) compared the performance of the four syllabification algorithms with a metric called syllable alignment error. Syllable alignment error (e) is the percentage of the utterance time when the computer-​detected syllable boundaries differ from the ground-​truth syllable boundaries. The syllable alignment error (e) is computed in this fashion: X = {utterances} Yx = {ground − truthsyllables in utterance x} Zx = {computer − detected syllables in utterance x} x ∈{1, , X} y ∈{1, , Yx } z ∈{1, , Zx } beginx, u = begin time of syllable u in utterance x endx, v = end time of syllable v in utterance x mx, y , z = minu∈{y , z}, v ∈{y , z} [| endx,v − beginx, u |] mx, z = max y mx, y , z  mx =

Zx z =1

mx , z


78  Computer Applications of Prosody mx = time ground-truthand computer − detected syllableboundaries maatch in utterance x dx = durationof utterance x d x ≥ mx

e = 1−

X x =1

mx dx


0≤e≤1 For example, suppose, as depicted in Figure 4.6, an utterance is made up of four ground-​truth syllables which we will represent as {y1, y2, y3, y4} (shown as 1, 2, 3, and 4 of the y column in Figure 4.6) with begin and end times of {(0.01, 0.03), (0.03, 0.05), (0.05, 0.06), (0.06, 0.09)}. Now suppose a syllabification algorithm detected only three syllables which we will represent as {z1, z2, z3} shown as 1, 2, and 3 of the z column in Figure 4.6) with begin and end times of {(0.02, 0.04), (0.04, 0.06), (0.06, 0.10)}. The alignment between the detected syllables and the ground-​truth 0.01 0.00 0.00   syllables can be represented by the matrix: 0.01 0.01 0.00 , where the 0.00 0.01 0.00   0.00 0.00 0.03 column labels are the three detected syllables and the row labels are the four ground-​truth syllables and the elements are the alignment between the detected and ground-​truth syllable (e.g., the alignment between y1 and z1 is 0.01 and the alignment between y4 and z3 is 0.03). The cells of the matrix are calculated with the above equation: mx,y,z = minu∊{y,z},v∊{y,z} [|endx,v -​ beginx,u|]. The sum of the maximums of each column (0.01 + 0.01 + 0.03 = 0.05) then represents the time the ground-​truth and detected syllables



0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09

0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 0.10



1 1 2 3 4



Figure 4.6 Example of Alignment between Ground-​Truth and Detected Syllables.


Computerized Systems for Syllabification  79 are aligned (mx). The syllable alignment is mx divided by the duration of the utterance, dx (0.09–​0.01 = 0.08), i.e., 0.05/​0.08 = 0.625. And, the syllable alignment error (e) is 1 –​mx/​dx, or 1–​0.625 = 0.375. 4.5.4  Syllabification-by-Grouping For the syllabification-by-grouping method, syllables are made by grouping non-​syllabic consonant phones with the vowel or syllabic consonant phone nearest to them with respect to time. The set of vowel and syllabic consonant phones consists of: aa, ae, ah, ao, aw, ax, ax-​h, axr, ay, eh, el, em, en, er, ey, ih, iy, ix, ow, oy, uh, uw, and ux. The closest vowel or syllabic consonant is ascertained by calculating the time from the non-​ syllabic consonant phone to the vowel or syllabic consonant preceding (tp) and following it (tf), where tp and tf are the times of the center of the phone. A positive number factor (b) is utilized to group a non-​syllabic consonant more frequently with the next vowel or syllabic consonant (i.e., b realizes the maximal onset principle). A non-​syllabic consonant is grouped with the previous vowel or syllabic consonant, if: b∙tp < tf. The factor b was calculated as 1.643 by finding the value in an exhaustive search of possible values from 1.000 to 2.000 in 0.001 increments that resulted in the smallest syllable alignment error (e) when syllabifying the noisy phonetic sequences of the 1,680 TIMIT test utterances. 4.5.5  Sonority Scale The last three syllabification methods (syllabification-​ by-​ HMM, syllabification-​by-​k-​means-​clustering, and syllabification-​by-​genetic-​ algorithm) make use of a sonority scale in their algorithms. Table 4.4

Table 4.4 Sonority Scale Sonority Value

Nucleus or Not


13 12 11 10 9 8 7 6 5 4 3 2 1 0

Always Always Always Always Always Sometimes Never Never Never Never Never Never Never Never

aa ae ah ao aw ax ax-​h axr ay eh ey ow oy er ih ix iy uh uw ux El em en eng wy hh hv r L epi m n ng nx v z zh f s sh th b bcl ch d dcl dh dx g gcl jh vcl cl k kcl p pcl q tcl th Sil


80  Computer Applications of Prosody gives the sonority scale employed in the last three syllabification algorithms. Sonority values were assigned to phones based on the ranking of types from highest to lowest with respect to intensity: vowels, approximants (glides and liquids), nasals, fricatives, affricates, and stops. Vowels are identified by having sonority values of nine or higher, consonants by seven or lower, and semi-​vowels equal to eight. 4.5.6  Syllabification by HMM In this algorithm, an HMM determines where the breaks between syllables occur. A Hidden Markov Model (HMM) is a statistical Markov model in which the system being modeled is assumed to be a Markov process. A Markov model is a stochastic model describing a sequence of possible events in which the probability of each event depends only on the state attained in the previous event. For syllabification, the states of the HMM are break and no-​break. The state break signals a syllable boundary before the phone while the state no-​break means there is no syllable boundary between phones. Transition to break from either no-​break or break indicates a syllable boundary. The sequence of events is the sonority values of the phones. The HMM is trained with a noisy phone sequence of the 1,680 TIMIT test utterances and state transitions force-​aligned with the ground-​truth syllable boundaries as illustrated in Figure 4.7. The diagram on the left shows the ground-​truth syllable boundaries of the phones (designated by opening and closing brackets) and the ground-​truth syllable start and stop times. For example, the syllable [sh iy] begins at time Noisy phone sequence

TIMIT Ground-Truth Syllable alignment

[sh iy]

Start time 0.54

Stop time 0.79

[w ao]




[w ao m]



[f l iy]




Start time

Stop Senority time

sh iy

0.52 0.68

0.68 0.78

3 11

break no-break

vcl b ao ax

0.78 0.84 0.89 1.04

0.84 0.89 1.04 1.07

2 2 13 13

break no-break no-break no-break

w aa r m

1.07 1.26 1.45 1.50

1.26 1.45 1.50 1.66

8 13 7 5

break no-break no-break no-break

dh l iy

2.01 2.06 2.10

2.06 2.10 2.18

2 6 11

break no-break no-break


Figure 4.7 Example of State Transitions Force-​Aligned with the Ground-​Truth Syllable Boundaries.


Computerized Systems for Syllabification  81 0.54 and ends at time 0.79. The diagram on the right shows the phone start and stop times of the speech segment as detected by the ASR. Note the phones and times are not the same because of the errors inherent in the ASR. For instance, the syllable [wa ao] with a beginning time of 0.79 and an ending time of 1.06, was detected by the ASR as [vcl b ao ax] beginning at 0.78 and ending at 1.07. The computer translated the phones [vcl b ao ax] to the sonority values [2 2 13 13]. And the HMM detected a syllable break after the second 13. The syllable boundaries of the noisy phone sequence are force-​aligned to the nearest ground-​truth syllable boundary and marked with a state transition to break. 4.5.7  Syllabification by k-​Means Clustering This syllabification algorithm builds on the syllabification-​ by-​ HMM algorithm, which utilized a single HMM, by using multiple HMMs tuned to different aspects of the phone sequence. On occasion it is expedient to utilize multiple HMMs in specific structures to smooth the progress of learning and generalization (Fine, Singer, & Tishby, 1998). Even though a fully connected HMM could always be applied it is often effective to constrain the model by not allowing arbitrary state transitions and employ multiple constrained HMMs in a larger structure. As illustrated in Figure 4.8, this algorithm identifies the breaks between syllables with three HMMs. Each HMM is tuned to specific features of a phone sequence. The algorithm decides which HMM to use based on the features of the phones in the tone unit being syllabified. K-​means clustering is utilized to determine which HMM should be used to syllabify a tone unit. In this model, tone units are delimited by silent pauses (sil) which are either longer than 200

Tone unit A Tone unit B Tone unit C Tone unit D sh iy hh ae vcl d y er vcl d sil aa r cl k s uw cl sil ix ng vcl g r iy s iy w ao sh epi sil w ao dx er ao l y ih er

Tone unit Feature extraction Sorting with K-Means clustering B HMM 1




Tone unit A Tone unit B Tone unit C [sh iy] [hh ae vcl d] [y er vcl d] sil [aa r cl k ] [s u w cl] sil [ix ng] [vcl g r iy s iy] [w ao sh epi]

Figure 4.8 Syllabification by k-​Means Clustering.

Tone unit D sil [w ao dx er] [ao l] [y ih er]


82  Computer Applications of Prosody ms or between 150 ms and 200 ms and are also followed by either post-​ boundary lengthening or a pitch reset. However, please note that Brazil’s tone units manually coded by human analysts are operationalized differently, although they have primarily motivated the current model. That is, semantic and pragmatic cues and interpretations are limitedly applied to this computerized automatic process due to the nature of systematic approaches in computer programs. See more details about Brazil’s tone unit identification in Chapters 2–​3. Also, see Chapter 7 for limitations and critical issues related to current ASR applications. Post-​ boundary lengthening is specified as the duration of the three phones after the silent pause being longer than the normal duration of the three phones. The normal duration is calculated by summing the average duration of the three phones. The average duration of the phone is calculated over all the phones in the utterance. A pitch reset occurs when the relative pitch of the three phones in front of the silent pause is low and the relative pitch of the three phones behind it is high, or just the opposite (i.e., high in front of and low behind). During training, all the tone units in the noisy phone sequence of the 1,680 TIMIT test utterances are sorted, or divided, into k=3 clusters with k-​means clustering based on a set of phone features which are described in the next paragraph below. Each of the three HMMs is then trained separately with one of the three clusters of tone units. During syllabification the features of a tone unit are calculated. The tone unit is then syllabified with the HMM whose k-​means cluster centroid is closest, as measured by the Euclidean distance in the three-​dimensional (3D) feature space, to the three phone features of the tone unit. To ensure optimum clustering, k-​means clustering is performed 1,000 times, and the clusters with the smallest reconstruction error are used. The three tone unit phone features that determine which HMM to use are: (1) average number of phones between vowels, (2) 50%, and (3) 90% most frequent sonority value in the tone unit. These three tone unit features were selected from a set of 12 tone unit features: (1) average number of phones between vowels (where a vowel is defined as a phone with a sonority value of greater than or equal to eight), (2) the least frequent sonority value in the tone unit, the (3) 10%, (4) 20%, (5) 30%, (6) 40%, (7) 50%, (8) 60%, (9) 70%, (10) 80%, (11) 90% most frequent sonority value in the tone unit, and (12) the most frequent sonority value in the tone unit. The selection was done with an exhaustive search of the 220 combinations of the 12 features taken three at a time. The combination (i.e., average number of phones between vowels, 50%, and 90% most frequent sonority value) selected was the one that gave rise to the smallest syllable alignment error (e) when the noisy phonetic sequences of the 1,680 TIMIT test utterances were syllabified using the corresponding three HMMs. The least frequent sonority value in the tone unit, the 10%, 20%, 30%, 40%, 50%, 60%, 70%, 80%, 90% most frequent sonority value


Computerized Systems for Syllabification  83 in the tone unit, and the most frequent sonority value in the tone unit are calculated as follows: (1) count the number of each sonority value in a tone unit; (2) sort the counts from most to least; where the counts are equal, sort them from highest to lowest sonority value; (3) the sonority value with the highest count is designated the “most frequent sonority value in the tone unit”; the sonority value with the lowest count is designated the “least frequent sonority value in the tone unit”; the 10%, 20%, 30%, 40%, 50%, 60%, 70%, 80%, and 90% most frequent sonority values in the tone unit are the ones with the second, third, fifth, sixth, seventh, eighth, ninth, eleventh, and twelfth lowest counts respectively. 4.5.8  Syllabification by Genetic Algorithm This syllabification approach utilizes a set of rules, or rulebook, to divide the phones into syllables. The rules in the rulebook are decided using a genetic algorithm as explained below. The output of the genetic algorithm is a set syllabification rules referred to as a rulebook. The syllabification rules describe how a specific sequence of phones is divided into syllables. Table 4.5 gives an example of four syllabification rules. As depicted in Table 4.5 there are two parts to each rule: the pattern and the syllabifying instruction. The pattern is a sequence of sonority values which begins and ends with a vowel and contains zero or more consonants between the vowels. In this algorithm, a vowel is considered to be a phone with a sonority value of nine and above and a consonant is a phone with a sonority value of eight and below. The reason this algorithm defines nine and above as a vowel instead of eight and above, like the syllabification by k-​means clustering algorithm, is because vowels with sonority values of nine and above are always the nucleus of a syllable whereas the vowels with a sonority value of eight are sometimes the nucleus of a syllable (see Table 4.4). The syllabifying instruction specifies how to syllabify the sequence of phones. If the rulebook does not contain the exact pattern for the phonetic sequence, it is syllabified with the rule whose pattern is closest, as measured by the Euclidean distance in the n-​dimensional space of the n sonority values in the pattern. The rules in the rulebook are chosen with a genetic algorithm. The initial population for the genetic algorithm is first determined by finding all

Table 4.5 Example Rulebook for Syllabification by Genetic Algorithm Pattern

Syllabifying Instruction

[13 12] [13 5 10] [12 6 9] [12 7 6 5 4 5 8 7 6 4 11]

[13] [12] [13] [5 10] [12 6] [9]‌ [12 7 6 5] [4 5 8 7] [6 4 11]


84  Computer Applications of Prosody Table 4.6 Example Rules Derived from the Force-​Aligned Syllable Boundaries Depicted in Figure 4.7 Pattern

Syllabifying Instruction

[11 2 2 13] [13 13] [13 8 13] [13 7 5 2 6 11]

[11] [2 2 13] [13 13] [13] [8 13] [13 7 5] [2 6 11]

of the unique rules represented in the noisy phonetic sequences of the 1,680 TIMIT test utterances using the force-​aligned ground-​truth syllable boundaries. For example, Table 4.6 shows examples of the rules derived from the syllable boundaries depicted in Figure 4.7. These are only some of the rules that can be derived from Figure 4.7. The total number of rules should be a very large number, resulting from grouping adjacent phones that span different lengths (two up to a maximum). The unique rules are then divided equally into five rulebooks. The first rulebook contains the top 20% most frequently occurring unique rules in the noisy phonetic sequences of the 1,680 TIMIT test utterances; the second rulebook is composed of the next 20% most frequently appearing ones; and so forth, with the fifth rulebook holding the 20% least frequently represented ones. These five rulebooks become the initial population for the first generation. Each generation of the genetic algorithm consists of an initial population of five rulebooks, the recombination of those rulebooks, and the mutation of those rulebooks. The recombination is performed by taking the union and intersection of each pair of rulebooks, resulting in at most 20 rulebooks. (Rulebooks that do not contain at least one rule for each pattern represented in the noisy phonetic sequences of the 1,680 TIMIT test utterances are not assessed and thus will reduce the number from 20.) The mutation is carried out by randomly changing, deleting, or adding one of the unique rules to the original five rulebooks, ten union rulebooks, and the ten intersection rulebooks, leading to possibly 25 mutated rulebooks. Changing is accomplished by randomly substituting one of the rules in the rulebook with a randomly selected rule from the full set of unique rules. The fitness of each generation is assessed with the syllable alignment error (e) resulting from using the rulebooks (possibly 50) to syllabify the noisy phonetic sequences of the 1,680 TIMIT test utterances. The five rulebooks with the lowest e become the initial population for the next generation. The genetic algorithm is executed for 50 generations. The rulebook with lowest e in the last generation becomes the syllabification rulebook. To allow for corpora which might have patterns larger than the largest pattern in TIMIT, the syllabifying instruction for patterns not in the rulebook is to divide the pattern into two syllables before the first smallest sonority value. For


Computerized Systems for Syllabification  85 example, the syllabifying instruction for the pattern [13 6 0 5 0 12] is [13 6] [0 5 0 12]. Each pattern begins and ends with a vowel, which guarantees that segmentation will not result in a syllable that does not contain a vowel. Genetic algorithms have a tendency to converge towards local optima rather than the global optimum of the problem. This shortcoming can be alleviated by increasing the rate of mutation and by using selection techniques that maintain a diverse population of solutions (Taherdangkoo et al., 2013). The rate of mutation is increased by randomly changing, deleting, or adding one of the unique rules to the original five rulebooks, ten union rulebooks, and the ten intersection rulebooks, leading to possibly 25 mutated rulebooks, the same number as unmutated rulebooks. A diverse population is maintained by randomly substituting one of the rules in the rulebook with a randomly selected rule from the full set of unique rules. 4.5.9  Comparison of Syllabification Algorithms Johnson and Kang (2017a) compared the performance of the syllabification algorithms by training them with one corpus and testing them with another (i.e., TIMIT and BURNC). This provides a better comparison of algorithms than held-​ out type comparisons such as k-​ fold cross-​ validation. This is because algorithms generally perform better when they are tested on a held-​out data set from the same data set they are trained on. Table 4.7 shows the syllable alignment error (e) when the syllabification algorithms were tested on TIMIT (the same ones they were trained on) and on the Boston University Radio News Corpus (BURNC). As would be expected, Table 4.7 shows that e is less (i.e., better) for all four methods when they are assessed with the same corpus they were trained on. Table 4.7 also shows that syllabification-​by-​genetic-​algorithm is the best one (lowest e) when evaluated with another unknown corpus (BURNC), followed by syllabification-​ by-​ k-​ means, by-​ HMM, and by-​grouping.

Table 4.7 TIMIT and BURNC Syllabification Results Syllable alignment error (e)

Syllabification by genetic algorithm Syllabification by k-​means Syllabification by HMM Syllabification by grouping

TIMIT (training set)

BURNC (test set)



0.170 0.177 0.201

0.253 0.258 0.277


86  Computer Applications of Prosody

4.6  Summary In summary, researchers have employed both acoustic-​based and phonetic-​ based algorithms to syllabify sequences of phones. The phonetic-​based algorithms combined machine learning and algorithmic data-​ driven methods with the maximal onset, sonority, and legality principles. The machine learning techniques consisted of both supervised and unsupervised techniques which included: genetic algorithms, k-​means clustering, neural network, instance based learning, HMM, weighted finite state transducer, cost minimization, expectation-​ maximization, statistical methods, SVM, and segmental conditional random fields. The machine learning methods and algorithms took a variety of inputs derived from corpora of ground-​truth data comprising: the phone’s position in the word, a window of one to three phones on either side, instances of correctly syllabified words, syllabification rules specifying where breaks occur in a sequence of phones, probabilities of syllable boundaries occurring before and after a phone, cost given to each consonant, and n-​gram statistical phone models. Some of the phonetic-​based algorithms depended on noise-​free phone detection while others worked with noisy phone detection. Syllabification is usually part of a larger application such as automatic English proficiency assessment or autism diagnosis. Research shows that the best syllabification algorithm depends on the application. For example, Oller et al. (2010) noted that human listeners counted noticeably more syllables in utterances than their acoustic-​based algorithm for syllabification did, implying the performance level of algorithms is relatively poor. However, it was more than adequate to differentiate the speech patterns of children with autism, language delay, and typical development. On the other hand, although Johnson and Kang (2017a) found syllabification-​by-​genetic-​algorithm to be the best when assessed with syllable alignment error (e), they found syllabification-​by-​grouping to produce the best results in their application, English speaking proficiency scoring.



5  Computerized Systems for Measuring Suprasegmental Features

PROMINENT POINTS This chapter presents an overview of the following: 5.1 Prominent Syllables 5.2 Pitch Contour Models 5.3 Algorithms for Detecting Suprasegmental Features of the ToBI Model 5.4 Algorithms for Detecting Suprasegmental Features Motivated by Brazil’s Model 5.5 Algorithms for Calculating Suprasegmental Measures 5.6 Summary

87 88 93 102 115 120

5.1  Prominent Syllables Various prosody models use the term “prominent syllable.”1 However, Brazil’s (1997) definition of a prominent syllable is briefly described here for the purpose of our current model description. Brazil states that the importance of prominence is on the syllable and not the word (also see Chapter 2 for an introduction to prominence in Brazil’s framework). He provides examples of words with more than one prominent syllable and words whose prominent syllable varies depending on the intonational meaning the speaker is imparting. Brazil argues that prominent syllables are recognized by the hearer as having more emphasis than other syllables. Brazil further notes in his description of prominent syllables that prominence should be contrasted with word or lexical stress. Lexical stress focuses on the syllable within content words that is stressed. However, prominence focuses on the use of stress to distinguish those words that carry more meaning, more emphasis, more contrast, etc. in utterances. Thus, a syllable within a word that normally receives lexical stress may receive additional pitch, length, or loudness to distinguish meaning at the level of prominence. Alternatively, a syllable that does not usually receive

DOI: 10.4324/​9781003022695-6


88  Computer Applications of Prosody stress at the word level (such as a function word) may receive prominence for communicative purposes.

5.2  Pitch Contour Models In linguistics, speech synthesis, speech recognition, and music, the pitch contour of a sound is a curve or function that follows the perceived pitch of the sound over time. It is fundamental to the linguistic notion of tone, where the pitch or change in pitch of a speech segment over time affects the semantic meaning of a sound. It also indicates intonation in pitch accent languages. Pitch is the fundamental frequency (Hz) of the sound. The fundamental frequency is defined as the lowest frequency sinusoidal in the sum, or superposition, of sinusoids that make up a sound. Since the fundamental frequency, which is referred to as F0, is the lowest frequency, it is also perceived as the loudest. The ear identifies it as the specific pitch of the musical tone or speech intonation. The other higher frequency sinusoids are not heard separately but are blended together by the ear into a single tone. The fundamental frequency of speech can vary from 40 Hz for low-​ pitched male voices to 600 Hz for children or high-​pitched female voices. A pitch detection algorithm is an algorithm designed to estimate the pitch or fundamental frequency of a digital recording of speech. These algorithms are very complex and are beyond the scope of this book. It should be noted that they are estimates of F0. These estimates are quite sensitive to the pitch detection algorithm employed and even the computer on which they are run. Johnson and Kang (2016a) found pitch contours generated by the Multi-​ Speech and Computerized Speech Laboratory (CSL) Software (KayPENTAX, 2008) and those generated by Praat (Boersma & Weenink, 2014) were significantly different. Further differences in pitch contours were found even between various versions of Praat. More differences were identified between the same versions of Praat running on different computers. Maryn et al. (2009) also reported this difference among Multi-​Speech and CSL Software and Praat. They declared that pitch and intensity values were not comparable. Amir, Wolf, and Amir (2009) also noted this discrepancy and added that the findings from Multi-​Speech and CSL Software and Praat should not be combined. Because of these differences, the comparison of results should be interpreted carefully. A number of models have been developed to exemplify the pitch contour of speech. Four of the more popular models used in computer algorithms (i.e., TILT, Bézier, Quantized Contour Model (QCM), and 4-​point) will be discussed below in detail. 5.2.1  TILT Pitch Contour Model TILT is one of the more popular models for parameterizing pitch contours (Taylor, 2000). The model was developed to automatically analyze and


Measuring Suprasegmental Features  89 Peak

Pitch (Hz)



Duration (sec)

Figure 5.1 Parameters of the TILT Model of a Pitch Contour.

synthesize speech intonation. In the model, intonation is represented as a sequence of events, which are characterized by parameters representing amplitude, duration, and tilt. Tilt is a measure of the shape of the event, or pitch contour. A popular public domain text-​to-​speech system, Festival (Black et al., 1998), applies this model to synthesize speech intonation. The model is illustrated in Figure 5.1. Three points are defined: start of the event, the peak (the highest point), and the end of the event. Each event is characterized by five RFC (rise/​ fall/​ connection) parameters: rise amplitude (difference in pitch between the pitch value at the peak and at the start, which is always greater than or equal to zero), rise duration (distance in time from start of the event to the peak), fall amplitude (pitch distance from the end to the peak, which is always less than or equal to zero), fall duration (distance in time from the peak to the end), and vowel position (distance in time from start of pitch contour to start of vowel). The TILT representation transforms four of the RFC parameters into three TILT parameters: duration (sum of the rise and fall durations), amplitude (sum of absolute values of the rise and fall amplitudes), and tilt (a dimensionless number which expresses the overall shape of the event). The TILT parameters are calculated as follows: s = start of event (1) p = peak (the highest point) (2) e = end of event (3) arise = difference in pitch between the pitch value at the peak ( p ) and at the start ( s ), >= 0


d rise = distance in time from start ( s ) of the event to the peak ( p ) (5) a fall = pitch distance from the end (e) to the peak ( p ),