Methods in Empirical Prosody Research 9783110914641, 9783110188561

This book contains a collection of cutting-edge papers on methodological aspects of prosody research. Current approaches

202 14 10MB

English Pages 403 [404] Year 2006

Report DMCA / Copyright

DOWNLOAD FILE

Polecaj historie

Methods in Empirical Prosody Research
 9783110914641, 9783110188561

Citation preview

Methods in Empirical Prosody Research

Language, Context, and Cognition Edited by Anita Steube

Volume 3

W G DE

Walter de Gruyter · Berlin · New York

Methods in Empirical Prosody Research Edited by Stefan Sudhoff, Denisa Lenertovä, Roland Meyer, Sandra Pappert, Petra Augurzky, Ina Mleinek, Nicole Richter, Johannes Schließer

w DE

G_ Walter de Gruyter · Berlin · New York

© Printed on acid-free paper which falls within the guidelines of the ANSI to ensure permanence and durability.

Library of Congress Cataloging-in-Publication

Data

Methods in empirical prosody research / edited by Stefan Sudh o f f . . . [etal.]. p. cm. — (Language, context, and cognition ; v. 3) Includes index. ISBN-13: 978-3-11-018856-1 (alk. paper) ISBN-10: 3-11-018856-2 (alk. paper) 1. Prosodic analysis (Linguistics) - Research - Methodology. I. Sudhoff, Stefan, 1977II. Series. P224.M48 2006 414'.6—dc22 2006015632

ISBN-13: 978-3-11-018856-1 ISBN-10: 3-11-018856-2 Bibliographic information published by Die Deutsche

Bibliothek

Die Deutsche Bibliothek lists this publication in the Deutsche Nationalbibliografie; detailed bibliographic data is available in the Internet at .

© Copyright 2006 by Walter de Gruyter G m b H & Co. KG, D-10785 Berlin All rights reserved, including those of translation into foreign languages. No part of this book may be reproduced or transmitted in any form or by any means, electronic or mechanical, including photocopy, recording or any information storage and retrieval system, without permission in writing from the publisher. Printed in Germany Cover design: Christopher Schneider, Berlin Printing and binding: Hubert & Co., Göttingen

Contents Preface Acoustic Segment Durations in Prosodic Research: A Practical Guide Alice Turk, Satsuki Nakai & Mariko Sugahara

VII

1

Stylization of Pitch Contours DikJ. Hermes

29

Voice Source Parameters and Prosodic Analysis Christophe d'Alessandro

63

Prosody beyond Fundamental Frequency Greg Kochanski

89

Paradigms in Experimental Prosodic Analysis: From Measurement to Function Klaus J. Kohler

123

Information Structure and Prosody: Linguistic Categories for Spoken Language Annotation Stefan Baumann

153

Time Types and Time Trees: Prosodic Mining and Alignment of Temporally Annotated Data Dafydd Gibbon

181

Probing the Dynamics of Speech Production Fred Cummins

211

Using Interactive Tasks to Elicit Natural Dialogue Kiwako Ito & Shari R. Speer

229

VI

Contents

Online Methods for the Investigation of Prosody Duane G. Watson, Christine A. Gunlogson & Michael K. Tanenhaus

259

How to Obtain and Process Perceptual Judgements of Intonational Meaning Toni Rietveld & Aoju Chen

283

Experimental Approaches to Establishing Discreteness of Intonational Contrasts Carlos Gussenhoven

321

Phonetic Grounding of Prosodic Categories

335

Katrin Schneider, Britta Lintfert, Grzegorz Dogil & Bernd Möbius Portraits of the Authors

363

Author Index

367

Subject Index

375

Preface Research in prosody has a relatively rich empirical tradition compared with other linguistic disciplines. Located at the intersection of theoretical linguistics, psycholinguistics, and phonetics, it can draw from a great variety of methods, ranging from the systematic observation of naturally occurring data to controlled laboratory experiments. While this allows the researcher to use multiple paradigms in testing the reliability of data patterns, the criteria for choosing the adequate method(s) to investigate a given research question are often unclear. Furthermore, the gathering, treatment, and analysis of prosodic data presupposes the thorough consideration of relevant methodological aspects. With these issues in mind, members of the PhD program "Universality and Diversity: Linguistic Structures and Processes" and of the DFG research group "Linguistic Foundations of Cognitive Science: Linguistic and Conceptual Knowledge" (FOR 349) organized the workshop "Experimental Prosody Research", which was held at the University of Leipzig in October 2004. This workshop was conceived as a series of tutorials to be presented by experts in the field. The present volume can be seen as a follow-up publication of the workshop, covering a broadened array of topics. Researchers with different backgrounds were asked to contribute state-of-the-art papers on the choice and measurement of prosodic parameters, the establishment of prosodic categories, annotation structures for spoken-language data, and experimental methods for production and perception studies (including the construction of materials, modes of presentation, online vs. offline tasks, judgement scales, data processing, and statistical evaluation). The goal of the volume is to enable researchers in linguistics and related fields to make more informed decisions concerning their empirical prosody work. The contributions can roughly be divided into three thematic sections: The first group of papers deal with acoustic parameters and their phonological correlates. They are followed by two articles on annotation structures and their application. The third section contains contributions on selected aspects of experimental design. The initial contribution of the first thematic section deals with strategies of speech segmentation relevant for durational analyses. In their paper "Acoustic

VIII

Preface

Segment Durations in Prosodic Research: A Practical Guide", Alice Turk, Satsuki Nakai and Mariko Sugahara introduce a method based on identifying clearly recognizable acoustic landmarks, so-called oral consonantal constriction events, and use numerous examples from English, Dutch, Norwegian, Finnish, and Japanese to illustrate how this primarily spectrogram-based technique can be applied in practice. The authors present a range of segment types with constrictions accurately measurable in VCV and cluster contexts, and discuss how to handle problematic cases, such as the identification of laterals. They propose guidelines for carrying out tightly controlled prosodic experiments while minimizing the influences of confounding variables such as inconsistencies in speech rate. The proposed segmentation method is argued to be valuable for optimizing the automatic segmentation tools used in the investigation of durational parameters in speech corpora. Dik J. Hermes' article "Stylization of Pitch Contours" discusses paradigms for the reduction of pitch contours to their perceptually essential properties by eliminating irrelevant details such as microprosodic fluctuations and interruptions due to unvoiced speech segments. While containing as little information as possible, the resulting representation must correspond to the mental image of a perceived pitch contour, i.e., a resynthesis based on the stylized contour should be perceptually equivalent to a resynthesis on the basis of the original contour. Hermes describes two different stylization methods. In the first, a pitch contour is reduced to as small a number of continuous straight lines (or curves) as possible without changing the perceived intonation, resulting in a "close-copy" stylization. Resting on the assumption that the perception of pitch contours depends, above all, on the pitch of the syllabic nuclei, the second method displays pitch contours as sequences of tones aligned with the syllable structure. The two approaches to the stylization of pitch contours are discussed with respect to their theoretical foundations, practical characteristics, possibilities and limitations. Christophe d'Alessandro's chapter "Voice Source Parameters and Prosodic Analysis" focuses on the various sound parameters reflected in voice quality and discusses the difficulties related to their acoustic analysis. The author approaches voice source phenomena from the perspective of their linguistic (phonological), expressive, and phonostylistic functions. While voice source parameters are assumed to be of secondary importance compared to fö in read speech and other laboratory speech conditions, they play a significant role in expressive communication situations, according to d'Alessandro. The range of parameters discussed in the paper include the degree of periodicity (e.g. in unvoiced, whispered or voiced speech), the voice open quotient (e.g. for strangled tones or lax voice), and the voice spectral tilt. The author points out that their investigation requires specific, and sometimes intricate signal analysis techniques. After presenting a summary of phonation types and voice quality dimensions in addition to a detailed review of voice source models, the

Preface

IX

author describes techniques for the analysis of aperiodicities, voice pressure, vocal effort, and voice registers. In his article "Prosody beyond Fundamental Frequency", Greg Kochanski argues that a description of the prosodic ability of the human voice must include a number of acoustic properties other than fl) and duration, such as loudness, slope of the speech spectrum, timbre, and degree of voicing. His point is based on information-theoretical estimates for the amount of information that must be carried by the prosodic properties of the sound and for the capacity of the individual acoustic channels. Kochanski discusses implications of this account both for the linguistic theory of prosody and experimental techniques, and algorithms for the relevant acoustic measurements. The experimental methods evaluated by the author relate to the questions as to (i) how much information can be transmitted (channel capacity), (ii) how a linguistic feature is encoded, and (iii) how much and which information is actually transmitted. Klaus J. Kohler's paper "Paradigms in Experimental Prosodic Analysis: From Measurement to Function" evaluates present-day experimental prosody research, contrasting the prevalent Autosegmental-Metrical approach with the framework of Function-Oriented Experimental Phonetics. In Kohler's opinion, the former has neglected crucial ingredients of prosodic phonology such as time, communicative function, and the listener. Several perception experiments on the timing of peaks and valleys are discussed, the results of which are interpreted as evidence for a contour rather than a tone sequence model of intonation. As for the second factor, Kohler argues against the subordination of function to form and the restriction to linguistic (as opposed to paralinguistic) function. All prosodic categories must moreover be perceptually confirmed. The author introduces the Kiel Intonation Model (KIM), which rests on these assumptions, as an alternative model of prosodic phonology. The next chapter, entitled "Information Structure and Prosody: Linguistic Categories for Spoken Language Annotation", shifts the focus to annotation structures and their application. Stefan Baumann deals with the annotation of prosody and information structure in West Germanic languages. While a spoken-language annotation system for prosody has previously been established (GToBI), a system for the annotation of information structure is still a desideratum. After a discussion of information structural dimensions such as theme/rheme, given/accessible/new, and focus/background, the paper introduces a multi-layer annotation system developed in the MULI (Multilingual Information Structure) project. The MULI system combines an annotation according to the information structural dimensions with an annotation of prosody, as amply exemplified by data from German. The second contribution of this thematic section, "Time Types and Time Trees: Prosodic Mining and Alignment of Temporally Annotated Data" by Dafydd Gibbon, focuses on a methodology for deriving hierarchical models

χ

Preface

of timing from large amounts of annotated speech. The author argues for a data-driven approach based on high-quality corpora of naturally occurring phonetic material, demonstrating how a representation of timing at the prosodic level can be induced automatically. A tree-building algorithm is used to scan the corpus for local differences in syllable duration and to construct an elaborate Time Tree. The latter is compared to a syntactic bracketing of the same material, resulting in a numeric Tree Similarity Index. The close match between iambic patterns in the Time Tree and syntactic constituents points to the distinction between closed class and open class items. Gibbon provides an in-depth discussion of the theoretical underpinnings of Time Tree induction, reaching from time in phonetics to local and global linear models of rhythm. The third thematic section, related to aspects of experimental design, is introduced by Fred Cummins' article "Probing the Dynamics of Speech Production". The chapter concentrates on two novel experimental approaches to the study of macroscopic timing phenomena in speech, Speech Cycling and Synchronous Speech. They are developed to unveil internal characteristics of the speech production system by intervening in the speaking process, more specifically external forcing and mutual entrainment. In the first method, speech production is influenced by means of periodic auditory signals cueing phrase onsets or stress positions. The second technique involves two speakers reading a prepared text in synchrony with one another, which leads to a considerable reduction of interspeaker variability by suppressing unpredictable idiosyncratic and expressive features. Cummins argues that these methods are especially valuable for the examination of dynamic properties of speech, such as rhythmic organization, phrasing, and pausing. In the next paper, "Using Interactive Tasks to Elicit Natural Dialogue", Kiwako Ito and Shari R. Speer focus on the use of scripted vs. spontaneous speech in prosodic experiments. Three studies examining PP attachment ambiguities are reviewed with respect to the varying degree of naturalness of the dialogue, and the extent to which disambiguating prosodic cues are produced. The authors show that experiments in which the speakers plan their utterances themselves yield the best results for both factors. They demonstrate how problems that arise with highly natural dialogue as well as with highly scripted speech can be avoided by using interactive speech tasks. Experiments on the interaction between the information status and tonal behavior within adjectivenoun sequences serve to illustrate the tree decoration task. This real-world object manipulation task developed by the authors makes it possible to constrain the set of target phrases and to guide the speakers to produce certain contrasts. Due to the naturalness of the elicited data, this experimental design does not only provide reliable prosodic cues, but can also evoke prosodic patterns which have not been observed in reading experiments.

Preface

XI

Concentrating on the distinction between presentational (H*) and contrastive (L+H*) pitch accents in English, Duane G. Watson, Christine A. Gunlogson and Michael K. Tanenhaus demonstrate that the eye-tracking method can be adopted as a useful paradigm in prosody research. In their chapter, entitled "Online Methods for the Investigation of Prosody", they illustrate how empirical research can shed light on theoretically grounded controversies concerning the categorical distinction between distinct accent types and their discourse function. By analyzing eye fixations in a restricted visual context, the authors show that the accent types have different distributions: While the presentational accent is compatible with a broad range of interpretations, the use of contrastive accents is much more restricted. According to the authors, eye-tracking experiments indicate that prosodic information can be used as a predictor for subsequent structures in real-time language comprehension. Possible extensions of this technique for use in related issues, such as prosodic boundary placement, are discussed in the paper. Toni Rietveld and Aoju Chen's contribution, "How to Obtain and Process Perceptual Judgements of Intonational Meaning", focuses on perceptual studies of intonational meaning in Germanic languages. Intonation is used to convey linguistic and paralinguistic meaning, which is often assumed to correspond to the phonological representation of the pitch contour (expressing discrete form-function relations) and its phonetic implementation (signaling gradient meaning differences), respectively. The authors address the question of how to obtain and process perceptual judgements of contrasts of the latter kind. In the first part of the paper, four unidimensional scaling methods - Equal Appearing Interval Scale (EAI), Paired Comparisons (PC), Direct Magnitude Estimation (DME), and Visual Analogue Scale (VAS) - as well as a multidimensional scaling procedure are discussed. In the second part, the suitability of the EAI, DME, and VAS are evaluated on the basis of two experimental studies of the perception of friendliness and pitch register in Dutch. The results suggest that the VAS is most sensitive among the scales taken into consideration in capturing the perceived paralinguistic meaning differences, but that the EAI scale might be more suitable for dealing with the perception of properties that are more directly related to pitch, such as register. A closely related issue is the topic of Carlos Gussenhoven's paper, "Experimental Approaches to Establishing Discreteness of Intonational Contrasts". The author discusses methods that allow dissociating discrete from gradual differences between intonational pitch contours. The former are attributed to paralinguistically meaningful variation between different pronunciations of the same phonological contour, and the latter to distinct phonological categories. A number of experimental techniques are reviewed. The author suggests that it is better for participants to judge the appropriateness of a given contour as an imitation of another contour (passable-imitation task) than to assign the contour to abstract categories (categorical perception task). The discussion of the

XII

Preface

consequences of alternative paradigms (the semantic difference task, the semantic scaling task, and the imitation task) leads to the conclusion that tasks should more explicitly engage listeners' intuitions about phonological identity in order to minimize the interference of purely phonetic factors in responses to phonologically equivalent contours. In the final contribution, "Phonetic Grounding of Prosodic Categories", Katrin Schneider, Britta Lintfert, Grzegorz Dogil, and Bernd Möbius argue that prosodic categories emerge from probability distributions, which correspond to regions in the parametric phonetic space. According to the authors, their emergence results from the application of a speaker-internal analysis-by-synthesis process. Two methodological points relevant to this theoretical reasoning are discussed: the testing of the categorical status of prosodic events, and the mapping of prosodic categories onto the continuous acoustic parameters of the speech signal. With respect to the first issue, the authors demonstrate the design, stimulus generation, and evaluation of two experimental paradigms - categorical perception and the perceptual magnet effect - using the perception of boundary tones in German as an example. A computational model for the emergence of prosodic categories is introduced in connection with the second issue. This model is based on the probability distributions of phonetic parameters, the relevance of which is illustrated by a statistical frequency evaluation of a speech corpus. As mentioned earlier, the idea for the present compendium grew out of the Leipzig workshop on "Empirical Prosody Research". We would like to thank the numerous participants of the workshop for their fruitful discussion and, above all, the presenters for their contributions and tutorials: Stefan Baumann, Aoju Chen, Carlos Gussenhoven, D. Robert Ladd, and Bert Remijsen. The workshop would not have been possible without generous funding by the DFG research group "Linguistic Foundations of Cognitive Science: Linguistic and Conceptual Knowledge" (FOR 349), and the DFG PhD program "Universality and Diversity: Linguistic Structures and Processes", both at the University of Leipzig. Their support is gratefully acknowledged. We would also like to thank Anita Steube, head of the research group and editor of "Language, Context, and Cognition", for her support and for the inclusion of this volume into the series. Our appreciation also goes out to Sebastian Hellmann for his technical assistance, and to David Dichelle for his proofreading work.

Leipzig, April 2006 The editors

Alice Turk, Satsuki Nakai (Edinburgh) & Mariko Sugahara (Kyoto)*

Acoustic Segment Durations in Prosodic Research: A Practical Guide 1 Introduction Carefully designed durational experiments are promising tools for testing and formulating theories of prosodic structure, its relationship with grammar, and its phonetic implementation. If properly designed, they allow for tight control of prosodic variables of interest, and can yield reliable durational measurements. Results from these tightly controlled experiments can then be used to form hypotheses about the way segment durations vary in more natural speech situations, or can be used to test hypotheses based on observations of natural speech corpora. In this paper, we discuss methodological issues relating to such studies. In the first part of the paper, we outline principles of reliable and accurate acoustic speech segmentation that allow us to make inferences about the durations of consonantal constrictions and surrounding, mostly vocalic, intervals. In doing so, we discuss the relative segmentability of a range of segment types, in the hope that this will help researchers to design materials with the maximum likelihood of accurate segmentation. In the second part of the paper, we discuss additional methodological issues relating to the design of durational experiments. These include ways of designing materials to control for sources of known durational variability, and methods for eliciting prosodic contrasts.

We thank Matthew Aylett, Simon King, Peter Ladefoged, Jim Scobbie, Laurence White, Ivan Yuen, and especially Stefanie Shattuck-Hufnagel and Jim Sawusch for discussion of ideas presented here, and Bert Remijsen for detailed comments on a pre-final version of this chapter. We are also grateful to two anonymous reviewers for their useful comments, to Sari Kunnari for help in collecting the Finnish data, and to Kari Suomi and Richard Ogden for helpful information regarding Finnish phonology and phonetics. This work was supported by Leverhulme, and British Academy grants to the first and second authors, and an AHRC grant to the first and third authors.

2

Alice Turk, Satsuki Nakai & Mariko Sugahara

2 Principles of acoustic speech segmentation Segmenting the speech signal into phone-sized units is somewhat of an artificial task, since the gestures used to produce successive speech sounds overlap to a great degree, as illustrated in Browman and Goldstein (1990) and elsewhere. For example, the closing gesture tongue movement for /g/ in the phrase Say guide walls (Figure 1) begins before the end of the preceding vowel /e/, as evidenced by the rising F2 formant transition for this vowel.1 This situation of articulatory overlap makes it difficult to determine the point in the acoustic signal where the vowel ends and the consonant begins. Nevertheless, there are often salient acoustic landmarks that correspond straightforwardly to recognisable articulatory events (Stevens, 2002). In particular, although we know that movement towards consonantal constriction begins earlier, abrupt spectral changes coincide with the onsets and releases of oral consonantal constrictions for the production of stops, fricatives, and affricates, as illustrated in Figure 1 (/s, g, d, ζΓ) and Figure 2 (/s, ρ/).

0

0.1

0.2

0.3

0.4 0.5 Time (s)

0.6

0.7

0.8

Figure 1: Say guide walls, spoken by a female Scottish English speaker

1

/e/ is monophthongal in Scottish English.

0.9

Acoustic Segment Durations in Prosodic Research

P M H » -

1

1

:

8000 ==- 6000 cflj 4000

! 2000

1

3

Ifflffl

tlfj

1

• IP*

JÜM

0 η

s

0

1 V ! 0.1 Time

Ρ 1 V 1 0.2 is)

i

0.3

Figure 2: A fragment (underlined) from MINUSTA "sari" sopii kohtaan tuhatkaksisataa Ί THINK "san" fits [#] 1200', spoken by a female Northern Finnish speaker. San is a nonsense word. V in the second label tier indicates the offset of voicing for [s], and [p]. We propose that acoustic segment durations should be determined by the intervals that these oral consonantal constriction events define. Oral constriction criteria are preferable to criteria based on the onset or offset of voicing, since oral constriction criteria can be used comparably for many different classes of speech sounds, including voiced and voiceless oral stops, fricatives, affricates, and nasal stops. Although oral constriction and voicing criteria might be thought to be interchangeable in some cases, e.g. at the onsets of voiceless obstruent constrictions in vowel-voiceless obstruent sequences, e.g. as between word-medial [o] and [p] in sopii, voicing often persists after the onset of the phonologically voiceless constriction (e.g. [s] and [p] in Figure 2). In situations of this type, the oral constriction onset criterion is clearly preferable. As can be clearly seen in Figure 2, oral constriction criteria can yield very different segment durations than criteria based on voicing. Similar discrepancies are observed in situations where aspirated voiceless stop offsets are measured. These potential differences also make a strong case for being explicit about segmentation criteria in reports, and above all for application of consistent segmentation criteria. The duration of an interval between a Q constriction release landmark and a following C 2 constriction onset landmark in a QVC2 sequence (e.g. the [λι] interval in Figure 1) is often described as the duration of a "vowel". We follow this convention here. However, this interval is not exclusively vocalic, since the so-called "vowel" duration includes formant transitions and burst noise that cue the identity of the surrounding consonants, in addition to any aspiration

4

Alice Turk, Satsuki Nakai & Mariko Sugahara

from preceding voiceless aspirated stops. This point should be kept in mind when interpreting labels for such intervals. We argue that the judicious choice of experimental materials can yield reliable, accurate durational measurements, if the materials contain alternations of salient oral consonantal constrictions and sonorant segments such as vowels. In particular, constriction onsets and releases are relatively easy to identify in: (1) stop consonants, e.g. [p, t, k, b, d, g], sibilants, e.g. [s, J, z, 3], and affricates, e.g. [tj, d3] in VCV contexts and (2) non-homorganic clusters (clusters containing consonants of different places of articulation) differing in manner. We will discuss segmentation criteria for these sequences below. 2.1 Relative segmentability There is a clear relationship between segmentation reliability and the strength of conclusions that can be drawn from experiments that use segmentation as part of their methodology. In order to ensure confidence in results of durational experiments, we recommend that materials be designed with the highest possible number of target segments whose durations can be reliably and accurately estimated. In the following sections, we present detailed segmentation criteria for sequences of segments shown in Table 1. These criteria derive from the theory of the relationship between articulation and acoustics (see Stevens, 2002), and from our experience in segmenting American English, Standard Scottish English, and, to a lesser extent, Southern Standard British English, Standard Dutch, Northern Finnish and Standard Japanese. Although grounded in general acoustic theory as it relates to speech production, there may be language specific factors, such as allophonic variation, assimilation or coarticulation patterns that may make some of these specific criteria less applicable for particular languages or language varieties. Table 1 includes 1) phones that we have found to be reliably segmentable in most contexts, 2) phones which we have found to be reliably segmentable in restricted contexts, 3) phones which we have found to be less reliably segmentable in most contexts, and 4) others which are to be avoided whenever possible. The phone classes mentioned in Table 1 conform to definitions given in Ladefoged (2001); we have defined additional terms not explicitly mentioned there. Not all phone types are included; we only discuss cases that we have had sufficient experience with to describe with confidence. It should be noted that nasal stops are the most appropriate class of segments for experiments where both duration and F0 are of interest. Obstruents are known to raise or lower F0 in adjacent pitch periods depending on their voicing specification, and are therefore less appropriate for F0 analyses.

Acoustic Segment Durations in Prosodic Research

Reliably segmented in most contexts

Boundary between consonant and vowel in CV or VC sequences, where consonants are: Oral stops, e.g. [p, b, t, d, k, g] Sibilants, e.g. [s, J, z, 3] Affricates, e.g. [tj, d3]

5

Boundary between two members of a consonant cluster, where phones in clusters are: Oral stops, nasal stops, sibilants, and affricates in the following sequences, when these differ in place and manner of articulation: Sonorant consonant*-oral stop Sonorant consonant-sibilant Sonorant consonant-affricate Oral stop-sonorant consonant Sibilant-sonorant consonant Affricate-sonorant consonant Sibilant-oral stop Oral stop-sibilant Nasal stop-sibilant Sibilant-nasal stop *Sonorant consonant = approximants or nasal stops, e.g. [1, j, w, m j

Reliably segmented in some contexts Less reliably segmented

Nasal stops, e.g. [n, m] Weak voiceless fricatives, e.g. [f, Θ]

To be avoided

Central and lateral approximants, e.g. [w,l]; [h] Weak voiced fricatives, e.g. [v, 5]

Weak voiceless fricatives Nasal or voiceless stops in homorganic nasal-stop or stop-nasal clusters, e.g. [mp, pm] Voiceless and voiced consonants in homorganic clusters, e.g. [st], [mb] Consonants in clusters sharing manner of articulation, e.g. [pk], [bt], [ran], [sj] Stop-affricate clusters

Table 1: Relative segmentability of consonants in VCV and cluster contexts 2.2 Segmentation criteria The detailed segmentation criteria that we present in the following sections are all based on the more general strategy of finding constriction onsets and re-

6

Alice Turk, Satsuki Nakai & Mariko Sugahara

leases, as described above. Most of the criteria we discuss are based on spectral characteristics most easily seen in spectrograms. Waveforms can also be useful for segmentation since they show dips and rises in amplitude, which often correspond to the onsets of constrictions and their release. However, amplitude dips can sometimes be gradual on waveform displays, particularly when constrictions are voiced, and some types of frication noise can be difficult to distinguish from aspiration noise on waveform displays. For these reasons, we prefer to rely primarily on spectrograms for first-pass segmentation decisions within an accuracy of 5-10 ms, and on waveforms for more finegrained segmentation decisions, once general boundary regions have been defined. Note that segmentation accuracy will necessarily depend on factors other than segmentation criteria, namely 1) the sampling rate used in digitising the signal, and 2) the spectrogram analysis window size, assuming that spectrograms are used for segmentation, and 3) the degree to which each successive analysis window overlaps (frame shift). For example, a sampling rate of 16,000 Hz will yield accuracy within .0625 ms if segmenting on the waveform, but a spectrogram analysis window size of 5 ms (for 200 Hz wideband analysis) will limit the accuracy of spectrogram-based criteria to within this 5 ms window. The reduced accuracy of spectrogram-based criteria as compared to those based on the waveform supports the use of the waveform for final fine-grained segmentation once segment boundaries have already been determined within 5-10 ms. When using visual displays for segmentation, it is easier to see gross spectral changes when these are zoomed out, or contain longer stretches of speech. We recommend more zoomed out spectrogram displays to determine general boundary regions, and more zoomed in waveform displays for determining exact boundary locations. In the following sections, we discuss segmentation criteria in rough order of relative segmentability, as organised in Table 1. 2.2.1 Consonants in VCV contexts Oral stops In our experience, canonical variants of oral stops are generally easy to segment (see [g, d] in Figure 1, [p] in Figure 2). The onset of stop closures in VCV contexts are associated with 1) a decrease in overall amplitude, and 2) cessation of all but the lowest formant and harmonic energy. Although some stop closures are also accompanied by the cessation of voicing, many voiced stops and even some phonemically voiceless stops have voicing that continues through part or all of the stop closure (see [t] and the second [p] in Figure 3). In addition, for some vowel-voiceless stop sequences, voicing can stop earlier

Acoustic Segment Durations in Prosodic Research

7

than the oral closure, resulting in pre-aspiration (see also pre-aspiration before a fricative in Figure 6). These phenomena (cf. Lucero, 1999) highlight the importance of non-FO based criteria when identifying stop closures. In many cases the offset of F2 energy coinciding with an overall dip in amplitude is the best criterion because F2 and higher frequency energy is often critically damped when the oral tract is closed. Using F2 to identify stop closure is preferred over using Fl, because Fl energy is less often critically damped and is often confusable with FO. It should be noted that for some speakers under very sensitive recording conditions, even F2 energy can continue into closure. In these cases, the offset of F3 and higher frequency energy coinciding with a drop in overall amplitude would provide a better criterion than F2. In syllable-final position, vowels before English voiceless stops are often glottalised, as shown in Figure 3 ([a] and [e]); in these cases, the end of formant (e.g. F2) energy can still be used as a criterion for finding oral closure onset if a full glottal stop has not occurred before the oral constriction has been made. The absence of a full glottal stop before oral closure can be diagnosed by 1) voicing that continues after the end of formant energy, and/or formant values at closure that are appropriate for the place of articulation of the stop. In Figure 3, although the [e] before the second [p] in paper is glottalised, there is no evidence that a glottal stop precedes the closure: Voicing continues after the end of formant energy, and the F2 value at the onset of [p] closure looks appropriate for a bilabial stop. Evidence for the lack of a glottal stop before the onset of [k] closure in tax comes from low amplitude voicing during the closure. Note that closure onset criteria cannot be applied where glottal stop has fully replaced an oral closure, as for glottal stop variants of /t/ in English. In Figure 3, weak high frequency frication noise occurs at the beginning of the first /p/ constriction, and throughout the second /p/ constriction. This frication noise is evidence of incomplete closure. Stop releases can be easily identified in the presence of a release burst, whose onset can be taken as the release (end of [t], [k] and both [p]'s in Figure 3). Velar stops are often accompanied by multiple bursts (e.g. both [k]'s in Figure 4, at times .2 and .8). Any of the multiple bursts (e.g. the first, last, or most salient burst) could potentially be used to mark the offset of the stop, as long as the choice is used consistently where measurements are to be compared, but the first burst arguably conforms best to our criterion of marking the constriction release, since subsequent bursts are produced through the (uncontrolled) Bernoulli effect. 2

2

Thanks to Mark Jones for pointing this out on the Phonet discussion list.

8

Alice Turk, Satsuki Nakai & Mariko Sugahara |j|inii Ν 8000 Χ 6000 Ο 4000 cω 3 σ· β Λ

Hin»—'

— t - -ifif^fK»

* Iki ' ! äp* !

!

Uli*

t

a

k

s

ρ

e

ρ

J

i

VOJ 0

0.1

V 0.2

0.3

diT

νc | t I

0.4 0.5 Time (s)

0.6

0.7

0.8

Figure 3: Tax paper, spoken by a female Scottish English speaker. The boundaries for the offsets of /a/ and /e/ are placed on the last glottal pulse peak in the intervals delimited by continuous F2.

-

I f

8000

® 6000

f

cΌ 4000

I • ' J " Btiaitifc : •

I 2000

ßft

η k

0

or)k

, 0.2

, 0.4

·

'

i t . •

f OI

1 V ,—ι—, 0.6 0.8

, 1.0

d

, 1.2

, 1.4

, 1.6

1.8

Time (s) Figure 4: Concord, spoken by a female Scottish English speaker. V in the second label tier indicates the offset of voicing for [k]. Bursts are sometimes not evident on the spectrogram for voiced stops that are [+anterior], i.e. produced at or in front of the alveolar ridge (e.g. [b, d], as well as the American English flap allophone of /t, d/); in these cases stop release can be considered to occur near the point of F2 onset. The releases of aspirated voiceless stops in VCV contexts are followed by voiceless aspiration which ends at the onset of voicing (VOT) of the following vowel (see VOT for [t] and [p] in Figure 3). One might wonder whether to

Acoustic Segment Durations in Prosodic Research

9

include this interval as a part of the duration of the so-called "stop", where the following "vowel" would begin at VOT. In these cases, it is useful to note that segmentation criteria for VOT are based on acoustic correlates of the onset of vocal fold vibration, and differ qualitatively from the acoustic correlates of oral constriction onsets and releases. It is our view that if segmentation of voiceless stops is to be comparable with that of fricatives and voiced stops, what we measure as voiceless stop durations should also correspond to their oral constriction durations, and should therefore end at oral release. On this view, vowels following voiceless stops would begin at consonantal release, rather than VOT. In our studies, we are careful to apply consonantal constriction criteria consistently across segment types, but do often measure intervals from consonantal release to VOT for other purposes. Note in this regard that waveforms are often more useful than spectrograms for determining voicing onset or offset. One drawback in using stops in experimental materials is that they can exhibit considerable variability in their allophonic and phonetic realisations. English /t/ is a notorious example in this regard. Many varieties of English frequently use glottal stop in syllable-final position, e.g. wha[7] (what), as well as in words like city, where American English uses a tap. In addition, glottal constrictions before final stops, e.g. before some renditions of [k] in tack, can make the onset of oral closure difficult to identify. And finally, /g/ closures in American English and Japanese can be realised as voiced fricatives or approximants (see different phonetic realisations of Japanese /g/ in Figure 5). In American English, these non-canonical /g/ closures generally occur in intervocalic contexts where the second vowel is unstressed, e.g. ogre.

0

0.24 0 Time (s)

0.24

Figure 5: Japanese /aga/ where Igl is realised as an approximant (left panel; spoken by a female Standard Japanese speaker), and as a fricative (right panel; spoken by a male Standard Japanese speaker), /aga/ is a fragment from Sei-wa "gansanf-o totemo yorokobu 'Sei is very pleased with "gansani"'. Wa is a topic marker; gansani is a nonsense word.

10

Alice Turk, Satsuki Nakai & Mariko Sugahara Sibilants

In general, we have found sibilant fricatives to be particularly useful for crosslinguistic studies of segment durations, since they appear to show little allophonic or other phonetic variation within a single language. In our experience, the onset and offset of frication energy appropriate for a sibilant of its place of articulation is the most useful criterion for sibilant constriction segmentation in VCV contexts (see [z] in Figure 1 and [s] in Figure 2). The presence of fricative noise is unambiguous evidence for an oral constriction, but the onset or offset of frication noise can sometimes appear gradual, or can be confused with breathiness or aspiration noise, making accurate segmentation challenging.3 The offset/onset of vowel formant energy (e.g. F2, see above discussion of oral stops) can sometimes also be used to identify fricative constriction boundaries, but is often less useful than the fricative noise criterion, since formant energy can often be seen in the presence of frication (see [se] in Figure 1). A small silent gap can sometimes precede or follow frication noise, particularly in high vowel or sonorant consonant contexts; these silent gaps are often due to the change in noise source from just behind the constriction to the vocal folds, or vice versa (Stevens, 1998). Relatively long periods of aspiration noise (equivalent to a partially voiceless vowel) or breathiness can sometimes occur before or after the onset of voiceless frication, e.g. the 'asp' intervals before [J] in tosh in Figure 6, and after [s] in Figure 7 (see also Gordeeva and Scobbie, 2004 for a discussion of this phenomenon in Scottish English). This aspiration noise is spectrally different from the adjacent fricative noise, and often contains voiceless formant energy. In these cases it is particularly important not to rely on the cessation of voicing (F0) as a cue to the onset or offset of the fricative. However, it should be noted that at times it may be difficult to find a clear spectral discontinuity between the aspiration noise and fricative noise, see Section 2.4 for a discussion of segmentation uncertainties. In our experience, American English does not appear to have heavily pre-aspirated voiceless consonants, whereas many British varieties do, at least for some speakers. Phonologically voiced fricatives in intervocalic position can also be problematic, if voicing continues throughout the frication and frication amplitude is relatively low, due to e.g. reduced airflow across the glottis during voicing. Even though the spectral changes due to the onset or offset of frication may be visible on the spectrogram, precise segmentation points are often difficult to determine on a waveform when low amplitude frication noise occurs simultaneously with voicing.

3

Some palatalised fricatives in high, front vowel contexts, e.g. Japanese /zi/ (realised as [31]) can be difficult to segment, as frication can occur simultaneously with the high vowel articulation.

11

Acoustic Segment Durations in Prosodic Research

asp 0

0.1

0.2

0.3 Time (s)

0.4

0.5

0.6

Figure 6: Tosh, spoken by a male Southern Standard British English speaker

8000

I:·-· J

® 6000 c 4000

i f · " ·

SHQRf

I 2000 0 s

a: asp

0

0.24 Time (s)

Figure 7: A fragment from Buun-sensei ICHI-BAN-ga "saasacT-tte ittakedo 'Mr. Boone said NUMBER ONE is "saasaa'", spoken by a female Standard Japanese speaker. Saasaa is a nonsense word. Affricates Since affricates can be considered as sequences of a stop + fricative, criteria for identifying the onsets of affricates will be identical to those for identifying

12

Alice Turk, Satsuki Nakai & Mariko Sugahara

the onsets of stops. Similarly, criteria for identifying their offsets will be the same as those for identifying the offsets of fricatives. Nasal stops Although oral stops and sibilants provide some of the most salient acoustic cues to constriction onsets and releases, other types of segments can also be reliably segmented in some contexts. The oral closures associated with nasal stops are often accompanied by abrupt spectral changes at closure onset and release as illustrated in Figure 8, [m] in Scottish English max tapes and Figure 9, [n] in Finnish san. In addition, these abrupt spectral changes often coincide with a brief v-like dip-followed-by-a-rise in the waveform as shown in these figures. In contrast to syllable-initial nasals, nasals in syllable-final or wordfinal position can be difficult to segment in many languages, since the onset of oral closure is often obscured by heavy nasalization on the preceding vowel, and in some cases oral closure can be absent altogether (see Figure 10 [n]). However, there may be some cross-linguistic differences in this regard: Unlike English and Japanese, Finnish is reported to have little anticipatory nasalization of coda nasal stops in word-final positions and consequently often has clear onsets of oral constrictions for word-final nasal stops (Lehiste, 1964, see also Figure 9). We find this observation to be generally true, although some Finnish speakers do nasalise vowels before coda nasal stops in phrase-medial positions. ti Wnl'i

n i t , — Ν 8000 a — >> e

jHyT

*



λ Ρ

i 1 I 2000 1 4

0

BBBP mm*

r · m

a

0.1

0.2

k 0.3

St

1;

e

0.4 0.5 Time (s)

s

Ρ 0.6

0.7

0.9

Figure 8: Max tapes, spoken by a female Scottish English speaker. The boundaries for the offsets of /a/ and /e/ are placed on the last glottal pulse peak in the intervals delimited by continuous F2.

Acoustic Segment Durations in Prosodic Research

s onse

α

13

η

rhyme syllable 0.25 Time (s)

Figure 9: A fragment from MINUSTA usan" sopii kohtaan tuhatkaksisataa Ί THINK "san" fits [#] 1200', spoken by a female Northern Finnish speaker. San is a nonsense word.

8000 ® >> 6000 cΟ 4000

•W

I 2000 a

0 a:n 0.1

0.2 Time (s)

0.3

0.4

Figure 10: A fragment from Toujou-sensei-ni kii-tara ICHI-BAN-ga "saansa" 'According to Mr. Tojo, NUMBER ONE is "saansa"', spoken by a male Standard Japanese speaker. Saansa is a nonsense word.

14

Alice Turk, Satsuki Nakai & Mariko Sugahara Weak fricatives

Weak voiceless fricative constrictions (e.g. for [f] and [Θ]) can often be identified by the onset and offset of frication noise, offset and onset of surrounding vowels' F2, and corresponding dips and rises in overall amplitude. However, at times their frication noise is too weak for reliable identification, e.g. they can sometimes be difficult to distinguish from pause in phrase final position. Weak voiced fricatives can be even less reliable to segment due to the difficulty of detecting the onset of low amplitude frication in the presence of voicing on a waveform, and to their phonetic variability. For example, English [ö] realisations can sometimes show frication energy similar to [Θ], but in other cases this segment can appear very similar to a coronal stop on a spectrogram, while being heard as a clear rendition of [6] (Zue, 1985). R-sounds R-sounds are known to exhibit great variability in their realisation both across and within languages (Ladefoged and Maddieson, 1996). Some of the nonapproximant tap or fricative variants can be segmented using criteria similar to those described for stops and sibilant fricatives above. However, variable realisations of /r/ in many languages, including Finnish, Japanese, Dutch, and Scottish English, can make Irl less useful than other segments for durational experiments. For example, in Scottish English, Irl can be realised as a tap, an approximant, or can be absent, depending on sociolinguistic and contextual factors (Romaine, 1978). Segmentation of tap variants in this variety is often straightforward, but approximant segmentation is prohibitively difficult, as discussed below. Central approximants and [h] The onsets and offsets of constrictions for central approximants (glides, [r, and fh] are notoriously difficult to identify, as shown in Figure 1, [wo]). Some researchers suggest using the midpoint of transitions from a preceding vowel to the glide, and the midpoint of transitions from the glide to a following vowel, as criteria for segmentation. We find that these criteria are difficult to implement in many contexts, e.g. where vowels lack reliable steady states. In addition, it is not entirely clear to which articulatory events these transition midpoints correspond, since they do not correspond clearly to points of constriction onset and release that serve to define stop and sibilant constrictions.

Acoustic Segment Durations in Prosodic Research

15

Laterals Although laterals such as [1] can sometimes be associated with clear spectral discontinuity at constriction onset and release, these discontinuities can often be absent. In our experience, there is enough speaker and contextual variability for us to be wary of relying on their segmentation. Walls in Figure 1 contains an example of a syllable-final velarised lateral whose oral constriction is extremely difficult to identify. 2.2.2 Clusters The boundary between consonants in non-homorganic clusters can be identified using criteria similar to those described above. For instance, sibilant-stop and stop-sibilant sequences which involve differences in place of articulation, e.g. [ks], [sp], are relatively easy to segment into fricative constriction vs. oral stop closure intervals, see [ksp] in Figure 3 tax paper. Homorganic clusters, on the other hand, present many segmentation difficulties, in spite of the fact that many contain rather marked differences in manner of articulation. For example, for homorganic nasal-stop clusters, e.g. Figure 4 [qk] in Concord, our principle of identifying oral constrictions and releases for each segment cannot be used, since these two phonological segments are produced with a single oral constriction. For these homorganic cases, e.g. [qk], other acoustic landmarks can sometimes be identified, e.g. the boundary between voicing and voicelessness in the [qk] cluster in Figure 4, and/or the acoustic correlates of velic closure. If these landmarks are to be used to infer segmental durations, it should be noted that their articulatory origin is different from that of other acoustic landmarks relating to oral constrictions. Clusters that are completely voiced are more difficult to segment, e.g. [rjg] in Figure 11. We have in the past recorded materials containing [st] clusters, hoping to be able to identify the boundary between frication noise associated with [s] and complete closure associated with the stop [t]. But in many of these cases, speakers produce incomplete closures for the oral stops (see Figure 8, [st] in max tapes). Although there is often evidence of a change in degree of stricture, this change can appear gradual, making it difficult to reliably segment these phones. This case contrasts with the [sp] cluster in Figure 3 (tax paper), where the onset of weak frication was clearly related to the onset of constriction for /p/. Note that word or phrase boundaries can sometimes intervene between two members of a cluster. In these cases there is a probability of a pause at the boundary which should be taken into consideration when segmenting these phones. For example, if a voiceless stop occurs phrase-initially, its voiceless closure is acoustically indistinguishable from pause.

16

Alice Turk, Satsuki Nakai & Mariko Sugahara

0.25 Time (s) Figure 11: A fragment from Buun-sensei ICHI-BAN-sa "saasacT-tte ittakedo 'Mr. Boone said NUMBER ONE is "saasaa"', spoken by a female Standard Japanese speaker. Ban 'number' is a suffix; ga is a nominative particle. 2.3 Segmentation uncertainties It is inevitable that there will be cases where boundaries cannot be found with certainty, even for speech materials whose phones have been carefully chosen to give the highest probability of reliable segmentation. These cases can be grouped into different classes: 1) cases where it is clear that the boundaries occur somewhere within a short window of uncertainty (roughly the duration of a single pitch period, i.e. 5-10 ms), 2) cases where boundaries are known to exist within a wider window of uncertainty, 3) cases where boundaries are completely obscured. One way of dealing with cases like 1) is to annotate them (e.g. with ?), and to segment them according to a chosen policy of either "when in doubt, place the boundary earlier", or "when in doubt, place the boundary later," to be applied throughout the dataset. Researchers can then choose whether to include these measurements in their analyses, depending on the expected size of effects. Although measurements from cases like 2) and 3) should never be included in data analysis, there may be other ways of salvaging the data for these segments. These may include 1) applying other types of segmentation criteria that might be more reliable in particular cases, e.g. voicing (or laryngeal) criteria, while keeping the implications of these choices in mind, and 2) grouping segments together to yield reliable durations for segment sequences, e.g. as in Figure 4.

4

The ideas in this section were developed in collaboration with Stefanie Shattuck-Hufnagel.

17

Acoustic Segment Durations in Prosodic Research 2.4 Using criteria consistently

Using criteria consistently across materials to be compared is one of the most important principles of acoustic speech segmentation. Target materials and carrier sentences should be chosen carefully with this principle in mind. In particular, segments which are known to have different allophonic variants should be avoided in compared conditions. There are some cases where it is impossible to keep phonetic context constant across conditions, i.e. in comparisons of phrase-final vs. phrase-medial materials, where it is likely that a pause will occur after a phrase-final word. In these cases, the choice of segmentation criteria may have drastic implications for conclusions about the presence and magnitude of prosodic effects. Our study on Japanese and Finnish utterance-final lengthening provides interesting examples in this regard. In Japanese, utterance-final vowels often end in creaky phonation. In an extreme case shown in Figure 12, the utterance ends with widely spaced glottal pulses that give the auditory impression of [a], although they lack continuous formant structure. In cases like this, a segmentation criterion based on continuous F2 yields a much shorter vowel than one based on laryngeal activity. In this particular example, where the vocalic interval based on the laryngeal criterion is labelled as 'a max' and the vocalic interval based on continuous F2 is labelled as 'a', the choice of segmentation criterion makes a difference of 227 ms in the estimated duration of the final vowel.

a max 0

0.1

0.2

0.3

0.4 0.5 Time (s)

0.6

0.7

0.8

0.9

Figure 12: A fragment from Toujou-sensei-ni kii-tara ICHI-BAN-ga 11saansa" 'According to Mr. Tojo, NUMBER ONE is "saansa"', spoken by a male Standard Japanese speaker. Saansa is a nonsense word. The boundary for the offset of /a/ is placed on the last glottal pulse peak in the interval delimited by continuous F2.

18

Alice Turk, Satsuki Nakai & Mariko Sugahara

In Finnish, one of the most striking characteristics of many utterance-final words is their breathy ending (cf. Figure 13, see also Ogden, 2004). The breathy phonation at the end of an utterance often results in voiceless formant structure, which would be included in the vowel if the vowel were judged to end with the apparent end of formant structure. If the end of the vowel were judged to coincide with the end of voicing, the voiceless formant structure would be excluded. In this example, the choice of segmentation criterion makes a difference of 47 ms in the estimated duration of the final vowel.

^ 8000 ® 6000 -

c 4000 U g" 2000 £

0-

0

0.1 0.2 Time (s)

0.3

Figure 13: A fragment from Kohtaan SEITSEMANSATAA voisi vastata "sasa" 'For [#] SEVEN-HUNDRED [you] could answer "sasa"', spoken by a female Northern Finnish speaker. Sasa is a nonsense word. V in the second label tier indicates the offset of voicing for [a].

3 Experimental design One of our main goals in discussing relative segmentability in the first part of the paper was to facilitate the design of experimental materials with the maximum likelihood of reliable segmentation. In the following sections, we discuss additional methodological issues for prosodic studies of duration, including ways of controlling for durational effects unrelated to prosodic structure, elicitation methods, and analysis tools.

Acoustic Segment Durations in Prosodic Research

19

3.1 Controlling for the influence of multiple factors on duration: Corpus studies vs. controlled experiments It is well known that segment durations are influenced by a variety of factors including talker, intrinsic segmental properties, segmental context, prosodic context, and global rate of speech (see Klatt, 1976 for a review). These multiple sources of durational variability make it especially challenging to make inferences about the influence of individual prosodic factors on segment durations. There are currently two main approaches to this problem: 1) to study prosodic effects on duration in very large corpora, so that the effects of nonprosodic factors can be modelled statistically to allow for segment duration normalisation (e.g. Campbell and Isard, 1991; Wightman, Shattuck-Hufnagel, Ostendorf, and Price, 1992), and 2) to study prosodic effects in tightly controlled experiments. The main advantage of the corpus approach is that prosodic effects can be studied in natural speech situations. However, this advantage is outweighed to some extent by the following: First, segment duration estimates are dependent on the accuracy of the automatic segmentation algorithms whose use is inevitable given the volume of data (see discussion of automatic segmentation in Section 4). Also, some of the more subtle prosodic effects have the potential to be obscured, primarily because of difficulties in establishing accurate normalisation procedures. Finally, some prosodic effects cannot be easily isolated in natural corpora, due to the fact that variables affecting duration are often correlated (the "Data Sparsity" problem, van Santen, 1994). For example, Campbell (1992) examines Kaiki, Takeda, and Sagisaka's (1990) corpus and demonstrates that their surprising finding of sentence-final vowel shortening in Japanese could be explained by the correlation of sentence-finality with the occurrence of the past-tense marker -ta in their corpus. Campbell argues that the 'sentence-final shortening' is likely to be a result of this correlation, and not a prosodic effect. In other words, [a] in the past tense marker may be short due to various other reasons, e.g. informational redundancy. Kaiki et al.'s corpus therefore lacked enough variation in sentence-final vowels for appropriate comparisons with sentence-medial vowels. Although data from tightly controlled laboratory studies cannot compare in naturalness with data from spontaneous speech corpora, experimental design can be used very effectively to control for the effects of confounding factors such as intrinsic segment durations, segmental context effects, rate of speech and inter-talker variability. In addition, problems of accurate segmentation can be dealt with through careful experimental design, as discussed at length above. In controlled experiments, materials are designed to be as alike as possible across conditions while varying the predictor, independent variables of interest, e.g. phrasal stress or constituent boundary placement. In these experi-

20

Alice Turk, Satsuki Nakai & Mariko Sugahara

ments, talkers are typically asked to read required materials embedded in carrier sentences. In order to ensure that findings can be generalised, it is advisable to include several (6-12) test words or phrases in each experimental condition. Especially if subtle prosodic effects are expected, word frequency and other factors such as morphological structure and orthographic representation (number of letters for each sound) should be controlled, or appropriately varied (Walsh and Parker, 1983; Warner, Jongman, Sereno, and Kemps, 2004). To a large extent, numbers of talkers and repetitions tend to be dictated by practical issues, but results will inevitably be more reliable in studies where more talkers are studied. When planning a study, researchers would do well to estimate the time it takes to conduct acoustic segmentation realistically. In our experience, only approximately 6-15 disyllabic words can be segmented per hour, depending on the number of segments in each word and on the speed of the segmenter. In addition, it is important to plan for reliability checks when more than one person segments materials from a single experiment. In the following sections, we discuss ways of controlling known sources of durational variability in more detail. 3.1.1 Control of intrinsic segment durations and segmental context Different segment types are known to have different intrinsic durations, e.g. low vowels tend to be longer than high vowels (Peterson and Lehiste, 1960; Klatt, 1976). For this reason, and because coarticulation from surrounding segments or gestures has the potential to affect target segment durations, prosodic effects should ideally be tested on identical target words. The choice of materials adjacent to the target word is also important, particularly if wordinitial or word-final segments are of interest. For example, if word-final vowel durations are to be compared with word-medial vowel durations, the word following the target word should be chosen so that phonetic context is identical in both cases, e.g. [i] in beef arm vs. bee farm. In these situations, it is particularly important to avoid eliciting pauses between the target and a following word. 3.1.2 Control of rate of speech In controlled studies, initial practice sessions and randomisation of test materials are used to control for rate of speech increases that are a frequent consequence of talker experience as s/he progresses through the experimental session. If test materials are presented in blocks, it is advisable to randomise presentation of materials within blocks. If these are blocked by experimental condition, the order of block presentation should be counterbalanced across talkers. As a check that experimental control of rate of speech has succeeded, it is

Acoustic Segment Durations in Prosodic Research

21

important to include a control sequence across the experimental conditions for comparison. For example, in a carrier sentence such as Say the word again, an appropriate, segmentable control might be aythewor, that is the stretch of speech from the release of the [s] to the onset of closure for the [d]. 3.1.3 Control of inter-talker variability Variation in segment durations between talkers can be quite large, due in large part to inter-talker differences in rate of speech. Where within-subject differences are of primary interest (as is often the case), these effects of inter-talker variability can be dealt with effectively using within-subjects experimental design, and appropriate statistical analyses, (e.g. single subject analyses, paired t-tests, or repeated measures analyses of variance, Loftus and Loftus, 1988). Readers are referred to Raaijmakers, Schrijnemakers, and Gremmen (1999) and Baayen (2004) for current debate about how best to statistically analyse designs involving both multiple subjects and multiple items. 3.2 Elicitation of prosodic contrasts Elicitation of prosodic contrasts can be challenging, especially since talkers normally have many options for the prosodic realisation of a single utterance (see discussion in Shattuck-Hufiiagel and Turk, 1996). Reliable methods for eliciting desired contrasts and knowing when the desired contrasts have been achieved are crucial if durational patterns are to be directly related to these contrasts. Desired phrasal stress patterns, intonation contours, and prosodic phrasing can be encouraged in several ways through the use of 1) syntactic manipulations, 2) orthographic manipulations, 3) precursor sentences, and 4) explicit instructions, as detailed below. 3.2.1 Varying the syntactic structure of read materials Although prosodic constituent boundaries correspond only indirectly to the boundaries of syntactic constituents, it is very often the case that different syntactic structures are produced with different prosodic structures (e.g. Lehiste, 1973), and in many cases prosodic constituent boundaries are aligned with either the left or right edge of particular types of syntactic constituents (Selkirk, 1986; Truckenbrodt, 1999). Varying the syntactic structure of read materials has been used successfully to encourage systematic variation in prosodic constituent structure in many types of experimental study (e.g. Lehiste, 1973; Scott, 1982; Cambier-Langeveld, 1997; Fougeron and Keating, 1997). In most studies, every effort is made for compared sentences to contain the same, or at

22

Alice Turk, Satsuki Nakai & Mariko Sugahara

least similar segmental material, and similar numbers of syllables within each utterance. For example, Cambier-Langeveld (1997) used sentences shown in (1) to elicit a range of prosodic boundary strengths after the word rododendron, which we have typed in bold (see Cambier-Langeveld, 1997 for a definition of each type of prosodic boundary): (1)

Prosodic Word-boundary: Piet wil die rare rododendronplanten,

gek als hij is.

'Piet wants those strange rhododendron plants, crazy as he is.' Phonological-Phrase boundary: Piet wil die rare rododendron planten, gek als hij is. 'Piet wants to plant that strange rododendron, crazy as he is.' Intonational-Phrase-boundary: Piet wil die rare rododendron, plantengek als hij is. 'Piet wants that strange rododendron, plant-crazy as he is.' Utterance -boundary: Plantengek als hij is wil Piet die rare rododendron. 'Plant-crazy as he is, Piet wants that strange rhododendron.'

One factor to be wary of in syntactic category manipulations is that in some languages, there are morpho-phonetic/phonological alternations associated with words of particular grammatical categories. For example, in Finnish word-initial consonants are long following words of certain grammatical categories that end in vowels, e.g. infinitives and the second person imperative. Thus, in Haluan tulla sinne ( Ί want to come there.'), the /s/ in 'sinne' is longer than in other, 'non-lengthening' contexts (see Sulkala and Karjalainen, 1992). 3.2.2 Orthographic manipulations Punctuation can be used successfully to signal intended syntactic structures, e.g. Cambier-Langeveld's presentation of rododendron and planten with and without an intervening space and comma in the prosodic conditions shown above. In addition, capital letters are sometimes used to indicate the presence of contrastive phrasal stress, e.g. I said BUY cakes, not MY cakes (Turk and Sawusch, 1997). However, capitals should be used with caution because they can encourage talkers to put unnatural degrees of emphasis on target words. 3.2.3 Precursor sentences Our current preferred way of controlling the placement of phrasal stress is to use precursor sentences that suggest the likely location of new or unpredictable information, on the assumption that unpredictable words are likely to bear phrasal stress (e.g. Aylett and Turk, 2004). For example, Sugahara and Turk

Acoustic Segment Durations in Prosodic Research

23

(in prep.) elicited phrasal stress on baking and the lack of phrasal stress on pan through the use of the sentence set shown in (2): (2)

A pan used in the kitchen. Say "baking pan" for me.

3.2.4 Explicit instructions We recommend avoiding the use of explicit instructions to elicit desired prosodic structures, where these instructions could have the undesired effect of getting talkers to exaggerate or to produce patterns or contrasts that they would not normally produce. However, in some cases, explicit instructions seem the most effective in eliciting the prosodic structure in question. For instance, we often ask talkers to avoid pausing within a target utterance to discourage the insertion of phrase boundaries before or after target words, as is often tempting in Say-X-for-me-type carrier sentences. 3.3 Influencing the expected magnitude of prosodic effects The hierarchical level of prosodic and/or morpho-syntactic constituents is expected to have an influence on the magnitude of durational effects that signal them. For example, in many languages, final lengthening is known to be greater in magnitude for Full Intonational Phrases than for smaller phrases (e.g. Wightman et al. 1992). In languages of this type, durational evidence for constituents near the bottom of the hierarchy, e.g. feet or words, is consequently expected to be subtle at best. In designing experiments to test for these constituents, it may be advisable to try to use contexts and elicitation methods that are likely to yield the largest possible effects. A growing body of experimental evidence suggests that phrasal stress, rate of speech, and, in some languages, the number of syllables in a word can be manipulated to yield larger or smaller durational differences between prosodic conditions of interest. For example, Turk and Shattuck-Hufnagel (2000) found that durational differences between segments and syllables in English, e.g. tune #acquire vs. tuna #choir sequences were more marked when one of the test words bore phrasal stress (e.g. phrasal stress on [tun(a)] or [kwau]). In addition, Beckman and Edwards (1990) and Sugahara and Turk (in prep.) found that durational differences related to word and morphological structure are magnified at slow rates of speech. And thirdly, Cambier-Langeveld (2000) and White (2002) found that effects of phrasal stress in Dutch and English were proportionally larger on monosyllabic than on polysyllabic words.

24

Alice Turk, Satsuki Nakai & Mariko Sugahara

4 Summary and conclusion In this paper, we discussed types of acoustic landmarks that define intervals whose durations can be straightforwardly correlated with the durations of recognisable articulatory events, namely oral constrictions and the vocalic intervals between them. We presented a list of segment types whose constrictions can be accurately and reliably measured in VCV and cluster contexts, and proposed that experimental materials should be designed as far as possible to include these measurable segment sequences. In addition, we discussed issues of experimental design and analysis such as encouraging the elicitation of desired prosodic conditions, and controlling for non-prosodic factors on duration. While the issues we discuss are far from exhaustive, it is our hope that other researchers will benefit from these practical considerations derived from our collective experience using acoustic segment durations in prosodic research. In particular, it is hoped that these issues might be taken into consideration in the design of automatic segmentation tools that would allow researchers to interactively specify criteria along the lines presented here. The main advantages of automatic over manual techniques are that automatic techniques are more objective and consistent, and can be used in a fraction of the time required for manual segmentation. These tools often involve alignment of a transcription with an acoustic signal using Hidden Markov Model techniques, and can involve subsequent boundary correction using, e.g. spectral discontinuity detection to improve accuracy up to less than 20 ms for all segment types (cf. Kim and Conkie, 2002). Currently, these types of auto-segmentation tools are being perfected for use in creating inventories for optimal unit selection textto-speech synthesis systems, and not specifically for phonetic investigations into segmental durations. However, they do appear to have the potential to detect landmarks of the type discussed in this paper, and would be useful especially if their accuracy were close to 5-10 ms for segments used in durational research. Their accuracy in segmenting a variety of experimental materials will need to be thoroughly evaluated before they can usefully be improved and customised to replace manual segmentation in cross-linguistic phonetic studies of duration.

References Aylett, M. and A. Turk (2004): The smooth signal redundancy hypothesis: A functional explanation for relationships between redundancy, prosodic prominence, and duration in spontaneous speech. Language and Speech 47(1), 31-56. Baayen, R. H. (2004): Statistics in Psycholinguistics: A critique of some current gold standards. Mental Lexicon Working Papers 1, Edmonton, 1-45.

Acoustic Segment Durations in Prosodic Research

25

Beckman, Μ. E. and J. Edwards (1990): Lengthening and shortening and the nature of prosodic constituency. In: J. Kingston and Μ. E. Beckman (eds.): Laboratory Phonology I. Cambridge: Cambridge University Press, 152-178. Boersma, P. and D. Weenink (2005): Praat: doing phonetics by computer (version 4.3). http://www.praat.org. Browman, C. and L. Goldstein (1990): Tiers in articulatory phonology, with some implications for casual speech. In: J. Kingston and Μ. E. Beckman (eds.): Papers in Laboratory Phonology I: Between the Grammar and Physics of Speech. Cambridge: Cambridge University Press, 341-376. Cambier-Langeveld, T. (1997): The domain of final lengthening in the production of Dutch. In: H. de Hoop and J. Coerts (eds.): Linguistics in the Netherlands. Amsterdam: John Benjamins, 13-24. Cambier-Langeveld, T. (2000): Temporal Marking of Accents and Boundaries. PhD dissertation. Holland Institute of Generative Linguistics; Netherlands Graduate School of Linguistics. The Hague: Holland Academic Graphics. Campbell, N. (1992): Segmental elasticity and timing in Japanese speech. In: Y. Tohkura, E. Vatikiotis-Bateson, and Y. Sagisaka (eds.): Speech Perception, Production and Linguistic Structure. Amsterdam: IOS Press, 403-418. Campbell, W. N., and S. D. Isard (1991): Segment durations in a syllable frame. Journal of Phonetics 19, 37-47. Fougeron, C. and P. Keating (1997): Articulatory strengthening at edges of prosodic domains. Journal of the Acoustical Society of America 101, 3728-3740. Gordeeva, G. and J. M. Scobbie (2004): Non-normative preaspiration of voiceless fricatives in Scottish English: A comparison with Swedish preaspiration. Colloquium of the British Association of Academic Phoneticians, University of Cambridge, http://www.qmuc.ac.uk/sls/pg/ogordeeva/BAAP2004_GORDEEVA_ SCOBBIE.ppt Kaiki, Ν., K. Takeda and Y. Sagisaka (1990): The control of segmental duration in speech synthesis using linguistic properties. In: Proceedings of the 1st ESCA Workshop on Speech Synthesis, Autrans, France, 18-22. Kim, Y.-J. and Conkie, A. (2002): Automatic segmentation combining an HMM-based approach and spectral boundary correction. In: Proceedings of the 7th International Conference on Spoken Language Processing (ICSLP), Denver, 145-148. Klatt, D. H. (1976): Linguistic uses of segmental duration in English: Acoustic and perceptual evidence. Journal of the Acoustical Society of America 59(5), 12081220. Ladefoged, P. (2001): A course in phonetics. Fourth edition. Fort Worth: Harcourt College Publishers. Ladefoged, P. and I. Maddieson (1996): The sounds of the world's languages. Maiden, M.A: Blackwell.

26

Alice Turk, Satsuki Nakai & Mariko Sugahara

Lehiste, I. (1964): Juncture. In: E. Zwirner and W. Bethge (eds.): Proceedings of the Fifth International Congress of Phonetic Sciences, Münster, 172-100. Lehiste, I. (1973): Rhythmic units and syntactic units in production and perception. Journal of the Acoustical Society of America 54, 1228-1234. Loftus, G. R. and E. F. Loftus (1988): Essence of Statistics (Second ed.). New York: Alfred A. Knopf. Lucero, J. C. (1999): A theoretical study of the hysteresis phenomenon at vocal fold oscillation onset-offset. Journal of the Acoustical Society of America, 105(1), 423431. Ogden, R. (2004): Non-modal voice quality and turn-taking in Finnish. In: E. CouperKuhlen and C. E. Ford (eds.): Sound Patterns in Interaction. Amsterdam: Benjamins, 29-62. Peterson, G. E. and I. Lehiste (1960): Duration of syllable nuclei in English. Journal of the Acoustical Society of America 32, 693-703. Raaijmakers, J., Schrijnemakers, J. and A. Gremmen (1999): How to deal with "the language-as-fixed-effect fallacy": Common misconceptions and alternative solutions. Journal of Memory and Language 41, 416-426. Remijsen, B. (2005): http://www.ling.ed.ac.uk/~bert/praatscripts.html Romaine, S. (1978): Post-vocalic /r/ in Scottish English: Sound change in progress? In: P. Trudgill (ed.): Sociolinguistic Patterns in British English. London: Edward Arnold, 144-158. Scott, D. R. (1982): Duration as a cue to the perception of a phrase boundary. Journal of the Acoustical Society of America 71(4), 996-1007. Selkirk, E. O. (1986): On derived domains in sentence phonology. Phonology Yearbook 3, 371-405. Shattuck-Hufnagel, S. and A. E. Turk (1996): A prosody tutorial for investigators of auditory sentence processing. Journal of Psycholinguistic Research 25(2), 193-247. Stevens, Κ. N. (1998): Acoustic Phonetics. Cambridge, Massachusetts: MIT Press. Stevens, Κ. N. (2002): Toward a model for lexical access based on acoustic landmarks and distinctive features. Journal of the Acoustical Society of America 111(4), 18721891. Sugahara, M. and A. Turk (in prep.): The influence of morphological structure on acoustic segment durations. Manuscript. Sulkala, H. and M. Karjalainen (1992): Finnish. London: Routledge. Truckenbrodt, H. (1999): On the relation between syntactic phrases and phonological phrases. Linguistic Inquiry 30(2), 219-255. Turk, A. E. and J. R. Sawusch (1997): The domain of accentual lengthening in American English. Journal of Phonetics 25, 25-41. Turk, A. E. and S. Shattuck-Hufnagel (2000): Word-boundary-related duration patterns in English. Journal of Phonetics 28, 397-440.

Acoustic Segment Durations in Prosodic Research

27

van Santen, J. P. H. (1994): Assignment of segmental duration in text-to-speech synthesis. Computer Speech and Language 8, 95-128. Walsh, T. and F. Parker (1983): The duration of morphemic and non-morphemic /s/ in English. Journal of Phonetics 11, 201-206. Warner, N., A. Jongman, J. Sereno, and R. Kemps (2004): Incomplete neutralization and other sub-phonemic durational differences in production and perception: Evidence from Dutch. Journal of Phonetics 32, 251-276. White, L. (2002): English speech timing: a domain and locus approach. Unpublished Ph.D. dissertation, U. of Edinburgh, Edinburgh, UK. Wightman, C. W., S. Shattuck-Hufnagel, M. Ostendorf, and P. J. Price (1992): Segmental durations in the vicinity of prosodic phrase boundaries. Journal of the Acoustical Society of America 91(3), 1707-1717. Zue, V. (1985): Notes on spectrogram reading. In Speech Spectrogram Reading: An Acoustic Study of English Words and Sentences. Special Summer Course. Lecture Notes and Spectrograms. Massachusetts Institute of Technology, 1988. N1-N392.

Dik J. Hermes (Eindhoven)*

Stylization of Pitch Contours 1 Introduction Various procedures have been proposed to describe pitch contours of spoken utterances in terms of a small number of straight-line elements. The result is then referred to as a stylized pitch contour, which ideally has as small a number of straight lines as possible without affecting any of the perceptually relevant properties of the pitch contour. The basic idea underlying this procedure is that, by eliminating all details of the pitch contour that play no communicative role, those perceptual properties of the pitch contour become apparent that are the essential constituents of the intonation patterns of the utterances. In this way, by eliminating what is not essential and keeping the essential properties of the pitch contour of the utterance unaffected, one aims at discovering the structure of intonation in natural speech. But, before going on, we will define three important concepts in stylization, stress, accent, and prominence, as they are used in this paper. Stress is a binary, linguistic property of a syllable. In general, native speakers of a stress language can easily indicate which syllable of a word has stress. Moreover, in some languages, e.g., Dutch and English, some words can have two stressed syllables, one syllable with primary stress and another with secondary stress. For instance, the word 'organization' has primary stress on the syllable 'za' and secondary stress on the syllable 'or'. In normal conditions, stressed syllables are spoken with longer durations than unstressed syllables of the same phonetic content and, probably more importantly, the spectral balance in stressed syllables is shifted to higher frequencies than in unstressed syllables (Sluijter and Van Heuven, 1996; Sluijter, Van Heuven, and Pacilly, 1997). This shows that stressed syllables are spoken with more effort than unstressed syllables. Accent is a binary feature of a spoken, generally stressed, syllable, lent to the syllable by the speaker. It is a binary property of a syllable in the sense that the syllable is either accented or not accented. Only in extra linguistic cases, e.g., Ί did not say DEfine, I said REfine', unstressed syllables can carry an

"

I thank Piet Mertens for calculating and supplying me with the prosogram presented in Figure 3, and making it available for this paper.

30

Dik J. Hermes

accent. In general, the percept of accentuation is induced by the presence of a pitch movement which can precede the accented syllable or start during the syllabic nucleus of the accented syllable (Hermes, 1997; Section 4.2). This will be referred to, if necessary, as a pitch accent. Only in exceptional cases, for instance - but not only - in whispered speech, an accent can be lent by producing the syllable with more effort. Pitch movements, if well positioned and large enough, are a very strong cue for accentuation. They generally overrule the effects of longer durations or intensity increments (Fry, 1958). While carrying an accent is an all-or-none property of a spoken syllable, prominence is a gradual property of a spoken syllable. It is defined by Portele and Heuft (1997, p. 63) as "a quantitative parameter of a syllable [...] that describes markedness relative to surrounding syllables". Hence, an accented syllable can be more or less prominent depending, e.g., on the excursion size of the pitch movement inducing the accent. For a more elaborated discussion on the concept of prominence, I refer to Portele and Heuft (1997), and Terken and Hermes (2000). In this paper we will discuss two different methods of pitch contour stylization. To begin with, we will describe how pitch contours can be reduced to a small number of continuous segments. These segments consist of straight lines according to the IPO method described in this paper, as well as in the methods presented by Scheffers (1988) and Bagshaw (1993). In the method proposed by Taylor (1994, 2000) some of the segments consist of quadratic polynomials. In the method described by Hirst, Di Cristo, and Espesser (2000) the segments consist of quadratic splines. Utterances resynthesized with the original and with the stylized contours should not sound different. A stylization will be called a close-copy stylization if the number of segments is at a minimum, in the sense that the number of segments can no longer be reduced without audibly affecting the intonation of the utterance. The second method we will present is based on models of tonal perception, and results in separate stylized contours for every syllable of an utterance, representing syllabic tones. In preparation of discussing this method, we will describe the model of tonal perception on which this method is based. Then we will describe this method as presented by d'Alessandro and Mertens (1995) and Mertens (2004a; b). The assumptions underlying these methods and the theoretical implications will be discussed briefly. Lastly, we will discuss how these two different approaches may be reconciled.

2 Points of departure 2.1 Pitch measurements and voiced-unvoiced detection The first step in automatic stylization procedures consists of estimating which segments of the speech signal have pitch and what frequency corresponds with

Stylization of Pitch Contours

31

this pitch. The successive speech segments for which the pitch frequency is estimated and stored, and possibly some other quantities, will be indicated with frames. For a review of pitch-determination algorithms mostly aimed at recovering the period of the voiced speech segments, I refer to Hess (1983). A discussion on pitch estimation algorithms which include knowledge about human hearing is presented in Hermes (1993). The problems involved in estimating the pitch of voiced fricatives, creaky speech segments, and speech segments with very strong harmonics at frequencies close to the formant frequencies are, moreover, discussed in detail in Hermes (1993). 2.2 Continuity of pitch contours In the IPO theory of intonation ('t Hart et al., 1990), pitch contours are presented as series of continuous straight-line segments. It is presumed that the unvoiced speech segments are filled in by the perceptual process. Scheffers (1988, p. 982), who was the first to publish an automatic stylization algorithm, also writes that "listeners won't perceive sentence melodies to be interrupted by unvoiced speech sounds". In other representations as well (Taylor, 1994; 2000; Hirst et al., 2000), pitch contours are presumed to be continuous. While we will adopt this at first, we will discuss this farther later on. Pitch contours must start and end somewhere. In general, the intonation of stretches of normal speech is partitioned into intonation phrases. Hence, it seems most natural to start and end stylization procedures at intonation-phrase boundaries. The ToBI system (Silverman et al., 1992) also recognizes 'intermediate phases', but this will not be discussed here (for a discussion on this topic see, e.g., Ladd, 1996, pp. 92-94). For the moment we will assume that a pitch contour covers an intonation phrase and is continuous between the start and the end of this phrase. This leaves us with the problem of how to detect the beginning and end of an intonation phrase automatically. In the ToBI system, four different categories of pauses are recognized, the break indices, numbered 1 to 4. The break between two intonation phrases is indicated with 4. It may seem natural to identify a boundary between two intonation phrases with a silent interval in the speech signal, longer than, say, 100 ms. Not all silent intervals, even longer ones, are necessarily phrase boundaries, however. For instance, in an utterance like Ί didn't say a BAD day, I said a BLACK day', the accented words 'BAD' and 'BLACK' can be preceded and followed by significant silent intervals, which can be much longer than a possible pause at the comma between 'day' and Ί ' . A second cue that may play a role in the signaling of phrase boundaries is the declination reset, corresponding with an increase in the speaker's pitch range at the start of an intonation phrase. The most reliable cue that signals the presence of an intonation-phrase boundary, however, appears to be final lengthening or pre-boundary lengthening. An algorithm to detect phrase

32

Dik J. Hermes

boundaries automatically was published by Campbell (1993). It is based on the relative log-transformed durations of the segments of a syllable. Let μ, and σ, be the averages and standard deviations of the log-transformed durations of the η segments that constitute the syllable. These averages and standard deviations are calculated from all tokens of a segment in a large spoken database. The perceived duration of the syllable is then estimated by calculating the factor k in the equation for the actual total duration D of the syllable, D = exp(//, + A;er ) . In case k turns out smaller than 0, the perceived duration of the syllable is relatively short. If it turns out larger than 0, the perceived duration is relatively long. Campbell distinguishes pre-boundary lengthening from lengthening related to syllable stress; in stressed syllables the onset of the syllable gets longer while the coda gets shorter. In pre-boundary syllables, the onsets are kept short and the codas are lengthened. We conclude, for the moment, that continuous speech can be partitioned into intonation phrases, and that within phrase boundaries pitch contours are perceived as continuous. 2.3 The perceptual irrelevance of microintonation One of the main reasons for stylization is that the unprocessed series of pitch estimations contains many fluctuations which are supposed to be an irrelevant part of the information conveyed by the intonation patterns of speech. These 'perceptually irrelevant' pitch fluctuations, which can be the size of accentlending pitch movements, often correspond with glottal stops, releases after plosives, and other source-filter interactions occurring at instances at which the air pressure in the vocal tract changes rapidly. They mostly occur with the production of consonants and arise during syllable onsets and just after vowel onsets. An explanation for the perceptual insignificance of microintonation may be found in the work by House (1990). He showed that in speech segments characterized by rapid spectral transitions pitch changes are irrelevant for the categorization of tonal movements. This might explain why listeners who focus their attention on intonation do not notice the changes introduced by the removal of the microfluctuations from the pitch contour. It is often astonishing how difficult it can be to hear the difference between an utterance resynthesized with the original pitch contour and an utterance resynthesized with the stylized pitch contour. Nevertheless, differences can sometimes be heard if the listener concentrates hard, but then they have the character of changes in the segmental content of a speech segment, rather than of changes in the pitch contour. For instance, a glottal stop may become much less audible. In other cases, the listener has to direct his or her attention to a specific part of the contour in order to notice the difference.

Stylization of Pitch Contours

33

2.4 Straight lines vs. curves Most studies on pitch-contour stylization (e.g., 't Hart et al., 1990; Scheffers, 1988) present stylizations as a sequence of continuous straight-line segments. In Hirst et al. (2000), however, the pitch contour is reduced to a series of pitchtarget points. Instead of straight lines, the contour between the target points consists of quadratic splines that represent the 'macroprosodic components' (p. 60) of the utterance. The pitch-target points are not identified with a specified phonetic segment or syllable, but are derived by a procedure in which candidate target points are first defined for every speech frame. Then, in an iterative procedure the candidates are partitioned and reduced in number in such a way that, ideally, the number of target points is at a minimum, while the resynthesized contour is indistinguishable from the original. In the models of intonation developed by Taylor (1994; 2000) perceptually significant rises and falls are described by quadratic polynomials, parts of parabolas, but the connections are represented by straight lines, 't Hart (1991) showed that, from a perceptual point of view, the audible difference is negligible. Hirst et al. (2000) argue that quadratic splines approximate the original contour more closely and, hence, can be more economical than straight line approximations. The main difference in the visual representation of straight lines versus curves lies in the clear visibility of the turning points of the stylizations. In stylizations consisting of straight lines, our attention is mainly drawn to these turning points, whether or not these turning points occur in unvoiced parts or in voiced parts. The important question here is whether the turning point, or its timing, as such conveys information to the listeners. If so, they should be clearly represented in the time-frequency visualization of the contour. If not, they can best be obscured by representing the contour by curves in which both the curves themselves and their time derivative are continuous, as is the case with quadratic splines. This will be discussed later on.

3 Stylization of pitch contours: A movement approach 3.1 Manual stylization of pitch contours Manual stylization of pitch contours started at the IPO/Institute for Perception Research in Eindhoven ('t Hart, 1976; 1979; Pijper, 1979). In the IPO approach, close-copy stylization is a first step in the phonetic analysis of pitch contours. A first straight-line approximation is made by the experimenter by watching the series of estimated pitches. The utterance is then resynthesized with the straight-line approximation as pitch contour, using LPC or by PSOLA techniques (Hamon, Moulines, and Charpentier, 1989), and then compared with the utterance resynthesized with the original pitch contour. On a trial and

34

Dik J. Hermes

error basis, turning points can then be deleted, included and shifted, up to a level at which the number of straight-line segments is at a minimum, while the intonation of the approximation has no audible difference from that of the original. In this way, one reduces the pitch contour to a sequence of perceptually relevant pitch movements, the building blocks of the IPO model of intonation ('t Hart et al., 1990). Per definition, an utterance resynthesized with a stylized contour should be perceptually equal to the utterance resynthesized with its original contour. By reducing the number of straight-line segments as much as possible without affecting the perceptual equality, a close-copy stylization is obtained, containing all, and nothing but all perceptually relevant pitch movements. Note the subtle difference between being perceptually equal and auditorily indistinguishable. Perceptual equality here means that the intonation of the two utterances is the same but, as explained in Section 2.3 on microintonation, this may mean that, especially for trained listeners, some segmental differences may remain audible.

3.2 Automatic stylization: The Scheffers method The first automatic stylization procedure was described by Scheffers (1988) and, as it covers all the essential aspects of automatic stylization, we will describe it in some detail. In this section, we will also briefly describe the method presented by Bagshaw (1993). Scheffers' method starts with voiced-unvoiced estimation and estimating the pitches at 10-ms spaced frames of the speech signal. Scheffers then stores the results, on a logarithmic frequency scale, together with the short-term energy. We will comment on the use of the logarithmic frequency scale in Section 6.5. Next, voicing errors, which are found to be very disruptive in listening tests, are corrected manually. After this preparatory work, the proper stylization procedure starts by searching for turning points, defined as points in the pitch contour where the slope of the contour changes "clearly" and "consistently". Scheffers defines two parameters, MAXVAR and MINDUR. MAXVAR is the maximum tolerated deviation of the actual pitch values from the regression lines that make up the stylized contour; MINDUR is the minimum duration a regression line is allowed to deviate from the actual pitch values before a new candidate turning point is defined. Scheffers' algorithm then starts with the calculation of a linear regression line through the estimated pitch values of the first three voiced frames. The pitch value of the fourth frame is estimated based on this regression. If the predicted value does not exceed MAXVAR, an estimate of the fifth pitch value is calculated through the linear regression of the first four pitch values. If, again, the predicted value does not exceed MAXVAR, the sixth pitch value is predicted, etc. This goes on until the predicted values exceed MAXVAR and the duration of this excess is at least MINDUR; when this happens, the point at which the predicted contour started to deviate from the actual pitch values is a candidate

Stylization of Pitch Contours

35

turning point of the stylization. Then, the procedure restarts with the established turning point as new starting point. An important aspect of this procedure is that an estimate of the short-term energy is used as a weight factor in calculating the regression lines. Scheffers uses the zero autocorrelation coefficient from the LPC-analysis of the speech signal low-pass filtered at 5 kHz. As a consequence, the contribution of the more intense speech segments, generally the sonorant speech parts, is larger than that of the less intense speech segments. This in fact strongly reduces the disturbing factor of microintonation, which generally is insignificant in the vowel nucleus where the sound intensity is largest. In other words, the pitch values during vowels play a more significant role in estimating the position of the turning points and in calculating the regression lines. Next, for each candidate turning point, except the first and the last, the location is optimized by considering adjacent frames as alternative positions. For each of these frames, the sum of the squared errors of the two regression lines crossing at that frame is calculated. The position at which the error sum reaches a local minimum is then chosen as the definite turning point. If in this process the interval between two turning points falls below 30 ms, one is removed. Finally, the stylized contour is determined by recalculating the regression lines between the established turning points, for which the signal intensity is again used as a weight factor. (Note that this last aspect of the procedure does not ensure that the stylized contour is fully continuous at the turning points previously defined.) The resulting automatically obtained stylized contours were compared with contours stylized manually. This was done for various values of the two free parameters of the procedure, MAXVAR and MINDUR. Scheffers (1988, p. 987) then concludes that "variations in F0 of less than 1.5 semitone and with a duration of less than 100 ms can hardly be heard in running speech". In a study of the acoustic events related to stress and accents in English, Bagshaw (1993) adopts the method described by Scheffers with some minor modifications. For instance, he uses 1 semitone as the permitted variation between F0 and the predicted contour. The maximum interval over which the discrepancy may occur is 100 ms, as with Scheffers. Furthermore, Bagshaw ensures that discontinuities only occur at unvoiced segments. Bagshaw starts with a syllabification of the utterance. This syllabification is based on sonorant energy, energy within the frequency range between 50 Hz and 2 kHz, calculated from the amplitude spectra of 20-ms frames. The minima in this sonorant-energy contour are selected as candidates for syllable boundaries, while the maxima are identified as possible candidates for the centres of the syllabic nuclei. By checking, comparing, and adjusting these extrema with an automatic segmentation of the utterance, the utterance is divided into syllables and syllabic consonants defined by maxima in the sonorant-energy contour, and bounded by a segmental boundary close to a minimum in the energy contour.

36

Dik J. Hermes

Then Bagshaw selects only those segments of the stylized pitch contour that cross these syllable nuclei. The fundamental importance of this will be discussed later on. 3.3 Automatic stylization: The IPO method Hermes developed a stylization algorithm used for the teaching of intonation to prelingually deaf children (Spaai, Storm, and Hermes, 1993; Spaai, Derksen, Hermes, and Kaufholz, 1996). This algorithm was built into an intonation meter which displayed the pitch contours of the utterances in real time, calculated the stylizations, and displayed them on a computer screen. Besides the stylized contours, the intonation meter displayed the vowel onsets (Hermes, 1990), so that the alignment of the pitch movements with the syllable structure was clearly visible. The pitch estimations for the data presented in this paper were obtained by subharmonic summation (SHS) (Hermes, 1988), coupled with a dynamic- programming routine that produces continuous pitch contours running through an optimal path of the pitch candidates produced by SHS. Furthermore, a quantity is determined which plays the same role as the sonorant energy in Scheffers' method, the vowel strength, a quantity calculated in the process of vowel-onset detection. For this, it is enough to know that vowel strength is a measure for the harmonic energy that contributes to the first two formants of the vowel. (For details see Hermes, 1990.) Next, in order to reduce microintonation, this contour is smoothed by a second-order filter with a weak resonance frequency of 3.5 Hz. The logarithm of the vowel strength is used as a weight factor in the smoothing process. The weak resonance prevents the maxima and minima preceding and following a steep rise or a steep fall from shifting in time away from that rise or fall. This smoothing removes a lot of the microintonation, and produces a smooth continuous pitch curve. The result is shown in Figure la for an utterance of the TIMIT database 'Fred can go, Susan can't go, and Linda's uncertain'. It shows the original contour as a thin line and the smoothed contour as a thick line. The following stylization procedure is then used. As in Scheffers' method, it starts by calculating a regression line through the pitch values of the first few frames. Vowel strength is again used as a weight factor. The formulae used are the following. We want to calculate the regression line ρ '„ = α 0 +α λ ί η , in which t„ are the times of the frames, a0 and ax are the regression coefficients, and ρ '„ the calculated pitch values of the frames on the regression line. Then, if w„ are the weight factors, and

η η η the regression coefficients a0 and ax are given by

37

Stylization of Pitch Contours

αϊ = ( Σ η

w

w

"P" Σ η

«f« - Σ

w

«t"

η

«0 =(£W»llW»t»P»

H w n t „ p „ ) i D η

D

-

1.5 2 time (sec)

Figure 1: Three representations in the time-frequency domain of the pitch contour of the utterance 'Fred can go, Susan can't go, and Linda's uncertain'. In (a) the smoothed pitch contour is shown as a thick line over the original. In (b) the automatically obtained close-copy is presented. In (c) the vowel onsets are added to illustrate the alignment with the rhythmic structure of the contour. This is done for an ever increasing number of frames until the root mean square (rms) error between the regression line and the smoothed pitch curve exceeds a criterion which depends on the number of frames. It is set to 0.2 Ε (nr of ERB), about 1 semitone. (As the number of frames increases the criteria for this process become more stringent.) The value of this regression line at its first frame is fixed, thus providing the starting point of the stylization. The last point of this first regression line is a candidate for the first turning point and is fixed for the moment. From this point a new regression procedure starts under the condition that the regression line passes through this first candidate turning point. This is calculated for an ever increasing number of frames until, again, the rms error criterion is exceeded. The last point of this regression line is then fixed and becomes a candidate for the second turning point of the stylized con-

Dik J. Hermes

38

tour. Hence, we now have the starting point of the pitch contour, a candidate for the first, and a candidate for the second turning point. The first definite turning point is then found by calculating two crossing regression lines, one starting at the starting point, the other ending at the frame of the second turning point. These two regression lines must cross at a frame intermediate between the starting point and the second turning point. This is done for all frames more than 50 ms after the starting point and more than 50 ms earlier than the second candidate turning point. That intermediate frame, for which the rms error attains a minimum, is the first definite turning point of the contour. The formulae used for calculating the two regression lines that should cross at t0 are the following. Let ^ Σ ^ - ' ο ) .

s; =