Sensory evaluation of sound 9781498751360, 1498751369

278 107 82MB

English Pages 538 [581] Year 2019

Report DMCA / Copyright

DOWNLOAD FILE

Polecaj historie

Sensory evaluation of sound
 9781498751360, 1498751369

Citation preview

Sensory Evaluation of Sound

Sensory Evaluation of Sound

Edited by

Nick Zacharov

MATLAB• is a trademark of The MathWorks, Inc. and is used with permission. The MathWorks does not warrant the accuracy of the text or exercises in this book. This book’s use or discussion of MATLAB • software or related products does not constitute endorsement or sponsorship by The MathWorks of a particular pedagogical approach or particular use of the MATLAB• software.

CRC Press Taylor & Francis Group 6000 Broken Sound Parkway NW, Suite 300 Boca Raton, FL 33487-2742 © 2019 by Taylor & Francis Group, LLC CRC Press is an imprint of Taylor & Francis Group, an Informa business No claim to original U.S. Government works Printed on acid-free paper Version Date: 20181103 International Standard Book Number-13: 978-1-4987-5136-0 (Hardback) This book contains information obtained from authentic and highly regarded sources. Reasonable efforts have been made to publish reliable data and information, but the author and publisher cannot assume responsibility for the validity of all materials or the consequences of their use. The authors and publishers have attempted to trace the copyright holders of all material reproduced in this publication and apologize to copyright holders if permission to publish in this form has not been obtained. If any copyright material has not been acknowledged please write and let us know so we may rectify in any future reprint. Except as permitted under U.S. Copyright Law, no part of this book may be reprinted, reproduced, transmitted, or utilized in any form by any electronic, mechanical, or other means, now known or hereafter invented, including photocopying, microfilming, and recording, or in any information storage or retrieval system, without written permission from the publishers. For permission to photocopy or use material electronically from this work, please access www.copyright.com (http://www.copyright.com/) or contact the Copyright Clearance Center, Inc. (CCC), 222 Rosewood Drive, Danvers, MA 01923, 978-750-8400. CCC is a not-for-profit organization that provides licenses and registration for a variety of users. For organizations that have been granted a photocopy license by the CCC, a separate system of payment has been arranged. Trademark Notice: Product or corporate names may be trademarks or registered trademarks, and are used only for identification and explanation without intent to infringe. Visit the Taylor & Francis Web site at http://www.taylorandfrancis.com and the CRC Press Web site at http://www.crcpress.com

To my loved ones: Paula, Manu, Misha (1929–2002) and Micheline (1926–2017)

Contents Foreword

xix

Preface

xxiii

Acknowledgements

xxix

Nomenclature and Abbreviations

xxxi

Contributors

xli

Section I Background Chapter 1  Introduction

3

Nick Zacharov

1.1 1.2 1.3 1.4 1.5

1.6

WHAT IS SENSORY EVALUATION? WHY DO WE NEED SENSORY EVALUATION? WHEN TO APPLY SENSORY EVALUATION? WHAT CAN WE LEARN FROM SENSORY EVALUATION? CONCEPTS AND TERMINOLOGY 1.5.1 Percepts, attributes, and attribute names 1.5.2 The relationship between attributes and dimensionality 1.5.3 The number of response variables and dimensions BEFORE MOVING ON

Chapter 2  Why Sound Matters

4 4 5 7 8 8 9 9 10

11

Julian Treasure

Chapter 3  Sound, Hearing and Perception

21

Ville Pulkki

3.1

BASICS 3.1.1 3.1.2 3.1.3 3.1.4 3.1.5 3.1.6

OF SOUND AND AUDIO Vibrations as source of sound Sound sources Sound waves and sound pressure Sound pressure level Frequency-weighted measurements for sound level Capturing sound signals

23 23 23 24 25 27 28 vii

viii  Contents

3.2

3.3

3.4

3.5 3.6 3.7 3.8

3.1.7 Digital presentation of signals 3.1.8 Further reading HEARING – MECHANISMS AND BASIC PROPERTIES 3.2.1 Anatomy of the ear 3.2.2 Structure of the cochlea 3.2.3 Passive frequency selectivity of the cochlea 3.2.4 Active feedback loop in the cochlea 3.2.5 Mechanical to neural transduction 3.2.6 Non-linearities in the cochlea 3.2.7 Temporal response of the cochlea 3.2.8 Frequency selectivity of hearing 3.2.9 Filterbank model of the cochlea 3.2.10 Masking effects PSYCHOACOUSTIC ATTRIBUTES 3.3.1 Loudness 3.3.2 Pitch 3.3.3 Other psychoacoustic quantities SPATIAL HEARING 3.4.1 Head-related acoustics 3.4.2 Localisation cues 3.4.3 Binaural cues 3.4.4 Monaural cues 3.4.5 Dynamic cues PSYCHOPHYSICS AND SOUND QUALITY MUSIC AND SPEECH HEARING IMPAIRMENT SUMMARY

30 32 32 32 33 34 34 35 36 37 37 39 39 41 41 43 44 45 45 46 47 49 50 50 51 53 56

Section II Theory and Practice Chapter 4  Sensory Evaluation in Practice

59

Torben Holm Pedersen and Nick Zacharov

4.1

4.2

INTRODUCTION 4.1.1 Can sensory evaluation be scientific? 4.1.1.1 Perceptual/descriptive scaling 4.1.1.2 Affective/subjective scaling 4.1.1.3 The filter model DESIGNING A LISTENING TEST 4.2.1 Defining the purpose of a test 4.2.2 Test method selection 4.2.2.1 Quantification of impression

61 61 62 62 63 65 65 65 68

Contents  ix

4.2.3

4.3

4.4

Experimental variables 4.2.3.1 Input variables 4.2.3.2 Response variables 4.2.3.3 Other experimental variables 4.2.3.4 Design of experiment 4.2.4 Test duration 4.2.5 Test samples 4.2.6 Usage scenarios 4.2.7 Perceptual constancy LISTENING TEST INGREDIENTS 4.3.1 Stimuli 4.3.1.1 Product testing 4.3.1.2 Product testing - By recording 4.3.1.3 Algorithm testing 4.3.1.4 Brand sound and music 4.3.2 Attributes 4.3.2.1 Requirements for good consensus attributes 4.3.2.2 Lexicons and sound wheels 4.3.2.3 Attribute development for novel applications 4.3.2.4 Attribute validation 4.3.2.5 Language issues 4.3.3 Assessors and panels 4.3.3.1 Assessor types 4.3.3.2 Panel types 4.3.3.3 Assessor selection 4.3.3.4 Assessor performance evaluation 4.3.3.5 Panel training 4.3.3.6 Panel maintenance 4.3.3.7 Ethics 4.3.4 Listening facilities 4.3.4.1 Listening rooms 4.3.4.2 Listening booths 4.3.4.3 Field testing 4.3.5 Equipment 4.3.5.1 Electroacoustic equipment 4.3.5.2 Systems for test administration TEST ADMINISTRATION 4.4.1 Calibration and equalisation 4.4.2 Instructions 4.4.3 Familiarisation 4.4.4 Controlling bias

68 69 69 69 70 71 71 72 73 73 73 74 75 76 76 76 77 78 81 84 85 87 88 90 92 93 95 96 96 96 97 98 98 99 99 99 100 100 101 101 101

x  Contents

4.5

4.6

4.4.5 Anchoring GOOD PRACTICES IN REPORTING LISTENING TESTS 4.5.1 Front page 4.5.2 Introduction 4.5.3 Method 4.5.4 Measuring objects 4.5.5 Equipment 4.5.6 Location 4.5.7 Assessors 4.5.8 Physical measurements 4.5.9 Test administration 4.5.10 Analysis and discussion 4.5.11 Results SUMMARY CHECKLIST

Chapter 5  Sensory Evaluation Methods for Sound

102 102 102 103 103 103 103 103 103 104 104 104 104 105

107

Jesper Ramsgaard, Thierry Worch, and Nick Zacharov

5.1

5.2

5.3

5.4

5.5 5.6

TEST METHOD FAMILIES 5.1.1 Discrimination methods 5.1.2 Integrative methods 5.1.3 Descriptive analysis methods 5.1.4 Mixed methods 5.1.5 Temporal methods 5.1.6 Performance methods DISCRIMINATION METHODS 5.2.1 Comparing pair(s) of systems discrimination tests 5.2.2 Paired comparison of multiple systems based on a sensory attribute INTEGRATIVE METHODS 5.3.1 Preference and affective methods 5.3.2 Recommendation ITU-T P.800 (ACR, CCR, DCR) 5.3.3 Recommendation ITU-R BS.1534 (MUSHRA) 5.3.4 Recommendation ITU-R BS.1116 CONSENSUS VOCABULARY METHODS 5.4.1 Quantitative descriptive analysis 5.4.2 Semantic differential 5.4.3 Recommendations ITU-T P.806 and ITU-T P.835 INDIVIDUAL VOCABULARY METHODS MIXED METHODS - EXPLAINING PREFERENCE 5.6.1 Audio descriptive analysis and mapping 5.6.2 Multiple stimulus - Ideal profile method

109 109 109 112 114 115 115 117 117 121 124 124 126 128 130 132 134 136 138 139 142 143 144

Contents  xi

5.7

5.8

5.9

INDIRECT ELICITATION METHODS 5.7.1 Multi-dimensional scaling 5.7.2 Free sorting 5.7.3 Projective mapping and napping ASSOCIATIVE METHODS 5.8.1 Free association 5.8.2 Check-all-that-apply (CATA) 5.8.3 Rate-all-that-apply (RATA) CLOSING WORDS

Chapter 6  Applied Univariate Statistics

146 147 148 150 151 152 153 154 155

157

Per Bruun Brockhoff and Federica Belmonte

6.1 6.2 6.3

6.4

6.5

INTRODUCTION SOME BASICS OF STATISTICAL REASONING THE GENERAL EXPERIMENTAL AND ANALYSIS APPROACHES 6.3.1 Simple versus elaborate analysis approaches 6.3.2 Multiple comparison corrections MIXED MODELS FOR SENSORY AND CONSUMER DATA 6.4.1 Headphone data - Structure and exploratory analysis 6.4.2 Overview of mixed model tools 6.4.3 From t-tests to multi-factorials and multi-blockings 6.4.3.1 Extension 1: K > 2 levels of the “treatment” factor 6.4.3.2 Extension 2: Replications - Two blocking factors 6.4.3.3 Extension 3: More than a single treatment factor 6.4.3.4 Extension 4: Even more complex structures 6.4.4 Headphone data analysis, modelling by mixed models 6.4.4.1 Post hoc analysis 6.4.5 Analysis of and correction for scaling effects 6.4.6 SensMixed-package: Multi-attribute automated analysis MODEL VALIDATION AND PERSPECTIVES: A VIEW TOWARDS NON-NORMAL AND NON-LINEAR MIXED MODELS WITHIN SENSORY EVALUATION 6.5.1 Model validation 6.5.2 Non-normal data - What to do generally? 6.5.2.1 Ignore it! 6.5.2.2 Use non-parametric (rank-based) methods 6.5.2.3 Use transformation to obtain (near) normality

158 159 161 162 163 164 164 168 171 173 173 176 178 178 180 182 184 186 187 187 188 188 189

xii  Contents

6.5.2.4 6.5.2.5 6.5.2.6

6.6

6.5.2.7 RELATION TO

Identify the “correct” distribution and do statistics based on that Use simulation-based methods: sensR - A brief on discrimination and similarity testing Ordinal - A brief on ordinal data analysis MULTIVARIATE ANALYSIS

Chapter 7  Applied Multivariate Statistics

189 189 190 191 192

195

Sébastien Lê

7.1 7.2

7.3

7.4

A MAIEUTIC DEFINITION OF MULTIDIMENSIONAL ANALYSIS WHEN A BIVARIATE QUESTION TURNS INTO A MULTIDIMENSIONAL ISSUE 7.2.1 Does occupation impact the music genre people listen to? 7.2.2 To what extent do farmers compare to executives and pensioners? 7.2.3 In summary ONE OF THE MAIN ISSUES IN MULTIDIMENSIONAL ANALYSIS: CHOOSING THE PROPER DISTANCE 7.3.1 How would you describe a company with such a sound logo? 7.3.2 Do you think this logo is too serious compared to the brand’s values? 7.3.3 In summary SOME CONSIDERATIONS DEPENDING ON THE DATA TYPE 7.4.1 Some considerations for understanding MCA 7.4.2 Some considerations to better understand PCA 7.4.3 In summary

195 197 197 198 206 206 208 209 210 212 213 217 223

Section III Application Chapter 8  Telecommunications Applications

227

Alexander Raake, Janto Skowronek, and Michał Sołoducha

8.1 8.2

8.3

INTRODUCTION 228 PRINCIPLES OF SPEECH QUALITY 233 8.2.1 Historical background 233 8.2.2 Quality, elements and features 235 8.2.3 From traditional test methods to new developments 238 SPEECH QUALITY VS. INTELLIGIBILITY 241 8.3.1 General considerations 241 8.3.2 Quality and intelligibility for packet loss degradations in VoIP 242

Contents  xiii

8.4 8.5

8.6 8.7

8.8

ASSESSING SPEECH QUALITY WITH TERMINAL DEVICES SPEECH QUALITY DIMENSIONS 8.5.1 Listening tests 8.5.1.1 Test methods 8.5.1.2 Insights 8.5.2 New trends: Conversation tests SPEECH QUALITY AND DELAY MULTIPARTY QUALITY TESTS 8.7.1 Standardised method for telemeeting assessment 8.7.2 Spatial audio meeting assessment 8.7.3 Assessment of asymmetric conditions SUMMARY AND OUTLOOK

Chapter 9  Hearing Aids

244 246 248 249 252 252 254 258 259 260 262 265

269

Lars Bramsløw and Søren Vase Legarth

9.1 9.2 9.3 9.4 9.5

INTRODUCTION 270 HEARING AID FUNCTION 270 TECHNOLOGY AND HISTORY 272 FACTORS AFFECTING SOUND QUALITY 275 HEARING IMPAIRED TEST PANELS 277 9.5.1 Hearing loss 278 9.5.2 Experts versus consumers 280 9.5.3 Maintaining test panels 281 9.6 LISTENING SCENARIOS 281 9.7 OUTCOME MEASURES 282 9.7.1 Speech intelligibility 282 9.7.2 Sound quality 284 9.7.3 New types of outcome measurements 286 9.8 HEARING AID ALGORITHM ASSESSMENT 287 9.9 ASSESSMENT OF A NOISE REDUCTION ALGORITHM 287 9.9.1 Test method and design 288 9.9.2 Summary 289 9.10 QUALIFYING A HEARING IMPAIRED PANEL 290 9.10.1 Audiometric screening tests 291 9.10.2 Sensory screening tests 292 9.10.3 Brand preference test 293 9.10.4 Results 293 9.10.5 Summary 294 9.11 THE MULTIPLE STIMULUS IDEAL PROFILE METHOD FOR OPTIMISATION OF HEARING AID SOUND QUALITY 296 9.11.1 Selection of samples and recordings of systems 296

xiv  Contents

9.11.2 Identifying attributes for the test 9.11.3 Data collection 9.11.4 Results 9.11.5 Summary 9.12 TESTING SMALL DIFFERENCES USING PAIRED COMPARISONS 9.12.1 Perceptual effects of delay 9.12.2 Test setup, listeners, hearing aids and signals 9.12.3 Paired comparison of delay and high-pass cut-off 9.12.4 Results 9.12.5 Discussion 9.12.6 Summary 9.13 CONCLUSION

Chapter 10  Car Audio

297 298 299 300 302 303 303 307 308 312 314 315

317

Patrick Hegarty and Neo Kaplanis

10.1 DESCRIPTIVE SENSORY ANALYSIS OF REPRODUCED SOUND IN CARS 10.1.1 Case study - A listening test system for car audio 10.1.2 Methods 10.1.2.1 The test car and associated equipment 10.1.2.2 Assessors selection 10.1.2.3 Independent variables 10.1.2.4 Dependent variables 10.1.2.5 Experimental design 10.1.3 Data analysis 10.1.4 Results 10.1.5 Conclusion 10.2 FLASH PROFILE FOR AUTOMOTIVE AUDIO ASSESSMENT 10.2.1 Flash profile - A brief 10.2.2 Why does flash profile suit automotive audio assessment? 10.2.2.1 Flash profile and the evaluation of acoustical fields 10.2.3 Feasibility study 10.2.4 Case study: The influence of a car’s interior on the perceived sound quality 10.2.4.1 Background 10.2.5 Methods 10.2.6 Materials and apparatus 10.2.6.1 Acquisition of sound fields 10.2.6.2 Presentation of sound fields 10.2.7 Assessors

318 318 319 319 320 320 322 322 322 324 326 328 328 329 329 330 331 331 331 332 332 334 334

Contents  xv

10.2.8 10.2.9

10.2.10

10.2.11 10.2.12

Experimental procedure Data analysis 10.2.9.1 Ordination using multiple factor analysis 10.2.9.2 Influence of programme and acoustical conditions 10.2.9.3 Generalising results 10.2.9.4 Clustering of individually elicited attributes 10.2.9.5 Clustering validation 10.2.9.6 Constructing the sensory profile Results 10.2.10.1 The effect of equalisation - Validation 10.2.10.2 Ceiling effect 10.2.10.3 Absorption effect 10.2.10.4 Front side window effect 10.2.10.5 Windscreen effect Discussion Considerations of using flash profile in audio evaluation 10.2.12.1 Future work

Chapter 11  Binaural Spatial Reproduction

335 337 337 339 340 341 342 342 343 343 344 345 345 345 345 346 348

349

Brian F.G. Katz and Rozenn Nicol

11.1 INTRODUCTION 11.2 SPATIAL HEARING: BINAURAL VERSUS LOUDSPEAKERS 11.3 PERCEPTUAL DIMENSIONS 11.3.1 Multidimensional perception 11.3.2 Quality metrics 11.4 ASSESSMENT METHODS 11.5 CASE STUDIES 11.5.1 Source localisation 11.5.1.1 Egocentric vs. exocentric pointing 11.5.1.2 Egocentric pointing variations 11.5.1.3 Spatial coordinates and front/back confusions 11.5.1.4 Lateralisation reporting 11.5.1.5 Paired comparison of compression degradations 11.5.2 Scene description 11.5.2.1 Scene drawing 11.5.2.2 Scene reconstruction 11.5.2.3 Environment reconstruction 11.5.2.4 Scene stability 11.5.3 Overall quality 11.5.3.1 Subjective global quality ranking

350 350 353 353 355 357 358 358 359 360 361 364 366 367 368 370 372 373 374 374

xvi  Contents

11.5.3.2 Repeatability of subjective rankings 11.5.3.3 Preference judgement 11.5.4 Perceptual attributes 11.5.4.1 Multiple comparisons with reference and anchor 11.5.4.2 Multi-dimensional scaling & HRTF differences 11.5.4.3 Verbal elicitation & HRTF differences 11.6 CONCLUSION

Chapter 12  Concert Hall Acoustics

377 378 380 381 382 385 386

389

Antti Kuusinen and Tapio Lokki

12.1 INTRODUCTION 12.2 CHARACTERISTICS OF SOUND FIELD IN CONCERT HALLS 12.3 PERCEPTION OF CONCERT HALL ACOUSTICS - A BRIEF HISTORY 12.4 DESCRIPTIVE PROFILING OF CONCERT HALL ACOUSTICS WITH VIRTUAL ACOUSTICS 12.4.1 Auralising concert hall acoustics with the loudspeaker orchestra 12.4.2 Selection of stimuli: Concert halls and receiver positions 12.4.3 Three studies of individual vocabulary profiling 12.4.4 Analysing IVP data 12.4.5 Summary of main findings 12.5 FUTURE DIRECTIONS

Chapter 13  Emotions, Associations, and Sound

389 390 391 396 397 399 400 402 407 407

411

Jesper Ramsgaard

13.1 WHAT ARE EMOTIONS? 13.1.1 Categorical models of emotions 13.1.2 Dimensional models of emotions 13.1.3 Direct assessment and self-report of emotions 13.1.4 Physiological measurements of emotions 13.1.5 Neuroimaging techniques 13.2 EMOTIONAL RESPONSE TO SOUND AND MUSIC 13.2.1 Basic emotions and music 13.2.2 Music emotion schemas 13.2.3 Dimensional models of emotions and sound/music 13.3 ASSOCIATIONS AND MEANING 13.3.1 Semantic distance - Sentences and music 13.3.2 Free associations 13.4 CONCLUSION

412 414 415 416 419 420 421 422 422 425 429 429 431 434

Contents  xvii

Chapter 14  Audiovisual Interaction

435

Dominik Strohmeier and Satu Jumisko-Pyykkö

14.1 INTRODUCTION 14.2 MULTIMODAL PERCEPTION OF QUALITY 14.3 OPEN PROFILING OF QUALITY (OPQ): A STRUCTURED, MIXED METHODS APPROACH TO STUDY AUDIOVISUAL QUALITY 14.4 STUDYING AUDIOVISUAL QUALITY FOR MOBILE 3D VIDEO IN LABORATORY AND FIELD CONDITIONS 14.4.1 Introduction 14.4.2 Research method 14.4.2.1 Test participants 14.4.2.2 Test setup 14.4.2.3 Test procedure 14.4.2.4 Method of analysis 14.4.3 Results 14.4.3.1 Preference evaluation 14.4.3.2 Sensory evaluation and external preference mapping 14.4.3.3 Comparison of results for laboratory and the actual context of use 14.4.4 Discussion and conclusion 14.5 PERCEPTUAL DIFFERENCES BETWEEN ONLY-GOOD AND ONLY-BAD QUALITY STIMULI 14.5.1 Theories of liking and disliking 14.5.2 Research method 14.5.2.1 Test participants 14.5.2.2 Test setup 14.5.2.3 Test procedure 14.5.2.4 Method of analysis 14.5.3 Results 14.5.3.1 Items of only-bad experienced quality 14.5.3.2 Items of only-good experienced quality 14.5.3.3 Interviews 14.5.4 Discussion and conclusion 14.6 EXPERIENCED QUALITY OF AUDIOVISUAL DEPTH: PROFILING AUDIOVISUAL QUALITY PERCEPTION ACROSS ASSESSORS 14.6.1 Research method 14.6.1.1 Test participants 14.6.1.2 Test setup 14.6.2 Test procedure

436 436 437 440 440 440 440 441 441 442 443 443 444 446 447 450 450 451 451 451 452 453 453 454 454 455 456 457 458 458 458 458

xviii  Contents

14.6.3 Sensory evaluation results 14.6.4 Interpretation of perceptual spaces between assessors 14.6.5 Discussion and conclusion 14.7 SUMMARY AND CONCLUSIONS

458 461 461 462

Section IV Annexes Appendix A  Description of Data Sets A.1 A.2 A.3 A.4

HEADPHONE DATA SETS A.1.1 Recorded stimuli data sets CONCERT HALL DATA SETS A.2.1 Data set summary SONIC BRANDING DATA SET TRAM NOISE DATA SET

467 467 467 469 469 470 470

Bibliography

471

Index

529

Foreword There has been a general consensus in the audio industry, for as long as I can remember, that technical measurements alone are insufficient to determine whether or not an audio system will sound good. This raises the important question of what we mean by “good”. In the minds of many the only requirement of a good audio system is fidelity, in other words what you get out is as close as possible to what you put in. That assumes that the primary function of an audio device is to capture, store, transmit, or reproduce an original sound signal as faithfully as possible. Such is indeed the primary function of a lot of audio devices, such as microphones, amplifiers, loudspeakers and recording devices, and it could be said that if they have ruler-flat frequency response across whatever we consider to be the audio band, and no distortion or noise of any kind, then they would fulfil the requirement of fidelity. Usually this is not the case, but it is becoming more likely as the years go by in the best equipment. At the other end of the quality spectrum the fidelity of low cost devices could be said to be getting worse. We still need to use our ears and our brains to judge whether a system meets the aim of fidelity, because we still do not know exactly how people react to different forms of distortion and noise. (Confusingly, it seems that some people turn out to like various forms of distortion, otherwise how can we explain a liking for vinyl records or MP3-encoded songs?) The idea of fidelity, however, assumes a model of sound quality where there is some ultimate reference against which to compare one’s judgement of sound. In the traditional world of recording and reproduction of acoustic signals we usually assume that either the listener has some inherent, learned knowledge about what is “correct”, or we play them something that we say is correct, against which they will compare various sounds. Interestingly this idea of sensory evaluation does not really apply in some other domains such as food and drink. It could reasonably be said that there is no reference butter sample against which all others should be compared, for example, and therefore there is no realistic concept of fidelity in the food and drink world. The challenge, then, when it comes to sound evaluation is to decide for what purpose the evaluation is being done. Are we trying to discover whether an audio device is “transparent” to audio signals in terms of technical fidelity, or are we trying to decide whether we like it or not? Perhaps we are trying to describe how something sounds in a variety of ways that are meaningful, or perhaps we are trying to find out how it affects our emotions. If one is trying to evaluate a synthesiser, a sound logo or a virtual reality game, for example, then what is the reference against which one might compare the sounds that it makes? The concept of fidelity largely goes out of the window in such cases, because one is dealing with artificial sounds that never existed in natural life, and we might be more concerned with whether the sound matches some intended application, context or other sensory mode (such as vision or touch). It may also be that one wants to find out what sounds people like, or what associations they have, rather than whether they are “correct” in some way, and then the question arises about whose opinion should be taken into account and for what purpose. While expert sound assessors may have been trained to believe that certain aspects of sound are important for them to rate it highly, the same does not always seem to apply to untrained listeners. My former research group found that this was the case xix

xx  Foreword for some spatial aspects of sound reproduction, for example, with trained listeners thinking that stereo imaging of sound source locations was very important, while naïve listeners hardly seemed to notice this aspect of a system’s performance when judging whether it sounded good or not. In this case training or experience had led some people both to be able to hear a particular feature and also to believe it mattered. Perhaps, then, it really does matter who you ask, when, why and in what context. Nick Zacharov as editor has pulled together a remarkable book, with numerous expert authors from a diverse range of backgrounds, to explore some of these questions. The reader will find answers as to how one might go about doing sensory evaluation of sound in a scientific manner. One of the things I like about the way it has been done is its openmindedness to techniques gleaned from a wide range of other fields. There is a bravery in that. Classical psychophysics does not have the answers to all aspects of sound evaluation, particularly when the sounds concerned are complex and exist in a variety of ecological contexts. I still maintain that there is a sort of uncertainty principle operating in this field - the more you tie down the experimental variables and control the stimuli the more precisely you may know something that you find, but what you know will not necessarily be very rich or meaningful. Allow a greater freedom or ecological realism in the stimuli and experimental context and you may be less certain or specific about what you find, or perhaps the results will be less generalisable, but you may know something richer and more relevant to the human experience. The approach you take rather depends on whether you are trying to find out something very precise about low-level perceptual or cognitive processes, or whether you are trying to find out something at a higher level of cognitive relevance. Similarly, I like to think it has something to do with which end of the telescope you are looking down. Looking through one end you are using sound stimuli to find out something about the way that the hearing process works (perhaps this is a description of classical psychophysics), while looking through the other you are using the human listener to find out something about sounds (you could call that sensory evaluation of sound). Francis Rumsey Chair of the AES Technical Council; Consultant Technical Writer and Editor of the Journal of the Audio Engineering Society

Foreword  xxi I started working in sensory sciences in the late 1970s, and since then I have spent the whole of my career in either food based research and consultancy, or working in the fragrance & flavour industry. As a new graduate, ‘sensory’ was hardly a recognised discipline and career choice at that time, and it certainly did not have the global reach it has today. Most of the research was conducted at an academic level, and there was only limited practical application in the food industry in product development. However, even in the early days we knew that a consumer’s response to products and brands must be multi-sensorial, and we recognised that all our senses interact when we experience products. We were taught that sensory evaluation is “a scientific method used to measure and interpret the responses to products as perceived by the five senses of sight, smell, sound, taste and touch”, and yet, surprisingly, we sensory scientists have tended to study and work with these senses relatively independently. Or at least, Sound has perhaps had less attention from a research perspective than with the other senses. The Pangborn Sensory Symposium is the premier global conference attended by over 1000 sensory & consumer scientists worldwide where advances in the field of sensory science are presented and discussed every two years. At the 2017 conference, at which several hundred researchers and practitioners showcased their current work to their peers, perhaps only a handful of these papers were focussed to any extent on the sensory evaluation of sound. Of course I have generalised and there are also notable exceptions. The Crossmodal Laboratory at Oxford University led by Charles Spence, for example, has published widely on the importance that the sounds of consumption have on our perception and enjoyment of food, and he comments that “sound is a forgotten flavour sense”. We know as well that certain chefs have been pioneering multi-sensorial appreciation of food, most notably Heston Blumenthal in the UK and elsewhere. On the fragrance side, for example, we could also consider what cues we take from the overall product experience in judging that fragrance. Would the sound of a soft, gentle flow from the nozzle of an aerosol can as the button is depressed help to reinforce the concept of a caring, feminine fragrance, or a strong “woosh” convey better the concept of a powerful, impactful and performing masculine fragrance? Would the same masculine fragrance presented in a soft and gentle flow have the same impact? What cues are we taking from the sound and transferring into our perception and belief of the impact of that product? Implicit methods of testing with consumers, rather than explicit methods that have been traditionally used over the years, are increasingly being explored to help us to get closer to understanding and measuring emotions and the importance to brand connection. This is emerging as a key focus for research. The availability of new, practical ‘virtual reality’ technology that can help us to induce context in product testing, and hopefully improve the relevance of our data, is partly fuelling the interest that we must test in ways that are more realistic to the real situation. Understanding the role of context in product testing is an important element of the future research in this area. Context is multi-sensorial, and one can not induce realistic context without also considering sound. What I learn from this book about sound quality and individual perception of sound helps me to understand that the contribution of sound to induced context in product testing in the food and other industries is not so easily satisfied by, for example, a simple playback of a recording. The publication of this new book on Sensory Evaluation of Sound is therefore very timely. Nick mentions that techniques for the sensory evaluation of sound have been applied sporadically for many years, and that in the last 20 years there has been an acceleration in this research. In the field of Sensory Evaluation we have some iconic reference books and texts to which those in the field would constantly refer. If I look back through the most often cited, there is very little mentioned specifically about the measurement of sound. I myself have a handful of well-thumbed books written by masters in sensory science and in those I can often find something new to stimulate my thoughts or to give me reassurance that I

xxii  Foreword have got my facts and assumptions correct. This excellent book is a welcome addition to the bookshelf of any sensory scientist, and I confidently predict that it will in itself become an icon over time. To my knowledge, no similar book exists, and if it does I have yet to find it. Apart from his own undoubted experience as a leader in the sensory evaluation of sound, the Editor has gathered together an impressive group of experts to contribute their knowledge and passion for this subject of Sound, and as a result they have produced a highly readable and informative book that creates a new milestone for the discipline. With these few comments, I let the book speak for itself. David H Lyon Director, Global Sensory Perfumery & Flavors Firmenich UK Ltd

Preface Sensory evaluation techniques have been sporadically applied to sound for many decades. In the last ∼ 20 years, there has been an acceleration in the use of sensory evaluation methods applied to sound. The cause for this increase in interest is not known, but we can observe some interesting trends. We have seen a major shift in telecommunication from landline telephone usage at work and at home, to mass mobile telecommunication within the last 3 decades. In 2005 it is estimated that over 800 million mobile phones were delivered and a decade later approximately 1.8 billion smartphones were produced. With this shift from landline to mobile phone to smartphone there has also been significant development in the device audio functionality and end-user expectation. Whilst in the early 90s we were happy to use a phone in our living rooms and offices, by the mid 00s millions of people were taking calls whilst on the move. This development has affected the world as a whole with over a third of the global adult population having internet access and over 30 % of adults in emerging and developing economies reporting smartphone usage. Furthermore, our smartphones allow us to function effectively anywhere and also on the move, whether participating in a multiparty teleconference, attending online courses, enjoying the latest streamed movie/music releases or playing multiplayer AR/VR games. As a result, there has been a rapid and significant growth in the headphone and headset business. In 2005 about 66 million headphone units were sold and by 2017 sales were estimated to grow to ∼ 300 million units. Furthermore, headphone technology has evolved to keep pace with this usage, including wireless functionality, active noise control technology, etc. The latest trend in so-called hearables is already bringing more content and information to our ears as well as narrowing the gap between technologies (headphones, smartphones, hearing aids, etc.). The evolution of technology and its usage shows no sign of slowing. Presently the trend is towards advanced sound systems bringing next generation home theatre experiences into the home or developing virtual, augmented and mixed reality technologies both for consumer and professional applications. Additionally, with our continuously increasing urbanisation, sound and noise are forever present in many places. Noise is a less desirable by-product from industry, personal and public transport and other sources, e.g. machines, neighbours, etc. Noise is a familiar part of the urban landscape and noise reduction technologies are becoming more prevalent in headphone, mobile phones, etc. There is also an increase in hearing loss and growing usage of hearing aids for people with hearing losses. As we can see there is a wide range of sound sources and audio devices in our world, with new applications and needs evolving continuously. Associated with each of these technologies, there is a continuous evolution in performance, whether from narrowband to fullband telephony or the inclusion of active noise control technology in headphones. At every stage of these technology developments and usage evolutions there has been, and will be, questions about the desired and required sound quality including: • Do consumers like the sound of my product? • Is the sound quality good enough? xxiii

xxiv  Preface • How competitive is my product and how can I further improve it? • What technology should I select for the best sound quality? • How can I demonstrate that my product performs best in its category? • What needs to be changed to optimise my algorithm for better sound quality? • Why do consumers appear to have different preference opinions? • Why do consumers prefer the competitor’s products? • What are the salient characteristics of my product that I should emphasis or avoid? • How is the sound quality of my product characterised? • How can I predict or model how consumers will evaluate my new product? The sensory evaluation toolset can assist you to answer these questions. The book Perceptual Audio Evaluation (Bech and Zacharov, 2006) provided the fundamental principles of perceptual audio evaluation with a focus on global measures and univariate analysis of audio quality. It gave a firm grounding regarding fundamental techniques and good practices, used primarily with a single response variable (e.g. Basic Audio Quality (BAQ), etc.). This book builds upon the earlier work with an emphasis on novel, advanced and multivariate techniques applied to sound. Many of the techniques presented here will allow you to answer many of the above questions, through study of global measures of quality (Preference, BAQ, etc.) as well as more in-depth interpretation through the characterisation using sound quality attributes (e.g. clarity, punch, scene depth, etc.), with associated experimental and analysis techniques. The generic methods of sensory evaluation are well reported in textbooks and journals, with predominant application’s to food, beverage and other consumer products, many of which are referenced throughout this text. However, an article presented by Goldman (2015) at the Pangborn Sensory Science Symposium pondered why sensory science has not been adopted by other fields beyond food sciences and consumer products. I hope that through this book, we can share our knowledge with both the sensory science and audio community, providing best practices of sensory evaluation of sound with many application examples. In order to achieve this, I have collected together a group of acknowledged experts with extensive experience in the applied sensory evaluation of sound. Their expertise spans from sensory evaluation techniques, statistical analysis through to domain-specific engineering and research applications within sound and audio. Many of the contributors have learnt sensory evaluation by doing as part of research or engineering activities, as sensory evaluation of sound is not a widely taught topic. From our joint experience we have collected together a very broad set of methods and practical examples of the beneficial application of sensory evaluation of sound. Our hope is to inspire you to apply and develop these techniques in your work leading to advances in sensory science, even better products and technical solutions. I believe it is fair to say that most of you, either as consumers or professionals, have experienced a product or service that has been improved using the sensory evaluation of sound, whether a mobile phone, hearing aid, electric car or concert hall.

Preface  xxv

READERSHIP Who is this book written for? It is primarily intended, but not limited to, the following three groups. Firstly, scholars of sound and audio engineering. Whether you are an eager undergraduate, master’s student, Ph.D. candidate or post doctoral researcher, there may be a time when you need to evaluate the sound quality characteristics of the research work (product, service, algorithm, etc.). A second group is the professional engineering and research community who are constantly developing and refining their audio/sound products and technologies. For you sensory evaluation may provide a beneficial additional toolbox to help you engineer your product to be more competitive or superior. Thirdly, engineering managers, CTOs, marketeers, clinical managers, product/program and project managers may also have an interest in knowing the potential of sensory evaluation techniques. You may benefit from understanding the nature and characteristics of sound quality or from knowing how these techniques can be used to benchmark the sound quality performance of your and your competitor’s products for marketing claims substantiation. And of course this book is also for anyone else who wants to take these methods and apply them in new fields, beyond the scope of what we are discussing here. After all, much of the inspiration leading to the content of this book has come from the application and adaption of traditional sensory evaluation techniques to the field of sound and audio.

ORGANISATION OF THIS BOOK Part I. Background. The first part of our book sets the stage by providing some background and basic information about sound and sensory evaluation. Chapter 1: Introduction. This short introductory chapter will provide you with an overview of sensory evaluation in general with a short background on the techniques, their roots and applications. Chapter 2: Why Sound Matters. This chapter provides the vital motivation for how and why sound is so valuable to us all and how it impacts our lives - both positively and negatively. Chapter 3: Sound, Hearing and Perception. For those of you without a background in sound and audio, this chapter will provide you with a concise overview of the physics of sound, the hearing mechanism and the fundamentals of perception. Part II. Theory and Practice. The second part will provide you with an overview of the family of sensory evaluation methods, associated statistical analysis techniques and practical guidelines on their usage. Chapter 4: Sensory Evaluation in Practice. In this chapter we address many of the practical issues related to performing sensory evaluation studies. This ranges from descriptions of perceptual and affective measurement, attribute selection or development, through to best practices in panel development and assessors monitoring. Chapter 5: Sensory Evaluation Methods for Sound. This chapter provides an overview of the core set of sensory evaluation methods, giving you the basics to understand the applications described in Part III. Chapter 6: Applied Univariate Statistics. A thorough, detailed and highly accessible summary of univariate statistics is presented here, when a single response variable/attribute is to be analysed.

xxvi  Preface Chapter 7: Applied Multivariate Statistics. This chapter provides a walk through the complex field of multivariate statistics in a most approachable manner, with concrete examples and associated analysis scripts. Part III. Application. The third part of the book is practical in nature and provides a wide range of real-life application examples from a wide range of fields and domains. Combined with Parts I and II we hope to provide you with insight into the best practices in the sensory evaluation of sound. Chapter 8: Telecommunication Applications. A thorough overview of the application of sensory evaluation in telecommunications, with an emphasis on standards practices. The chapter covers traditional techniques through to the latest trends in conversational tests. Chapter 9: Hearing Aids. The chapter provides a history of audiology and hearing aids to the present day. A number of application examples are provided including discussion of hearing impaired panels, intelligibility tests and sound quality evaluation. Chapter 10: Car Audio. This chapter focuses upon sound reproduction in the car also using individual attribute-based techniques such as flash profiling. Chapter 11: Binaural Spatial Reproduction. The authors walk you through the fundamentals of binaural spatial sound reproduction and the latest approach to the assessment of these technologies. Chapter 12: Concert Hall Acoustics. Concert hall acoustics is one of the first applications of sensory evaluation of sound. Following a brief introduction to the domain, the authors provide detailed examples of their work to successfully quantify and qualify concert hall acoustics with specially developed sensory evaluation techniques suited to this field. Chapter 13: Emotions, Associations and Sound. The chapter provides a different perspective to the use of sound in application to audio branding, where matters of emotion and other associations are of importance. Chapter 14: Audio Visual Interaction. The last chapter in this section provides a view to audio visual interaction and introduces the Open Profiling of Quality method. Annexes Our annexes provide useful additional information for your work and research. Annex A: Description of Datasets. This annex provides a detailed description of data sets available with this book and used in Chapters 6, 7 and 12. Datasets are available from https://www.crcpress.com/Sensory-Evaluation-of-Sound/ Zacharov/p/book/9781498751360. The following organisations and associated trademarks are mentioned in this book: • Bluetooth is a registered trademark of the Bluetooth SIG. • Google and Hangouts are registered trademarks of Google LLC. • © 2018 The MathWorks, Inc. MATLAB and Simulink are registered trademarks of The MathWorks, Inc. See mathworks.com/trademarks for a list of additional trademarks. Other product or brand names may be trademarks or registered trademarks of their respective holders.

Preface  xxvii • Microsoft, Microsoft Excel and Skype are registered trademarks of Microsoft. Sensory Evaluation of Sound is an independent publication and is neither affiliated with, nor authorized, sponsored, or approved by, Microsoft Corporation. • Python is a trademark or registered trademark of the Python Software Foundation. • QDA is a registered trademark of Curion LLC. • XLSTAT is a registered trademark of Addinsoft SARL.

Acknowledgements I would like to thank all the contributors of this book who took the time and made a big effort. I really think it was worth all the hard work. Thomas Bech Hansen, Peter Hofmann Holsøe and Søren Vase Legarth of FORCE Technology are thanked for their support in all shapes and forms during this long process. From Taylor-Francis, I would like to thank Nora’s team (Kyra, Laurie, Vanessa, Michele, Shashi and Karen) for their great effort, technical support and flexibility. The whole of the SenseLab team is thanked for their support and patience. Brian Katz gets a grateful mention for all the BIBTEX support and genius. Peder Hjulmand Søby - thank you for creating such a lovely cover - it’s spot on. I am also grateful to Anders Toftgaard of The Royal Library, Department of Manuscripts and Rare Books, Copenhagen, Denmark, for his authoritative historical knowledge of early medical research. The Oticon Foundation is thanked for providing financial support towards the preparation and editing of this text. And to my family - thanks for the patience, love and support. Nick Zacharov Iisalmi, Finland.

xxix

Nomenclature and Abbreviations NOMENCLATURE Acoustic metrics: Numbers assigned to physical characteristics of the sound by measurements and/or calculations. For example, sound pressure level, frequency-weighted sound pressure level. Affective (or hedonic) measurement: Subjective measurements of preference, liking, annoyance or of connotative attributes (ideally made with individuals from the relevant consumer segment or target group). Often associated with personal opinion in the form of a subjective assessment. Assessor, subject, listener, judge, participant: A person participating in a sensory evaluation or listening test. Attribute: A property that can be perceived (perceptual, affective or connotative). Attribute name, descriptor or descriptive term: A word or phrase that describes identifies or labels an attribute or a characteristic. A clear distinction between the attribute and its descriptor is not always made. Condition, test item, sample, processed sample: An original sound sample, processed using the technology under study. Component, principal component, dimension: Label associated with the axes resulting from a multivariate analysis. Consumer, naïve assessor: A user of the technology under study who does not fulfill any particular performance criterion. Expert, expert assessor, experienced assessor: Assessor with a degree of expertise either related to a listening test (discrimination skills, reliability, etc.) and/or audio domain expertise. Lexicon: Lexicons are standardised vocabularies that objectively describe the sensory properties of consumer products. Measurement: Assigning numbers to objects in a relational way, - e.g. by comparison with a standardised quantity of the same dimension (a unit). Specific instruments and/or a panel of expert listeners are needed. Metric: A measure of a physical or perceptual properties. xxxi

xxxii  Nomenclature and Abbreviations Panel, assessor panel, jury: A collection of assessors used in sensory evaluation studies. Perceptual measurement: An objective quantification of the sensory strength of individual sensory descriptors of a perceived stimulus. Perceptual tests are measurements where humans (e.g. expert assessors) are used as “measuring instruments”. Profile: A set of parameter values (e.g. attributes, sensory descriptors or/and metrics) that describes the (character of) the sound. Rating, score, response: An assessor’s evaluation using the response variable and scale provided for an experiment. Response variable, dependent variable: The combination of a scale and attribute the assessor employs to evaluate the technology under study. Sample, item, programme material, footage: An original, unprocessed sound sample. Stimuli: Stimuli maybe anything that evokes a response from an assessor when presented with this stimuli. Such stimuli may stimulate one or many of the senses, e.g. hearing, vision, touch, olfaction or taste. Testing of sound reproduction systems is special in the sense that these devices need sound samples for the reproduction. This means that the stimulus is the combination of the sound system and the sound sample. Trial, screen, block: A condition or collection of conditions to be evaluated and/or compared to each other. Typically presented sequentially using a graphical user interface (GUI).

GLOSSARY 2-AFC: Two-alternative forced choice 3-AFC: Three-alternative forced choice A/D: Analogue-to-digital converter AAC: Advanced audio coding ACR: Absolute category rating ADAM: Audio descriptive analysis and mapping ADC: Analogue-to-digital converter AES: Audio Engineering Society AFB: Anti-feedback AHC: Agglomerative hierarchical clustering AIC: Akaike information criterion ANCOVA: Analysis of covariance ANOVA: Analysis of variance ANS: Autonomic nervous system

Nomenclature and Abbreviations  xxxiii ANSI: American National Standards Institute APA: America Psychological Society AR: Augmented reality ASA: Acoustical Society of America ASTM: American Society for Testing and Materials ASW: Apparent source width BAQ: Basic audio quality BCa: Bias-corrected and accelerated BEM: Boundary element method BIBD: Balanced-incomplete-block design BIC: Bayesian information criterion BLE: Bluetooth low energy BTE: Behind the ear BTL: Bradley-Terry-Luce model CA: Correspondence analysis CAIC: Consistent Akaike information criterion CETVSQ: Continuous evaluation of time varying speech quality CATA: Check-all-that-apply CB: Critical bandwidth CCITT: International Telegraph and Telephone Consultative Committee CCR: Comparison category rating CDMA: Code-division multiple access CE: Confidence eclipse CETVSQ: Continuous evaluation of time varying speech quality CI: Confidence interval CIC: Complete in the canal CIPIC: Center for image processing and integrated computing CLM: Cumulative link models CLT: Central limit theorem CMOS: Comparison mean opinion score CNS: Central nervous system

xxxiv  Nomenclature and Abbreviations CQS: Continuous quality scale CRT: Cathode ray tubes CSSA: Conversational surface structure analysis CT: Conversation test CV: Consensus vocabulary CVA: Canonical variate analysis CVC: Consonant-vowel-consonant CVP: Consensus vocabulary profile D/A: Digital-to-analogue converter DA: Descriptive analysis DCR: Degradation category rating DECT: Digital enhanced cordless telecommunications DF: Degrees of freedom DF: Diffuse field DirAC: Direction audio coding DL: Difference Limen DMOS: Degradation mean opinion score DoE: Design of experiments DRP: Eardrum reference point DTX: Discontinuous Transmission EBU: European Broadcast Union EC: Echo canceller EEG: Electroencephalography EM: Expectation maximisation EMS: Expected mean square ERB: Equivalent rectangular bandwidth ERP: Ear reference point ETSI: European Telecommunications Standards Institute FA: Factor analysis FB: Fullband FCP: Free-choice profiling

Nomenclature and Abbreviations  xxxv fMRI: Functional magnetic resonance imaging FP: Flash profile GAL: Graphical assessment language GEMS: Geneva emotional music scale GLM: Generalised linear model GLS: Generalised listeners selection GPA: Generalized procrustes analysis GPL: General public license GSM: Global system for mobile communication GUI: Graphical user interface HATS: Head and torso simulator HCPC: Hierarchical clustering on principal components HHS: Hybrid hedonic scale HI: Hearing impaired HINT: Hear in noise test HMFA: Hierarchical multiple factor analysis HOA: Higher Order Ambisonics HRIR: Head-related impulse responses HRTF: Head-related transfer function HSD: (Tukey’s) honestly significant difference IACC: Interaural cross correction IAPS: International affective picture system IG: Insertion gain ILD: Interaural level difference INDSCAL: Individual differences scaling IOI-HA: International outcome inventory for hearing aids IP: Internet protocol IPM: Ideal profile method IRS: Intermediate reference system IRT: Item response theory iSCT: Interactive short conversation test

xxxvi  Nomenclature and Abbreviations ISO: International Organisation for Standards ISTS: International speech test signal ITC: In the canal ITD: Interaural time difference ITU-R: International Telecommunications Union, Radiocommunication Sector ITU-T: International Telecommunications Union, Telecommunications Standardisation Sector ITU: International Telecommunication Union IV: Individual vocabulary IVP: Individual vocabulary profile IVPPPJ: Individual vocabulary profile with preferences by pairwise judgements JAR: Just about right JND: Just noticeable difference LAM: Labeled affective magnitude scale LFE: Low frequency effects LHS: Labeled hedonic scale LMS: Labeled magnitude scale LOT: Listening only test LSD: (Fisher’s) least significant difference LSDI: Large screen digital imagery LSO: Loudspeaker orchestra MAF: Minimum audible field MAP: Minimum audible pressure MAPD: Minimum audible pressure at the eardrum MCA: Multiple correspondence analysis MCL: Most comfortable level MDS: Multi-dimensional scaling MDU: Multidimensional unfolding MFA: Multiple factor analysis MIDI: Musical instrument digital interface mIRS: Modified intermediate reference system MNRU: Modulated noise reference unit

Nomenclature and Abbreviations  xxxvii MOA: Method of adjustment MoCa: Montreal cognitive assessment MOS-CQE: MOS estimated conversational quality MOS-CQO: MOS objective conversational quality MOS-CQS: MOS subjective conversational quality MOS-LQE: MOS estimated listening quality MOS-LQO: MOS objective listening quality MOS-LQS: MOS subjective listening quality MOS: Mean opinion score MOSle : Listening effort scale MOSlp : Loudness preference scale MOSc : Arithmetic mean of any collection of conversation opinion scores MPEG: Motion picture expert group MRI: Magnetic resonance imaging MS-IPM: Multiple stimulus - ideal profile method MUSHRA: Multi stimulus test with hidden reference and anchor n-AFC: n-alternative forced choice NAL-RP: National Acoustic Laboratories - revised, profound NB: Narrowband NDA: Non-disclosure agreement NH: Normal hearing ODG: Objective difference grade OLSA: Oldenburger Satztest OPQ: Open profiling of quality PARAFAC: Parallel factor analysis PC: Paired comparison PCA: Principal component analysis PCR: Principal component regression PET: Positron emission tomography PL: Packet loss PLC: Packet loss concealment

xxxviii  Nomenclature and Abbreviations PLS: Partial least squares PLS-R: Partial least squares regression PNS: Peripheral nervous system POTS: Plain old telephony service PR: Paired rating PrEMo: Product emotion measurement instrument PSA: Perceptual structure analysis PSTN: Public switched telephone network QDAr : Qualitative descriptive analysis QoE: Quality of experience QoS: Quality of service RAU: Rationalized arcsine units RGT: Repertory grid technique RIC: Receiver in concha RIE: Receiver in ear RITE: Receiver in the ear RMS: Root mean square RSM: Response surface method RT: Reverberation time SAM: Self-assessment manikin SAQI: Spatial audio quality inventory SAS: Sequential agglomerative sorting SCT: Short conversation test SCAS: Swedish core affect scales SD: Semantic differentials SDG: Subjective difference grade SDM: Spatial decomposition method SDSCE: Simultaneous double stimulus for continuous evaluation SDT: Signal detection theory SE: Standard error SI: Speech intelligibility

Nomenclature and Abbreviations  xxxix SIRR: Spatial impulse response rendering SNR: Signal-to-noise ratio SPL: Sound pressure level SPLRT: Speech packet loss reception threshold SPPR: Similarity picking with permutation of references SRT: Speech reception threshold SS: Single stimulus SS: Sum of squares SSCQE: Single stimulus continuous quality evaluation SSQ: Speech, spatial and qualities of hearing scale questionnaire SSR: Single stimulus rating STD: Standard deviation STQ: Speech and multimedia transmission quality SWB: Super wideband TAFC: Two-alternative forced choice TCATA: Temporal check-all-that-apply TCU: Turn-construction units TDS: Temporal dominance of sensations TI: Time intensity TRP: Transition-relevance place UI: User interface USB: Universal serial bus UFP: Ultra flash profiling VAD: Voice activity detection VQEG: Video quality experts group VIR: Vehicle impulse response VoIP: Voice over internet protocol VPA: Verbal protocol analysis VR: Virtual reality WB: Wideband

Contributors Editor: Dr Nick Zacharov FORCE Technology, SenseLab Hørsholm, Denmark

Prof. Tapio Lokki Aalto University Espoo, Finland

Ms Federica Belmonte Danish Technical University Lyngby, Denmark

Dr Rozenn Nicol Orange Lannion, France

Dr Lars Bramsløw Eriksholm Research Centre Snekkersten, Denmark

Mr Torben Holm Pedersen FORCE Technology, SenseLab Hørsholm, Denmark

Prof. Per Bruun Brockhoff Danish Technical University Lyngby, Denmark

Prof. Ville Pulkki Aalto University Espoo, Finland

Mr Patrick Hegarty Dynaudio Skanderborg, Denmark

Prof. Alexander Raake Technical University of Ilmenau Ilmenau, Germany

Dr Satu Jumisko-Pyykkö Vincit Tampere, Finland Dr Neofytos Kaplanis Bang & Olufsen Struer, Denmark

Mr Jesper Ramsgaard Google Mountain View, CA, USA Dr Janto Skowronek Technical University of Ilmenau Ilmenau, Germany

Dr Brian F.G. Katz Sorbonne Universités UPMC Univ. Paris 06, CNRS Institut d’Alembert, Paris, France

Mr Michał Sołoducha Technical University of Ilmenau Ilmenau, Germany

Dr Antti Kuusinen Aalto University Espoo, Finland

Dr Dominik Strohmeier Mozilla Berlin, Germany

Prof. Sébastien Lê AgroCampus Ouest Rennes, France

Mr Julian Treasure The Sound Agency Chertsey, UK

Mr Søren Vase Legarth FORCE Technology, SenseLab Hørsholm, Denmark

Dr Thierry Worch Qi Statistics West Malling, UK xli

I Background

1

CHAPTER

1

Introduction Nick Zacharov FORCE Technology, SenseLab, Hørsholm, Denmark

CONTENTS 1.1 1.2 1.3 1.4 1.5

1.6

What is sensory evaluation? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Why do we need sensory evaluation? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . When to apply sensory evaluation? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . What can we learn from sensory evaluation? . . . . . . . . . . . . . . . . . . . . . . . . . . . Concepts and terminology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.5.1 Percepts, attributes, and attribute names . . . . . . . . . . . . . . . . . . . . . . 1.5.2 The relationship between attributes and dimensionality . . . . . . . 1.5.3 The number of response variables and dimensions . . . . . . . . . . . . . Before moving on . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

4 4 5 7 8 8 9 9 10

n this introduction I will describe the basic motivations and the guiding principles of sensory evaluation, applied to the domain of sound perception. We will also consider when and why to apply sensory evaluation and some common terminology and concepts will be introduced to aid further reading. The origins of sensory evaluation have evolved rapidly from the food and consumer product industries since the 1950s with the development of methods to evaluate consumer preference using hedonic scales. Today sensory science continues to be strongly developed in these fields for product development, production testing and validation, both in terms of sensory evaluation methodologies and statistical analysis techniques, also referred to as sensometrics. The fundamental techniques of sensory evaluation are extensively described in the works of Lawless and Heymann (2010); Meilgaard et al. (2006); Stone et al. (2012) with the most recent techniques discussed in Varela and Ares (2014) and associated statistical analysis techniques in Næs et al. (2010); Lê and Worch (2014). The application of sensory evaluation to sound is deeply rooted in the history of psychoacoustics. Helmholtz (1863, 1875), was one of the early scientists to study the boundary between the physics of sound and the psychological acoustics of how a sound is perceived. His work commenced with a discussion “on the sensation of sound in general” and the “quality of tones”. Later, Wallace Clement Sabine studied the nature of sound in concert halls and what characteristics made them sound acceptable. Sabine researched the topic for many years and identified three key characteristics including loudness, distortion of complex sounds: interference and resonance and confusion: reverberation, echo and extraneous sounds (Sabine, 1900). By the early 1930s psychoacoustics was being further studied by Stevens and his collaborators with regards to the attributes of tones such as perceptual tone (i.e. pitch) and the perception of sound pressure (i.e. loudness), amongst other characteristics (Stevens, 1934, 1936; Stevens et al., 1937). The application of sensory evaluation

I

3

4  Sensory Evaluation of Sound to research the nature of sound and the way we perceive it has continued ever since and the work of Bech and Zacharov (2006) provides a thorough audio focused introduction to perceptual evaluation techniques. Before going any further, its valuable to clarify what we mean by sensory evaluation and when and why it should be applied.

1.1 WHAT IS SENSORY EVALUATION? Sensory evaluation is the collective term applied to techniques that allow us to evaluate how we perceive the nature of stimuli. More specifically Stone et al. (2012) define sensory evaluation as a scientific method used to evoke, measure, analyse and interpret those responses to products as perceived through the senses of sight, smell, touch, taste and hearing. In practice, this means that an assessor is presented with stimuli, e.g. a sound, which evokes a response in that assessor. We then measure their response, for example using a rating scale. Once we have collected the data from several assessors, we are then able to analyse the data, typically with some form of statistical analysis to seek out any meaningful patterns in the data. Lastly, we interpret the analysed data, with the aim of answering our original research questions and hopefully leading to informative findings or conclusions. Throughout this book we are going to be talking about sensory evaluation methods of various types. The concept of a sensory evaluation method is important and will be discussed in more detail in the introduction to Chapter 5. We are going to define a sensory evaluation method as a means to evoke a response from assessors, which can be measured such that the data can be (statistically) analysed and interpreted. In this way we have the four key steps of sensory evaluation captured as part of a method. The concept of a method should be equally applicable to sensory evaluation of sound, other senses, food and other products. As the title of this book suggests, we have collected together a body of knowledge regarding the application of the sensory evaluation methods to sound. However, many of the techniques presented have their roots in other domains of application. As a result, you may occasionally see reference to non-sound/audio applications, as at the most generic level, sensory evaluation methods are domain agnostic.

1.2 WHY DO WE NEED SENSORY EVALUATION? Our perception of sound, as with all senses, is complex and dependent upon many contributing factors. For example, our perception of speech for the primary purpose of communication is very different from our perception of music for entertainment. Furthermore, our needs as humans change and adapt constantly depending on our situation, needs and expectations. If we are having a telephone conversation with a friend, we are hopefully focused on the conversation, enabled through recognition of the person and ease of communication. By comparison if we are working in an open-plan office, focusing upon completing a report, the role and importance of speech around us is different. In this context, the speech around us may be considered as a noisy disturbance to be minimised. Depending on what we are doing, our moods and a multitude of other factors, so our expectations and needs may change - also in terms of sound quality and characteristics. The perception of sound quality or what sounds good is very context dependent, as illustrated

Introduction  5 by our character in Figure 1.1. In this cartoon our character seems to be seeking peace and quiet. As this figure illustrates there are many different sound sources in our everyday lives. The sound of the church bell characterised as sharp and loud and the train brakes screechy or the signing of a railway track shrill. A live band may sound dynamic, bassy and punchy for a member of the audience or else boomy, loud and annoying as a neighbour to the festival. In this sense depending on the situation of the listener, listening to the live band may be considered either enjoyable/desirable or annoying/unpleasant. From these examples, you can hopefully see that not only can the nature of sound be very clearly described in a specific way using adjectives or attributes, but also that the suitability or desirability of sound and its associated quality is very context dependent.

1.3 WHEN TO APPLY SENSORY EVALUATION? Sound quality can be evaluated in a wide range of manners and it is beneficial to understand different approaches and their usage applicability. This is a topic we will start to discuss here and will be expanded upon in Chapter 4 with the introduction of the filter model (see Figure 4.1). Traditionally in science and engineering, we are able to measure the physical properties of sound using a microphone with the associated measurement equipment. e.g. an audio analyser, as discussed in more detail in Section 3.1. This is a very robust and repeatable way to quantify and qualify the physical characteristics of the sound waves and can provide a lot of valuable information, such as sound pressure level, frequency and phase characteristics response, impulse responses, measurements of distortion, etc. Using such instrumentation or physical measurement methods is core to sound and audio engineering for science, research and product development. The hearing mechanism is beautiful and complex, allowing humans to perceive sound. Compared to a measurement microphone used with instrumentation measurement, the ear and hearing mechanism capture and processes sound waves in a specific manner, as introduced in Section 3.2. For example the frequency response of a microphone can be designed to be flat, as a function of frequency to provide a neutral physical measurement of the sound pressure. However, the frequency response of the ear has frequency dependent characteristics, as illustrated later in Figure 3.1. Due to the differences between the physical measurements and the way humans hear sound, so the physical measurements do not always tell us how a sound will be perceived. A common example is the difference between how a sound pressure level, in decibels (dB) is physically measured, compared to the way we hear and perceive this characteristic, commonly referred to as loudness. Loudness is the attribute of auditory sensation in terms of which sounds can be ordered on a scale extending from quiet to loud. Loudness can be evaluated either using assessors or by using so-called psychoacoustic metrics in conjunction with physical measurement of the sound field. The concept of loudness and other psychoacoustic metrics will be presented in Section 3.3, providing insight into the nature of sound perception. Such psychoacoustic metrics are useful to gain a better understanding of how we perceive sound and are also easy, fast and robust to employ. However, the hearing system is a complex structure and as yet we have not yet developed a complete set of psychoacoustic metrics that can characterise all aspects of our auditory perception. We can easily measure sound pressure levels and also estimate perceived loudness. However, commonly accepted psychoacoustic metrics for all sound perception characteristics have not yet been developed, e.g. for attributes such as envelopment, scene depth, etc.

6  Sensory Evaluation of Sound

Our character is in search of somewhere that sounds good. Reproduced by permission of Petteri Tikkanen. Figure 1.1

Introduction  7 When researching sound quality or engineering products, we sometime get to the point where the physical instrumentation metrics are insufficient to explain our perception. When we get to this stage, we often seek available psychoacoustic metrics to provide greater insight. The sensory evaluation toolkit becomes valuable when you wish to study human perception beyond the scope of existing physical and psychoacoustic metrics. You may be interested to study whether a difference can be perceived between your implementation of an audio algorithm and a reference implementation. Or perhaps you need to establish which technology solution is most preferred or has the best quality for your consumers, to guide your technology selection process. Alternatively, you could be seeking to better understand how consumers characterise the sound quality they experience and how these characteristics contribute to the consumer preferences. Sensory evaluation methods can provide robust means to test and answer such questions, giving valuable and complimentary data to the instrumentation and psychoacoustic measurement data. The wide range of sensory evaluation methods are introduced in Section 5.1 with an overview of methods provided in Figure 5.1. As you can see a multitude of methods exist which address different and specific research questions. Of these methods, four important method families should be mentioned already the usage of which we will discuss next, namely: • Discrimination methods; • Integrative methods; • Descriptive methods; • Mixed methods.

1.4 WHAT CAN WE LEARN FROM SENSORY EVALUATION? These four methods are particularly worth mentioning here, as they are targeted towards answering specific and common types of research questions. The family of discrimination methods, as discussed in more detail in Sections 5.1.1 and 5.2 are focused upon establishing and testing whether any perceptual difference can be found between stimuli or products. Efficient methods such as the ABX, duo-trio or 2alternative forced choice (2-AFC) tests are well suited to this purpose and not aimed at answering other exploratory or diagnostic research questions, e.g. what is preferred or how the preference is characterised. Integrative methods which are discussed in Sections 5.1.2 and 5.3 are well suited to finding best performing products within a test. Two sub-categories of integrative methods are presented, namely affective methods (see Section 5.3.1) and audio quality methods (see Sections 5.3.2 to 5.3.4). Affective methods, using for example the 9-point hedonic scale, are best suited to addressing questions of overall consumer preference or liking. By comparison, audio quality methods tend to focus more upon the matter of overall quality, quality degradation, relating to speech or audio (for example basic audio quality (BAQ)). Such methods are targeted towards finding winning solutions from a test, either in terms of overall preference or the overall audio or speech quality and as a result are widely used in technology selection in the corporate, academic and standardisation domains. However, integrative methods, by their very nature do not allow the researcher to delve much deeper to understand and diagnose the nature of the preferred/winning (or losing)

8  Sensory Evaluation of Sound technologies. If you are interested in understanding the perceptual nature of the stimuli or technologies you are testing, descriptive methods become valuable. Descriptive methods including descriptive analysis, semantic differential technique, etc., allow you to establish the perceptual attributes and dimensions that characterise the sound stimuli. Such in-depth methods also allow you to understand the relative perceptual importance of each of these attributes and dimensions. However, the drawback of these techniques is that it is difficult to establish the preferred stimuli within a test. In order to establish which are the winning or preferred stimuli in a test and how their performance is characterised, we can consider looking towards mixed methods, which aim to provide the answer to both questions. To do this mixed methods, discussed further in Sections 5.1.4 and 5.6, combine integrative and descriptive data collection with multivariate data analysis techniques (see Chapter 7). Of course the advantages of mixed methods thus require more experimental and analysis effort. Beyond these 4 core families of methods lie many more, such as performance methods, e.g. intelligibility, localisation, etc., some of which will be introduced in Sections 5.1.5 and 5.1.6 as described in detail in Sections 5.7 - 5.8.3.

1.5 CONCEPTS AND TERMINOLOGY The field of sensory science has primarily developed in the fields of food, beverage and other consumer products and is well matured, with standardised practices and terminology. As introduced earlier in this chapter, sensory evaluation of sound has also been employed for well over a century, with the earliest applications in psychoacoustics and concert hall acoustics. Nowadays, as you will see in Part III, applications for sound and audio range from telecommunications, audiology, to car audio and concert halls, etc. Each of these quite diverse domains has developed in its own manner with dedicated ways of working and sometimes even specific terminology. At the beginning of this project, I had hoped to try to unify the terminology within our book. However, as time passed it became clear this this would be an impossible undertaking, as many application domains within sound and audio are very mature (e.g. telecommunications, audiology, etc.) and have their own common practices, standards, etc. The net result is that you will observe that across the chapters different terminology is used. Whilst somewhat problematic, this is the current state of our field and perhaps something that we could clearly improve, unify and perhaps even standardise. To help with a few terms that are largely synonymous and commonly employed some nomenclature is provided in . For this introduction, I want to focus upon a few key concepts, that are common throughout this book and the field of sensory science. These relate to the concepts of attributes and dimensions and the usage of univariate and multivariate statistics.

1.5.1

Percepts, attributes, and attribute names

A lot of the techniques in this book consider how to characterise stimuli such that we can better understand the perceptual nature of the stimuli. In order to do this we often refer to so-called attributes, with associated labels or attribute names, as a means for assessors to report the nature of their perception. As discussed by Lawless and Heymann (2010), such terminology is considered part of scientific language, with specific meanings compared to their usage in everyday or lexical (dictionary) language and as such it is beneficial to define understand these terms. An attribute is defined in ISO 5492 (2008) as a “perceptible characteristic”, which is

Introduction  9 also very similar to the dictionary definition of a percept, i.e. “something that is perceived”. A concrete example is when we look at a cloudless daytime sky, we may perceive a certain colour. However, this percept is difficult to communicate alone. In order to be able to communicate the nature of what we perceive, we need to label our percepts. The label associated with an attribute is referred to as the attribute name and is defined as “. . . the descriptive name given to the sensory attribute . . . ” (World Coffee Research, 2013). Continuing with our example of a cloudless summer sky, we might label our perceived attribute of colour with the attribute name: blue. The definition for each attribute is “. . . a definition that clarifies and describes what the attribute name means” (World Coffee Research, 2013). For our attribute blue, the dictionary definition is a good example, i.e. “Sense relating to the colour. Of a colour of the spectrum intermediate between green and violet, as of the sky or deep sea on a clear day”. The combination of an attribute name and associated definition allows for the communication and understanding of a perceived attribute, all of which are discussed in greater detail in Chapter 4.

1.5.2

The relationship between attributes and dimensionality

The concept of attributes and dimensions/components are commonly encountered in sensory science and often loosely used interchangeably, even though they have quite specific meanings. While the attribute and its associated name are used to refer to the percept, a dimension or component more commonly refers to a dimensional axis resulting from a multivariate statistical analysis. One such technique is Principal Component Analysis (PCA) and comprises data-reduction techniques, converting original data into a set of one-dimensional orthogonal axes, called principal components. Other analysis techniques, such as Multiple Factor Analysis (MFA) or Multi-Dimensional Scaling (MDS) refer to the axes as dimensions. Figure 1.2 illustrates the relationship between dimensions/components and attributes. In this case 5 attributes have been rated by assessors. Conceptually, we might consider there is likely to be some correlation between certain attributes, for example spaciousness, clarity and source width. A multivariate analysis will often yield several dimensions to explain the variance within a data set and in this illustration 3 dimensions are needed to do this. In this example each attribute contributes to the multidimensional space, with the red arrows showing the ratings for each attribute averaged for all assessors. The solid grey lines illustrate the dominating attribute loading a given dimension, with the second most important attribute illustrated with a dotted grey line. For example, both spaciousness and clarity load dimension 1 with different weights. The dominating attribute for dimension 1 is spaciousness. However, this does not mean dimension 1 is solely defined by spaciousness, as mathematically several attributes contribute to dimension 1, with specific weightings. Depending on the type of multivariate analysis performed, each dimension will be the mathematical combination of several attributes and not uniquely correlated to one sole attribute. It is in this way that mathematical dimensions and attributes differ.

1.5.3

The number of response variables and dimensions

When we collect data from assessors using sensory evaluation methods we often use numerical rating scales (e.g. categorical or interval) for each attribute or response variable. The data collected with each attribute or response variable(s) is referred to as the variate, which we can then statistically analyse. Depending on the nature of the experiment we perform, we may employ one or more attributes or response variables (variate) to collect data. For example a common situation is to evaluate audio quality in a global manner using

Dimension 2

10  Sensory Evaluation of Sound

Clarity

Spaciousness Source width Distortion

Dimension 1

en Dim

sio

n3 Noise

Illustration of the relationship between (perceptual) attributes and (mathematical) dimensions. Figure 1.2

a single response variable, e.g. the Basic Audio Quality (BAQ) scale. In other methods, we may collect data from assessors using multiple attributes or response variables, e.g. using multiple attributes such as envelopment, spatial clarity and scene width. The term univariate is commonly used in statistics when a single variable quantity is involved, e.g. when collecting data using a single attribute or response variable. In such a case we may perform a univariate statistical analysis, as discussed further in Chapter 6, studying the structure of the data for this single response variable. Multivariate statistical analysis techniques, as discussed further in Chapter 7, are generally applied when there are multiple attributes or response variables. It is important to distinguish between the nature and number of variates and dimensions as both have specific meaning and are not always correlated. The number of response variables do not necessarily tell us much about the dimensionality of the data itself. Furthermore, the usage of univariate analysis does not imply that the data has a unidimensional structure. It is through the statistical analysis of our data that we can establish whether the structure of our data is unidimensional or multidimensional in nature.

1.6 BEFORE MOVING ON Before moving on, it is important to say that sensory evaluation methods are merely tools to evaluate the human perception of stimuli, i.e. sound in our case. This is not the only tool set that can help you do this and you might also consider using instrumentation methods, psychoacoustics metrics, predictive models, etc., to provide vital complimentary measurements and data. If you do need the sensory evaluation toolset, ensure you choose the most suitable method(s) for your research. This is important to ensure that you are able to answer your research question(s) in an efficient, reliable and robust manner. We hope that this book will provide you with the needed guidance on these matters.

CHAPTER

2

Why Sound Matters Julian Treasure The Sound Agency, Chertsey, UK

n this chapter we examine the nature of sound and its effects on human beings, both positive and negative. Human voices, music and the sounds of nature may have pleasing and desirable, even therapeutic, effects; however, unwanted sound (noise) can adversely affect wellbeing, effectiveness and happiness. By considering a range of examples of these positive and negative impacts of sound, we gain some sense of the vital importance of sound, hearing and listening to humanity, revealing why sound matters far more than most people think, and why it pays to spend the time to understand its impact in our world and on every one of us. In this electromechanical age, sound surrounds humanity like never before. Many people have never experienced silence, and even quiet is becoming a rarity for those who live in urban environments. That is already more than half of humanity, growing to two out of three people by 2050 according to United Nations forecasts (UN, 2014). The renowned nature recordist Bernie Krause proposes three classes of sound: geophony (the sounds of the planet, such as weather, water and earth movement); biophony (the sounds of organisms in their habitat); and anthropophony (the sounds caused by humanity, Krause (2012)). Each has numerous dimensions that characterise its nature, for example from quiet to loud, from peaceful to violent, and from a human perspective from pleasant to unpleasant. Geophony has a huge range, from gentle rain on leaves and bubbling brooks to the loudest sounds ever made on earth: the eruption of Krakatoa on August 27th 1883 was heard in Perth, Australia, 2,800 miles away; estimated at 170 decibels (dB) at a range of 100 miles, it shattered the eardrums of crew members on the British ship Norham Castle, 40 miles distant (Winchester, 2003). Biophony encompasses what Krause calls the Great Animal Orchestra. Most animals make sound, from the ultrasonic, inaudible (for humans, but not for cats) cries of mice to the deep bass of elephants; the frequency range of biophony is 10 to 100 kHz. Natural habitats vary in the richness of their biophony of course. Many parts of the seas are alive with surprising amounts of sound from snapping shrimps, dolphins, whales and many more species; others are very quiet. Densely populated habitats like jungles and rainforests teem with sound, each species occupying its own section of the frequency spectrum, while deserts and high mountains are almost bereft of biophony. In human-inhabited locations, the most prevalent and significant form of biophony is usually birdsong, a sound which researchers are now discovering is actually therapeutic and beneficial for human beings (Ratcliffe et al., 2013). Sadly, Krause and other nature recordists are tracking the decline and even disappear-

I

11

12  Sensory Evaluation of Sound ance of biophony due to human activity. In some places, we can not see the damage done to the environment, but we can hear it as a reduction over the years of the range and quantity of biophonic sound (Krause, 2013). Industrial use of chemicals, especially neonicotinoids, by agro-business decimates insect populations and the birds that feed on them decline or disappear, along with their song (van der Sluijs et al., 2015). The gentler geophonic and biophonic sounds of wind, water and birds are almost universally considered pleasing by human beings, and I suspect continued research will show that they are all actually good for us. In cities they are notably absent, which I suspect may be a significant factor in a range of urban health and social issues like alienation, depression, vandalism and even violence. Anthropophony is even more varied than the other two classes of sound. We humans create sounds like no other species. We have voices and language. We make music. We create millions of machines: cars, trucks, trains, ships and aircraft; phones and computers; construction and manufacturing equipment; domestic appliances and of course TVs. We come together in huge groups to make noise for religious, sporting or celebratory events. Most of the anthropophony that human beings encounter is noise, which I define simply as “unwanted sound”. Transport, construction, neighbour and electro-mechanical noise are the exhaust gases of pulsing economic and social activity: not designed by anyone, these sounds are at best annoying, inappropriate and inconvenient and at worst, as is becoming increasingly clear, they are massively damaging to health, wellbeing, effectiveness and happiness. The World Health Organisation (WHO) estimates that more than one million disability-adjusted life years (DALYs) are lost every year in Western Europe to traffic noise alone; this makes noise the second-largest environmental health risk in the region, just behind air pollution (WHO, 2011). The mechanism for most of this health damage is sleep deprivation due to traffic noise that exceeds the WHO night-time guidelines. The excess does not need to be large: research indicates that the biggest issue is not rare, very loud events, but frequent noise events slightly above the threshold of hearing that create multiple sleep disturbances throughout the night (Health Council of the Netherlands, 2004). In densely populated areas like Belgium and Luxembourg, the problem is acute as well as chronic, with 75 % of the population exposed to road noise above the WHO recommendation (European Environmental Agency, 2014). In 2003, over a quarter of the entire population of the Netherlands were found to have been highly disturbed by noise during sleep (WHO, 2009). As the pyramid graphic shows (see Figure 2.1, (Babisch, 2002)), chronic noise exposure transcends mere annoyance, feeding through stress reactions into elevated risks, particularly of coronary disease, resulting in substantial loss of both quality and quantity of life. The academic evidence linking noise with major health issues is growing. Leading expert on noise and health Professor Stephen Stansfeld produced (see Table 2.1) to summarise recent reviews of the available work. Recent studies (WHO, 2011) contain evidence of additional diseases and adverse health effects of noise. Not surprisingly, the economic cost of noise is vast. The UK Government states (Department for Environment, Food & Rural Affairs, 2013): It is estimated that the annual social cost of urban road noise in England is £7 to 10 billion. This places it at a similar magnitude to road accidents (£9 billion) and significantly greater than the impact of climate change (£1 to 4 billion). The European Commission estimated in 1996 that the cost of noise may be up to two

Table 2.1

Annoyance Hypertension Cardiovascular disease (inc. ischemic heart disease and myocardial infarction) Self-reported sleep disturbance Awakening Sleep (arousal, motility, sleep quality) Heart rate, body movements during sleep Hormonal changes during sleep Performance, fatigue next day Stress hormones Learning, memory, performance Immune effect Birth weight Wellbeing

Symptom or Characteristic

WHO (2009)

limited/sufficient limited -

sufficient sufficient sufficient sufficient limited limited limited sufficient limited limited limited

sufficient

sufficient sufficient

EEA (2010)

sufficient sufficient sufficient sufficient sufficient sufficient sufficient limited limited limited sufficient sufficient limited sufficient

limited

sufficient sufficient (aircraft) limited

Babisch (2004, 2006, 2009)

Sources

sufficient

Passtier-Vermeer & Passhier (2000) sufficient sufficient

Why Sound Matters  13

Classification of the evidence related to health symptoms of noise (European Commission, 2013). Data compiled from Passchier-Vermeer and Passchier (2000); Babisch (2004, 2006, 2009); WHO (2009); European Environmental Agency (2010)

14  Sensory Evaluation of Sound

Figure 2.1

2002).

Illustration of noise severity and its impact on the population (Babisch,

percent of Europe’s GDP (European Commission, 1996), which in Europe, at the time of writing, amounts to a staggering e300 billion. Remember, this is road noise only. While awareness of these risks is growing, little preventative action is being taken by governments. Even in Europe, the only continent to have carried out widespread noise mapping, the Environmental Noise Directive (END) obliges member states only to measure, not to act: sadly there are no votes in noise reduction. This is because most people are unconscious of sound and of its effects on them. Surrounded by cacophony, they have fallen into the habit of suppressing awareness of sound; they bellow over traffic noise and pretend it is not there. If they could smell it or see it, they would never tolerate it, but because it is sound, they just shout. The human ability to habituate, adapting to cope with long term exposure to even the harshest environmental factors, does not negate the harmful health effects of chronic noise exposure. Just as stochastic resonance turns noise to advantage by adding it to a system in order to make a weak signal more measurable, auditory noise can sometimes have positive effects. Dithering adds noise to 16-bit digital sound to reduce distortion. Comfort noise can improve phone conversations where voice activity detection is used, creating a user experience that is less choppy and startling. Masking noise systems in offices can improve privacy by making speech from colleagues less intelligible - though research is needed to establish the long term effects of such systems, which I suspect may include some fatigue and stress. Human beings have four communication channels: reading, writing, speaking and listening. About 30 % of our brain neurons are used for visual processing and about 2 % of that for hearing - clearly a visual bias. Two send and two receive; two are for the eyes and two are for the ears. This sounds equal but we do not weight them that way in the modern world: for example, only two of them are taught in most schools. It would be a scandal in any country if children were leaving school unable to read or write, but from my experience and communication with teachers around the world I am confident in asserting that most

Why Sound Matters  15 schools systems do not teach listening or speaking in any depth, and therefore that most of the world’s children leave their education without having been taught to speak effectively, and even less to listen consciously or well. It is no accident, then, that many of the major innovations in communication in the last 40 years have been text-based or visual: email, SMS, instant messaging, social media all employ our eyes and our fingers, not our voices or our ears. Over the same period, the advances arising from digital sound have been largely about data compression (audio codecs such as MP3, AAC, etc.) and the democratisation of recording and disseminating music with debatable outcomes for quality - though it is also true that VoIP has offered a significant upgrade from traditional low quality phone calls, and the hearing-impaired community has benefited from some transformational improvements in technology. In general, we do not often notice or complain about sound unless it is particularly disturbing. Nevertheless, despite the lack of consciousness of it, sound affects human beings profoundly - hence the title of this chapter. Sound matters because it changes us all, even if we are not aware of it. When I set up The Sound Agency in 2003 I undertook a major research project to understand the ways sound affects people. Many books and papers later, I settled on four primary effects, and in the subsequent years I have never found the need to change them. The first effect is physiological. All human beings (and other animals) interpret sudden sounds as threats: it is the safest way to be. Even though a car backfiring or a plate smashing may be completely without danger to us, millions of years of evolution mean that these sounds will still cause the autonomic nervous system to react with an instant release of the hormones adrenaline, noradrenaline and cortisol: these instantly accelerate heart rate and breathing, increase blood pressure and blood sugar, constrict blood vessels and shut down digestion, among many other changes. This is the fight/flight response and it is instinctive. It is also one reason why noise is bad for the health: over-frequent doses of cortisol create many health issues, from decreased immunity to digestive issues and muscle wastage. Sound can equally be physiologically positive. Gentle surf at about 8–12 cycles per minute is an excellent soporific, mimicking the breathing of a sleeping human and entraining the heart and breathing to slow down and the whole body to relax. Sound, particularly music, has been proven to be effective as a treatment (or an aspect of a treatment regime) for pain relief, asthma, dementia, stress, depression, stroke and autism, and there are now academic journals dedicated to researching these effects. The second effect is psychological. Music is the most obvious case here, since most of it is specifically created to convey emotion of some kind. Despite many books on the subject, I am not aware that anyone has explained exactly how music can take a composer’s emotion and transfer it so powerfully to the listener. The combination of pitch, movement, harmony, melody, rhythm, tempo, density, timbres and often voices and words is a rich and complex brew with so many interrelationships that we may never understand exactly how it works. Nevertheless, it is a profound part of human experience: every human society, no matter how cut off from the rest, has independently developed music. To be human is to be musical, and humans are adept at using music to enhance or to counteract emotions. It can transport dancers to ecstatic trance, amplify romantic love, coalesce thousands of sports fans into one chanting mass, prepare athletes for challenges and propel soldiers to war, among many other uses. It is not only musical sound that produces psychological affects: voices, natural sound and noises can all alter or induce emotional states. Much of the emotional impact of sound comes from association. The ticking of a grandfather clock can transport one person back to a childhood home and generate a feeling of warmth; the sound of a domineering boss’s

16  Sensory Evaluation of Sound ringtone can inspire fear in another. Birdsong makes many people feel safe and awake, because over hundreds of thousands of years we have learned that when the birds are happily singing there is usually nothing to worry about - and because the birds are nature’s alarm clock, associated with daytime and getting things done. The third effect is cognitive. Our brains may have massive storage capacity, but we have finite bandwidth for auditory processing. The well-known cocktail party effect illustrates our ability to focus attention on certain conversations and ignore others, which is needed because almost nobody can understand two people talking simultaneously. We can understand speech and recognise music in high noise conditions, but only with high cognitive load. This is why our productivity can be slashed when we are disturbed by noise: effective knowledge working (for example working with words or numbers or manipulating complex concepts) usually depends on listening to our inner voice, which becomes very hard when other people’s conversation intrudes. We can adapt, or habituate, to constant drones like the hum or hiss of air conditioning, even though I believe this takes mental energy and leaves us tired at the end of the day. But many of the sounds in modern open plan offices are not constant, and they dramatically affect individual effectiveness. Many studies have shown this effect, estimating the drop in productivity at up to a shocking 66 % (Banbury and Berry, 1998). In order, the most disturbing sounds in modern offices are other people’s conversation (we are programmed to decode language and we have no earlids to shut out the irritating chatter of the person at the next desk), ringing telephones (a call to action that, if unanswered, is often very irritating), any other irregular noise, information-rich sound like music. An open plan is excellent for collaboration, but it has unfortunately become the default office floor-plan, despite the fact that we need to facilitate several other work modes. Alongside collaboration, Professor Jeremy Myerson distinguishes concentration and contemplation as work modes that need their own type of space. Noise is the number one complaint in numerous surveys of office workers, probably because open plan is largely devoid of rules. As Myerson said when I interviewed him for my radio documentary “The Curse Of Open Plan”, the postman would never enter your house and throw your letters on the living room floor - but that is exactly how many people behave in the open plan office. It seems that some new etiquette is required The fourth and final effect of sound is behavioural. Sound changes what we do. Any sound with a tempo below that of the resting heart beat (around 60–80 beats per minute (bpm) for most people) will tend to slow our bodies down, causing us to move more slowly; conversely, tempos over 100 bpm tend to speed us up. This has been dramatically shown in studies of music and shoppers or diners. In one study (Milliman, 1982), slow-paced music in a grocery store caused people to walk significantly more slowly around the store, and the result was that they spent 38 % more, presumably as they noticed more things they liked. Other research has established that fast-paced music causes people to chew faster and thus finish their meals more quickly (Roballey et al., 1985) - which would be why fast food restaurants play nothing but up-tempo pop. Quality matters as well. It may be that we linger somewhere because of the beauty of the sound: many piazzas in Italy have fountains at their centre, not just to look at but to listen to. Equally, people will tend to move away from unpleasant sound if they can, creating a very important effect in retail, where the cacophonous soundscape in many shops and malls creates stress, fatigue and the unconscious desire to leave. The result is a reduction in what the retailers call dwell time - and dwell time is directly related to sales. If we swap senses for a moment, the proposition becomes obvious: a shop with a terrible smell would clearly

Why Sound Matters  17 not do very well. But because architects and interior designers focus almost exclusively on visual design, the sound in many retail spaces is poor, and even hostile. Improving sound in the built environment has been a major part of my work over the last 10 years. When we carry out a Sound Audit on a space, be it an office, an airport, a school, a hospital, a shopping centre or a shop, we assess four key aspects: acoustics, noise, sound system and content. When considering builds, Acoustics are crucial: it is very hard to create a pleasing sound in a room with bad acoustics. The modern trend to deploy nothing but hard materials mainly stone, metal, glass and plaster - creates highly reverberant rooms with surfaces that bounce every sound back and thus amplify noise. Acousticians are at least partly to blame, focusing on equations and engineering instead of selling the outcomes of their work on the people in the building. Maybe because sound is invisible, sound-related budgets are often the first things to be “value engineered” down or out altogether when quantity surveyors are seeking to reduce construction costs; as a result, slightly more expensive acoustically absorbent or noise-attenuating materials are replaced with cheaper alternatives that bounce sound back, (increasing reverberation) or allow it to pass through (destroying privacy) or both. Lack of attention to acoustics is a major issue that has significant social consequences, because it affects some very important buildings to the point of rendering them unfit for purpose - notably schools. It is shocking to learn that, due to bad design combining low signal to noise ratios with high reverberation times, speech intelligibility in many classrooms can be as low as 30 % for children with perfectly normal hearing (Crandell and Smaldino, 2000). Or that due to poor acoustics and group work the average noise level in German classrooms is 65 dB (Oberdörster and Tiesler, 2005) - well above the point at which longterm exposure increases risks of coronary disease (Ising and Kruppa, 2004). It may be that teachers are shortening their lives by working in such noisy environments day after day. Acousticians tend to be brought in when the damage is done, tasked with solving serious acoustic problems in built spaces, with grudging budgets eked out for the work. How much more inspiring to be in at the start of the process, helping to create a space that sounds as good as it looks! Architects, please note. Noise in the sense of unwanted sound, is best and most cost effectively ameliorated at the design stage. Many domestic white goods like dishwashers and washing machines and many cars now show noise ratings as part of their key specifications. This is not so for the commercial equivalents. Very few supply contracts for heating, ventilating and air conditioning (HVAC) systems, escalators, catering equipment or commercial vehicles mention noise output at all. As a result, the noise floor (the level of sound of the unoccupied space) in many commercial buildings is way too high, and the quality of the soundscape is poor before anyone sets foot in the space. Aside from the obvious examples of performance cars and motorbikes, most mechanical noise is unpleasant even when equipment is well serviced. If things go awry it gets much worse: we have all suffered the pain of squeals, squeaks, clanks and bangs when something like an escalator gear or AC bearing is out of alignment. Even humble trolleys can generate piercing squeaks that pollute an entire supermarket. When auditing, we have often found water features to be destructive if the pressure is too high or the quantity of water too large. While gentle water is a lovely sound to most people, powerful waterfalls and dancing fountains are wearing, and they inhibit communication dramatically because they mask the sibilants of speech. We understand one another largely by discerning consonants, not vowels, so those Ss and Ts are crucial for effective conversation. Another critical environment where noise is destroying effectiveness is healthcare. A 2007 Johns Hopkins study found that average daytime noise levels in the hospital were running

18  Sensory Evaluation of Sound at 12 times the WHO’s recommended maximum, and night time levels eight times too high (Busch-Vishniac et al., 2005)! Again, the damage is being done via sleep deprivation; we repair ourselves when we sleep, and when you imagine the cacophony of beeps, hisses, buzzes and bangs in the average intensive care unit, it seems a miracle that anyone can get well in that environment. In most cases, noise can be significantly reduced by fixing things that are not working as they should, or changing simple behaviours. For example one hospital managed to reduce noise levels by 20 dB by changing staff behaviours, including using soft footwear, having all phones on vibrate and sensitising staff to patient noise so that they stopped having conversations in earshot. Sound systems are often even more victimised than acoustic treatments when it comes to cutting building costs. No doubt the reader can recall numerous experiences where the sound in some commercial built space has been actively unpleasant due to a poor quality sound system. Often this is because the sound system was only installed for safety, and is capable of handling emergency voice and alarm bells but nothing more - and then some bright spark notices all those loudspeakers and decides to play music through them. The result usually sounds like a broken AM radio in a sock, and serves only to degrade the experience of the people in the building. The ocular obsession even extends to spaces where sound really should be the critical consideration. We speak about videoconferencing - but actually the key element of the conference is sound. If the video fails, you can continue perfectly happily in sound only. If the sound fails, the conference is over. It is strange then that so many conference rooms boast huge, lavish video screens but offer one puny microphone cluster in the centre of a large table, in a room that almost certainly has poor acoustics, to try and convey audio. This is not a recipe for a successful meeting. The last element is content. Music is greatly abused in this context. Almost all music is made to be listened to; it is not a veneer to be overlaid on every public space. Where music works well is in situations where people are bored by carrying out repetitive tasks, like a factory or workshop. In that situation, music is welcome stimulation and diversion. However, when you play something that wants to be listened to in a space where people are trying to do something that requires cognition, you create an immediate conflict of interest, because the music is taking up valuable mental bandwidth and restricts people’s ability to think, talk, work or make decisions. Contrary to the glowing research sponsored by the music industry, which would love us to be bombarded by music in every space1 and thus attempts to persuade us that we all love music everywhere, the independent figures, for example those from the Royal National Institute for the Deaf (now Action on Hearing Loss) show that roughly a third of the population with normal hearing like piped music, a third do not care, and a third hate it (RNID, 1998). The numbers are far more negative for those with hearing loss, who find it very hard to understand speech when background music is playing. One third is a lot of customers to upset! This is why pressure groups like the UK’s Pipedown have such avid followings. The Elizabethans described conversation as “decorating silence”. That is a great way to think about designing an effective soundscape in any space. Start from silence and decorate it, making sure the sound is appropriate for the acoustics, noises and sound systems, is congruent with the brand or values behind the space, and adds value by supporting people 1 Not surprising: along with live concerts, publicly licensed music is the only music industry revenue source that is now increasing.

Why Sound Matters  19 in what they are wanting to do. In between silence and noise lie a wide palette of sounds natural or designed. Generative players can weave sounds into organic, slowing streams that never repeat themselves. There is so much richness to choose from, so many creative options to use! The results show the importance of designing for the ears as well as the eyes. Our own developments with generative soundscapes have increased sales in shopping malls and airport retail outlets by up to 10 %, and have improved customer satisfaction ratings by up to 50 % (Treasure, 2011). This shows what can be achieved by designing sonic environments that are pleasing and appropriate, which simply involves optimising (as far as is realistic) the four pillars of good sound in spaces: acoustics, noise, sound system and content. Good sound may be good business for anyone, but for many sound is their business. The global music industry is making a modest comeback with digital music now accounting for more than half of its $15 billion revenue, creating 127,000 jobs in Germany and 117,000 in the UK (IFPI, 2016). These jobs comprise composers, musicians, engineers, producers and then a whole supply chain, which today involves very little physical product or retail. At the time of writing, Virtual Reality (VR), binaural technology and the new breed of hearables are gaining traction, all bringing new opportunities for sound design and creating new dimensions to the concept of schizophrenia, R. Murray Shafer’s term for the separation of original sounds from their recorded versions. Ever since the Walkman’s invention, what we see and what we hear have not necessarily been related. In the future, they may not be real either. Digital music has changed listening habits dramatically by being more available, flexible and portable than analogue ever was. Originally this came at a significant cost in quality as 64 kbps MP3s and cheap ear buds swept the world, but now HD music in 24-bit and 96 or even 192 kHz is offered by streaming services like Qobuz and Tidal. Where once people sat still and consciously listened to entire albums from start to finish, the new norm is to consume playlists of tracks, often mobile and while doing something else. The democratisation of the recording process due to ProTools, Logic, Live and other digital audio workstations means that the sheer quantity of recorded music is overwhelming. This may be why radio is holding its audience share very well in the face of global media fragmentation: people evidently value more than ever having a trusted guide to direct their music listening. Music is not the only way people earn their living working with or on sound. There are sonic artists and sound designers; academics, from musicologists and music psychiatrists to acoustic ecologists and physicists; professionals, from acousticians and pro-audio specialists to audiologists and music therapists; and hardware and software technicians developing audio technology for applications such as entertainment, communication or hearing enhancement. Recently marketers have joined the list. Brands are starting to realise they are experienced in five senses, not just visually, and a whole audio branding industry has arisen, complete with its own trade association called the Academy for Applied Sound which has over 100 members from all over the world. Of course broadcast advertising has used sound (music, sound effects and voices) ever since this has been possible. Where the film industry innovated, advertising followed. Sound accounts for a major part of the emotional impact of any feature film, as anyone who has ever watched with the sound switched off can verify, and the research on sound in advertising confirms that it too contributes significantly to recall - and, more importantly, to our emotional responses (Alpert and Alpert, 1989). The more recent developments in audio branding are very wide-ranging, to support the multi-sensory brand image required by modern companies and organisations. In my work over the last 15 years I have identified eight possible expressions of a brand in sound:

20  Sensory Evaluation of Sound sonic logos, brand music, advertising sound, branded audio, brand voice, telephone sound, product sound and soundscapes. Some brands are making use of all of them, and many are exploring several of them. For example the French railway operator SNCF developed a four-note sonic logo that acts as DNA for all of its sonic touch-points. Versions of it presage voice announcements on the trains and in the stations, play on the website and complete TV and radio commercials, and form the basis of music that is played on-hold and at events. Pink Floyd’s David Gilmour liked it so much on a French train trip that he wrote the title track of his recent album Rattle That Lock around those four notes - with permission, of course. It is clear that audio branding is set to develop rapidly, which is a good thing because most of the noise we discussed at the start of this chapter is made by companies and organisations who have simply not been listening: their sound output has been a hidden social cost that never shows on the balance sheet. Perhaps in time noise cost, like carbon cost, will become explicit and organisations will become responsible for it, and with it. And perhaps the imminent radical improvements in speech recognition and voice synthesis such as Viv will help all of us to start listening more and typing less. Maybe schools will then, at last, start teaching conscious listening skills. This is critical, because conscious listening is the key to everyone - individuals as well as organisations - taking responsibility for the sound they create, and for the sound they consume. Conscious listening is also the doorway to understanding, and thus a keystone of democracy, which relies on civilised disagreement to exist at all. As was very evident in the tumultuous and often vicious political events of 2016, people’s tendency to select only the news they agree with is virtually eliminating compassionate listening, polarising populations and crystallising only a feeling of righteous anger. Listening is the key to creating a world of sonic harmony rather than noise and conflict. We can but hope that the vital role of sound and listening will be recognised before it is too late.

CHAPTER

3

An Introduction to Sound, Hearing and Perception Ville Pulkki Aalto University, Espoo, Finland

CONTENTS 3.1

3.2

3.3

3.4

3.5

Basics of sound and audio . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.1.1 Vibrations as source of sound . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.1.2 Sound sources . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.1.3 Sound waves and sound pressure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.1.4 Sound pressure level . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.1.5 Frequency-weighted measurements for sound level . . . . . . . . . . . . . 3.1.6 Capturing sound signals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.1.7 Digital presentation of signals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.1.8 Further reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Hearing – Mechanisms and basic properties . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2.1 Anatomy of the ear . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2.2 Structure of the cochlea . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2.3 Passive frequency selectivity of the cochlea . . . . . . . . . . . . . . . . . . . . 3.2.4 Active feedback loop in the cochlea . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2.5 Mechanical to neural transduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2.6 Non-linearities in the cochlea . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2.7 Temporal response of the cochlea . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2.8 Frequency selectivity of hearing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2.9 Filterbank model of the cochlea . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2.10 Masking effects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Psychoacoustic attributes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.3.1 Loudness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.3.2 Pitch . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.3.3 Other psychoacoustic quantities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Spatial hearing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.4.1 Head-related acoustics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.4.2 Localisation cues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.4.3 Binaural cues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.4.4 Monaural cues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.4.5 Dynamic cues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Psychophysics and sound quality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

23 23 23 24 25 27 28 30 32 32 32 33 34 34 35 36 36 37 39 39 41 41 43 44 45 45 46 47 49 50 50 21

22  Sensory Evaluation of Sound

3.6 3.7 3.8

Music and speech . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Hearing impairment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

51 53 56

uring evolution the senses “developed” as a vehicles to gather information regarding the surrounding physical world. As one of the results we have the sense of hearing, which consists of two ears and a considerable amount of dedicated neural processing in the brain. The hearing mechanism makes its best efforts to understand “what” caused a sound, and “where” the physical object(s) that caused the sound is located. For example, hearing allows animals even in dark to localise and listen for their prey, companions and natural events, and to avoid predators. In addition, humans have also developed special uses for hearing that enable efficient human-to-human communication by voice and sound. The main purpose of the human voice is to communicate between humans using language and other utterances, where information of concepts and state-of-mind can be delivered. Humans also have developed music, which is a form of art that is composed of sound and silence. Music can be used to communicate emotions to the listener, for example, a the background music of a movie is often used to affect the perceived mood of scenes. There are thus many tasks where hearing is used, and the brain does a magnificent job of extracting all obtainable information from sound signals arriving at the ear canals. When compared with vision, the hearing system has very different characteristics. The eye has an angular discrimination accuracy at least 100 times finer than our directional hearing, although vision is limited to a relatively limited set of forward directions. Conversely, the hearing system receives sound generally from all directions, but the directional accuracy is limited to a few degrees at best, and the presence of other sounds (reflected sound or incoherent sounds) from any direction largely degrades the accuracy of directional hearing. In contrast, the eye has three types of photoreceptor cells, which are tuned to different wavelengths of light (Starr et al., 2010), allowing the eye/brain to resolve over 10 million colours. This is very different from the number of auditory frequency channels that we can individually listen to, where the number is about 40 (Moore, 1995). The temporal resolution of hearing is also acute, where details in the envelope of a sound signal are resolved to an accuracy of ∼1 ms accuracy. Overall, the hearing system can be said to be an omnidirectional system used for communication, surveillance of our environment and alerting us of potential dangers around us. It is capable of identifying the nature and location of sound sources with marvellous accuracy. Hearing makes us conscious of our surrounding world, enables us able to communicate verbally and also to enjoy the wonders of music. This book considers sensory evaluation of sounds that humans hear in their environment often produced by man-made technologies. This chapter introduces some background regarding the hearing system, sound and audio that may be beneficial when conducting sensory evaluation studies of sound. We will review some basics of acoustics and audio engineering, that may be helpful when implementing and testing sound technology. Even more important is to understand the general working principles of hearing when studying its function in the context of complex sounds. For this, the acoustics, neural mechanisms and the resolution of hearing are also discussed. The short introduction chapter merely scratches the surface of the subject matter, as the acoustic chain from source via room to receiver is physically a very complex system, and furthermore, hearing is an immensely

D

Sound, Hearing and Perception  23 intricate system. To get a broader and more complete introduction to the field, the reader is referred to (Pulkki and Karjalainen, 2015; Rossing et al., 2001; Moore, 2013).

3.1 BASICS OF SOUND AND AUDIO This section covers some basic knowledge of how sound is generated and radiated, how it can be captured by technical means, and processed and stored in digital form.

3.1.1

Vibrations as source of sound

In most cases sound is caused by a vibrating mechanical object, or by vibrations of flowing air. In either case the oscillations occur in the range of audible frequencies, from about 20 Hz to about 20 kHz. The vibrations then cause changes in sound pressure and particle velocity in air next to the object or to the flow, which causes sound waves to radiate from it, as discussed in the next section. Very common sounds are impact sounds occurring when two objects collide with each other. The collision makes the objects to vibrate with a slower or faster decay, which causes radiation of sound. Simple examples with fast decay are, e.g., hand clapping or foot steps. In such cases the vibration is non-periodic, and the magnitude spectrum is relatively flat without spectral peaks in radiated sound. An object may have resonances in the physical structure of it, which largely affects the spectrum of the vibrations. A resonance means that a specific frequency decays more slowly than the others. Such resonances occur also with air flow vibrations, for example the howling sound when blowing on an empty bottle is caused by a resonance. Often the sound source, such as a musical instrument or human voice, exhibits a large number of resonances at the same time. In such cases the resulting sound can be called as a combination tone, and it consists of a set of sinusoids called partials, each caused by a resonance. Each partial has a specific frequency, amplitude and phase. For example, very strong resonances are found in vibrating elements in musical instruments (e.g., strings or pipes), or in acoustical resonances of rooms. If the waveform radiated from the source repeats itself in time, it is said to be periodic, and consequently the magnitude spectrum of the signal consist of partials that have a harmonic relationship, and the corresponding sound is said to be harmonic. The lowest frequency of the partials is called the fundamental frequency, often denoted f0 . Partials are also called harmonics, and their frequencies follow integer multiples of the fundamental, fn = n f0 . Most musical instruments in Western music generate harmonic or almost harmonic signals. If the vibration of the sound source is non-periodic, it may consist of discrete frequencies that are not in harmonic relationship, or of a continuous distribution of partial frequencies. For example, bells often produce non-periodic sounds and typically have a relatively sparse distribution of frequencies with non-harmonic relationships. In addition, noise-like sounds such as wind noise or ventilation noise are non-periodic, and they show energy at each frequency without strong peaks in magnitude spectrum.

3.1.2

Sound sources

The vibration of an object causes movement or displacement in its surface, which displaces the molecules of surrounding medium. This causes a disturbance in the sound field, which then travels through the medium with the speed of sound (Beranek and Mellow, 2012). The speed depends on characteristics of the medium such as the density, temperature and

24  Sensory Evaluation of Sound moisture. The speed of sound can be approximated to be 340 m/s in air in normal room temperature. Besides natural sound sources, humans have built electroacoustic sources, where electric signals can be transformed into acoustic sound. A loudspeaker (Colloms, 1997; Borwick, 2001) consists of one or more driver elements, which are typically built into an enclosure box. When the loudspeaker cone moves outwards, it pushes the air particles closer to each other resulting in compression and higher pressure, and when it moves inwards, the air particles are separated more from each other causing rarefaction and lower pressure. An amplifier is used to supply the necessary electric power to a loudspeaker, and typically only a fraction of this energy is converted into sound. An important acoustic property is the directivity pattern of a sound source. The pattern describes the level of sound radiating in each direction from the source. With typical sound sources, such as voice or musical instruments, frequency-dependent patterns are found. Typically the pattern is omnidirectional at frequencies where the source is small when compared with wavelength. At higher frequencies typically the sources have more complex directional patterns, that can be focusing on single or several directions. In normal rooms with wall reflections and reverberation the sound radiated to each direction is bounced from the walls, and arrives to the listener after some delay. Such delayed components may be detrimental in some cases, for example they lower the speech intelligibility in public address. Thus it is often desirable to use a loudspeaker that radiates sound more towards the listeners than towards the walls. The loudspeaker is said to have some directivity, which means that it radiates more sound along the main axis than in other directions, which reduces the contribution of reflected and reverberated sound at the listening position. Conversely, in anechoic chambers, the reflections and reverberation are heavily attenuated, and a listener will only receive the sound emanated from the source in the direction of the listener. This is also often the case in open outdoor situations.

3.1.3

Sound waves and sound pressure

The waves of compressions and rarefactions travel then through the air as a sound wave. Generally, the propagation speed is the same for all frequencies and amplitudes within the audible range of sounds, and the sound signal does not change in spectral content during the travel. This means, that when a microphone is placed to any distance from the source within the range of a few tens of meters, the signal originating from the source will be the same but stronger at closer distances, and weaker at farther distances, assuming no reflections are present, as found in an anechoic chamber. The presence of reflections may have a strong impact on the spectrum measured at the microphone position. Additionally, certain non-linear effects in the air absorb more high frequencies than low frequencies, and after several hundreds of meters of traveling the high frequencies of a signal are significantly attenuated compare to the low frequencies. This leads to, for example, the sound of lightning being perceived like a “crack” at short distances, and a “rumbling” when heard from several kilometres (Ingaard, 1953). One of the most important quantities in acoustics is the sound pressure. Pressure in general describes force per area in Pascal units [Pa] = [N/m2 ]. Sound pressure in turn is the deviation of pressure from static pressure in a medium, most often in the air, due to sound waves, in a specific point of space. The time-dependent instantaneous pressure measured in the point is p(t), and the absolute value of it |p(t)| refers to the instantaneous amplitude of it. However, usually the term “pressure” refers to the root mean square or RMS value of the

Sound, Hearing and Perception  25 sound pressure signal p(t). It is defined as prms

1 = t2 − t1

sZ

t2

p(t)2 dt,

(3.1)

t1

where the time range of integration can be over one period for a periodic signal or a long enough—ideally infinite—time √ span for non-periodic signals. For a pure tone (sinusoidal signal), the peak value pˆ = 2 prms . A homogeneously vibrating object that is small in dimensions compared to wavelength can be approximated to a spherical, or point, source. The spherical wave propagates from the source with sound velocity c. When the spherical wave propagates further, energy that it is carrying is distributed over a larger area, which is proportional to the inverse square of the radius of the sphere. Since sound pressure is proportional to the square root of the energy, the pressure carried by the wave is inversely proportional to distance r from the midpoint (symmetry point) of the source. This means that at low frequencies such sources as a typical loudspeaker or a human speaker are practically spherical wave sources, and that the measured pressure is halved by doubling the measurement distance, i.e. a 6 dB reduction in sound pressure. When two microphones are in different distances r1 and r2 from a spherical source, and the measured root-mean-square pressures are x1 and x2 , it holds x1 r1 = x2 r2 . (3.2) This is commonly referred to as the inverse square law and is valid only for free field cases, i.e. where no reflections are present. The plane wave is another important special case, which is caused by a large and homogeneously vibrating plane. In a lossless medium the plane wavefront preserves its waveform and propagates without attenuation due to increasing distance. When there exists N sources in anechoic space, the total pressure signal is obtained as ptot (t) =

N X

pn (t),

(3.3)

n=1

where pn (t) instantaneous pressure signals from different sources. The sound pressure signals of each wave arriving at a specific position are thus simply added together, which is not true only for sound pressure level (SPL) values well above the threshold of pain.

3.1.4

Sound pressure level

The sound pressures that human hearing can perceive and tolerate are within range of 20 · 10−5 . . . 50 Pa. In contrast, the static pressure of the atmosphere has a value near 100 kPa, which is a much higher value than human-tolerated sound pressure values. The concept of decibel is used in a special way in acoustics. A reference sound pressure p0 = 20 · 10−6 Pa is used in decibel calculation. This value of p0 is selected to match with the pressure near the threshold of hearing, i.e., the weakest sound at 1 kHz frequency that can just be perceived. The sound pressure level (SPL) Lp [dB] is thus obtained from   p Lp = 20 log10 . (3.4) p0 Human hearing is able to hear and tolerate sounds in the range of 0–130 dB, from threshold of hearing to threshold of pain. SPL values in dB are more convenient to employ than sound

26  Sensory Evaluation of Sound 140

Loudness level [phon] 130

120

120

Sound pressure level [dB]

110 100

100

80

80

90 70 60

60

50 40

40 30

20 0

20

Hearing threshold (MAF) 20 31.5

63

10 3 125

250

500

1k

2k

4k

8k

16k

Frequency [Hz]

The equal loudness contours and the comfortable listening area of human hearing on the frequency–SPL plane (Fletcher, 1995). The dark-grey oval presents the typical SPLs of comfortable every-day speech, and light-grey oval presents the same region for music. Figure 3.1

pressure values in Pascal units, since ∼1 dB change in SPL leads to a noticeable change in loudness with both weak and strong sounds. If the SPL Lp1 and Lp2 created by source 1 and 2, respectively, have been measured, it is sometimes handy to estimate the SPL value in the case where both sources are active, and in phase. In such case, the corresponding RMS pressure values p1 and p2 are first computed from level quantities using Equation (3.4). If the sources are incoherent, i.e., originate from different sources, the resulting RMS pressure value can be computed as a squared sum of individual pressure values s X ptotrms = p2n . (3.5) n

This means, that if two incoherent sources each cause X dB level at a certain position, the resulting level will be X + 3 dB. In the rare case when the signals are coherent, for example if two loudspeakers are producing identical signals (differences in level are allowed), and the measurement position is at equal distance from the sources, the signals “add up in phase”, and it holds that X ptotrms = pn . (3.6) n

Sound, Hearing and Perception  27

Relative sound pressure level [dB]

20 10 0

D A

C C,B

-10 D -20 B -30 -40 A -50

20

50 100 200

500 1k

2k

5k

10k 20k

Frequency [Hz] Figure 3.2

3.1.5

Weighting curves A, B, C, and D for sound level measurement.

Frequency-weighted measurements for sound level

In most cases when a sound is amplified to have a higher sound pressure level (SPL), the perceived loudness will be higher. Loudness is a quantity that reflects the perception of the strength of sound by the listener in scale from silent to loud, which will be discussed more in Section 3.3. However, the SPL does not estimate generally the perceived loudness for many reasons, one being the fact that the subjective level perception is strongly frequencydependent, as can be easily seen from the equal loudness contours of the hearing system (ISO 226, 2003) shown in Figure 3.1. This curve was measured using subjects who compared a 1 kHz reference tone at a given SPL and a test tone of another frequency by adjusting the latter to have the same perceived loudness. As can be seen in the figure, the contours show a prominent dependence on frequency. For example, in the range 1–4 kHz, around the resonance of the ear canal, sounds need a lower SPL to have equal loudness than other frequencies. In general, at high and low frequencies more sound pressure level is needed compared to midrange frequencies to achieve equal loudness. Note also, that the shapes of the contours are different at low frequencies depending on SPL. With lower SPLs the curves are steeper at low frequencies. As a consequence of this, sounds played at a relatively low level seem to lack some of the low frequency or bass. As the SPL in decibels is defined equal for all frequencies in Equation (3.4), it does not account for such complex frequency and level dependencies. To obtain simple, yet superior, measure of the perceived loudness of a sound, the concept sound level is defined, a.k.a., frequency-weighted sound pressure level. It is a measure with a frequency-dependent weighting which roughly approximates the frequency-sensitivity of hearing shown in Figure 3.1. Four different weighting curves have been defined: A, B, C, and D curves, as shown in Figure 3.2 (IEC 61672-1, 2013). The A-weighted sound level is often used in noise measurements to characterise the perceived loudness, and also the risk of hearing loss. It slightly emphasises the levels at mid frequencies and attenuates them at low and high frequencies. It is easy to see that A-weighting is just

28  Sensory Evaluation of Sound a very rough estimate, which can be seen when the weighting curve is compared with the inverted equal-loudness curves. Technical simplicity and extensive use in practice are its advantages. The other curves (B and C) are meant to be used with mid-range and high SPLs, respectively, and the D curve was originally developed for noise at air fields. However, they are used rarely. The unit of all sound levels is decibel [dB], and it is common to show the weighting curve also, for example dB(A) for the A-weighted curve at a level of ∼40 Phon. Some exemplary sound level values in everyday environments are shown in Figure 3.3.

3.1.6

Capturing sound signals

A microphone transforms a sound signal (typically pressure) propagating in the air into a corresponding electric signal (voltage) (Eargle, 2004). Optimally, it should convert a wide range of frequencies, say from 20 Hz to 20 kHz, with a flat frequency response, low distortion and noise, and have a desired and known directional pattern. The microphone is a fundamental tool to capture and measure sound pressure. The most important microphone type nowadays is the condenser microphone and its variants. The instantaneous variations of sound pressure make one electrode of the condenser move while the other is kept fixed. When a constant electric charge is created between the electrodes, the change of distance between the electrodes will change the voltage between the electrodes. When this weak voltage signal is amplified by an amplifier close to the same microphone unit, the signal can be fed through a cable to audio equipment over reasonable distances. A variant of this is the electret microphone, which is based on using an electret material, which has electric charge between the electrodes so that no external voltage is needed. The built-in amplifier is required here as well. Electret microphones have found widespread use especially in portable devices. Another important property in microphones is the sensitivity to sound from different directions, also known as the directional pattern. A pressure microphone has an omnidirectional pattern, which means it is equally sensitive all directions. If the directional pattern is a dipole that reacts at the main axis from front and back but not from the sides, the microphone is sensitive to the velocity of particles in air. A cardioid microphone, which is a useful compromise, is maximally sensitivity from the front and minimally sensitivity from the rear. It is also possible to construct highly directional microphones that are sensitive mainly to sound coming from a narrow spatial angle. Different types of microphones are needed depending on application. The choice of microphone and directional pattern depends on the source that is to be recorded, on background noise, on the acoustics of a recording room and on the application where the sound is needed. For example, if such a voice that humans typically receive is to be recorded, a silent and relatively low-echoic room should be selected, and a microphone with flat response and directivity of cardioid or similar should be used at about 1 meter distance from the sources, this would capture a similar signal that a human listener receives. Even higher similarity to human ear canal signals can be obtained with binaural mannequins that are replicas of human head and torso geometry with microphones mounted to ends of the ear canals, such as shown in Figure 3.4. Conversely, if the question is “how much sound does a source radiate into a specific direction?”, then an anechoic chamber should be used, and a microphone with very flat response should be employed to measure the directional characteristics of the source. When recording in outdoors, in presence of competing sound sources, or in very strong reverberation, it would be better to use a directional microphone as possible, however, the downside of this

Sound, Hearing and Perception  29

Sound level dB(A)

120

Screaming child 1 m

110

Threshold of pain Air raid siren 30 m Loud music discotheque Outdoor rock concert audience

100

Symphony orchestra max level audience Propeller aircraft during take-off 30 m

Loud home stereo system

90 Noise limit at workplaces 85 dB(A) 8 hour avg. Hair dryer 0.3 m

80

Lorry pass-by 10 m (max level)

Vacuum cleaner 1 m

70

Inside car 80 km/h

Hand-held mixer max speed 1 m Exhaust hood 0.5 m Normal speech 1 m

Car pass-by 10 m (max level)

60

Dishwasher 1 m Washing machine 1 m

50

Whisper 0.3 m

40

Laptop computer 1 m Refrigerator 1 m

30

Quiet bedroom

20

Noise from trees in wood wind speed 8 m/s Open office speech, phones etc. Background noise concert hall with audience Background noise in residential area no traffic noise Noise from ventilation in office Quiet one-man office with PC

Quiet forest wind speed 1 m/s

10

Threshold of hearing

0

Typical environmental sound levels, i.e., sound pressure levels weighted with A-weighting curve. Courtesy of FORCE Technology, SenseLab. Figure 3.3

30  Sensory Evaluation of Sound

Example of two binaural microphones, a.k.a. dummy head or head and torso simulator. The ear canals of the mannequins are equipped with high-quality microphones. The ear canal signals recorded with this device mimic the ear canal signals heard by a real human listener. Figure 3.4

is that the directional microphones have issues in flatness in frequency response and also higher self noise at low frequencies. More details of microphones can be found in (Eargle, 2004).

3.1.7

Digital presentation of signals

In digital signal processing (DSP) electric signals from microphones are converted into sequences of numbers, processed in the digital domain, and finally converted back into electric signals (Oppenheim et al., 1983; Mitra and Kaiser, 1993; Strawn, 1985). The conversion into the digital domain is performed by analog-to-digital conversion (A/D-conversion). Conversely a digital signal (a number sequence) is converted back to continuous-time form by digital-to-analog conversion (D/A-conversion). Digital signal processing has many advantages compared to techniques based on dedicated analog electric circuits. Sound signals can be stored easily in digital form; the result is always predictable. For example, real-time spectrum analysis can be implemented using the FFT. When converting between digital and analog signals the sampling theorem (or the Nyquist theorem) states that the sampling rate must be at least twice as high as the highest signal component. If this is not met, aliasing will occur. In aliasing those signal components with frequency higher than the Nyquist frequency, which equals to half of the sampling rate, will be mirrored to below the Nyquist frequency. Such aliasing will thus produce added frequency components, that distort the signal. Typically an A/D-converter normally includes a lowpass filter that yields enough attenuation above the Nyquist frequency to avoid aliasing. D/A-conversion also typically includes a low-pass filter (reconstruction filter) to make the

1

1

output

Sound, Hearing and Perception  31

0 (b) -1

0

1 0 (c)

-1 -1

-1 (a)

0 input

1

0

1

PCM representation and 4-bit quantisation of a sinusoidal signal (16 levels): (a) quantisation curve, (b) sinusoidal analog input, and (c) quantised signal waveform. Figure 3.5

output continuous in time and free from frequencies above the Nyquist frequency. Thus, in practise digital signal processing deals with band-limited signals. Sampling rates commonly found in audio technology are 44.1 kHz (Compact Disc), 48 kHz (professional audio), 32 kHz (less demanding audio), and for very demanding audio 96 kHz or even 192 kHz. The numeric stream originating from A/D-conversion may be coded in different ways to be stored in the memory of a computer. The most straightforward representation is to use the Pulse-Code Modulation, PCM coding, where each sample is quantised into a number where the number of bits implies the precision of the binary result. The principle of such a conversion using four bits is shown in Figure 3.5. In practise, the sample values of an analog signal are mapped onto binary numbers, which results in 2n discrete levels, where n is the number of bits. Quantisation generates unavoidably quantisation noise. The signal-to-noise ratio (SNR) is an attribute that states the level of the signal when compared to the level of noise. For quantisation noise SNR is improved by 6 dB for each increment of n by the value of 1. For example, with n = 16 which is often used in audio, a maximum SNR of about 96 dB is obtained. If the total dynamic range of hearing (about 130 dB) is targeted, even more than 22 bits may be needed. However, very often a reduced range of 16 bits is more than enough, since very high and very low SPLs are not needed. The PCM coding produces relatively high rate of data, e.g., with 44.1 kHz sample rate with 16-bit quantisation of two-channel stereophonic input produces 1.4 Mbit/s, which results in large files and high demands for communication channels. In the field of lossy audio coding the target is to find the methods to reduce the bitrate in such a way that only those components of sound be would stored that are audible to the listener, and all inaudible components can be disregarded. There exists many standardised methods such as MPEG-1 Layer-3 (MP3, ISO/IEC (1993)) and MPEG-2 Advanced Audio Coding (AAC, ISO/IEC (1997); Bosi et al. (1997)), where the main principle of reducing the bitrate is to transform the signals into the time-frequency domain and to optimise the quantisation of the time-frequency samples using a computational model of perceptual masking curves. In most cases the sound quality obtained with such coding methods is very good or excellent, however small audible differences still often remain in the coding. In sensory evaluation of

32  Sensory Evaluation of Sound

audio in

Analysis filterbank

Quantization & Coding

Encoding of bitstream

Perceptual model

bitstream in

Decoding of bitstream

Inverse Quantization

Synthesis filterbank

audio out

The block diagram of an audio encoder and decoder based on perceptual masking. Adapted from Brandenburg (1999). Figure 3.6

sound such coding to cut storage capacity need or transmission rate is generally not needed, as computers come with high amounts of digital storage, and data transmission is not a major obstacle. Furthermore, if the sound signals researched in listening tests were coded with some lossy audio coding system, it would not be later known if some measured perceptual differences between sounds were caused by the coding algorithm or the researched phenomenon itself.

3.1.8

Further reading

To understand better the general fundaments of acoustics, the reader is referred to such books as (Morse and Ingard, 1968; Beranek and Mellow, 2012). Room, building and architectural acoustics, including concert hall acoustics are covered in (Barron, 2009; Beranek, 2004; Ando, 2012). The acoustics of musical instruments is discussed in (Fletcher and Rossing, 1998) and the singing voice is discussed in (Sundberg, 1977). Acoustic measurement techniques are introduced in (Beranek, 1988) and in (Crocker, 1997, part XVII). An introduction to loudspeakers and microphones can be found in (Colloms, 1997; Borwick, 2001; Eargle, 2004). The measurement and equalisation of acoustical and audio systems are discussed in (Borwick, 2001; Klippel, 2006).

3.2 HEARING – MECHANISMS AND BASIC PROPERTIES The purpose of hearing is to capture sound arriving at the ear and analyse the content of the signal to deliver information about the acoustical surroundings to the higher levels in the brain. The auditory system divides ear-canal signals into several neural signals with narrow frequency content, and then conducts a sophisticated analysis on the signals in the frequency bands.

3.2.1

Anatomy of the ear

The ear consists of the external ear for capturing sound waves traveling in the air, the middle ear for mechanical conduction of the vibrations of the eardrum, and the inner ear for mechanical-to-neural transduction. The cross-section of the ear is shown in Figure 3.7, where the external (outer), the middle

Sound, Hearing and Perception  33 malleus incus

stapes and oval window semicircular canals auditory nerve

cochlea round window Eustachian tube

pinna external middle inner ear

concha

eardrum ear canal

Figure 3.7

The cross-section of one ear.

ear, and the part of the inner ear are depicted. The main role of pinna is to filter the sound at high frequencies depending on direction. The spectral cues that the pinna introduces to the sound are then used in directional hearing. The ear canal is a relatively hard-walled tube that is approximate 22.5 mm long with a diameter of about 7.5 mm, so that its volume is about 1 cm3 . Acoustically, the canal is a short transmission line where sound waves propagate from the external environment to the eardrum. The sound waves arriving at the pinna are fed into the ear canal largely independent of the direction of arrival, since the diameter is small compared to most audible frequencies. The ear canal acts as a quarter-wavelength resonator since one end is open and the other is sealed by the ear drum and middle ear. This emphasises signals around frequencies of 3–4 kHz by about 10 dB, attenuating them at around 7–8 kHz, and showing the next (weak) resonance above 10 kHz (see Figure 3.1). The eardrum is a membrane that moves with incoming sound waves and thus converts sound waves into mechanical vibrations and passes them to the middle ear. As shown in Figure 3.7, the middle ear is located in a small air-filled cavity between the eardrum and the inner ear. It is a transmission system of mechanical vibration from the eardrum through little bones called ossicles to the oval window that is a membrane-covered opening leading to the fluid-filled cochlea of the inner ear. The middle ear is connected to the oral cavity by a narrow channel called the Eustachian tube. The function of this tube is to balance the static air pressure between the middle ear and the environment. If the Eustachian tube is blocked and the static pressure is higher or lower in the middle ear than in the environment, it displaces the eardrum and reduces the sensitivity of hearing, and may even cause pain.

3.2.2

Structure of the cochlea

The conversion from acoustical vibration into neural pulses happens in the cochlea, which is a spiral-shaped and liquid-filled tube of about 2.7 turns having a length of 35 mm, as shown in Figure 3.7 (Plack, 2013). The cochlea has two membrane-covered openings to

34  Sensory Evaluation of Sound helicotrema bony shelf

oval window

Reissner’s membrane

base (basal end) apex (apical end)

stapes basilar membrane round window

Figure 3.8

The linearised structure of the cochlea.

the middle ear called the oval window and the round window. The oval window routes the movements of the stapes to the liquid medium inside the cochlea. The basilar membrane is the first frequency-sensitive part in the auditory pathway. It is illustrated in the linearised model shown in Figure 3.8. The helicotrema is an opening at the apical end of the basilar membrane, which connects the liquid on both sides of the membrane. The membrane is relatively narrow, as it reaches only about 0.5 mm at its widest point, which is at the apical end (Plack, 2013). The cochlea is depicted in Figure 3.9 as a cross-section, showing other structural elements inside the tube. The basilar membrane and the bony shelf divide the tube into two halves. The anatomical names of different sections are shown in the figure. The bony wall of scala media has a structure that leaks potassium ions (K+ ) into the liquid in it. As a result the ion concentrations of liquids in the scala media and in other tunnels are different and a small electric potential difference exists across the basilar membrane. Figure 3.9 shows in detail the structure of basilar membrane, and the vibration-sensitive neural structures attached on it. There exists receptors that are called hair cells, which are specialised cells involved in the process of converting the movements of the basilar membrane into neural impulses which are routed into the auditory nerve fibres. The cells are relatively tiny, their length is about 12 µm at the basal end and about 90 µm at the apical end. The inner hair cells sense the motion of the basilar membrane, and send the pulses towards higher stages. The function of outer hair cells is not to sense the motion, but to control the mechanical vibration state of the system as part of an active feedback loop.

3.2.3

Passive frequency selectivity of the cochlea

The main role of cochlear processing is to route vibrations with different frequencies to different positions at the basilar membrane. There exist many passive properties in the membrane that already implement this. It is narrower, stiffer, and less massive at the basal end (oval window side) and gradually changes its characteristics to wider, loosely moving, and more massive at the opposite (apical) end. An inherent property of such a system is that each point along the line has a specific frequency of resonance.

3.2.4

Active feedback loop in the cochlea

The passive selectivity to frequency of the basilar membrane does not explain the great sensitivity to different frequencies that listeners have. It has been found that the collaboration of the basilar membrane and the hair cells makes a complex signal processing system

Sound, Hearing and Perception  35

Re

scala media

i ss ne

r's

scala vestibuli

m

em

tectorial membrane

br

an

e

outer inner hair cells

auditory nerve

bony shelf

basilar membrane scala tympani

Figure 3.9

Cross-sectional structure of the cochlea.

where cells themselves insert motion into the basilar membrane enhancing the selectivity. The outer hair cells can actively amplify the mechanical motion. The amplification is shown in Figure 3.11(a) as function of input frequency. The amplification is high and has a high frequency-selectivity for low sound pressure levels at frequencies near 9 kHz (Ruggero et al., 1997). With higher levels, the best frequency is lower, and the response has broader selectivity in frequency. The system has thus evidently non-linear positive feedback, whereby very weak signals are strongly emphasised in amplitude and in frequency-selectivity, and thus the cochlear resonance is improved. The explaining mechanism has been found to be action of the outer hair cells, which modulate their length when they are excited by motions of basilar membrane or by signals arriving from higher stages in brain processing (Schnupp et al., 2011). Since the system is a real biological organ, it can move only with certain displacement and velocity. When the level of stimulus increases, the active amplification will decrease up to levels of 60–80 dB above which the signal is no longer amplified. This is characterised in Figure 3.11(b) as an input-output relation with input level in the x-axis. A consequence of the active feedback loop in cochlea is that the system can not be assumed to be a simple linear filter bank. This manifests as non-linear features in response within frequency, time and level.

3.2.5

Mechanical to neural transduction

The vibrations of the basilar membrane are converted into neural pulses by inner hair cells. The vibrations lead to the bending of the hair cells which in turn cause the small openings on the tops of them to open for K+ ions, as shown in Figure 3.10. The K+ ions increase the potential difference across the membrane of the cell compared to the potential in the auditory nerves. The high potential difference triggers the firing of the neurons of auditory nerves. Each inner hair cell is connected to about 20 neurons, and each neuron receives an input from one inner hair cell only. As the each hair cell has a certain resonant frequency, the fibres of the auditory nerve will inherit the same frequency selectivity. The selectivity is preserved throughout the neural

36  Sensory Evaluation of Sound S

inner haircell

S

N

outer haircell N

S

+

+

K

K

Top: A schematic structure of the hair cells. The endings of auditory nerve in the bottom of a cell route the firings of the cells towards upper stages in hearing if shaded light-grey and the dark-grey endings bring impulses from upper stages to the cell. Below: When the stereocilia is bent right the channels are opened for K+ ions. Figure 3.10

processing stages up to the auditory cortex. Also, the auditory neurons in the brain are often spatially ordered by the centre frequency of their frequency-selectivity.

3.2.6

Non-linearities in the cochlea

There are many deviations from the linear and time-invariant behaviour found in the neural responses. Perhaps the two most prominent cochlear non-linearities are the two-tone suppression and the combination tones. In two-tone suppression the addition of a second input tone at another frequency suppresses the activation caused by the test tone at its characteristic frequency (Delgutte, 1990). This is illustrated in Figure 3.12 in the form of a tuning curve, where the shaded areas represent suppression of the probe tone by the secondary tone. This phenomenon can be interpreted as masking, where stronger signal components dominate neighbouring weaker signal components because the non-linear cochlear processes. When two sinusoids used as sound signals, there exists also strong combination tones in the cochlea, which is a phenomenon common to all non-linear systems. The difference tone fdiff = f2 − f1 , can be perceived relatively easily. For example, frequencies f2 = 2.1 kHz and f2 = 2.0 kHz produce the perception of a faint low-frequency tone at 100 Hz. There exists also other combination tones. An interesting example is the cubic difference tone with frequency fcubic = 2f1 − f2 . For example, frequencies f1 = 2.0 kHz and f2 = 2.1 kHz create fcubic = (2 · 2.0 − 2.1) kHz = 1.9 kHz. This is generated already with low-level stimuli, which is in contrast to the general behaviour of non-linear distortion requiring that the system is driven on the limits of its capabilities. The cochlea thus can not be seen as a simple Fourier analyser of ear canal signals.

Sound, Hearing and Perception  37

60

5 & 10 dB 30 40

40

velocity [dB]

velocity / pressure gain [mm / s / Pa]

20

20 50 60

0

gain

70 80

-20 2

3 4 5 6 8 10 stimulus frequency [kHz]

20

(a) The velocity of cat basilar membrane recorded at a single point. Sinusoids of different frequencies and levels have been used as input signals. The gain of the active cochlea is measured as velocity of the membrane divided by stimulus pressure, and it is shown as a function of stimulus frequency. After (Ruggero et al., 1997), with permission from Acoustical Society of America.

Figure 3.11

3.2.7

0

30 60 input SPL [dB]

90

(b) A schematic plot of the gain function at a single position in the cochlea showing the dependence on the stimulus level.

Measurement and characterisation of the non-linear gain in the cochlea.

Temporal response of the cochlea

As already discussed, each auditory nerve connects to a single hair cell, and a neural impulse, “firing”, is seen in it when the voltage of the hair cell exceeds a certain limit. The connection between the hair cell and auditory nerve can be either weaker or stronger. As a consequence, the impulses enter the stronger-connected nerve fibres with a lower threshold, and some of them need a higher potential in hair cell before the impulse can be transmitted. In many cases the neurons “phase lock” to a periodic input signal, i.e., they synchronise the firing with a certain phase of a signal. The neural activity waveform is approximately a half-wave rectified (limited to positive values) form of the mechanical excitation waveform present in each partition of the basilar membrane. The synchronisation occurs at repetition frequencies below 1500 Hz. The synchronisation degrades at frequencies above 1500 Hz, and vanishes above 4 kHz. If we consider a single neuron, the firings do not reflect much of the characteristics of the input signal. However, the processing is based on a large number of neurons working together. In the best case this makes the output of such neural networks very precise and robust.

3.2.8

Frequency selectivity of hearing

Let us first consider a simple case with two tonal components delivered to a listener. The non-linear masking effects shown in Figure 3.12 imply that a frequency component affects the perception of other sounds in nearby frequencies. If the tones have different enough frequencies, our hearing resolves them into two separate auditory events. However, if they are located sufficiently close in frequency, a single auditory event is perceived. In other words, with two frequency-separated tones, two “whistling” sounds are perceived, but if

38  Sensory Evaluation of Sound

threshold [dB]

80

60

40 probe tone 20

2

5 frequency [kHz]

10

A secondary tone having a frequency and level falling in the shaded area will suppress the response to the probe tone. Adapted from (Arthur et al., 1971). Figure 3.12

they are close enough to each other, a single sound is perceived, which can be described as “fluctuating”, “modulated”, or “rough”. The segregation ability of such frequency-separated sounds is called frequency resolution. Frequency resolution and selectivity are important properties in basic hearing sciences, and also in audio technology and quality applications. Frequency selectivity stems from cochlear processing. Each hair cell has a frequency to which it responds best, but they also respond to nearby frequencies. The frequency region of input signals that they respond to is called the critical band or the auditory filter. The frequency-dependent width of this band is of great interest in sound and voice technologies. A classic method to measure the width of critical bands has been based on perceived loudness of signals depending on their bandwidth. A narrowband noise is used as the reference sound, to which the subject compares the test sound, with the noise having a broader bandwidth. The sound pressure level and centre frequency are though equal to the reference sound. The change in perceived loudness is measured. Interestingly, the loudness is constant up to a certain value of bandwidth (Fastl and Zwicker, 2007). This bandwidth is 160 Hz for the centre frequency 1 kHz, beyond which the loudness starts to increase. The knee point where the loudness starts to increase is thought to be the position where the spectrum of the test sound spreads over more than a single critical band. The value of the bandwidth at the knee point is regarded as the measure of the bandwidth of the critical band. The critical bandwidths measured using this approach are called Bark bandwidths. The bandwidths ∆fBark [Hz] have been measured with listening tests, and they can be estimated as ∆fBark = 25 + 75[1 + 1.4(fc /1000)2 ]0.69 . (3.7) The widths are shown in Figure 3.13(a) as a function of the centre frequency fc . The bandwidth is 100 Hz at low frequencies, and above 500 Hz the width increases with frequency. Near the highest audible frequencies the width is several kHz. The bandwidth of the auditory filters has also been measured using the concept of ERB (Equivalent Rectangular Bandwidth) bands. In the method the bandwidths of the filters have been estimated using the unrealistic but convenient simplification of estimating the auditory filters as rectangular band-pass filters. The frequency widths of filters can be

Sound, Hearing and Perception  39

1/6

f

8

Q = fc /

Auditory bandwidth

10 3

Bandwidth in octaves

f [Hz]

10 Bark ERB

10 2

6

1/3

4

1/2 2

10 2

10 3

10 4

10 2

Frequency [Hz]

(a) Estimates of the critical bandwidths ∆f .

Figure 3.13

10 3

10 4

Frequency [Hz]

(b) Corresponding Q-values.

Bark and ERB bands presented as a function of the centre frequency fc .

measured with a listening test, where bands of masking noise are applied above and below the test signal, which eliminates the possibility of hearing responding in the cochlea to the test tone outside the frequency region being tested. The detection threshold of the tone is measured as a function of the width of the notch 2∆f , which is assumed to give an estimate of the width of critical band. The widths obtained in listening tests are typically 11–17% of the value of the centre frequency fc (Glasberg and Moore, 1990), and can be estimated as ∆fERB = 24.7 + 0.108 fc

(3.8)

The bandwidth follows a logarithmic relationship with the centre frequency over a larger range than the Bark bandwidth. This is seen from the fact that the Q-value shown in Figure 3.13(b) changes less with frequency. The bandwidth of hearing changes also with level; the higher level, the poorer the frequency resolution. The critical bandwidth is often estimated simply by 1/3-octave bandwidths or 1/6-octave bandwidth in technical acoustics. An octave is a √ frequency range which covers frequencies f . . . 2f ,√a third-octave covers the range of f . . . 3 2f , and a sixth-octave covers the range of f . . . 6 2f . Figure 3.13(b) shows these bandwidths as Q-values together with Q-values of estimated ERB and Bark critical bands.

3.2.9

Filterbank model of the cochlea

The functioning of the cochlea can be understood by studying a simple auditory model as shown in Figure 3.14. The figure illustrates how an auditory spectrum can be derived with a model employing a filter bank to emulate the time-frequency resolution. The band-pass filters implement roughly the active and passive mechanical frequency selectivity of the cochlea and resulting narrowband signals are further processed by signal envelope detection using half-wave rectification and low-pass filtering corresponding to the monaural time resolution. The temporal effects that occur in first processing stages in hearing are implemented simply by filters, which simulate in this case short-term adaptation and temporal integration in temporal masking.

3.2.10

Masking effects

Louder sounds make softer sound inaudible, a fact that is very common in everyday listening scenarios.

40  Sensory Evaluation of Sound

~

Bandpass

Envelope detector

Bandpass

Envelope detector

Low-pass ~1 KHz

Adaptation

Temp.integr. temp. mask.

Low-pass ~1 KHz

Adaptation

Temp.integr. temp. mask.

Low-pass ~1 KHz

Adaptation

Temp.integr. temp. mask.

Bark- or ERB bands Bandpass

Figure 3.14

Envelope detector

Auditory spectrum

Input

~ ~

An auditory model implemented with a filter bank.

Spectral masking occurs when a continuous sound makes the detection of another continuous sound with different spectral content harder, even though the spectra do not necessarily overlap (Fastl and Zwicker, 2007). Spectral masking can be best described by plotting the masking threshold as a function of frequency, as will be illustrated for the very simple case of a narrowband noise masking a tonal signal. The masking thresholds then have the form shown in Figure 3.15. Concentrating on the noise masker with a 160 Hz wide spectrum, centre frequency of 1 kHz and sound pressure level 60 dB. This results in a slightly asymmetric curve that peaks a few decibels below the level of the masker at 1 kHz. The other noise maskers produce similar masking threshold curves as shown in the figure. The masking effect also depends on the SPL of the masker. The masking curves with single centre frequency are shown in Figure 3.16. The curvature of the masking thresholds changes prominently when the level of the masker is increased. With higher levels the masking curve is shallower at frequencies above the centre frequency of the masker. However, the slope of the masking threshold is not affected below the centre frequency of the masker. Only the spectral masking effects have been discussed so far. Subsequent and temporally non-overlapping sounds mask each other as well, that is, a preceding sound affects the audibility of following sound, and vice versa. The temporal masking is illustrated in Figure 3.17 both for a sound occurring before the masker, called backward masking, or premasking, and for sound occurring after the masker, called forward masking, or postmasking. The thresholds of audibility under such conditions are measured using psychoacoustic tests. In the tests a long enough (> 200 ms) masker sound is presented to the listener, and the brief test sound burst is presented after or before the masking sound. Backward masking seems to appear only with inexperienced listeners, and also in such cases the effect is mild, making backward masking only marginally interesting in the context of audio technologies. A much stronger effect is obtained with forward masking. The masker has an effect about 150–200 ms after the offset of the masker, and also relatively high-level sounds are also masked. The masking effect has also been studied with very short sounds repeated periodically (Duifhuis, 2005). When the masker comprises an impulse, the masking threshold of temporally nearby impulses has been measured. The masking impulse affects most effectively the impulses within a distance of 1 ms from the position of the masker impulse, where the threshold of masking is a value of about -10 to -5 dB. The threshold decreases fast, and

Sound, Hearing and Perception  41 Frequency [kHz] 0.055 60

0.21

0.81

fc = 250 Hz

50 Test tone SPL [dB]

0.45

1.4

2.2

3.6

fc = 1 kHz

5.6

8.8

14

fc = 4 kHz

40 30 20 10 0 2

6

10

14

18 22 ERB scale

26

30

34

38

Masking threshold curves caused by narrowband noise with centre frequencies 250 Hz, 1 kHz and 4 kHz having a sound pressure level of 60 dB. Adapted from (Pulkki and Karjalainen, 2015), original data (Fastl and Zwicker, 2007). Figure 3.15

if the test impulse is separated 5 ms from the masker impulse, the masking threshold is already about -40 dB. Already a relatively weak impulse is thus perceived as an individual auditory event, if it is not preceded or followed immediately by other sound. Otherwise, if the impulses occur within about a 1–2 ms time window, they are perceived in a single auditory event. The shortest time resolution of the hearing system is about 1–2 ms, where the impulses are just perceived as separate sounds.

3.3 PSYCHOACOUSTIC ATTRIBUTES When a listener hears a sound, an auditory object is formed in the brain. The research on attributes of auditory objects is relatively challenging, since they exist only in the consciousness of the listener. The information of auditory objects can be accessed thus only by introspection and verbalisation. For example, psychoacoustic quantities loudness and pitch are related to SPL and frequency, respectively. However, they are psychoacoustic quantities referring to the perception of sound, to attributes of an auditory object. The relation between SPL and loudness is not straightforward, and frequency has a clear relation to pitch only in the case when a single sinusoid is presented to a listener. A common research approach to quantify the attributes of auditory objects is based on comparison with other auditory objects or verbal or numerical scaling. When the comparisons are made systematically, it is possible to estimate some psychoacoustic attributes from the data resulting from the comparisons. For example, the relation between the SPL of a sinusoid and perceived loudness can be estimated relatively accurately, when the frequency of the sinusoid is known. This section reviews a few most common psychoacoustic attributes, the reader is referred to (Fastl and Zwicker, 2007; Pulkki and Karjalainen, 2015) for broader introduction to the field.

3.3.1

Loudness

Loudness is “that attribute of auditory sensation in terms of which sounds can be ordered on a scale extending from quiet to loud” (ANSI S1.1, 2013). The equal loudness curves

42  Sensory Evaluation of Sound

Frequency [kHz] 0.055 0.21

0.45

0.81

1.4

Test tone SPL [dB]

100

L

CB

80

2.2

3.6

5.6

8.8

26

30

34

14

= 100 dB

80 dB

60

60 dB

40

40 dB

20

20 dB

0 2

6

10

14

18 22 ERB scale

38

Spectral masking thresholds caused by narrowband noise with centre frequency 1 kHz at different sound pressure levels LCB . Adapted from (Pulkki and Karjalainen, 2015), original data (Fastl and Zwicker, 2007). Figure 3.16

test tone level [dB]

80 70

backward masking

forward masking (postmasking)

60 50

threshold of test tone without masking 40 -40

Figure 3.17

-20

0

0

20

40 60 time [ms]

80

The temporal masking effects in hearing.

100

120

Sound, Hearing and Perception  43 presented in Figure 3.1 show that the same perception of loudness is obtained with very different SPLs depending on the frequency of sinusoidal signals, which can not be easily expressed in mathematical terms. Nevertheless, the curves can be used to define the loudness level. An equal loudness level is obtained with all SPL-frequency pairs located at one curve. Furthermore, a set of reference values are selected, which are located at 1 kHz with 10 dB spacing in the sound pressure level. The unit of loudness level is termed the phon. It is defined so that at 1 kHz the sound pressure level in dB and the loudness level in phon have the same magnitudes. As loudness level LL [phon] is related to a logarithmic quantity, SPL and it has been found useful to define also a linear counterpart of loudness level, loudness, with the unit sone (Fastl and Zwicker, 2007; Pulkki and Karjalainen, 2015). The (perceived) loudness N [sone] of a sinusoidal tone with sound pressure of p [Pa] can be estimated with equation N = k · (p − p0 )0.6 , (3.9) where p0 is a pressure near the hearing threshold, and constant k must be selected so that the loudness at 40 dB sound pressure level matches 1 sone (Fastl and Zwicker, 2007). Furthermore, the relation between loudness level LL [phon] and loudness has been derived as LL = 40 + 10 log2 (N ) (3.10) where N is the loudness in sones and LL is the loudness level of a tone at 1 kHz. Note, that at 1 kHz LL = Lp by definition, where Lp is the sound pressure level in dB. These results are also based on formal listening tests, where the subjects were asked to adjust the loudness of a tone until it has “double” loudness when compared to a reference tone (Fastl and Zwicker, 2007). One of the results was, that in order to produce double loudness an increase in sound pressure level of 10 dB is needed. This means that the sound pressure has to be amplified by the factor of about three to obtain such loudness that the listeners on average agree that it produces a sensation that is “twice as loud” than the reference sound produces. The perception of loudness requires a stimulus with considerable temporal length to build up (Fastl and Zwicker, 2007). When the duration of sound stimulus decreases from 100 ms, the loudness decreases 10 phons when the length is decreased to 10 % of original length. When the length of stimulus is increased from, say, 200 ms, the perception of loudness of the stimulus does not change anymore. When the duration of sound stimulus decreases from 100 ms, the loudness decreases 10 phons when the length is decreased to 10 % of original length. When the length of stimulus is increased from, say, 200 ms, the perception of loudness of the stimulus does not change anymore. This means, that temporally very short sounds will be perceived as quieter than their longer counterparts are with equal amplitude. The process is called temporal integration, and in computational modelling of hearing the process is often implemented with a integrating filter with time constant of order of 100 ms.

3.3.2

Pitch

Pitch is defined by the American National Standards Institute to be “that auditory attribute of sound according to which sounds can be ordered on a scale from low to high” (ANSI S1.1, 2013). Many types of sounds produce the perception of pitch, such as sinusoids, vocals, musical instruments and filtered noise (Fastl and Zwicker, 2007; Hartmann, 1996). The saliency of pitch varies, such as whistling, humming and car tyre noise have different “strength” of pitch perception. Moreover, some of them have multiple pitches, such as church bells. The nearest counterpart of pitch in the physical world is the frequency of repetition in a

44  Sensory Evaluation of Sound periodic signal, even though pitch depends on some other parameters as well. The perceived pitch thus often matches with the fundamental frequency of a harmonic tone complex, even in such cases when the tone has been filtered out (Plack et al., 2005). Especially with noisy signals, the perceived pitch may match with a prominent spectral peak.

3.3.3

Other psychoacoustic quantities

There also exist two other “basic” psychoacoustic attributes1 , namely duration and timbre (Fastl and Zwicker, 2007). While duration is relatively clear in definition and with relation to duration in physical world, timbre is defined in a less intuitive way. When two sounds have the same duration, pitch, and loudness; timbre is the characteristic that differentiates musical sounds from one another. For example, the same musical notes played by a cello and a harp are easily distinguished by listeners to be different. Physically timbre is formed by the sound spectrum and its variation with time. Timbre can be further divided into subcategories, such as sharpness, roughness, fluctuation strength and tonality, which have their own definitions, scales and units (Fastl and Zwicker, 2007; Pulkki and Karjalainen, 2015). These quantities are listed in the following with brief descriptions. • Sharpness is an attribute, which can be measured with subjective tests most coherently using continuous sounds with a temporally smooth envelope. When the centroid of the spectrum of the sound is high, the associated sharpness is also perceived by listeners as high. High sharpness is perceived with noise high-passed at 10 kHz, and very low sharpness is perceived with sinusoids below 100 Hz. • Fluctuation strength is the degree of perceived temporal changes in sound, either as modulations in amplitude or in frequency. The modulation is typically best perceived around about 3–5 Hz ranges. Below 0.5 Hz the hearing system does not detect modulation to be prominent, since short-term auditory memory has less accurate reference of the past. With frequencies-of-modulation higher than about 15 Hz, the temporal integration in the auditory system limits the resolution, and the fluctuations can not be followed accurately. • Roughness occurs when amplitude modulation reaches frequencies higher than about 15 Hz, and the maximum roughness is perceived around 50–100 Hz, depending on the spectral content of the carrier. With such modulation frequencies, at least narrowband carrier signals are perceived to be generally unpleasant, whereas lower modulation frequencies are perceived to be “beating” with high fluctuation strength, and with higher modulation frequencies the spectral components caused by modulation are perceived as separate and no interaction happens. • Tonality is low with noisy sounds and high with sounds with strong spectral peaks. For example, sounds with high tonality include for example whistling, church bell sounds, or most musical instrument sounds. Sounds with low tonality are then, for example, ventilation noise, babble noise and fricative /s/. Tonality is of importance in audio coding, where the tonality of a sound affects how audible are the spectral components in it. The “basic” psychoacoustic attributes are often used for noise applications and soundscapes (ISO 12913-1, 2014). Due to the nuanced nature of sound reproduction and other 1 As will be seen later in this book, there are a large number of sound quality attributes and characteristics beyond the “basic" psychoacoustic attributes.

Sound, Hearing and Perception  45 applications, such as concert halls, these attributes find limited application in these domains. Relatively simple models exist to compute these attributes, which can be applied in principle for any sound signal (Fastl and Zwicker, 2007). This would be important in automatic estimation of product sound quality. For example, in noise acoustics there is a need to estimate the annoyance created by any kind of sound with more elaborate models than sound level measurements. The motivation would be to combine the measured values of roughness, sharpness, fluctuation strength and tonality together to an estimation of annoyance. Unfortunately, such measures have not yet been able to provide a robust estimation of annoyance (Waye and Öhrström, 2002).

3.4 SPATIAL HEARING Spatial hearing seeks to localise the sound sources, and also to perceive the attributes of the room where the listener is located. As humans, or any other animals, do not have acoustical reflectors or lenses, the only information available has to be analysed from ear canal signals. Our spatial hearing indeed relies heavily on signal analysis of the ear canal signals conducted in the brain. The mechanism can be characterised to be accurate only in the case of a single broad-band plane wave arriving at the listener in the free field. When multiple spectrally overlapping signals exist, or strong responses from the environment are present, the accuracy is much lower. This section presents the basic knowledge of spatial hearing to understand the functioning and the resolution of the system.

3.4.1

Head-related acoustics

The head-related impulse response (HRIRs) represents the direction- and distance-dependent impulse response of the sound from a sound source to ear canal. The same response in the complex frequency domain is called head-related transfer function (HRTF)2 . The complex frequency-domain HHRTF is defined as HHRTF (ω) = Hec (ω) / Hff (ω),

(3.11)

where Hec is the response of the sound source at a specified position at the ear canal and Hff is the corresponding response in the middle position of the head when the head is absent. The responses are typically measured in free-field conditions with the loudspeaker in a considerable distance, say 2 meters, from a subject who has miniature microphones in his/hers ear canals. To capture the direction-dependent effect, the source is positioned in subsequent measurements at each azimuth and elevation angle of interest (Møller, 1992). The measurements can be performed in anechoic chambers, where room reflections are not present in the measurements, but if such space is not available, the reflections from walls can also be excluded by windowing the time-domain response. A set of horizontal HRIRs, and corresponding HRTFs, measured with a real subject is shown in Figure 3.18. As the source moves around the head, the time of arrival of the direct sound varies, and subsequently the peaks produced by direct sound show up earlier or later in time axis depending on the angle of arrival in Figure 3.18(a). This is because the ear canal is not in the centre of the circular movement of the source, but off centre by about 8 cm. Thus, the source comes closer the ear when it is located on the same side with the ear (ipsilateral side). When the source is located on the contralateral side (on the other side), the peaks of the responses have lower amplitudes which appear also later in time. 2 Note that the term HRTF is commonly and incorrectly used in reference to head-related impulse responses.

46  Sensory Evaluation of Sound The corresponding horizontal-plane HRTFs in Figure 3.18(b) show some clear phenomena. When the source on the same side of the receiving ear (ipsilateral side, angles 0◦ . . . 180◦ ), the response is relatively flat below about 1 kHz. In the range of 1–3 kHz, the reflection from the head causes an increase in the response. Furthermore, at higher frequencies some direction-dependent effects are seen, which are caused by direction-dependent filtering by the pinna. When the source is on the contralateral side (angles of −180◦ . . . 0◦ ), there are irregular changes in response. The high frequencies above 3 kHz are clearly attenuated due to head shadowing. When the source is on the contralateral side, in directions from −100◦ to −110◦ , the so-called “bright spot” phenomenon appears, at frequencies between 800 Hz and 3 kHz. This is caused by the fact that all the sound paths bending via front, up and back sides of the head meet at the ear canal with approximately the same propagation delay from the source to the ear. The waves are summed in equal or modified phase, which causes an amplified or attenuated response compared to adjacent source directions. The effect of source elevation on HRIR in the median plane (the plane dividing head into left and right parts) is shown Figure 3.19(a). With all elevations the peaks caused by the direct sound are located in 0.3–0.5 ms in the time axis. This means that the ear canal is relatively precisely at the centre of the circular rotation path of the source. The figure shows faint arcs or half-arcs which coincide at 0.5 ms with source directions −45◦ and 225◦ . The arc with apex at 1.5 ms is evidently caused by the shoulder reflection, which can be understood as when the source is above, the sound reaching the shoulders is reflected to back to the ears. The reflected sound travels an extra distance about 30 cm, which causes the 1 ms delay seen in the figure. With sources lower in elevation the extra travel time for shoulder reflection is shorter, and the contribution from it arrives earlier. The other arcs are caused by some other reflections, but their true causes are not evident. They can come from the measurement devices, the chair, or from the feet of the sitting subject. The frequency-domain HRTF is also affected when the elevation changes as shown in Figure 3.19(b). Especially above 1 kHz elevation-dependent changes are seen. For example, the frequency of the dip around 8 kHz varies systematically with the angle of elevation. The reflections from the subject cause the arcs with the apex pointing left at frequencies below 4 kHz. The responses are quite irregular at frequencies above 4 kHz. The irregularities are due to the complex spectral filtering by the pinna, which is a prominent localisation cue.

3.4.2

Localisation cues

Listeners analyse subconsciously the ear canal signals, which results in localisation cues (Grantham, 1995), which then define or guide the perceived direction and distance of the sound source. Note that the term “sound source” refers to the physical entity producing sound, and “auditory source” refers to the perceived counterpart of it, which exists only in the mind of the listener. There exists a number of attributes in ear canal signals, that can be used in localisation of sources. They are commonly divided to binaural cues, monaural cues and dynamic cues. The binaural cues are the differences between ear canal signals, and monaural cues are the spectral properties of a single ear canal signal used in localisation. The dynamic cues occur during head movement and result in changes of binaural and monaural cues, providing relevant information of source location. The combined effect of all cues then form the spatial attributes of subjective auditory events.

Sound, Hearing and Perception  47 Horizontal plane HRTF left ear [dB] 0.8 0.6 0.4 0.2 0 −0.2 −0.4 0

0.5

1 1.5 Time [ms]

2

(a) Each horizontal line presents a HRIR measured with a loudspeaker in the azimuth direction tabulated in y-axis.

Azimuth [degrees]

Azimuth [degree]

Horizonal plane HRIR left ear 0 31 63 94 124 153 −179 −150 −121 −92 −61 −30

0

0 31 63 94 124 153 −179 −150 −121 −92 −61 −30

−5 −10 −15 −20 0.1 0.2 0.4 0.8 1.6 3.2 6.4 12.8 Frequency [kHz]

−25

(b) The magnitude spectra of each azimuth direction HRTF are shown in horizontal lines of the figure.

Head-related impulse responses (HRIRs) and the magnitude spectra of corresponding head-related transfer functions (HRTFs) in the horizontal plane. The HRTF data originate from (Gómez Bolaños and Pulkki, 2012). Adapted from (Pulkki and Karjalainen, 2015). Figure 3.18

3.4.3

Binaural cues

This section discusses the differences between ear canal signals, which are used in localisation of sources. The differences are mainly temporal, i.e., interaural time difference (ITD), or differences in level, i.e., Interaural Level Difference (interaural time difference). The main mechanism for temporal differences is easy to understand: the propagation path to the ipsilateral ear is slightly shorter than to the contralateral ear, which causes a short delay, i.e., ITD, between ear canal signals. The ITD varies from about −700 µs to approximately 700 µs when the azimuth angle varies from −90◦ to 90◦ . When the source is in the median plane, the value of ITD is 0 µs, since due to symmetry the propagation paths have equal lengths. The dependency of the ITD on frequency and on direction is shown in Figure 3.20(a) as a surface computed using a binaural auditory model (Pulkki and Karjalainen, 2015). The value of the ITD is relatively constant with frequency. Only at low frequencies below about 700 Hz does the ITD surface increase to slightly higher values. The propagation of the wave from the source to the contralateral side has multiple paths, and the interference of the arriving waves makes the surface irregular above 2 kHz. The hearing mechanism is sensitive to the ITD. As already discussed, the cochlea can be thought to produce a narrowband time-domain signal for each critical band. For signals with frequency below about 1.5 kHz, the auditory system is sensitive to the phase difference between the waveforms of the critical bands. The spatial separation between the ears is finite, and for frequencies above about 800 Hz the distance between ears is larger than half the wavelength of the sound. This makes phase differences at higher frequencies ambiguous as localisation cues. Nevertheless, the phase differences are decoded at frequencies up to about 1.6 kHz (Blauert, 1996). The hearing mechanism is also sensitive to ITDs at higher frequencies. The ITD is detected from the delays between temporal envelopes of the signals. This is of course only possible if the source signal has a temporal envelope, such as speech, or impulsive sounds. High-pitched mosquito flying sounds are an example of signals that do not have strong temporal envelopes, which causes low localisability.

48  Sensory Evaluation of Sound

Median plane HRTF left ear [dB]

Median plane HRIR left ear

45 90

0

135 180 225 0

0.5

1

1.5 2 2.5 Time [ms]

3

−0.05

3.5

(a) Each horizontal line presents a HRIR measured with loudspeaker in the median plane with elevation tabulated in y-axis.

Elevation δcc [degrees]

Elevation δcc [degrees]

0

0

−45

0.05

−45

0

−5

45

−10

90 −15

135

−20

180 225 0.4

0.8

1.6 3.2 6.4 Frequency [kHz]

−25

12.8

(b) The magnitude spectra of each measured median plane HRTF are shown in horizontal lines of the figure.

72 HRIRs and HRTFs measured from left ear of the subject in the median plane. The HRTF data originate from (Algazi et al., 2001b). To emphasise the structure of the HRIR responses after 1 ms, the colour code saturates at ±0.05, although the maximum value of the HRIRs has been scaled to unity. Adapted from (Pulkki and Karjalainen, 2015). Figure 3.19

0.2

0.4 0.7

1.1 1.7

2.6

3.9

5.7

8.5

12.418.2 0.2

1.1

1.7

2.6

3.9

5.7

18.2 8.5 12.4

30

1

ILD [phon]

20

0.5

ITD [ms]

0.7

0.4

0

10 0 −10 −20

−0.5

−1 −90 −60 −30 0 30 60

Azimuth [degree]

90

0.2

0.4

0.7

1.1 1.7

2.6 3.9

5.7 8.5

12.418.2

Frequency [kHz]

(a) Interaural time difference (ITD) functions of real sources in different azimuth directions in free field.

−30 −90 −60 −30 0 30 60

Azimuth [degree]

90

0.2

0.4

0.7

1.1

1.7

2.6

3.9

5.7

8.5

12.4 18.2

Frequency [kHz]

(b) Interaural level difference (ILD) functions in corresponding situation.

ITD and ILD values computed with a binaural auditory model. Adapted from (Pulkki and Karjalainen, 2015). Figure 3.20

Sound, Hearing and Perception  49 The hearing system is also sensitive to the level difference between the ear canal signals, and the corresponding binaural cue is called interaural level difference, ILD. When a plane wave arrives at a listener the ILD is produced due to the scattering effect by the head and in practise the pressure level increases due to reflection at the ipsilateral ear and the pressure level decreases at the contralateral ear. The effect is very frequency dependent. Plane waves produce ILD only at a frequency where the head is large enough compared to the wavelength. There is little effect at frequencies below about 400 Hz, and the effect increases for higher frequencies. As shown in Figure 3.18(b) larger effects than 20 dB at frequencies above 4 kHz exist. The strong effect at high frequencies make ILD a strong directional cue for distant sources. The ILD surface in Figure 3.20(b) shows the frequency-dependency clearly. Although plane waves cause ILDs at high frequencies, it is a known fact that humans are sensitive to ILDs at all frequencies equally. There are at least two reasons why humans are sensitive to ILD also at low frequencies. In reality, low-frequency ILDs are encountered for near-by sources. If the source is, e.g., at 5 cm distance from one ear, the travel path is more than 30 cm to the other. The distance attenuation creates a considerable ILD, also at low frequencies, which makes it a salient cue to detect the distance of the sound source. The ILD cue also carries information about the acoustical characteristics of the listener’s environment. Furthermore, the greater the degree of reflections and room effects, the lower the proportion of direct sound, resulting is larger ILD changes over time. It is also known that listeners are sensitive to rapid changes of ILD with a time resolution of a few milliseconds. If the ear canal signals are incoherent, as may happen in the case of diffuse sound, large instantaneous fluctuations in ILD are analysed from ear canal signals (Goupell and Hartmann, 2007). It has been shown that large instantaneous fluctuations in ILD are present when the ear canal signals are incoherent, as happens with dominantly diffuse room reverberation (Goupell and Hartmann, 2007). The binaural cues are not enough alone to decode the direction of sound source unequivocally. In fact, they only define the angle between the interaural axis and source direction. The source to be localised can thus be in any position on the surface of a cone, and in principle equal ITD and ILD cues are produced. This cone is called the cone of confusion. Monaural and dynamic cues are used to detect the direction unambiguously.

3.4.4

Monaural cues

Monaural cues are the cues that already exist based on input from a single ear. The main monaural cues for source localisation are spectral cues. In principle, the temporal structure of the incoming sound could be utilised in localisation. However, the monaural temporal cues have not been found important in localisation (Blauert, 1996). The greatest details of HRIRs are located within a 1 ms time window, as shown in Figure 3.19(a), and since the time resolution of monaural hearing is about 1–2 ms they can not have a notable effect for hearing. The monaural cues are due to the complex acoustic effect of the outer ear, and due to reflections from the torso. The effects make the spectrum of the ear canal signals dependent on the direction of arrival. These cues are denoted as spectral cues. Spectral cues are important when localising sound sources within the cone of confusion. The magnitude HRTFs are plotted in Figure 3.19(b) for the elevation in median plane where a complex pattern of spectral changes can be observed as a function of elevation. The frequencies of the dips and peaks change as a function of elevation. It has been shown that such details in the monaural spectrum define the perceived direction of a sound source (Blauert, 1996).

50  Sensory Evaluation of Sound

3.4.5

Dynamic cues

When the orientation of the listener’s head changes, the binaural cues change accordingly. If a continuous sound source is located in front of a listener, upon rotation of the head to the right, the propagation distance from the source to the left ear becomes shorter, and correspondingly the distance to the right becomes longer. This causes left-ear dominance in both the ITD and ILD cues. If the same source is behind the listener, the same movement would result in right-ear dominance. It has been shown, that such dynamic cues are used in localisation (Blauert, 1996). The localisation mechanisms implicitly assume that the sound sources do not move in synchrony with head movements, and the effect of movement to binaural and monaural cues allows us to resolve front/back and up/down directions. However, if the cues do not change with head movements, the best explanation is that the source is either above the head, perceived inside the listener’s head such that and in-head localisation occurs. This happens, e.g., when localising the listener’s own voice or sounds of eating.

3.5 PSYCHOPHYSICS AND SOUND QUALITY This book covers multiple methods of how to quantify the listener hearing experience. To start this discussion, we define the difference between physical phenomena related to sound, and their counterparts in human perception. In common psychoacoustical literature, the physical side is referred to by such terms as “sound source”, and “sound event”. A sound source produces sound typically by mechanical vibrations and a specific occurrence when a source produces sound is called sound event. Their perceptual counterparts are then auditory source and auditory event (Blauert, 1996; Pulkki and Karjalainen, 2015). The sound events have certain physical attributes, such as pressure, spectrum and so forth. The attributes of auditory events depend on attributes of sound events, though in a non-linear way with many factors involved. The simplest dependency is shown with the equal loudness contours shown in Figure 3.1, which shows the average lower limits on SPL where a sound event produces any kind of auditory event, i.e., the threshold of hearing. The formation of auditory events is also affected by such factors as sounds heard before, state of mind, and expectations. The classical psychoacoustics researches the connection between the attributes of sound events containing some simple sounds (such as tones, noise, or impulses) and the attributes of the auditory events they generate. In such experiments, sound events are presented, and a question is made to the listener that they should answer based on the properties of the auditory event possibly created by the sound event. A common measurement is the relation between the SPL and the presence of an auditory event, i.e., “do you hear a sound”, resulting in the threshold of hearing. A bit more interesting topic in this book is the psychophysical function, which describes the projection from the attributes of sound event and an attribute of an auditory event. Typically the functions are not simple ones, and they might be hard to express with exact mathematical equations. For example, “loudness” is one attribute of a sound event. The loudness depends primarily on the SPL of the sound event, but also on other factors such as frequency content of sound event, duration of sound event, previous sounds events, and also on the state of mind of the listener. Auditory scale is an important concept in this book. A scale quantifies the psychophysical function. For example, the dependence of loudness on SPL can be quantified by a long series of listening tests. The scale represents as numbers the average perception of the attribute in function of one or several attributes of the sound events, and the scales have often

Sound, Hearing and Perception  51 some anchor points normalised with some reference values. For example, the loudness scale expresses in “sones” how loud a sound is perceived to be, and the unity value in the scale is normalised to be the loudness of 1 kHz sinusoid at 40 dB SPL. Well-defined auditory scales have been defined at least for pitch, loudness, and duration, which are more or less related to the physical quantities of repetition frequency, level, and duration (Fastl and Zwicker, 2007), see Section 3.3. The scales assume the human as a simple metering device, which has great value when researching the basic properties of hearing. This book addresses the human listener from a wider view angle, as the perception and quality evaluation of sounds that humans encounter in their every-day life are discussed. We experience the sounds of the world as more or less desirable, valuable, positive, appealing, or useful. This happens without deep concentration, for example, some bird sounds are rated to be more “beautiful” than others. The discussion of sound quality is important when designing or modifying the sounds made by products. The conceptions and rankings of sound quality help us to set goals for development and to find more appealing solutions for products. The quality of experience (QoE, Le Callet et al. (2012)) is a widely-used term in this field, which denotes the overall acceptability of an application or service as perceived subjectively by the end-user. When such outside-the-lab conditions are discussed, cognitive factors can no longer be disregarded as was done in basic psychoacoustic research on auditory scales. The “quality of sound” is defined as the suitability of a sound to a specific situation, and thus the suitability can not be judged without cognitive functions. This inherently leads to situations where the same sound may yield different sound qualities in different contexts. A typical example of this is that higher intelligibility of speech improves the sound quality in telecommunication, but decreases the ability of a worker to concentrate in an open-plan office. The high value on an auditory scale of “intelligibility” can thus have either negative or positive effects on sound quality.

3.6 MUSIC AND SPEECH The primary acoustic communication mode specific to human beings is speech. It is the original type of linguistic communication, substantially important for everyday life. A speech signal is the acoustic output from the speech production organs. It consists of vowels, consonants, and other sounds produced using the speech production organs: lungs, vocal folds, vocal and nasal tracts, tongue, and lips. Depending on their positions, and on the method of production of sound in the organs, the spectral characteristics of the sounds vary a lot, as seen in Figure 3.21. For example, the resonances in vocal and nasal tracts are called formants, which are during vowel sounds visible as peaks in the spectrogram around frequencies 800 Hz to 4 kHz. Each consonant and vowel has distinct characteristics in the time-frequency domain, and the linguistic information is coded into them in speech. Added to this, speech also contains so-called prosodic features. Such features are, e.g., intonation, stress, rhythm and timing, which may carry such information as indicating emotions and attitudes or signalling the difference between statements and questions. Speech production and related technologies have been researched a lot, and introductions can be found in (O’Shaughnessy, 1987; Flanagan, 1972; Titze, 1994; Rabiner and Schafer, 1978). The radiated signal itself has very special perceptual characteristics, speech sounds are quite seldom misinterpreted to be any other sounds. The common frequency-dependent SPLs found in speech signals is characterised in the same graph with equal loudness contours in Figure 3.1. The distribution of the frequency content of every-day speech and music is conceptually shown as a region. The most salient frequencies of speech range from about

52  Sensory Evaluation of Sound [k][a]

#

[k][s]

[i]

12000

0

10000

-10

Frequency [Hz]

-20

8000 -30

6000 -40

4000 -50

2000

-60

0

-70

0

100

200

300

Time [ms]

A Finnish word kaksi uttered by a male talker in a moderately reverberant room. Top: waveform and corresponding phones. Bottom: Spectrogram presentation of the utterance. The colour code shows the SPL of each time-frequency position normalised with the highest value in the spectrogram. Figure 3.21

300 Hz to 4 kHz, and SPLs from 40 to 90 dB. There exist also other frequencies in speech, but they are not very prominent and have less important contribution in the coding of the message, though they have an effect on sound quality. The speech signals contain large fluctuations in frequency bands, the level of spectrum varing with relatively fast pace, as shown in Figure 3.21. The level fluctuations of bands are often interpreted as modulation. In classic instrumental evaluation of speech intelligibility, the transfer function of modulation spectrum in the range of 2 Hz to 32 Hz is of interest (Steeneken and Houtgast, 1985; Houtgast and Steeneken, 1985). The lower the magnitude of the spectral modulation, the lower the speech intelligibility. Music is a form of art that is composed of sound and silence. Music communicates less linguistic and conceptual content than speech, instead it mostly evokes an aesthetic and emotional experience. The interpretation of music communication emotions can be seen to have large similarity with prosodic features in speech, where the emotional state of the talker may be clearly audible, for example so it is quite easy to hear if the talker is angry, bored, or excited. In audiovisual content production, music is used to affect the emotions of the viewer. For example, in movies the background music very effectively gives a certain mood to the scene. In principle any sound can be used in the composition of music (Fletcher and Rossing, 1998). For example, if a white noise burst is repeated with a distinctive rhythmic pattern,

Sound, Hearing and Perception  53 some listeners may already accept the sound as music. However, most typical musical sounds are produced by music instruments that emanate a more or less harmonic spectrum producing clear perception of pitch. Musical melodies consist of subsequent instrument tones with varying pitch, and a musical chord composed of multiple tones played at the same time with different pitches. The sound signals of music can also have very different forms, ranging from signals with broad spectrum with impulsive or noisy characteristics to signals of highly repetitive temporal structure with single-frequency or strictly harmonic spectrum. Music and speech are not exclusive, for example singing and rapping have features from both fields. In typical frequency and SPL ranges shown in Figure 3.1, the music span a wider range in all directions, though the difference is largest at lower frequencies. The sounds used in music have diverse dimensions, and many of these dimensions are related to psychoacoustics. Some of the phenomena can be directly explained with results obtained from psychoacoustics. For example, the level of consonance and dissonance formed by two concurrent tones has been measured with listening tests for two sinusoids as a function of the difference in frequency (Rossing et al., 2001). When the frequencies are the same, consonance has the maximum value 1.0. When the frequency difference is about a quarter of the critical band, the consonance is at its minimum. After this, consonance approaches asymptotically the value 1.0 when the frequency difference is increased. The tones with frequency difference of a quarter of a critical band results in interference that is perceived to have high roughness, which is generally not considered to be pleasant (Fastl and Zwicker, 2007). This has been also generalised to cases with notes with harmonic spectrum. When two notes with different pitches are played at the same time, it has been also shown, that the perceived level of dissonance can be explained by an integrated effect of the partials of the first sound causing roughness when interfering with the partials of the second sound. In the coding of speech and music, the a priori knowledge of the typical features of the signals is used heavily when the technologies are developed (Kahrs and Brandenburg, 1998; Herre and Disch, 2014; Bäckström, 2017). Most parts of music and speech is composed of a limited set of types and structures of sounds, where certain basic properties hold. For example, plucked string instruments or drums form distinguishable families of musical instruments, where the principle of sound generation and the structure of the sound signal are similar. However, in speech coding lower bitrates are obtained than in coding of music. This is easy to understand because the speech signal is finally more limited in complexity than music signals are, and in music there exist multiple pitches at the same time in the signal, whereas only one pitch is present in speech of single talker at one time.

3.7 HEARING IMPAIRMENT Lastly, a few words about the important topic of hearing loss and impairment. Three hundred and sixty million people worldwide have disabling hearing loss, which corresponds approximately to 5 % of the population, as is estimated by the World Health Organization (WHO, 2018). Disabling hearing loss is defined as a hearing threshold shift of 40 dB in the better hearing ear for adults and 30 dB for children. Furthermore, age also detriments hearing, and as the average age and the number of the elderly increases in the population, the number of individuals with hearing loss increases. There exist multiple types of impairments with different degree. Speech reception is a fundamental ability for humans, and therefore speech intelligibility is used often as the basic criterion of the functionality of hearing. However the perception of details in sound is affected in already milder impairment, for example nuances in music are harder to hear

54  Sensory Evaluation of Sound with impaired ears. In the context of quality evaluation, the consequences of a mild hearing impairment will probably be valued even more in future. Hearing impairments can be classified medically by the location of the impairment. This classification can be done, for example, as follows (Martin and Clark, 2006). 1. Conductive impairments refer to impairments in the conductive path of the auditory system, that is, the outer ear and the middle ear. For example, if the chain of ossicles is broken, or if the middle ear is full of liquid. 2. Sensorineural impairments refer to impairments in the inner ear and in the auditory nerve. Sensorineural impairments can be further divided into cochlear and retrocochlear impairments, based on whether they are located in the inner ear or in the auditory nerve. For example, too high sound pressure levels or ageing of the cochlea cause lower sensitivity to mechanical-to-neural conversion. 3. Central impairments refer to impairments in the central auditory nervous system. Decreased speech intelligibility may be an indicator of a central hearing impairment if the peripheral auditory system has been found to function normally. 4. Psychic impairments is a group of hearing impairments where no organic cause can be found. The impairments cause different kinds of hearing deficits. The most common is the hearing threshold shift, which renders the quietest signals inaudible. In addition, strong distortion components and echoes may occur, which may decrease speech intelligibility and make listening to music uncomfortable. When the cochlea is damaged, i.e., in sensorineural hearing impairments, the resolution of hearing gets lower in multiple ways. This leads easily into problems in understanding speech in the presence of background noise or in strong reverberation. Different stages of hair cell damage and the corresponding effect on neural tuning curves are shown in Figure 3.22. The tuning curves show the amplitude of a sinusoid needed to provide equal output from a single neuron at different frequencies. When the dip in the curve is deeper and narrower, the frequency resolution of the neuron is better. Normal hearing is shown with a dashed line in the tuning curve graphs. In Figure 3.22(a) the sensitivity has decreased since the inner hair cells are damaged, but the outer hair cells function normally keeping the frequency-selectivity normal shown in (b). Conductive hearing impairment could result in a similar tuning curve. In Figure 3.22(c), the outer hair cells are damaged largely, and the corresponding tuning curve in (d) is much shallower than with normal hearing. In Figure 3.22(e), the outer hair cells are completely destroyed, and the corresponding tuning curve in (f) is flat and frequency selectivity is largely lost when compared to normal hearing. The resulting tuning curve depicts in this case the passive amplification characteristics of the cochlea without the active effect of the outer hair cells that sharpen the frequency resolution. This makes the auditory filters in the cochlea broader with a lower centre frequency and more overlapping with other bands. More energy is added to a band from the same and the other sources, and thus each band receives input from a wider frequency range. This results in stronger masking effects from other sounds, and thus degraded speech intelligibility in noise.

Sound, Hearing and Perception  55

(b) 100 threshold level [dB]

(a)

80 60

40 20

0.1

4.0

(d) 100 threshold level [dB]

(c)

1.0

frequency [kHz]

80 60

40 20

0.1

4.0

(f) 100 threshold level [dB]

(e)

1.0

frequency [kHz]

80 60

40 20

0.1

1.0

4.0

frequency [kHz]

Sensorineural hearing impairments. (a,c,e) The damage of hair cells in the cochlea. (b,d,f) Solid line shows the corresponding neural tuning curve compared to normal hearing shown with dashed line. Adapted from (Liberman and Dodds, 1984). Figure 3.22

56  Sensory Evaluation of Sound

3.8 SUMMARY When human perception is measured in systematic experimentation for sensory evaluation of sound, the conductor of the test has to be aware of a number of phenomena involved. A sound source produces audible sounds by vibrations, which then cause disturbances in the sound field surrounding the source. The waves travel to the listener directly and/or via reflections and reverberation of the listening room. Often in sensory evaluation different sounds are presented to the listener, and the perceived loudness should be ideally equal for them. A simple method is to estimate the loudness with frequency-weighted sound level measurements. Natural sound signals to be utilised in psychoacoustic experiments have to be recorded using a microphone, and the properties of the microphone should be carefully selected to match the targeted application. The hearing of mammals has developed to perceive what happens and where, and later the capabilities to utilise sound in communication via speech and music have evolved. Hearing mechanisms are a very complex set of acoustic, mechanical and neural processing steps that are targeted to encode the sound waves into neural form in a robust and accurate way. Various spectral and temporal masking features, and information on spectral and temporal resolution are reviewed in this chapter. Furthermore, the basic mechanisms in spatial hearing are reviewed. The basic psychoacoustic quantities that can be systematically measured from human listeners are loudness, pitch, duration and timbre, where timbre has several subcategories, such as sharpness, roughness, and fluctuation strength. Speech and music sources are important in the typical acoustic environment of a listener, and both of them show certain representative characteristics in the signal content. When conducting listening tests for sensory evaluation of sound, a large number of listeners have to typically be used. The acuity of hearing can be affected by different impairments, caused by ageing, diseases or by other reasons. The conductor of listening tests should be aware of the possibility of such impairments, which might lead to fundamentally different responses from the subjects.

II Theory and Practice

57

CHAPTER

4

Sensory Evaluation in Practice Torben Holm Pedersen FORCE Technology, SenseLab, Hørsholm, Denmark

Nick Zacharov FORCE Technology, SenseLab, Hørsholm, Denmark

CONTENTS 4.1

4.2

4.3

Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.1.1 Can sensory evaluation be scientific? . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.1.1.1 Perceptual/descriptive scaling . . . . . . . . . . . . . . . . . . . . . . 4.1.1.2 Affective/subjective scaling . . . . . . . . . . . . . . . . . . . . . . . . . 4.1.1.3 The filter model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Designing a listening test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2.1 Defining the purpose of a test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2.2 Test method selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2.2.1 Quantification of impression . . . . . . . . . . . . . . . . . . . . . . . . 4.2.3 Experimental variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2.3.1 Input variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2.3.2 Response variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2.3.3 Other experimental variables . . . . . . . . . . . . . . . . . . . . . . . 4.2.3.4 Design of experiment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2.4 Test duration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2.5 Test samples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2.6 Usage scenarios . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2.7 Perceptual constancy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Listening test ingredients . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.3.1 Stimuli . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.3.1.1 Product testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.3.1.2 Product testing - By recording . . . . . . . . . . . . . . . . . . . . . 4.3.1.3 Algorithm testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.3.1.4 Brand sound and music . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.3.2 Attributes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.3.2.1 Requirements for good consensus attributes . . . . . . . 4.3.2.2 Lexicons and sound wheels . . . . . . . . . . . . . . . . . . . . . . . . . 4.3.2.3 Attribute development for novel applications . . . . . . 4.3.2.4 Attribute validation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

61 61 62 62 63 65 65 65 68 68 69 69 69 70 71 71 72 73 73 73 74 75 76 76 76 77 78 81 84 59

60  Sensory Evaluation of Sound

4.3.2.5 Language issues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Assessors and panels . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.3.3.1 Assessor types . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.3.3.2 Panel types . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.3.3.3 Assessor selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.3.3.4 Assessor performance evaluation . . . . . . . . . . . . . . . . . . . 4.3.3.5 Panel training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.3.3.6 Panel maintenance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.3.3.7 Ethics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.3.4 Listening facilities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.3.4.1 Listening rooms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.3.4.2 Listening booths . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.3.4.3 Field testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.3.5 Equipment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.3.5.1 Electroacoustic equipment . . . . . . . . . . . . . . . . . . . . . . . . . . 4.3.5.2 Systems for test administration . . . . . . . . . . . . . . . . . . . . Test administration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.4.1 Calibration and equalisation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.4.2 Instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.4.3 Familiarisation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.4.4 Controlling bias . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.4.5 Anchoring . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Good practices in reporting listening tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.5.1 Front page . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.5.2 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.5.3 Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.5.4 Measuring objects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.5.5 Equipment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.5.6 Location . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.5.7 Assessors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.5.8 Physical measurements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.5.9 Test administration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.5.10 Analysis and discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.5.11 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Summary checklist . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.3.3

4.4

4.5

4.6

85 87 88 90 92 93 95 96 96 96 97 98 98 99 99 99 100 100 101 101 101 102 102 102 103 103 103 103 103 103 104 104 104 104 105

his chapter considers the common practical aspects associated with performing any kind of sensory evaluation irrespective of the particular test methodology or application. It is written in a practical and accessible manner, so that all practitioners can understand the basic concepts The key steps to planning, designing, executing any sensory evaluation experiment will be outlined here. Chapter 5 provides more detail regarding the characteristics of different test methods. Chapter 6 and Chapter 7 will provide a thorough grounding in the fundamentals of applied univariate and multivariate statistical analysis respectively. Lastly, Part

T

Sensory Evaluation in Practice  61 III provides a wide range of sensory evaluation application examples to guide and inspire you to identify the best fitting approach for your needs.

4.1 INTRODUCTION In order to ensure that sensory evaluation of sound is performed in a rigorous and robust manner that will lead to repeatable scientific or engineering results, it is important to design and plan experiments thoroughly. The process of data collection using sensory evaluation is relatively slow and arduous, compared to physical forms of measurement. As a result, there is a need to plan and design experiments well in order to maximise the signal we are trying to measure and minimise or control any disturbing variables which might otherwise result in noise within the data1 . The primary purpose of this chapter is to ensure that best practices are applied to all steps of performing a sensory evaluation experiment. When we perform any sensory evaluation experiment, the primary aim is to answer some kind of research question. Such questions maybe about whether two technologies are perceptually identical or which alert signal is most audible and least annoying, or how different car tyre size/compound combinations are perceived by consumers. Irrespective of the question to be addressed, the practical steps to evaluating this question are the same for all sensory evaluations. Furthermore the steps needed to select an appropriate test method and design an efficient and robust test are similar irrespective of the research question to be addressed.

4.1.1

Can sensory evaluation be scientific?

Results from listening tests are sometimes considered as less objective than instrumental measurements of physical characteristics. There are various reasons for this view, whether due to the fact that individuals have personal opinions or that they may change their opinions overtime. However, results from either perceptual or affective listening tests can be valuable and scientific in nature, i.e. consisting of systematic observation, testing, measurement, and experiment, and the formulation, testing, and modification of hypotheses. In perceptual, descriptive or objective listening tests stable and accurate results may be achieved with a small group of trained assessors. In order to obtain a stable population view of consumers in an affective test e.g. of consumer preference(s), a larger number of representative consumers may be needed. It may even be necessary to select the consumers from specified market segments to get representative results. It is therefore necessary for you as the experimenter to understand and consider the differences between perceptual and affective testing. Before going further, let’s clarify some terminology that is sometime encountered in our field. Listening tests are sometimes called subjective tests because assessors or subjects are employed to provide the results, compared to results obtained with measuring instruments. However, the term subjective as defined refers to the results “based on, or influenced by, personal feelings, tastes, or opinions”. In order to avoid any confusion, we generally refer to 1 In the field of sound and sensory evaluation we sometimes talk about the signal-to-noise ratio (SNR). In sound and audio the SNR related to the signal of interest (e.g. speech, music, etc.) and the noise to background or unwanted sounds (e.g. traffic noise, electronic hiss, etc.). When we discuss data from sensory evaluations, we may also consider a signal and noise within the data. In a similar way to sound and audio signals, our data set also has a signal component often comprising mean scores of the systems we wish to evaluate and noise components, i.e. related to the uncontrolled or random variables within the test. There is thus also a signal-to-noise ratio equivalent for any data set, which differs from the SNR of the sound signals.

62  Sensory Evaluation of Sound sensory evaluation using listening tests, whether they are objective/perceptual/descriptive or affective/subjective in nature. Tests where assessors use their senses to provide assessments are called sensory evaluations. Generally, we consider two types of sensory evaluation tests: Perceptual/descriptive: An objective quantification of the sensory strength of auditory attributes of the perceived stimulus; Affective (or hedonic) measurement: The subjective preference, degree of liking, annoyance or other connotative attributes. Often associated with personal opinion in the form of a subjective assessment. These two approaches should be carefully understood, as they relate to the nature of human perception and have a significant impact on the type of test method we select, the nature of the collected data and the degree to which the data may be analysed. The roots of these two approaches will be discussed later in this section.

4.1.1.1

Perceptual/descriptive scaling

Perceptual/descriptive evaluation, often called descriptive analysis (DA) can be considered as an objective quantification of the sensory strength of auditory attributes of the perceived stimulus. The fundamental concept behind this is that certain fundamental perceptual characteristics are common to all humans in such a way that perceptual evaluation of different sounds would yield similar rating from all assessors. For example, if we present three speech samples to assessors at different reproduction levels (say 55, 65, 75 dB) and we ask assessors to order these stimuli according to the attribute loudness, it is very likely that all assessors will rank the stimuli similarly according to the attribute loudness. Such an assessment of loudness is likely to be consistent not only between assessors but also at a population level. Of course, loudness is not the only attribute relating to sound characteristics. A vast range of sound characteristics and attributes may be considered (Pedersen, 2008; Lindau et al., 2014; Pedersen and Zacharov, 2015; Zacharov et al., 2016b; Kuusinen and Lokki, 2017) Such methods are sometimes referred to as descriptive analysis, sensory profiling, attribute based testing, or perceptual testing are well established in sensory science (Lawless and Heymann, 2010; Næs and Risvik, 1996; Varela and Ares, 2014; Bech and Zacharov, 2006) and form a major part of methods and applications presented in this book. To obtain objective quantifications of the sensory strength of the attributes of the perceived stimulus, it is important that the attributes are well defined and that the assessors have the same understanding of the meaning of the attributes. Therefore trained assessors are most often used for this kind of assessment.

4.1.1.2

Affective/subjective scaling

This subjective quantification is by definition a very volatile matter, as we often hear engineers saying that data collected from such subjective experiments is too variable to be considered meaningful, as the results can not be replicated. Indeed, this is a valid concern and thankfully it is easily addressed. At an individual level, we can expect a degree of volatility, which can be caused by a large number of factors such as mood, personal taste, opinion, duration of exposure, recent experiences and other biasing factors. Certainly, this is problematic. However, typically, when we analyse affective evaluation

Sensory Evaluation in Practice  63

The Filter Model – Sound, Perception and Preference Physical measurement

Affective measurement

Perceptual measurement FILTER 1 – the senses

FILTER 2 – non acoustical factors

PHYSICAL STIMULUS

PERCEIVED STIMULUS

LIKES/ DISLIKES

M1

M2

M3

Sensory sensitivity and selectivity

Mood Context Emotion Background Expectation

Objective Instrumental

Subjective Sensory evaluation

A revision of the filter model which illustrates the relations between physical (instrumentation) measurements and sensory (objective and subjective) measurements (Pedersen and Fog, 1998). Figure 4.1

studies, we are looking at the population data, i.e. averaged across the panel of assessors rather than at the individual level. This is where the important difference occurs. For such studies, it is typical to employ a moderately large panel of assessors and to analyse the population patterns. It is at this level of analysis that we expect to see stability and replicability in the data. i.e. that x % of the population has this view and y % has a different view. If we can replicate such data, then we can consider that there is a valuable pattern for consideration in our engineers’ choices as an outcome of a study. Typically, such affective or subjective assessments, also known as consumer studies are performed with consumers of the product or naïve assessors. Furthermore, by comparison to descriptive tasks, assessors are commonly asked to evaluate their overall high level or global perception of a product in terms of preferences, acceptance, overall impression, basic audio quality, etc. At present, the similarity between such scales is not fully understood but some studies have illustrated a high degree of similarity between preference and quality scales (Zacharov et al., 2017).

4.1.1.3

The filter model

The original filter model, developed by Pedersen and Fog (1998) is shown in Figure 4.1 and illustrates the relations between physical (instrumental) measurements and the sensory (descriptive and affective) measurements. The physical stimulus, the sound, is perceived through hearing and the hearing sense acts like a filter (filter 1) by modifying the nature of the sound through the hearing mechanism (see Section 3.2 and Figure 3.1). For example physical sound pressure level compared to perceived loudness. After the first filter, we have the perceived sound.

64  Sensory Evaluation of Sound The only way to measure the characteristics of this is to ask listeners about their perception. The perceived sound passes through a second non-acoustical filter. This filter represents individual factors such as mood, context, the personal background and expectations. After this filter, we can ask about individual opinions; how preferable or suitable a sound is, if the quality is satisfactory or how annoying a sound is. Measuring points M1–M3 are shown in the model. Measurements at each of these points may be made independently. M1 represents the physical/instrumental measurements, i.e. sound pressure levels, frequency response functions, impulse responses, etc. Many of these measures are used continuously for the evaluation of audio systems and can also be employed as input for predictive models of the perceptual characteristics of the stimuli, e.g. to estimate loudness, sharpness, roughness, etc. M2 represents the perceptual measurements, which are objective quantifications of the sensory strength of the attributes of the perceived stimulus. It may also be threshold determinations. The main purpose is to give information about the character of the sound as perceived by the hearing sense. The characteristics of the perceived stimulus are rated in objective terms without asking the assessors for preferences or annoyance. The tests are usually made with a panel of trained assessors who understand the attributes and how to rate them. From these measurements, the perceived characteristics are found, and a perceptual sound profile can be made. These tests are reproducible and usually with small confidence intervals. M3 represents the affective measurements are subjective measurements of preference, annoyance or of connotative attributes. They are normally performed with a group of naïve (untrained and without experience in listening tests) test persons who are representative of the relevant target group (consumers, users, average citizens, etc.). The main purpose is to give information about reactions to the sound in a certain context. The context will usually have a major influence on the result together with cultural and individual factors. The test results depend on who is asked and subgroups or clusters with different preferences may appear. To interpret the results of the physical measurements in M1, the perceptual measurements in M2 are often a valuable tool. It may also be easier to make predictions of the reactions in M3 from the perceptual measurements (M2) than from the physical measurements (M1). Relations between physical and perceptual measurements are called perceptual models and relations between perceptual and affective measurements are called preference mapping. Since the conception of the filter model, scientists have continued to study our brain and perception. The dual processing theory discussed in Sun (2001); Kahneman (2011) presents the concept of the two-systems within the brain, comprising: System 1 (implicit): Unconscious reasoning (fast, automatic, frequent, emotional, stereotypic, subconscious), that handles tasks such as localising sounds; Systems 2 (explicit): Conscious reasoning (slow, effortful, infrequent, logical, calculating, conscious), that performs tasks such as point your attention towards someone at a loud party. The dual processing theory comprises the brain processing information in parallel, compared to the serial processing proposed in the filter model.

Sensory Evaluation in Practice  65

4.2 DESIGNING A LISTENING TEST Designing a listening test may be subdivided into the following steps, which are dealt with in this chapter: 1. Define the purpose of the test; 2. Select test method; 3. Collect the ingredients; 4. Plan and perform the test; 5. Collect data and check data quality; 6. Perform the analysis, interpretation and report.

4.2.1

Defining the purpose of a test

Two types of test exist namely confirmatory and exploratory. Typically, we perform confirmatory tests, where we wish to prove something, e.g. we want to show that System A is preferred over System B. To do this we would use a listening test to test this hypothesis using statistical analysis. Alternatively, we might just be interested in exploring the nature of a perceptual space and establish what factors contribute to our perception and which elements are dominant or irrelevant. This is considered a more exploratory approach. Either way we aim to select an efficient test method to perform the task as efficiently as possible.

4.2.2

Test method selection

Once you have established the purpose of your sensory evaluation and have a clear idea of the research question and hypothesis, it is time to select your test method. This is not a trivial step and once you have made your choice you are limited to what that method yields. The next chapter presents a large number of test methods both from sensory science and specific to sensory evaluation of sound with a summary overview of the methods illustrated in Figure 5.1. Test methods, outlined in Section 5.1 and expanded in detail in Sections 5.2 to 5.8 can be categorised as follows: • Discrimination methods2 • Integrative methods – Affective methods2 – Audio quality methods • Descriptive analysis methods2 – Consensus vocabulary methods – Individual vocabulary methods – Indirect vocabulary methods 2 Test

methods classified according to Lawless and Heymann (2010, Table 1.1).

66  Sensory Evaluation of Sound

Example graphical user interface for a 2-AFC (preference) discrimination test using SenseLabOnline. Courtesy of FORCE Technology, SenseLab. Figure 4.2

– Associative methods • Mixed methods • Temporal methods • Performance methods Examples of user interfaces from three method families include 2-alternative forced choice (2-AFC) discrimination tests (Figure 4.2), affective, preference tests (Figure 4.3) and an attribute-based descriptive test (Figure 4.4). All of the above methods can be applied, but are not limited to answering questions such as: 1. Threshold or audibility determination; 2. Overall quality, mean opinion, preference, suitability or acceptability assessment; 3. Annoyance measurements; 4. Detection and scaling of impairments to reproduced sound; 5. Profiling of sound or systems for reproduced sound; 6. Mapping between two domains, e.g. affective and perceptual or perceptual and instrumental; 7. Intelligibility, localisability, etc. Selection of an appropriate method is vital to ensuring you are able to answer your research question(s) properly and in an efficient manner. However there is no absolute and unique way to select a suitable method. Chapter 5 is written to help you in this process, with guidance on the nature and common applications of different methods, with Chapters 8 to 14 providing a wide range of application examples to further inspire you.

Sensory Evaluation in Practice  67

Example of a graphical user interface for a 9-point hedonic scale preference test using SenseLabOnline. Courtesy of FORCE Technology, SenseLab. Figure 4.3

Example graphical user for an attribute-based descriptive test for clarity, using SenseLabOnline. Courtesy of FORCE Technology, SenseLab. Figure 4.4

68  Sensory Evaluation of Sound Before we get to discussing individual methods in detail, we will provide generic guidance on common aspects of all methods, including how we quantify an assessor impression and structure the experimental variables to provide a structured sensory evaluation experiment.

4.2.2.1

Quantification of impression

As introduced in Chapter 1, when performing any kind of sensory evaluation, we are using sound or audio stimuli to evoke a response from an assessor which we can then measure. This measurement is a quantification of the assessors impression of the sound. Using measurements from multiple assessors we are able to obtain a perspective from a sample population of assessors (sometime called a panel ), which we then analyse statistically for subsequent interpretation. To quantify impression we need to define the response attribute (or variable) that the assessor is to consider and the format of their response. As we see from the filter model in Figure 4.1, in sensory evaluation the response variable, discussed further in Section 4.2.3.2, can be either perceptual/descriptive or affective/hedonic in nature. Details regarding the definition and usage of perceptual attributes is discussed in Section 4.3.2 and affective methods are elaborated in Section 5.3.1. The response format is used as the means by which the assessors express their impression - usually via a graphical user interface (GUI). In the previous section, a wide range of response formats were presented. Depending on the response nature of your research and the attributes you use, this may have an impact on the response format type. Often the response format is based on a line scale or else discrete choices. Line scales can be used for collecting perceptual/descriptive or affective/hedonic data, and associated methods are discussed further in Section 5.4 and Section 5.3.1 respectively. Discrimination methods provide assessors with discrete choices for their responses. As with all discriminations tests types (see Section 5.1.1 and Section 5.2), the 2-alternative forced choice illustrated in Figure 4.2, is a response format that can be applied to both perceptual/descriptive or affective/hedonic questions. In addition to line and discrete choice response formats, specific response formats exist for specific methods. For example when using performance methods for the evaluation of sound source localisation, the responses will be in the form of spatial location using pointing or drawing techniques (see Section 5.1.6). To a large degree the combination of the response format and the response variable form the basis of the test method as discussed further in Chapter 5.

4.2.3

Experimental variables

For any kind sensory evaluation you as the experimenter will have an idea of the technology or stimuli you wish to evaluate, with the consideration that different stimuli may be perceived differently from each other by assessors. This is then used to formulate a research question and test hypothesis. The experimental variables are used as a mean to bring structure and control to an experiment in order to enable you to establish the causal effects within the experiment and in so doing either test a hypothesis or explore any causal effects. Input variables are controlled so that we can study their impact of the assessor perception and their responses, using statistical analysis and in so doing test our hypothesis.

Sensory Evaluation in Practice  69

4.2.3.1

Input variables

The input variables, which are also sometime known as the independent variables, are the parameters controlled by you the experimenter and can be either categorical or quantitative in nature. For a sensory evaluation of sound we commonly evaluate sound reproduction of devices, e.g. a binaural processing technology for headphones. To set up such an experiment we may wish to compare eight such technologies, which we will define as our first input variable, which we will call the System factor, with 8 levels. To be able to perform the comparison, sound samples need to be reproduced through each system. You then decide on a set of 10 appropriate sounds, either critical in nature or representative of common listening patterns, which will be reproduced through each of the Systems. This will be our second input variable which we will call a Sample factor, having 10 levels (i.e. 10 sound samples). In this example, we might also want to include some quantitative input variables, for example the level of reproduction, to evaluate performance of low, nominal and high sound reproduction levels specified in dB. We might call this input variable Level. We can continue to include other input variables such as: Assessor: A panel of expert or naïve assessors perform the experiment. An Assessor factor is useful to be able to study differences in the data from each assessor; Replication: A partial or complete replication of an experiment (labelled Replicate) may be of interest to you, if you wish to study how stable your results are either for the experiment as a whole or per assessor; Condition: Other Conditions, such a background noise level, bitrate, talker gender, etc. can be included as additional variables, as you find necessarily.

4.2.3.2

Response variables

The response variable(s), sometimes also referred to as the dependent variable(s), are the assessors’ evaluations to the presented stimuli defined by the input variables. Depending on the nature of the experiment there may be one (e.g. basic audio quality) or multiple response variables (e.g. envelopment, timbral balance, scene depth, distortion, etc.). Using the response variables we collect assessor data that is then used for statistical analysis. When setting up the experiment and our hypothesis, our primary focus is to evaluate differences between means or medians between different Systems, Samples, etc. Using appropriate statistical tests, we can then test our experimental hypotheses and draw conclusions regarding the relationship between the input variables and the assessor responses.

4.2.3.3

Other experimental variables

Uncontrolled variables: These are variables that may have an influence on the results, but are not included in the input/dependent variables. Examples might include the presentation order of stimuli, which can be controlled with the design of experiment using randomisation or blocking. By using different presentation orders for assessors, we may reduce any system impact of presentation order, which can be tested statistically. If uncontrolled these variables can lead to bias in the data that may be of importance, yet overlooked (e.g. impact of talker gender); Disturbing variables: These are unknown nuisance variables that are uncontrolled and

70  Sensory Evaluation of Sound may affect or confound results. For example the age of assessors might have an impact on results. By collecting the demographic data, it is possible to study whether age has an impact on the data, by using age as a co-variate during statistical analysis.

4.2.3.4

Design of experiment

The design of experiment (DoE), also sometimes referred to as the statistical experimental design, is the manner in which we organise all of the experiment variables for an experiment, with the aim of minimising the effect of disturbing variables to ensure that we can establish, through statistical analysis, the impact and importance of the input/independent variables on our assessors’ responses (dependent/response variables). Broadly speaking we are aiming to maximise the signal-to-noise ratio of the data from an experiment, such that we can test our hypotheses with confidence and hopefully answer our research questions. Design of experiment is an involved topic and we will only mention it briefly here. When necessary you are referred to more elaborative texts on this topic such as Hunter (1996); Montgomery (2012) with audio related examples in Bech and Zacharov (2006, Section 6.1). The DoE comprises the following elements: Treatment design takes into account how the experiment is structured in terms of the experimental variables, for example which combinations of experimental variables will be included into the experiment: Full factorial is the most commonly used treatment design in our field, whereby all combination of the independent variables are included in the experiment; Fractional factorial is a generic and broad term to describe designs with a subset of the combination of the independent variables in the experiment. Allocation of stimulus design relates to how the stimuli are presented to each assessor, for example presentation order, randomisation or blocking of the stimuli and response variables and consists of two categories: Within-subjects design comprising experiments in which all assessors are presented with all combinations of the experiment; Between-subjects design used to present a subset of all stimuli to a subgroup or individual assessors. Combinations of treatment and allocation of stimuli designs can be made. It is very common in our field to employ full factorial, within-subjects designs, due to their simplicity and known robustness. However, when the number of independent variables becomes very large there may be a desire to consider fractional factorial design, to reduce the duration of the total test. However, this type of shortcut should be well understood in terms of how it may affect the quality of the data and the power of the test. In certain situations, there may be geographical limitations associated with available labs and assessors in those labs, which could lead to considering a between-subjects design, whereby a subset of assessors perform tests in one specific laboratory. This is sometimes also referred to as a nested design. And of course the combination of fractional factorial, between-subjects design is also possible, with all its complexities and details. Such experiments can be very beneficial when a large full factorial experiment becomes unmanageable. However, it should be noted that the design of such experiments are involved and should be performed with full awareness of

Sensory Evaluation in Practice  71 the implications of testing a subset of the full factorial design (see Montgomery (2012)). Not only may such designs deliver different levels of statistical power compared to full factorial design, but also the analysis complexity may be far greater. While you are designing your experiment it is advised that you also plan your statistical analysis approach to take into account the specific nature of your design of experiment, e.g. if a nested design is employed. The design of experiment aims to provide a means to perform a test with balanced presentation of stimuli (e.g. the same number of ratings per assessor or per system), such that robust, repeatable and reliable statistics can be performed. The combination of the input/independent variables, the response/dependent variables and the design of experiment is used as the input data set for statistical analysis discussed in Chapter 6 and Chapter 7. Through the statistical analysis we are able to explore the factors contributing to the variance in the collected data and/or to test the defined hypotheses.

4.2.4

Test duration

To avoid listener fatigue and loss of concentration, the duration of a listening test session should be limited. A rule of thumb would be not to exceed a 1 to 2 hour session, with pauses every 20–30 minutes or whenever the assessor is in need of a break. Practically, listening tests with consumers tend to become challenging to complete if they require more than one session, as consumers may not return for a second session. This should be taken into account when designing the test. Even with expert assessors, very long listening tests, i.e. > 4 hours per assessor, even if spread over multiple sessions/days, risk incompletion due to the length of the task.

4.2.5

Test samples

When selecting the original test samples, one should consider what is the purpose of the device and what type of sound samples would be appropriate. For example, if testing a telecommunication device, a range of male and female speech samples may be most relevant. However, it might be of interest to also evaluate how a small set of music samples perform compared to similar on-hold music scenarios. It is up to you as the experimenter to consider what are the relevant scenarios. By comparison if we are evaluating active noise control (ANC) technology we might want to test a balance of music and speech listening conditions. But of course, this technology is far more complex and thus we also need to think about relevant background noise conditions to meaningfully test the technology - a topic we will discuss in the next section. Lastly, we have the matter of sample criticality. This is a term commonly used in our field, but not particularly well defined. ITU-R Rec. BS.1116-3 (2015) states that “. . . critical material is to be used in order to reveal differences among systems under test. Critical material is that which stresses the systems under test”. Typically, critical test samples are found specifically for a given test or category of technology under test. It is a laborious process of expert listening to find sound samples that highlight differences between technologies or lead to audible characteristics/artefacts. The nature of a critical sound sample is very much dependent of the type of product or algorithm to be evaluated. For example if testing hearing aids or telecommunication products, it may be of importance to test primarily speech signals with different voice characteristics and well-defined background noise conditions (steady state noise, babble, competing speech, time varying conditions, etc.). By comparison of evaluating audio codecs, a wide range of music genres would be most appropriate or well-known critical samples (see EBU Tech. 3253 (2008)).

72  Sensory Evaluation of Sound Whilst critical test samples are the best way to show difference in performance between technologies, it should be noted that they are perhaps not necessarily representative of the audio content that will be used with the product technology. If you are interested in establishing the general level of performance of the products under test, you may wish to consider a more ecologically motivated approach to test sample selection. This may comprise selecting test samples that would be commonly encountered in everyday product usage. For example, if we are considering wireless loudspeakers, then a range of pop, jazz, classical, radio play, podcasts, newscast, and sport might be considered a relevant spectrum of test samples. However, you should be aware that taking this approach will most often lead to non-critical samples being selected that will not necessarily exercise or test the weakness of the technology. As a result it may be more difficult to show differences between the products under test and in the worst case you may have to accept the null hypothesis for the test, i.e. that no statistically significant differences can be found between the systems under test. As a result, you should make a conscious decision regarding the use or balance between critical and non-critical (ecologically motivated) test samples in your test and the potential impact of the results. Lastly, but not least, ensure the test samples are well prepared, and of sufficient quality in terms of low background noise, proper use of dynamic range, avoiding clipping, suitable and sufficient bandwidth and correct calibration level where needed. If speech or music samples are looped during presentation, they should be carefully edited to ensure the loop is smooth and does not introduce distracting artefacts due to the editing/loop point.

4.2.6

Usage scenarios

When performing any sensory evaluation, it is important that the product, technology or service is evaluated in a meaningful and pertinent manner. This can mean a multitude of things. It is of interest to test the product in relevant usage conditions and across a range of conditions that could exemplify common usage, such that from the listening tests we can estimate the performance in real life usage. Whilst it may be of academic interest to test devices in silent anechoic conditions, this may be very far from the common device usage. As a result, it may be of great importance to consider the common and important usage scenarios for a device to reproduce these conditions. Continuing on from our earlier ANC device test, we might learn that the devices are typically used in a range of situations with different and well described background noise conditions, e.g. music listening in a busy city in the street or watching movies, on a plane, in a train, in a bus, etc. This is valuable information that could shape the usage scenarios you include in your test, based upon noise type, noise characteristics (steady-state, time varying, etc.), signal-to-noise ratio and other properties of the usage environment. Of course, if performing tests in the lab (e.g. listening room, etc.) then we are probably far from the actual usage condition (e.g. sitting in a plane, watching a movie or going to a concert hall for a performance). To improve the degree of authenticity, we have a lot of options today. We may simulate the usage scenario with different degrees of accuracy, ranging from reproducing the environmental sound field and/or project the video of the original scenario, all the way through to a complete immersive virtual reality (VR) representation of the original usage scenario. The latter are not yet common place, but the beginnings are already seen in some sensory evaluations. It is highly likely that not all scenarios and test conditions can be tested. However, it does make sense to consider and select relevant usage scenarios that will exercise the performance of the technology.

Sensory Evaluation in Practice  73

4.2.7

Perceptual constancy

The idea of perceptual constancy relates to how consistent the sample is over time, i.e. is it perceptually similar throughout or varying significantly in its characteristics over time. Certainly this is a matter for consideration. Ideally, we select samples that have different, pertinent characteristics that are either critical or ecologically motivated. In many experiments we may consider the sample as an independent variable such that we can statistically study any difference in system performance for different samples. In this case we might select samples that have a consistent and constant characteristic, for example, stationary car noise at 80 km/h, babble with ambient music or male speech in a quiet studio. ITU-R Rec. BS.1534-3 (2015) proposes the use of short samples, i.e. approximately 10–12 s to ensure samples are consistent. In general, the stimuli should be long enough to evoke a stable and consistent response, but not too long to lead to wide differences in perception during the sample. The length may depend on the type of stimuli and the purpose, and commonly stimuli are of 10–30 seconds as a good starting point. More complex samples may have time-varying characteristics that could be pertinent of a usage case, but may also lead to different evaluations by assessors depending where in the sample they focus their attention. For example it may be of interest to study the performance of an audio codec during a transition from silence to music or vice versa. Depending on the nature of your research question, you may choose for perceptually constant samples and/or more complex time-varying stimuli.

4.3 LISTENING TEST INGREDIENTS In the following sections, we will introduce many practice matters, where you as the experimenter need to make an informed decision to build pertinent and sensitive tests.

4.3.1

Stimuli

Stimuli are the sounds used in a test to evoke a response from an assessor. The stimuli are the subject of the tests, the core of what is to be tested. The stimuli may originate from a number of sources, for example: • Product testing; – Electroacoustic devices for reproduced sound or telecommunication; – Product sound: Automotive, household, etc. devices; – Sound signalling devices; – Room acoustics; – External noise sources and environmental noise; • Product testing by recording; • Algorithm testing; • Brand sounds and music.

74  Sensory Evaluation of Sound

4.3.1.1

Product testing

There are five main groups of products of relevance: 1. Electroacoustic devices: Loudspeakers, headphones, mobile phones, hearing aids, etc. which typically reproduce or amplify music and speech; 2. Product sound: Including devices not primarily intended for sound reproduction, such as automotive products, household appliances, environmental sound sources and other products where the sounds are a by-product of their function; 3. Sound signalling devices: For example, alarms, user interface sounds, pedestrian traffic signals, where the sound is intentionally used for a specific purpose; 4. Room acoustics: In homes, schools, call-centres, shopping centres, auditoria, sports arenas, where the acoustics have an impact on the function of the space; 5. External noise sources: The assessment of product sounds (e.g. car, trains, traffic, industrial facilities, air conditioning, etc.) in the context of noise annoyance and soundscapes. Testing of electroacoustic devices is particular in the sense that we can not listen to the device alone, but rather using sound samples for reproduction. This means that the stimulus is the combination of the sound system and the sound sample. It is therefore important that the sound samples possess characteristics that are adequate for testing the systems under test (see Section 4.2.5). Particular to loudspeaker testing, it is vital to take into consideration the complex acoustic interactions of the speaker with the test room, which have a vast impact on the sound quality, as discussed extensively in (Bech, 1994; Toole, 2017). The purpose of testing electroacoustic devices is often to establish a level of quality, preference or to characterise the sound reproduction. Examples of this type of testing are provided in Chapters 8 to 10. Product sound quality often considers the adequacy of the sound from a product, i.e. a vacuum cleaner or sports car. Product sound quality is not absolute, but takes into consideration the fitness for purpose of the product from the end-user perspective. Therefore, listeners for such tests should be chosen among the relevant consumer segment. For tests on sound signalling devices the purpose is often to test whether the signals are easily understood, if they are audible in a specified background noise and maybe how annoying they are for persons they are not intended for. There may also be some branding issues for signal sounds, for example in cars or other consumer products. Listening tests on room acoustics is special in the sense that it often requires the type of sound source that the room is intended for, whether it is an orchestra, a singer, a speaker or a teacher. Due to the physical nature of rooms, it is very hard to compare them directly. As a result some special approaches for comparing concert halls and cars are considered in Chapter 12 and Chapter 10, respectively. For assessment of external noise sources, e.g. traffic, wind turbines, industrial sound sources, etc., the purpose is often related to the annoyance of the sound, e.g. to establish which characteristics are more annoying than the others. The true annoyance of people can not be directly measured in a listening test but should be measured by a socio-acoustic measurement according to ISO 15666 (2003).

Sensory Evaluation in Practice  75

4.3.1.2

Product testing - By recording

Real device tests should ideally be performed directly with the actual products. However, in some cases, it is more practical and sufficient to use recordings of the real devices for the listening tests. It is easier to make tests without visual bias and it is also easier to switch between stimuli for comparative assessments, which may give a better discrimination between stimuli. Computer-based stimulus presentation (i.e. presentations of the recorded sound files) and test management makes it faster and more efficient to perform such tests. The downside is that not all details of the real devices are represented accurately in a reproduction of a recorded real device. Recording techniques should be chosen according to the intended reproduction mode. Binaural recording techniques are widely used, using dummy heads or head and torso simulators (HATS). This gives a realistic impression of the recorded sounds both in the timbral and to some extent also in the spatial domain. It is a rather easy way of making recordings but it requires a special mannequin for the recordings. An artificial head and torso with microphones in the carefully designed artificial ears is used. With a good artificial head the diffraction of the incoming sound waves are similar to the ones with a real person placed in the sound field. The microphones in the ears will register the same sound waves as a person would hear. When reproducing the recording on a good pair of headphones a very realistic impression of the real sound may be obtained. Depending on the nature of the dummy head and the location of the microphones at the ear canal entrance or eardrum position, correction filter may need to be applied to ensure the correct reproduction calibration for the headphones. One disadvantage of dummy heads compared to other recording techniques relates to the fact that the HATS pinna is always different from the assessor’s ear, leading to perceived differences in the reproduced sound both spatially and timbrally. A comparison of physical device tests with a HATS recordings of the same devices (see Lorho et al. (2010b) was made. The comparison found that the results by both methods were very comparable for timbral dimensions. However, higher dimensions relating to spatial characteristics show differences between the data collected with the physical devices versus HATS. The same recording technique with artificial heads with ear canals may be successfully used for recordings of phones, earplugs and hearing aids (see for example Chapter 9). The spatial decomposition method (SDM) method is a novel means to capture the spatial characteristics of a space, when excited by a transducer, i.e. speaker. This method was originally developed by Tervo et al. (2013) and applied to capture sound fields in concert halls (see Chapter 12). More recently, it has been applied to the capture of speaker/car cabin acoustic interactions as discussed in Chapter 10. For product sound quality, environmental sound sources, sound signalling devices and maybe other devices it may be relevant to include background noise. It is worth considering if the background noise shall be part of the recording or recorded separately. For these devices, the context in which they are normally heard may have a considerable influence on the assessment. It should therefore be carefully considered whether a representation or simulation of the context will improve the value and relevance of the test results. Lastly, certain products can be tested by recording, in simulated sound environments. For example hearing aids, mobile phones or headsets are often used in a multitude of environments with different noise characteristics. A common approach today is to capture noise fields using microphone arrays. Using for example first order or higher order ambisonics (HOA), a sound field can be reproduced with a multichannel loudspeaker system into an anechoic chamber or listening room. The characteristics of the system under test can then

76  Sensory Evaluation of Sound be placed on a HATS and recording made for a wide range of noise and signal-to-noise ratio conditions for subsequent evaluation over headphones (see Section 9.11). In all of these cases, it important to consider what degree of calibration is needed to be able to reproduce the sound in a reproducible and meaningful manner during the listening test.

4.3.1.3

Algorithm testing

Audio algorithms are as prevalent as hardware solutions in the digitally oriented world. The evaluation of audio enhancement is thus an important product category for most device families, whether hearing aids, mobile phones, headsets, broadcast systems, music streaming services, etc. The algorithms have a wide range of purposes from wind noise reduction to audio/speech compression (i.e. codecs), noise suppression, bandwidth expansion, as well as algorithms for other purposes that may affect the audio quality, e.g. watermarking. In all cases the algorithms need to be tested with some sound stimuli, i.e. sound samples and reproduced with loudspeakers or headphones. The selection of suitable or critical samples is once again important and also that the loudspeakers or headphones are fit for the purpose. Furthermore, when testing certain types of algorithms, you should be aware of existing common protocols for testing. For example audio codecs are often evaluated following ITUR Rec. BS.1116-3 (2015) or ITU-R Rec. BS.1534-3 (2015), whilst speech coding technology often follows ITU-T Rec. P.800 (1996) and noise suppressors ITU-T Rec. P.835 (2003).

4.3.1.4

Brand sound and music

Sound logos, on-hold music and other sounds for user interfaces are typically created by sound designers, composers and musicians. As they are designed with a specific purpose, they should ideally be tested with that context in mind i.e. using a relevant device for the application. If for example the sounds are intended as waiting sounds on a telephone line, band-pass filtering to the telephone frequency range may be appropriate. Some examples in relation to sound branding are provided in Chapter 13.

4.3.2

Attributes

An attribute name is a label associated with something we can perceive (i.e. an attribute). In this section we will discuss the nature, selection, development and usage of attributes for sensory evaluation. Attributes generally are either consensus or individual in nature. The use of a so-called consensus vocabulary (CV) method is a common way to evaluate the perceptual characteristics of products. In this case the attribute and attribute names are common for all assessors. The sensory strength for each attribute can then be objectively evaluated for each product, providing a characterisation of how each product is perceived. Alternatively, for so-called individual vocabulary profiling (IVP) approaches, the working assumption is that each assessor will use their own attribute names, which are related to common perceptual characteristics, that can be found during statistical analysis. This chapter focuses upon CV, methods, with further guidance on IVP methods to be found in Chapters 10, 12 and 14. When performing any sensory evaluation employing consensus attributes, the selection or definitions of the attribute can occur in one of the following ways:

Sensory Evaluation in Practice  77 • Experimenter defined; • Selected from an existing lexicon; • Developed with a panel. Experimenter defined attributes are generally discouraged, to avoid the bias and personal influence of the experimenter. A more neutral and generally accepted approach is to either select suitable attributes from an existing lexicon or else develop attributes with a panel of trained or expert assessors. Many lexicons of attributes exist for a wide range of applications and you are advised to consider their suitability for your research, as discussed further in Section 4.3.2.2. In the case that you can not find a suitable lexicon, you will need to develop a specific set of attributes with a panel for your new application, which is discussed further in Section 4.3.2.3.

4.3.2.1

Requirements for good consensus attributes

One of the purposes of labelling and defining consensus attributes is to be able to communicate the meaning of perceived sound characteristics in an objective and consistent manner. Furthermore, the attributes should be suitable for reliable assessments of perceptual product characteristics within a panel. This leads to the following list of desired attribute characteristics, inspired by Lawless (2013), which can be found in ITU-R Rep. BS.2399 (2017): 1. Powerful: Good discrimination power among stimuli (i.e. a large span of mean values and small confidence intervals); 2. Good for individual usage: Assessors can give reproducible and discriminative assessments; 3. Good consensus characteristics: Good agreement among assessors, unambiguous attributes to all subjects; 4. Independent: Low redundancy and correlation with other attributes, little or no overlap with other terms; 5. Specifiable attribute scales: Using text labels and reference sound samples to illustrate the meaning and polarity of the attribute scale; 6. Relation to reality: Attributes should be practically useable to assess differences between products; 7. Relation to preference: Should be related to concepts that influence consumer’s preference, through further analysis (e.g. preference mapping); 8. Relates to metrics: Relates to physical measures defining the stimuli. Such metrics may not yet exist. Points 1–5 should be fulfilled for good consensus attributes. The degree of fulfilment can be demonstrated by results from listening tests and as discussed further in Section 4.3.2.4. Point 3 is of particular importance, as if an attribute is ambiguous, it may be used differently by each assessor, causing disagreement. This implies that attributes should in general be unidimensional in nature. Qualified attributes may either be developed by a word elicitation process with assessors for novel applications or selected from lexicons of predefined attributes. You are advised to avoid pulling words out of thin air and calling them attributes.

78  Sensory Evaluation of Sound

4.3.2.2

Lexicons and sound wheels

As already introduced in Chapter 1, attributes are useful to characterise our perception in general and can be to applied sound quality assessment using descriptive analysis or other consensus vocabulary methods. Lexicons are standardised vocabularies that objectively describe the sensory properties of consumer products (Lawless and Heymann, 2010). Laboratories making repetitive listening tests within the same product category may have internal lists of descriptors which their assessors are familiar with and trained on. Generally accepted or maybe even formally standardised lists of attributes constitute a lexicon. Besides the name, the attributes should have scale labels for the sensory assessment scales and a definition that clarifies and defines what the attribute name means. With a lexicon at hand, part of or the whole word elicitation process can be minimised to selecting the relevant attributes from the lexicon. The attributes may be selected based on a pilot test, e.g. finding the attributes that best characterise the differences between the systems. Alternatively, the test leader or the customer may choose a set of attributes of interest for the listening test. The attributes of a lexicon may be organised in a sound wheel, as has been a common practice from some of the earliest known examples of lexicons in their earliest form in the early sixteenth century medical diagnostics of urine based on descriptions of the colour, smell and taste characteristics presented as a lexicon on a urine wheel (see Figure 4.5). A wheel is often used as a hierarchical visual representation of a perceptual attribute lexicon. Attributes with similarities are placed in the same category. Related categories are grouped near each other and placed on a sunburst graph. The result is a visual representation of the attribute relationships. Such wheels may be used to explain and understand descriptive sensory analysis and may be used for panellists during training. One of the most advanced lexicons we are aware of is that of the coffee industry, which we introduce as an example of a thoroughly developed lexicon. Their extensive lexicon has been developed for many years across their industry and has resulted in The Coffee Taster’s Flavor Wheel (see Figure 4.6), with a very clearly defined lexicon (World Coffee Research, 2013). For an example extract, (see Figure 4.7) which is comprised of the attribute name and definition with associated physical references and intensity scores for coffee. Furthermore, this lexicon has been translated into numerous languages. In our field of sound and audio, we are presently in the process of formulating our lexicons. Many attribute sets have been developed for specific experiments and purposes. Wider efforts to collect attributes together can be found for sound generically in Pedersen (2008) and for vehicle sounds in Altinsoy and Jekosch (2012). A number of efforts have tried to collect together common attributes into lexicons, including Lindau et al. (2014); Kaplanis et al. (2014); Pedersen and Zacharov (2015); Pearce et al. (2016); Zacharov et al. (2016b); Kuusinen and Lokki (2017), but as yet a common lexicon across our domain is yet to be developed and agreed upon. The original sound wheel for reproduced sound is shown in Figure 4.8, (Pedersen and Zacharov, 2015; ITU-R Rep. BS.2399, 2017), with additional spatial sound attributes in Zacharov and Pedersen (2015). Complete definitions of each attribute are provided in Figure 4.9. Creating a lexicon is a necessary first step to understanding what causes (e.g. sound reproduction) systems to sound the way they do. The goal of a lexicon is to provide a toolset of attributes that can be used to evaluate and measure the perceptual characteristics of systems in a reliable, repeatable and communicable manner. In the ideal case, as illustrated with the Coffee Taster’s Flavor Wheel, a lexicon

Sensory Evaluation in Practice  79

A hand-drawn urine wheel from the beginning of the sixteenth century from a composite volume of illustrated medical texts in the collection of the Royal Danish Library (manuscript NKS 84 b folio). The urine wheel and the accompanying text are partly based upon the printed book by Johannes de Ketham: Fasciculus Medicine, printed in Venice in 1491. The Royal Danish Library is acknowledged for providing this image. Figure 4.5

80  Sensory Evaluation of Sound

The SCAA and WCR Coffee Taster’s Flavor Wheel. The Coffee Taster’s Flavor Wheel is copyright of Speciality Coffee Association of America (SCAA) and the World Coffee Research (WCR) 2016. Reproduced by permission of Speciality Coffee Association of America (SCAA). Figure 4.6

Paper ashes

Aroma: 4.0

Obtain ashes from burned white paper and place in 2-ounce glass jars with screw-on type lids. Fill jars approximately 1/3 full. This may be prepared several days in advance and stored at room temperature, tightly sealed. Prepare one jar for every three panelists.

Sensory Evaluation in Practice  81

BURNT The dark brown impression of an over-cooked or over-roasted product that can be sharp, bitter, and sour. R E FE R E NC E

I N TE N S I TY

P REPA RAT IO N

Benzyl disulfide

Aroma: 4.5

Place 0.1 gram of benzyl disulfide in a medium snifter. Cover.

Raw peanuts, over-roasted/burnt

Flavor: 7.5

Preheat oven to 425°F. Place raw, blanched peanuts in a single layer on a baking sheet lined with parchment paper. Roast for 20 minutes. Peanuts will be burnt. Serve in a 1-ounce cup. Cover with a plastic lid.

Alf’s Natural Nutrition Red Wheat Cereal

Aroma: 8.0

Serve 1 tablespoon of cereal in a medium snifter. Cover.

Flavor: 3.0

Place 1 tablespoon of cereal in a 1-ounce cup. Cover with a plastic lid. Cereal should be tasted two at a time.

SMOKY Figure 4.7 World coffee research sensory lexicon: Example of an attribute name, An acute, pungent aromatic that is a product of the combustion of wood, leaves, or a non-natural product.

definition with associated physical references and intensity scores for the burnt coffee attribute (World Coffee Research, 2013). Copyright of World Coffee Research R E FE R E NC E I N TE N S I TY P REPA RAT IO N (WCR) 2016. Reproduced by permission of World Coffee Research. Benzyl disulfide Aroma: 3.5 Place 0.1 gram of benzyl disulfide in a medium snifter. Cover. Diamond Smoked Almonds Aroma: 6.0 Place 5 almonds in a medium snifter. Cover. may be developed within an industry, to become a common standard, enabling a common Flavor: 5.0 Placeperception 1 tablespoon ofof almonds in a 3.25 ounce cup. Cover with a language to measure and characterise the products. plastic lid.

4.3.2.3

Attribute development for novel applications

In the case of a new domain of application for which attribute-based descriptive analysis has not been R OASTE D applied, pertinent attributes may not have been yet defined. In this case you will 33 have the more arduous task of going through a consensus vocabulary development process prior to the assessor’s rating the attributes. As presented in Bech and Zacharov (2006); Lorho (2010), the protocol for consensus attribute development comprises the following key steps: Assessor panel selection: A panel of expert assessors (6–10 recommended) is to be selected for this task; Selection of stimuli: Select stimuli and samples that are representative of the family of technology to be evaluated, but also stimuli that will excite a pertinent and wide range of characteristics. In our field, it is common to create very large sets of stimuli, but it is advised to keep this within reason (30–100 representative samples) - i.e. a subset of the stimuli for the test may be selected for this purpose. The breadth and relevance of the developed attributes, will depend on how well the stimuli excite the characteristics of the systems under test; Familiarisation with stimuli: Each assessor will spend time to freely listen to all stimuli and start to explore how they are perceived and the nature of the differences between the systems under test;

26 March 2015/THP

Sound wheel for reproduced sound

82  Sensory Evaluation of Sound

For more information see Tech Document no.7 on http://senselab.madebydelta.com/about/publications/ The sound wheel for reproduced sound. The inner ring represents the DELTA Venlighedsvej 4 2970 Hørsholm Denmark Tel. (+45) categories 72 19 40 00 - Fax (+45) 72 and 19 40 01 - the www.delta.dk - CVR nr. 12275110 main groups, the middle ring shows the outer ring the attribute names. See Pedersen and Zacharov (2015); ITU-R Rep. BS.2399 (2017).

Figure 4.8

Individual attribute elicitation: Assessors now perform individual word elicitation by listening to the stimuli and provide words to describe how they perceive the sound quality and the differences between the stimuli. Assessors can provide as many words as they feel necessary to describe their perception of the sound and quality. This stage often takes 1 or 2 sessions per assessor. Consensus panel discussions: With you as the panel leader, group discussions are set up to start a process of developing attributes that can be agreed upon within the panel. As a panel leader your work is to moderate the discussion toward finding the most pertinent and relevant attribute that will allow the systems under test to

Dark- Bright

Boomy Boxy

Boxy denotes a hollow sound, as if the sound was played inside a small box. Represents resonances in the upper bass frequency range. Denotes the balance between bass and treble. -Dark: Excessive bass. Either loud bass or weak treble. -Neutral: Bass and treble are perceived equally loud, there is a balance in the reproduction. This also applies if both bass and treble are equally weak or if the bass and treble are both too loud. If it leads to prominent or soft midrange this is assessed by the Midrange strength. -Bright: Excessive treble. Either loud treble or weak bass. The cause for the sound being dark or light can deduced from the assessments of Bass strength and Treble strength.

Full

Denotes how far the bass extends downwards. If it goes down in the low end of the spectrum, there is great depth. Should not be confused with Bass strength, which indicates the strength of the bass or Boomy which related to resonances in the lower bass region. Resonances in the low bass, as sound in a large barrel, which gives a prominent bass resound resounding (reverberating) when bass and bass drums are heard. The representation tends to become muddy and imprecise.

If both low and high frequencies are well represented with good extension the sound is Full. Denotes to which degree the different frequency ranges (bass, midrange and treble) are coherent, continuous, and balanced without gaps between them. That there are seamless transitions between the tone ranges.

Distance Width

Sound image

Depth Presence

When listening on headphones: To what extent do you perceive the sound sources outside of your head?

Clean

Localization Externa Envelop Precise lization ment

Are you surrounded by the reproduced sound and does it give a sense of space around you?

It is easy to listen into the music, which is clear and distinct. Instruments and vocals are reproduced accurately and distinctly. The opposite of clean: dull, muddy.

Detailed

Transparency

Can the individual instruments and voices be clearly placed and separated in the spatial sound image? How precise are the individual sound sources positioned in the room? If the individual sound sources are inadvertently spread or broadened out the precision is low. Can the individual instruments and voices be clearly placed and separated in the spatial sound image? How precise are the individual sound sources positioned in the room? If the individual sound sources are inadvertently spread or broadened out the precision is low.

Does it sound as if the sound sources are present and not distant or absent?

Natural

The relative level of bass, i.e. the low frequencies, for example male voices, bass guitar, bass drum, timpani and tuba. Should not be confused with bass depth that indicates the low frequency bass extension.

Is the soundstage skewed to one side (left or right) or is it centred in the middle?

Shrill

The music sounds like it is being played in a can or tube. The sound is characterized by prominent and narrowband resonances in the midrange.

The depth of the sound image (i.e., in the direction away from the listener). Not to be confused with distance.

Artefacts – Signal related Fluctuating/ Compres Distor- Clip Buz Rub Rough Intermittend sed ted ped zing bing

Nasal

A closed sound with pronounced midrange. Gives the impression corresponding to vocalists singing through the nose (nasal).

Canny

The relative level of the midrange, i.e. the middle frequencies, e.g. sopranos, trumpets, violins and xylophones. Not to be confused with Canny which represents prominent narrow frequency ranges (resonances) in the midrange.

Noise

Treble

Brilliance Tinny

Resonances or narrowband frequency prominence in the treble or high frequencies.

Bass strength

Midrange strength

The relative strength of the treble or high frequencies -Weak: Covered, unsharp. -A little under neutral: A soft sound without being dull. -Neutral: In the middle of the scale, where you can clearly distinguish instruments. A lot: Treble Raised. Sharp, hard sound. Treble or high frequency extension: -A little: As if you hear music through a door, muffled, blurred or dull. -A lot: Crystal-clear reproduction extended treble range with airy and open treble. Lightness, purity and clarity with space for instruments. Clarity in the upper frequencies without being sharp or shrill and without distortion.

The width of the sound image (expressed as the perceived angle). - The width of the sound sources positions (soundscape width). The width of any reverberation should not be included in the assessment.

Hissing Humming Bubbling

Dynamics Bass Precision Punch Powerful

The ability to handle high sound levels, especially when striking the drums and bass. Indicates whether the Punch, Attack and Bass precision are maintained at high volume.

Treble strength

Specifies whether the strokes on drums and bass are reproduced with clout, almost as if you can feel the blow. The ability to effortlessly handle large volume excursions without compression (compression is heard as level variations that are smaller than one would expect from the perceived original sound).

Bass Bass depth

Midrange

Are instrument impacts from the bass drum and bass precise, crisp and without distortion, are the impacts tight and well defined? Bass precision may be defined as Attack in the bass region. Imprecise means that the attack spead in time and the peak of the impact is softened.

Homogeneou s

Timbral balance

Transient response. Specifies whether the drum beats and percussion, etc. are accurate and clear i.e. if you can hear the actual strokes from drumstick, the plucking of the strings etc. it is also expressed as the ability to reproduce each audio source transients cleanly and separated from the rest of the sound image. Imprecise Attack is understood as unclear or a muted impact.

The perceived distance between the listener and the main sound sources (instruments / singer). Does it sound as the music is close or far away?

Balance

How loud the sound is perceived.

Attack



Loudness

Sensory Evaluation in Practice  83

A well-resolved sound rich in detail. Instruments, voices etc. can easily be separated. The music has many details, details that cannot be measured, details that give the music "soul". It may be small audible nuances: Breathing from a singer, fingers wandering across the guitar strings, the flaps from the clarinet, embouchure sound of the saxophone, the impact from the piano's hammers when they hit the strings. Sounds reproduced with high fidelity. Acoustic instruments, voices and sounds, sounds like in reality. The sound is similar to the listener's expectation to the original sound without any timbral or spatial coloration or distortion, "Nothing added - nothing missing." The soundstage is clear in space and brings you close to the perceived original sound experience. Treble Distortion. Very sharp s-sounds, cymbals etc. As the sound of something scraping on a (rough) surface. A hoarse off-sound unintentionally accompanying the reproduced sound. Bass distortion. A zzz-like, undesireable sound typically in the low and midrange frequencies. The harmonics are to pronounced and sharp. Additional and undesired sounds that add a sharpness to the reproduction. Limited dynamic range leading to a lack of natural peaks. Dymanic compression may be heard as a pumping effect.

Noise with varying loudness and or pauses.

Sound or noise with fast ( 80 is the guideline provided for difference tests in Lawless and Heymann (2010), although estimation of the required sampling is also discussed based on the size of the difference between the products based on statistical power calculations. Lastly, for marketing claims substantiation, larger consumer population (N = 250–400) is recommended in ASTM E1958 (2016) depending upon the type of test and the nature of the claim to be substantiated. Figure 4.14 compares the results for an identical small listening test of intermediate audio quality of audio codecs using ITU-R Rec. BS.1534-3 (2015), with 3 different panels to compare patterns in ratings. The data from all three panels is largely centred and has different levels of contraction bias (see Zieliński (2016) for details on centering and contraction bias). It can be seen that the naïve panel uses a narrow middle part of the scale followed by the experienced assessor and expert assessors who span most of the scale. From this we can see that not all panels are equal in terms of experience in usage of scale and method, nor in terms of their discrimination skills. All of these elements will have an influence on the interpretation of the results. The small expert panel may have the experience to discriminate repeatably complex stimuli using the scale with confidence. This data is perhaps representative of the most discriminating assessors in our population This compares

92  Sensory Evaluation of Sound with the consumer panel, who may less discriminating or reliable, but are representative of untrained consumers. Consumer and expert panels are clearly different and provide different benefits. The fast and easy access to an established expert panel can be beneficial for a lab performing routine and/or standards-based tests that require a level of expertise or discriminations skill. Employing a consumer panel may be more appropriate in other cases when the consumer perspective is of greater importance than the discrimination skills of the assessors. This is a matter of choice for any given experiment.

4.3.3.3

Assessor selection

Depending on the method and nature of a test, different criteria may be considered for assessor and panel selection. The generic process of recruitment of sensory assessors through training to the level of expert sensory assessor is an involved process outlined in Figure 4.15. If naïve assessors or consumers are to be selected, it is important to define the appropriate demographic in terms of age, gender, nationality, their product usage, e.g. brand affiliation, usage of a product family, etc. If expert assessors are to be selected, then a more complex recruitment and selection process may be needed, depending on the test method to be applied. For example. when performing experiments according to ITU-R Rec. BS.1534-3 (2015); ITU-R Rec. BS.1116-3 (2015) you should follow the recommended practices for pre- and post-screening and assessor selection. When performing descriptive- or attribute-based test assessors are selected, they are then trained and monitored as outlined in ISO 8586 (2012) and illustrated in Figure 4.15. It is not expected that all assessors can be successfully trained to become expert sensory assessors. Furthermore, as assessor expertise is domain specific, it should not be assumed that an expert assessor in the domain of spatial audio coding will necessarily be an expert in watermark technology assessment. Assessor expertise is best determined based on data and statistical analysis for each application domain. Several different approaches have been reported for the selection of expert assessors. In traditional sensory evaluation of food, general recommended practice are provided in ISO 8586 (2012); ISO 13300-1 (2006); ISO 13300-2 (2006) with guidance on the recruitment and selection of panel leaders as well as the selection of assessors. In the field of sound and audio a number of different procedures have been reported for example in Wickelmaier and Choisel (2005) for multichannel audio applications, Legarth and Zacharov (2009) for multimedia applications and Legarth et al. (2012) for hearing impaired assessors, with an overview provided by (Bech and Zacharov, 2006). These approaches commonly select assessors based on: • Demographic fit; • Hearing acuity; • Sensory ability; • Availability. The demographic requirements are defined prior to recruiting and then evaluated by questionnaire or interview to establish the assessor suitability. The hearing acuity may be evaluated in various ways and often the hearing is tested with an audiogram to select assessors with normal or a known level of hearing loss - depending the panel requirements. The sensory ability of assessors can be studied through small listening tests, which

ISO 8586:2012(E) Sensory Evaluation in Practice  93 According to Clause 4

Recruitment Naïve sensory assessors Familiarization Initiated sensory assessors SELECTION

According to Clause 5

Selected sensory assessors Final choice of p anels for p articular methods Selected assessors (difference, ranking, rating)

Assessors selected in order to become

Monitoring and testing of p erformance

Training Monitoring and testing of p erformance

According to Clause 6 According to Clause 7 According to Clause 8

Exp ert sensory assessors

Figure 1 — Entire process

Procedure overview of the ISO 8586 (2012); DS/EN ISO 8586 (2014) standard for the selection, training and monitoring of selected assessors and expert sensory assessors. Reproduced with the permission of the Danish Standards Foundation. Copyright Danish Standards. Figure 4.15

evaluate the fundamental ability of the assessor prior to training. Examples of this type of screening test can be found in Wickelmaier and Choisel (2005); Legarth and Zacharov (2009); Legarth et al. (2012); Dziedzic and Kleczkowski (2014). Specific to individual vocabulary profiling, where assessors use their own personal attribute for evaluation, Kuusinen et al. (2010) provided guidance for assessor selection. The final selection criterion relates to the availability of assessors, which is vital to collecting data. Even if an assessor is highly acute and able, if they are not available, data can not be collected. For most expert panel selections, assessors need to fulfil all four criteria to become a panel members.

4.3.3.4

Assessor performance evaluation

When employing naïve assessors or consumers, usually there is not requirement nor evaluation of the assessor’s performance. However, when experienced experts assessors are to be employed, we usually have certain expectation regarding their nature and performance as discussed in Sections 4.3.3.1 to 4.3.3.3. Different methods have different definitions and requirements for assessor performance, in order that high quality data is collected. Assessor performance can be considered for any specific experiment either as a postscreening, following a test or as a general continuous assessment of an assessor’s skill within a panel. Either way, we are generally interested in whether assessors can discriminate between the stimuli and do this in a reliable manner. The approaches taken in ITU-R Rec. BS.1116-3 (2015); ITU-R Rec. BS.1534-3 (2015) is to post-screen assessors in terms of how reliably they can detect the hidden reference samples. This type of test evaluates the assessor’s discrimination skill, based on correct identification of the hidden reference stimuli. In the case of small impairment tests using ITU-R Rec. BS.1116-3 (2015), if the system(s) under test are very similar to the reference

vi

Licensed to DELTA / Nick Zacharov ([email protected]) ISO Store Order: OP-195595 / Downloaded: 2017-02-10 Single user licence only, copying and networking prohibited.

© ISO 2012 – All rights reserved

94  Sensory Evaluation of Sound (e.g. transparent), correct identification of the hidden reference becomes very difficult even for the most skilled assessors. In such cases, you may find that many assessors are rejected, based on this post-screening criteria, as there is little or no difference between the reference and the systems under test. For descriptive- or attribute-based testing using consensus attributes, different requirements are placed on assessors in terms of their ability to discriminate between the stimuli or systems under test. Furthermore, this should be done in a reliable and repeatable manner, employing attributes in an similar manner compared to other members of the panel. The topic of assessor performance evaluation is detailed and you are referred to Bech and Zacharov (2006); Zacharov and Lorho (2006) for more details. For such applications, assessor expertise can be evaluated based on three main performance metrics including: Discrimination: A measure of the ability to perceive differences between test items; Reliability: A measure of the closeness of repeated ratings of the same test item; Agreement: A measure of the similarity of ratings between a listener and the panel. Several methods exist for evaluation of such characteristics by statistical analysis, which are sometime based on a repeated measures of the same stimuli for each assessor (i.e. a partial or complete replication of ratings). An expertise gauge (eGauge, Lorho et al. (2010a); ITU-R Rep. BS.2300 (2014)4 ), was developed for applications in audio and provides a rapid means of evaluating assessor performance in terms of assessor discrimination, reliability and agreement. Additionally, the method benefits from an objective acceptance or rejection criterion for assessor performance based on a statistical permutation tests - providing a measurable means of classifying assessors within a test as expert or not. Both discrimination and reliability metrics provide the experimenter with a way of evaluating whether the assessor is bringing meaningful data to the experiment. Example output from eGauge are shown in Figure 4.16 for 24 assessors with a wide range of performance. The grey lines indicate the 95 % permutation test levels for reliability and discrimination axes. Assessor who can not discriminate and provided unreliable data fall into the bottom-left quadrant. Assessors who are reliably able to provide similar scores across the test, but can not discriminate between the systems under test, fall below the horizontal grey line. Assessors who can not discriminate the systems under test, lie to the left of the vertical grey line. Lastly, the 18 assessors who are able to both discriminate and provide similar replicated scores for each system under test are located in the top-right quadrant. Such assessors can be considered as discriminating and reliable and thus categorised as expert assessors for this test. Additionally, PanelCheck (www.panelcheck.com) is another user-friendly stand-alone tool for panel performance assessment and its application to audio is discussed in Section 6.4. In addition to providing fundamental statistical data analysis, PanelCheck is also well suited to studying assessor discrimination, reliability and agreement. For example, the so-called Tucker1 analysis presented in Section 4.3.2.4 allows you to easily evaluate whether assessors are using an attribute in a consensual manner or not. Figure 4.10(a) illustrates a high degree of assessor agreement for the distortion attribute, with the majority of assessors clustered together on the axis PC1. However, assessor 4 is not in agreement and appears to have inverted her scale usage, which can be quickly identified through this analysis. For the attribute dark-light, in this experiment (see Figure 4.10(b)) assessors are less in agreement, with their scores spread over the two principal components PC1 and PC2. 4 R-code for running eGauge can be downloaded from the ITU-R web site associated with ITU-R Rep. BS.2300 (2014).

Sensory Evaluation in Practice  95

Overview of assessor performance using the expertise gauge (eGauge) found in Lorho et al. (2010a); ITU-R Rep. BS.2300 (2014). The grey lines indicate the 95 % permutation test levels for reliability and discrimination axes. Figure 4.16

4.3.3.5

Panel training

Once assessors have been selected to be members of an expert panel, they have usually shown some fundamental skills and characteristics for performing listening tests. However, as a member of an expert listening panel, this does not necessarily mean that all assessors are deemed expert assessors for all types of test methodologies nor for all types of stimuli. Training of assessors for one specific family of tests and stimuli or more generically for a broad range of methods and stimuli are options to be considered. Training on test methods and stimuli can be achieved easily using small training experiments for a specific application, e.g. ABX test for watermark technology, ITU-R Rec. BS.1116-3 (2015) for small codec impairments, ITU-T Rec. P.835 (2003) test for noisesuppressor technology. By preparing tests that allow you to evaluate assessor performance, e.g. for reference detection or measures of discrimination and reliability as described in Section 4.3.3.4, you can establish when assessors have become familiar with the test method, the family of test stimuli, and are also discriminating each attribute reliably. An additional way to train assessors is to encourage assessor to improve their listening skills. A number of software-based solutions exist for ear training, many of which focus upon modifications in timbral (peaks, notches, harmonics, etc.), dynamic (attack, decay, etc.) and other characteristics. Please refer to Corey (2016) for a detailed guide to technical ear training with associated training software.

96  Sensory Evaluation of Sound In the context of ISO 8586 (2012) as shown in Figure 4.15, an expert assessor can only be qualified for a specific application after training, familiarisation and monitoring.

4.3.3.6

Panel maintenance

As you have already seen, establishing a permanent panel of expert assessors requires significant effort, which can only be justified when there is sufficient testing to be performed. This needs to be carefully considered before embarking on such an effort and can not always be justified. If you decide to establish a panel, you will then also need to maintain it. Our experience is that a panel size decreases with time, as assessors change job, graduate, become otherwise occupied or lose interest. Without maintenance, very often after 2–3 years the panel may be less than half the size you started with. Panel motivation is an important topic to keep assessors interested and involved. Assessors have different motivations for participation in a panel, as discussed extensively in Kapparis et al. (2008). Often, assessors interested in technology, have an interest in audio and have an interest in contributing to technology development. There is also the element of remuneration, indicating to the assessors the importance of their work. In order to maintain motivation, the assessors and panel need to be actively engaged, whether performing experiments and/or training tasks. It is also important that assessors get an idea of how their work has contributed to the world and how they are performing on the whole. Feedback is thus of great value. Before and during a test, it is a good practice to provide a minimum amount of information to the assessor regarding the nature of the test and the products to be evaluated, so as to minimise any potentially biasing effects and ensure the test is as blind as possible. When a test is completed it is beneficial to give feedback to assessors, whether on a one-to-one basis or as part of a panel team-building event. Irrespective of how well motivated and paid your panel is, there will come a time when you need to renew a panel due to diminishing membership. It is advised to consider continually maintaining a panel, as this is easier than starting from scratch every few years. In this way, assessors can be recruited, trained and up to speed with the existing panel, also allowing for continuity in the panels work.

4.3.3.7

Ethics

When assessors are involved in sensory evaluations, ethical considerations should be taken into account. The American Physiological Society (APA) provide an extensive guideline in APA (2010) on the matter and include matters such as avoiding harm, privacy and confidentiality, informed consent, etc. Depending on the region you are working in, so there may exist regional practices for ethical practices and approval. This may also depend on whether you are applying methods according to a national or international standards or recommendation, performed with prototype or commercially available hardware, etc. It is a good idea to become aware of the regional practices which may be different in industry and academia.

4.3.4

Listening facilities

The type of listening facility needed depends on the type of test and the stimuli. As both the manner and context in which the stimuli are presented have an influence on test result we will start by considering the stimuli. As discussed in Section 4.3.1 a wide range of different

Sensory Evaluation in Practice  97 stimuli can be considered for sensory evaluation and may require different test environments including: • Listening rooms; • Listening booths; • Field testing. It may also be relevant to consider whether the most relevant results are obtained in a laboratory or in a field test or in a laboratory with or without a simulated (acoustic and/or visual) context. Today context can be brought into the laboratory with relative ease using virtual reality equipment.

4.3.4.1

Listening rooms

Two situations of reproduced sound are relevant to consider: 1. A real device test where the sound systems (e.g. passive, active, wireless, surround and other type of loudspeakers) are the test object; 2. A test where recorded sounds or algorithms are the test objects and the sound system is a tool for presentation of the stimuli that are the test objects. In both cases a neutral listening room with low background noise is an advantage or even a requirement. The reason is that it the test objects that are to be assessed and not the influence of the listening room. The reverberation time must be short and lie within specified limits. In particular, the reverberation time must be the same at all frequencies (however a weak increase is permitted at low frequencies). Next, the background noise must be low, so that unintended sound does not interfere with the listening test. Ideally, the standards EBU Tech. 3276 (1998) or ITUR Rec. BS.1116-3 (2015) for listening tests of multichannel loudspeaker systems should be complied with. In a real device test an acoustic transparent curtain or a screen should hide devices under test (e.g. loudspeakers) to avoid visual bias. The screen might be used for projecting the test interface to the assessors so no undesirable sound reflections or similar occur from a computer screen in front of the listener. For context simulations the screen may be used for showing pictures or videos illustrating the context. When loudspeakers are the devices under test, the room–speaker interaction should be taken into account, as the interaction can be very significant. If the different loudspeakers are positioned in different locations in the room, the sound will be coloured differently, depending on their positioning (see Bech (1994); Toole (2017)). An experimental design that takes into account the position of each speaker should be considered, but can be complex and time consuming to perform. An alternative is to ensure that speakers are always evaluated in identical positions, using a speaker shuffler or spinner that move the speaker to the correct position. Examples of such devices can be found in Olive et al. (1998); Bech and Zacharov (2006); McMullin et al. (2015); Volk et al. (2017) and shown in Figure 4.17. Considerations should also be given to the general appearance, lighting and climate in the listening room. A nice looking and tidy room with no more cables and technical installations than needed, may make the assessors or consumers feel more comfortable and welcome. A more in-depth treatment of the topic of listening rooms can be found in Bech and Zacharov (2006); Toole (2017).

98  Sensory Evaluation of Sound

Example of a loudspeaker spinner used to compare loudspeakers in identical positions with respect the listening room. Courtesy of SenseLab, Force Technology. Figure 4.17

4.3.4.2

Listening booths

If the test is to be carried out using headphones, it is sufficient that the background noise is less than required. Typically, in these situations listening booths are used, providing a compact, performant and cost-effective solution. Good lighting and ventilation are necessary and with the ventilation on, the background noise inside the booth should be low, as in the listening room situations, i.e. NR10 - NR15.

4.3.4.3

Field testing

For some products e.g. automotive products and public address loudspeaker systems the most realistic tests are made out of the laboratory. The advantage of field testing is that the test may be performed in the relevant context. For annoyance assessments, the context and the actual persons exposed to the noise are essential factors for the results and should preferably be made in the field. For other types of tests it might be a disadvantage that different products can not be directly compared. In such cases it might be worth considering to using recordings to perform the tests later in the laboratory. As an example: A test on alternative power steering systems for a car model should be made. The systems could be tested and compared out of the car, but the interior cabin sound with engine noise was an important context factor. It took several hours to change the steering systems in the actual car, so binaural recordings were made in the car for later direct comparisons with headphones in the laboratory. In another example, the characteristics of the sound systems in different concert venues was sought. A panel of trained assessors was brought to the venues. For some of the venues the same band with the same sound engineer performed the concerts. The intervals between the concerts were one or more weeks. The field results showed that other factors, e.g. another

Sensory Evaluation in Practice  99 band playing at the same venue, were at least as important as the differences between the venues. When the same assessors made a listening test with binaural recordings listened to on headphones combined with a subwoofer, the differences between the venues were significant in most of the tested attributes. When considering field tests, it is important to pay attention to (i.e. observe and document) variables that may influence or cause bias on the results, that are beyond experimental control (e.g. weather, traffic conditions, etc.).

4.3.5 4.3.5.1

Equipment Electroacoustic equipment

When performing sensory evaluation of stimuli (i.e. recorded, processed or simulated) that need to be reproduced via loudspeakers or headphones, it is vital to be aware of the quality and performance of the transducers and how this may impact your results. When the focus of your experiment is about evaluation of the performance of a set of audio algorithms, the performance of the speaker or headphone should be as neutral as possible to avoid any bias of results. However, in other experiments you may wish to evaluate an algorithm using a range of consumer headphones from a defined category and price range and in such a case you might consider including the speaker type as part of your treatment design (Section 4.2.3.4). When needed you may need to consult an acoustician regarding the objective measurement of loudspeakers and headphones to ensure they have suitable and sufficient performance, in terms of frequency response characteristics, distortion, output level, etc. These topics are discussed in further details in ITU-R Rec. BS.1116-3 (2015); Toole (2017). The location of loudspeakers in listening room can also have a large influence on the sound quality. ITU-R Rec. BS.2051-1 (2017) provide guidance on the physical location of loudspeaker with respect the listening positions and ITU-R Rec. BS.1116-3 (2015) provide guidance on the location of the loudspeakers within the listening room. You are advised to consider the appropriate location and equalisation of loudspeakers.

4.3.5.2

Systems for test administration

When performing any sensory evaluation of sound, assessors are presented with sound or audio stimuli, which is used to evoke an impression, which we then measure using a scale or other response variable. Today, it is very rare to perform a listening test with manual control of sound stimuli and paper-based data collection, which can be prone to a wide range of errors, which could result in incorrect interpretation and conclusions. Generally, the entire administration of any listening test is handled in software to control the entire process of administrating the listening test, controlling all of the experimental variables and ensuring the robust collection of data for subsequent export, data analysis and interpretation. Ideally, such software, whether open source, commercial or coded for the purpose should handle the following aspects of the test administration: • Graphical user interface (GUI); • Presentation and control of audio stimuli (robust playback, looping, cross-fading); • Response variable (attribute, scale, etc.);

100  Sensory Evaluation of Sound • Control of the design of experiment (presentation order of stimuli with and between trials and assessors); • Data collection; • Data export. For classical sensory evaluation of products, generic commercial software is available from Compusense, Biosystems and Logic8 which focus upon flexible data collection. Software tools that administrate tests and provide robust means to present audio or audio/visual stimuli are available, for example, from Audio Research Labs (STEP) or FORCE Technology/SenseLab (SenseLabOnline), as illustrated in Figures 4.2 to 4.4. Additionally, the internet is full of a wide range of software tools for different kinds of experiments and methods. For example psylab, developed at the Oldenburg University of Applied Sciences is a collection of open-source MATLAB® scripts for the performance of psycho-acoustical detectionand discrimination-experiments.

4.4 TEST ADMINISTRATION 4.4.1

Calibration and equalisation

The loudness of the presented stimuli may have a major influence on the results of a listening test. For product sounds it is often the most important variable and for reproduced sound it also has a major influence at least for three reasons: • The frequency-dependent loudness characteristics of the ear will change the perceived timbre of the stimuli at different levels (see also Figure 3.1); • Distortion components at higher frequencies may be masked by low frequency components, and the masking depends on the level; • The physical properties of systems under test, e.g. distortion, bandwidth, equalisation, etc., may change with the level. Some types of tests are to be performed at the most comfortable level. As the level is adjusted after the wish of the assessors no calibration is needed, but it may be relevant to measure and record the levels for analysis or reporting. For reproduction of pre-recorded signals, e.g. product sounds, the recordings should contain a calibration signal, so that the reproduction can be adjusted to the same level as the original level or another well-defined level. If the systems under test are meant for sound reproduction, e.g. loudspeakers and headphones, it is important that the systems under test are aligned to the same level. It is a common experience that if two loudspeakers with relatively similar characteristics are included in a test, the loudest playing loudspeaker will be preferred. In such cases loudness alignment should be carefully considered. Loudness alignment may be either perceptual or instrumental. Perceptual alignment is made by several assessors who carefully adjust the levels of all systems under test, so they sound equally loud. The programme material intended for the test should be used. This may require some patience as different sound samples may introduce different level perceptions from each of the test objects. Alternatively a number of instrumental level and loudness alignment methods exists,

Sensory Evaluation in Practice  101 some are based on RMS measurements of test signals, e.g. pink noise, with different frequency and time weighting functions (see Figure 3.2 and IEC 61672-1 (2013); EBU R128 (2014); ITU-R Rec. BS.1770-4 (2015)), and others based on psycho-acoustic loudness metrics (ISO 532-1, 2017; ISO 532-2, 2017). Whilst, loudness alignment is often advised when comparison the systems under test, it may not always be justified. For example, it may be of interest to evaluate the loudness and sensitivity of a hearing aid for a given input signal as part of an overall evaluation of the devices performance. Furthermore, due to the frequency-, level- and temporal-dependent nature of loudness, at times it can become complex to perform loudness normalisation between systems with very different characteristics (e.g. width, equalisation and distortion). A pragmatic and effortless way to start, is to adjust the systems to the same A-weighted sound pressure level in the listening position with a pink noise signal and progress from there.

4.4.2

Instructions

All assessors should be provided with written and verbal instruction by the experimenter to ensure that all assessors have the same understanding of the task, how to use the interface, scales and attributes. Ideally once this instruction has been given, there should be time for the assessor to ask questions and clarify any open issues regarding how to operate the user interface, understanding of scale, attribute meanings, etc.

4.4.3

Familiarisation

The purpose of familiarisation is to make the assessors well-acquainted with the test method, user interface and the stimuli they will be exposed to, before they make any assessments. This is to ensure that assessors are not taken by surprise with regard to the quality levels and characteristics presented during the experiment and allow them to become familiar with the nature of the content. This will ensure the assessors are more confident with the method, and scale usage and should result in good data quality. A specific form of familiarisation is training with a direct feedback regarding which details to focus on during in the assessment. The balance between information and training versus introducing bias should be carefully considered.

4.4.4

Controlling bias

All sensory evaluation is prone to bias and as such we should always consider how to minimise and control bias. In the old school science of psycho-acoustics anything else than the acoustic stimulus was regarded as unwanted bias. Nowadays we acknowledge that for some types of tests we want to include and evaluate the influence of non-acoustical factors (see Figure 4.1), e.g. the influence of context, appearance of a product, audiovisual interaction, talker gender, etc. As already discussed in Section 4.2.3, we can design our experiment to include a range of experimental variables, and then subsequently evaluate the degree of their importance on the assessor rating, during the analysis phase. Furthermore, we can try to minimise the effect of potentially biasing uncontrolled variables, e.g. presentation order, by using a design of experiment or randomisation of the stimulus presentation order. Lastly, the experimenter can also bias the experiment during the instruction of the assessors and this should be carefully handled. The experiment should provide the assessor with clear and neutral instruction about the task of the experiment. However, the experi-

102  Sensory Evaluation of Sound menter should also avoid providing excessive information that might bias the experiment, e.g. brand names, etc.

4.4.5

Anchoring

The results of the listening test are affected by the setting and context in which they are carried out. If the test concerns loudspeakers in mobile phones for listening to music and if these are compared to a small loudspeaker intended for music, the latter will be assessed as having a high bass strength. If the same small speaker is assessed together with floor standing Hi-Fi speakers, it will probably be assessed as having a low bass strength. In order to compare the results from different listening tests, it is necessary that there are some recurring systems, which can be used as fixed reference points - so-called anchor systems. Two or three anchor systems may be chosen, within the actual product category, in a way so that their characteristics is assessed respectively high, mid and low for most attributes. The anchor systems are assessed on an equal basis with the other systems. The anchor systems serve two purposes, namely: • To set the same reference frame for the test and encourage assessors to use the scales in a similar way to each other; • To enable a great comparability between tests, across time or laboratory, etc. Anchoring is also a mandatory component of certain recommendations for assessment of audio technology, e.g. ITU-R Rec. BS.1534-3 (2015) and review of the impact of anchors an other sources of biases in listening tests is provided in Zieliński et al. (2008); Zieliński (2016).

4.5 GOOD PRACTICES IN REPORTING LISTENING TESTS In this section topics that should be included in a full report on a listening test are listed and compliment the checklist, provided for your convenience in Section 4.6. A full report implies a report that gives credibility and confidence in the procedures used and the results obtained with a level of information and details such that other specialists from the field could duplicate the test. A full report may not always be needed, and the desired content and amount of reporting should be targeted to the purpose and the receiver. The report should contain results and information to such an extent and with such quality and accuracy, that it is adequate for the purpose. The list of topics in this section may be useful as an outline or starting point for selection of relevant topics. It might be preferable to present the overall experimental results in the start of the report, while detailed analyses could be in appendices. Observe that some standards have specific requirements for reporting (e.g. ITU-R Rec. BS.1534-3 (2015); ITU-R Rec. BS.1116-3 (2015)).

4.5.1

Front page

The first page of the report should provide formal information: • Laboratory name and address; • Document identification; • Test type (and method);

Sensory Evaluation in Practice  103 • Summary of the purpose, the main results and conclusion; • Signature and date; • Terms of distribution of the report.

4.5.2

Introduction

The background for the test, the objectives and purpose and the involved parties shall be mentioned. Where relevant the current state of knowledge in the field, hypotheses, predictions or expected outcomes of the study may be mentioned.

4.5.3

Method

The method(s) used for the listening test should be described by reference to the standard (if any) or other references where the method is described. The independent/experimental variables, dependent/response variables, design of experiment and data collection procedures should be presented in details.

4.5.4

Measuring objects

The test objects should be described in a clear manner, so they can be unambiguously identified. For tests with reproduced sound the programme/test material or sound samples shall be stated with origin (recording, producer, artist, year, etc.) and technical information (number of channels, sample rate bit depth, etc.). Any equalisation applied should be stated.

4.5.5

Equipment

The instrumentation and software used for recording (if any), reproduction (headphones, loudspeakers, sound cards, amplification systems) should be stated by type, make and model. The calibration procedure of the system should be stated. For accredited measurements, last calibration date for traceable calibration of the equipment shall be stated. Special equipment may be described in an appendix.

4.5.6

Location

The location of the listening test should be mentioned, and the acoustical characteristics should be stated in relevant terms. For reproduced sound the properties of the listening room or listening booths could be stated in an appendix (e.g. by size, reverberation time, background noise level and fulfilment of standards). For field testing (concert venues, product sound, noise annoyance, etc.) appropriate descriptions and parameters must be given.

4.5.7

Assessors

The criteria for selection of test persons should be stated. The training of assessors (if any) should be mentioned. The type of assessors should be stated according to a suitable standard or recommendation. The number of assessors, their gender and age (average and standard deviation) should be mentioned. Any exclusion of assessors by post screening of the results shall be stated.

104  Sensory Evaluation of Sound

4.5.8

Physical measurements

The procedures and methods for any physical sound/noise measurements performed (e.g. sound pressure levels, frequency characteristics, distortion) or psycho-acoustic metrics (loudness, sharpness, roughness, fluctuation strength, etc.) on or related to the stimuli (background noise, reverberation time, etc.) should be stated. There may also be a need to report the loudspeaker (or headphone) performance characteristics (raw or equalised frequency responses, impulses responses) and their performance located within the listening room.

4.5.9

Test administration

Time, dates and duration of the listening sessions should be stated. Information about the amount of information given to the test persons about the purpose of the test, sound sources/situations of use etc. may be relevant. Instructions to test persons, graphical user interfaces and questionnaires (if any) may be shown in an appendix.

4.5.10

Analysis and discussion

The types of analysis performed, statistical methods employed and discussions (if any) on the path from raw data to end results and conclusions should be described. Any correlation between physical measurement results and the results of the listening test may be analysed and discussed.

4.5.11

Results

Results should be given in a short and conclusive manner. Graphs should be clear with axis titles and units and appropriate text should be added to graphs and tables. Uncertainties (e.g. in the form of 95 % confidence intervals) on the results should be indicated where relevant. In interpretation of the results the uncertainties should be considered.

Sensory Evaluation in Practice  105

4.6 SUMMARY CHECKLIST Definition of research question Test hypothesis Null and alternative hypothesis definitions Test methodology selection Specification of input/independent variables Test system selection Sound sample selection Other independent variables Specification response/dependent variables Attribute definition Attribute selection Design of experiment Treatment design Allocation of stimulus Analysis approach Assessor panel Definition of assessor type (consumer/naïve, experienced, expert) Assessor screening criteria (availability, demographics) Assessors selection or recruitment Assessors training Consent forms Payment Test location Listening room, booth Electroacoustic equipment Calibration Test administration Stimulus presentation software Data collection software or questionnaire Assessor instruction (spoken, written) Assessor familiarisation Analysis and reporting Documentation of experimental procedure Statistical analysis and interpretation Final report

5

CHAPTER

Sensory Evaluation Methods for Sound Jesper Ramsgaard Google, Mountain View, CA, USA

Thierry Worch Qi Statistics, West Malling, UK

Nick Zacharov FORCE Technology, Senselab, Hørsholm, Denmark

CONTENTS 5.1

5.2

5.3

5.4

5.5 5.6 5.7

Test method families . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.1.1 Discrimination methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.1.2 Integrative methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.1.3 Descriptive analysis methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.1.4 Mixed methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.1.5 Temporal methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.1.6 Performance methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Discrimination methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.2.1 Comparing pair(s) of systems discrimination tests . . . . . . . . . . . . . 5.2.2 Paired comparison of multiple systems based on a sensory attribute . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Integrative methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.3.1 Preference and affective methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.3.2 Recommendation ITU-T P.800 (ACR, CCR, DCR) . . . . . . . . . . . . 5.3.3 Recommendation ITU-R BS.1534 (MUSHRA) . . . . . . . . . . . . . . . . . 5.3.4 Recommendation ITU-R BS.1116 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Consensus vocabulary methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.4.1 Quantitative descriptive analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.4.2 Semantic differential . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.4.3 Recommendations ITU-T P.806 and ITU-T P.835 . . . . . . . . . . . . . Individual vocabulary methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Mixed methods - explaining preference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.6.1 Audio descriptive analysis and mapping . . . . . . . . . . . . . . . . . . . . . . . . 5.6.2 Multiple stimulus - Ideal profile method . . . . . . . . . . . . . . . . . . . . . . . . Indirect elicitation methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.7.1 Multi-dimensional scaling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

109 109 109 112 114 115 115 117 117 121 124 124 126 128 130 132 134 136 138 139 142 143 144 146 147 107

108  Sensory Evaluation of Sound

5.8

5.9

5.7.2 Free sorting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.7.3 Projective mapping and napping . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Associative methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.8.1 Free association . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.8.2 Check-all-that-apply (CATA) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.8.3 Rate-all-that-apply (RATA) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Closing words . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

148 150 151 152 153 154 155

s researchers, sensory evaluation methods are our tools. Selecting the best-suited method will ensure that we can study our research question or test a hypothesis, and do this in a time/cost efficient manner. In this chapter we will give an overview of commonly encountered methods from sensory evaluation of sound. It will not be a complete overview of each method, or the complete selection of methods that are available. But it will be a pragmatic introduction that can help guide in choosing the right method to (efficiently) answer the questions of a study. The concept of a method is generally not well defined, and often it is tempting to think of a method in terms of how we collect data for analysis, i.e. how are my scales oriented and how do I present my test question to the assessor? But what is a method really? Is it a statistical analysis approach? A way of collecting data, tied to a specific standard or recommendation? The way we present test stimuli to our subjects? Or something else? For clarity we are going to define a sensory evaluation method as a means to evoke a response from assessors, which can be measured such that the data can be (statistically) analysed and interpreted. In this way we have the four key steps of sensory evaluation captured in this definition. This captures the four key steps of sensory evaluation, as defined in (Lawless, 2013). From a practical perspective we understand these elements of a sensory evaluation method as:

A

Evoke: Presenting stimuli to an assessor. This includes considering context, calibration levels, number of independent variables, randomisation/blocking of stimuli etc.; Measure: Collect a response, typically a rating, using a response format like a discrete choice, ranking of samples, scaling on attribute, etc. This may include User Interface (UI) used for data collection; Analyse: Apply appropriate statistical analysis, respecting the nature of the collected data for hypothesis testing or exploratory analysis; Interpret: Interpret the data and analysis for the purpose of answering the original research question. Within this definition a large variety of sensory methodologies exist, each with its own primary purpose, characteristics and properties. Some methodologies lead to experiments that answer simple research questions, are easy and fast to set up, while others are more tedious and time consuming, but provide more information about the systems under test. Some methodologies are less dependent upon the assessor type to be employed, while others require particular assessors, whether they need to be a priori trained or expert or simply selected consumers. In this chapter, we will provide a general overview of test method families and present

Sensory Evaluation Methods for Sound  109 a selection of methodologies based on their objectives, typical characteristics, and practical implementation, with many examples for sound and audio applications.

5.1 TEST METHOD FAMILIES In this section we provide a short overview of families of methods with different purposes. The aim here is merely to familiarise you with these families and illustrate the key differences. Thereafter we will provide an in-depth introduction to many of the key methods and further detailed examples are presented throughout Part III. The comparison of methods has been quite widespread in the sensory evaluation of food, but less so in our field. Several authors and standards have provided a systematic structure illustrating various methods, that can assist in method selection, including Lorho (2010, Figure 2.5), Volk (2016, Figure 2), Bech and Zacharov (2006, Figure 4.12), Varela and Ares (2014, Table 14.1) and Dehlholm (2012, Figure 1.) and ITU-R Rec. BS.1283-1 (2003). Each of these tables and flow diagrams considers the method categorisation from a slightly different perspective, highlighting the complex nature of method selection. To guide you in method selection, we have collected together many of the methods relevant to the sensory evaluation of sound in Figure 5.1. This high-level figure illustrates primarily Listening Only Test (LOT) types and does not cover conversational or multiparty test types, as discussed in Chapter 8 nor the specifics of localisation nor speech intelligibility tests, as discussed in the Part III. Temporal methods are mentioned in this figure, although not covered within this book, as they have not yet been widely used in our field - an overview of generic temporal methods can be found in Dehlholm (2012). Based on this structure we will present a short overview of each category and their applications. Many of the methods are presented in greater detail in Sections 5.1.1 to 5.8.3.

5.1.1

Discrimination methods

Discrimination methods are primarily employed to establish whether there is a perceived difference between products and to answer this question efficiently and with confidence. A wide range of discrimination techniques exist, from the paired comparison (ISO 5495, 2005), ABX (Munson and Gardner, 1950) / duo-trio (ISO 10399, 2017), 2-AFC (Fechner, 1889), to tetrad tests (Ennis and Jesionka, 2011) and so forth. Common to all of these methods, is the fact that discrete choices are performed by each assessor. This results in an ordinal data set, which requires a specific type of statistical analysis, as discussed further in Section 6.5.2.7. The primary use of a discrimination test is to simply evaluate similarity and differences between stimuli, e.g. systems under test. In most cases, this is based on certain sensory attributes or overall quality, or focusing upon a hedonic questions, such as preference. The paired comparison is most commonly used for preference type questions, whereas the triangle test and tetrad method typically consider similarity.

5.1.2

Integrative methods

Integrative methods cover the family of techniques whereby assessors are asked overall or global questions regarding their perception of a product. This could be in terms of quality, e.g. using the Basic Audio Quality (BAQ, ITU-R Rec. BS.1534-3 (2015)) or impairment (ITU-R Rec. BS.1116-3, 2015) scale, or in terms of preference, using for example the 9point hedonic scale (Jones et al., 1955). Common to this family of methods is the idea of primarily employing a single scale or dependent variables. In this manner the collected

ITU-R BS.1534 (MUSHRA)

Paired preference

Tetrad

OPQ IVPPPJ

Mixed methods

JAR IPM / MS-IPM

Repertory grid technique

IVP

Flash profile

Free choice profiling

Individual vocabulary methods

ADAM

ITU-R BS.1284

ITU-T P.806

ITU-T P.835

Semantic differential

Descriptive Analysis / QDA

Consensus vocabulary methods

Verbal elicitation methods

Direct elicitation methods

Pointing

Drawing

Non-verbal elicitation methods

Indirect elicitation methods

Projective mapping / Napping

Perceptual structure analysis

Free sorting

Multi dimensional scaling

Descriptive methods

RATA

CATA

Associations based on fixed lists

Free associations

Associative methods

SDSCE

SSCQE

TCATA

TDS

Time intensity

CETVSQ

Temporal methods

Neurophysiological metrics: -GSR -EEG -pupillometry -memory tasks

Speech intelligibility

Localisation

Performance methods

Figure 5.1

An overview of primary sensory evaluation methods, focusing upon listening only test (LOT) methods. The diagram also includes both audio specific and generic sensory evaluation methods.

n-AFC

ITU-T P.800 CCR ITU-R BS.1116

ITU-T P.800 DCR

ITU-T P.800 ACR

Audio quality methods

Triangle test

Hedonic scaling:

Affective methods

Integrative methods

- 9-point hedonic - LAM - LMS - LHS - HHS

Paired comparison

ABX / Duo-trio

Discrimination methods

Sensory Evaluation of Sound

110  Sensory Evaluation of Sound

Sensory Evaluation Methods for Sound  111 data may be easily and quickly analysed using univariate statistical analysis methods (see Chapter 6)1 . This family of methods is common both in the food and audio industries for a number of reasons including: • Well-established methods and practices (standards and recommendations); • Moderately rapid data collection, due to single dependent variable; • Finds “winners”: highest quality, most liked, most transparent, etc. Due to the well-established practices in the fields of both sensory evaluation and the audio industry, we divide the methods into two sections. Common to both categories is the usage of a single dependent variable, scale or attribute to assess overall performance, whether related to basic audio quality or overall preference or acceptance. Comparison of scales is often performed (see for example Villanueva and Da Silva (2009)), and although the relationship between affective/hedonic scales and audio quality scales is not well understood these are often used interchangeably. For a review of the relationship between audio quality scales and affective scales see (Zacharov et al., 2017). Affective methods: The affective methods are commonly found across all fields of sensory evaluation and generally ask assessors to evaluate the overall preference or acceptance of a product in a global, integrative and overall manner, i.e. taking into consideration all perceived aspects of the product. Based on the 9-point hedonic scale (Jones et al., 1955), a large number of scale variants have evolved over the years including variants such as Labelled Affective Magnitude scale (LAM, Schutz and Cardello (2001)), Labeled Hedonic Scale (LHS, Lim et al. (2009)), Labeled Magnitude Scale (LMS, Green et al. (1993)), Hybrid Hedonic Scale (HHS, Villanueva and Da Silva (2009)). Audio quality methods: In the field of audio and telecommunication a number of integrative methods exist, many of which are standardised in ITU Recommendations. For telecommunications, the ITU-T provides the ITU-T P.800 recommendation (ITUT Rec. P.800, 1996), which includes several Listening Only Tests (LOT): Absolute Category Rating (ACR), Degradation Category Rating (DCR) and the Comparison Category Rating (CCR) scale and methods. The three methods are widely employed and vital in telecommunication technology development. For audio related applications, a number of recommendations exist within the ITU-R BS-series, with two key recommendations being the most commonly employed. The ITU-R Rec. BS.1534-3 (2015), also referred to as MUSHRA (MUlti-Stimulus test with Hidden Reference and Anchor), is used to evaluate primarily Basic Audio Quality and is applied to cases where the quality is not completely transparent, i.e. that audible artefacts are present and clearly audible. The ITU-R Rec. BS.1116-3 (2015) recommendation is a detailed method applied to the assessment of small-impairment situations, when the technology under evaluation is close to transparent. Guidance on the usage and selection of ITU-R methods is provided in ITU-R Rec. BS.1283-1 (2003). 1 It should be noted that certain data collected using a single response variable, e.g. a similarity scale, may also be analysed using multivariate techniques to obtain a perspective on any latent multidimensional space. Multi-Dimensional Scaling (MDS), as discussed in Section 5.7.1, is one example and more generically this concept of multivariate analysis of a single response variable is referred to as Internal Preference Mapping (IPM). Principal Components Analysis (PCA) is another way to perform IPM on univariate data.

112  Sensory Evaluation of Sound

5.1.3

Descriptive analysis methods

To a large extent this book is anchored around the wide range of Descriptive Analysis (DA) methods, in their various forms. This broad family of methods seeks to address research questions that go beyond merely answering questions of preference or best quality. When employing descriptive methods to study the perception of product and sound quality, some of the following research questions can be addressed: • How is the sound quality described? • What are the salient perceptual attributes/characteristics? • What is the relative contribution for each attribute/characteristic to our perception? Answering such questions can be handled using a range of different techniques that have different benefits and constraints. Consensus vocabulary methods refer to Descriptive Analysis (DA) methods where assessors employ a so-called Consensus Vocabulary (CV) of attributes to evaluate the perceptual difference between the systems under test in an objective manner. Attributes that characterise the nature of the sound quality are defined by a trained/expert panel, using a well-defined process (see Section 4.3.2). The assumption is that assessors will generally perceive products in a similar manner, allowing them to establish a common terminology to described these perceptual characteristics and differences using a consensus vocabulary of attributes. Using such attributes is beneficial for evaluating product perception and differences, as well as providing means of communication across the panel and other parts of the an organisation, e.g. engineering and management. A panel is employed to define or select pertinent attributes and in so doing avoids or reduces experimenter bias. The panel rates the systems under test using the defined attributes. The results are then submitted to a univariate to obtain an overview of the contributing dependent variables for each attribute. A multivariate analysis can be applied to all attributes to uncover the consensus nature of the underlying latent multidimensional structure. Examples of consensus elicitation methods include descriptive analysis techniques such as Quantitative Descriptive Analysis (QDA, Stone et al. (1974)) and Semantic Differential (Osgood et al., 1957). Attribute methods have also started to become standardised for specific applications e.g. ITU-T Rec. P.835 (2003) or ITU-T Rec. P.806 (2014). Individual vocabulary methods also aim to characterise the perceptual nature of the product differences using attribute scaling by assessors. However, compared to the significant task of developing and training a panel to develop and use consensus attributes, the individual methods allow each assessor to employ their own Individual Vocabulary (IV) of attributes. Furthermore, the assessors are even able to define different numbers of attributes. The assumptions behind this approach considers that assessors will fundamentally perceive the stimuli in a similar manner, but may use different adjectives to express their perception. Compared to CV methods, each assessor will assign their own terminology for each attribute. For example assessors might user terms such tinny, shrill, emphasised or sharp to describe high-frequency emphasis. The common underlying multidimensional space is then identified using advanced multivariate techniques, based on these assumptions. Using the individual’s

Sensory Evaluation Methods for Sound  113 attribute names, the experimenter is then able to ascertain the common nature of the dominating perceptual characteristics and associated explained variance. The origins of individual elicitation method lie with Free-Choice Profiling (FCP, Williams and Langron (1984)) and Flash Profile (FP, Dairou and Sieffermann (2002)), developed and applied for non-sound applications. Individual Vocabulary Profiling (IVP) has been developed and applied for audio applications by Lorho (2010). The Repertory Grid Technique (RGT, Kelly (1955a)), with its origins in the field of psychology have also been successfully applied to audio applications by Berg and Rumsey (2006). These methods are sometimes considered more rapid than CV methods, as a trained panel can be avoided, as can the consensus attribute effort. However, the increased number of assessors sometimes reported to obtain repeatable and viable results, plus the complexity of analysis/interpretation should also be considered when comparing the consensus and individual vocabulary approaches. Indirect vocabulary methods Both consensus and individual vocabulary methods use elicited attributes as the response variables. By comparison, indirect methods do not ask the assessor to provide their ratings for a specific attribute but rather derive the latent multidimensional perceptual space directly from a multivariate analysis, as introduced in Chapter 7. Multi-Dimensional Scaling (MDS, Kruskal (1964)) is one of the oldest indirect methods, whereby assessors are asked to compare pairs of stimuli and rate their similarity or dissimilarity on a line scale. Using one of the MDS algorithms such as INDSCAL or ALSCAL a multivariate analysis can be performed to yield the latent multidimensional space. Projective Mapping (Risvik et al., 1994) or Napping® (Pagès, 2005) allows assessors to sort the stimuli/products in a 2-dimensional space according to their own perception and criterion. In a manner of speaking this is a 2D version of MDS. Once sorted, the relative spatial locations of the stimuli are then used as input to a multivariate analysis using Multiple Factor Analysis (MFA, Escofier and Pagès (1994)), yielding a view of the common latent multidimensional space. Free Sorting methods are similar to Napping, however assessors are asked to group stimuli/products that are perceived as similar. These methods and associated analysis techniques are presented in Cadoret et al. (2009); Qannari et al. (2010); Cadoret et al. (2011); Courcoux et al. (2014) and specifically Sequential Agglomerative Sorting (SAS) is discussed in Section 7.4.1. In general these methods are interesting, as there is little potential bias associated with the use of attributes. Assessors evaluate, compare and sort the stimuli/products according to what they perceive. This neutrality may be considered beneficial. However, as there is no knowledge prior to analysis of whether there is any commonality across the assessors, a relatively large number of assessors should be considered for the task to ensure a stable and repeatable analysis and interpretation. One of the challenges with certain indirect methods, is to establish the meaning of the dimensions or components following the multivariate analysis. In these cases, the meaning of latent dimensions has to be ‘interpreted’ by the experimenter, introducing their own opinion at this stage. Certain indirect methods such as Perceptual Structure Analysis (PSA, Choisel and Wickelmaier (2006); Wickelmaier and Ellermeier (2007)), require

114  Sensory Evaluation of Sound assessors to provide input regarding their grouping or choices, which is beneficial in guiding the experimenter during the interpretation of dimensions. Associative methods Associative methods is a novel term introduced here to describe methods that give the assessor freedom to report what they perceive. Free association is one such method, discussed in Nielsen and Ingwersen (1999) with more detailed examples in Ramsgaard (2009); Sester et al. (2013) and discussed in Section 13.3.2. The method allows assessors to freely express their perceptions using their own words. Following a verbal clustering a count-based statistical analysis may be performed. Associative methods, based on fixed lists such as Check-All-That-Apply (CATA, Adams et al. (2007)) and Rate-All-That-Apply (RATA, Ares et al. (2014)) provide assessors with lists of words (sensory attribute, emotional descriptors, etc.) from which the assessors can freely select the most pertinent associations and descriptions. RATA is a hybrid technique comprising pre-selected attributes of which assessors only rate those that they find applicable to describing the test stimuli. These methods are gaining increased interest, as they allow the assessor to express either perception, with less bias than other techniques. This is potentially of great interest, as consumers may also be employed for such tasks. However, due to the less formal and unbiased structure of associative methods, the requirements for ensuring good data quality are not yet fully known, i.e. for many assessors are needed to ensure reliable and repeatable results.

5.1.4

Mixed methods

Mixed methods is the generic term applied to statistical analysis techniques that combine data collected using different experimental methods for a more in-depth analysis. In the context of explaining drivers of liking, mixed methods are often referred to as preference mapping (Næs et al., 2010). A number of mixed methods, some specific to sound, have been developed including: Audio Descriptive Analysis and Mapping (ADAM, Zacharov and Koivuniemi (2001a)), Just About Right (JAR, Moskowitz (1972); Rothman and Parker (2009)) scales, Ideal Profile Method (IPM, Worch (2012)), Multiple Stimulus - Ideal Profile Method (MS-IPM, Legarth et al. (2014)), Open Profiling of Quality (OPQ, Strohmeier et al. (2010)) and Individual Vocabulary Profile with Preferences by Pairwise Judgements (IVPPPJ, see Chapter 12). For each of these approaches both descriptive attribute-based ratings as well as global affective ratings are collected and then analysed in combination. The approaches for each of these methods differs specifically in terms of how data is collected and analysed, i.e. using consensus or individual attributes, with rating scales or pair comparisons and so forth. However, common to all of these mixed methods is the concept of explaining the global affective data from the descriptive data. External Preference Mapping (EMP) is the common name for regression of the affective data onto descriptive attribute data, explained in Næs et al. (2010) and Section 7.4. An alternative approach, Landscape Segmentation Analysis (LSA, Rousseau (2009)) performs a similar analysis, however, projecting the descriptive attribute data onto the latent multidimensional space of the global affective data, following an internal preference mapping. Whilst both EMP and LSA are very different in nature, their primary aims are similar, i.e. to establish the drivers of liking - a major step beyond the separate analysis of the

Sensory Evaluation Methods for Sound  115 affective or descriptive/attribute data. LSA also additionally estimates an ideal point for each assessor, which is subsequently projected into the space. Lastly, in the mixed method category we have the Ideal Profile Method (IPM, Worch (2012); Worch et al. (2013a, 2014)) which was originally developed for consumer evaluation of food and consumer products. The Multiple Stimulus - Ideal Profile Method (MS-IPM, Legarth et al. (2014)) is an evolution of IPM usage in the field of audio with an expert assessors panel. Compared to other mixed methods, IPM and MS-IPM collect some additional data from assessors, which comprises a so-called ideal ratings for each descriptive attribute. Using this additional data, the experiment has the change to study the following questions: • Which products/stimuli are preferred or have the highest quality? • How are the products/stimuli characterised perceptually? • What are the perceptual characteristics contributing to the preference or quality? • How well does the preferred or highest quality relate to a projected assessor ideal rating?

5.1.5

Temporal methods

The category of temporal methods focus upon the assessment of stimuli or products over a period of time, whether considering affective or descriptive characteristics. The origins of temporal methods relate to the Time Intensity (TI) method developed by Neilson (1957) from the domain of consumer products. More recently the Temporal Dominance of Sensation (TDS, Pineau et al. (2009)) and the Temporal-Check-All-That-Apply (TCATA, Castura et al. (2016); Ares et al. (2016)) approaches have been developed to track the temporal evolution of descriptive/attribute characteristics. The TDS method has been applied to products such as chewing gum to study how the flavour and texture of products evolves over a period of time. In the field of video quality the Single Stimulus Continuous Quality Evaluation (SSCQE) method and Simultaneous Double Stimulus for Continuous Evaluation (SDSCE) method have been standardised in ITU-T Rec. BT.500-13 (2012) for the assessment of overall performance characteristics, such as quality or degradation of image quality as a function of time. Most recently, the introduction of the Continuous Evaluation of Time-Varying Speech Quality (CETVSQ, ITU-T Rec. P.880 (2004)) was specifically developed for the temporal evaluation of overall perceived speech quality. The usage of such methods in sound and audio has so-far been limited. However, in future their application may be of interest in the cases such as adaptive technologies, where transitions in conditions may affect the sound quality.

5.1.6

Performance methods

The last category of methods we present are performance methods. So far we have covered methods that evaluate whether there are quality, preference or specific attribute differences. However, in sound we also have other characteristics of interest for specific applications. Broadly speaking, two categories of performance metrics exist: • Self-reported performance metrics; • Neurophysiological metrics.

116  Sensory Evaluation of Sound Speech intelligibility, localisation, etc., fall into the category of self-reported performance metrics, whilst neurophysiological metrics are indirect measures of human performance. Localisation: For many years localisation has been a very common means to evaluate the performance of spatial technology, i.e. binaural or spatial sound reproduction systems. As found in the literature (Lindau et al., 2014; Pedersen and Zacharov, 2015; Zacharov and Pedersen, 2015; Zacharov et al., 2016a; Francombe et al., 2017a,b), there are many spatial sound quality attributes, of which only a few relate to localisation. Localisation methods provide very specific and accurate information regarding how well sound sources are localised. More advanced drawing (Ford, 2004; Ford et al., 2003) and pointing methods are an extension of localisation methods, enabling a study of sound source trajectories, the extent of sound objects and so forth. Examples of localisation and pointing techniques are discussed in Section 11.5.1 and Section 11.5.2.1. Speech intelligibility: Speech Intelligibility (SI) is a vital measure of speech communication systems, particularly when there are poor signal-to-noise ratio conditions, either due to the nature of the noise environment or the hearing mechanism. Speech intelligibility is vital for the assessment of telecommunication and hearing aid instrument performance. Today these methods are complimented by other measures of speech quality and other performance metrics, e.g. cognitive load, etc. When using a mobile phone in very poor signal-to-noise ratio conditions, the speech quality may be evaluated as very poor or bad, however there is still a measurable degree of speech intelligibility. This illustrates that performance measures are valuable in themselves, providing complimentary and specific performance information to other measures of quality. The majority of SI methods evaluate the percentage of lost consonants of speech or other related metrics which are typically application- and language-task specific. Some examples of intelligibility methods include: triple digit test (Smits et al., 2013), Hearing-In-Noise Test (HINT, Nilsson et al. (1994)) , and other so-called matrix tests, such as the Dantale II (Wagener et al., 2003) or Oldenburger satztest (OLSA, Wagener et al. (1999a,b,c)), etc. Further discussion of a number of SI methods are discussed further in Section 8.3 and Section 9.7.1. Neurophysiological metrics: This is a family of methods that measure a neurophysiological characteristic. Whilst not new in nature, only recently have such methods been gaining interest as successful indicators of human performance. Whilst today we are not yet able to measure sound quality directly in this way, other important characteristics can be estimated using this toolset. Example of such techniques include Galvanic Skin Response (GSR), electroencephalography (EEG), pupillometry, etc., which are discussed further in Sections 9.7.3 and 13.1.5. Pupillometry is presently finding applications in audiology as a measure of cognitive load, valuable for the evaluation of the performance of new speech enhancement technologies, and complimenting SI and other measures of speech/sound quality (Zekveld et al., 2010). The remainder of this chapter is dedicated to the in-depth description of each method, starting with discrimination methods.

Sensory Evaluation Methods for Sound  117

5.2 DISCRIMINATION METHODS Discrimination (or difference) testing methods are a family of methods suitable for testing whether there are perceived differences between systems under test. This is one of the simplest questions to ask when looking at a set of stimuli. The reverse question, typically answered with the same methods is: Are these products similar ? Discrimination methodologies are essential for sensory science, and are taught in every curriculum in sensory evaluation (see e.g. Lawless (2013); Meilgaard et al. (2006)). It is hence no surprise that it is one of the most popular and most used family of methods, particularly within sensory evaluation of food, where they are still being developed and improved (Ennis et al., 2014). Examples from the food sector where discrimination tests are applied include quality control, recipe modifications (fat or salt reduction related to stricter law norms), cost reduction (ingredient provider change), etc. In these situations, the question could either be: Is my new sample (recipe improvement) different from my current/competitor product? Is my new product (fat/salt reduced) perceptually similar to my current product? Appropriate applications in the audio industry could be: Does a change in components change the perceived sound? Is a codec implementation perceived as transparent compared to the original reference? Overall this family of methods is well suited for formal hypothesis testing and not for exploratory purposes.

5.2.1

Comparing pair(s) of systems discrimination tests

As we can see from the examples above, the motivation for applying discrimination tests vary, but typically boils down to a basic comparison of two products. This does not mean that we can not test more than two products in an experiment (assuming we have time enough), but that we should consider each paired comparison a separate test. The paired comparison method, is just one of several discussed in this section, as illustrated in Figure 5.2. Independent variables: One discrimination test comprises comparing one pair of systems under test (A and B). Multiple pairs of systems can be tested under the same experiment if time limits allow. Discrimination tests are appropriate for comparing systems that are perceptually very similar. Assessors: In discrimination tests we can employ either consumers, naïve/untrained listeners, or trained and expert assessors, depending on the purpose of the research. The methods are simple in terms of test instructions and are easy to carry out from an assessor perspective. The selection of assessors will influence how the test results may be interpreted and extrapolated. Testing with expert assessors is not representative of the perception of the general consumer population, but represents the critical case in terms of exposing potential weaknesses (Ishii et al., 2007). On the other hand, testing with consumers will not give us an indication about what a highly critical sub-population will perceive. The number of assessors needed for an experiment is dependent on the differences of the systems under test, the level of training of the assessors, and the intended purpose of the test. This topic is discussed further below. Presentation of stimuli: Discrimination tests are typically performed using sound files being presented over test software. Real device testing is possible but sets strict requirements for blinding test subjects. In order to avoid loudness bias (unless loudness differences is an inherent part of the test), calibration procedures should be followed.

118  Sensory Evaluation of Sound

(a) 2-AFC.

(b) Triangle test.

(c) Duo-trio test.

(d) Tetrad test.

Example discrimination test user interface for collecting data. The grey text (i.e. A, B) are not visible to the subject, but included to show the “randomisation” for the given presentation. Note the simple nature of the test instructions. Figure 5.2

When comparing pairs of products, two types of overall methods can be applied: • Indirect/unspecified tests: The comparison of products is made based on the whole sample without specifying the nature of the difference. From an assessor perspective the indirect/unspecified test hence gives no instructions regarding what perceptual aspects to focus upon, • Directional/direct/specified tests: The comparison of products is made based on a pre-defined attribute (e.g. treble extent, or coding artefacts). In this case the assessor is given instructions regarding which characteristic(s) to focus upon during the evaluation. Amongst the indirect/unspecified tests, the duo-trio (ISO 10399, 2017) and the triangle tests (ISO 4120, 2004) are the most commonly encountered methodologies. However, since the late 2000s, a lot of interest has been shown to the unspecified tetrad test, which became popular due to its advantages from a statistical perspective (see e.g. Ennis et al. (2014)). To describe the method of the duo-trio test, triangle and tetrad test, let’s assume that two products A and B are being assessed. In the duo-trio test (see Figure 5.2(c)), each assessor is presented with a trial with an identified reference e.g. sample A, alongside two blinded products (A and B, or B and A), one of which matches the reference sample. The assessor is asked to indicate which blinded sample matches the reference. The duo-trio test is similar to what we in the audio industry know as the ABX test. In the triangle test (see Figure 5.2(b)), each assessor is presented with a set of three blinded products on each trial, two of them being identical, and different from the third one (e.g. AAB, ABA, BAA, ABB, BAB, BBA). The assessor is asked to indicate the odd sample amongst the three.

Sensory Evaluation Methods for Sound  119 In the tetrad test (see Figure 5.2(d)), the assessor is presented with a set of four presentations, corresponding to a pair of each of the products evaluated (e.g. AABB, ABAB, ABBA, BBAA, BABA, BAAB). The assessor is then asked to select two groups of two similar products. Amongst direct/specified tests, we can mention the two alternative forced choice (2AFC, see Figure 5.2(a)), the 3-AFC, or more generally the m-AFC. Again, direct/specified tests imply that the assessment of the products is based on a known characteristic. The m-AFC consists in presenting to each assessor a set of m products, (m − 1) of them being identical (A, say), the other one being different (B) on a known attribute (e.g. treble). Since the m-AFC is a directional test, the question should be adapted to the products presented: Let A have more treble than B: In a test in which (m − 1) products A are presented with one sample B, the assessor is asked to detect the sample with the lowest level of treble amongst the m products. Conversely, in a situation in which (m − 1) products B are presented with one sample A, the assessor is asked to detect the sample with the highest level of treble amongst the m products As seen above the discrimination test methods require comparing two systems noted here as A and B. However, depending on the method used, some of these systems may be ‘duplicated’ within a trial. In the 2-AFC, only two products are assessed simultaneously; in the triangle test, 3 products are assessed simultaneously, one of the samples (A, say) being duplicated whereas the third sample is presented only once (B, say). In the tetrad test, four products are assessed simultaneously, two pairs of each sample being presented. In order to fully understand the nature of discrimination tests, we will describe the experimental design and statistics already in this chapter. To set up a discrimination test correctly, five parameters need to be properly defined. These five parameters are inter-connected, meaning that setting up 4 of them will automatically define the fifth one. These 5 parameters for consumer sensory discrimination are (Rousseau, 2015): • The method used (triangle test, 2-AFC, etc.); • The size of the difference of interest (P dR or δR ); • The assessor sample size N ; • The type-I error α2 ; • The type-II error β 3 , indirectly used through the power4 (1 − β) of the test. In most situations, the method used and the size of the difference are defined by the users based on habits, knowledge, and requirements. Also, the type-I error α for the test is usually set up by the experimenter, 5 % being used most commonly. Two parameters remain to be defined through software: the power (knowing N ) and the sample size N (knowing the power)5 . In a triangle test in which we consider a difference to detect of, α at 5 %, and β at 20 % (i.e. power of 80 %), the sample size N required is of 202 assessors, if each assessor is only assessing one combination of products (e.g. AAB). In the same settings, the tetrad test would require a total sample size of 61 assessors, whereas the 3-AFC would only require 2A

type-I error, also called a false positive, is related to incorrect rejection of a the true null hypothesis. type-II error, also known as a false negative, is related to a failure to reject the false null hypothesis. 4 The power of a test is the probability of observing a difference that is there: statistically, it corresponds to the probability of rejecting H0 when H1 is true. 5 Most software only provides two options: Computing the power of the test, or defining the sample size N based on the setup of the entire test (protocol used, difference to detect, α, and N or Power). 3A

120  Sensory Evaluation of Sound a total sample size of 16 assessors. A triangle test with, α = 5 %, and N = 20 assessors has a power of 16.6 %. In the same settings, the tetrad test has a power of 39.0 %, whereas the 3-AFC has a power of 84.4 %, as a guide for method selection (Ennis, 2012). This is only the case if we present each assessor with only one trial. From an audio perspective we can typically present each assessor with multiple trials. This means that, despite the strengths of the tetrad test, the triangle test is typically a valid choice as the experimental design is less sensitive to the number of test subjects. We substitute assessors by adding more trials for each assessor. Generally in audio, repeated testing with the same assessor is not practised. However, in such cases please refer to Bi and Ennis (1998) for details regarding appropriate analysis. For each of these tests, whether they are specified or unspecified, the answers provided by the assessors are either correct (e.g. they find the odd sample in the triangle test), or incorrect. The result of such test is the total number of correct answers x based on the total number of answers n, often defined as the proportion of correct answers Pc . Pc = x/n

(5.1)

Under the assumption of independent observations, the sampling distribution of Pc is a binomial distribution.   Pc Pc ∼ (5.2) n with pc the probability of correct answers6 , which is defined as the maximum between Pc and PG . Since assessors are forced to provide an answer, they may guess it right even without detecting the difference between the samples. For this reason, we associate to each protocol a guessing probability (noted PG ). It can easily be proven that for the 2-AFC and the duotrio, the probability is 12 ; for the 3-AFC, the triangle test, and the tetrad test, the guessing 1 probability is 13 ; and more generally, for the m-AFC, the probability of guess is m . The measure of the difference between the two samples evaluated can be done according to two points of view: the assessors and the products. These two points of view define the core of the guessing model and the Thurstonian model, respectively. Let’s assume that the panel of assessors used to perform the test is divided into two distinctive groups, one of them clearly distinguishing between the products whereas the other one can not detect the differences and are simply guessing the answer. In this case, the sensory distance used between the products is expressed in terms of proportion of assessors who detect the difference. This proportion is often referred to as the proportion of true discriminators and is defined as: PD =

Pc − PG 1 − PG

(5.3)

As can be seen, Pc and PG are linearly linked, meaning that results from different protocols can not be directly compared to each other since only the guessing probability is used here. This approach is also known as the guessing model. Another way to measure the sensory distance between the samples is through a parameter taken from signal detection theory called d-prime (d0 ), which is an estimate of the Thurstonian δ. This measure is a direct measure of the perceptual distance between samples, and is expressed as a signal-to-noise ratio for the data set. Here, the proportion of 6 Mathematically,

Pc is defined between 0 and 1, whereas pc is defined between PG and 1.

Sensory Evaluation Methods for Sound  121 correct answers Pc is directly linked to d0 through psychometric functions, each protocol being associated to its own psychometric function. It has been recently shown that the d0 can be obtained through generalised linear model (Brockhoff and Christensen, 2010). As opposed to the guessing model, the Thurstonian model considers the protocol and their cognitive aspects, meaning that the d0 values obtained through different protocols (involving the same samples) can be directly compared (O’Mahony et al., 1994; O’Mahony and Rousseau, 2003). In other words, a Pd and a d0 of 1 means the same regardless of the protocol used. This is not true for the proportion of discriminators, as some tests may appear to be simpler to perform than others (hence reaching higher proportions of correct answers). But since both PD and d0 depend on the same parameter Pc , one can easily transit from one to another, and show that a d0 of 1 for a triangle test corresponds to Pc = 0.42 and PD = 0.13; whereas it corresponds at Pc = 0.49 and PD = 0.24 for the tetrad test and Pc = 0.63 and PD = 0.45 for the 3-AFC (Worch and Delcher, 2013). Finally, we can express the sensory difference tests through the following hypotheses test (Haubo and Christensen, 2011):    pc ≤ pc 0  p c > pc 0 H0 : PD ≤ PD 0 vs. H1 : PD > pc 0 (5.4)  0  0 d ≤ d0 0 d > d0 0 In its usual form, pc 0 = PG , PD 0 = 0, and d0 0 = 0. The p-value of the difference test is the probability of observing a number of correct answers that is as large or larger than that observed under the null hypothesis defined through pc 0. Similarly, the sensory similarity test is defined through the following hypotheses test:    pc ≥ pc 0  p c < pc 0 H 0 : PD ≥ PD 0 vs. H1 : PD < pc 0 (5.5)  0  0 d ≥ d0 0 d < d0 0 Here, pc 0, PD 0, and d0 0 should be defined appropriately based on the user’s expertise. The p-value of the similarity test is the probability of observing a number of correct answers that is as large or less than that observed under the null hypothesis defined through pc 0. Discrimination tests are fast and easy to perform - especially from an assessor perspective. The analysis is simple and in most cases, the interpretation of the results is straightforward. However, the method is only allows for a few products to be evaluated at the same time, and further interpreting the reasons why products may vary (particularly with unspecified tests) may be difficult. Additionally, for robust hypothesis testing, a relatively large number of assessors is needed.

5.2.2

Paired comparison of multiple systems based on a sensory attribute

One of the more common test methods encountered in the audio industry is the paired/multiple comparison test between two or more systems. This test type is typically applied to answer questions such as: How do my products compare on a given sensory attribute? Which one is the most/least preferred? These questions can be answered in two different ways: either by considering products in pairs (multiple paired comparisons), or by considering the entire set of products at once (ranking analysis). When having two or more products this is an intuitive approach where the assessor either makes a discrete choice in a paired comparison, or ranks them according to a given

122  Sensory Evaluation of Sound attribute. Rather than asking assessors to score the products on the attribute of interest, the assessors are asked to select the sample with the most/least intensity, or to rank the entire set of products based on the sensory attribute of interest. Independent variables: The main constraint for the experimental design for paired comparison or ranking of systems is experiment size. Paired comparison tests quickly increase in size as more independent variables are included. Time constraint/experiment size is the main limiting factor, but pairs of systems can be tested under the same experiment if time limits allow. Paired comparisons or ranking are appropriate for testing real devices, and is often applied in field tests, or consumer test settings. Assessors: Paired comparisons and ranking tests can apply to consumers, naïve/untrained listeners, as well as trained and expert assessors. The methods are simple in terms of test instructions and are easy to carry out from an assessor perspective. The number of assessors needed for an experiment is dependent on the differences of the systems under test, the level of training of the assessors, and the intended purpose of the test. Since ranking is an easy task, it can be performed by any type of assessor. However, experts is more sensitive, the data collected may be of better quality, meaning that a small panel size may be required. When performed with untrained assessors, the sample size may depend on the number of products and the complexity of the products tested, yet a sample size of about 60 assessors is often sufficient (ACTIA, 2001). Presentation of stimuli: Discrimination tests are typically performed using sound files being presented over test software. Real device testing is possible but sets strict requirements for blinding test subjects. In order to avoid loudness bias (unless loudness being a part of the test), calibration procedures should be followed. When comparing multiple systems in one test we can apply one of two approaches: Comparison in pairs or multiple paired comparison, as presented in Figure 5.3(a) consists of presenting two systems simultaneously from the total set of systems to each subject and to ask them to pick the one which has the most of a certain attribute (or in affective tests, the one they prefer). Although the individual test itself involves only two systems at a time, the overall procedure may involve more than two systems. In that case, all the possible pairs should be presented to the panel. As the number of products increases, the number of potential pairs increases: for a test involving 10 systems, 45 paired comparisons of systems exist. In this case, it may not be possible to ask the assessors to evaluate all the existing pairs. The sample size is directly linked to the number of products evaluated: when many pairs need to be evaluated, a fractional factorial or incomplete design may be used, taking care that there remains sufficient statistical power in the experiment to test your hypotheses. In the situations where paired comparisons are performed based on preference, the assessors may have the option not to choose a sample and to consider them as equal. These equal answers (or no preference answer) need then to be treated in a specific way, as will be explained below. Multiple comparisons/comparison of the entire set of systems at once (ranking analysis) is illustrated in Figure 5.3(b). In a ranking task, each assessor is presented with the entire set of systems under test, and asked to rank them on the dependent variable of interest. The dependent variable can either be a sensory attribute or liking/preference7 . 7 Care should be taken with ranking to ensure that the polarity of the scale is correct, i.e. does a rank of 1 relate to the highest sensory strength (e.g. loud) or vice versa. This should be carefully considered

Sensory Evaluation Methods for Sound  123

(a) Multiple paired comparison

(b) Multiple comparison ranking

Example of two multiple comparison approaches. The grey text is not shown to the subject, but indicates the systems under test. Figure 5.3

Due to the procedure of the test, the entire set of products is provided simultaneously to the assessors. For this reason, the test should be limited to the number of products the subject is comfortable/capable of working with, especially since this method does not accept incomplete design. Note that in this section we only describe multiple comparisons where data is collected on an ordinal scale of measurement; meaning a discrete choice, ranking or sorting of the presented products. The analysis of a paired comparison is straightforward: for each pair of products, count the number of times each sample “won the duel” and compare this proportion to a binomial test with a probability of 0.5. In the case of multiple paired comparison, the analysis can be performed on each pair separately. However, one could consider a more specific model such as the Bradley-TerryLuce (BTL) model (Causeur and Husson, 2005). This model is based on a logistic regression, and aims to define, for each sample, a score called ability that reflects the relative distance between products: the higher the coefficient, the more this sample tends to be selected over the other sample (Brard and Lê, 2016). For paired preference tests involving the no-preference option, two alternative analyses can be performed. The no-preference answers may be discarded, and the analysis is then performed on the subset of the panel that expressed a preference. If this solution works fine, it still alters the data as the no-preferences are completely ignored and data is discarded. When more than two products are involved, the analysis will provide a score for each of the products (through the regression coefficient called abilities) that could be used in further analysis. However, tests involving many products may become difficult to run as a very large number of pairs needs to be tested. In rank analysis, the primary statistics of relevance are the sum of rank, or the mean rank. The significance between products in terms of rank is then obtained through the for liking/preference based ranking during the experimental design, preparation of assessor instruction and analysis/interpretation.

124  Sensory Evaluation of Sound Friedman test, which is a non-parametric alternative to the one-way ANOVA with repeated measures. The (multiple) paired comparison is a method that is fast and easy to perform, and that quickly answers simple questions such as: Which sample is the most intense for a certain attribute, or in terms of preference? The task is easy and fast to perform, especially if the number of attributes is limited.

5.3 INTEGRATIVE METHODS Methods measuring preference or affect are extremely important in finding the “best” system under test. The notion of preference is easily understood when faced with stimuli - either liking or disliking it, and it is well accepted that virtually all objects can be assessed with a certain valence (Osgood et al., 1957). Ultimately affect or preference is of interest because we assume a direct cause and effect relationship between preference and (purchase) behaviour. This also bridges preference into the affective domain where brand perception/attitude, packaging, design, product expectations etc. can be extremely important independent variables of an experiment. For most of this section we assume that the test setup is performed in a single or double blind paradigm. However, we do encourage that future research efforts include factors like brand, packaging, and design as independent variables to get a stronger understanding of overall consumer purchase behaviour, employing un-blinded tests where relevant. To what degree these factors should be included in an experiment is of course dependent on the intent of the study. Selecting between codecs for a wireless audio transfer is potentially less complicated than understanding the consumer perception of sound quality for a well-known and branded headphone. We often see that decisions in research and development are guided by preference of one product over another. This can be seen when bench-marking against competitors or selecting between different vendors’ technical solutions. For this point of interest it is often seen that the audio industry related recommendations, such as the ITU-R Rec. BS.1534-3 (2015) (MUSHRA) or ITU-T Rec. P.800 (1996) are favoured, with a focus upon global measures of audio quality. As with any test the preference/affective methods potential for extrapolation to a larger population is limited by the sampling we use in the data collection. Additionally any researcher should strive toward increasing tests construct validity by focusing on (1) increasing the likelihood that our measured dependent variable carries some value for future predictions about the stimuli, and that (2) test items are from a relevant (similar) context or category (Cronbach and Meehl, 1955).

5.3.1

Preference and affective methods

In the domain of sensory evaluation, affective scaling, or scaling of preference, has long been of interest. One of the earliest and most influential preference scales, the 9-point hedonic scale (Jones et al., 1955), is still applied, even after more than half a decade. It has since been supplemented with other scales that each have their strengths and weaknesses. In this section we will consider and describe these multiple scales as one method family. As previously noted, changing a scale, does not necessarily mean that we are changing method, and the elements shared between the most commonly encountered preference scales in sensory evaluation far surpass the visual and perceptual differences between scales. Common preference or affective methods are applied when we as researchers are inter-

Sensory Evaluation Methods for Sound  125 ested in the consumer’s preference for one system or tendency to favour one system under test over other systems under test. These methods will give the right tool for answering which system under test is most preferred by the consumer under test given test condition. Independent variables: Since we typically only apply one dependent variable in a preference experiment, it may be possible to include multiple independent variables in an experiment, with the systems under test typically being of most interest. As mentioned in the introduction to the chapter one can additionally decide to look at potentially purchase driving variables such as e.g. brand, visual design as well as more traditional audio relevant variables, like SNR, source material etc. Assessors: Because affective methods typically denote a general interest in a subpopulation’s preference, it is recommended that representative consumers are sampled and these common affective methods are applied. It is commonly accepted that at least 60 consumers, from the representative user population. are needed to obtain repeatable results (see ACTIA (2001)). Presentation of stimuli: For affective methods both real device testing and sound filebased experiments are applicable. Since factors such as loudness, etc. have been shown to bias preference/quality ratings, it is recommended to take this into consideration when setting up the experiment. In addition, similar transducer types and models should be used when presenting sound files, to avoid slight timbre differences or that fit/comfort of the e.g. headphone, influences ratings. As we can see in Figure 5.4, scales applied for collecting ratings show similar characteristics; allowing for interval scaling of preference according to specific verbal labels. From a practical perspective we can ask the subject to either focus on rating the relative difference between the systems under test (appropriate for the Hybrid Hedonic Scale and 9-point hedonic scale), or rate the stimuli according to a more absolute point of reference, as applied with the LAM and 9-point hedonic scale. Deciding between either a relative or absolute focus might also influence whether we apply a multiple stimuli comparison presentation, or a single stimulus presentation. It should be noted that the continuum between relative and absolute ratings are influenced by more than stimulus presentation (see e.g. Zwislocki (1983)). Selecting the appropriate scale for your experiment depends on the purpose of the experiment, as well as the range of quality of the systems under test. Running a pilot study, and exploring the scale usage, may provide the needed feedback for selecting the best scale for the purpose. Below is a short description of some of the more commonly encountered affective scales (see also Figure 5.4): 9-point hedonic scale: The 9-point hedonic scale is one of the first attempts to, in a structured manner, collect liking data using a 9-point categorical scale (typically using radio buttons). It was originally proposed by (Jones et al., 1955), and is still widely applied in consumer and sensory science. It applies a 9-point categorical scale, ranging from Dislike extremely to Like extremely. Labelled Affective Magnitude (LAM) scale: The LAM scale (Schutz and Cardello, 2001) differs from the 9-point hedonic scale in that it applies verbal absolute endpoints as well as a quasi logarithmic distance in the verbal labels approximating the true semantic distance of the labels, approaching ratio-scale characteristics. One of the disadvantages of the LAM is that it may not be appropriate for “zooming” in on small range systems under test, providing appropriate power to separating them statistically.

126  Sensory Evaluation of Sound

(a) 9-point hedonic scale

(b) LHS

Overview of common affective scales applied in sensory evaluation; the 9-point hedonic scale (Jones et al., 1955) and Labelled Hedonic Scale (LHS, Schutz and Cardello (2001)). Figure 5.4

Labelled Hedonic Scale (LHS): The LHS scale to a large degree resembles the LAM scale, but is a continuous scale, rather than the categorical LAM scale. From a visual perspective the distance of the scale labels vary a bit, as do the endpoint labels. LHS provides ratio-scale data and meaningful semantic context for the ratings (Lim et al., 2009); Hybrid Hedonic Scale (HHS): The Hybrid Hedonic Scale applies only three verbal labels similar to the end and mid-point(s) from the 9-point hedonic scale. The scale is continuous with equidistant markings. This structure gives a more relative scale that has been shown to give good resolution in testing (Villanueva and Da Silva, 2009). Analysis of data from the affective scales presented in Figure 5.4, can be considered and generally allows for parametric statistical analysis, including Analysis of Variance (ANOVA) and other appropriate statistical tests. For more guidance refer to Chapter 6. It is important to remember that data from these common affective methods may also serve as preference data to be used as input to mixed methods for preference mapping, etc.

5.3.2

Recommendation ITU-T P.800 (ACR, CCR, DCR)

The ITU-T Rec. P.800 (1996) methodologies originate from the telecommunications industry, and have a key role in bench-marking mainly transmission technologies (ITU-T, 2011). The recommendation describes several methods, including: the Absolute Category Rating (ACR), Comparison Category Rating (CCR), and Degradation Category Rating (DCR). Because the ITU-T P.800 was developed for applications in telecommunications, it is not that commonly applied to the evaluation of other audio technologies. However, there is nothing inherently exclusive about the methods, and they can be used more widely when needed.

Sensory Evaluation Methods for Sound  127 The methods are used for answering test questions related to overall “quality” performance of systems under test, and are often applied for technology evaluation or selection within standards bodies and industry. For a comprehensive introduction to the methods we refer to the ITU-T Rec. P.800 (1996) or (ITU-T, 2011). While the ITU-T P.800 tests include descriptions of “conversational tests”, we will only give an introduction to the listening only tests in the recommendation, and refer to Chapter 8, for examples of conversational tests. Independent variables: Typical independent variables included in a ITU-T P.800 test are; test conditions under test with additional reference conditions (e.g. Intermediate Reference System (IRS, ITU-T Rec. P.48 (1988)), modified IRS Annex D (ITU-T Rec. P.830, 1996) or Modulated Noise Reference Unit (MNRU, ITU-T Rec. P.810 (1996)) conditions, language, talkers, and bandwidth. Typically 36 or 48 conditions8 are employed; Assessors: The ITU-T P.800 specifically requires the usage of naïve listeners. This means that the test subjects must not have participated in any subjective test in the previous 6 months leading up to the experiment, or be experts in the technological implementations under test. They should also be familiar with using telephones. Whilst not specifically mentioned in the recommendation, 32 listeners are typically applied in the ITU-T P.800 tests. Presentation of stimuli: ITU-T P.800 tests are performed on recordings of stimuli, often comprising 3 male and 3 female speech samples. Procedures for sound recording and reproduction are specified in the recommendation. Sound levels should be measured during experiment and documented. The recommendation proposed fixed sequence of stimulus presentation, without allowing the assessors to freely switch between stimuli as their will. Research by Villegas et al. (2016) has shown that free-switching can reduce the duration of the test. The three Listening Only Test (LOT) methods mentioned in ITU-T P.800 vary greatly, and one could argue that the DCR variant is not specifically an affective test. In order to provide an easy overview of all variants, these are included in this section. The ACR test, is the simplest and most commonly applied variant of the ITU-T P.800. In the ACR method one sound is presented at a time (serial monadic presentation), and rated on either listening quality, listening effort, or, loudness preference, with listening quality being the most prevalent. The categorical scale ranges from 1–5 and used verbal labels ranging from Bad to Excellent (see Figure 5.5(a)). Average ratings collected using ACR on listening quality are referred to as Mean Opinion Scores (MOS). A CCR test compares the systems under test to a reference sample, as a pseudo paired comparison, and asks the subject to rate the quality of the first (A) sample compared to second (B). Randomisation is applied so that A and B may contain either the reference case or system(s) under test. As seen in Figure 5.5 it uses a bi-directional categorical scale (see Figure 5.5(c)) ranging from -3 to 3 allowing the system under test to be either better or worse that the reference sample. This is an important feature of the CCR test, because it allows us to test systems that can outperform previous reference systems. The average panel ratings from a CCR test are referred to as Comparison Mean Opinion Score (CMOS). DCR applies a pseudo paired comparison paradigm, where the reference condition is 8 Particular to the ITU-T Recommendations, conditions refer to combinations of test parameters. The MNRU conditions are used as a frame of reference and to stabilise the scale usage, across experiments. Very often these are not a factorial (full or fractional) combination of independent variables and level. Conditions may comprise codec, bitrates, DTX, and other technical parameters.

128  Sensory Evaluation of Sound

(a) ACR

(b) DCR

(c) CCR

Overview of ITU-T P.800 LOT scales: Absolute Category Rating (ACR), Degradation Category Rating (DCR) and Comparison Category Rating (CCR). Figure 5.5

disclosed as the subject is asked to rate the degradation relative to the reference. This assumes the test conditions perform similarly or worse than the reference condition. In a DCR test the listener is asked to scale the level of degradation of the systems under test compared to the reference, using a 5-point degradation scale (see Figure 5.5(b)). The DCR is suitable for measuring small degradation of systems under test (compared to the reference), and average panel ratings are termed the Degradation Mean Opinion Score (DMOS). For both the ACR and DCR conventional parametric statistical analysis may be applied. However, it should be observed that many ITU-T P.800 LOT are performed without the use of a factorial independent variable design. Many such tests are performed with 36 or 48 conditions that may be widely varying in characteristics. This can be well justified, as it allows for the very efficient evaluation of many characteristics, but does not allow for a full range of parametric statistical analyses that are dependent on an factorial experimental design (e.g. analysis of variance). In such cases, the analysis may be comprised of calculation of means and 95 % confidence intervals (averaged over the talker samples), with t-tests to establish the degree of significant between means. Guidance on the common practices are outlined in ITU-T (2011). The CCR method on the other hand does require careful data conversion since the scale is bi-directional, compared to the hidden reference. Once data is transformed according to the randomisation, parametric tests may be applied.

5.3.3

Recommendation ITU-R BS.1534 (MUSHRA)

ITU-R BS.1534 is a recommendation published by the International Telecommunications Union - Radiocommunication Sector (ITU-R Rec. BS.1534-3, 2015) and is one of the most employed methods in our field. Originally introduced in EBU Tech. 3286 (1997); Soulodre and Lavoie (1999) and published as an ITU-R recommendation by the ITU-R in 2001, at the time of this publication is in its third revision. The method is commonly known as the MUSHRA (Multiple Stimuli with Hidden Refer-

Sensory Evaluation Methods for Sound  129 ence and Anchors), and as the name indicates applies reference and anchor systems to anchor scale usage between tests. Whilst having many variations, this paradigm is seen in practical applications, and it is important to note that the name ITU-R BS.1534-3/MUSHRA should only be used when the specifications in the recommendation are followed closely. MUSHRA is a double-blind multiple stimulus test method that includes a shown and hidden standard reference system (full bandwidth PCM file) and at least two anchor systems; a mid-anchor (7 kHz low-pass filtered), and a low anchor (3.5 kHz low pass filtered). Additional anchors may be included. The third revision provides provision for testing advanced sound systems, i.e. as defined in ITU-R Rec. BS.2051-1 (2017) and additional guidance on both parametric and non-parametric statistical analysis. This test method is typically applied for audio codecs and similar technology, in cases where these systems are considered to be of intermediate quality. For very high quality audio systems (e.g. high bit rate audio codec, watermarking technology, etc.), with close to transparent characteristics, the ITU-R Rec. BS.1116-3 (2015) is recommended. Independent variables: It is recommended to keep the number of systems under test ≤ 12, as each trial includes a hidden reference, and two anchors, totalling a maximum of 15 stimuli per trial presented for rating (see Figure 5.6). If exceeding this number, a well-documented blocked design should be applied. To ensure that systems under test are tested for an appropriate selection of samples/source material it is recommended to apply a selection of typical and critical test material (i.e. critical broadcast material for testing broadcast technologies). The number of samples for a test should be around 1.5x the number of systems under test. Assessors: The ITU-R Rec. BS.1534-3 (2015) prescribes ‘experienced’ listeners as assessors. Assessors should have normal hearing and are selected preferably based on both pre- and post-screening of their ability to discriminate between systems under test as well as reliably rate these systems. For more information regarding assessor screening for the MUSHRA test we refer to the recommendation and the ITU-R Rep. BS.2300 (2014). When experimental conditions are highly controlled, and assessors pre/post screened, no more that 20 listeners is often sufficient. Presentation of stimuli: Overall the sound reproduction described in the ITU-R BS.1116-3, should be used. This includes using reference monitor loudspeakers or headphones, with all assessors, using same the type of transducer throughout the test. Individual adjustment of sound level should only be allowed to ± 4 dB relative to the reference level. It is recommended that a MUSHRA test is preceded by a training phase. The purpose of the training phase is to familiarise the assessors with the systems and samples under evaluation (full range of impairments), as well as method/tool for collecting ratings. As already mentioned the MUSHRA is a double-blind multiple stimulus. This, of course, means that neither the researcher nor the assessor is aware of how the randomisation is assigned for each test In a MUSHRA test the assessors are asked to rate the stimuli on the Continuous Quality Scale (CQS). The CQS consists of identical graphical scales (typically 10 cm long or more), ranging from 0–100, which are divided into five equal intervals with the verbal anchor labels being placed as shown in Figure 5.6. The verbal labels range from Bad to Excellent. The full bandwidth reference is presented on the left side of the GUI. In addition to the declared reference, a hidden reference, as well as two band-limited hidden anchor systems are being presented and rated for each trial. The presentation order for the hidden reference and anchors is randomised along with the systems under test for each trial.

130  Sensory Evaluation of Sound The dependent variable, or attribute, can vary in the MUSHRA, but should always include Basic Audio Quality (BAQ). In practice BAQ this is typically the only dependent variable applied. However, it is worth noting that the recommendations offers other possibilities, presented below; Basic Audio Quality: This single, global attribute is used to judge any and all detected differences between the reference and the object. Stereophonic image quality: This attribute is related to differences between the reference and the object in terms of sound image locations and sensations of depth and reality of the audio event. Although some studies have shown that stereophonic image quality can be impaired, sufficient research has not yet been done to indicate whether a separate rating for stereophonic image quality as distinct from Basic Audio Quality is warranted. Timbral quality: This attribute has been found to be of particular significance. The attribute of timbral quality may be described by two sets of properties: • The first set of timbral properties is related to the sound colour, e.g. brightness, tone colour, coloration, clarity, hardness, equalisation, or richness; • The second set of timbral properties is related to the sound homogeneity, e.g. stability, sharpness, realism, fidelity and dynamics. These properties may be descriptive of the timbre of the sound, but may also be descriptive of other characteristics of the sound. Localisation quality: This attribute is related to the localisation of all directional sound sources. It includes stereophonic image quality and loss of definition. This attribute can be separated into horizontal localisation quality, vertical localisation quality and distant localisation quality. In case of a test with an accompanying picture, these attributes can be also separated into localisation quality on the display and localisation quality around the listener. As we can see all attributes refer to quality, allowing for scaling using the CQS, and in accordance with the general integrative nature of the method. For a general overview of recommended analysis approaches for the MUSHRA methodology we refer to the recommendation (ITU-R Rec. BS.1534-3, 2015) that provides good guidance and more general guidance can be found in Chapter 6.

5.3.4

Recommendation ITU-R BS.1116

The ITU-R Rec. BS.1116-3 (2015) is a method intended for assessing systems with impairments so small that they are undetectable without laborious control of the experimental setup and thorough statistical analysis. Typical applications in the audio industry is testing close to transparent codecs (high bitrate), watermarking technologies, and other technologies that introduce small audible impairments to the audio signal in comparison to a reference. The ITU-R BS.1116 was originally put into effect in 1994 and is currently in its fourth revision, published in 2015. The ITU-R BS.1116-3 is commonly recognised as one of the most sensitive methods in the audio industry and it is recommended for specifically application with the evaluation of small impairments. In cases where clearly audible impairments exist, the method is over-sensitive and ITU-R BS.1534 should be considered as a less laborious and potentially more appropriate method. Independent variables: Because the ITU-R BS.1116-3 is typically applied for close to transparent systems, the actual process of grading can be time consuming and exhausting for the listeners. This means that care should be taken to keep the experiment within a reasonable size, which will be limiting factor as to how many systems

Sensory Evaluation Methods for Sound  131

A typical GUI for collecting data for a ITU-R BS.1534 (MUSHRA) test, using SenseLabOnline. Courtesy of FORCE Technology, SenseLab. Figure 5.6

can be included in a test. The original samples applied in the test, also known as programme material in the recommendation, are a crucial element of the method. Only samples that exert, or stress, the artefacts typically displayed for the systems under test should be applied. These are referred to as critical material , which the recommendation defines as “. . . typical broadcast programme . . . stresses the systems under test”. In addition this should be considered for each of the systems under test, ensuring equally critical samples across technologies. Music as well as other broadcast-type material is permitted, while synthetic signals are not. The recommendation suggests that the number of samples in a test is 1.5 times the number of systems under test, with a minimum of 5. In practice more samples are seen in tests. Assessors: The ITU-R BS.1116.3 strictly uses “expert listeners”, which means that the listeners should have expertise in detecting the specific impairments that characterise the systems under test. The requirements for the listeners increase as the transparency, and hence, difficulty of the test, increases. Pre-screening can be based on knowledge about the listeners’ ability to detect system differences from previous tests, or analysis of pre-tests and training. Additionally, strict post-screening procedures of subjects are described in the recommendation and should be performed before including a listener’s data in the analysis reporting of results. The post-screening is based upon the correct identification of the system under test - a criterion that becomes increasingly strict with transparent technologies. The qualification is achieved by performing a one-sided t-test on the algebraic difference between the two ratings for each trial (incl. all samples in the experiment). If the listener has correctly identified the system under test, the t-test will reject the null-hypothesis - that the listeners, across samples, was not able to detect the system(s) under test. Twenty post-screened listeners provide enough

132  Sensory Evaluation of Sound data for this method, but as some may be excluded during the post-screening, more listeners might be needed to reach an adequate number. Presentation of stimuli: The sound playback specifications described in the ITU-R BS.1116-3 are very detailed, and can be applied in specifying critical listening rooms. From room size, reverberation characteristics, calibration, speaker setup etc., the recommendation can be seen as an attempt to provide a best practice setup for critical audio testing. In addition to testing over speakers, the method can also be applied with high quality headphones if relevant for the systems under test. We recommend reading the recommendation for further details regarding sound reproduction setup. A summary of listening room characteristics is provided in Bech and Zacharov (2006) and configuration of advanced sound systems can be found in ITU-R Rec. BS.2051-1 (2017). The ITU-R BS-1116-3 applies a double blind triple stimulus with hidden reference paradigm as illustrated in Figure 5.7. During a test the listener is in complete control of the sound playback and grading pace. For each trial the subject is presented with 3 sounds: one being an identified reference, one being a hidden reference, and one the system under test. The hidden reference and system under test is randomly assigned with one of two scales, and the task is to identify, and rate, the system under test. For each trial, the reference must be given a rating of 5.0 and the other sample must be given a score below 5.0. The dependent variable, or attribute, can vary in an ITU-R BS.1116-3 test, but should always include Basic Audio Quality (BAQ) - the single global attribute used to judge any and all detected differences between the reference and system under test. In addition to BAQ attributes other attributes may be included (see examples in the recommendation). ITU-R BS.1116-3 employs a continuous 5-point impairment scale, typically at a resolution of one decimal place. The impairment scale focuses on annoyance from 1 to 4, while a rating between 4 and 5 relates to detection of degradation in the sample. The analysis may be performed on the absolute ratings or the Difference Grades (DG). The difference grade is the algebraic difference between the hidden reference and the system score for each trial. The recommendation advises common parametric analysis such as Analysis of Variance (ANOVA), t-test and summary of results using means with 95 % confidence levels. The recommendation also provide guidance on data transformation to take into account differences in scale usage - although this approach is not often applied within the industry.

5.4 CONSENSUS VOCABULARY METHODS Consensus Vocabulary (CV) methods also known as Descriptive Analysis (DA)9 methods are motivated by being able to characterise the perceptual nature of systems under test typically obtaining a magnitude estimate of these characteristics, commonly referred to as attributes. This can be useful to characterise the salient perceptual characteristics of a technology family and to provide a profile of each product under study. Furthermore, during product development, tracking the evolution of product characteristics (e.g. scene depth, localisation accuracy and timbral colouration) may be beneficial when striving towards optimal solution in terms of perceptual characteristics and other technical constraints (e.g. code complexity, budget, etc.). 9 Also

less formally referred to as sensory profiling or attribute-based testing.

Sensory Evaluation Methods for Sound  133

Example of a typical user interface in a ITU-R BS.1116-3 test, using SenseLabOnline. Courtesy of FORCE Technology, SenseLab. The reference is the unprocessed, full bandwidth, sample. Each trial includes a hidden reference and the system under test. The listener is asked to identify and rate the system under test compared to the reference. The sample identified as the hidden reference is rated at 5.0, and the other samples must be rated 8

Both experienced and inexperienced assessors have been applied.

Assessor type and number

N >15

Experienced assessors. N >16

Multiple stimulus - single attribute paradigm. Unipolar scales.

Applied with both inexperienced and experienced assessors.

Rating on bipolar constructs/attributes.

Single stimulus - multiple attribute paradigm.

Ranking on unipolar scales. N >8 (experienced) N >20 (inexperienced)

Multiple stimulus - single attribute paradigm.

Unipolar scales.

Single stimulus - multiple attribute paradigm.

Presentation and scaling paradigm

HMFA, PCA.

Cluster analysis, PCA.

Non-parametric factor analysis, PCA, GPA.

GPA, HMFA.

Analysis approach

Table 5.2

Overview of four selected individual vocabulary methods focusing on the main differences in attribute elicitation, scaling, assessor type, and appropriate analysis.

Individual Vocabulary Profiling (Lorho, 2010)

Repertory Grid Technique Kelly (1955a)

Triadic presentation of stimulus driving the attribute development by asking the assessor to describe how two of the stimuli vary from the third (see e.g. Berg and Rumsey (2006)). Can also be used with pairwise presentation of stimulus (e.g. Choisel and Wickelmaier (2006)). Attribute elicitation performed under direct comparison of systems under test. In addition to attributes, attribute definitions are developed, which are useful for later semantic interpretation of the results. In addition to vocabulary development, a training phase can be implemented.

Flash Profile Same as Free-Choice Profiling. (Dairou and Sieffermann, 2002)

Attribute elicitation

Method

Sensory Evaluation Methods for Sound  141

142  Sensory Evaluation of Sound vertically (across attributes) as with consensus attribute data, since (1) the same attributes are not consistently used across assessors, and (2) there is no semantic consensus meaning that even if two assessors are using the same terms, they may not have the exact same definition of it. The size of the sample set is similar to other Descriptive Analysis methods. Ideally for the attribute development stage the stimuli should span a broad range of characteristics - but in practice this may be limited to the products of interest within one study. As a result for IVP, the number of products and stimuli can range for just a few to many tens of conditions. This is particularly true for FCP since products are rated in monadic sequence, following an experimental design. Since Flash requires ranking the products, the totality of the sample set is usually presented to the assessors: in that case, the number of products may be limited to 8–10 (depending on the type of products) due to fatigue through repetitions. Since neither FCP or Flash require training, these methodologies can also be performed with consumers, in which case it is recommended to increase the sample size to 20 at the minimum. For a comprehensive comparison between FCP, FP and RGT, see Lorho (2010). Typically, FCP, IVP and Flash Profile data are analysed from a multivariate point of view using analyses that can handle multi-table data. Amongst these analyses, we can mention Generalized Procrustes Analysis (GPA) and (Hierarchical) Multiple Factor Analysis ((H)MFA). Through this analysis, different criterion can be examined: 1. The product space which highlights the (dis)similarities between products; 2. The semantic interpretation through the variable representation: The correlations between attributes semantically similar is a measure of the consensus between assessors. FP and FCP are methods that are considered as fast since no prior training of the panellist is required, yet providing a deep quantitative profile of the products. However, the specifics of the test (and the data gathered) do not allow the usual hypothesis tests that are used with conventional quantitative tests. IVP is also a relatively fast technique, but the additional training and attribute refinement stage should improve the attribute definitions but does take more time.

5.6 MIXED METHODS - EXPLAINING PREFERENCE Integrative methods, described earlier in this chapter, provide means to collect data from an overall or integrative perspective either from a qualitative (e.g. BAQ) or affective (e.g. preference) perspective. In so doing we are able to establish, the highest quality or most preferred products - however lacking in-depth analysis possibilities. In comparison, descriptive methods provide a means to quantify and characterise how products are perceived. However, these methods do not provide the integrative perspective to allow us to select winning products. Mixed methods allow you to combine data collected by both integrative and descriptive methods in order to explain the one data set from the other - i.e. explain preference. As such they are not purely methodologies, but a hybrid combination of methods for evoking and collecting data, with specific analysis techniques for interpretation. As illustrated in our overview of methods (Figure 5.1), the five mixed methods are in a category of their own, so based on collection of either affective or integrative audio quality data. Of those five methods, three employ consensus vocabulary descriptive data and two use individual vocabulary approaches. Two recent mixed methods, which both employ individual vocabulary techniques, are

Sensory Evaluation Methods for Sound  143 the Open Profiling of Quality (OPQ, Strohmeier et al. (2010)) and Individual Vocabulary Profile with Preferences by Pairwise Judgements (IVPPPJ). OPQ is described in detail in Chapter 14 in application to audiovisual sensory evaluation. The IVPPPJ method has been developed around the paired comparison approach and is described in detail in Chapter 12 in application to concert hall studies. Lastly, the Multiple Stimulus - Ideal Profile Method (MS-IPM) is described, which uses three data sets (affective, descriptive and ideal ratings) to gain deeper consumer insight.

5.6.1

Audio descriptive analysis and mapping

Audio Descriptive Analysis and Mapping (ADAM) is presented in detail in Zacharov and Koivuniemi (2001b); Koivuniemi and Zacharov (2001); Zacharov and Koivuniemi (2001c) and summarised in Zacharov and Koivuniemi (2001a), and is one of the first applications of more structured consensus attribute development process in audio, as well as multivariate integration of descriptive data with preference ratings as a mixed method. It includes a detailed process for consensus attribute development, panel training, through to External Preference Mapping (EPM) using Partial Least Squares Regression (PLS-R). Originally applied to multichannel spatial sound reproduction, it is a general method applicable to other audio domains as well. Independent variables: First and foremost the systems under test are of relevance. While one should take care not to include too many systems under test, it is recommended to select a product range with a relevant and wide enough perceptual span to explore the product space, to ensure the generic applicability of the developed consensus attributes. As with any other audio testing methodology the products should be tested with relevant and critical test samples that span a range of characteristics if relevant. This may result in rather large experiments, requiring the experiment to be split into multiple test sessions. Assessors: Naïve assessors (N = 16) were employed for the collection of preference data, to ensure a pure integrative consumer perspective. These assessors were then trained to become expert assessors (N = 12), subsequently employed for the attribute development and rating phases. Presentation of stimuli: The preference judgements were collected in pseudo paired comparison presentations, where ratings are made compared to a defined reference using a ± 10-point rating scale using the range: Extremely prefer the reference (-10), Prefer neither (0) to Extremely prefer A (+10). Attribute ratings were collected in a single stimulus paradigm with pairs of attributes presented for each trial. All ratings were performed using a double blind presentation. The analysis of the data collected can be performed in a number of manners due to the large and mixed data set, including common parametric analysis of the preference/overall quality data or multivariate analysis (e.g. PCA) analysis of the attribute data as discussed in Chapters 6 and 7. However, as with all mixed models the primary aim is to understand the drivers of preference (or overall quality). To achieve this a partial least squares regression approach was used to develop an exploratory and predictive model for spatial sound reproduction based on the experiment. This type of model provides information regarding what are the salient attributes contributing the preference scores and with what weightings. More indepth analysis regarding the impact of the different sound samples to the preference of

144  Sensory Evaluation of Sound different sound systems is also possible. In the ideal case, the developed model can be used to estimate preference scores from attribute ratings.

5.6.2

Multiple stimulus - Ideal profile method

The Multiple Stimulus - Ideal Profile Method (MS-IPM) is an evolution of the consumerbased Ideal Profile Method (IPM, Worch (2012); Worch et al. (2013a, 2014). In addition to collecting integrative (affective or qualitative) and descriptive attribute ratings, a third type of rating is collected, namely the ideal ratings, sometimes referred to as an ideal point. Legarth et al. (2014) developed the approach for application to hearing aids using a panel of expert hearing impaired assessors, details of which are presented in detail in Section 9.11. Other application examples for assessment of advanced sound systems, also sometimes known as Next Generation Audio (NGA) systems, can be found in Zacharov et al. (2016b); EBU TR 043 (2018); Liebl et al. (2018). The original IPM (and hence MS-IPM) was inspired from the Just-About-Right (JAR, Moskowitz (1972); Rothman and Parker (2009)) scale method in which consumers were asked to characterise samples relative to their own ideal. In the JAR case, the notion of the ideal or optimal point is implicit within the 5-point scale for each attribute: much too little (1), too little (2), just about right (3), too much (4), or much too much (5). For both the IPM and MS-IPM methods, the assessors are explicitly asked to provide their ideal rating on a provided scale for each attribute. Using this data the ideal profile provides an overview of the assessors’ average ideal ratings for each attribute. In addition to the evaluation of overall quality and characterisation of the products using attributes, the MS-IPM approach also establishes whether any product is similar to the assessors’ ideal perception or whether there is potentially room for improvement, irrespective of the current technology and independent of any technological or physical limitations. Ideally, from the developer’s perspective, it would desirable to show that the profile of a technology under evaluation is statistically similar to the ideal profile. If this can not be shown, then the ideal profile provides a route to a detailed analysis, of which attribute characteristics to prioritise, ignore, maintain, increase or decrease, can be of value to understand the assessor quality rating and/or to guide the product development process. In this way the JAR, IPM and MS-IPM methods can be used as diagnostic tools with the aim of understanding the strengths or weaknesses of products under study, either for pure assessment purposes or for product development. This can be done by understanding which attributes need to be improved by estimating their impact on the overall quality scores (also called mean drop or penalty) using for example penalty analysis (see Varela and Ares (2014); Pagès et al. (2014)). Independent variables: In addition to the systems under test, multiple samples may be included in the test. Furthermore, replications of some of the attributes, for certain samples, may be of value as a means of evaluating assessor and attribute performance (e.g. using ITU-R Rep. BS.2300 (2014)). Assessors: The method employs >20 trained or expert assessors, who are familiar with attributes and the type of stimuli to be assessed. Presentation of stimuli: A multiple stimulus presentation paradigm is used and stimuli are rated first on a 100-point overall quality scale, ranging from bad - poor - fair good - excellent (see Figure 5.10(a)). In applications where no reference exists, for example, advanced sound system renders, no reference is provided.

Sensory Evaluation Methods for Sound  145

(a) Overall subjective quality

(b) Attribute and ideal (*) rating phase for Clarity.

Example of a Multiple Stimulus - Ideal Profile Method (MS-IPM) test user interface, using SenseLabOnline. Courtesy of FORCE Technology, SenseLab. Figure 5.10

146  Sensory Evaluation of Sound Pertinent attributes are selected for the characterisation of the system performance by assessors from a lexicon, where possible. It is generally found in audio that 6 to 10 attributes is sufficient to allow assessors to express their perceptions of audio products. A multiple stimulus presentation is also used for each attribute to compare all systems, for a single sample per screen/trial, as illustrated in Figure 5.10(b). The last rating scale in the user interface, labelled with an * allows the assessor to provide their ideal rating of each attribute. The data set from an MS-IPM test provides a range of options for analysis. Fundamentally the analysis of the MS-IPM is similar to most mixed methods with consensus attribute data, as presented in Section 5.6.1. The overall quality data can be evaluated using common univariate analysis techniques (means, 95 % confidence intervals, t-tests, ANOVA, etc.). Similarly, univariate analysis can also be applied to the individual attribute data, to understand the experimental factors contributing to the variance in the data. Traditional multivariate analysis (e.g. PCA, MFA), etc. can be applied to the attribute data to obtain an overview of the salient perceptual characteristics of the products. Additionally, the ideal profile data can be used as supplementary data during the multivariate analysis of the attributes or during the External Preference Mapping, in order to understand the underlying relationship between attributes and the overall quality ratings. The 95 % confidence ellipses (see Husson et al. (2005); Cadoret and Husson (2013); Altman (1978)) are used to establish whether or not products are similar to each other or to the ideal in the multivariate space. Mixed methods in general represent some of the more involved sensory evaluation methods. However, the effort involved in applying them does come with the reward of being able to deeply analyse and interpret the performance of evaluated products. These methods allow you to answer multiple questions including: What product(s) are preferred or have the highest quality? How is this performance characterised and how far is this from a hypothetical user ideal? For those experimenters interested in studying such questions, the family of mixed methods may be beneficial.

5.7 INDIRECT ELICITATION METHODS The family of indirect descriptive methods can be considered as a way of “seeing the product as a whole”, without any pre-conceptions. Indirect elicitation methods are different from the direct elicitation methods in that they do not rely on scaling intensities using consensus or individual attributes for the systems under test. Instead indirect approaches employ ranking, sorting or scaling of e.g. similarities/differences between products is applied. This means that we are not explicitly asking an assessor to rate his/her perception on a given attribute, but instead allowing assessors to express their perception without any pre-conceptions beyond their own experience. Furthermore, this allows the assessor to explore their perception of the product as a whole. The working assumptions with all of these methods is that some degree of commonality in assessor perception exists and this latent multidimensional space can be ascertained through a multivariate analysis11 . These methods are generally better applied for exploratory purposes than formal hypothesis testing. From a practical perspective this means that the methods can be applied to answer questions such as: What characterises the main differences between my selection of products? What are the main dimensions of interest when looking at a selection of 11 And as with all analysis, we can explore whether assessors are generally in agreement or whether subpopulations exist, with different perceptions.

Sensory Evaluation Methods for Sound  147 products? What are the main perceptual dimensions describing the differences between my products? Indirect methods are typically seen as moderately fast to perform from a data collection perspective, because no attributes need to be developed/defined, and subjects can perform the data collection task(s) with ease - typically within one short session. However, the interpretation of the meaning of dimensions is more complex and may introduce a degree of experimenter bias. The indirect approach has also been described as “holistic” 12 , in reference to a more open approach where the product, as a whole, is being evaluated in the test - i.e. the point of interest being how the assessors perceive the whole product and not individual dimensions or attribute intensities. In this section we will cover the most commonly encountered indirect descriptive methodologies; Multi-Dimensional Scaling (MDS), Free Sorting and Projective Mapping or Napping. Although products are often described quantitatively (e.g. using descriptive analysis such as QDA), it is also important to describe them qualitatively13 , particularly at the exploratory phase of research. Indeed, the qualitative description of products allows understanding how they compare to each other overall. In other words, it answers the following question: How similar/different are my products compared to the others, and in what sense? To answer this question, products can be evaluated in their entirety, without decomposing them using attributes. This introduces the notion of holistic approaches, in which assessors evaluate the products in their entirety using their own criterion. Besides providing a qualitative description of the products, these approaches also provide valuable information regarding assessors’ insights as it highlights which product characteristics are the most relevant to them. These techniques benefit from simplicity and ease of use, some of which are even entertaining for the assessors who may see them as a game. The techniques are generally very easy to grasp and perform, so well suited to consumers without dedicated training in sensory methods. However, as a result, moderately large amounts of data need to be collected to obtain clean and repeatable results. Furthermore, these techniques do not always allow for a deep understanding of the products themselves, but can be very valuable at the early, fast, exploratory phase. Multi-dimensional scaling allows assessors to evaluate the similarity in sample pairs on a simple similarity line scale. Both Free Sorting and Projective Mapping/Napping allow assessors to group similar products, the latter methods using a 2D space as a means to organise the products onto a 2-dimensional space, used as input for analysis.

5.7.1

Multi-dimensional scaling

Multi-Dimensional Scaling (MDS) is one of the early techniques developed for the indirect evaluation of latent multidimensional characteristics. Assessors merely evaluate the similarity (or dissimilarity) of pairs of stimuli, the interpretation of which is presented in a low-dimensional multidimensional space (Borg and Groenen, 2005). Whilst MDS is a general statistical approach for reducing the dimensionality of select types of data, it can also be seen as a method to establish the underlying latent multidimensional structure of our assessor’s perception, in a manner that can be easily interpreted. We will focus on the prac12 The

term holistic is derived from the Greek word holos which means “whole”. Holism is the idea that all the properties of a given system can not be determined or explained by its component parts alone. Instead, the system as a whole determines in an important way how the parts behave (source: http://en.wikipedia.org/wiki/Holism). 13 The term qualitative is derived from the Latin word qualis which means “as such”.

148  Sensory Evaluation of Sound tical application of the method in collecting and analysing direct proximities from similarity ratings of pairs of systems under test. Independent variables: Classical MDS is limited to the comparison of products/systems. In the case of INdividual Differences SCALing (INDSCAL, Carroll and Chang (1970)), this can be extended to include the assessor as an independent variable, allowing for analysis of the assessors’ impact on the mutli-dimensional space (see for example Martens and Zacharov (2000)). Assessors: The task of expressing perceived difference/similarity between a pair of stimuli is straightforward, and both consumers and trained assessors can perform the task. The required number of assessors is somewhat dependent on the number of stimuli being evaluated and also the expected dimensionality of the analysis - a matter discussed in Rodgers (1991), where N >15 is suggested. Presentation of stimuli: A paired comparison presentation is performed, and as a result the size of the MDS experiment increases drastically with the number of systems under test, and we recommend limiting the overall systems under test to a maximum of 15. The applications of this method in audio has typically applied recordings of stimuli to make for a manageable test setup (e.g. Martens and Zacharov (2000); Mattila (2001b)), but can also be scaled to real, physical devices too. Typically, the paired comparison matrix will include combinations of the independent variable for one sample. When multiple samples or other conditions are to be tested, you may need to consider these as individual sub-experiments. Cross modal matrices have been successfully applied and is discussed in Chapter 13 to study the perceptual proximity of sound samples with textual descriptions. Each stimulus pair is rated on the dependent variable, “similarity/dissimilarity”, typically using a continuous line scale ranging from similar to dissimilar. All possible combinations of pairs of systems under test are presented to each assessors. The data collection itself is simple; we present the assessor with pairs of systems under test, and ask them to rate their level of perceived similarity/difference, typically on a 9-point line scale ranging from highly similar to highly dissimilar. In general the MDS method is used to avoid bias, as far as possible, in order to establish the most salient perceptual characteristics for each assessor. As a result, the assessor is typically asked to evaluate the products and compare them, without any further guidance that could cause bias. A number of variants of MDS analysis algorithms exist such as, classical MDS (CMDS), INDSCAL, ALSCAL, etc. Depending the data structure of your MDS experiment, you will need to select a suitable MDS algorithm. For example, if you have data from multiple assessors, the INSCAL algorithm will allow you to analyse each assessor matrix and interpret both the difference between the products (stimulus space) and the difference in scoring between assessors (subject space). Care is needed in establishing the correct dimensionality for the analysis, as this choice may influence the interpretation of results.

5.7.2

Free sorting

The Free Sorting task is a methodology that allows assessors to group products. This is done by considering products that are “similar” from products that are “dissimilar”, forming multiple groups of similar products. What constitutes “similarity” is left to the assessor to

Sensory Evaluation Methods for Sound  149 describe through their grouping. The collection of the different groups produced by each assessor is called a sorting or a partition. Independent variables: Typically with sorting, the systems/products under test are considered per experiment. If different samples are to be considered as well, then these may constitute sub-experiments. A minimum of 5 systems under test is required. In addition to the independent variables we normally see in audio research this method allows the experimenter to study the impact of other consumer-oriented factors such as form factor, design, brand, etc. Assessors: A minimum of 10 assessors are required for testing. Free Sorting tasks are easily understood by consumers and very appropriate for consumer testing in general. Presentation of stimuli: Free Sorting can be applied to sound files (employing a GUI), or may be more appropriate for real device testing. In either case all stimuli should be presented simultaneously to enable sorting. To avoid obvious solutions, assessors are asked to create at least 2 partitions (all the products can not be positioned in one unique group) and can not consider more than P − 1 partitions (at least two products should be grouped together). In practice, the individual partition generated by each assessor is based on their own criteria. Once the individual partitions are generated, the assessors are asked to qualitatively describe the products by describing the criterion used. This can be done by asking the assessors to name their different partitions. Such information is particularly relevant for two reasons: • From a product point of view, it provides information about why they are considered as similar or different; • From an assessor point of view, it illustrates which product characteristics are the more relevant. In a Sorting Task, the assessors are asked to group products based on similarity. The partitions thus obtained at the individual level represent the dependent qualitative variables. The variables are aggregated into a squared and symmetric contingency table crossing the products in rows and columns, and counting how many times each pair of products have been grouped together in the partition. By construction, the diagonal of this similarity matrix is defined by the sample size used for the test. Outside the diagonal, the frequency fij , (i 6= j) represents the number of times sample i has been grouped together with sample j (the larger fij , the more similar sample i and j). Such tables can be analysed using MDS algorithms such as INDSCAL or ALSCAL, resulting in a sample space highlighting the relative positions between products, two products being close if they were perceived as similar. However, this solution ignores the individual assessment of the sample, as the table used for the analysis is an aggregation across assessors. Another approach to analyse Free Sorting data is by using a factorial approach. In this case, each individual variable corresponds to each individual assessor’s partition of the products (P ), the rows of the table corresponding to the sample (J). This table (P rows ∗J columns, J being the sample size) is then analysed by Multiple Correspondence Analysis (MCA, see Section 7.4.1). Like MDS, the MCA provides a product space highlighting the (dis)similarities between products through

150  Sensory Evaluation of Sound their proximities on the overall space. Additionally, the MCA also provides the group representation which highlights the assessors’ agreement in terms of partition. The individual evaluations can also be compared numerically using the Rand index (Qannari et al., 2014). Examples of advanced Free Sorting in the form of Sequential Agglomerative Sorting (SAS), are presented in Section 7.4.1.

5.7.3

Projective mapping and napping

Projective mapping which has more recently evolved into the similar and so-called Nappingr method, are a methods closely related to Free Sorting in that the assessor is given the task of organising the systems under test according to their similarities/differences, without specifying which dimension to focus on during the assessment. Instead of sorting in groupings, the systems under test are distributed and placed on a real or virtual surface (e.g. a piece of paper or a table cloth14 ), using the space to indicate level of differences according to the assessor’s own perception of the main differences/similarities. Napping can either be applied to the “product as a whole”, or focused on one sensory modality at a time. When instructions are given to focus on only one sensory modality at a time the method is also known as Partial Napping (e.g. Ramsgaard (2009); Dehlholm (2012)). Independent variables: It is recommended that 5 to 15 systems be included in a test. Because the method is relatively fast, it is possible to include multiple samples in a 2-way design. Assessors: Napping is a method highly suited for consumers (as well as trained assessors). The method is intuitive and easy to conduct from the subject’s perspective (Perrin et al., 2008). A minimum of 20 assessors is advised and according to Vidal et al. (2014), 50 consumers should be used to obtain a stable interpretation. Presentation of stimuli: Systems under test can be presented as either real devices or recordings of devices. If real devices are used the experiment will of course be influenced by design/brand bias. In Nappingr tasks, assessors are instructed to position the products on the surface in such a way that two products that are close to each other are similar to each other, compared to two products that are far from each other. The assessor is asked to organise the products on the surface according to their own criteria. Since the assessors are using their own criterion, the assessment of the individual responses provides a way to study the overall importance of the products characteristics. The analysis aims to find the common denominator characteristics used by assessors in organising their products. Napping data is digitised for each sample by tabulating the X and Y co-ordinate from the sheet/nappe. For practical reason, the origin (co-ordinate 0,0) is usually set as the bottom left corner. For each assessor, a set of 2 dependent variables (X and Y co-ordinates) is generated. Hence, the complete data set consists in J blocks of P rows 2 variables. In its original form, the analysis of this multiple table structure obtained from Projective Mapping was performed using Generalized Procrustes Analysis (GPA). Such analysis generates the consensus space, the average sample configuration across assessors. However, the 14 The term “Napping” is derived from the French word “nappe” meaning a table cloth, in relation to the rectangular area used for groups of products under test.

Sensory Evaluation Methods for Sound  151

An example of a Napping task. The assessor is presented with the shapes (left) and asked to organise these according to the perceived similarities/differences, using the whole surface to express these differences. In this case the assessor has used colour hue, size and number of edges to arrange the shapes (right). Each assessor is allowed to sort according his/her criteria of which similarities/differences are the most important. Figure 5.11

individual differences are omitted. For Napping the data set is identical to the one obtained from Projective Mapping. However, the analysis matches the particularity of the data collection: in this particular case, Napping data can be analysed using Multiple Factor Analysis (MFA) based on Principal Component Analysis (PCA) performed on the covariance matrix as it respects the rectangular structure of the sheet of paper. Indeed, the covariance PCA does not standardise the data to unit variance, hence respecting the relative importance of the dimensions of variability used by the assessors when separating the products. Such information is of utmost importance when analysing Napping data as it highlights the differences in terms of importance of the criterion used by the panellists, and enables comparison of individual configurations. Both for the Free Sorting and for the Projective Mapping/Napping, the labels provided by the assessors to label each group are used to describe the product space and improve the interpretation of results. This can be done by generating a frequency table crossing the products in rows and the set of words in columns; in this contingency table, the intersection between row i and column j represents the frequency Fij of time the word j has been used to characterise sample i. This table can be projected as supplementary in the space obtained either by MDS, MCA, GPA, or MFA depending on the type of data and the analysis considered. A sample will be located close to a word it has often been associated with.

5.8 ASSOCIATIVE METHODS In the previous section we showed that indirect descriptive methodologies are a selection of methods that can be applied as various measurements of product similarities and dissimilarities between products. This generally allows extrapolating the main perceptual dimensions between the systems under test, and with great care, maybe describe these dimensions by either collecting additional data from our assessor or by interpreting the dimensions based on product spread and known characteristics. But what if we are more interested in a qualitative description of the products? What would be the main characteristics/descriptors that assessors would associate with the different products?

152  Sensory Evaluation of Sound To answer these questions, different approaches can be considered. The simplest and most logical one consists in directly asking assessors to describe the products using their own words. The advantage of such practice is that the assessors’ answers are free, spontaneous, and fast. Depending on the context, such description can eventually be more structured (e.g. asking assessors to describe products based on certain characteristics), limited (e.g. “use 5 words to describe this system/product”), or completely unstructured. The more unstructured the question, the more associative methods can be considered a holistic approach. In this section we will describe a selection of methods that may be considered associative, that is; methods that rely on semantic association between either free elicitation or word lists and the systems under test.

5.8.1

Free association

In sensory and consumer science, different methods have been used by researchers to measure consumers’ liking or attitudes towards a set of products. In this context, open-ended affective type questions have been used to uncover the consumers’ perception and/or reasons for liking. Examples could be: “In few words, describe what you like about this sample?” or “In few words, describe what you dislike about this sample?” More generally, it could also be asked as: “Is there anything else you would like to share regarding this sample?” In some other situations, such as a Sorting Task or Napping, the word association is not directly related to liking, but to the perceived characteristics of the products. Yet, the approach is similar, and assessors are asked to characterise products or group of products using their own criteria and words. In its rawest form the Free Association method aims to, through individually expressed words, understand the assessor’s product associations (see e.g. Chapter 13 or Nielsen and Ingwersen (1999) for a more thorough review). Free association tasks have also been used as supplementary data in more structured data collection (e.g. Gabrielsson and Sjögren (1979b)). Independent variables: Free association methods typically need clear differences between the systems under test to achieve meaningful differences in the measurement. Because it is word based, the number of factors in the experiment should be limited to a 2-way factorial design. Due to of the holistic nature of the method, it is appropriate for studying semantic associations related to e.g. brand/design interactions. Assessors: Whilst expert assessors may be applied, it is more meaningful to use naïve assessors or consumers. Depending on the objective of the experiment more or less subjects may be used; < 20 for a more qualitative approach, > 20 allows for a more structured analysis. Presentation of stimuli: The method can be used for both real devices and sound recordings. Best practice for calibration and sample presentation should be followed. Each sample is evaluated in a monadic sequence, based on an experimental design, that controls for first-order and carry-over effects. Because the assessors are allowed to freely express themselves, Free association methods often generate large textual data sets. Assessors may provide a large number of words, some of them having the same meaning yet used differently, being misspelled, or being “stop words” (irrelevant or meaningless in the study, such as “and, “I”, “or”, etc.). Lemmatization is then performed: this step consists in standardising each word by grouping the ones with the same meaning together. The stop-words and the words with very rare frequencies may be removed from the analysis as they may influence the analysis negatively. Depending on the

Sensory Evaluation Methods for Sound  153 amount of text collected, such a procedure of cleaning the data can be quite time-consuming, if performed manually. However, today a number of software-based tools are available for performing such tasks at least semi-automatically (e.g. RapidMiner, Tropes, etc.). Ideally, such tasks should be done by multiple experimenters independently to make sure that the same groupings are performed (when disagreement between experimenters is observed, it is commonly agreed that the words involved should not be removed or grouped). The core of the data analysis in free association methods is to understand the association between words and products. Such association can be made through a simple frequency table crossing the products in rows and the words used in columns. To visualise the association between the products and the words, Correspondence Analysis (CA) is performed on the table of frequencies.

5.8.2

Check-all-that-apply (CATA)

Check-All-That-Apply (CATA, Adams et al. (2007)) is an associative method that uses a pre-selected list of attributes in exploring the relationship between these attributes and the systems under test. In a CATA experiment each assessor is given a list of 10 to 40 words (that may include different topics such as sensory characteristics, attitudes and habits, liking, emotions, etc. depending on the objective of the test) and asked to select all the words from this list that they find will characterise the presented stimuli (see Figure 7.4). Independent variables: CATA tests are suitable for assessing larger sets of stimuli, because of the fast nature of the data collection. Anywhere between 4 to 15 systems under test may be evaluated. Both 2- and 3-way experiments may be applied if necessary. Assessors: Both expert and naïve assessors/consumers may be applied. The method is more suitable for consumers, due to the easy instructions and data collection. Typically around 50–100 assessors are used in an experiment (Vidal et al., 2014). Presentation of stimuli: The method can be used for both real devices and sound recordings. Best practice for calibration and sample presentation should be followed. Each sample is evaluated in a monadic sequence, based on an experimental design, that controls for first-order and carry-over effects. This methodology is easy and quick to execute, as it does not engage the assessors in deep processing. As a drawback, previous studies have shown that assessors tend to select the terms that easily catch their attention and/or that can easily be found within the list. Also, as further products are being tested, the assessors tend to limit themselves in selecting only the words that are at top-left of the list. For those reasons, it is recommended to randomise the words within and between assessors’ evaluations. This reduced any presentation bias relating to the word order, but can be found frustrating to assessors if they need to search specific terms from a large set, randomly sorted for each trial. Furthermore, the checklist can be replaced by a yes/no question for each term - encouraging the assessor to actively read and respond to each CATA term. This can be a burden when many CATA terms are presented. Most of the analysis we can apply relies on the frequency of association of a word across products: with a small sample size, this association may not be strong enough to show reliable differences between products. CATA is a relatively novel tool and can be used as a very fast and efficient way to obtain data on the most perceptually relevant characteristics that assessors identify. It can also

154  Sensory Evaluation of Sound

Example user interface for a Check-All-That-Apply (CATA) test, using SenseLabOnline. Courtesy of FORCE Technology, SenseLab. Figure 5.12

be used as a rapid way to identifying most pertinent attributes to be used in descriptive analysis experiments from a larger lexicon of attributes. Free association and CATA tasks are tasks that are easy and quick to perform. In short the free association method provides a wide range of information regarding the products and its analysis is quite limited since it requires heavy pre-processing of the data and does not allow hypotheses testing. CATA does not suffer from these drawbacks since a fixed list of words is used. However, in both situations the outcome of the analysis is a measure of association, not a measure of intensity as we are used to. And although higher intensity may lead to higher association rates, such an observation is not obligatory and can not be interpreted as such.

5.8.3

Rate-all-that-apply (RATA)

Rate-All-That-Apply (RATA) is a relatively new extension of the CATA method (see Ares et al. (2014)). In a RATA experiment the assessor is first asked to assess if an attribute is applicable in describing the presented stimuli. In the next step the assessor is asked to rate attribute intensity and “applicability” on separate scales. Any attributes that the assessor finds non-applicable are ignored. Compared to the CATA this provides a direct measurement of intensity per attribute. The key advantage of RATA is to allow assessors to rate only those attributes of relevance for a given trial, which is not possible with other consensus vocabulary or descriptive analysis methods. This can be beneficial in experiments, where certain attribute characteristics are only excited on some cases, i.e. for certain samples, but not for others. In an example where one sample causes midrange distortion, and another sample causes phase distortion, it would

Sensory Evaluation Methods for Sound  155 be inefficient to force all assessor to evaluate these two attributes for samples not involving such distortions. Compared to a consensus vocabulary or descriptive analysis data set, which has the same number of attribute ratings for each assessor and each product, the RATA data structure is more complex. Each assessor has a different number of ratings for each product, depending on which attributes they established were pertinent. The analysis can be performed similarly to an IVP data set, using GPA, MFA or HMFA. For a thorough introduction to the analysis of RATA data please refer to Meyners et al. (2016).

5.9 CLOSING WORDS This chapter introduced a wide range of sensory evaluation methodologies, many of which are used in the field of sound/audio. However, methods are constantly evolving to address new research questions and applications and thus it is impossible to cover all methods even here. The focus of this chapter has been on Listening Only Tests (LOT), whilst conversational and multiparty tests were not covered, but are discussed in Chapter 8. Free sorting and Just About Right scales are presented with examples in Chapter 7, Further in-depth guidance to JAR methods is available in Rothman and Parker (2009). The Individual Vocabulary Profile with Preferences by Pairwise Judgements (IVPPPJ) is presented for concert hall acoustics applications in Chapter 12 and the Open Profiling of Quality (OPQ) method is introduced in detail for audiovisual applications in Chapter 14. Additionally, Part III of our book provides many applied examples of these methods and more, whilst further inspiration can be found from classic and modern sensory evaluation textbooks including Lawless and Heymann (2010); Lê and Worch (2014); Næs et al. (2010); Kemp et al. (2018).

CHAPTER

6

Applied Univariate Statistics Per Bruun Brockhoff Danish Technical University, Lyngby, Denmark

Federica Belmonte Danish Technical University, Lyngby, Denmark

CONTENTS 6.1 6.2 6.3 6.4

6.5

6.6

Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Some basics of statistical reasoning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . The general experimental and analysis approaches . . . . . . . . . . . . . . . . . . . . 6.3.1 Simple versus elaborate analysis approaches . . . . . . . . . . . . . . . . . . . 6.3.2 Multiple comparison corrections . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Mixed models for sensory and consumer data . . . . . . . . . . . . . . . . . . . . . . . . . . 6.4.1 Headphone data - Structure and exploratory analysis . . . . . . . . . 6.4.2 Overview of mixed model tools . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.4.3 From t-tests to multi-factorials and multi-blockings . . . . . . . . . . . 6.4.3.1 Extension 1: K > 2 levels of the “treatment” factor 6.4.3.2 Extension 2: Replications - Two blocking factors . . 6.4.3.3 Extension 3: More than a single treatment factor . 6.4.3.4 Extension 4: Even more complex structures . . . . . . . 6.4.4 Headphone data analysis, modelling by mixed models . . . . . . . . . 6.4.4.1 Post hoc analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.4.5 Analysis of and correction for scaling effects . . . . . . . . . . . . . . . . . . . 6.4.6 SensMixed-package: Multi-attribute automated analysis . . . . . . . Model validation and Perspectives: A view towards non-normal and non-linear mixed models within sensory evaluation . . . . . . . . . . . . . . . . . . . . 6.5.1 Model validation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.5.2 Non-normal data - What to do generally? . . . . . . . . . . . . . . . . . . . . . . 6.5.2.1 Ignore it! . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.5.2.2 Use non-parametric (rank-based) methods . . . . . . . . . 6.5.2.3 Use transformation to obtain (near) normality . . . . 6.5.2.4 Identify the “correct” distribution and do statistics based on that . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.5.2.5 Use simulation-based methods: . . . . . . . . . . . . . . . . . . . . . 6.5.2.6 sensR - A brief on discrimination and similarity testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.5.2.7 Ordinal - A brief on ordinal data analysis . . . . . . . . . Relation to Multivariate Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

158 159 161 162 163 164 164 168 168 173 173 176 178 178 180 182 183 186 187 187 188 188 189 189 189 190 190 192 157

158  Sensory Evaluation of Sound

his chapter provides an introduction and overview of the basic statistical approaches for the sensory evaluations of sound. Our focus is on the detailed univariate analysis of single attribute data but also includes some thoughts about the role of univariate analysis when in fact we are facing multivariate/multi-attribute data situation. The main topic is analysis of variance methodologies and the so-called repeated measures and mixed model extensions of these applied to single attribute multi-subject and multi-systems data. Following some basics of statistical reasoning, the majority of the chapter consists of the newest developments within so-called mixed models for sensory and consumer data. Our focus is upon linear mixed models and their analysis using general purpose and sensory specific R-packages and other open source tools. Thereafter, relevant non-linear and non-normal conditions are considered and finally, several links to multivariate methods are described to bridge into the applied multivariate statistics Chapter 7. Mixed models is the generic statistical term for models where multiple error/noise components are part of the model. Typically with human perceptual data of any kind, be it from expert or consumer panels, we essentially always have repeated data in the sense that each subject provided more than a single evaluation. Hence, we always face a situation with at least two error components coming from within and between individual variability respectively. The ability of a researcher to handle mixed model analysis is at the core of being able to properly analyse perceptual data of sound as any kind of sensory and consumer data. A complete review of applied linear mixed models can be found in Brockhoff (2015). The use here of the term mixed models is rather generic and includes concepts such as repeated measures, a term that may be more well known and also the “usual” multi-factorial ANOVA and/or regression and ANCOVA models. The overall purpose of this chapter is first to convey the message that there is insight and value to be gained from going from simple attribute-by-attribute one- and two-sample averaging procedures potentially combined with similarly simple error bars to the more elaborate mixed model analysis procedures. And secondly, to point at user friendly and dedicated open source software tools to make the analysis actually happen.

T

6.1 INTRODUCTION The basic statistical concepts of hypothesis testing and confidence intervals are typically introduced in the one- and two-independent sample settings. A typical introductory statistics course would maybe manage to also include the multi-group mean comparison setting, the basic one-way analysis of variance (ANOVA). And some courses will also include the blocked samples case of the multi-group comparison case, where the “treatments” are applied to each block leading to a two-way ANOVA setting, see e.g. Brockhoff et al. (2015a). The common and basic design of using a consumer panel to evaluate a set of K ≥ 2 products exactly once is an example of the basic randomised block design. The typical introductory statistics textbook would never mention the concept of mixed models nor the random effect for such analysis. Despite the fact that it is generally recommended to consider block effects as a random effect as the differences coming from blocks really represent a random variability effect. The reason why this apparent contradiction is not a real problem is simple. The main purpose of such a study is to test for and compare product mean differences. The tools for this purpose formally mathematically become exactly the same whether or not the block (subject) effects are considered fixed or random: Product means are compared “within sub-

Applied Univariate Statistics  159 jects” in the sense that the subject-to-subject variability is being held aside when products are compared, as it should. However, this friendliness of the statistical theory only works for this purpose in this particular idealised complete and simple data setting. Most realistic data settings and/or analysis purposes in business, science or society will not fit into this introductory statistics class framework. More commonly a complete mixed model analysis can handle all sorts of different purposes and the answer will generally be different compared to an analysis where certain effects are considered as fixed, that ideally should have been considered random in order to provide a pertinent answer to a relevant question. This has been long acknowledged in what has received the name “repeated measures analysis”, which is a much used term for a collection of various different techniques and models. The mixed model approach presented here can be viewed as including repeated measures analysis but with focus on correlation and variability structures across and within individuals related to various design factors. The classical repeated measures focus more on time-induced correlations. There will still be room for purely fixed ANOVA analysis for various purposes, and in a way the purely fixed ANOVAs are embedded within the mixed model approach. Sometimes, all the structures can be investigated by a fixed ANOVA, sometimes the relevant tests are indeed what you would find in a fully fixed ANOVA. Sometimes there are no critical interactions, or sometimes the difference between a fully fixed ANOVA and the seemingly more correct mixed model might be minor. But sometimes variabilities due to various interactions will change the level of uncertainty in a major way, so if ignored, as they will be in a fully fixed ANOVA, the conclusions of the analysis can change drastically between the two approaches. There is no doubt that due to the complexity of this, many examples can likely be found in the literature, where the best possible analysis was not done. One of the major points we want to make in this chapter is to provide tools that make it possible for non-expert users to access these more complex tools to make sure that better analyses and decision making are made in the future. In Brockhoff (2015) the mixed models are motivated and introduced in the settings of randomised block experiments, and an important message is that the mixed models are not only there to make sure that we do not make too optimistic conclusions based on data that are much more correlated than we perhaps realise. They are also there to help us extract the real product information by filtering out and weighting according to complicated noise structures that otherwise might blur our product conclusions. And additionally, they provide the insights into the complicated noise structures. An example of this could be when different audio systems are tested using different stimuli samples, then potential perceptual interactions between subjects on one side and systems and samples on the other can be due to either systems only or samples only or maybe a combination. The approach described below will automatically detect which components are of major importance, report them and take them into account in the analysis of the systems and samples differences.

6.2 SOME BASICS OF STATISTICAL REASONING The scientific field of statistics has historically been characterised by fundamental disagreements, see e.g. Efron (1998). In Pawitan (2001) it is expressed as: “Statisticians are probably unique among scientists with constant ponderings of the foundation of their subject". One fundamental conflict exists between the so-called frequentist and Bayesian perspectives. Another, within the frequentist area, is between the Fisherian and the Neyman-Pearson perspectives. Most introductory statistics classes given globally to university students are founded on the frequentist statistics perspective. In most of these a pragmatic compromise

160  Sensory Evaluation of Sound between the two frequentist directions is chosen. On the one hand, hypothesis test p-values are being interpreted in the Fisherian way of allowing the size of the p-value to quantify the degree of (internal data) evidence against a null-hypothesis. On the other hand, the more Neyman-Pearson inspired long-term risk considerations (Type I and Type II errors) of a yes/no hypothesis test is used as an approach to handle the important concept of power. This is especially important when planning studies and experiments to be able to identify relevant effects for the economic and human/animal ethical investment made. For most people lacking a background in the history and philosophy of statistical reasoning this discussion may appear to be something to be left for statistics experts to argue about. This is certainly our opinion, as we would rather focus on solving real challenges and answering real questions. However, we are faced with these problems on a daily basis, particularly with the current digital data revolution. Statistics and statistical reasoning is unfortunately something that is globally not well understood even among the researchers who use statistical tools to produce knowledge. This has been documented in numerous studies by various experimental psychologists studying statistical cognition. A number of studies showed that most people struggle with the formal understanding of the concept of p-values and Type I/II errors, (Oakes, 1986; Haller and Krauss, 2002; Nickerson, 2000; Lecoutre et al., 2003). Furthermore, similar studies showed that the understanding of the confidence interval concept suffers from exactly the same problems, see e.g. Hoekstra et al. (2014). Psychologist and economics Nobel prize winning scientist Daniel Kahneman (2011) in his bestseller describes how originally the challenges in understanding probabilistic statistical reasoning was part of the motivation for his team’s and colleague’s research. No doubt these challenges in fundamentally understanding the concepts, strengths and limitations of statistical reasoning tools are part of the reason why statistics, in some fields, has developed into the mindless application of a p-value computation recipe. The debate about hypothesis testing and p-values has been around during the entire history of statistics. In Gigerenzer (2004) an entertaining Freudian analysis of the three main fundamental ideas of statistics (with a view towards applications in social science) is given together with the recommendation: “To stop the ritual, we also need more guts and nerves. We need some pounds of courage to cease playing along in this embarrassing game. This may cause friction with editors and colleagues, but it will in the end help them to enter the dawn of statistical thinking.” In March 2016 the American Statistical Association (ASA), for the first time in its 177 year history, made a public press release expressing their views on the use of a specific statistical method - the p-value from hypothesis testing (Wasserstein and Lazar, 2016). The message to extract from this is a general warning against the “mindless” use of pvalues. One of the incidents leading to this press release was that in 2015 the Editors for the scientific journal Basic and Applied Social Psychology banned all use of p-values and frequentist-based confidence intervals in the journal because of the misleading conclusions drawn from poor interpretation (Trafimow and Marks, 2015). There is no general agreement about the meaningfulness in this rather strong move, which was not generally supported by the ASA-based statement (Wasserstein and Lazar, 2016), and in writing no other journals followed this path. The scientific discussion about this is continuing. In a 72-author paper, Benjamin et al. (2017), in Nature Human Behaviour, the suggestion is made to move to a general threshold of 0.005 instead of 0.05. At the time of writing, disagreeing multi-author responses to this suggestion were also on the way. With the digital big data revolution we now face the use and role of statistics and statistical reasoning are challenged even more. And the view of how to use and think of

Applied Univariate Statistics  161 statistical and/or data analysis in business, science and society is probably more diverse than ever. It spans from highly regulated and formal simple endpoint confirmatory applications within e.g. the pharmaceutical industry over more pragmatic industrial developmental data collection and analysis to big data machine learning applications with a focus on complex black box nonlinear multivariate prediction models. For the former, hypothesis testing and pvalues still play an important role. For the latter instead, the parsimony modelling principles of Occam’s razor would typically be enforced together with all sorts of cross validation techniques to protect against over-fitting instead of p-values and hypothesis testing. And misunderstandings and misuses of statistics and modelling no doubt occur at both ends of the scale. Recently, O’Neil (2016) warned strongly against the consequences of certain prediction models when (mis-) used at large societal scale. We start this chapter with situations of pragmatic analysis approaches. Using classical statistical reasoning with p-values and confidence intervals, parsimony modelling principles will be used and recommended to get the best out of the data at hand. Domain-specific expertise, common sense and data visualisations are recommended as close companions in everything you do in applying statistics and modelling. Used with care, p-values can provide you with a valuable means to quantify evidence against various null hypotheses (H0 ). However, whenever possible one should move towards error bars, confidence limits and meta-thinking, (Cumming, 2012). Later we will see examples of new developments in sensory and consumer data analysis aligned with such ideas. Also in line with Stigler (2016) we believe that the seven basic historic pillars of statistical reasoning (aggregation, information measurement, likelihood, inter-comparison, regression, experimental design, residual), independent of your “statistical religion”, continue to be valuable even in the new age of big data. And as a final statement to the approach of this chapter: Yes, we were schooled as frequentists, so we are going to share only frequentistbased tools with you. If you want to do similar works based on a Bayesian approach, that is possible, but beyond the scope of this chapter/book.

6.3 THE GENERAL EXPERIMENTAL AND ANALYSIS APPROACHES The focus of this chapter is on statistical analysis procedures, that is, the way to extract, in the best possible way, the information from the experimental data at hand. Before the data was in your hands some initial research and/or business questions motivated some hypotheses that led to someone designing this particular experiment with the particular choices of (independent) design variables (independent variables) and (dependent) measurement variables (dependent variables). The analysis should reflect these meta settings of the situation such that the statistical analysis matches the formal design used for the experiment. We do not discuss the important topic of design of experiments (DoE) in this chapter. For a brief discussion in light of sensory and consumer experiments, see Næs et al. (2010) with further guidance from Montgomery (2012). The analysis procedures described below are fairly generic and do not rigidly tie any specific design/analysis together. And at least for the typical type of audio sensory experiments discussed here the procedures will “cover all bases” in the sense that all possible research questions and all possible noise and variability sources are investigated. Generally, a statistical analysis will include the following steps and you may have to carry out several iterations: 1. Exploratory analysis (descriptive statistics). Getting an initial idea of the important structures within the data and clarifying the basic (meta) settings of the data at

162  Sensory Evaluation of Sound hand with respect to (independent) variables and scales (dependent variables) used. Simple summary/descriptive statistics, and one- and two-variable graphics such as scatter plots and box plots. 2. Modelling. Formal identification of the overall important structures in the data set through statistical modelling, including hypothesis testing: Noise (random) effects as well as systematic (fixed) effects modelled through the design factors and covariates of the experiment. Answering overall questions such as: Whether synergistic and/or antagonistic effects, also jointly known as interaction effects, play a significant and important role for the information to be extracted? 3. Validation and assumptions. Checking the standard assumptions for the models used, including e.g. assumptions of normality, homogeneity of variance and independence to the extent that they are relevant. Investigating potential outlying and influential observations. 4. Conclusions, summary and final visualisation of results. Summarise the important findings by tables and visualisations. Post hoc analysis in ANOVA settings. Include model estimates, hypothesis test results and uncertainties and confidence intervals from the resulting models.

6.3.1

Simple versus elaborate analysis approaches

A first approach to summarising experimental data and associated findings may often include simple averaging and some simple variability and uncertainty judgements. For example, subset and/or pooled data can be used to calculate simple one- or two-sample-based confidence intervals (CI). In the context of the exploratory part of the analysis this is perfectly fine. If you limit yourself to this kind of analysis, you may miss important details simple means and CIs can not reveal all important signals and variabilities. In particular: 1. Overall interactions or functional structures might be missed. 2. Interpretation of simple error bars continues to be a challenge. 3. Uncertainties extracted like this may be far from the real uncertainties considering all the relevant variabilities. 4. In more complex settings, generally simple means are not the optimal estimates of the unknown features of the system investigated, and sometimes they can be even off target if not corrected for certain covariate effects. Let us elaborate a bit on each point. 1. Functional relations as e.g. how certain sensory properties depend on the actual values of the signal-to-noise ratio of the stimuli clearly require certain regression type models to be quantified. Interactions may exist such that sound system differences depend on the type stimuli. Alternatively, the human-system interaction, such as a person’s experience clearly differs from another person’s experience. Potentially, such effects could be related back to between-subject factors, such as, gender, age, etc. 2. As discussed in Cumming et al. (2007); Krzywinski and Altman (2013); Cumming and Finch (2005) the interpretation of error bars given in plots of e.g. simple means, continues to challenge the average reader of scientific papers: Many readers would tend to interpret

Applied Univariate Statistics  163 the potential overlap between two CI and/or standard error (SE) bars as equivalent to a non-significant difference between the two means depicted. This is wrong for both types of error bars for the simple reason that errors are not additive on the standard deviation scale but on the variance scale! And this is a basic point made for independent samples data. For paired data the comparison of two separate means by two separate error bars becomes plainly meaningless as the variability used for these would be completely wrong: Individual means use the between-subject variability, whereas the relevant variability to use would be the within-subject variability, that is, variability of differences. In surveys made, a high number of readers also missed this major issue, see e.g. Hoekstra et al. (2014). 3. For the mixed models in general, and indeed for the types to be used with audio sensory data, there may be even more complex “pairing” and “blocking” going on at several levels. As a result simple error bars based on data sub-setting and/or pooling are almost impossible to verify as meaningful measures of uncertainty in a given situation. 4. Finally, even the simple average itself will generally not amount to the optimal estimate of the true relevant unknown parameter. For completely balanced designs this is generally not an issue - in such situations simple averages estimate the relevant and expected features. However, for e.g. incomplete block designs, partially controlled design factors or non-balanced covariates, the formal mixed model is needed to obtain the right estimates. But in many almost complete data experiments this particular issue may not have a major impact on the results. The cases that we will cover below are sometimes termed repeated measures settings. And the classical methods used for such are either a univariate approach with some correlation correction factor to correct for the potential lack of so-called sphericity (such as Greenhouse Geisser (Greenhouse and Geisser, 1959) or Huynh Feldt (Huynh and Feldt, 1976) corrections) or a completely unrestricted multivariate approach, see e.g. Keselman et al. (2001). The approach we propose in this chapter is closer to Kowalchuk et al. (2004). With its novel automated approach it can be considered a hybrid between the two classical approaches: It will identify a suitable version of the correlations in the data, similar to the idea of the correction methods, but additionally investigate further detailed structures. Therefore the identified solution will lie in between the often over simplified correction methods and the often too general unrestricted multivariate method. Additionally, this approach will allow for further interpretations of the correlation structures given by certain variance components - both main and interaction effects.

6.3.2

Multiple comparison corrections

Below we will use post hoc multiple mean comparisons within mixed models, and show univariate hypothesis test results collected across multiple attributes. For both types of results it will be important to protect against “significance by chance” during the interpretation of such multiple test results. For post hoc analysis within mixed models, you can use any approach among classical linear models. Thus classical correction methods such as Dunnet’s, Tukey’s, Scheffe and Bonferroni methods also apply similarly for analysing fixed effects within mixed models. We will detail a few of these methods here. They are generally available in R (see R Core Team (2016)) through various post hoc packages, see e.g. Lenth (2016). R is an open source language and environment for statistical computing and graphics. It is the mostly used development language for statistical computing and together with Python® similarly so for machine learning and data science. The number of add-on packages to base R has been increasing exponentially since 2002, and passed 10.000 in the

164  Sensory Evaluation of Sound beginning of 2017. It is beyond the scope of this chapter to introduce the basic use of R, and such introductions are numerously found online, e.g. Brockhoff et al. (2015a). The Bonferroni technique is, due to its simplicity and broad applicability, worth recapping, as discussed in more detail in Brockhoff et al. (2015a). The Bonferroni-correction amounts to using α/M as the level in each individual test or CI instead of α, when M multiple tests are carried out. Imagine that we performed an ANOVA in a situation with k = 15 groups. And then we do all the M = 15 · 14/2 = 105 possible pairwise hypothesis tests. Assume for a moment that the overall null hypothesis is true, that is, there really are no mean differences between any of the 15 groups. Now consider what would happen if we still performed all the 105 tests with α = 0.05! How many significant results would we expect among the 105 hypothesis tests? The answer is that we expect α · 105 = 0.05 · 105 = 5.25, that is, approximately 5 significant tests are expected. And what would the probability be of getting at least one significant test out of the 105? The answer to this question can be found using the binomial distribution: P (At least one significant result out of 105 independent tests) = 1 − 0.95105 = 0.9954

(6.1)

When performing a single test, we have a probability of α = 0.05 of getting a significant result, when we should not, we now have an overall Type I error, that is, the probability of seeing at least one significant result, when we should not, of 0.9954! This is an extreme (overall) Type 1 risk. This is also sometimes called the “family-wise” Type 1 risk. In other words, with k = 15, we will basically always see at least one significant pairwise difference, if we use α = 0.05. This is why we recommend a correction method when doing multiple testing like this. Using the Bonferroni correction ( i.e. αBonferroni = 0.05/105) will in this case for each of the 105 tests give the family-wise type 1 risk: P (At least one significant result out of 105 independent tests) = 1 − (1 − 0.05/105)105 = 0.049

(6.2)

The Bonferroni-correction can also be used across attributes: if you see a single p-value below 5 % across 15–20 attributes, that should not be considered as evidence of significance. And even though the output result does not show any p-value corrections, you can easily apply the α/M correction when studying multiple results: Simply use this stricter criterion when learning from the multiple p-values. So in spite of not being the optimal choice for certain post hoc applications within ANOVA, this simplicity is worth sharing with you. PanelCheck software mentioned below the Bonferroni methods is also implemented for the mixed models used there. In the core mixed model R-packages mentioned below, there are no in built corrections of this type, and you would have to either apply some of the additional and dedicated packages for this or use Bonferroni-interpretations “on the go”.

6.4 MIXED MODELS FOR SENSORY AND CONSUMER DATA 6.4.1

Headphone data - Structure and exploratory analysis

Within this chapter a headphone data set, described further in Appendix A.1, is used for illustration. It consists of 8 attributes evaluated for 8 systems (SYS) with 4 sample stimuli (SAM) in 2 replicates (REP) by 18 subjects (SUB). For a single attribute we have 8 × 4 × 18 × 2 = 1152 observations. As part of the initial exploratory analysis a principal component analysis (PCA) of the raw 1152 × 8 data matrix was performed. We emphasise

Applied Univariate Statistics  165



150

● ●

50

● ●



● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ●● ● ● ●● ● ● ● ● ●● ●● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ●● ● ● ● ● ● ●●● ● ● ● ● ● ● ● ●● ● ● ● ●● ● ● ● ● ● ● ● ● ●●● ●● ●● ●●● ● ● ● ● ●● ● ● ● ● ●● ● ● ●● ●● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ●●● ● ● ●● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●●● ● ● ● ● ● ● ● ●● ●●● ●●● ●● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ●● ● ● ● ●● ● ● ● ● ● ●● ● ● ● ●● ● ● ● ● ● ● ● ● ●●●●● ●●●● ●● ● ● ● ● ● ● ●● ●● ● ●● ● ● ●● ● ● ●● ●●● ● ●● ● ●● ●●● ● ● ● ● ● ●● ●● ● ● ● ●●● ● ● ● ● ● ● ● ● ●● ● ●●● ● ● ●●● ● ●● ● ●● ● ● ●● ● ●● ●● ●●●●● ● ● ●●● ●● ● ● ● ● ● ● ● ●● ● ●● ● ●● ●● ● ●● ● ● ● ● ●● ●● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ●● ● ●● ●● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ●● ●● ●● ● ●● ● ● ● ● ●●● ● ●●●●● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ●●● ●●● ● ● ● ● ● ● ●●●●● ● ● ● ● ●● ● ● ● ● ●●●● ● ●● ● ●●●● ● ● ● ● ● ● ● ● ● ● ● ●●● ●●●●● ● ●● ● ● ● ●● ● ● ●● ● ● ● ● ●● ●● ●● ●● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●●● ● ● ● ● ● ● ●●●● ● ●● ● ●● ● ● ● ● ● ● ● ●●●● ● ● ●● ●● ●●●●●● ●● ●● ●●● ● ●● ●●●●● ● ●● ● ●● ● ●● ● ● ●● ●● ●●●●●●● ●● ●● ●● ● ● ● ● ● ● ●●●● ● ●●● ● ● ●●● ● ● ●● ● ●● ● ● ● ● ● ● ●● ● ●● ● ●●● ● ● ● ●●●● ●● ● ● ● ● ● ● ● ●● ● ● ●● ●● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ●● ● ●● ● ● ● ●● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ●● ● ●● ●● ● ●● ● ●● ●● ●● ●● ● ● ● ● ● ●● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ●●● ● ● ● ● ●●● ● ● ● ● ● ● ●● ● ●● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ●● ●● ● ● ●● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●●● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ●●● ●● ● ● ● ● ● ● ● ●● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●

groups

Att5 Att3Att6

0

Attt8t2 A

−50

−100



Att4

Att7

PC2 (13.3% explained var.)

100

●●

At t1

0



0



1



2



3



4



5



6



7



100

PC1 (57.5% explained var.)

Principal component analysis with main effects of SYSTEMS highlighted. For each system we see all the 4 × 18 × 2 = 144 combinations of SAM, SUB and REP in the plot and an ellipse (based on a 68 % normal probability, that is, similar to plus minus 1 standard deviation) visualising the structure of the point cloud. Figure 6.1

that even though the chapter is focusing on univariate, by-attribute, analysis we must not forget to consider the individual attribute information in light of everything else going on, when we have a real multivariate data set as this. In Figure 6.1 the biplot, where both loadings and scores are shown simultaneously, illustrates the main effects of the SYS: Each point in the plot (score values) represents one of 1152 observations and is colour coded according to the 8 different systems. The arrows (loadings) represent the attributes. See the following chapter for more on multivariate analysis like this. We looked at the same plot with the other three main effect colour codings of (SAM, SUB, REP) and all of these showed almost completely overlapping and similarly

166  Sensory Evaluation of Sound sized point clouds. Up front this indicates that the major main effects in these data, across attributes, comes from the SYS differences (71 % of this individual level data is explained by the first two components). The differences between these within SYS point clouds indicates that either some variability, heterogeneity or maybe interaction effects are present in the data. In particular the reference condition (group = 0) point cloud is very different from the rest. The cause for these differences can be investigated by a further decomposition of these variabilities, either on the raw attributes or potentially on the extracted principal components, as desired. Put differently: we need to carry out some formal analyses of variances to better understand the data, and also to distinguish signal from noise. We can already do a number of relevant exploratory interpretations from the PCA-plot related to SYS average differences and how the attributes are correlated to each other. As this is part of a PCA exercise, we will leave out the details of this here, as PCA is discussed further in Chapter 7. With regard to design of experiment (DoE), structurally we use a completely balanced full factorial design with the four factors, namely, SYS, SAM, SUB and REP, comprising exactly one observation per combination of the four factors. Table 6.1 shows the complete and unique ANOVA decomposition table for attribute 2. This is based on a classical fixed effect model, where all effects are modelled as fixed effects. Such a decomposition of the total variability can be a quite illustrative part of the exploratory analysis of these data. The proper inferential part of the analysis (hypothesis tests, p-values and confidence bands of relevant effects) should be based on the relevant mixed model, see below. Hence, the pvalues from this analysis are not provided, as most of them would not be the right p-values within the more relevant mixed models to be discussed below. Actually the F-test statistics shown in the table would similarly generally not be the F-statistics from the proper choice of a mixed model. In fact the F-tests for the highest order interactions, the four 3-factor interactions, in the table would be a valid test also in the mixed model, as the relevant error term is the residual error term also used by the fixed model. However, all the remaining F-tests and p-values in the table are not expected to be valid in any way. Well, if all the four 3-factor interactions are not important, then the next level, the 2-factor interaction tests would be valid. And similarly for the next level. This tedious approach will most often fail because several of these interactions are indeed important, and is not recommended. Instead, we recommend applying the proper mixed model, that will automatically take all the potentially important interactions into account, and ignore the remainder. However, as pointed out in a sensory context by Brockhoff et al. (2016), these F-statistics can still be used as quite informative measures of effect size of each effect when properly ˜ effect size measure is closely related to the well-known transformed. The “delta-tilde” (δ) Cohen’s f, or the “Ψ-measure”, also discussed in Brockhoff et al. (2016). It expresses the average pairwise difference between levels of a factor relative to the basic (residual) noise level in the data. The delta-tilde, just as Cohen’s f is a multi-level version of Cohen’s d, which is a fundamental estimate of the signal-to-noise ratio µ/σ, Cohen (1988), that is, the relative effect size. In the well-known work by Cohen, effect sizes are often used for the purpose of obtaining sufficiently powered studies. In Brockhoff et al. (2016) it is put forward as a way to interpret the importance of various effects in both fixed and mixed model multi-factorial ANOVA. And it can be explicitly linked to the psychometric concept of a signal-to-noise measure, δ or d-prime. In Table 6.1, instead of the fixed ANOVA p-values, we have provided these delta-tilde values for the purpose of exploring which of all the 14 different types of effects appear to be the important ones. There is no R-package, in writing, producing this particular table, but the simple formula to find the delta-tilde values for a balanced and complete data

Applied Univariate Statistics  167

SYS SAM SUB REP SYS:SAM SYS:SUB SAM:SUB SYS:REP SAM:REP SUB:REP SYS:SAM:SUB SYS:SAM:REP SYS:SUB:REP SAM:SUB:REP

DF 7 3 17 1 21 119 51 7 3 17 357 21 119 51

Sum Sq 1417026.87 11575.41 114335.66 8184.54 20633.65 161724.07 17838.43 3249.01 341.09 12577.34 126806.37 4296.73 54659.64 22102.78

Mean Sq 202432.41 3858.47 6725.63 8184.54 982.55 1359.03 349.77 464.14 113.70 739.84 355.20 204.61 459.32 433.39

F value 621.67 11.85 20.65 25.13 3.02 4.17 1.07 1.43 0.35 2.27 1.09 0.63 1.41 1.33

K 8 4 18 2 32 144 72 16 8 36 576 64 288 144

n 144 288 64 576 36 8 16 72 144 32 2 18 4 8

deltatilde 2.94 0.27 0.78 0.29 0.28 0.81 0.08 0.07 0.00 0.20 0.24 0.00 0.29 0.17

The full fixed factor ANOVA table for attribute 2. The last column presents instead of p-values as a measure of effect size, where delta-tilde q q“delta-tilde” √ 2 DF = n K−1 F − 1. Table 6.1

experiment like this is given in the table legend and provided in Brockhoff et al. (2016). For attribute 2, the residual standard deviation is 13.8. So the average pairwise difference between two systems (there are 28 such pairs for the 8 systems) is around 3 (2.94) times this value. This effect is clearly the largest among all possible effects. Figure 6.2 illustrates a grouped bar plot of the effect sizes for all attributes. This plot is a novelty in this context. In spite of the concept of effect sizes in ANOVA being well known in statistical literature, they are not often used for exploratory analysis purposes in multifactorial ANOVA settings such as this. Quite a few insights regarding the effect sources can be seen from this plot. This particular plot for the full fixed model version is not included in a designated R-package, but comes as a quite straightforward use of basic bar plotting features of R via the ggplot2 package, Wickham (2009). In the SensMixed package (Kuznetsova et al., 2014) described below, similar plots are provided based on the mixed model. To explore the actual effects (not only their size) in an univariate attribute-by-attribute perspective, box plots of main effects and interaction plots could be utilised. A few of the main ones are depicted in Figure 6.3. In the user-friendly stand-alone open source tool PanelCheck (www.panelcheck.com), a lot more univariate and multivariate exploratory analysis tools are provided with a focus upon subject performance. In order to fit the simplicity of the PanelCheck framework you might typically make a system-by-stimuli identification variable, considering the situation as corresponding to 8 × 4 = 32 products, without any other distinction in the data handling of systems and stimuli than the chosen names for the 32 combinations to be used in various plots. Using this trick, you could then fit the 4-way headphone data set into the 3-way PanelCheck environment using a PRODUCT-by-SUB-by-REP for each attribute, where PRODUCT = SYS-by-SAM. This finishes the discussion of the exploratory analysis of such data. In practice there are no limitations to how creative you can be with advanced data visualisation with modern computer graphics tools, which can truly help you interpret and communicate the meaning in your data. Now moving onto the more formal modelling part of the analysis, we need to

168  Sensory Evaluation of Sound

3

Attr

deltatilde

att1

2

att2 att3 att4 att5 att6 att7 att8

1

0 REP

Figure 6.2

SA:SU:RE

SAM

SAM:REP SAM:SUB

SUB

SUB:REP SY:SA:RE SY:SU:RE

Effect

SYS

SYS:REP SYS:SAM SYS:SUB

NA

Average relative effect sizes (“delta-tilde”) for all effects and all attributes.

discuss further the concept of mixed models and which effects we should consider as fixed or random.

6.4.2

Overview of mixed model tools

Table 6.2 provides an overview of the open source tools for mixed modelling of sensory and consumer data touched on in this chapter. They span from a generic mixed model R-package (lme4, Bates et al. (2015b)) to be used with the script based R language as its interface. It helps if you are familiar with R in more basic applications. Alternatively, there are a number of user friendly GUI-based and semi-automated tools that are designed to be used by the sensory practitioner with a limited degree of user control. The former allows for maximum flexibility but then requires significant statistical skills. The latter group requires no statistical skills, but there will be limitations regarding the type of data or analysis that can be handled. Either way, the proper interpretation of the output will certainly still require some insights into the basics of statistical thinking. The focus of the current section is the SensMixed package (see Kuznetsova et al. (2015a); Kuznetsova (2015)), which is positioned between the two tool groups. It offers a nice selection of basic and advanced methods for sensory and consumer data, comprising automated multiattribute analysis and visualisation including the newest specialised research in the field. However, the number of choices are limited, such that default choices, together with a little tutorial given in the chapter, makes it accessible for most people. Furthermore, the package has been implemented in parallel both as a standard script based version and as an R Shiny Application version, Chang et al. (2016), which runs as a GUI based software in your preferred web browser and requires no R-experience by the user.

Applied Univariate Statistics  169

150

150 ● ● ● ●

SYS

● ●

1 ● ●

Att2



3

● ● ● ●



50

4

● ●

0 0

1



5

● ● ●

● ●

100

2

● ●

6



7

SAM

● ● ● ●



1



● ● ●



2

● ●





3



4

● ● ● ●



Att2

100

0





50

● ● ● ●

● ● ●



SUB

0 2

3

4

5

6

7

0

1

2

3

SYS

4

5



A1



A10



A11



A12



A13



A14



A15

● ●



A16

● ● ● ● ● ● ● ●



A17



A18



A2



A3



A4



A5



A6



A7



A8

6

7

SYS

150

150

● ● ● ● ● ● ●

SAM

100

100

4

● ● ● ●

Att2

Att2

3 50

● ● ● ● ● ● ●

1 2

● ● ●

● ● ●

● ● ● ●

● ● ● ● ● ● ● ● ● ● ● ●

● ● ● ● ● ● ● ● ● ●



● ●

● ●



50

● ● ● ● ● ● ● ● ●



● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●



0

0 1

2

3

SAM

4

0

1

2

3

● ●

4

SYS

5

6

7

Univariate exploratory plots of Attribute 2. Main effects of SYS and SAM. Interaction effect between SYS and SAM. Interaction effect between SYS and SUB. Figure 6.3

Kuznetsova (2015) Kuznetsova et al. (2015b) Brockhoff et al. (2016) Brockhoff et al. (2015b) Kuznetsova et al. (2014) Kuznetsova et al. (2015b) Kuznetsova et al. (2017b) Kuznetsova et al. (2017a) Bates et al. (2015b) Bates et al. (2015a)

Multi-attribute visualisation of sensory data Multifactorial analysis of sensory and consumer data Automated model selections Correction of scaling effects R-package: Standard and Shiny App GUI version P-values for mixed models in R Step-wise model building and selection Basic post hoc R package - only script based A linear mixed model package in R R package - only script based

Table 6.2

Overview of the open source tools for mixed modelling of sensory and consumer data mentioned in this chapter (PanelCheck and ConsumerCheck also have many other features beyond mixed models).

lme4

lmerTest

SensMixed

Næs et al. (2010)

Multifactorial analysis of consumer data Conjoint analysis Accounting for consumer and replication effects Automated model options and analyses Stand-alone GUI

ConsumerCheck (www.consumercheck.co)

References Næs et al. (2010)

Description Multi-attribute visualisation of sensory data Single factor product structure only Accounting for subject and replication effects Stand-alone Graphical User Interface (GUI)

Tool PanelCheck (www.panelcheck.com)

170  Sensory Evaluation of Sound

Applied Univariate Statistics  171

6.4.3

From t-tests to multi-factorials and multi-blockings

Let us zoom in on a subset of the headphone data: Consider only the first replication (REP = 1) of the first sample (SAM = 1) of the two systems 0 (reference) and 1. Hence we have 18 scores from each subject (SUB) for each of 2 systems (SYS). The mean system difference is 50.1. This corresponds nicely to the difference between the medians of these two systems across all the observations as can be seen in the top left of Figure 6.3. This leads to considering a paired t-test setting and the standard deviation of the 18 system 50.0889 √ differences is 25.214 yielding a t-test statistic 25.214/ = 8.4282 which will give a p-value 18 of 1.8 · 10−7 using the Student’s t-distribution with df = 17 degrees of freedom. This result could be found in R as follows, including the raw output from the given R-call: > t.test(Att2 ~ SYS, data = headsubset011, paired = TRUE) Paired t-test data: Att2 by SYS t = 8.4282, df = 17, $p$-value = 1.782e-07 alternative hypothesis: true difference in means is not equal to 0 95 percent confidence interval: 37.55019 62.62759 sample estimates: mean of the differences 50.08889 Here the data set headsubset011 consists of 36 rows corresponding to the combination of 18 subjects and two systems and includes the variables Att2 (the actual observed data) and SYS (the two-level factor coding for the systems). This design can also be seen as a randomised block design with subjects as blocks and systems as treatments. You could analyse the data from this design using either a standard fixed effect two-way ANOVA or the random block effect mixed model ANOVA both mathematically equivalent to the paired t-test analysis just carried out. The basic R-scripts and raw output for the fixed ANOVA are: > anova(lm( Att2 ~ SYS + SUB, data = headsubset011)) Analysis of Variance Table Response: Att2 Df Sum Sq Mean Sq F value Pr(>F) SYS 1 22580.1 22580.1 71.0341 1.782e-07 *** SUB 17 9146.3 538.0 1.6925 0.1439 Residuals 17 5403.9 317.9 --Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Note how the p-value is exactly the same and that the F-statistic equals the square of the paired t-statistic: 8.42822 = 71.03456 (only rounding errors make it not exact here). The functions t.test, anova and lm (for the “linear model”) are part of the base R (R Core Team, 2016). The mixed model version of the same analysis is run by a basic call as follows (again including raw output):

172  Sensory Evaluation of Sound > lmer1 library(lmerTest) > anova(lmer1) Analysis of Variance Table of type III with Satterthwaite approximation for degrees of freedom Sum Sq Mean Sq NumDF DenDF F.value Pr(>F) SYS 22580 22580 1 17 71.034 1.782e-07 *** --Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 The function lmer is the core linear mixed model function of the widely used lme4 package (Bates et al., 2015a,b). In the lm and lmer calls, the basic model syntax of R is illustrated. In lm: Att2 is modelled as a function of the factors SYS and SUB in an additive manner. From this script we can not see that these two variables in the (sub)data set headsubset011 are stored as text strings. In lmer: The special syntax (1|SUB) is the way to instruct the lmer function that the SUB factor is to be considered as a random effect. Just using effects as in lm implies that they are considered as fixed effects. The library(lmerTest) loads the lmerTest package (Kuznetsova et al., 2017a,b), and then applies the mixed model ANOVA function implemented from lmerTest in the last line. This is one of the nice features of the R environment in that many generic functions such as summary, plot, anova, etc., have been implemented for many different types of result objects. R identifies which version is relevant for a certain result object. The consequence of using the anova from the lmerTest in this way is that a p-value using a so-called Satterthwaithe-based denominator degrees of freedom is provided. The Satterthwaithe approach is computationally faster than the other often used approach of Kenward Rogers, and works just fine for data like this, see Kuznetsova et al. (2017a,b). By just using the lme4 package, no p-value would be provided. The results of the paired t-test were already given as the output of the R-call shared above. Additionally, the output of both of the ANOVAs are easily provided in addition: They will provide an F-test statistic mathematically equalling the square of the paired t-test: 8.42822 = 71.03 and exactly the same p-value resulting from the F (1, 17) -distribution. More specifically the probability that a random variable following the F (1, 17) -distribution is larger than or equal 71.03, is 1.8 · 10−7 . And this could also be found in R as 1-pf(71.03, 1, 17). As a post hoc analysis we could utilise the 95 % confidence interval for the SYS difference: (37.6, 62.6). Details of how to obtain such with R is given below. This is perhaps a good the place to recap how we should look at the p-value. The following is taken from Brockhoff et al. (2015a): First of all, the general definition of a p-value is: “the probability of obtaining a test statistic that is at least as extreme as the test statistic that was actually observed”. This probability is calculated under the assumption that the null hypothesis is true. The p-value is used as a general measure of evidence against a null hypothesis (using the given data). The smaller the p-value, the stronger the evidence against the null hypothesis H0 . A typical strength-of-evidence scale is given in Table 6.3. The following design extensions illustrate each additional individual complexity that we need to master to do the analysis for most real-life settings 1. K > 2 levels of the “treatment” factor. 2. Replications, leading to more than a single “block” factor. 3. More than a single treatment factor.

Applied Univariate Statistics  173 p < 0.001 0.001 ≤ p < 0.01 0.01 ≤ p < 0.05 0.05 ≤ p < 0.1 p ≥ 0.1 Table 6.3

Very strong evidence against H0 Strong evidence against H0 Some evidence against H0 Weak evidence against H0 Little or no evidence against H0

A way to interpret the strength-of-evidence given by the p-value.

4. Even more complex structures, e.g. incomplete data in various ways in either the treatment (fixed) part or in the block (random) part. For this general discussion of the various extended design settings we will use the generic names A, B, C and D for the (fixed effect) design factors (“treatment” factors). We will discuss up to a four-factor full factorial setting of the “product” investigated. The name “product” is used for the combination levels of all these factors. In the example headphone data we would then have fixed effect design factors A and B corresponding to systems and samples. In the small subset data analysed above we would have the single (2-level) factor A (systems). We need these generic one-letter names to be able to illustrate the complexity of this. All the products in an experiment are then evaluated by all subjects (S) either once or more than once (in replicates (R)). The general approach: if we want the conclusions of our study to be generalisable for the products (i.e. fixed effects A, B, C and D) under study we should consider effects related to subjects (S) and replication effects (R) as random. In this way we correct for both effects and use the relevant levels of noise and variabilities in the conclusions we make. The natural consequence of this is that all potential interaction effects between one or both of S and R on one side and one or several of A, B, C and D on the other side should also be considered as random effect. This has potential major and proper impact on the analysis we do compared to a fully fixed effect analysis.

6.4.3.1

Extension 1: K > 2 levels of the “treatment” factor

The first extension step is to allow for the single treatment factor to have more than two levels but still only evaluated once by each subject. In sensory evaluation applications this occurs as the regular setting for a complete consumer study, with no replicates i.e. each subject (expert or consumer) evaluates the different levels of the treatment only once. This standard randomised block analysis would in R run as exact copies of the ANOVA script lines shown above for the mixed model analysis. It is also the analysis performed by the first of three menu tabs in the open source PanelCheck software.

6.4.3.2

Extension 2: Replications - Two blocking factors

The second extension step is the handling of replications. It is highly relevant as this will often be the case for sensory panel data. “Replication” can be performed, from a design perspective in different ways, i.e. in the same session or in separate sessions. In food sensory applications, the concept and handling of “replicated food samples” is challenging due to sample preparation differences, satiation and other temporal-related effects, discussed further in Næs et al. (2010). This is less of an issue in the sensory evaluation of sound, where it is possible to perfectly replicate stimuli. Thus the concept of “replication” does not have

174  Sensory Evaluation of Sound a unique meaning. In this chapter the concept is considered a generic term for all types of replicated design structures. Considering a single product factor A, which could comprise different codec bitrates of several sound samples, the replicated data situation could be analysed by a simple 2-way ANOVA if e.g. the replications are allocated in a completely randomised way. This would be a 2-way mixed model ANOVA with no specific replication effects included in the model, only including effects S, A, and the interaction SA (top row of Table 6.4) with A and SA as random effects. Under other experimental conditions, e.g. where the replications are allocated and tested in separate sessions you should analyse it as a 3-way mixed effect model, with A, R, AR, SA, SR as random effects (second row of Table 6.4). These two analyses are available through the last two menu tab options in the PanelCheck software. In practice the two-way analysis amounts to analysing the product effects (F-tests and post hoc analysis) using the Subject-by-Product (= S-by-A) interaction mean square term M SSA as the error term, e.g. that leading to the F -test for product differences: F =

M SA M SSA

(6.3)

This analysis is most often recommended in sensory science text books, see e.g. Lawless and Heymann (2010). Apart from being a mathematical consequence of assuming the mixed model described, it reflects the following intuitions: • The analysis is performed on the subject-level avoiding the correlation problem of repeated measures data. • The same F-test would be found from analysing the averaged replicated data and using the “no replication” analysis mentioned above. • The differences across subject’s perception of product are used as the noise/error term. • The potential main effects across subjects (individual scale shifts) are not part of the error variability. The 3-way analysis would be the more correct approach e.g. in the case where the replication structure results from separate test sessions, where all products are evaluated once in each session. The 5 listed random effects have specific interpretations, as defined in Section 6.4.3.2, that may or may not actually be seen in the data. Of these, the first three are not related at all to the individual products but allow for different scale usage or “scale shifts” between judges and sessions across all the products. The last two, SA and RA, are the important ones for investigating product differences. The 3-way mixed model analysis would lead to using an error term which becomes a combination of these two terms, e.g. that leading to the F -test for product differences: F =

M SA M SSA + M SRA − M SE

(6.4)

The 3-way analysis can be considered a bulletproof approach for all replicated cases whether replications are organised in one or the other way: If the design does not include a replicate effect, effects related to the R (R, SR and RA) will reflect random variation only, and the replicate effect part of the 3-way analysis will not affect the results in any way. In this case the 3-way analysis will essentially amount to the 2-way analysis. To see this, try a 2-way analysis and compare it to the 3-way analysis. If they are more or less the same,

8

11

A, B, C, D, AB, AC, AD, BC, BD, CD, ABC, ABD, ACD, BCD, ABCD 16+16+15 combinations of random effects

A, B, C, D, AB, AC, AD, BC, BD, CD, ABC, ABD, ACD, BCD, ABCD S, SA, SB, SC, SD, SAB, SAC, SAD, SBC, SBD, SCD SABC, SABD, SACD, SBCD, SABCD

47

16

A, B, C, AB, AC, BC, ABC S, SA, SB, SC, SAB, SAC, SBC, SABC, R, RA, RB, RC, RAB RAC, RBC, RABC, SR, SRA, SRB, SRC, SRAB, SRAC, SRBC 23

A, B, C, AB, AC, BC, ABC S, SA, SB, SC, SAB, SAC, SBC, SABC

A, B, AB, S, SA, SB, SAB, R, RA, RB, RAB, SR, SRA, SRB

4

5

k1 2

S, SABCD, R, RABCD

S, SABCD

S, R, SABC, RABC

S, SABC

S, SAB, R, RAB

S, SAB

S, SA, R, RA

Always keep S, SA

4

2

4

2

4

2

4

k2 2

Table 6.4

Overview of the various full model situations with one, two, three or four product factors (A, B, C and D) and with either only Subject (S) effect or both Subject (S) and Replication (R). The number k1 expresses the number of random effects in the full models. The number k2 expresses the minimum number of random effects in the selected final models using the default keep-option.

A,B,C,D

S

R

A,B,C,D

S

A,B,C

S

R

A,B,C

S

A,B

S

R

A,B

S

A, B, AB, S, SA, SB, SAB

A, S, SA, R, RA, SR (PanelCheck 3-way)

A

S

R

Full Model (Fixed, Random) A, S, SA (PanelCheck 2-way)

Situation S A

Applied Univariate Statistics  175

176  Sensory Evaluation of Sound S Each subject has his/her own average score R Each replicate session has its own average score SR Each subject-replicate combination has its own average score SA Subjects have an average score per product RA Each product has an average score per replicate Table 6.5

The five possible random effects in replicated sensory evaluations.

then you can base your results on the 2-way analysis. However, if they are clearly different, then you should definitely rely only on the 3-way results, as the 2-way analysis then clearly misses important effects. As a practical recommendation to PanelCheck users: Always use the 3-way analysis as default if you have 3-way data. If you do not have an actual variable in the data set with the replication information in it, then you could not do the 3-way analysis obviously. Our experience is that most often the order of evaluation and/or session number will typically be saved in a variable, called the R-factor. In PanelCheck, such a column is required in your data set. A pragmatic view on the analysis of such replicated data, performed automatically by the PanelCheck software, is a way to by-pass the sometimes challenging discussion of the proper analysis for specific data sets. One may protect against some of the rare cases of “negative variances”, e.g. that M SRA is smaller than M SE , in various ways. In PanelCheck, the two-way error M SSA is used whenever M SRA is smaller than M SE . It can be discussed which effects should be handled as random and which as fixed, see e.g. Næs et al. (2010); Næs and Langsrud (1998) for more details. Different choices lead to different types of conclusions of the data analysis. In brief, if we want the conclusions of our study to be generalisable for the products (i.e. systems and samples) under study we should consider effects related to subjects and replication effects as random. In this way we correct for both effects and use the relevant levels of noise and variabilities in the conclusions we make.

6.4.3.3

Extension 3: More than a single treatment factor

The next important extension, of particular value in audio sensory studies, allows the handling of more than a single “product”/“treatment” (fixed) effect factor. Often, audio systems will be evaluated in combination with a stimulus/sample, and often a single stimulus is used leading to a multilevel stimulus-by-system (A-by-B) situation. On one hand, the approaches discussed so far, including the use of PanelCheck, can always be used by just considering any multi-factorial setting as just a one product factor setting, where all available factor combinations are considered a different product. This technique is commonly referred to as unfolding. Using this approach will provide an analysis with no distinction possible between the fixed effect factors which were part of the DoE, nor any information regarding potential fixed effect interactions. Common interactions between stimulus and system, i.e. a system difference that depends on the stimulus/sample, could be particularly interesting, but inaccessible. Furthermore, all factors would be treated as fixed, which maybe an inappropriate assumption for certain random effect factors e.g. a replicate factor. As a result the important noise structures are not at the “single unfolded factor” level, but more related to certain specific factors. An analysis, where everything is investigated, based on the actual factors from the DoE, where effects are identified properly as main effects and interactions, is called for. The headphone data introduced earlier in Section 6.4.1 belongs to this category

Applied Univariate Statistics  177 with two treatment factors. This corresponds to row 4 in Table 6.4. Row 3 would be the same product situation but without replications. As you can see from the headphone discussion above, this type of data can be decomposed into the 14 different effects: 4 main effects, 6 2-factor interactions, and 4 3-factor interactions. The four-way combination is exactly the observational level of the study: we have one observation for each combination of A, B, S and R (corresponding respectively to systems (SYS), samples (SAM), subjects (SUB) and replications (REP)). Having accepted that the S and R effects should be random effects, and that in the “unfolded single factor” view we should include the five random effects from Section 6.4.3.2, it should be clear that all the expanded effects including S and R should also be random. This also reflects the common principle of relevant model choices in mixed models. Thus, only the three effects A, B and AB are fixed and the remaining 11 effects should be considered as random effects noise components. The total number of random effects for each setting is stated as k1 in Section 6.4.3.2. The table has been extended with the explicit descriptions of the full 3- and 4-way product factorials showing how rapidly the size of the problem increases. With a 4-way product design, and with replicated data, the number of different possible random effects is 47. On the one hand, it is recommended to run “the full model” to avoid missing important effects. On the other hand, the “full model” may be challenging to identify and to run. Some of the software tools described below can handle the full model with all the random effects. However, this may not be the best model to identify the most relevant information, partially due to the number of poorly determined low degree of freedom factors. We will allow the data to guide us towards the best model choice, whether a “single unfolded factor” approach or the full model approach. A specific design structure would normally call for a specific statistical analysis. The formal design phase and the subsequent identification of which a formal statistical model approach matches the used design is where most people without substantial formal statistical training would be challenged. Expert statistical knowledge might be needed to accomplish this, however there is a lack of expert statisticians globally to handle all such analyses. In order to avoid the need for further support, we suggest that if you have collected the data, you should be capable of analysing it. The basic idea is to somehow make sure that initially we have identified ALL possible structures that could even (mathematically) potentially be part of the model, and then let the data speak for itself to tell us which are really important. The point is that potentially irrelevant random effects, e.g. related to replication structures, would generally not show up as important in the data, and since we will remove certain non-important random effects automatically, Kuznetsova et al. (2015b), the analysis will roughly match the more formally correctly chosen analysis. Statisticians from the more formal end of the scale could be concerned about the “inflation of p-values” as a consequence of selection procedures like this. However, when used for the random part of the model, the classical “selection bias” concern is not so straightforward. And such an approach is, as described in the introduction above, akin to classical repeated measures correction techniques where some empirical estimation of the correlation structures are used to achieve better information about the fixed effects, that are of interest, see e.g. Keselman et al. (2001). In practice the full model analysis is rarely performed in complex settings due to the complexity where a researcher might rely on a simplified analysis. Compared to a too simplified analysis, an analysis based on automated investigation of complex structures could rather than leading to “too optimistic results”, the usual concern about inflated p-values, lead to “less significant results”, as potentially important correlation structures otherwise ignored are identified and taken into account. Furthermore, the multi-attribute setting pro-

178  Sensory Evaluation of Sound vides us with the additional possibility to judge the importance of certain random noise effects across attributes. We suggest basic restrictions on the automated random effect model search to be used in such multi-factorial sensory experiments to avoid certain selection issues. The default restriction, defined in the Shiny App GUI version of the SensMixed package (Kuznetsova et al., 2014) retains the main effects of S and R in the models together with the highest order random interaction term for both. What that means for each setting is stated explicitly as the last column in Table 6.4. In the replicated setting, it can also be seen as always keeping four of the five random effects listed in Section 6.4.3.2 when viewed as an “unfolded single factor” setting. The automated random effect procedure can hence be considered as a random model validation process: Is it good practice to make this very simplified assumption of the random noise structure, or do the data call for a more elaborate error model? If a more elaborate error model is needed, it will be detected and used for the subsequent fixed effect analysis. The basic automated functionality is provided by the step function in the lmerTest package, which generically also offers a keep option, where the user can specify which effects the procedure should never try to eliminate from the model. In the SensMixed package further model choice options are given together with the described default choice for the keep option.

6.4.3.4

Extension 4: Even more complex structures

The approach and tools presented here and implemented in SensMixed, ConsumerCheck and lmerTest can also handle more complex data and model structures including missing data, product incompleteness by design and/or practical necessity, order and carry-over effects and complex blocking structures, as e.g. in Bavay et al. (2014). Furthermore, so-called external preference mapping, where consumer preferences are modelled as regression functions of sensory profiles principal component scores, leading to the mixed models of the random coefficient type (Kuznetsova et al., 2017b, 2014), can be handled by these approaches and tools.

6.4.4

Headphone data analysis, modelling by mixed models

The headphone experiment requires extensions 1, 2 and 3 to be handled properly. The entire analysis as described above is carried out by the R-scripts given here: library(lmerTest) ## Running the full mixed model: lmerAtt2 res.chi names(res.chi) [1] "statistic" "parameter" "p.value" "method" "data.name" "observed" [7] "expected" "residuals" "stdres" > res.chi$residuals French songs Pop music Classical World Rock Jazz

Applied Multivariate Statistics  199 Farmer Artisan, shopkeeper Executive Middle class Employee Skilled worker Unskilled worker Pensioner Other non working

2.79 -0.52 -3.52 -0.38 -1.12 0.37 0.18 2.19 -0.04

-0.24 -0.71 -2.17 -0.10 2.38 1.95 2.51 -3.79 0.51

-1.43 -0.55 4.41 0.30 -1.21 -2.24 -2.22 4.24 -1.47

-1.08 -1.17 -2.15 3.44 -0.95 0.96 1.79 1.49 2.70 -1.91 1.08 1.60 0.00 0.82 -0.56 -0.57 0.62 -1.24 -0.42 0.40 -1.72 -2.61 -2.72 0.76 1.03 0.57 -0.37

The residuals correspond to the Pearson residuals (R Core Team, 2016; Agresti, 2007), denoted rij . By definition, they are calculated using the following formula: rij =

Oij − Tij p . Tij

(7.3)

Although these numbers seem quite simple to understand and to interpret, they do not provide a straightforward indicator of the discrepancy in terms of independence (Agresti, 2007). By definition, under the null hypothesis, i.e. under the hypothesis of independence, the probability of having a certain occupation and listening to a specific music genre is the product of theses probabilities. In terms of estimated probabilities, if fij = nij /n, fi. = ni. /n, and f.j = n.j /n, the null hypothesis can be expressed the following way: fij = fi. × f.j ,

(7.4)

fij − 1 = 0. fi. × f.j

(7.5)

fij − 1, i ∈ I, j ∈ J} fi. × f.j

(7.6)

or equivalently

The differences {

define the discrepancy matrix from the situation of independence. To obtain such a matrix, I will first calculate the two vectors comprising of fi. and f.j , using the very important apply function. fi >

200  Sensory Evaluation of Sound These numbers define a profile of the occupations on the one hand and of music genres on the other hand. These multivariate profiles (as vectors of RJ and RI ) can be directly interpreted in terms of difference from the independence model . For instance, people who belong to the Farmer class tend to listen to French songs (with a positive discrepancy value of 0.32), but they are not inclined to listening to Classical, World, Rock, nor Jazz music (with negative values ranging from -0.28 to -0.60). Similarly, people who belong to the Executive class tend to listen to Jazz and Classical (with positive values of 0.90), but they are not inclined to listening to French songs, nor Pop music (with negative values -0.43 and -0.37, respectively). Looking at these profiles will allow us to see whether people with certain occupations listen to specific music genres. Moreover, we want to identify the proximities or the distances between certain occupations that may be noteworthy: this notable piece of information will constitute a feature of my data set, a characteristic line of my data set, in other words a dimension of my data set. This is how a bivariate question turns into a multidimensional issue. Naturally, if you had to compare two occupations in terms of difference from the independence model , you would calculate a distance based on the differences for each music genre. As the different music genres have different weights relative to each other, you would naturally take this information into account, and calculate a distance weighted by the relative importance of each music genre. In other words, we are going to consider the following distance between two occupations i and i0 : d2 (i, i0 ) =

X j∈J

f.j ((

fij fi0 j − 1) − ( − 1))2 , fi. × f.j fi0 . × f.j

(7.7)

which can also be written d2 (i, i0 ) =

X 1 fij fi0 j 2 ( − ) , f.j fi. f i0 .

(7.8)

X n nij n i0 j 2 ( − ) . n.j ni. ni0 .

(7.9)

j∈J

or d2 (i, i0 ) =

j∈J

This last formula leads to a lot of comments. The distance between i and i0 (in terms of difference from the independence model ) leads to the comparison of the profile of i and the profile of i0 , music by music genre. The relative weight of each music genre impacts the distance. This impact is inversely proportional to the relative weight of the music genre, which makes sense, as it means that according to the music genre, a difference between two statuses is stressed by the fact that the music genre is really linked to status. With a distance and a system of masses, we can now focus on the very important concept of inertia. By definition, as recalled in our introduction, the inertia of our ensemble of occupations is obtained by the following calculation: X I(NI ) = fi. d2 (i, O), (7.10) i∈I

where I(NI ) is the inertia of the rows, NI denotes the scatter plot of the rows in the RJ space, and O denotes the centre of gravity of NI . One can easily show that I(NI ) =

D , n

(7.11)

Applied Multivariate Statistics  201 hence the name of the distance between two row profiles which is named the chi-square distance: X n nij n i0 j 2 d2 (i, i0 ) = d2χ2 (i, i0 ) = ( − ) . (7.12) n.j ni. n i0 . j∈J

f

ij The multidimensional analysis of our, i.e. the { fi. ×f − 1, i ∈ I, j ∈ J} matrix, consists .j in finding the dimensions that maximise the inertia of the orthogonal projection of the rows on the dimensions. In other words, these are the dimensions for which the categories of the variable occupation and the categories of the variable music genre are the most dependant: these are the dimensions for which the correspondence between the categories of the variable occupation and the categories of the variable music genre are the highest. These dimensions are obtained by applying a so-called correspondence analysis (CA) on the contingency table we named music in our earlier R session (see Page 197). As you have hopefully understood, correspondence analysis is the multidimensional method dedicated to the analysis of the dependence between two categorical variables, from the point of view of their categories. In other words, CA is dedicated to the analysis of the correspondence between the categories of one categorical variable and the categories of another. To apply this method on our contingency table, I will use the CA function of the FactoMineR package, then I will use the plot.CA function in order to represent the rows of our data set, i.e. the occupations.

> res.ca plot.CA(res.ca,invisible="col")

As you can see in the code, the use of the CA function is pretty straightforward. In this example, the main input is the matrix music, and I have deliberately asked R not to show any graphical outputs (graph=FALSE). Then, the numerical results have been stored in an R object named res.ca, This is the object I will use as input for the plot.CA function, in order to get a customised graphical outputs. In other words, for the plot.CA function, I have used two arguments, the first one that indicates the name of the object in which I have saved the results of my CA, the second one that indicates that the columns of my contingency table should be invisible (i.e. should not be plotted): Figure 7.1 is the result of the two lines of code. From Figure 7.1 you can see that the first dimension differentiates the two categories Executive and Pensioner from the rest of the categories, and in particular to the categories Unskilled worker and Skilled worker. The second dimension differentiates the categories Farmer and Pensioner from the category Executive. These oppositions are due to the differences in terms of correspondence between people of these categories and the music genres they listen to. At this point, we still can not tell how occupations and music genres correspond with each other. To do this, we have to understand the dimensions in terms of music genres as well. Using the plot.CA function, I will plot the music genres on the dimensions provided by my CA applied to my contingency table music. > plot.CA(res.ca,invisible="row")

Remark: By default, the first two main dimensions of variability are plotted. This can be changed using the argument axes. For instance, to plot the second and the third axes, you have to set this argument the following way axes=c(2,3). In general the number of dimensions to be analysed is something for you to carefully consider (Borg et al., 2012; Greenacre, 2017).

202  Sensory Evaluation of Sound

0.4

Representation of the statuses

0.2

Executive

Artisan, shopkeeper

Employee

0.0

Unskilled worker Skilled worker

Middle class

-0.2

Dim 2 (31.41%)

Other non working

-0.4

Farmer

Pensioner

-0.4

-0.2

0.0

0.2

0.4

Dim 1 (55.39%)

Representation of the statuses on the first two dimensions resulting from a CA on the contingency table music, using the plot.CA function. Figure 7.1

As you can see in Figure 7.2, the first dimension differentiates the two categories Jazz and Classical compared to the remaining categories, and in particular to the category Pop music. The second dimension differentiates the category French songs to the category World. As you can easily guess, the analysis of our contingency table through its rows on the one hand, its columns on the other hand, leads to graphical outputs with dimensions that have the same explained variances (shown in the axis labels), and which should be interpreted conjointly. This conjoint interpretation can be done through a very important concept in multidimensional analysis, namely the concept of the transition formulae. In a CA framework, this concept can be written the following way: J 1 X nij Fs (i) = √ Gs (j), λs j=1 ni.

(7.13)

where Fs (i) is the coordinate of row i on the dimension of rank s, Gs (j) is the coordinate of column j on the dimension of rank s, λs is the variance of the dimension of rank s. As you can see, this formula expresses the fact that the position of a row on a dimension

Applied Multivariate Statistics  203

0.6

Representation of the types of music

0.4

World

0.2

Jazz

0.0

Pop music

Classical

-0.2

Dim 2 (31.41%)

Rock

French songs

-0.4

-0.2

0.0

0.2

0.4

0.6

Dim 1 (55.39%)

Representation of the music genres on the first two dimensions resulting from a CA on the contingency table music, using the plot.CA function. Figure 7.2

depends on the position of the columns on the same dimension. The position of a row on a dimension is a “linear combination” of the position of the columns, in other words, to obtain the coordinate of a row we calculate some kind of weighted average of the coordinates of the columns. The weights depend on the way the row and the columns are associated: a row i will be all the more close to a column j that nij /ni. is high, i.e. that the row is associated with that column. We could expect this, as the dimensions are based on the discrepancy matrix from the situation of independence. Thanks to this property, often named the barycentric property, it is possible to plot one row and its correspondences with the columns. A contingency table is intrinsically a symmetric entity. The considerations I made with the rows could also be made with the columns. In this case the barycentric property illustrated previously can be written the following way: I 1 X nij Gs (j) = √ Fs (i). λs i=1 n.j

(7.14)

Remark: This is precisely how the graphical outputs provided by CA should be interpreted

204  Sensory Evaluation of Sound

Correspondence between Executive and the musics

0.4

World Executive

Rock

0.0

Pop music

Classical

-0.2

Dim 2 (31.41%)

0.2

Jazz

-0.4

French songs

-0.4

-0.2

0.0

0.2

0.4

0.6

Dim 1 (55.39%)

Representation of the correspondence between Executive and the music genres on the first two dimensions resulting from a CA on the contingency table music, using the plot.CA function. Figure 7.3

in terms of correspondences: one row amongst the columns, or one column amongst the rows. To do this, I will use the plot.CA function, specifying the selectRow argument for the row I want to use. In the following example, I am going to illustrate the correspondences between the category Executive and the music genres. > plot.CA(res.ca,selectRow="Executive",title="Correspondence between Executive and the musics")

As in Figure 7.2, the first dimension of Figure 7.3 shows an opposition between the categories Jazz and Classical on the one hand, Pop music and French songs on the other, in terms of correspondences with the category Executive. As you can see, this opposition quite successfully illustrates the discrepancy matrix calculated previously on Page 199, i.e. the fact that people who belong to the Executive class tend to listen to Jazz and Classical music, as these three categories are located in the same region of the figure. However they are not inclined to listening to Pop music or French songs, as Executive has a high positive

Applied Multivariate Statistics  205 score on the first dimension, whereas Pop music and French songs have high negative scores on the same dimension. On the second dimension, we can see the fact that people who belong to the Executive class tend to listen to Rock and World music (but to a lower extent compared to the first dimension), as these three categories have positive scores on the second dimension. Although it seems to be only a graphical approximation of the discrepancy matrix (as it explains 86.8 % of the information contained in this matrix), this visualisation is invaluable as it shows how the dimensions of the matrix are structured. With a similar aim, we can get an automatic description of the contingency table in terms of significant correspondences between the rows and the columns. To do this, I will use the descfreq function of the FactoMineR package. The results provided by the function consists of a list of statistically significantly related categories to our category of interest. The list is provided by a test based on the comparison between two percentages, a first one calculated within the category of interest, a second one calculated over the observed population. This comparison is done thanks to the hypergeometric distribution and is described in detail in Husson et al. (2017). Let’s apply the descfreq function to the contingency table music found on Page 197, and let’s have a look at the category Executive. In order to do this, I will first store the results in a list arbitrarily named res.descfreq and then I will have a look at the element associated with the category of interest. > res.descfreq res.descfreq$Executive Intern % glob % Intern freq Glob freq p.value v.test Classical 28.57143 15.067466 46 201 3.534254e-06 4.637045 Jazz 10.55901 5.547226 17 74 9.918812e-03 2.578647 Pop.music 13.04348 20.839580 21 278 9.521877e-03 -2.592725 French.songs 24.22360 42.278861 39 564 5.957766e-07 -4.992581

In this example, we can see, from the v.test column, that the category Executive is significantly related to the categories Classical, Jazz on the one hand, and on the other hand inversely related to Pop music, French songs, due to the opposite signs of the v.test values. With a p-value of 3.53e-06, we can see that nij /ni. is significantly higher than n.j /n, where i corresponds to the category Executive and j corresponds to the category Classical. In other words, within the category Executive, the percentage of people who listen to Classical music (here, 28.57 %) is significantly higher than the percentage of people who listen to Classical music in the whole population (here, 15.07 %). Similarly, with a p-value of 5.96e-07, we can see that within the category Executive, the percentage of people who listen to French songs (here, 24.22 %) is significantly lower than the percentage of people who listen to French songs in the whole population (here, 42.28 %). People who belong to the Executive class tend to listen to Jazz and Classical music, but they are not inclined to listen to Pop music and French songs. This information, although it is very operational, can not replace the unique and invaluable information contained in the graphical outputs provided by CA in terms of dimensions and structure, in other words the multidimensional information inherent1 to the contingency table. 1 Inherent

in the sense of the German word eigen, as used in the words eigenvalue and eigenvector.

206  Sensory Evaluation of Sound

7.2.3

In summary

This section was the beginning of our journey into the multidimensional world. I tried to illustrate the fact that although bivariate does not seem “very multivariate”, it can indeed be multidimensional in nature. From this perspective, the section provides the general concept of multidimensionality. In addition, I tried to present the main principles behind the visualisation of multidimensional data. These principles are based on the concept of dimension reduction, and therefore in search of dimensions onto which data may be projected and represented according to certain criteria. I decided to start from this example, and therefore from correspondence analysis, for many reasons. It is a very familiar model to start with, as everyone has at least once encountered a situation in which (in)dependence between two categorical variables has to be assessed. The mere fact that it is a familiar model justifies its introduction at this early stage of the chapter. But I should say that this choice is debatable as CA is not usually presented at such an early stage when introducing multidimensional analysis. This is due to the metric used in CA, where rows and columns are weighted, which adds some complexity. The example is easy to understand and is somehow universal as it answers a common sociological question that may concern everyone: “tell me who you are and I will tell you what you are listening to”. Actually, this example is historical and in the spirit of the French sociologist and philosopher, Pierre Bourdieu I wanted to present CA in its original context (Bourdieu, 1984; Benzécri, 2002). Finally, the method is a cornerstone of what in France is called Exploratory Multivariate Analysis or also more correctly (according to me) Multidimensional Data Analysis. Correspondence analysis can be used in an infinity of situations but is often misused either because of its versatility or because of its interpretability, and that is also why I decided to present it at the very beginning of the chapter (Escofier, 2003; Govaert, 2013).

7.3 ONE OF THE MAIN ISSUES IN MULTIDIMENSIONAL ANALYSIS: CHOOSING THE PROPER DISTANCE Having presented the principles of multidimensional analysis in the previous section, I will now present the very important concept of active variables, or more generally of active columns. In a very general way, the so-called active columns are the columns that cause the variability amongst the rows. In other words, these are the columns that are used to calculate a distance between the rows. Alternatively, I could say that these columns define the perspective based on my research question. Therefore, the choice of these columns is not only mandatory, but must also be justified in relation to the research question. Once these columns have been chosen, we can add as many additional illustrative columns as desired. This second type of information (also named supplementary information), as indicated by its name, is used to illustrate (or to supplement) the visualisation of the rows and their relative positioning caused by the active columns. This choice should be made prior to data collection, as part of the experimental design discussed further in Chapter 4. Once research questions are defined, it is important to define how they will be addressed. In other words it is important to draw up a list of measures or variables that, it is hoped, will provide relevant information. The “core” of these measures constitutes our set of active variables (or active columns), the “rest” constitutes our set of illustrative variables.

Applied Multivariate Statistics  207 Instrument Piano Violin Accordion

Pitch Octave 1 Octave 3 Octave 5

Tempo 80 120 160

For the variations, we used all the possible combinations between the three levels of the factors, i.e. 33 = 27. Table 7.1

To illustrate my point, I will use data collected to understand the concept of sound logos (also discussed in Chapter 13). These data were collected by three of my former students (Julien Brécheteau, Kévin Guillamet, and Anaïs Lebastard) during a project I supervised in 2012 (see Appendix A.3). For this project, we wanted to develop a methodology that would help in choosing the best version of a sound logo. The selected logo would fit the expectations of a client in the sense that it would be associated with some intrinsic values of the client’s brand, and would touch the consumer accordingly to these values (Brécheteau et al., 2012). In other words, our objectives were to: • Understand the values associated with a sound logo. • Assess how well the sound logos’ emotional impact aligns with the brand values. In practice, using a sound logo melody, we created 27 variations based on three factors with three levels each: the instrument, the pitch, and the tempo (see Table 7.1). The data were obtained in a two-step procedure. First, the participants had to answer the following question: “How would you describe a company with such a sound logo?” To answer this question, they had at their disposal the following list of adjectives: handcrafted, complex, convivial, premium, innovative, international, free, recreational, original, passionate, performance, popular, related, serious, simple, sober, technological, traditional. After listening to a sound logo, participants had to check all the adjectives that applied to a company using that sound logo. These data can be considered as check-all-that-apply (CATA) data (see also Chapter 5). The participants then attended a presentation where the client’s brand and values were presented. After the presentation, the participants had to assess the sound logos in terms of how well the sound logos’ emotional impact aligns with the brand values. To do this, they had a list of adjectives: happy, surprising, exciting, energetic, sensual, sad, warm, serious; and they had to use a just about right (JAR) scale as illustrated in the following generic question: “In your opinion do you think this logo is Far too adjective ... Not at all adjective with respect to Brand ’s values?” These data can be considered as just about right data (see also Chapter 5). From this, we constructed a data table comprising 27 rows, i.e. the logos, and 62 columns, i.e. the 19 adjectives, the 8 emotions according to one of the 5 points of the JAR scale, the 3 factors, making a total of 19 + 40 + 3 = 62 columns (see Appendix A.3 for details): • At the intersection of one row and one CATA column, the number of times a given adjective has been checked when listening to a sound logo. • At the intersection of one row and one JAR column, the number of times a given combination of emotion×JAR category has been mentioned when listening to a sound logo.

208  Sensory Evaluation of Sound • At the intersection of one row and one of the factors, the category that characterises the sound logo for the factor considered.

7.3.1

How would you describe a company with such a sound logo?

To understand the sound logos in terms of values conveyed by a brand, the distance that has to be considered should be based on the CATA data, in our case the first 19 columns of the data table2 . We expect that the structure caused by these active columns may be explained by the intrinsic factors of variability amongst the sound logos, considered here as illustrative variables. To plot the results, after importing the data set in my R session with the very important read.table function, I first performed a correspondence analysis on the data table named sonic, specifying that the JAR data as well as the factors used for the variations should be considered as an illustrative information (by default, all the columns are considered as active). To do this, for the CA function, I have to specify the position of the supplementary columns (col.sup=20:56), as well as the position of the supplementary variables (quali.sup=57:59). The results of the CA were saved in an R object named res.ca.cata. In order to avoid visual overload with the plot.CA function, I decided to plot the 15 rows and the 10 columns with the best quality of representation (Blasius and Greenacre, 2014) onto the plane generated by the two first dimensions (selectRow="cos2 15" and selectCol="cos2 10"). Similarly, as by default the supplementary information is also plotted, I decided not to plot the supplementary columns associated with the JAR data (invisible="col.sup"). The following code yields the following graphical output shown in Figure 7.4 which illustrates perfectly well the benefits of having supplementary information. > sonic res.ca.cata plot.CA(res.ca.cata,invisible="col.sup",selectRow="cos2 15",selectCol= "cos2 10",title="Analysis of the CATA (before)")

On the first dimension of Figure 7.4, we can see that the levels of the pitch factor have been ordered from pitch 5 with a negative score, to pitch 1 with a positive score, and pitch 3 located close to the origin. The first dimension can be interpreted as a gradient of pitch, from high to low. It is worth mentioning that variability in the first dimension, based on the CATA items, is solely based on the intrinsic factor, namely pitch. The gradient of pitch will be extremely useful to explain the differences between the sound logos that have been perceived as technological and innovative, and those that have been perceived as popular and traditional. On the second dimension we can see a differentiation between the accordion on the one hand (with positive scores), the piano and the violin on the other hand (with negative scores): this dimension could be interpreted as some gradient of sophistication. The technical corner: The concept of quality of representation. As explained previously, visualisation and dimension reduction are based on the concept of projection. As the rows (and the columns) are projected onto a plane, there is no guarantee that they will be well represented: by definition, the projection operator decreases and thus distorts the 2 The sonic.csv data set is described further in Appendix A.3 and can be downloaded from https: //www.crcpress.com/Sensory-Evaluation-of-Sound/Zacharov/p/book/9781498751360

Applied Multivariate Statistics  209

0.6

Analysis of the CATA (before)

recreational convivial popular A_160_3

A_120_3

Instrument.Accordion A_160_1 Tempo.160 Pitch.3 hand crafted Tempo.120 A_80_3 A_120_1

P_120_5 performance Instrument.Piano Instrument.Violin Tempo.80

-0.2

0.0

V_160_5 V_120_5 technological A_120_5 innovative V_80_5 Pitch.5

traditional A_80_1 Pitch.1

premium

-0.6

-0.4

Dim 2 (23.85%)

0.2

0.4

A_160_5

P_80_1 serious V_120_1

-0.8

V_80_1

-0.5

0.0

0.5

Dim 1 (43.55%)

Plot of the sound logos (in blue) by considering the CATA items (in red) as active and their intrinsic factors of variability (in pink) as illustrative. Before refers to the first step of the experiment. Figure 7.4

distances between points. To assess how well projected on a dimension a row can be, we calculate for this dimension a numerical indicator. This indicator is the cosine square of the angle formed by the lines passing through the row and the barycentre of the rows, and its projection on the dimension and the barycentre of the rows. This number lies between 0 and 1, and measures the quality of representation of the row projected on the dimension. The sum over all dimensions is equal to 1: a row (or a column) is perfectly represented on the whole dimensions. A row is all the more well projected on a plane when its quality of representation on the plane is close to 1. The information on the quality of representation of the rows can be found in the object res.ca.cata$row$cos2.

7.3.2

Do you think this logo is too serious compared to the brand’s values?

If I now want to understand the variability between the logos in terms of how they affect the participants regarding the values conveyed by the client’s brand, I have to consider the

210  Sensory Evaluation of Sound JAR data as the active information, and the rest as the illustrative information. This can be done by using the following R code: > res.ca.jar names(sonic[,20:59]) [1] "Happy JAR" "Happy Not enough" [3] "Happy Too much" "Happy Not enough at all" [5] "Happy Far too much" "Surprising JAR" [7] "Surprising Not enough" "Surprising Too much" [9] "Surprising Not enough at all" "Surprising Far too much" [11] "Exciting JAR" "Exciting Not enough" [13] "Exciting Too much" "Exciting Not enough at all" [15] "Exciting Far too much" "Energetic JAR" [17] "Energetic Not enough" "Energetic Too much" [19] "Energetic Not enough at all" "Energetic Far too much" [21] "Sensual JAR" "Sensual Not enough" [23] "Sensual Too much" "Sensual Not enough at all" [25] "Sad Not enough" "Sad Too much" [27] "Sad Not enough at all" "Sad Far too much" [29] "Warm JAR" "Warm Not enough" [31] "Warm Too much" "Warm Not enough at all" [33] "Serious JAR" "Serious Not enough" [35] "Serious Too much" "Serious Not enough at all" [37] "Serious Far too much" "Instrument" [39] "Tempo" "Pitch" > plot.CA(res.ca.jar,invisible=c("col.sup","quali.sup"),selectCol=c(1,6,11,16, 21,29,33),selectRow="contrib 10",title="Analysis of the JAR (after)")

As you can see from the code, I also decided to select and to plot the emotions associated with the just about right level only (selectCol=c(1,6,11,16, 21,29,33)). Interestingly, we can see in Figure 7.5 that all the JAR categories are located in the same area as our graphical output. In other words, (1) the JAR categories have similar multivariate profiles in terms of how they were associated with the sound logos, and (2) the sound logos that are located in the same area are the ones that affect the participants appropriately regarding the values conveyed by the brand of the client. Considering the brand and its values, these logos are just about right for the adjectives happy, surprising, exciting, energetic, sensual, warm, serious. From this we can learn that there are “good” combinations of the factor categories that will enable the sound logo’s characteristics to be well aligned with the brand and its associated values. The technical corner: You must have noticed that the number of columns associated with the JAR data is not exactly 40. Actually, I decided to get rid of a few columns: the ones that have P been elicited a very small number of times. In practice, the columns j for which n.j = i nij ≤ 3.

7.3.3

In summary

In this second section, I had to deal with a more complex data set than the one in the first section. Consequently, regarding my multidimensional analysis, I had to face two issues, one related to the input, and another related to the output. The input basically concerns the choice of the status of the columns of my data set: depending on the question of interest, and therefore on the point of view I want to use to represent the variability amongst my rows (respectively my columns), I have to choose the proper distance, in other words the columns (and rows respectively) that I want to consider

Applied Multivariate Statistics  211

1.0

Analysis of the JAR (after)

V_80_1 P_80_1

0.0

V_160_5 A_160_5 A_160_3

P_80_3 Surprising JAR Warm JAR Energetic JAR Serious JAR Sensual JAR Happy JAR Exciting JAR V_160_3 P_120_3 P_160_3

-1.0

-0.5

Dim 2 (29.20%)

0.5

A_80_1

-0.5

0.0

0.5

1.0

1.5

Dim 1 (44.48%)

Plot of the logos and some emotions at the JAR level on the first two dimensions resulting from a CA on the JAR data, using the plot.CA. After refers to the second step of the experiment. Figure 7.5

as active in the calculation of the distance between the rows (respectively the columns). This is what I have illustrated with two very different research questions. The output concerns the choice of the rows and the columns of my data set I want to visualise. Once again, this choice depends on the question I want to answer. As the data get more complex, this choice is of greater importance as it will impact the results of the employed automatic dimensional data reduction technique. As illustrated, this choice can be made according to the importance of the objects to be visualised, to their quality of representation, or based on their contribution (Husson et al., 2017). In order to further explore this complex data set, two additional analyses can be performed For the first analysis, one can observe that the JAR data are naturally structured into groups, one group corresponding to one JAR variable. One approach would be to visualise the sound logos based on their optimal JAR levels. For the second analysis, one may want to visualise the sound logos combining the two

212  Sensory Evaluation of Sound kinds of information contained in the CATA and the JAR data, i.e. the perception of the logos without knowing anything about the brand, and the perception of the logos in terms of emotions while the name of the brand has been revealed. These two questions can be handled with dimensional reduction methods that take into account the concept of multiple data sets, and multiple factor analysis (MFA, Escofier and Pagès (1994)) is one of them. Once again, the concept of active or illustrative variables is crucial: it has to be determined in terms of groups of columns rather than single columns as illustrated previously. To illustrate how to perform an MFA with the FactoMineR package (Pagès, 2016), I will present the following code: > sonic_MFA res.mfa plot.MFA(res.mfa,choix="ind") > plot.MFA(res.mfa,choix="group") > plot.MFA(res.mfa,choix="axes") > plot.MFA(res.mfa,choix="ind",partial=c("A_160_5","V_80_1","P_160_5"), xlim=c(-7.5,10),ylim=c(-5,6.5))

These lines of code correspond to the following actions: I built a data set based on the CATA data and on the JAR data, then I performed an MFA on the data set by considering 9 contingency tables (Bécue-Bertaut and Pagès, 2004; Kostov et al., 2013), one based on the CATA data and 8 others based on the JAR data, one for each JAR. I wanted to understand the sound logos in terms of the JAR variables, the role of each JAR being balanced within a global analysis, and I decided to consider the CATA contingency table as an illustrative data table, while the other 8 tables are considered as active data tables, in the sense that the distance amongst the logos is based on these 8 tables. Remark: If you want to consider the second point of view, you can use the following code: > sonic_MFA res.mfa sounds_sas res.mca plot.MCA(res.mca,axes=c(1,2),choix="ind",invisible="var", title="Representation of the sounds of trams")

In Figure 7.6, based on the first dimension, you can discern a differentiation between 4 The tram.csv data set is described further in Appendix A.4 and can be downloaded from https: //www.crcpress.com/Sensory-Evaluation-of-Sound/Zacharov/p/book/9781498751360

216  Sensory Evaluation of Sound sounds 25, 36, 43, for instance (located on the negative side), and the others (located on the positive side), i.e. 27, 34, 46. On the second dimension, you can also discern a differentiation between sounds 27, 34, 46, for instance (located on the negative side), and sounds 3, 21, 56 (located on the positive side). This figure is very difficult to interpret without further information regarding the sounds. Even with expertise in sound, such a visualisation is impossible to interpret without the descriptions of the groups of sounds provided by the participants (in our case, the name of the groups in Table 7.2). As mentioned previously, this visualisation of the rows can be supplemented by a visualisation of the columns that would inherit the same properties as CA on a contingency table and in particular that would benefit from the transition properties: Fs (i) =

1 X xik Gs (k), λs J

(7.17)

1 X xik Fs (i). λs i Ik

(7.18)

k

Gs (k) =

Here, Fs (i) denotes the coordinate of the statistical individual i (in our example, i is a tram sound) on the dimension of rank s, Gs (k) denotes the coordinate of the category k on the dimension of rank s, xik is either 0 or 1, Ik denotes the number of sounds described by the category k, λs denotes the variance of the dimension of rank s, and J denotes the number of participants. As mentioned previously, the tram sound is located at the barycentre of the categories used to characterise that sound. Once the coordinates of the columns of the complete disjunctive table have been obtained, the interpretation of the plane (see Figure 7.6) can be performed from a (latent) variable perspective. In other words, from a dimensional perspective, using the dimdesc function of the FactoMineR package (Husson et al., 2017). This function provides an automatic description of the dimensions based on the variables of the data set that are analysed. This function is used in the following way: > res.dimdesc res.dimdesc$’Dim 1’$category $‘Dim 1‘$category Part4_strident,metal,industrial Part1_strident,metal,industrial Part13_rhythmic Part30_merry-go-round,danger Part3_strident,metal,industrial Part2_strident,metal,industrial ... Part23_usual,fluid,subway Part22_fluid,usual Part21_fluid,usual Part15_fluid Part14_fluid,homogeneous,consistentity Part24_fluid,usual

Estimate 1.1433175 1.0897839 0.7045784 0.9011023 1.1211957 1.1957332

p.value 1.982080e-07 1.735708e-06 2.063323e-06 4.991323e-06 1.262711e-05 2.046932e-05

-0.6011040 -0.6772685 -0.5637844 -0.7965626 -0.8217955 -0.5464487

1.150456e-04 8.062565e-05 6.447303e-05 6.096617e-05 6.096617e-05 2.131711e-05

For each dimension, the function provides a list of categories that are significantly linked to the dimension considered. It is now up to the user to interpret these dimensions. In our example, we can see an opposition between sounds that are described as rather fluid and sounds that are described as rather strident and rhythmic. The main axis of variability amongst the sounds of trams may be interpreted as an axis of smoothness. The interpretation of the plane can also be done from a statistical unit perspective, by

Applied Multivariate Statistics  217 considering homogeneous clusters of sounds. In order to do this, let’s apply the hierarchical clustering on principle components, HCPC function of the FactoMineR package. This function performs a hierarchical ascendant classification of the dimensions resulting from the multidimensional analysis of a matrix using the FactoMineR package: in our case, the input to this function is simply an object of class MCA, i.e. the set of results provided by the MCA function. In the following example, 30 dimensions were taken into account for the classification, and 3 clusters retained. The HCPC function provides a description of the clusters based on the variables of the data set that is analysed: this description is stored in a list named res.hcpc$desc.var, where res.hcpc is the name of the object in which I have stored the results issued from the HCPC function (Husson et al., 2017). To obtain a graphical output of the clusters (Figure 7.7), I will use the following code: > res.mca res.hcpc plot.HCPC(res.hcpc,ind.names=F,choice="map",draw.tree=F, title="Graphical representation of the sounds from the hierarchical clustering)

If I am interested in the description of the first cluster of sounds, for instance, I will write the following code: > round(res.hcpc$desc.var$category$’1’,2) Cla/Mod Mod/Cla Global p.value v.test Part25_deep,screech,stressful 100.00 61.54 40.0 0.00 3.93 Part24_fluid,usual 94.74 69.23 47.5 0.00 3.75 Part21_fluid,usual 90.48 73.08 52.5 0.00 3.48 Part52_pleasant 100.00 50.00 32.5 0.00 3.33 Part13_fluid,homogeneous 100.00 50.00 32.5 0.00 3.33 Part23_usual,fluid,subway 90.00 69.23 50.0 0.00 3.25 Part38_rhythmic,screech,rumble Part4_strident,metal,industrial Part1_strident,metal,industrial Part47_rhythmic,machine Part13_rhythmic

0.00 0.00 0.00 11.11 10.00

0.00 0.00 0.00 3.85 3.85

15.0 15.0 15.0 22.5 25.0

0.00 0.00 0.00 0.00 0.00

-3.36 -3.36 -3.36 -3.62 -4.00

This description of the clusters is essential to interpret the plane and can also be used to illustrate it. This description is separated into two parts: the attributes that are significantly characterising the cluster and the ones that are not significantly characterising the cluster (Lê et al., 2008). In our example, apart from participant 25, the sounds of trams were considered as rather fluid and were not considered as rhythmic. One possible way of using this information is to plot, for instance, 2 characteristic categories for each cluster, as shown in Figure 7.8. To do so, I use the following code: > plot.MCA(res.mca,choix="ind",selectMod =c("Part52_pleasant","Part24_ fluid,usual","Part15_rhythmic","Part5_agressive,rattle,danger","Part6_strident, rattle,agressive","Part21_scrap.metal,regular,locomotive"),label="var", title="Representation of the categories that best characterise the clusters")

In addition to these data, we also have at our disposal measures of annoyance provided by courtesy of Arnaud Trollé. In a previous study, the 40 sounds were assessed in terms of annoyance by 40 other participants (Trollé et al., 2014).

7.4.2

Some considerations to better understand PCA

Let’s now explore the annoyance data. As mentioned previously, these data (as well as the stimuli used for the SAS task) were provided by courtesy of Arnaud Trollé (Trollé et al.,

218  Sensory Evaluation of Sound

cluster 1 cluster 2 cluster 3

0.5 0.0 -1.5

-1.0

-0.5

Dim 2 (8.86%)

1.0

1.5

2.0

Representation of the sounds from the hierarchical clustering

-1

0

1

2

Dim 1 (10.82%)

Graphical representation of the sounds from the hierarchical ascending classification of the 30 first dimensions from MCA, using the plot.HCPC function. Figure 7.7

2014). They consist of a 40×40 matrix denoted X, where the rows (i.e. the statistical units) are the 40 sounds, and the columns (i.e. the variables) are the 40 participants who listened to the sounds and provided annoyance scores for each sound on a continuous scale ranging from 0 to 10. Here, it is not my intention to go into the details of principal component analysis, but rather to illustrate how the method works using perceptual data such as annoyance data. If I want to obtain clusters of participants according to their annoyance profiles, I have to consider the transpose of X, where the rows are the participants and the columns are the annoyance scores given for each sound. A naïve version of an annoyance profile, would be the vector of scores of annoyance given by a participant: homogenous clusters would be comprised of participants who were similarly annoyed a sound. Thus, to obtain these clusters of participants from this perspective, I should first run a PCA on the transpose of X, and then run a hierarchical ascendant classification of the dimensions (or a subset of the dimensions) resulting from the PCA. Figure 7.9 is the output of the following code: > res.pca res.hcpc plot.HCPC(res.hcpc,ind.names=F,choice="map",draw.tree=F, title="Plot of the sounds from the hierarchical clustering")

These results were expected with such data: it is the expression of the assessor effect you have seen in the previous chapter for which each assessor has their own average level. Hence, the interpretation of the clusters is straightforward: the first cluster is comprised of participants who are barely annoyed by any sounds, the second cluster comprises participants who are moderately annoyed by the sounds, and finally the third cluster included participants who are extremely annoyed by all sounds. In PCA, a very structural first dimension embodies what is termed a size effect: by definition, it happens when the variables are all positively correlated, as is the case here. As this figure is an optimal illustration of the variables provided by PCA of the correlation matrix (or more generally of the variancecovariance matrix), the fact that all the variables are pointing in the same direction reflects the fact that the correlation coefficients between all pairs of variables are positive. Whilst this result is not only expected, it is also devoid of interest. Another profile of annoyance could be considered: homogenous clusters would be formed of participants who rank similarly the sounds according to their annoyance. To obtain such clusters, data should

220  Sensory Evaluation of Sound Representation of the sounds from the hierarchical clustering

1.0

10

Variables factor map (PCA)

5 0

Dim 2 (6.98%)

-1.0

-10

-5

0.0

27 38 25 2 302132 18 5 50 11 53 3 45 14 39 19 6 49 13 22 60 37 35 58 8 20 5156 46 48 3359 23 9 43 2834 54

-0.5

Dim 2 (6.98%)

0.5

36

cluster 1 cluster 2 cluster 3

-1.0

-0.5

0.0

0.5

1.0

Dim 1 (57.66%)

-10

-5

0

5

10

Dim 1 (57.66%)

Plot of the participants from the hierarchical ascending classification of the 30 first dimensions from PCA, using the plot.HCPC function. Figure 7.9

be centred by participant. In other words, for each participant, I am going to subtract their mean for each sound. To do this, I am using the following code: > m tXc res.pca.centred par(mfrow=c(1,2)) > plot.PCA(res.pca.centred,choix="var",title= "Plot of the sounds without Assessor effect") > plot.PCA(res.pca.centred,choix="ind",title= "Plot of the assessors without Assessor effect")

As you can see in Figure 7.10, the first dimension of the analysis does not reflect a size effect anymore, as the variables are pointing in all directions. For example, the first dimension differentiates two very different participants: participant J35, who seems to be really annoyed by the sounds 34, 54, and 28, and not at all by sounds 36 or 38; whilst participant J1, seems to be really annoyed by the sounds 36 and 38, and not at all by sounds 34, 54 or 28. By centring by rows, i.e. by subtracting the mean of each row, I actually got rid of the main information that was contained in the transpose of X. In other words it is as if I was on a subspace of the transpose of X that would be orthogonal to the first dimension caused by the assessor effect. If you remember the principles I have set out concerning the search for optimal dimensions, in the sense of projected inertia, you may recall that the decomposition of the inertia comes down to looking for a sequence of orthogonal axes that maximise the inertia (or variance) of the projected scatter plot on these axes. Therefore, you should come to the conclusion that the main plane obtained from a PCA on the transpose of X centred by row should look very similar to the plane generated by dimensions 2 and 3 obtained from a PCA on the transpose of X. It should be true for both visualisations of the individuals and the variables.

Applied Multivariate Statistics  221

Plot of the assessors without Assessor effect

6

1.0

Plot of the sounds without Assessor effect

J12

4

0.5

6

34 54

0.0

46 23 51 48 33 9 43

J35

14 50

49 22

36

45 58 37

J15 J32

2

13

J16

19 18

60

2 25

J6

J39

0

35 8

Dim 2 (10.58%)

56

53 27 30 11 21 32 5 38

-2

28

39

J19

-4

59

-0.5

Dim 2 (10.58%)

J21

3

20

J5 J34 J18 J37 J1 J38J13J2 J36 J17 J23 J22 J10 J20J8 J14 J28 J24 J4 J3 J7 J33 J9 J25 J11 J31 J27 J29

J26 J40

-1.0

-6

J30

-1.0

-0.5

0.0

0.5

1.0

-10

-5

Dim 1 (16.76%)

0

Dim 1 (16.76%)

Illustration of the sounds and the assessors issued from the PCA on the data without the assessor effect. Figure 7.10

On the orthogonal of the Judge effect by centring 6

On the orthogonal of the Judge effect by shifting

J21 J12

4

4

J19 J30

J35 -2

J37

-2

0

Dim 2 (10.58%)

J6

J32 J5 J18 J15 J37 J34 J1 J29 J38 J13J2 J36 J17 J23 J14 J4 J10 J20J22 J8 J24 J9 J3 J28 J25 J7 J33 J11 J27 J31

-4

J19

J12

J26 J40

J30

-6

J21

J16

J34 J18

J6 J39

2

J35 J31 J28 J3 J15 J27 J11 J10 J7 J25 J1 J8 J38 J29 J24 J9 J20 J14 J17 J16 J23 J13 J33 J4 J2 J36 J5 J22 J32 J40

0

Dim 3 (4.76%)

2

J26

J39

-6

-4

-2 Dim 2 (6.98%)

0

2

4

-10

-5

0

Dim 1 (16.76%)

Representation of the participants according to the way they ranked the sounds in terms of annoyance: by shifting from the main plane to the plane obtained from dimensions 2 and 3 (on the right); by subtracting directly the assessor effect from the data (on the left). Figure 7.11

222  Sensory Evaluation of Sound

1.0

Shifted Centred

Representation of the 2 groups of dimensions

Dim.3

Shifted

Dim.1

0.6

Dim.2

Centred

-1.0

0.2

0.4

0.0

Dim.5 Dim.4 Dim.4

Dim 2 (23.85%)

Dim.3

-0.5

Dim 2 (23.85%)

0.5

0.8

1.0

Representation of the dimensions from both analyses

-1.5

-1.0

-0.5

0.0 Dim 1 (36.51%)

0.0

Dim.2

0.5

1.0

1.5

-0.2

0.0

0.2

0.4

0.6

0.8

1.0

1.2

Dim 1 (36.51%)

Illustration of the dimensions (on the right) and of the groups of dimensions (on the left) issued from the separate analyses of the transpose of X, on the one hand, of the transpose of X without the assessor effect, on the other hand. Figure 7.12

As you can see in Figure 7.11, the plane generated by dimensions 2 and 3 obtained from a PCA on the transpose of X looks actually very similar to the main plane (i.e. dimensions 1 and 2) obtained from a PCA on the transpose of X centred by row, in Figure 7.10. The comparison of these plots is not totally straightforward, as a dimension for an analysis may be inverted compared to the other analysis. This is the case for the 3rd dimension of the PCA on the transpose of X and the 2nd dimension of the PCA on the transpose of X centred. To compare more efficiently these visualisations, I would recommend the use of MFA (Escofier and Pagès (1994)), as we are in a multiple data set context. In practice, I would save the coordinates of the participants on the dimensions for each analysis. For the first PCA, the one on the transpose of X, I would save the coordinates from the 2nd to the 6th dimension; for the second PCA, the one on the transpose of X centred, I would save the coordinates from the 1st to the 5th dimension. Then I would run an MFA on these two groups of coordinates, taking care not to scale the dimensions to unit variance during the calculation of the weights of the variables in MFA (type=c("c","c"), where "c" stands for centring the data versus scaling the data). In terms of code, that is what it would look like. > a b c res.mfa