Testing and Assessment of Interpreting: Recent Developments in China (New Frontiers in Translation Studies) 9811585539, 9789811585531

This book highlights reliable, valid and practical testing and assessment of interpreting, presenting important developm

106 86 4MB

English Pages 208 [204] Year 2021

Report DMCA / Copyright

DOWNLOAD FILE

Polecaj historie

Testing and Assessment of Interpreting: Recent Developments in China (New Frontiers in Translation Studies)
 9811585539, 9789811585531

Table of contents :
Contents
Editors and Contributors
Testing and Assessment of Interpreting in China: An Overview
1 Testing and Assessment of Interpreting in China
1.1 Testing and Assessment Practice
1.2 Testing and Assessment Research
2 Purpose and Structure of the Current Volume
3 An Overview of Individual Chapters
3.1 Part I: Rater-Mediated Assessment
3.2 Part II: Automatic Assessment
4 Conclusion
References
Rater-mediated assessment
Introducing China’s Standards of English Language Ability (CSE)—Interpreting Competence Scales
1 Introduction
2 Modelling Interpreting Competence
2.1 Interpreting Competence
2.2 Interpreting Quality Assessment
2.3 The Model of Interpreting Competence in the CSE Project
3 Design and Development of the CSE-Interpreting Competence Scales
3.1 Scale Design: Operational Descriptive Scheme
3.2 Scale Development: Collection and Analysis of Scalar Descriptors
3.3 Validation of the CSE-Interpreting Competence Scales
4 The Scale Levels in the CSE-Interpreting Competence Scales
4.1 Scale Levels
4.2 Salient Characteristics of Each Level
5 Potential Application of the CSE-Interpreting Competence Scales
5.1 Interpreting Teaching
5.2 Interpreting Learning
5.3 Interpreting Assessment
5.4 Academic Research and Inter-Sector Collaboration
6 Conclusion
Appendix 1. The CSE-Interpreting Scales
References
Developing a Weighting Scheme for Assessing Chinese-to-English Interpreting: Evidence from Native English-Speaking Raters
1 Introduction
2 Literature Review
2.1 Rater-Mediated Assessment of Language Performance
2.2 Rater-Mediated Assessment of Translation and Interpreting
2.3 Quality Criteria in Interpreting Assessment
2.4 Quality Criteria and Their Weighting in Interpreting Assessment
3 Method
3.1 Participants
3.2 The Source Speech
3.3 Procedures
3.4 Rater Selection and Rater Training
3.5 Rating Criteria
3.6 Data Analysis
4 Results
4.1 Rating Reliability
4.2 Correlation Analysis
4.3 Multiple Regression Analysis
5 Discussion
6 Conclusion
Appendix 1 The Transcript of the Chinese Source Speech
Appendix 2 English Translation of the Source Speech
Appendix 3 Band Descriptions
Appendix 4 Sample Marking Sheet
References
Rubric-Based Self-Assessment of Chinese-English Interpreting
1 Introduction
2 Literature Review
2.1 Rubrics in Self-Assessment
2.2 Student’ Experiences of Rubric Use
2.3 Rubric Use in Interpreting Learning
3 Methodology
3.1 Participants
3.2 Interpreting Rubric
3.3 Test Material
3.4 Data Collection and Analysis
4 Results and Discussion
4.1 Rubric Use
4.2 Interpreting Performance
4.3 Rubric Perception
4.4 The Role of the Strategy-Based Rubrics in Interpreting: Awareness, Assessment, Acquisition
5 Conclusion
References
Detecting and Measuring Rater Effects in Interpreting Assessment: A Methodological Comparison of Classical Test Theory, Generalizability Theory, and Many-Facet Rasch Measurement
1 Introduction
2 Literature Review
2.1 Rater-Mediated Assessment of Spoken-Language Interpreting
2.2 Psychometric Approaches to Examining the Rater Effects
3 Research Purposes
4 Method
4.1 Data Source
4.2 Assessment Criteria and Rating Scale
4.3 Raters and Rater Training
4.4 Operational Scoring
4.5 Data Analysis
5 Results
5.1 Results from the CTT Analysis
5.2 Results from the G-Theory Analysis
5.3 Results from the Rasch Analysis
6 Discussion
6.1 Rater Effects in Assessment of Interpreting
6.2 Methodological Discussion
6.3 Methodological Implications for Interpreting Assessment
7 Conclusion
Appendix 1 The Analytic Rating Scale
References
Automatic assessment
Quantitative Correlates as Predictors of Judged Fluency in Consecutive Interpreting: Implications for Automatic Assessment and Pedagogy
1 Introduction
2 Quantitative Assessment of Fluency
3 Applying Quantitative Assessment of Fluency to CI
4 Method
4.1 Participants
4.2 Material
4.3 Procedures
4.4 Statistical Analysis
5 Results
5.1 Auditory Ratings of Fluency and Accuracy
5.2 Acoustic Measures of Fluency
5.3 Correlations Between the Quantitative Prosodic Measures and the Fluency Ratings
5.4 Quantitative Prosodic Measures as Predictors of Judged Fluency
5.5 A Closer Examination of the Melodic Parameters
6 Discussion
7 Conclusion
References
Chasing the Unicorn? The Feasibility of Automatic Assessment of Interpreting Fluency
1 Introduction
2 Method
2.1 Context and Participants
2.2 Data Processing and Analysis
3 Results and Discussion
4 Pedagogical Implications
5 Conclusion
References
Exploring a Corpus-Based Approach to Assessing Interpreting Quality
1 Introduction
2 Interpreting Quality Assessment
2.1 Assessment Practices
2.2 Common Ground
3 Corpus-Based Profiling of Interpreted Texts
4 Method
4.1 Sample Selection
4.2 Data Collection
4.3 Data Analysis
5 Results
5.1 Data Grooming
5.2 Accuracy of the Proposed Method
5.3 Visualization of the Assessment Results
6 Discussion
7 Conclusion
References
Coh-Metrix Model-Based Automatic Assessment of Interpreting Quality
1 Interpreting Quality Assessment
2 Coh-Metrix: A Brief Introduction
3 Research Questions
4 Method
4.1 Interpreted Texts
4.2 Procedures
5 Results
5.1 The Coh-Metrix Analysis for the Three Levels
5.2 Statistical Modeling of the Coh-Metrix Indices
6 Discussion
7 Conclusion
Appendix 9.1 Detailed Description of Coh-Metrix Measures
References

Citation preview

New Frontiers in Translation Studies

Jing Chen Chao Han   Editors

Testing and Assessment of Interpreting Recent Developments in China

New Frontiers in Translation Studies Series Editor Defeng Li, Center for Studies of Translation, Interpreting and Cognition, University of Macau, Macao SAR, China

Translation Studies as a discipline has witnessed the fastest growth in the last 40 years. With translation becoming increasingly more important in today’s glocalized world, some have even observed a general translational turn in humanities in recent years. The New Frontiers in Translation Studies aims to capture the newest developments in translation studies, with a focus on: • Translation Studies research methodology, an area of growing interest amongst translation students and teachers; • Data-based empirical translation studies, a strong point of growth for the discipline because of the scientific nature of the quantitative and/or qualitative methods adopted in the investigations; and • Asian translation thoughts and theories, to complement the current Eurocentric translation studies. Submission and Peer Review: The editor welcomes book proposals from experienced scholars as well as young aspiring researchers. Please send a short description of 500 words to the editor Prof. Defeng Li at [email protected] and Springer Senior Publishing Editor Rebecca Zhu: [email protected]. All proposals will undergo peer review to permit an initial evaluation. If accepted, the final manuscript will be peer reviewed internally by the series editor as well as externally (single blind) by Springer ahead of acceptance and publication.

More information about this series at http://www.springer.com/series/11894

Jing Chen · Chao Han Editors

Testing and Assessment of Interpreting Recent Developments in China

Editors Jing Chen Research Institute of Interpreting Studies College of Foreign Languages and Cultures Xiamen University Xiamen, China

Chao Han Research Institute of Interpreting Studies College of Foreign Languages and Cultures Xiamen University Xiamen, China

ISSN 2197-8689 ISSN 2197-8697 (electronic) New Frontiers in Translation Studies ISBN 978-981-15-8553-1 ISBN 978-981-15-8554-8 (eBook) https://doi.org/10.1007/978-981-15-8554-8 © Springer Nature Singapore Pte Ltd. 2021 This work is subject to copyright. All rights are reserved by the Publisher, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting, reproduction on microfilms or in any other physical way, and transmission or information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now known or hereafter developed. The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication does not imply, even in the absence of a specific statement, that such names are exempt from the relevant protective laws and regulations and therefore free for general use. The publisher, the authors and the editors are safe to assume that the advice and information in this book are believed to be true and accurate at the date of publication. Neither the publisher nor the authors or the editors give a warranty, expressed or implied, with respect to the material contained herein or for any errors or omissions that may have been made. The publisher remains neutral with regard to jurisdictional claims in published maps and institutional affiliations. This Springer imprint is published by the registered company Springer Nature Singapore Pte Ltd. The registered company address is: 152 Beach Road, #21-01/04 Gateway East, Singapore 189721, Singapore

Contents

Testing and Assessment of Interpreting in China: An Overview . . . . . . . . Jing Chen and Chao Han

1

Rater-mediated assessment Introducing China’s Standards of English Language Ability (CSE)—Interpreting Competence Scales . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Weiwei Wang

15

Developing a Weighting Scheme for Assessing Chinese-to-English Interpreting: Evidence from Native English-Speaking Raters . . . . . . . . . . Xiaoqi Shang

45

Rubric-Based Self-Assessment of Chinese-English Interpreting . . . . . . . . Wei Su Detecting and Measuring Rater Effects in Interpreting Assessment: A Methodological Comparison of Classical Test Theory, Generalizability Theory, and Many-Facet Rasch Measurement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Chao Han

67

85

Automatic assessment Quantitative Correlates as Predictors of Judged Fluency in Consecutive Interpreting: Implications for Automatic Assessment and Pedagogy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117 Wenting Yu and Vincent J. van Heuven Chasing the Unicorn? The Feasibility of Automatic Assessment of Interpreting Fluency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143 Zhiwei Wu

v

vi

Contents

Exploring a Corpus-Based Approach to Assessing Interpreting Quality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 159 Yanmeng Liu Coh-Metrix Model-Based Automatic Assessment of Interpreting Quality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 179 Lingwei Ouyang, Qianxi Lv, and Junying Liang

Editors and Contributors

About the Editors Jing Chen is a full professor in the College of Foreign Languages and Cultures at Xiamen University, China. Her research interests include interpreting quality assessment and interpreting pedagogy. She has published widely in peer-reviewed journals in English and Chinese (e.g., Interpreter and Translator Trainer, Language Assessment Quarterly, Chinese Translators Journal), and she has led several large-scale research projects funded by the European Union (i.e., Asia Link - the EU-Asia Interpreting Studies) and the China National Social Sciences Foundation. She is serving as the Deputy Director of the National Interpreting Committee of the Translators Association of China. Chao Han is a full professor in the College of Foreign Languages and Cultures at Xiamen University, China. He conducted his PhD research at the Department of Linguistics at Macquarie University (Sydney), focusing on interpreter certification testing. He is interested in testing and assessment issues in translation and interpreting (T&I) and methodological aspects of T&I studies. His recent publications have appeared in journals such as Interpreting, The Translator, Language Testing, and Language Assessment Quarterly. He is currently a member of the Advisory Board of the International Journal of Research and Practice in Interpreting.

Contributors Jing Chen Research Institute of Interpreting Studies, College of Foreign Languages and Cultures, Xiamen University, Xiamen, China Chao Han Research Institute of Interpreting Studies, College of Foreign Languages and Cultures, Xiamen University, Xiamen, China Junying Liang Zhejiang University, Hangzhou, China vii

viii

Editors and Contributors

Yanmeng Liu School of Languages and Cultures, The University of Sydney, Sydney, NSW, Australia Qianxi Lv Shanghai Jiao Tong University, Shanghai, China Lingwei Ouyang Zhejiang University, Hangzhou, China; University of California, San Francisco, CA, USA Xiaoqi Shang School of Foreign Languages, Shenzhen University, Shenzhen, China Wei Su Research Institute of Interpreting Studies, College of Foreign Languages and Cultures, Xiamen University, Xiamen, China Vincent J. van Heuven University of Pannonia, Veszprém, Hungary; Leiden University, Leiden, Netherlands Weiwei Wang School of Interpreting and Translation Studies, Guangdong University of Foreign Studies, Guangzhou, China Zhiwei Wu Department of Chinese and Bilingual Studies, The Hong Kong Polytechnic University, Hong Kong, China Wenting Yu Shanghai International Studies University, Shanghai, China

Testing and Assessment of Interpreting in China: An Overview Jing Chen and Chao Han

Abstract Testing and assessment of interpreting has been gaining traction, as is evidenced by a growing number of relevant publications over the past years. However, much of the previous literature pertains to practice and research in the Europe and the United States. Little is known about recent developments of interpreting testing and assessment (ITA) in China, a country where interpreter training and professional certification have grown exponentially over the past 15 years. In this opening chapter, we first aim to provide an overview of the practice of and research on ITA in China, describing two main drivers of testing and assessment, namely, interpreter training and education, and professional certification. We then provide a concise review of theoretical discussion and empirical research conducted by Chinese scholars and researchers. Such description provides an important background against which we briefly introduce major topics and issues examined in the remainder of the edited volume. We hope that the introduction in the first chapter would help readers better navigate through the book. Keywords Interpreting testing and assessment · Interpreting · Rater-mediated assessment · Automatic assessment · Interpreting studies

1 Testing and Assessment of Interpreting in China Over the past 15 years, interpreting testing and assessment (hereafter referred to as ITA) has emerged as an important area of research in which educators and researchers investigate substantive issues related to design, development and validation of tests, tasks, and scoring methods involved in assessment of both spoken- and signedlanguage interpreting. The emergence of ITA is attested by an increasing number of scholarly publications on ITA-related topics. For example, there have been special issues published in mainstream translation and interpreting journals (Melby 2013; J. Chen · C. Han (B) Research Institute of Interpreting Studies, College of Foreign Languages and Cultures, Xiamen University, Xiamen, China e-mail: [email protected] © Springer Nature Singapore Pte Ltd. 2021 J. Chen and C. Han (eds.), Testing and Assessment of Interpreting, New Frontiers in Translation Studies, https://doi.org/10.1007/978-981-15-8554-8_1

1

2

J. Chen and C. Han

Koby and Lacruz 2017), monographs (Sawyer 2004; Setton and Dawrant 2016), edited books (Angelelli and Jacobson 2009; Tsagari and van Deemter 2013; HuertasBarros et al. 2018), doctoral dissertations (e.g., Collados Aís 1998; Clifford 2004; Wu 2010; Han 2015a; Xiao 2019), and research reports (e.g., Roat 2006; ALTA Language Services 2007; Hale et al. 2012). Whereas much of the existing literature describes how testing and assessment of interpreting has been conducted in different countries around the world, relatively little is known about the recent practice and research in China. In this introductory chapter, we therefore provide a brief historical review of how ITA has been practiced and researched in China in order to set the scene for research reported in subsequent chapters.

1.1 Testing and Assessment Practice One of the earliest practices of systematic testing and assessment of interpreting in China could be traced back to the recruitment of Chinese language interpreters for the United Nations (UN). In 1979, the UN and the Chinese government co-established the landmark UN Training Program for Interpreters and Translators in Beijing (see Yao 2019). To select the best candidates for the program, the UN would send a panel of examiners for entrance and final examinations. According to Dawrant et al. (Forthcoming), 25 out of 536 applicants were admitted in the first year of the program through competitive national exams. It is in the 2000s that testing and assessment practice in China received a major boost. During this period, interpreter training and professional certification programs were created and expanded rapidly. In 2005, the Bachelor of Translators and Interpreters (BTI) program was established by the State Council Academic Degrees Committee, followed by the launch of the Master of Translators and Interpreters (MTI) program in 2007. The creation of the two new programs attracts an increasing number of educational institutions to enroll BTI and MTI students. As of April 2020, a total of 281 and 259 universities and colleges had launched the BTI and the MTI program respectively,1 enrolling tens of thousands of students each year. The extraordinary scale-up of the BTI and the MTI education creates a massive demand for educational testing and assessment. For instance, admission/aptitude testing is administered to select the best candidates (e.g., Tang and Li 2013; Xing 2017); formative assessment (e.g., self- and peer assessment) is implemented to generate feedback to teaching and learning (Han 2018a; Li 2018; Su 2019b); summative assessment is conducted at the end of the program/course to gauge students’ achievement (Feng 2005). The other important area where testing and assessment constitutes an essential component is professional certification. Huang and Liu (2017) provide an informative review of translator and interpreter certification testing programs developed in China 1 For detailed information, please visit: https://cnti.gdufs.edu.cn/info/1017/1955.htm; https://cnbti. gdufs.edu.cn/info/1006/1595.htm.

Testing and Assessment of Interpreting in China …

3

from 1995 to 2015. According to them, professional certification testing in China experiences four stages of development: (a) the embryonic period (i.e., 1995–2000) during which China’s first testing program, Shanghai Interpretation Accreditation (SIA), was piloted in 1995; (b) the coming-of-age period (i.e., 2001–2003) which witnessed the blossoming of various testing programs, including the National Accreditation Examinations for Translators and Interpreters (NAETI), the China Accreditation Tests for Translators and Interpreters (CATTI), English Interpreting Certificate (EIC) provided by Xiamen University, and the Business Interpreting Accreditation Test (BIAT) offered by the Shanghai Human Resources Bureau; (c) the period of prosperity (i.e., 2004–2014) during which the existing tests expanded to incorporate different language pairs and cater to diverse market demands; and (d) the period of adjustment (i.e., 2013–2015) during which some certification testing programs (i.e., NAETI) were phased out, whereas others reformed their testing procedures (i.e., CATTI). It is estimated that almost two million people took the certification tests during these years, and that about 10% of them were certified as various types of translators and/or interpreters (see Huang and Liu 2017). Apart from classroom-based assessment and certification testing, we also want to direct some attention to a special type of performance-based educational assessment, organized in the format of regional and national interpreting contests in China. Large-scale interpreting contests for the English-Chinese language pair have started to emerge since 2009 when Xiamen University launched the first Cross-Straits Interpreting Contest. Another influential contest is the All China Interpreting Contest started in 2010 by Translators Association of China and China Translation and Publishing Corporation. In these contests, trainee interpreters from different universities and colleges compete with their peers, interpreting live on stage and being assessed real time by a panel of experts (Wang 2011). While this type of performance assessment usually has several purposes to serve, one of them is to motivate students’ learning and to improve interpreting teaching. Such contests could be high-stakes, largely because they are sometimes live-streamed on major social media platforms and draw considerable public attention. As a result, relevant stakeholders, including contestants, coaches, and directors of interpreting programs, may be heavily invested in the contests.

1.2 Testing and Assessment Research In the 2000s, much of the interpreting literature contributed by Chinese scholars pertains to theoretical discussion of how testing and assessment of interpreting could be conducted in various contexts (e.g., summative assessment, professional certification). For example, Liu (2001) contributes a fair share of her monograph to discussing the design of admission testing and summative assessment for interpreter training programs in China. Also focusing on educational interpreting assessment, Feng (2005) discusses a number of common problems in testing and assessment systems, including selection of improper test materials, lack of inter-rater reliability, and

4

J. Chen and C. Han

inconsistency between test administrations. Alternatively, Huang (2005) analyzes four large-scale interpreter certification tests in China, casting light on test content, test materials, testing procedure and quality assessment criteria, and proposes a standardized approach to certification testing. Lastly, Wang (2011) discusses the rationale behind the design of an assessment rubric for an interpreting contest. Among this body of literature, two scholars’ research appears to be particularly important. First, a series of journal articles published by Chen (2002, 2003, 2009, 2011) introduces communicative language testing theory (i.e., Bachman 1990; Bachman and Palmer 1996) to Chinese scholars and researchers interested in testing and assessment of interpreting. Chen also presided over a research project, Theoretical system and operational model of professional interpreter accreditation testing (2009–2012) funded by the National Social Science Foundation, which channels theories and methods in second language testing literature to ITA. Chen’s research has motivated new generations of ITA researchers to seek cross-disciplinary pollination between the fields of interpreting studies and language testing (e.g., Han 2015a, Xing 2017; Xiao 2019). Second, a monograph authored by Cai (2007) provides a comprehensive description of how interpreting can be assessed. The systematicity of the book is unparalleled, covering a wide range of important topics related to ITA such as quality assessment by different groups of users, assessment of fidelity, target language quality and fluency of delivery, and analysis of different assessment methods. The theoretical discussion lays a solid groundwork for subsequent empirical exploration, and also provides new perspectives to investigate issues associated with ITA. Over the past couple of years, a group of Chinese researchers have initiated empirical investigations into a diverse range of topics in ITA. For example, Wang et al. (2020) developed a suite of interpreting competence scales, based on empirical analysis of a large sample of descriptors. In addition, Zhao and Dong (2013), Han (2015b, 2017), and Wen (2019) applied many-facet Rasch measurement to analyze ratings from human raters in interpreting assessment, casting light on rater severity, rater self-consistency, and psychometric properties of rating scales. Han (2016, 2019) also used generalizability theory to examine score reliability in high-stakes assessment of interpreting. Regarding formative assessment of students’ interpreting, Han and Riazi (2018), Han (2018a), and Li (2018) investigated the validity of self-assessment, while Han (2018b), Su (2019b), and Han and Zhao (2020) examined the accuracy of peer assessment. Furthermore, Xing (2017) studied the extent to which personality hardiness could predict interpreting performance in aptitude testing, and Xiao (2019) explored the use of propositional analysis to assess fidelity in consecutive interpreting. These studies are a testimony to the momentum of ITA-related research.

2 Purpose and Structure of the Current Volume Against the background outlined above, we feel that it is germane to present recent research developments in China to a broader audience, and to share our experience with colleagues in different parts of the world. More importantly, we want to

Testing and Assessment of Interpreting in China …

5

showcase that testing and assessment of interpreting is an important field of practice and a promising area of research. Overall, there are two parts in the volume. Part I consists of four chapters (i.e., Chapters 2, 3, 4, and 5), all concentrating on ratermediated assessment (i.e., interpreting assessed by human raters); Part II also includes four chapters (i.e., Chapters 6, 7, 8, and 9), all of which explores the possibility of automatic assessment. In Part I, we highlight rater-mediated, rating scale-based assessment of interpreting. That is, human raters use a rubric-referenced or descriptor-based rating scale to assess multiple quality dimensions of interpreting performance (e.g., fidelity, fluency, target language quality). We start with a descriptive piece of research on design, development and validation of rating scales (Chapter 2), and then proceed to an investigation into the allotment of weightings to different assessment criteria in an analytic scale (Chapter 3). Next, we present a pedagogical case study in which rubric-referenced scale is used by students to self-assess strategy use in interpreting (Chapter 4). Last, we discuss the issue of rater effects in interpreting assessment and compare different psychometric approaches to detecting and measuring the rater effects (Chapter 5). In Part II, we focus on automatic assessment of interpreting in which researchers rely on statistical modeling of quantifiable features of interpretations to estimate and predict interpreting quality. Specifically, Chapters 6 and 7 examine the possibility of using acoustically measured temporal variables of utterance fluency (automatically calculated by Praat) to predict human raters’ perceived or judged fluency. Chapter 8 explores a corpus-based approach to quality assessment, based on a set of (para-)linguistic indices, and evaluates the accuracy of the new assessment approach. Finally, Chapter 9 conducts statistical modeling of a diverse array of linguistic indices generated by the computational linguistic tool, Coh-Metrix, to predict human scoring of interpreting performance in an interpreting contest.

3 An Overview of Individual Chapters 3.1 Part I: Rater-Mediated Assessment The first chapter of Part I (Chapter 2), contributed by Weiwei Wang, describes the design, development and validation of the Interpreting Competence Scales, which is part of a larger research project known as China’s Standards of English Language Ability (CSE). In parallel to the Common European Framework of Reference for Languages (CEFR) (Council of Europe 2001), the CSE is an ambitious research project that aims to demarcate English ability levels for Chinese learners/users of English, and to describe relevant characteristics at each CSE level (Ministry of Education and National Language Commission 2018). The unveil of the CSE—Interpreting Competence Scales represents an important milestone in interpreting teaching and

6

J. Chen and C. Han

assessment, because it intends to offer a consistent and standardized frame of reference for educators and testers. In Chapter 2, Weiwei Wang, who led the research team on the CSE—Interpreting Competence Scales, provides a descriptive account of how relevant scales and associated scalar descriptors were designed and validated, and also illustrates how the scales could be used in formative assessment. Importantly, we would like to remind our audience that although the CSE—Interpreting Competence Scales have been published (see Ministry of Education and National Language Commission 2018), the evaluation of their utility is on-going. Particularly, interpreting trainers, educators, and testers are encouraged to apply, modify, and fine-tune the scales in their local training and assessment contexts. The second chapter of Part I (Chapter 3), authored by Xiaoqi Shang, explores how a weighting scheme could be developed for assessing Chinese-to-English consecutive interpreting. The issue of weighting may be relevant, when an analytic rating scale is used to assess interpreting. Typically, an analytic rating scale consists of multiple sub-scales, with each focusing on a particular aspect of interpreting performance quality. In other words, when there are multiple assessment criteria, one needs to consider which criterion should be given more weighting. In previous literature, such weightings are allotted arbitrarily, often based on personal preference and theoretical reasoning. For instance, in Lee’s (2008) study to assess Korean-toEnglish interpreting, she assigned 40% of weight to accuracy, 40% to target language quality, and 20% to delivery. In Chapter 3, Shang argues for an empirically-based approach to deriving weightings for assessment criteria. In addition, Shang observes that when developing a weighting scheme, researchers and testers need to consider directionality of interpreting and raters’ language background (see also Su 2019a). This is because raters with different language backgrounds may assign different weightings to a particular assessment criterion. In Shang’s exploratory study, he recruited two native English-speaking raters to assess 50 Chinese-to-English interpretations produced by trainee interpreters, based on both holistic and analytic rating scales. He then conducted regression analysis on the analytic scores (i.e., the independent variables) and the holistic scores (i.e., the dependent variable). He proposed using the regression beta coefficients as a reference to allot weightings to different assessment criteria. Educators and researchers who are interested in developing a weighting scheme for interpreting assessment would find Shang’s study interesting and informative. The third chapter of Part I (Chapter 4), contributed by Wei Su, describes an interesting pilot study in which a group of interpreting students self-assessed their strategy use in consecutive interpreting over a period of six weeks. The assessment instrument is a rubric designed to evaluate the extent to which three particular strategies (i.e., lexical conversion, syntactic conversion, and discourse expansion) are used by students in consecutive interpreting. Although rubrics-referenced rating scales represent a popular type of instruments in interpreting assessment, majority of the rubrics described in the existing literature are intended to evaluate the quality of interpreting performance (e.g., Tiselius 2009; Han 2015b; Lee 2015). The rubric of strategy use in Su’s study is therefore one of the first rubrics that focus on strategic/cognitive processes involved in interpreting. In addition, as our readers will find out, Su’s study

Testing and Assessment of Interpreting in China …

7

is based on classroom assessment that is conducted as part of longitudinal pedagogical intervention. Given that classroom-based assessment takes place frequently in any type of interpreting program/course, readers who are interested in carrying out classroom-based assessment research would find Su’s study relevant and helpful. The fourth chapter of Part I (Chapter 5), written by Chao Han, describes possible rater effects (e.g., rater inconsistency, rater severity, rater bias) that may contribute systematic and/or random measurement error to a given assessment system. Importantly, Han compares three psychometric approaches to detecting and measuring the rater effects, namely, classical test theory, generalizability theory, and manyfacet Rasch measurement, based on the empirical assessment data from a previous study (i.e., Han 2015b). As has been discussed above, in rater-mediated assessment of interpreting performance, it is human raters that play the central role of performance evaluation. Therefore, scores, marks, and grades produced by human raters can serve as a valuable window to analyze rater behavior. Certain rating patterns (e.g., excessively lenient assessments toward a particular group of test takers) are likely to indicate problematic rating process and decision-making. Although much of the previous research draws on classical test theory to investigate inter-rater reliability, recent years has seen the increasing application of more sophisticated approaches such as generalizability theory (Han 2016, 2019) and Rasch analysis (Zhao and Dong 2013; Han 2015b, 2017; Shang and Xie 2020). For interested readers, they would find Han’s study useful in helping them develop a better understanding of these analytic approaches.

3.2 Part II: Automatic Assessment The four chapters in Part II pertain to automatic assessment of interpreting, with the first two chapters (Chapters 6 and 7) focusing on assessment of interpreting fluency and the remaining two chapters (Chapters 8 and 9) drawing on corpus and computational linguistics to predict interpreting quality. Specifically, Chapter 6, contributed by Wenting Yu and Vincent J. van Heuven, explores the relationship between a group of temporal variables (that could be measured automatically by Praat) and raters’ judged fluency. Similar to previous research (Yu and van Heuven 2017; Han et al. 2020), the very motivation behind the current study is that automatic assessment of interpreting fluency is possible, if raters’ judged/perceived fluency could be predicted by certain temporal variables that are measured automatically. The authors calculated 12 temporal measures (e.g., effective speech rate) and two melodic measures (i.e., pitch level and pitch range), based on the Praat analysis of interpreted samples from 12 trainee interpreters. Statistical analysis shows that a number of temporal variables may be good predictors of raters’ judged fluency, indicating the possibility of automatic assessment of interpreting fluency. Chapter 7, authored by Zhiwei Wu, can be considered as an extension of Chapter 6, as Wu investigates the feasibility of using Praat for automatic assessment of fluency in English-to-Chinese interpreting. Based on a sample of 140 audio recordings of

8

J. Chen and C. Han

interpreting performance, Wu examined the relationship between measures of speed and breakdown fluency (e.g., speech rate, mean silence duration, mean length of run) and raters’ perceived fluency. Particularly, in order to use Praat to automatically detect silent/unfilled pauses, Wu explored four conditions that may affect the automatic detection (and measurement) of silent/unfilled pauses. These conditions are: Condition A where the audio intensity was optimized for each audio file and intersegmental silences/noises were edited and removed; Condition B where the intensity was optimized for each audio file but inter-segmental silences/noises were unedited and kept; Condition C where the intensity was set uniformly to all audio files and inter-segmental silences/noises were edited and removed; and Condition D where the intensity was set uniformly to all audio files and inter-segmental silences/noises were unedited and kept. Statistical analysis of the acoustically measured variables and the raters’ perceived fluency ratings in different conditions indicates that several key parameters in Praat (e.g., intensity level, pause threshold) could be fine-tuned to produce better automatic assessments. Chapter 8, authored by Yanmeng Liu, intends to explore a corpus-based approach to interpreting quality assessment. Liu sampled 64 English-to-Chinese interpreting renditions of four different levels of quality (i.e., excellent, good, pass, and fail) from the Parallel Corpus of Chinese EFL Learners, and profiled (para-)linguistic measures of the interpreted texts. These (para-)linguistic features could be roughly categorized into three major assessment criteria, namely, information accuracy, output fluency, and audience acceptability. Statistical analysis (e.g., principal component analysis, decision tree analysis) was run to examine whether the selected (para-)linguistic features could predict the correct classification of a given interpreted rendition into a certain level of quality. Liu’s study represents an interesting exploration in which statistical techniques in machine learning (e.g., decision tree analysis) were leveraged to analyze interpreting corpora and to categorize interpreted renditions based on their quality. Chapter 9, contributed by Lingwei Ouyang, Qianxi Lv, and Junying Liang, describes an interesting application of the computational linguistic tool, Coh-Metrix, to quantify multidimensional linguistic features of interpreted renditions collected from the All China Interpreting Contest. Based on the Coh-Metrix indices, they built statistical models to predict human raters’ scores of interpreting quality. It is found that a set of linguistic indices (e.g., word count) could account for 60% of the variance in the human raters’ scores. This preliminary finding suggests a potential area of research in which different computational linguistic tools could be harnessed to describe and quantify a diverse array of linguistic features of interpreted renditions, and in which statistical models could be built to predict human scoring. This line of research can also shed light on human raters’ perception of interpreting quality from a linguistic point of view.

Testing and Assessment of Interpreting in China …

9

4 Conclusion In the opening chapter of the book, we first provide an overview of how testing and assessment of interpreting has been conducted in China. Specifically, we highlight two major areas, namely, interpreter training and professional certification, in which testing and assessment plays a critical role of producing evidence for relevant decision-making (e.g., certification). We also review the testing and assessment research conducted by Chinese scholars, outlining the important theoretical groundwork and the recent empirical investigation. Next, we describe the structure of the current volume and introduce the major topics discussed and examined in the chapters. We divide the volume into two parts, with the first part focusing on rater-mediated assessment and the second part on automatic assessment. Despite our intention to incorporate a diverse range of research topics, we are aware that not all research developments have been covered in the current volume (e.g., the development of a certification test for sign language interpreters in China). Nonetheless, we remain confident that the volume has cast insights into interpreting testing and assessment in China, and would inspire more evidence-based investigation into rater-mediated and/or automatic assessment of interpreting. Acknowledgments This work was supported by National Social Science Foundation (grant number: 18AYY004).

References ALTA Language Services. 2007. Study of California’s court interpreter certification and registration testing. http://www.courts.ca.gov/documents/altafinalreport.pdf. Accessed 10 June 2015. Angelelli, Claudia, and Holly Jacobson. 2009. Testing and assessment in translation and interpreting studies: A call for dialogue between research and practice. Amsterdam: John Benjamins. Bachman, Lyle. 1990. Fundamental considerations in language testing. Oxford: Oxford University Press. Bachman, Lyle, and Adrian Palmer. 1996. Language testing in practice: Designing and developing useful language tests. Oxford, UK: Oxford University Press. Cai, Xiaohong. 2007. Interpretation and evaluation. Beijing: China Translation and Publishing Corporation. Chen, Jing. 2002. Fundamental considerations in interpreting testing. Chinese Translators’ Journal 23: 51–53. Chen, Jing. 2003. On communicative approach to interpreting testing. Chinese Translators’ Journal 25: 67–71. Chen, Jing. 2009. Authenticity in accreditation tests for interpreters in China. The Interpreter and Translator Trainer 3 (2): 257–273. Chen, Jing. 2011. Language assessment: Its development and future—An interview with Lyle F. Bachman. Language Assessment Quarterly 8 (3): 277–290. Clifford, Andrew. 2004. A preliminary investigation into discursive models of interpreting as a means of enhancing construct validity in interpreter certification. https://ruor.uottawa.ca/handle/ 10393/29086. Accessed 15 Sept 2019.

10

J. Chen and C. Han

Collados Aís, Ángela. 1998. La evaluación de la calidad en interpretación simultánea: La importancia de la comunicación no verbal. Granada: Comares. Council of Europe. 2001. Common European framework of reference for languages: Learning, teaching, assessment. Cambridge: Cambridge University Press. Dawrant, Andrew, Binhua Wang, and Hong Jiang. Forthcoming. Conference interpreting in China. In Routledge handbook of conference interpreting (Forthcoming), ed. Elisabet Tiselius and Michaela Albl-Mikasa. London: Routledge. Feng, Jianzhong. 2005. Towards the standardization of interpretation testing. Foreign Languages Research 89: 54–58. Hale, Sandra Beatriz, Ignacio Garcia, Jim Hlavac, Mira Kim, Miranda Lai, Barry Turner, and Helen Slatyer. (2012). Development of a conceptual overview for a new model for NAATI standards, testing and assessment. http://www.naati.com.au/PDF/INT/INTFinalReport.pdf. Accessed 22 May 2015. Han, Chao. 2015a. Building the validity foundation for interpreter certification performance testing. The Interpreter and Translator Trainer 10 (2): 248–249. Han, Chao. 2015b. Investigating rater severity/leniency in interpreter performance testing: A multifaceted Rasch measurement approach. Interpreting 17 (2): 255–283. Han, Chao. 2016. Investigating score dependability in English/Chinese interpreter certification performance testing: A generalizability theory approach. Language Assessment Quarterly 13 (3): 186–201. Han, Chao. 2017. Using analytic rating scales to assess English–Chinese bi-directional interpreting: A longitudinal Rasch analysis of scale utility and rater behavior. Linguistica Antverpiensia, New Series—Themes in Translation Studies 16: 196–215. Han, Chao. 2018a. A longitudinal quantitative investigation into the concurrent validity of self and peer assessment applied to English-Chinese bi-directional interpretation in an undergraduate interpreting course. Studies in Educational Evaluation 58: 187–196. Han, Chao. 2018b. Latent trait modelling of rater accuracy in formative peer assessment of EnglishChinese consecutive interpreting. Assessment and Evaluation in Higher Education 43 (6): 979– 994. Han, Chao. 2019. A generalizability theory study of optimal measurement design for a summative assessment of English/Chinese consecutive interpreting. Language Testing 36: 419–438. Han, Chao, and Mehdi Riazi. 2018. The accuracy of student self-assessments of English-Chinese bidirectional interpretation: A longitudinal quantitative study. Assessment and Evaluation in Higher Education 43: 386–398. Han, Chao, Sijia Chen, Rongbo Fu, and Qian Fan. 2020. Modeling the relationship between utterance fluency and raters’ perceived fluency of consecutive interpreting. Interpreting 22 (2): 211–237. Han, Chao, and Xiao Zhao. 2020. Accuracy of peer ratings on the quality of spoken-language interpreting. Assessment and Evaluation in Higher Education. https://doi.org/10.1080/02602938. 2020.1855624. Huang, Min. 2005. Toward a more standardized large-scale accreditation test for interpreters. Chinese Translators’ Journal 6: 62–65. Huang, Min, and Junping Liu. 2017. The 20-year development of China’s translation and interpreting accreditation examinations: Retrospect, problems and prospect. Technology Enhanced Foreign Language Education 173: 49–54. Huertas-Barros, Elsa, Sonia Vandepitte, and Emilia Iglesias-Fernández. 2018. Quality assurance and assessment practices in translation and interpreting. Hershey, PA: IGI Global. Koby, Geoffrey S., and Isabel Lacruz. 2017. Translator quality-Translation quality: Empirical approaches to assessment and evaluation. Linguistica Antverpiensia New Series—Themes in Translation Studies 17: 1–229. Lee, Jieun. 2008. Rating scales for interpreting performance assessment. The Interpreter and Translator Trainer 2 (2): 165–184. Lee, Sang-Bin. 2015. Developing an analytic scale for assessing undergraduate students’ consecutive interpreting performances. Interpreting 17: 226–254.

Testing and Assessment of Interpreting in China …

11

Li, Xiangdong. 2018. Self-assessment as ‘assessment as learning’ in translator and interpreter education: Validity and washback. The Interpreter and Translator Trainer 12 (1): 48–67. Liu, Heping. 2001. Interpreting skills: 思维科学与口译推理教学法. Beijing: China Translation and Publishing Corporation. Melby, Alan K. 2013. Certification. Translation and Interpreting 5 (1): 1–236. Ministry of Education and National Language Commission. 2018. China’s standards of English language ability. http://cse.neea.edu.cn/res/ceedu/1811/6bdc26c323d188948fca8048833f151a. pdf. Accessed 20 Oct 2020. Roat, Cynthia E. 2006. Certification of health care interpreters in the United States: A primer, a status report and considerations for national certification. Los Angeles, CA. http://www.calendow.org/ uploadedFiles/certification_of_health_care_interpretors.pdf. Accessed 22 May 2015. Sawyer, David. 2004. Fundamental aspects of interpreter education: Curriculum and assessment. Amsterdam: John Benjamins. Setton, Robin, and Andrew Dawrant. 2016. Conference interpreting: A trainer’s guide. Amsterdam: John Benjamins. Shang, Xiaoqi, and Guixia Xie. 2020. Aptitude for interpreting revisited: predictive validity of recall across languages. The Interpreter and Translator Trainer 14 (3): 344–361. Su, Wei. 2019a. Exploring native English teachers’ and native Chinese teachers’ assessment of interpretin. Language and Education. https://doi.org/10.1080/09500782.2019.1596121. Su, Wei. 2019b. Interpreting quality as evaluated by peer students. The Interpreter and Translator Trainer. https://doi.org/10.1080/1750399X.2018.1564192. Tang, Fang, and Dechao Li. 2013. Interpreting aptitude testing and its research. Journal of Xi’an International Studies University 21 (2): 103–107. Tiselius, Elisabet. 2009. Revisiting Carroll’s scales. In Testing and assessment in translation and interpreting studies, ed. Claudia Angelelli and Holly Jacobson, 95–121. Amsterdam: John Benjamins. Tsagari, Dina, and Roelof van Deemter. 2013. Assessment issues in language translation and interpreting. Frankfurt: Peter Lang. Wang, Binhua. 2011. Exploration of the assessment model and test design of interpreting competence. Foreign Language World 1: 66–71. Wang, Weiwei, Yi Xu, Binhua Wang, and Lei Mu. 2020. Developing interpreting competence scales in China. Frontiers in Psychology. https://doi.org/10.3389/fpsyg.2020.00481. Wen, Qian. 2019. A many-facet Rasch model validation study on business negotiation interpreting test. Foreign Languages in China 16 (3): 73–82. Wu, Shao-Chuan. 2010. Assessing simultaneous interpreting: A study on test reliability and examiners’ assessment behavior. https://theses.ncl.ac.uk/jspui/handle/10443/1122. Accessed 15 Apr 2019. Xiao, Rui. 2019. A propositional analysis approach to the assessment of information fidelity in English-to-Chinese consecutive interpreting. https://etd.xmu.edu.cn/detail.asp?serial=CB4 3B0B2-9CAB-4CE1-8007-75C94C96EBA2. Accessed 20 Oct 2020. Xing, Xing. 2017. The effects of personality hardiness on interpreting performance: Implications for aptitude testing for interpreting. https://etd.xmu.edu.cn/detail.asp?serial=D5EFE288-9F1643E6-8D04-988748E5FDF7. Accessed 12 Mar 2019. Yao, Bin. 2019. The origins and early developments of the UN Training Program for Interpreters and Translators in Beijing. Babel 65 (3): 445–464. Yu, Wenting, and Vincent J. van Heuven. 2017. Predicting judged fluency of consecutive interpreting from acoustic measures: Potential for automatic assessment and pedagogic implications. Interpreting 19 (1): 47–68. Zhao, Nan, and Yanping Dong. 2013. Validation of a consecutive interpreting test based on multifaceted Rasch model. Journal of PLA University of Foreign Languages 36 (1): 86–90.

12

J. Chen and C. Han

Jing Chen is a full professor in the College of Foreign Languages and Cultures at Xiamen University, China. Her research interests include interpreting quality assessment and interpreting pedagogy. She has published widely in peer-reviewed journals in English and Chinese (e.g., Interpreter and Translator Trainer, Language Assessment Quarterly, Chinese Translators Journal), and she has led several large-scale research projects funded by the European Union (i.e., Asia Link—the EU-Asia Interpreting Studies) and the China National Social Sciences Foundation. She is serving as the Deputy Director of the National Interpreting Committee of the Translators Association of China. Chao Han is a full professor in the College of Foreign Languages and Cultures at Xiamen University, China. He conducted his Ph.D. research at the Department of Linguistics at Macquarie University (Sydney), focusing on interpreter certification testing. He is interested in testing and assessment issues in translation and interpreting (T&I) and methodological aspects of T&I studies. His recent publications have appeared in such journals as Interpreting, The Translator, Language Testing, and Language Assessment Quarterly. He is currently a member of the Advisory Board of the International Journal of Research and Practice in Interpreting.

Rater-mediated assessment

Introducing China’s Standards of English Language Ability (CSE)—Interpreting Competence Scales Weiwei Wang

Abstract With more than 500 undergraduate and postgraduate degree programs in translation and interpreting (T&I) being launched over the past decade, interpreter training and education has been developing rapidly in China. This creates a huge demand for testing and assessment of interpreting in the educational context. To provide reliable measurement of interpreting competence, a group of researchers have worked together since 2014 to develop a suite of interpreting competence scales as part of the national project entitled China’s Standards of English Language Ability (CSE) initiated by the Ministry of Education in China. In this chapter, we aim to introduce the CSE project, focusing on the development and validation of the interpreting competence scales. Specifically, we first conceptualize and define the construct of interpreting competence, drawing on both T&I and language testing literature. We then describe the design, development and validation of the CSE-Interpreting Competence Scales. Next, we elaborate on two scales, the Overall Interpreting Competence Scale and the Self-Assessment Scale, to help relevant stakeholders gain a better understanding of characteristics of the scales. Finally, we discuss possible applications of the scales in interpreting teaching, learning and assessment, and call for more empirical research on scale validation. Keywords China’s standards of English · Interpreting Competence Scales · Design · Validation

1 Introduction In 2014, the National Education Examinations Authority, entrusted by the Ministry of Education of People’s Republic of China, initiated the China’s Standards of English Language Ability (CSE) Project with an aim to enhancing English education in China (Jin et al. 2017). In the following four years (i.e., 2014–2017), the eight research teams W. Wang (B) School of Interpreting and Translation Studies, Guangdong University of Foreign Studies, Guangzhou, China e-mail: [email protected] © Springer Nature Singapore Pte Ltd. 2021 J. Chen and C. Han (eds.), Testing and Assessment of Interpreting, New Frontiers in Translation Studies, https://doi.org/10.1007/978-981-15-8554-8_2

15

16

W. Wang

under the auspices of the CSE Project (i.e., listening, reading, speaking, writing, translation, interpreting, pragmatics, and organizational knowledge) worked collaboratively to achieve the common goal of researching and describing the English competence for teaching, learning, and assessment. Adopting a use-oriented approach to the description of language ability, the CSE defines language ability as the ability of language learners to comprehend and express information when applying language knowledge, world knowledge, and strategies to perform language use tasks in a variety of contexts, as can be seen in Fig. 1 (Liu and Han 2018). Notably, in addition to the four major language skills (i.e., listening, reading, speaking, and writing), translation and interpreting viewed as interlingual mediation are situated at the upper-intermediate to advanced levels of the CSE framework. As shown in Fig. 1, the two-headed arrow connecting “Interpreting/Translation” to “Functions/Texts” means that interpreting learners need to operate through two channels, namely, the source and the target language in different contexts with various communicative functions (Dunlea et al. 2019). With over 500 undergraduate and postgraduate translation and interpreting (T&I) degree programs being launched over the past decade in China, the lack of common T&I competence standards has become an urgent issue in the training and assessment of interpreters. To address the issue, we developed the CSE-Interpreting Competence

Fig. 1 The structure and the components of the CSE (see Liu and Han 2018)

Introducing China’s Standards of English Language Ability (CSE) …

17

Scales with an aim to creating a national framework of interpreting competence to support T&I students’ professional development, and to provide guidance for interpreting training and assessment in China. To the best of our knowledge, the research on the CSE-Interpreting Competence Scales has been the most intensive effort and the largest empirical study to generate and validate scalar descriptors of interpreting competence, not only in China but also across the world. As will be seen in the description below, the comprehensiveness and the enormity of the research (e.g., the sheer number of scalar descriptors generated and the number of participants involved) is unparalleled. As such, replication of our research may not be easy. In this chapter, we first review definitions of interpreting competence so as to bring clarity to this construct. Particularly, we describe the model of interpreting competence used for the development of the CSE-Interpreting Competence Scales. Second, we provide a descriptive account of the design, development, and validation of the CSE-Interpreting Competence Scales. Third, we elaborate on two particular scales (see Appendix 1), the Overall Interpreting Competence Scale and the Self-Assessment Scale, focusing on salient characteristics of each level of the scales. Fourth, we discuss how the scales could be applied in teaching, learning, and assessment.

2 Modelling Interpreting Competence 2.1 Interpreting Competence Previous research on interpreting competence has different foci, “most notably [on] cognitive processes, [on] education (including curriculum design, aptitude testing, and pedagogy) and [on] certification programs” (Pöchhacker 2015). For instance, Kalina (2000) defines interpreting competence, from a psycholinguistic perspective, as the ability to process texts in a bilingual or multilingual communication environment. Zhong (2003) proposes that interpreting competence ought to incorporate linguistic knowledge, encyclopedic knowledge, and skills related to both professional interpreting and artistic presentation. Wang (2007, 2012) defines interpreting competence as the fundamental framework of knowledge and skills required to accomplish the task of interpreting, including professional and physio-psychological attributes. According to Setton and Dawrant (2016), interpreting competence is composed of four core elements: bilingual language proficiency, knowledge, skills, and professionalism. Most of the interpreting literature that discusses interpreting competence examines the composition of interpreting competence. For instance, Pöchhacker (2000) proposes a multidimensional model of interpreting competence that highlights language and cultural skills, translational skills, and subject-matter knowledge. In

18

W. Wang

this model, linguistic transfer competence is regarded as a core element, complemented by cultural competence and interaction management skills. These elements are supported by professional performance skills and ethical behavior (Pöchhacker 2015). Albl-Mikasa (2013) draws on Kalina (2002) and Kutz (2010) to propose a detailed model of interpreting competence that comprises five skill sets, each with a subset of skills: (a) pre-process (language proficiency, terminology management, and preparation); (b) in-process (comprehension, transfer, and production); (c) periprocess (teamwork and ability to handle stress); (d) post-process (terminology work and quality control); and (e) para-process (business acumen, customer relations, and meta-reflection). Han (2015) defines interpreting ability from an interactionalist perspective as consisting of knowledge of languages, interpreting strategies, topical knowledge, and metacognitive process, all of which in turn interact with external contexts where interpreting takes place. Dong (2018) conducts a longitudinal investigation into the development of students’ interpreting competence and proposes a complex dynamic system to account for how self-organization among different key parameters (e.g., L2 proficiency, working memory, psychological factors) could foster and cultivate interpreting competence. The definitions and models of interpreting competence we have reviewed indicate that: (a) researchers agree that interpreting competence goes beyond bilingual competence and should focus on cross-cultural communication; (b) although there is no universally accepted model of interpreting competence, the previous discussion sheds light on important components of interpreting competence; and (c) little attention has been paid to the developmental stages of interpreting competence, and to the different competence requirements for specific interpreting tasks.

2.2 Interpreting Quality Assessment When it comes to the assessment of interpreting competence, research on interpreting quality is most pertinent (e.g., Barik 1975; Berk-Seligson 1988; Kurz 1993; Moser-Mercer 1996; Collados Aís 1998; Campbell and Hale 2003; Kalina 2005a, b; Liu 2008; Gile 2011). In terms of quality assessment, numerous researchers have proposed such criteria as completeness, accuracy, intonation, voice projection, language use, and logical cohesion (e.g., Bühler 1986; Kurz 1993; MoserMercer 1996; Riccardi 2002; Garzone 2003; Bartłomiejczyk 2007). Particularly, Pöchhacker (2001) suggests four overarching criteria: accurate rendition, adequate target language expression, equivalent intended effect, and successful communicative interaction. As Pöchhacker (2015, p. 334) puts it, “on a superficial level, quality relates to something that is good or useful, or to behavior that is sanctioned or expected.” The past decades have seen a particular strand of investigation on interpreting quality assessment, especially in the educational context (Yeh and Liu 2006; Lee 2008; Postigo 2008; Tiselius 2009; Liu 2013; Lee 2015; Han 2016, 2017, 2018a,

Introducing China’s Standards of English Language Ability (CSE) …

19

2019; Han and Riazi 2018; Lee 2018; 2019). Among them, the design and application of rating scales have been of great interest to interpreting researchers and trainers. Based on the previous literature, interpreting quality constructs and parameters have been operationalized into a plethora of rubric-referenced rating scales. For instance, Liu (2013) introduced the development of the rating scheme for Taiwan’s interpretation certification exam with two rating domains (i.e., accuracy and delivery). Lee (2015) described the process of developing an analytic rating scale for assessing undergraduate students’ consecutive interpreting in a Korean university, which includes three rating categories (i.e., content, form, and delivery) and five bands (0–4). Han (2017, 2018a) examined the validity of the rating scales used to assess students’ English-Chinese consecutive interpreting. Whereas research in assessment rubrics and rating scales has been a popular topic in interpreting studies, the construct and measurement of the progressive development of interpreting competence have received limited scholarly attention. An exception is the interpreting scale developed by the Interagency Language Roundtable (ILR). The ILR characterizes interpreting performance in three levels: (a) Professional Performance (Levels 3–5), (b) Limited Performance (Levels 2 and 2+), and (c) Minimal Performance (Levels 1 and 1+). In the broader field of language testing and assessment, several language proficiency scales have been developed in the Europe, Canada, Australia, and other countries and regions. These include the American Council on the Teaching of Foreign Languages Proficiency Guidelines, the International Second Language Proficiency Ratings, and the Canadian Language Benchmarks. The Common European Framework of Reference for Languages (CEFR) scale, jointly developed by more than forty member countries of the Council of Europe, is widely used across the world (Council of Europe 2001). However, it is a pity that none of these scales involve language interpreting. This leaves huge room for T&I educators and researchers to design and develop relevant scales that describe different developmental stages of T&I competence.

2.3 The Model of Interpreting Competence in the CSE Project Based on the literature review, we designed a construct model of interpreting competence for the CSE Project (see Fig. 2). In the model, bilingual competence, which refers to linguistic ability in source and target languages, is the foundation and prerequisite for (the development of) interpreting competence. Cognitive abilities serve as the core of interpreting competence, which include several aspects of abilities: identifying and retrieving information (i.e., identifying logic in the source language, memorizing, and recognizing notes); summarizing and analyzing (i.e., identifying

20

W. Wang

Fig. 2 The construct model of interpreting competence (Wang et al. 2020)

logic, identifying primary and secondary information, selecting information, inferencing, filtering, and summarizing); restructuring (i.e., reorganizing information, changing the word order) and producing (e.g., language structuring and expressing). Bilingual competence, interpreting strategies (i.e., planning, execution, appraisal, and compensation), and knowledge for interpreting (such as encyclopedic knowledge, knowledge about interpreting techniques, and codes of conduct) can contribute to the overall development of interpreting competence. Moreover, interpreting competence is supported by professionalism (e.g., psychological competence and professional ethics), and can be assessed, based on an interpreter’s specific interpreting performance in a range of typical interpreting tasks/activities.

3 Design and Development of the CSE-Interpreting Competence Scales 3.1 Scale Design: Operational Descriptive Scheme A preliminary operational descriptive scheme (Fig. 3) for the CSE-Interpreting Competence Scales has been proposed (Wang et al. 2018; Wang et al. 2020), based on the construct of interpreting competence (Fig. 2) and the overall framework of the CSE (Liu 2015; Liu and Han 2018). It is worth noting that bilingual competence

Introducing China’s Standards of English Language Ability (CSE) …

21

Fig. 3 The operational descriptive scheme (Wang et al. 2020)

is not included in the scheme because the listening, speaking, writing, and reading scales of the CSE have already described English language proficiency, respectively. As explained by Wang et al. (2020), cognitive abilities are the key components of the interpreting competence and are described via interpreting tasks in the scales of overall interpreting performance and of self-assessment, and in the subscales for typical interpreting activities. Evidence on learners’ interpreting competence can be identified and collected from their participation in various interpreting tasks/activities. One of the cognitive abilities, for instance, is described as “Can understand the content of an interview while analyzing the logical relationships in the source-language information during SI for a media interview.” Typical interpreting activities include business negotiations, training, lectures, interviews, and conferences. For instance, the following statements are used to describe two different settings: (a) “Can interpret important information, such as research objectives, methodology, and conclusions, during SI for an academic talk” and (b) “Can follow target-language norms to reflect source-language register and style during Consecutive Interpreting (CI) with note-taking for a foreign affairs meeting”. Interpreting strategies are described in four aspects: planning, execution, appraisal, and compensation. The strategy scales mainly concern relevant skills, methods, and actions that are needed to fulfill interpreting tasks and solve problems. For instance, the following descriptor pertains to the strategy of anticipation in interpreting: “Can use contextual information to anticipate upcoming content and information actively”.

22

W. Wang

Interpreting knowledge and professionalism are dealt with by the knowledge scales which are still under construction (Mu et al. 2020). The knowledge scales in the CSE-Interpreting Competence Scales will include descriptors about encyclopedic knowledge, codes of conduct, and relevant theories of interpreting.

3.2 Scale Development: Collection and Analysis of Scalar Descriptors The development of the CSE-Interpreting Competence Scales mainly consists of two phases (Wang et al. 2020): (a) collection of relevant descriptors and (b) follow-up analysis. In the first phase, the research team collected descriptors for interpreting competence at all possible levels through documentary research, exemplar generation, and expert evaluation. The “can-do” principle of CEFR was adopted (see North 2014). Based on the “can-do” principle, descriptors of interpreting competence scales need to clarify three aspects of interpreting (as also shown in Table 1): (a) performance or interpreting tasks (e.g., “interpreting a speech consecutively”); (b) criteria or intrinsic characteristics of the performance, involving a range of cognitive efforts or interpreting skills (e.g., “actively anticipating speech information, with note-taking”); (c) conditions or any extrinsic constraints or conditions defining the performance (e.g., “moderate speech rate, high information density, and no accent”). Table 1 The three-element “can-do” principle governing the generation of the descriptors Descriptor

Criterion

Condition

Can accurately, Can interpret completely, and fluently speeches interpret speeches on subjects of a wide range of categories, including technical speeches relating to unfamiliar fields. (Overall Interpreting Ability Scale Level 9)

Performance

accurately, completely, and fluently

Subjects of a wide range of categories, including technical speeches relating to unfamiliar fields

Can interpret specialized Can interpret discourse (e.g., academic discourse lectures, court hearings) fluently and professionally. (Overall Interpreting Ability Scale Level 8)

fluently and professionally

Specialized fields (e.g., academic lectures, court hearings)

Introducing China’s Standards of English Language Ability (CSE) …

23

As can be seen in Table 1, the “can-do” principle describes the expected type of scalar descriptors concerning interpreting competence. In other words, the descriptors need to specify which interpreting mode an interpreter performs (e.g., simultaneous interpreting, consecutive interpreting, liaison interpreting), using which cognitive strategies (e.g., summarization, anticipation, and identifying logic) in which settings/activities (e.g., press conferences, business negotiations), and how the quality of their performance is expected to be (e.g., accuracy, completeness, and fluency). Specifically, we used three methods to collect and/or generate scalar descriptors. First, we mainly relied on documentary research to identify and collect descriptors from the existing literature, including existing scales, curriculum standards, syllabi, examination guidelines, textbooks, and industry standards. Second, we used the method of exemplar generation. That is, we recruited interpreting students at different learning stages with high-, medium-, and low-level of interpreting competence, and asked them to perform interpreting in a variety of tasks. We video-recorded their performance and interviewed them immediately after the completion of the tasks. We then assembled a team of researchers to generate and write descriptors based on the students’ self-reflection and the characteristics of their interpreting performance. Third, regarding expert evaluation, 69 interpreting professionals and trainers were invited to write descriptors based on the videos and also on their professional and teaching experience. Using the three methods, we initially collected/generated a total of 9,208 descriptors, with 8,937 of them being obtained from the documentary research and the remaining 271descriptors from the exemplar generation and the expert evaluation. In the second phase, the research team analyzed the collected descriptors. We first categorized the descriptors according to the operational descriptive scheme shown in Fig. 3, and assigned each descriptor to a provisional level on the scale (from Levels 5–9, see Fig. 4). Then, we held three rounds of intra- and inter-group reviews and analyses (Wang et al. 2020), during which experts and teachers of interpreting were invited to make preliminary evaluation of the descriptors. During the intragroup reviews, each descriptor had to be reviewed and verified by more than three members of the research team. If the reviewers could not reach a consensus on the clarity, relevance, and utility of a given descriptor, it would be rewritten or deleted after further discussion. During the inter-group reviews, the descriptors were crosschecked by several research teams in the CES Project (e.g., the interpreting team, the translation team). The descriptors that were deemed inappropriate or controversial were then revised or deleted. Moreover, we carried out two rounds of revision, eliciting views and comments from external experts, teachers, and professionals in interpreting via a combination of online surveys and interactive workshops (Wang et al. 2020). At the end of the repeated analysis and evaluation, we produced the first draft of the scales with 548 descriptors.

24

W. Wang

Fig. 4 Interpreting mode and typical activities at each level of the Self-Assessment Scale

3.3 Validation of the CSE-Interpreting Competence Scales Regarding the validation of the scales and the descriptors, we used both quantitative and qualitative methods. As reported by Wang et al. (2020), two rounds of validation were carried out from 2015 to 2017. First, we conducted quantitative validation. We designed a series of questionnaires to survey teachers, students, and potential users of the scales in different provinces and regions of China in order to verify the rationality of the scaling of the descriptors. A total of 42 questionnaires containing the first draft of the 548 descriptors were created and distributed. We sampled the language behaviors of interpreting students nationwide, and collected relevant data to match the descriptors with the students’ language behaviors. A total of 5,787 teachers from 259 colleges and universities responded by rating the descriptors against their students’ actual Chinese-English interpreting performance, whereas 30,682 students from 215 colleges and universities participated by evaluating their own interpreting competence from June 20 to July 15, 2016. The data collected by the questionnaires were analyzed, based on Rasch modeling (Zhu 2016; Wang et al. 2020), to scale the descriptors, and to test the representativeness of each descriptor and the appropriateness of their assigned scale levels. Second, we conducted qualitative validation. Expert judgment and focus group interviews among English teachers, interpreting trainers, and interpreters were carried out to corroborate and contextualize the findings from the quantitative analysis. A total of 260 participants from various groups, including high school instructors, English teachers, and university T&I trainers, participated in 26 focus group interviews from six regions and provinces from March to July 2017. The CSEInterpreting Competence Scales were also linked with other scales and finalized through expert reviews.

Introducing China’s Standards of English Language Ability (CSE) …

25

After the analyses and reviews, we finalized the CSE-Interpreting Competence Scales that consist of 12 sub-scales and 369 descriptors, including (a) the Overall Interpreting Performance (1 scale, 16 descriptors), (b) the Interpreting Competence in Typical Interpreting Activities (6 scales, 220 descriptors), (c) the Interpreting Strategy (4 scales, 105 descriptors), and (d) the Self-assessment for Interpreting Competence (1 scale, 28 descriptors).

4 The Scale Levels in the CSE-Interpreting Competence Scales 4.1 Scale Levels The CSE-Interpreting Competence Scales consist of five levels categorized into three stages (see Fig. 4). Here, we take the Self-Assessment Scale as an example to explain the scale structure. As shown in Fig. 4, different interpreting modes and tasks are associated with different developmental stages of interpreting competence, from the basic stage (Levels 5 and 6) to the intermediate stage (Levels 7 and 8) and finally to the advanced stage (Levels 8 and 9). Levels 5 and 6 are basic stages, at which one should be able to perform liaison interpreting for such assignments as are involved in shopping, airport transfer, tour, and small commodity fairs. The intermediate stages encompass both Levels 7 and 8. Student interpreters at these levels ought to be able to complete interpreting tasks with longer segments and in more formal settings, such as business visits and lectures. It is worth noting that Level 8 involves both consecutive and simultaneous interpreting. That is, Level 8 represents the transition from intermediate learners to advanced learners. Level 9 is associated with the most difficult tasks (e.g., simultaneous interpreting for political speeches or diplomatic negotiations) performed by the most competent and highly-skilled conference interpreters. It may therefore take years of professional practice to progress from Level 8 to Level 9.

4.2 Salient Characteristics of Each Level Each level in the scales is characterized by five salient aspects: (a) interpreting modes, (b) interpreting conditions, (c) typical interpreting activities, (d) cognitive strategies involved, and (e) interpreting quality expected. We explain and discuss these aspects, based on the Overall Interpreting Competence Scale, with the relevant descriptors being presented in Table 2. First, regarding interpreting modes, interpreters with an basic level of interpreting competence (Levels 5 and 6) can perform informal liaison interpreting, informal and short consecutive interpreting on familiar topics; interpreters at Level 7 can perform

26

W. Wang

Table 2 The salient characteristics of the Overall Interpreting Competence Scale Level

Interpreting mode

Condition

Typical interpreting activity

Cognitive strategy

Level 9 Simultaneous interpreting Consecutive interpreting

Unfamiliar domain; technical; high-level, formal, and informal settings

Various topics

Comprehensive The use of strategies interpretation is accurate, complete, and fluent, which shows linguistic flexibility and follows target-language norms; the target-language register and style are consistent with that of the original

Level 8 Simultaneous interpreting

Moderate information density, normal speech rate, no obvious accent

Product Launch, industry forums, etc.

Anticipation, segmentation, waiting, and adjustment

The interpretation is relatively accurate, complete, and fluent, and follows target-language norms

Relatively high information density, high speech rate, long segments, and a high level of specialization

Foreign affairs meetings, media reports, academic lectures, court trials, etc.

Be able to adjust the target language in real-time according to the situation on the spot

Accurate rendition of the intended meaning, complete information, fluent delivery, correct use of target language, and the target-language register and style are close to that of the original as much as possible

Consecutive interpreting

Interpreting quality

(continued)

Introducing China’s Standards of English Language Ability (CSE) …

27

Table 2 (continued) Level

Interpreting mode

Level 7 Consecutive interpreting

Condition

Typical interpreting activity

Cognitive strategy

Interpreting quality

Moderate information density, normal speech rate, short segments

Business negotiations, training workshops, etc.

Additions, deletion and reduction, explicitation, noticing errors, corrections or complements

Accurate and complete, be able to capture important information and key details; and the utterance is appropriate, fluent, and logically coherent

Level 6 Consecutive Familiar topics, interpreting short segments (short without notes)

Daily business, Anticipation, trade fairs, etc. monitoring interpreting quality, and correcting errors

Accurate and complete interpretation

Level 5 Liaison interpreting

Protocol routine, accompanied shopping, etc.

Be able to grasp important information, and the sense of interpretation is generally consistent with that of the original

Familiar topics

Is aware of obvious errors and can correct them in a timely manner

short consecutive interpreting with notes; interpreters at Level 8 can perform formal, long consecutive interpreting, and simultaneous interpreting on familiar topics; and those with the highest level of competence (Level 9) are highly-skilled expert interpreters who can handle all interpreting modes. It should be noted that the interpreting mode at each level mainly refers to the easiest interpreting task(s) that should be performed at that level. In other words, an interpreter with a higher level of competence is believed to be competent in performing all interpreting modes at lower levels. In general, Level 9 represents the highest level or expert-level of interpreting competence, which requires extensive professional experience and cannot be acquired through only school-based training. Second, regarding interpreting conditions, factors related to a given condition usually affect the difficulty level of interpreting activities, including, for example, learners’ familiarity with a given topic, information density, speech rate, length of speech segments, and accent. In general, there is a positive relationship between the difficulty of interpreting tasks and the level of interpreting competence. That is, more competent interpreters are capable of performing more difficult tasks. For example,

28

W. Wang

most of the students at Levels 5 and 6 just begin to learn interpreting. They could only perform interpreting on “familiar topics” or for “relatively short segments”. Students at Level 7 are at an intermediate stage, and they may have learnt note-taking and other related skills, but their overall interpreting competence is still under-developed. They are most likely to be able to perform interpreting on tasks characterized by “moderate information density and a normal speech rate”. Third, the issue of typical interpreting activities essentially pertains to topics involved in interpreting, which, together with modes of interpreting, may reflect the cognitive complexity of an interpreting task. Regarding the Overall Interpreting Competence Scale, interpreting topics mainly cover cultural exchanges, foreign affairs meetings, business visits, public services, and lectures. As can be seen in Table 2, the basic stage of interpreting (Levels 5 and 6) mainly involves informal and common topics such as airport pick-up and daily business. From the intermediate stage (Level 7), the typical interpreting activities become increasingly formal, such as business negotiations and training workshops with “moderate information density and a normal speech rate.” In Level 8, interpreting scenarios pertain to foreign affairs meetings and court trials involving consecutive interpreting, and media events and industry forums requiring simultaneous interpreting. Finally, Level 9 involves a wide range of technical and specialist topics. Fourth, regarding cognitive strategies, efficient strategy use reflects overall interpreting competence. Taking consecutive interpreting as an example, at the basic stage (Levels 5 and 6), the most common strategies used may include anticipation, monitoring, and compensation. At Level 7, the intermediate stage, descriptors can include more strategic behaviors such as “additions, deletion and reduction, explicitation, noticing errors, corrections or complements.” Considering that consecutive interpreting at Level 8 is the advanced stage, the interpreter has a relatively flexible cognitive ability, so the description includes phrases like “be able to adjust the target language according to the situation on the spot.” At the expert level (Level 9), expressions such as “comprehensive use of strategies” demonstrate the mastery of relevant strategies. As interpreters’ competence improves, they become increasingly capable and flexible in using a diverse range of strategies. Finally, interpreting quality described in the scale is assessed from multiple dimensions such as content, delivery, and communicative effect. Let us take the (informational) completeness of the target text for example. As shown in Table 2, the descriptors of Level 5 include “[being] able to grasp important information of the original.” At Level 5, interpreting activities mainly involve liaison interpreting, which requires interpreters to convey the general message in the source speech. At Level 6, the descriptors mention the concept of “completeness”, meaning that interpreters need to monitor and strive for the informational completeness in target-language renditions. As note-taking becomes part of the requirements at Level 7, the quality descriptors emphasize that “important information and key details in the source language” should be reproduced in target language. For advanced stages at Levels 8 and 9, more stringent requirements on informational completeness are described. Essentially, higher levels of competence are associated with higher standards on interpreting quality.

Introducing China’s Standards of English Language Ability (CSE) …

29

5 Potential Application of the CSE-Interpreting Competence Scales In general, we envisage that the CSE-Interpreting Competence Scales could be used to enhance interpreting teaching, learning and assessment, and to promote further research in related fields.

5.1 Interpreting Teaching Currently, the interpreting market in China is complex, multi-dimensional, and fastevolving, calling for a diverse range of interpreting services to meet needs and expectations of different users and clients. While top-notch conference interpreters are in dire need in high-level or high-profile international conferences, other types of interpreters such as liaison interpreters and community interpreters also play an important role in facilitating cultural and business exchanges. Arguably, the latter type of interpreters constitutes the main workforce in the interpreting profession in China. These multi-dimensional needs are also described and discussed in Setton and Dawrant (2016). According to the recent report by the National Steering Committee of Master of Translation and Interpreting (Ping 2019), in most T&I programs many interpreting teachers may not have the competences required for Level 8 and above described in the CSE-Interpreting Competence Scales. It is therefore not recommended that simultaneous interpreting be included as a mandatory course or a training component for all interpreting programs, as it blurs the boundary of conference interpreting proper (Dawrant et al. 2021). As such, teaching plans, course objectives, pedagogy and practice materials in local interpreting programs should be determined by actual learner needs and teaching capacity of a given institution and local markets. We therefore propose that interpreting trainers could make use of the levels and corresponding descriptors of our scale in planning and structuring their teaching, based on specific needs analysis. For instance, an intensive course on business interpreting (e.g., 90 min per week for 16 weeks) may involve interpreting modes of short liaison consecutive and full consecutive with notes. It would be useful to transform the following descriptors at Levels 6 and 7 into classroom tasks, in accordance with the actual capabilities of students (see Table 3). As shown in Table 3, trainers could first consult the descriptors in related levels and decide which tasks students will need to perform at the end of the course. Once the competences have been identified, relevant descriptors could be used to design pedagogic activities that reflect real-world tasks and develop or select teaching materials of appropriate difficulty. This can be helpful in relating learning objectives to real-world needs, which can be used to guide and orient the process of teaching and learning (North 2014). Moreover, the progressive description of the scales across the levels may help trainers in sequencing teaching objectives into their syllabus.

30

W. Wang

Table 3 The use of the CSE-Interpreting Scales in teaching design Descriptor

Classroom activity

Material difficulty

Can consecutively Consecutive with interpret moderate notes information-dense speech (e.g., as in business negotiations, training activities) in which segments are comparatively short and delivered at a regular speed with note-taking notes. (Overall interpreting ability Scale Level 7)

Interpreting mode

Business negotiations, HR or supplier training, product launch events, etc.

Moderate information-density; segments are comparatively short and delivered at a regular speed

I can draw support from Short consecutive my notes to interpret with notes moderately information-dense speech in which speech segments are comparatively short and delivered at a regular speed, as seen in activities such as business visits, popular science lectures, or guided tours. (Self-assessment scale Level 7)

Company or product Moderately introduction, plant tour, information-dense business visits, etc. speech in which speech segments are comparatively short and delivered at a regular speed

I can perform Short consecutive consecutive interpreting without note-taking for a familiar topic in which speech segments are comparatively short, such as when receiving businesspeople or accompanying people on a tour. (Self-assessment scale Level 6)

Business reception, airport transfer, etc.

Familiar topic in which speech segments are comparatively short

5.2 Interpreting Learning Although it is acknowledged that students’ self-study contributes to the development of interpreting competence, little guidance is available for students to select appropriate practice materials and self-assess their performance (Wang 2015). In addition, due to the lack of useful assessment instruments, interpreting trainers may not be

Introducing China’s Standards of English Language Ability (CSE) …

31

able to identify and detect subtle but significant weaknesses of students’ interpretation. For instance, sometimes students focus disproportionately on their voice or pronunciation, ignoring other more important aspects such as coherence and cohesion. At other times, students may tend to award higher-than-warranted scores to themselves, or over-estimate their competence in self-assessment (Han and Riazi 2018; Han and Fan 2020). Moreover, both teachers and students need guidance in selecting appropriate assessment tools (Setton and Dawrant 2016; Li 2018). Therefore, the Self-Assessment Scale for Interpreting Competence can be used by students to self-diagnose and evaluate their learning outcomes. Here, we briefly describe our application of the Self-Assessment Scale in a Massive Online Open Course (MOOC) in consecutive interpreting,1 the first course of its kind in China (for more details, see Ouyang et al. 2020). In this course, the students were introduced with the SelfAssessment Scale for Interpreting Competence in the form of a questionnaire (see Table 4). The students in our course used the descriptors above to identify their current level of interpreting competence and set specific goals for further improvement. In addition, quality criteria (i.e., content, delivery, and communicative effect) were also introduced. Communicating the quality criteria for interpreting at different levels can help learners know which aspects of performance they need to focus on, and thus facilitate peer assessment and self-assessment. In addition, self-directed learning will potentially be enhanced, if students can accurately self-diagnose their strengths and weaknesses. The accuracy of selfassessment would be higher, if learners are trained to reflect on their progress with the help of relevant descriptors (North 2014; Han and Fan 2020). Therefore, the descriptors could also be exploited for signposting in self- and/or peer assessment checklists. As shown in Table 5, the students enrolled in our MOOC compared the original speech and the recordings of their interpreting, transcribed their performance, and marked the places where the original information was not rendered (deleted in the source text), where their renditions were wrong (underlined in both source and target texts), and where other types of errors occurred. In addition, in the peer assessment, the students added quality parameters such as “Language”, “Completeness”, “Fluency” and “Accuracy” to the assessment checklist. After reading the evaluation report, the teachers may further discuss with the students to find out the cause of and solutions to the identified problems and encourage them think about other parameters such as “Logical Cohesion”, “Pronunciation and Intonation” and “Voice Quality”. Alternatively, teachers could develop a similar kind of checklists to discuss progress with individual learners, to explain to the learners the rationale behind their teaching plan, to motivate learners during the course of the MOOC, to determine if the class as a whole or an individual student has achieved the desired level, to finetune the course accordingly, and to document achievement at the end of the course. As is demonstrated, the can-do descriptors can involve learners in the monitoring 1 . To access this course, please visit the following website: http://www.icourse163.org/course/ GDUFS-1002493010.

32

W. Wang

Table 4 The sample questionnaire for self-assessment in the intermediate stage (Levels 7 and 8) Suppose you are appointed to a consecutive interpreting task for a product launch ceremony. Indicate your estimated performance by putting a cross in the appropriate box (0-4) for each statement 0

2

3

4

Cannot do it at all Can do it with much help

1

Can do it

Can do it well

Can do it easily

Unable to execute the task in any circumstances. My proficiency is obviously much lower than this level

Can execute the task independently in normal circumstances. My proficiency is at this level

Can execute the task even in difficult circumstances. My proficiency is a bit higher than this level

Can execute the task easily in any conditions. My proficiency is clearly much higher than this level

1. Prior to interpreting, I can familiarize myself with event-related specialized vocabulary, background information, and development trends

 0. Cannot do it at all  1. Can do it with much help  2. Can do it  3. Can do it well  4. Can do it easily

2. During consecutive interpreting in which speech segments are comparatively long, I can draw support from my notes to interpret information-dense, relatively specialized speech that is delivered at a regular speed with a certain degree of accent

 0. Cannot do it at all  1. Can do it with much help  2. Can do it  3. Can do it well  4. Can do it easily

3. During consecutive interpreting in which speech segments are comparatively long, I can use context and related background knowledge to analyze the speaker’s logic

 0. Cannot do it at all  1. Can do it with much help  2. Can do it  3. Can do it well  4. Can do it easily

4. During consecutive interpreting in which speech segments are comparatively long, I can use devices such as addition, omission, explication, and word reordering to ensure the target language is produced accurately, completely, and fluently

 0. Cannot do it at all  1. Can do it with much help  2. Can do it  3. Can do it well  4. Can do it easily

Can execute the task in favorable circumstances. My proficiency is a bit lower than this level

(continued)

Introducing China’s Standards of English Language Ability (CSE) …

33

Table 4 (continued) Suppose you are appointed to a consecutive interpreting task for a product launch ceremony. Indicate your estimated performance by putting a cross in the appropriate box (0-4) for each statement 0

1

2

3

4

5. I can avoid committing obvious grammar mistakes and deliver the target language fluently. I can also use non-verbal methods to ensure effective communication

 0. Cannot do it at all  1. Can do it with much help  2. Can do it  3. Can do it well  4. Can do it easily

6. During interpreting, I can monitor in real-time the accuracy, fluency, and cohesion of target-language information

 0. Cannot do it at all  1. Can do it with much help  2. Can do it  3. Can do it well  4. Can do it easily

7. I can assess and correct errors that occur in the delivery of source-language information

 0. Cannot do it at all  1. Can do it with much help  2. Can do it  3. Can do it well  4. Can do it easily

8. I can draw support from my notes to interpret moderately information-dense speech in which speech segments are comparatively short and delivered at a regular speed

 0. Cannot do it at all  1. Can do it with much help  2. Can do it  3. Can do it well  4. Can do it easily

9. Prior to interpreting, I can use information about the subject and the speaker’s background to create a glossary, gain pertinent knowledge, and actively anticipate source-language information

 0. Cannot do it at all  1. Can do it with much help  2. Can do it  3. Can do it well  4. Can do it easily (continued)

34

W. Wang

Table 4 (continued) Suppose you are appointed to a consecutive interpreting task for a product launch ceremony. Indicate your estimated performance by putting a cross in the appropriate box (0-4) for each statement 0

1

2

3

4

10. During interpreting, I can understand, analyze, and remember, using my  0. Cannot do it notes and memory, the speaker’s entire message at all  1. Can do it with much help  2. Can do it  3. Can do it well  4. Can do it easily 11. I can use relatively fluent target language to deliver source-language information

 0. Cannot do it at all  1. Can do it with much help  2. Can do it  3. Can do it well  4. Can do it easily

12. I can use materials prepared prior to interpreting or my accumulated experience to help when I encounter difficulties

 0. Cannot do it at all  1. Can do it with much help  2. Can do it  3. Can do it well  4. Can do it easily

of their own achievement and serve as a tool to facilitate communication between teachers and learners. They can also help students realize that learning interpreting is a process of repeated and deliberate practice, accompanied by constant self-reflection and external feedback.

5.3 Interpreting Assessment The scales are an embodiment of the diverse range of interpreting sub-competences; they also determine levels of baseline performance. We hope that the scales could promote evidence-based assessment, and guide both teachers and testers to think beyond the use of a single summative test and to embrace a broader assessment model that is inclusive of different forms of assessment at different points of time for

Introducing China’s Standards of English Language Ability (CSE) …

35

Table 5 The example of the peer assessment checklist (Wang 2017) Original text

Transcript

Criterion

Peers’ comment

Altogether, I have translated five works by Mo Yan, four novels and a collection of short stories. No.6 is in the work, No.7 才40万字, probably I’ll get a chance to read soon and we’ll see. That is the good news. The bad news is, in my country, translated fiction does not do very well. It is getting harder and harder for us to work there. It is getting harder and harder for novels to be published there. It is getting harder and harder for literature from China and elsewhere to be read in the United States. Translated literature has never been terribly well received. Somehow American think it is not quite genuine. It loses its authority as soon as it gets into our hands.

我一共翻译了莫言的五 部作品,其中四部是小 说,还有一部是短篇小 说集。第六部也在翻译, 第七部是只有四十万字 。应该很快将要面世,你 们能够读到这些作品。 这是非常好的一个消息 。 但是比较坏的一个消息 就是在美国,翻译的文 学作品并不是很受欢迎 。在美国这些作品越来 越难出版,而美国人也 越来越不欢迎中国以及 其他国家的翻译来的文 学作品。他们认为翻译 的文学作品并不是真正 的、原始的文学作品, 已经失去权威性。

Language

这里“novel”是与“short story”对比,所以建议译 成“长篇小说”。(I suggest to interpreting “novel” into “长篇小说” since it is used as the contrast of “short story”.)

Completeness 原文删除线的部分漏译 。(Information loss was identified with strikethroughs.) Fluency

语速不够平稳,有时较 快,有时有些停顿 。(Unsteady pace, sometimes relatively fast, sometime with pauses.)

Accuracy

下划线部分译的不准确 。比如原文讲的是在美 国越来越难读到翻译小 说,不受欢迎也许是一 个导致翻译小说出版少 的原因,但是这样译会 有一些加工过度 。(Inaccurate translation in underlined parts. For instance, the speaker said that “It is getting harder and harder for literature from China and elsewhere to be read in the United States.” The interpreter over-translated it as “American readers are becoming less and less welcoming of literary works translated from China and other countries.”)

Logical Cohesion Pronunciation and intonation Voice quality

36

W. Wang

different purposes (e.g., self-evaluation, peer evaluation, and classroom observation). The scales may also promote further exploration of assessment methods (e.g., Han 2018a) and encourage dynamic interaction among different stakeholders involved in interpreting education (e.g., institutions, teachers, students, testers, and users). In terms of assessment outcome reports, the scales may serve as a reference for decision-making on cut-off scores for interpreter certification performance tests such as the China Accreditation Tests for Translators and Interpreters (CATTI) and the United Nations Language Professionals Training Programme (UNLPP) in China. Developers of analytic scales in interpreting assessment (e.g., Lee 2015; Han 2018b) could leverage our descriptors to better contextualize and explain the meaning of numeric scores concerning a given interpretation. In addition, test alignment among proficiency tests among different T&I institutions and accreditation tests (e.g., CATTI, NAATI and other tests) could also be carried out to help achieve greater consistency and coherence in interpreting education.

5.4 Academic Research and Inter-Sector Collaboration The scales’ central focus on interpreting competence in different contexts may encourage interpreting trainers and researchers to consider and validate the appropriateness of the descriptors, the typical interpreting activities, and their related levels. We hope that our scales would provide some impetus to relevant researchers and testers to focus more on the measurement of interpreting competence development and teaching outcomes. In addition, we envisage that the scales be used to compare students’ competence levels across different interpreting programs and institutions in China. With the detailed description of interpreting competence, the scales could contribute to a common understanding among different sectors in the language services industry. The wide application of and reference to the scales could promote social recognition and understanding of interpreting as an important form of language services, and serve as a frame of reference for providers of language services, particularly T&I educators and practitioners.

6 Conclusion In this chapter, we intend to introduce the CSE-Interpreting Competence Scales by conceptualizing the construct of interpreting competence, by describing the design, development and validation of the scales, and by explaining the salient features of the scales and their possible applications. One of the important features of the scales is their flexibility and customizability, which would enable T&I educational institutions to adapt the scales to their specific needs. The scales are a potentially effective tool

Introducing China’s Standards of English Language Ability (CSE) …

37

for measuring interpreting competence development, especially for T&I programs in China. The CSE-Interpreting Competence Scales have undergone multiple phases of quantitative and qualitative validation, and are able to describe interpreting performance at different levels in the form of “can-do” statements. Nonetheless, the validation and improvement of our scales are an ongoing process. We suggest that the scales be further tested with a diverse range of T&I programs and institutions in China. As North (2000) states, it is important that an ideal scale of language proficiency (or interpreting competence) needs to be independent of a specific situation so that it can be extrapolated to different domains, but at the same time provides useful and interpretable information to a local assessment practice. The CSE-Interpreting Competence Scales have been designed and developed to provide specific description of interpreting performance, interpreting strategies, and relevant knowledge. While much work still needs to be done, we hope that the scales could be applied by relevant stakeholders to enhance interpreting education in China. Acknowledgements The author would like to thank all the students, teachers and interpreters, especially Dr. XU Yi, who contributed to this project. We are grateful to the reviewers and editors for their insightful comments. This research was supported by the National Social Science Fund of China (19CYY053).

Appendix 1. The CSE-Interpreting Scales (a) The Overall interpreting ability scale As the interpreting competence is based upon bilingual competence, the overall scale for interpreting competence starts from Level 5. CSE 9 • Can accurately, completely, and fluently interpret speech on subjects of a wide range of categories, including technical speech relating to unfamiliar fields • Can interpret in high-level, formal, and informal speech contexts, ensuring the register and style of target language speech is consistent with that of the source language • Can use and integrate consecutive or simultaneous interpreting skills and strategies, and deliver adaptable, idiomatic language (continued)

38

W. Wang

(continued) CSE 8 (Consecutive Interpreting) • Can interpret information-dense speech (e.g., foreign affairs meetings, media reports) in which segments are comparatively long and delivered at a relatively high speech rate with the aid of note-taking or other methods • Can interpret specialized discourse (e.g., academic lectures, court hearings) fluently and professionally • Can make timely adjustments to target-language speech based on on-site circumstances in order to ensure accurate and complete delivery, keeping as close as possible to source-language register and style (Simultaneous Interpreting) • Can interpret moderate information-dense speech (e.g., as in news conferences, industry forums) delivered at a regular speed with little to no accent interference • Can coordinate efforts in simultaneous interpreting for anticipation, chunking, decalage, and adjustment, and ensure relatively accurate, complete, fluent, and idiomatic delivery CSE 7 • Can consecutively interpret moderate information-dense speech (e.g., as in business negotiations, training activities) in which segments are comparatively short and delivered at a regular speed with note-taking notes • Can use methods such as addition, deletion, and explication to interpret important source-language information and key details, ensuring logical, coherent, appropriate, and fluent delivery • Can promptly discover interpreting errors such as misinterpretation or omission of important information; and take immediate actions during follow-up interpreting CSE 6 • Can consecutively interpret comparatively short speech on a prepared topic (e.g., as in everyday interactions with visitors, trade fairs) without taking notes • Can actively anticipate speech information, monitor target-language accuracy and completeness, and correct mistakes CSE 5 • Can perform liaison interpreting on a familiar topic (e.g., greetings and goodbyes, escorted shopping assignments) • Can interpret important source-language information based on communicative settings and contextual knowledge, ensuring overall accuracy of meaning • Can be aware of and try to correct obvious errors that occur during interpreting Notes The “speed” mentioned in this table is defined as follows: Fast (in English): approximately 140–180 words/min; Moderate speed (in English): approximately 100–140 words/min; Fast (in Chinese): approximately 160–220 Chinese characters/min; Moderate speed (in Chinese): approximately 120–160 Chinese characters/min

(b) The Self-assessment scale for interpreting competence (continued)

Introducing China’s Standards of English Language Ability (CSE) …

39

(continued) CSE 9 1. Prior to interpreting, I can familiarize myself with the speaker’s speech habits, such as pronunciation and intonation, rate of speech, and wording preference 2. During difficult, specialized consecutive interpreting, such as a heads-of-government meeting or a court hearing for an important case, I can accurately, completely, and fluently interpret source-language information while ensuring target-language delivery is consistent with the speaker’s register and style 3. During consecutive interpreting, I can use and integrate consecutive interpreting skills and strategies to deliver adaptable, idiomatic language, with a natural and relaxed bearing and clear enunciation 4. During simultaneous interpreting for a government press conference or report of a major emergency, I can interpret source-language information accurately, completely, and fluently 5. During simultaneous interpreting, I can use and integrate simultaneous interpreting skills and strategies to deliver appropriate, adaptable, idiomatic language with clear enunciation and a pleasant voice 6. During interpreting, I can identify and handle appropriately any significant errors made by the speaker CSE 8 7. Prior to interpreting, I can familiarize myself with industry-related specialized vocabulary, background information, and development trends 8. During consecutive interpreting in which speech segments are comparatively long, such as news conferences, academic talks, or business negotiations, I can draw support from my notes to interpret information-dense, relatively specialized speech that is delivered at a regular speed with a certain degree of accent 9. During consecutive interpreting in which speech segments are comparatively long, I can use context and related background knowledge to analyze the speaker’s logic 10. During consecutive interpreting in which speech segments are comparatively long, I can use devices such as addition, omission, explication, and word reordering to ensure the target language is produced accurately, completely, and fluently 11. I can avoid committing obvious grammar mistakes and deliver the target language fluently. I can also use non-verbal methods to ensure effective communication 12. During simultaneous interpreting, such as for a political leader’s speech, a live-broadcast of a sporting competition, or an expert review meeting, I can interpret low-information-density speech delivered at a regular speed and without any obvious grammar mistakes 13. During simultaneous interpreting, I can use strategies such as anticipation, omission, and summarization to divide my attention. I can simultaneously listen and analyze information, using suitable sense groups to chunk source-language sentences to interpret the speaker’s message relatively accurately, completely, and fluently 14. During interpreting, I can monitor in real-time the accuracy, fluency, and cohesion of target-language information 15. I can assess and correct errors that occur in the delivery of source-language information (continued)

40

W. Wang

(continued) CSE 7 16. I can draw support from my notes to interpret moderately information-dense speech in which speech segments are comparatively short and delivered at a regular speed, as seen in activities such as business visits, popular science lectures, or guided tours 17. Prior to interpreting, I can use information about the subject and the speaker’s background to create a glossary, gain pertinent knowledge, and actively anticipate source-language information 18. During interpreting, I can understand, analyze, and remember, using my notes and memory, the speaker’s entire message 19. I can use relatively fluent target language to deliver source-language information 20. I can use materials prepared prior to interpreting or my accumulated experience to help when I encounter difficulties CSE 6 21. I can perform consecutive interpreting without note-taking for a familiar topic in which speech segments are comparatively short, such as when receiving business people or accompanying people on a tour 22. I can use a variety of channels to make preparations, such as contacting event organizers or using the internet, to collect material pertinent to an interpreting assignment and a speaker’s background 23. I can understand the speaker’s main intent, remember principal information of a speech, and use the target language to deliver source-language information relatively accurately 24. I can monitor target-language accuracy and completeness and promptly correct mistakes 25. I can request help from the speaker or the audience when I encounter difficulties 26. After the interpreting assignment, I can reflect on my performance and on the reasons for any difficulties I encountered CSE 5 27. I can integrate subject and pertinent background knowledge to interpret important information present in a dialogue while interpreting for a familiar topic, such as airport pick-up and drop-off or a shopping assignment 28. Prior to interpreting, I can make pertinent preparations, such as familiarizing myself with an itinerary, subject, and details of an activity 29. I can recognize and promptly correct obvious errors that occur during interpreting

References Albl-Mikasa, Michaela. 2013. Developing and cultivating expert interpreter competence. The Interpreter’s Newsletter 18: 17–34. Barik, Henri C. 1975. Simultaneous interpretation: Qualitative and linguistic data. Language and Speech 18 (3): 272–297. Bartłomiejczyk, Magdalena. 2007. Interpreting quality as perceived by trainee interpreters: Selfevaluation. The Interpreter and Translator Trainer 1 (2): 247–267. Berk-Seligson, Susan. 1988. The impact of politeness in witness testimony: The influence of the court interpreter. Multilingua 7: 411–439. Bühler, Hildegund. 1986. Linguistic (semantic) and extralinguistic (pragmatic) criteria for the evaluation of conference interpretation and interpreters. Multilingua 5 (4): 231–235. Campbell, Stuart, and Sandra Hale. 2003. Translation and interpreting assessment in the context of educational measurement. In Translation today: Trends and perspectives, ed. Anderman Gunilla and Margaret Rogers, 205–227. Clevedon: Multilingual Matters.

Introducing China’s Standards of English Language Ability (CSE) …

41

Collados Aís, Ángela. 1998. La evaluación de la calidad en interpretación simultánea: La importancia de la comunicación verbal (in Spanish). Granada: Comares. Council of Europe. 2001. Common European Framework of Reference for Languages: Learning, teaching, assessment. Cambridge: Cambridge University Press. Dawrant, Andrew, Binhua Wang, and Hong Jiang. 2021. Conference interpreting in China. In Routledge handbook of conference interpreting. (Forthcoming), ed. Elisabet Tiselius and Michaela Albl-Mikasa. London: Routledge. Dong, Yanping. 2018. Complex dynamic systems in students of interpreting training. Translation and Interpreting Studies. Translation and Interpreting Studies 13 (2): 185–207. Garzone, Giuliana Elena. 2003. Reliability of quality criteria evaluation in survey research. In La evaluación de la calidad en interpretación, ed. Ángela Collados Aís, María Manuela Fernández Sánchez, and Daniel Gile, 23–30. Granada: Editorial Comares. Gile, Daniel. 2011. Errors, omissions, and infelicities in broadcast interpreting: Preliminary findings from a case study. In Methods and strategies of process research: Integrative approaches in translation studies, ed. Alvstad Cecilia, Hild Adelina, and Tiselius Elisabet, 201–218. Amsterdam: John Benjamins. Grbi´c, Nadja. 2008. Constructing interpreting quality. Interpreting 10: 232–257. Han, Chao. 2015. Building the validity foundation for interpreter certification performance testing. Unpublished PhD thesis, Macquarie University. Han, Chao. 2016. Investigating score dependability in English/Chinese interpreter certification performance testing: A generalizability theory approach. Language Assessment Quarterly 13 (3): 186–201. Han, Chao. 2017. Using analytic rating scales to assess English–Chinese bi-directional interpreting: A longitudinal Rasch analysis of scale utility and rater behavior. Linguistica Antverpiensia, New Series – Themes in Translation Studies 16: 196–215. Han, Chao. 2018a. Using rating scales to assess interpretation: Practices, problems, and prospects. Interpreting 20 (2): 59–95. Han, Chao. 2018b. Latent trait modelling of rater accuracy in formative peer assessment of EnglishChinese consecutive interpreting. Assessment and Evaluation in Higher Education 43 (6): 979– 994. Han, Chao. 2019. A generalizability theory study of optimal measurement design for a summative assessment of English/Chinese consecutive interpreting. Language Testing 36: 419–438. Han, Chao, and Mehdi Riazi. 2018. The accuracy of student self-assessments of English-Chinese bidirectional interpretation: A longitudinal quantitative study. Assessment and Evaluation in Higher Education 43: 386–398. Han, Chao, and Qin Fan. 2020. Using self-assessment as a formative assessment tool in an EnglishChinese interpreting course: Student views and perceptions of its utility. Perspectives 28 (1): 109–125. Interagency Language Roundtable. 2002. About the Interagency Language Roundtable. https:// www.govtilr.org/IRL%20History.htm. Accessed 3 Nov 2019. Dunlea, Jamie, Richard Spiby, Sha Wu, Jie Zhang, and Mengmeng Cheng. 2019. China’s Standards of English Language Ability (CSE): Linking UK exams to the CSE. https://www.britishco uncil.org/exam/aptis/research/publications/validation/china-standards-english-cse-linking-ukexams-cse Accessed 3 Feb 2020. Jiang, Gang. 2016. Implementing suggestions on deepening the reform of examination and enrolment system and steadily advancing the national assessment system of foreign language proficiency. China Examinations 1: 3–6. Jin, Yan, Wu Zunmin, Charles Alderson, and Weiwei Song. 2017. Developing the China Standards of English: Challenges at macro political and micropolitical levels. Language Testing in Asia. https://doi.org/10.1186/s40468-017-0032-5. Kalina, Sylvia. 2000. Interpretation competences as a basis and a goal for teaching. The Interpreters’ Newsletter 10: 3–32.

42

W. Wang

Kalina, Sylvia. 2002. Quality in interpreting and its prerequisites—A framework for a comprehensive view. In Interpreting in the 21st century. Challenges and opportunities, ed. Giuliana Garzone and Maurizio Viezzi, 121–130. Amsterdam: John Benjamins. Kalina, Sylvia. 2005a. Quality in the interpreting process: What can be measured and how? Communication and Cognition. Monographies 38 (1–2): 27–46. Kalina, Sylvia. 2005b. Quality assurance for interpreting processes. Meta 50: 769–784. Kurz, Ingrid. 1993. Conference interpretation: expectations of different user groups. The Interpreters’ Newsletter 5: 13–21. Kutz, W. 2010. Dolmetschkompetenz: Was muss der Dolmetscher wissen und können? (in German). München: European University Press. Lee, Jieun. 2008. Rating scales for interpreting performance assessment. The Interpreter and Translator Trainer 2 (2): 165–184. Lee, Sang-Bin. 2015. Developing an analytic scale for assessing undergraduate students’ consecutive interpreting performances. Interpreting 17: 226–254. Lee, Sang-Bin. 2018. Exploring a relationship between students’ interpreting self-efficacy and performance: Triangulating data on interpreter performance assessment. The Interpreter and Translator Trainer 12: 166–187. Lee, Sang-Bin. 2019. Holistic assessment of consecutive interpretation: How interpreter trainers rate student performances. Interpreting 21: 245–268. Li, Xiangdong. 2018. Self-assessment as ‘assessment as learning’ in translator and interpreter education: Validity and washback. The Interpreter and Translator Trainer 12 (1): 48–67. Liu, Minhua. 2008. How do experts interpret? Implications from research in interpreting studies and cognitive science. In Efforts and models in interpreting and translation research: A tribute to Daniel Gile, ed. Gyde Hansen, Andrew Chesterman, and Heidrun. Gerzymisch-Arbogast, 159–178. Amsterdam: John Benjamins. Liu, Minhua. 2013. Design and analysis of Taiwan’s interpretation certification examination. In Assessment issues in language translation and interpreting, ed. Dina Tsagari and Roelof van Deemter, 163–178. Frankfurt: Peter Lang. Liu, Jianda. 2015. Some thoughts on developing China common framework for English Language proficiency. China Examinations 1: 7–11 + 15. Liu, Jianda, and Baocheng Han. 2018. Theoretical basis for the construction of China’s applicationoriented scale for English proficiency. Modern Foreign Languages 41 (1): 78–90 + 146. Moser-Mercer, Barbara. 1996. Quality in interpreting: Some methodological issues. The Interpreters Newsletter 7: 43–55. Mu, Lei, Weiwei Wang, and Yi Xu. 2020. Research on CSE-Interpreting Scale. Higher Education Press. Napier, Jemina. 2004a. Sign language interpreter training, testing, and accreditation: An international comparison. American Annals of the Deaf 149: 350–359. Napier, Jemina. 2004b. Interpreting omissions: A new perspective. Interpreting 6: 117–142. North, Brian. 2000. The development of a common framework scale of language proficiency. New York: Peter Lang. North, Brian. 2014. The CEFR in practice, vol. 4. Cambridge: Cambridge University Press. Ouyang, Qianhua, Yu. Yi, and Fu Ai. 2020. Building disciplinary knowledge through multimodal presentation: A case study on China’s first interpreting Massive Online Open Course (MOOC). Babel 66 (4/5): 655–673. Ping, Hong. 2019. Quality development of MTI education. Translation Journal 40 (1): 76–82. Pöchhacker, Franz. 2000. Dolmetschen: Konzeptuelle grundlagen und deskriptive untersuchungen (in German). Tübingen: Stauffenburg. Tübingen: Stauffenburg-Verlag. Pöchhacker, Franz. 2001. Quality assessment in conference and community interpreting. Meta 46: 410–425. Pöchhacker, Franz. 2015. Routledge encyclopedia of interpreting studies. London: Routledge. Postigo Pinazo, Encarnación. 2008. Self-assessment in teaching interpreting. TTR: Traduction, Terminologie, Rédaction 21: 173–209.

Introducing China’s Standards of English Language Ability (CSE) …

43

Pym, Anthony, Claudio Sfreddo, François Grin, and Andy Chan. 2013. The status of the translation profession in the European Union. London: Anthem Press. Riccardi, Alessandra. 2002. Evaluation in interpretation: Macrocriteria and microcriteria. In Teaching translation and interpreting 4: Building bridges, ed. Eva Hung, 126–226. Amsterdam: John Benjamins. Setton, Robin, and Andrew Dawrant. 2016. Conference interpreting: A complete course and trainer’s guide. Amsterdam: John Benjamins. Tiselius, Elisabet. 2009. Revisiting Carroll’s scales, In Testing and assessment in translation and interpreting studies, ed. Claudia. V. Angelelli and Holly C. Jacobson, 95–121. Amsterdam: John Benjamins. Wang, Binhua. 2007. From interpreting competence to interpreter competence—A tentative model for objective assessment of interpreting (in Chinese). Foreign Language World 120: 75–78. Wang, Binhua. 2012. From interpreting competence to interpreter competence: Exploring the conceptual foundation of professional interpreting training (in Chinese). Foreign Languages and their Teaching 267: 75–78. Wang, Binhua. 2015. Bridging the gap between interpreting classrooms and real-world interpreting. International Journal of Interpreter Education 7: 65–73. Wang, Jihong, Jemina Napier, Della Goswell, and Andy Carmichael. 2015. The design and application of rubrics to assess signed language interpreting performance. The Interpreter and Translator Trainer 9: 83–103. Wang, Weiwei. 2017. Applications of CSE-interpreting competence scales. Foreign Language World 183: 2–10. Wang, Weiwei, Xu Yi, and Mu Lei. 2018. Interpreting competence in China’s Standards of English (CSE). Modern Foreign Languages 41 (01): 111–121 + 147. Wang, Weiwei, Xu Yi, Binhua Wang, and Mu Lei. 2020. Developing interpreting competence scales in China. Frontiers in Psychology. https://doi.org/10.3389/fpsyg.2020.00481. Yeh, Shu-pai, and Minhua Liu. 2006. A more objective approach to interpretation evaluation: Exploring the use of scoring rubrics. Journal of the National Institute for Compilation and Translation 34: 57–78. Zhang, Qiang, and Yiming Yang. 2013. On language ability and its enhancement. Linguistic Sciences 6: 566–578. Zhong, Weihe. 2003. Memory training in interpreting. Translation Journal 7 (3): 1–9. http://transl ationjournal.net/journal/25interpret.htm Accessed 12 Nov 2013. Zhengcai, Zhu. 2016. A validation framework for the national English proficiency scale of China. China Examinations 8: 3–13.

Weiwei Wang is an associate professor in the School of Interpreting and Translation Studies at Guangdong University of Foreign Studies. Her research focuses on interpreting quality assessment. She is particularly interested in understanding developmental patterns of interpreting competence and causal factors in competence development. Her research has been published in such journals as Frontiers in Psychology, Chinese Translators Journal, and Foreign Language World. She has led several research projects funded by the Ministry of Education, the National Social Science Foundation, and the British Council. She is serving as the Deputy SecretaryGeneral of the National Interpreting Committee of the Translation Association of China.

Developing a Weighting Scheme for Assessing Chinese-to-English Interpreting: Evidence from Native English-Speaking Raters Xiaoqi Shang

Abstract Analytic rating scales have emerged as a viable instrument for assessing interpreting in recent years. However, little research has been conducted to examine what aspects of interpreting need to be assessed and how weighting could be assigned to each assessment criterion. Previously, weighting schemes were largely derived from analysis of raters’ assessments of interpreting from English into other languages. Only a few studies have focused on assessment of interpreting into English, and even fewer studies have involved native English-speaking raters. In addition, little research has taken directionality into consideration by investigating the behavior of native English speakers as raters when they assess interpreting into English. To bridge this gap, our study seeks to explore a weighting scheme for assessing interpreting. We recruited two professional conference interpreters who were native speakers of English to assess Chinese-to-English interpreting. Data analysis indicates that the two raters seemed to weight the three assessment criteria (i.e., fidelity, language, and delivery) differently. More specifically, the criterion of language was weighted the most heavily (β 2 = 0.396), followed by delivery (β 3 = 0.343) and fidelity (β 1 = 0.300). Retrospective interviews were conducted with the two raters to obtain insights into their rating process. Theoretical and pedagogical implications of the research findings are also discussed. Keywords Interpreting assessment · Weighting scheme · Chinese-to-English interpreting · Native English-speaking raters

1 Introduction The role of assessment in the broader context of educational measurement has been well documented in the literature (e.g., Bachman and Palmer 1996; Weigle 2002; Campbell and Hale 2003). Assessment serves multiple purposes such as making X. Shang (B) School of Foreign Languages, Shenzhen University, Shenzhen, China e-mail: [email protected] © Springer Nature Singapore Pte Ltd. 2021 J. Chen and C. Han (eds.), Testing and Assessment of Interpreting, New Frontiers in Translation Studies, https://doi.org/10.1007/978-981-15-8554-8_3

45

46

X. Shang

decisions about candidates’ readiness for a program, diagnosing their strengths and weaknesses, tracking the progress of learning and instruction, and granting certification (Bachman 1990). In the field of interpreter education, assessment mainly involves three areas: diagnostic testing at the time of program admission, formative testing during the process of learning, and summative testing at the end of the program (Sawyer 2004). Reliable and valid assessment plays an important role in regulating access to and maintaining high standards for interpreter training, and thus serves as a gatekeeper for the interpreting profession. However, current practice in interpreting testing is characterized by “a lack of systematic methods and consistent standards” with regard to test development, administration and assessment (see Liu 2015, p. 20), which hinders “reproducible, defensible and accountable decisions” (Setton and Dawrant 2016, p. 374). As a viable tool used in performance assessment, analytic rating scales can help raters generate reliable and valid measurements, and provide detailed information about a test taker’s performance (Weigle 2002; Green and Hawkey 2010). As such, interpreting researchers, trainer, and practitioners have invested much time and efforts in designing, developing, and using analytic rating scales in interpreting assessment (e.g., Lee 2008; Han 2015, 2016; Lee 2015). A few interpreting researchers have further sought to explore how to assign weighting to different assessment criteria (Roberts 2000; Setton and Motta 2007; Lee 2008; Choi 2013; Skaaden 2013; Wu 2013; Han 2016; Lee 2015). However, divergent views exist with respect to weighting assignment. Weighting schemes in previous research were largely based on the analysis of raters’ assessments of interpreting from English into other languages (e.g., Lee 2008; Choi 2013; Lee 2015; Han 2016). To the best of the author’s knowledge, virtually no empirical research has been conducted on the design of a specific weighting scheme, based on analysis of assessments of interpreting into English by native English-speaking raters with a background in interpreting. Even in the current assessment practices where assessing interpreting into English is involved, raters are predominantly non-native speakers of English. As non-native English-speaking (NNES) and native English-speaking (NES) raters display systematically divergent patterns when assessing a candidate’s English language performance (Hyland and Anan 2006; Zhang and Elder 2011), it would therefore be of interest to investigate how NES raters with an interpreting background would assess interpreting into English. Research findings could be used to confirm or invalidate previous weighting schemes, and also inform test designers about the recruitment of raters with proper language background. Against this backdrop, our study adopts a data-driven approach to exploring a weighting scheme for assessing Chinese-to-English (C-to-E) interpreting performance. Two native English speakers who were professional conference interpreters were asked to evaluate a total of 50 C-to-E consecutive interpretations produced by interpreter trainees at the postgraduate level from four Chinese universities. We aim to answer the following two research questions (RQ): RQ1:

How would the NES raters weight the assessment criteria of fidelity, language and delivery when assessing C-to-E consecutive interpreting?

Developing a Weighting Scheme for Assessing …

RQ2:

47

What are the possible factors that have influenced the raters’ perceptions of weightings?

2 Literature Review 2.1 Rater-Mediated Assessment of Language Performance There has been a significant body of literature examining the differences between native speaking (NS) and non-native speaking (NNS) raters when assessing spoken or written language performance. Studies have primarily focused on the effect of raters’ linguistic background (e.g., Sheorey 1986; Fayer and Krasinski 1987; Shi 2001; Kim 2009) or social and professional differences on their assessments (e.g., Hadden 1991; Chalhoub-Deville 1995; Chalhoub-Deville and Wigglesworth 2005). In general, teachers and non-native speakers were found to be more severe than non-teachers and native speakers. However, mixed results were also reported when specific criteria were used to assess language performance. For example, Kim (2009) compared the differences of NNS and NS teachers in assessing oral communication skills and found that the NS teachers made more detailed and elaborate comments than their NNS counterparts in terms of pronunciation, grammar use, and accuracy of transferred information. In contrast, the study by Fayer and Krasinski (1987) suggests that the NNS raters were harsher about “linguistic form” than the NS raters when assessing Puerto Rican English learners’ oral performance.

2.2 Rater-Mediated Assessment of Translation and Interpreting In contrast to the abundant literature in language assessment, there have been very few studies exploring the differences between NNES and NES raters when assessing translation and interpreting performance, with the exception of a few researchers (Tang 2017; Su 2019; Su and Shang 2019). Tang (2017), for example, argued that in assessing C-to-E translation, the assessments from the NNES and the NES raters could be complimentary, because the NNES raters can help students with critical thinking skills, whereas the NES raters can serve as quality control specialists. Su and Shang’s (2019) recent case study on the NNES and the NES teachers’ co-teaching of an interpreting class found that the NNES teacher diagnosed the students’ translation pitfalls, while the NES teacher monitored the students’ English output and served as a cultural advisor.

48

X. Shang

2.3 Quality Criteria in Interpreting Assessment Quality assessment has been a central concern in all domains related to interpreting (Pöchhacker 2001). However, there has been a lack of a universal definition of quality in the interpreting literature, since interpreting is a dynamic concept involving a host of “environmental variables” and “socially constituted norms” (Pöchhacker 2015, p. 338). In terms of interpreting quality assessment, it is therefore important to consider different settings and modes of interpreting, which might involve different expectations from multiple stakeholders such as interpreters, service users, employers, colleagues, and delegates. Interpreting scholars have examined the topic of quality criteria through various approaches ranging from theoretical exploration (Schjoldager 1995; Riccardi 2002; Roberts 2000; Pöchhacker 2001; Skaaden 2013; Setton and Dawrant 2016) to questionnaire-based surveys (Bühler 1986; Kurz 1989, 1993; Gile 1991; Ng 1992), and to experimental studies (Lee 2008; Choi 2013; Liu 2013; Wu 2013; Lee 2015). The number of assessment criteria varies considerably, depending on different research methodologies, interpreting domains, and interdisciplinary traditions. Questionnaires have been the most common means used to elicit user expectations on interpreting quality (Gile 1991; Pöchhacker 2001). Bühler (1986) was among the first few pioneers who used questionnaires to explore quality criteria for conference interpreting. She asked AIIC interpreters to rate the relative importance of 15 linguistic and extra-linguistic criteria. Riccardi (2002) went further to incorporate as many as 17 micro-criteria in his evaluation scheme, which were further categorized into four broad macro-criteria of equivalence, accuracy, appropriateness, and usability. Apart from the use of questionnaires, quite a few interpreting scholars have theorized and conceptualized interpreting quality (e.g., Roberts 2000; Clifford 2001; Pöchhacker 2001). Based on an extensive review of the studies on interpreting quality, Pöchhacker (2001) grouped assessment criteria into four major categories: accurate rendition, adequate target language expression, equivalent intended effect, and successful communicative interaction. Drawing on the discourse theory, Clifford (2001) developed a performance-based assessment rubric for interpreting which features the three broad categories of deixis, modality, and speech acts. Furthermore, a few interpreting researchers have empirically investigated the criteria suitable for interpreting assessment (e.g., Lee 2008; Choi 2013; Liu 2013; Wu 2013; Lee 2015). The study conducted by Lee (2015), for example, sought to design an analytic rating scale for assessing undergraduate students’ interpreting performance and proposed three major assessment criteria: content, form, and delivery, with each criterion consisting of seven sub-criteria. Taken together, previous scholars and researchers have reached a general consensus on three essential criteria for assessing interpreting, namely, fidelity, target language quality, and delivery. However, very few studies have explored how different

Developing a Weighting Scheme for Assessing …

49

assessment criteria could be weighted based on empirical assessment data. It is therefore of importance to explore and investigate weighting schemes for interpreting assessment.

2.4 Quality Criteria and Their Weighting in Interpreting Assessment While there is much literature on quality criteria for interpreting assessment, research on weighting assignment remains scant. A few interpreting researchers argue for assigning differential weightings to assessment criteria based on intuition and conceptual analysis (e.g., Lee 2008; Roberts 2000; Choi 2013; Skaaden 2013). For instance, in evaluating Korean-to-English consecutive interpreting, Choi (2013) used a total weight value of 20, assigning 10 to accuracy (50%), 6 to expression (30%) and 4 to presentation (20%). Similarly, in assessing students’ aptitude for interpreting in a variety of languages, Skaaden (2013) accorded 50% of the total weight to the content-related criteria, while attributing 25% to pronunciation and grammar/phrasing, respectively. Other researchers have sought to address the issue of weighting assignment based on empirical data (Wu 2013; Lee 2015). Lee (2015) conducted a study to develop an analytic rating scale for assessing undergraduate students’ Korean-to-English consecutive interpreting. Data analysis suggests that the criterion of content be assigned a weight of 2 (50%), whereas the other criteria, form and delivery, be given a weight of 1 (25%), respectively. However, the empirical study by Wu (2013) identified five categories of assessment criteria used by the raters: (a) fidelity and completeness (FC), (b) audience point of view (APV), (c) interpreting skills and strategies (ISS), (d) presentation and delivery (PD), and (e) foundation abilities for interpreting (FAI). The criteria of FC and PD were found to account for 86% of 300 assessment decisions made, indicating their predominant importance among the five assessment criteria. The review of the above literature reveals a few research gaps. First, despite their consensus on fidelity as the major criterion for assessing interpreting, researchers still have divergent views on weightings of different assessment criteria. More research is thus needed to provide further empirical evidence. Second, given that most of the weighting schemes described in the literature are developed based on researchers’ personal experience and conceptual analysis, future research can be conducted to explore an empirically-based weighting scheme for interpreting assessment. Third, most of previous studies have focused on assessment of interpreting from English while neglecting the possible influence of interpreting directionality on weighting assignment (e.g., interpreting into English). In a very few studies that involve assessment of interpreting into English, the raters recruited were predominantly non-native speakers of English. More research is therefore needed to examine whether native speakers of English weight assessment criteria differentially when assessing interpreting into English. Fourth, rater training is seldom provided (or described) in most

50

X. Shang

of the previous studies, with a few exceptions such as Lee (2008) and Han (2015). Given that rater bias is a potential source of measurement error, rater training should be provided to enhance scoring reliability (Lumley and McNamara 1995). Given these gaps, we set out to empirically explore a weighting scheme for assessing C-to-E consecutive interpreting. Particularly, we recruited bilingual raters, who speak English as their mother tongue (L1) and mandarin Chinese as second language (L2), to assess C-to-E interpreting. This design marks a difference from previous studies in which Chinese native speakers are usually used as raters to assess C-to-E interpreting.

3 Method 3.1 Participants A total of 50 students majoring in interpreting at the postgraduate level from four Chinese universities were recruited to participate in the study. Forty-three of them were female and seven were male; all aged between 23 and 25 years. They were all native speakers of Chinese (L1) with English as their Second language (L2). At the time of this study, all of them had received one-year formal interpreter training, and were capable of interpreting for a diverse array of non-technical topics.

3.2 The Source Speech The material to be interpreted was a five-minute speech on tobacco control in China (see Appendices 1 and 2), delivered by a native Chinese speaker in Chinese and video-recorded. The speaker had been trained as conference interpreter and was a staff interpreter at a governmental agency in Shanghai. The rate of speech was 175 syllables/characters per minute.

3.3 Procedures The C-to-E interpreting test was held at the four universities mentioned above. A total of 50 student interpreters participated in the test. During each test, the participants were required to interpret the speech (divided into three segments) in the consecutive mode; and note-taking was allowed. Their interpretations were audio-recorded and assigned with a unique code.

Developing a Weighting Scheme for Assessing …

51

3.4 Rater Selection and Rater Training Two raters were invited to evaluate the 50 audio recordings of the interpretations. Both of them were native speakers of English, with Chinese as their L2. They both received a master’s degree in conference interpreting from a leading interpreter training institution and had been active freelance interpreters. Both had more than six years of professional interpreting. Prior to the formal rating, the raters were asked to evaluate a total of 10 Cto-E interpretations. The training session lasted about three hours. It consisted of such procedures as orientation and overview of the rating task, familiarization with the scoring rubrics, review of benchmark performances, and anchoring and practice session (Johnson et al. 2009; Setton and Dawrant 2016). Many-facet Rasch measurement (MFRM) was conducted on the raters’ scores to examine their severity levels and internal consistency. Inter-rater reliability was also analyzed using SPSS 24.

3.5 Rating Criteria Each interpreting performance was evaluated on the three criteria of fidelity, language quality, and delivery, based on a modified version of Setton and Dawrant’s (2016, pp. 436–437) analytic rating scale, and also on a revised holistic scale sourced from Schjoldager (1995, pp. 191–192). Specifically, the criterion of fidelity refers to the full, faithful and accurate rendering of all the message elements in the source speech without unjustified additions, changes or omissions. The criterion of language refers to target language quality, which mainly concerns a strong and impressive command of the target language, including terminology, word choice, register, style, and appropriateness. The criterion of delivery focuses on comprehensibility and communicability of the output, such as liveliness and expressiveness, being free from backtracking, fillers, and pauses. For each recording, the raters were required to assign a numerical value to each of the three criteria on the 10-point analytic rating scale and to the overall performance on the 10-point holistic rating scale (see Appendices 3 and 4). Both the analytic and the holistic scores on each interpretation were collected. In addition, the raters were required to comment on the strengths and weaknesses of each candidate’s performance to generate qualitative data about the rating process.

3.6 Data Analysis During the rater training, a many-facet Rasch measurement (MFRM) analysis (Linacre 1989) was conducted to investigate raters’ severity level and selfconsistency. The data was analyzed using the computer program of Facets 3.57.0 (Linacre 2005). After the rater training, the formal rating was conducted.

52

X. Shang

To answer RQ1 (i.e., the weightings the raters would assign to the three assessment criteria), Pearson’s correlation analysis and multiple linear regression were performed. Correlation coefficients were first computed, using SPSS 24, to investigate the interrelationships between the scores of the three assessment criteria and the overall interpreting score. Subsequently, multiple linear regression, based on the default “enter” method, was conducted to examine to what extent each of the three assessment criteria (i.e., analytic scores) could predict the holistic scores of candidates’ interpreting performance. To answer RQ2, the retrospective interview data were transcribed independently by the author and a second researcher. The transcripts were double-checked with the four raters to ensure the consistency and accuracy of the data.

4 Results 4.1 Rating Reliability As stated in previous sections, the MFRM analysis was conducted to investigate the variability of the scores provided by the two raters. As an extension of the item response theory, the MFRM model has been widely used to examine assessment systems involving human raters (Eckes 2015). As seen in Table 1, the two raters were similar in severity levels. Rater 1 was slightly more severe than Rater 2. The rater report also shows that the infit statistic for Raters 1 and 2 were 0.90 and 0.84, respectively, both of which are within the acceptable range of 0.7–1.3 (Bonk and Ockey 2003). This result indicates that each rater was largely self-consistent. In addition, Cronbach’s alpha was computed to measure to what extent the two raters agreed with each other as a group. The results show that the values of Cronbach’s alpha for the analytic ratings (i.e., fidelity, language, and delivery) were 0.752, 0.669, and 0.748 and that for the holistic ratings was 0.812. As the value of Cronbach’s alpha being at 0.7 and larger is generally considered acceptable for low-stakes assessment (Shohamy 1985, p. 70), we believe that the level of rater consistency in the study was acceptable on the whole, except the criterion of language (Cronbach’s Table 1 The MRFM-based statistics on the rater facet based on the pilot ratings Raters/descriptive statistics

Rater severity (logits)

Standard error

Infit (mean square)

Rater 1

0.20

0.13

0.90

Rater 2

0.13

0.13

0.84

Mean

0.00

0.13

0.87

Standard deviation

0.23

0.00

0.13

Developing a Weighting Scheme for Assessing …

53

alpha = 0.669). Because of the results from the analysis of the pilot ratings, we provided further rater retraining to the raters, particularly guiding them to assess target language quality of the interpretations.

4.2 Correlation Analysis To answer RQ1, Pearson’s correlation analysis was conducted to look into the correlations between the candidates’ scores on each of the three assessment criteria and the scores on the overall interpreting performance. Table 2 summarizes the descriptive statistics for the analytic scores and the holistic scores. Table 3 displays the correlation coefficients between the analytic scores and the holistic scores. As is seen in the table, the correlation coefficients between the scores of fidelity, language and delivery and the holistic scores were r 1 = 0.914, r 2 = 0.932 and r 3 = 0.927, respectively. As a correlation coefficient of above 0.7 indicates strong correlation between variables (Gravetter and Wallnau 2013, p. 514), it seems that the analytic scores were strongly correlated with the holistic scores. Table 2 The descriptive statistics of the correlation analysis

Table 3 The correlation matrix involving three analytic scores and the holistic scores

Measures

Mean

Standard deviation

Holistic

5.74

1.54

Analytic: fidelity

6.42

1.28

Analytic: language

5.70

1.38

Analytic: delivery

5.17

1.69

Measures

Correlation matrix Analytic: fidelity

Analytic: language

Analytic: delivery

0.914*

0.932*

0.927*

Analytic: fidelity



0.826*

0.835*

Analytic: language





0.841*

Analytic: delivery







Holistic

*p

< 0.05

54

X. Shang

Table 4 Statistical results from the multiple regression analysis (n = 50) Measure

Adjusted R2

F

Holistic scores

0.959

373.712

Analytic: fidelity





Analytic: language





Analytic: delivery





beta

t

p

Tolerance

−2.833

0.004



0.300

5.120

0.000

0.250

0.396

6.660

0.000

0.242

0.343

5.623

0.000

0.231

4.3 Multiple Regression Analysis Multiple regression analysis, using the default “enter” method, was performed to investigate to what extent the scores of the three assessment criteria could predict the candidate’s overall interpreting performance. In order to perform linear regression analysis, certain preconditions had to be met. As a “tolerance” value lower than 0.2 or the largest VIF (1/Tolerance) greater than 10 usually indicates a potential problem in collinearity among independent variables (Field 2018, p. 534), the statistical evidence suggests no collinearity among the three independent variables in the study, as shown in Table 4. Therefore, the three sets of the analytic scores (corresponding to the three assessment criteria) were able to explain a total of 95.9% of the variance in the holistic scores in C-to-E interpreting (adjusted R2 = 0.959). More specifically, the scores on fidelity, language, and delivery accounted for 30.0% (β 1 = 0.300), 39.6% (β 2 = 0.396), and 34.3% (β 3 = 0.343) of the variance of the holistic scores, respectively (p < 0.05).

5 Discussion The results of the correlational analysis show that the scores of fidelity, language, and delivery were significantly correlated with the candidates’ overall interpreting performance (r 1 = 0.914, r 2 = 0.932 and r 3 = 0.927). Multiple linear regression analysis suggests that the three assessment criteria could predict the candidates’ overall interpreting performance to a varying degree. Among them, the criterion of language was the most powerful predictor (β 2 = 0.396), followed by the criteria of delivery (β 3 = 0.343) and of fidelity (β 1 = 0.300). The research finding that the criterion of language turned out to be a more powerful predictor of the overall performance than the criterion of fidelity might be explained by the fact that, when interpreting into their L2, the student interpreters could enjoy a “comprehension bonus”, but were challenged by a “production deficit” (Setton and Dawrant 2016, p. 119). Thus, when assessing Chinese-to-English interpreting, the raters could focus on the criterion of language quality so as to distinguish the higher-ability interpreters from the lower-ability ones. Accordingly, the raters might have attended more to the criterion of language quality than to that of fidelity. The

Developing a Weighting Scheme for Assessing …

55

primary importance of the criterion of language quality was also supported by the two raters in the retrospective interview. Both raters generally agreed that the criterion of language quality should be prioritized in the assessment of interpreting into L2. It is worth noting that Rater 1 argued that a larger weighting should be assigned to fidelity than delivery. We provide the following quotation from the raters to illustrate this point. Rater 1:

Rater 2:

“I think they should be weighed differently. I’d put them in this order: Language > Fidelity > Delivery. Although in two distinctive cases, I remember that one student had great fidelity but not so great delivery, which made his/her overall score lower than it could be, and another student who had wonderful delivery also had glaring meaning errors, so his overall score suffered”. “I mostly agree with the findings - students who demonstrated excellent language control were more often than not better interpreters than their peers who struggled with grammar, pronunciation, etc.”

Nevertheless, the greater importance that the raters attached to the criterion of language quality suggested by our study is at odds with previous findings in the language testing literature (Galloway 1980; Fayer and Krasinski 1987). Previous literature shows that NNS and NES exhibit highly divergent behaviors when assessing a candidate’s English spoken performance. NNS raters focus more on the grammatical form (Galloway 1980) and the linguistic form (Fayer and Krasinski 1987) than NES raters do. On the contrary, according to the data from our study, the two English-speaking raters gave more priority to the criterion of language quality than that of fidelity when assessing interpreting into English. Such discrepancies may further explain that assessment of interpreting is different from that of general spoken language performance. The finding that the raters assigned the second largest proportion of the total weighting to the criterion of delivery conforms to the results reported in a number of previous studies on interpreting assessment (e.g., Gile 1991; Kurz 2001; Pöchhacker 2001). In a survey with conference interpreters and delegates, Kurz (2001) found that fluency was ranked very high among a list of quality criteria, only after sense consistency and terminology. Similarly, Pöchhacker (2001) elicited opinions from 704 AIIC interpreters on the importance of the assessment criteria for interpreting and found that fluency was ranked in the third place after sense consistency and logical cohesion. According to the current study, the raters assigned 34.3% of the total weighting to delivery, right after the criterion of language quality. The critical role of delivery was also supported by the raters. Below are some illustrative remarks from the two raters. Rater 1:

“I agree with the importance of delivery in assessing interpreting. Hesitations, excessive occurrence of fillers and repairs and backtracking can be really irritating so as to obscure raters’ assessment of the other two criteria (fidelity and language). However, I think it’s interesting how fidelity ranked last! I would have thought that fidelity would be rated more importantly.”

56

Rater 2:

X. Shang

“As fluency of delivery directly affects the listeners’ comprehension of the interpretation, the role of delivery cannot be over-estimated when assessing interpreting, regardless of language direction. Disfluencies in delivery tend to make the interpretation untrustworthy, thus affecting the credibility of the interpreter even when most of the source message is rendered by the interpreter.”

As seen from the above remarks, both raters argued that the criterion of delivery played a critical role in their assessment of interpreting and could sometimes overshadow the interpreter’s performance on the other two assessment criteria, namely fidelity and language. Yet, it is worth noting that the remarks from Rater 1 suggest that she would not have expected to attach less importance to fidelity when assessing the interpretations. She continued to contend in the interview that slight adjustments should be made to foreground the importance of fidelity. The finding that the criterion of fidelity was assigned the least weighting in our study runs counter to the generally accepted view in the interpreting literature that fidelity-related criteria are more important than others (Gile 1991; Roberts 2000; Kurz 2001; Lee 2008; Choi 2013; Skaaden 2013; Lee 2015; Setton and Dawrant 2016). Setton and Dawrant (2016, p. 434) argued that, regardless of language direction, fidelity should always be considered a primary factor as long as performances in expression and delivery are acceptable when assessing interpreting. However, according to the current study, the analytic ratings on fidelity explained 30% of the score variance (β 1 = 0.300) in the candidates’ overall interpreting performance, which is lower than that of language quality (β 2 = 0.396) and of delivery (β 3 = 0.343). One possible explanation for these findings might relate to the influence of directionality. When interpreting into her/his L1, the interpreter was faced with a ‘comprehension deficit’, but enjoyed a ‘production bonus’ (Setton and Dawrant 2016, p. 119). Therefore, raters might tend to pay more attention to those aspects of interpreting performance that may be affected by the interpreter’s comprehension (e.g., the criterion of fidelity). Therefore, more weighting was assigned to the criterion of fidelity. However, when raters assess interpreting into L2, they would heed the quality of target language. Finally, it is also interesting to note that our finding about the weighting of fidelity does not concur with what was reported in Lee (2015) in which the native speakers of Korean (i.e., the raters) assessed English-to-Korean consecutive interpreting. It was found that the criterion of content was assigned 50% of the total weighting, whereas the criteria of form and of delivery were assigned 25%, respectively. This disparity calls for further investigation into the potential differences involved in assessment of interpreting into different directions. These research findings may have both theoretical and pedagogical implications for interpreting studies and interpreter training. First, the differently perceived importance of the assessment criteria, as manifested in the tentative weighting scheme in this study, may spark further research on rater-mediated assessment of interpreting. Our study suggests that the raters attached greater importance to the criterion of language quality rather than that of fidelity, when assessing interpreting into English.

Developing a Weighting Scheme for Assessing …

57

Given that little research has been conducted to examine raters’ behavior in assessment of interpreting into both their L1 and L2, it would be of interest to investigate the potential influence of directionality on raters’ perceived importance of different assessment criteria. Second, whereas our study shows that the criterion of language quality should be given more importance when raters assess interpreting into L2, findings from previous research emphasize the importance of fidelity. The seemingly contradictory finding may suggest the need to develop differentiated assessment rubrics for assessing interpreting into L1 and into L2, as raters attach importance to different aspects of interpreting performance for different directions. Third, in our study the criterion of delivery was given a much higher weighting compared with that reported in previous research. In our weighting scheme, the criterion of delivery was assigned 34.3% (β 3 = 0.343) of the total weighting by the two raters. This may suggest the role of delivery features in affecting raters’ perception of the overall interpreting quality. According to Rennert (2010), how an interpreted rendition is delivered affects intelligibility and user perception. It is therefore important for researchers to pay attention to the dimension of “listener orientation” and “target text comprehensibility” (Pöchhacker 2001) or “comprehensibility” and “plausibility” (Schjoldager 1995). However, we also need to guard against such a tendency to pursue smooth and fluent delivery at the expense of fidelity. Delivery may create the “false impression of high quality” when the original message is actually distorted or fabricated (see Kurz 2001). Similarly, Gile (1991) warned against the likelihood of the “packaging effect” of smooth delivery to obscure the assessment of fidelity. This is indeed a balancing act.

6 Conclusion This study sets out to explore a weighting scheme for assessing Chinese-to-English interpreting, based on the ratings provided by two native English speaker raters who were also conference interpreters. The quantitative findings of the study suggest that the criterion of language quality should be given the largest weighting, followed by delivery and fidelity. More specifically, according to the research results, language quality, fidelity, and delivery should be assigned 39.6% (β 2 = 0.396), 34.3% (β 3 = 0.343) and 30.0% (β 1 = 0.300) of the total weighting, respectively. This is generally consistent with the data generated from the interview with the two raters. Despite our preliminary findings, the study is not without limitations. First, the sample size of the raters was very small. This is because it is highly difficult to recruit raters who are native English-speaking raters and have also been trained in conference interpreting in China. The findings of the present study thus remain tentative and need further validation. Future research could involve a much larger number of qualified raters. Second, there is no control group in the study. Future research should produce more convincing results by comparing assessments of two groups of raters: (a) native Chinese-speaking interpreters and (b) native English-speaking interpreters.

58

X. Shang

The concept of interpreting quality is “essentially relative and multi-dimensional” (Pöchhacker 2004, p. 153). Therefore, multiple factors such as the context and the user perspective should be taken into consideration in interpreting assessment. Qualified raters with desired characteristics (e.g., language background) play an important role in ensuring scoring reliability. We hope that the current study would spark more investigation into raters’ assessment of interpreting quality for different directions.

Appendix 1 The Transcript of the Chinese Source Speech 大家好,非常高兴能够出席今天的媒体见面会。那么,酝酿许久的烟草行业改革 呢,终于经过国务院审批,将于今年五月份的时候正式启动。看到今天有这么多 的媒体记者朋友出席今天的见面会呢,我也可以了解到大家都是对这个非常感 兴趣,那么今天呢就由我就本次改革的相关背景、内容和其影响给大家做一个 通报。我看到今天在场的还有很多外国的记者朋友,所以我觉得还是有必要先 对中国的这个烟草行业的概况给大家做一个简要的介绍。中国的烟草行业呢 是有其特殊性的。一方面,吸烟有害健康,每年中国有很多人死于由吸烟引发的 肺癌等疾病,这是对社会的和谐和稳定有不利的影响。另外一方面呢,烟草行业 又是国民经济的一个重要的组成部分,它呢为国家贡献了相当一部分的财政收 入,所以呢,烟草行业在中国是属于国家垄断行业。那么,整个行业呢分工业和商 业两大环节。工业环节呢包括烟叶的采购与加工,还有卷烟的制造;商业环节呢 包括卷烟的批发和零售。那么除了卷烟零售商和物流企业之外,所有涉及烟草 的行业、企业基本上都是属于国有企业,隶属于央企—中烟总公司。那么以上 呢就是我对烟草行业的一个简要的介绍。 接下来我再讲一讲这次改革的背景。一方面呢,由于这个国家宏观经济放缓 和结构性减税措施的出台,我们现在的这个财政收入呢有所减少,使得财政收 支不平衡进一步加大。另外一方面呢,整个烟草行业也在面临越来越多的来自 政府、舆论、公众和控烟组织的压力。那么控烟呢是我们中国必须要走的道 路,但同时,烟草行业的税率比较高,而且企业呢又是国有企业的本质,所以呢整 个行业通过税收和国有企业资本利润的这个上缴呢为国家财政贡献了百分之 七的收入。那么怎么样在不影响中国经济稳定发展的前提下实现控烟的目的, 是我们政府面临的一个非常重要也非常困难的课题。经过各方反复地磋商,我 们呢终于找到了一条路径,也就是此次的这次改革。具体的来说呢就是一次对 烟草价、税的调整。 价税调整顾名思义既调价又调税。价格方面呢,卷烟厂卖给批发企业的出厂 价维持不变,批发商卖给零售商的统一批发价呢在所有的烟草制品品类上上浮 百分之六。地方烟草局给零售商的指导价格,也就是我们所说的建议零售价呢, 根据这个统一批发价顺提。为什么提建议零售价呢,就是因为零售商他可以根 据本地的这个供求关系,在这个建议零售价的基础上进行调整,出一个实际的 零售价格。那么在税收方面呢,我们是在批发环节,对每箱香烟征收两百五十元 的批发税。如果此次改革能够按照我们预期得以落实的话,2015年我们可以实 现香烟百分之,香烟销售量,八十万到九十万箱的减少。但是由于价格,综合价格 上浮和利润,和税收的这个上调等因素呢,我们整个行业对国家财政的贡献呢 还能够实现一千亿的增收。如果要是这样能够成功的话,就说明我们是找到了

Developing a Weighting Scheme for Assessing …

59

一条在稳定经济为前提,以稳定经济为前提下实现控烟的路径,而且也为未来 后续的这个控烟努力和进一步的改革奠定了基础。以上就是我的通报,谢谢。

Appendix 2 English Translation of the Source Speech Good morning everyone! It is a pleasure to meet the press today. The reform of the tobacco industry which was long in the making has finally been approved by the State Department and will be launched officially this May. Seeing so many friends from the media at today’s press conference, I can presume that you are all very interested in this topic, so today I will brief all of you on this reform in terms of its backgrounds, contents, and influences. Since there are also many journalists from other countries, I think it would be necessary to give you a brief introduction of China’s tobacco industry, which has its peculiarity as a matter of fact. On the one hand, as is known to all, smoking is harmful to health, and every year a large number of people in China die from smoking, which hampers social harmony and stability. On the other hand, however, as an important component of the national economy, the tobacco industry comprises quite a large proportion of the country’s financial revenue, and therefore this industry in China is considered to be a state-approved monopoly that can be divided into two major parts, namely, industrial and commercial. The industrial part involves the purchasing and processing of tobacco leaves as well as the producing of cigarettes, while the commercial part includes the wholesale and retail of cigarettes. Also, all businesses and enterprises related to tobacco, except cigarette retailers and logistic companies, are state-owned enterprises that are subordinated to China National Tobacco Corporation, a central enterprise. This is what I would like to introduce about the tobacco industry. Next, I will talk about the background of this reform. For one, due to the slowdown in macro-economy and the introducing of structural tax reduction, we are witnessing decreases in financial revenue, which further exacerbates the unbalance in financial revenue and expenditure. For another, the whole tobacco industry is also faced with more and more pressure from the government, the public, and organizations of tobacco control. China, for sure, must control smoking within the country, but at the same time, because of the high tax rate of the tobacco industry and that related enterprises are state-owned in nature, the whole industry contributes to 7% of the national finance through tax revenue and capital profits from the state-owned businesses. So, to control smoking without affecting China’s stable economic development is a very important yet very difficult task in front of the government. After consultations of all parties, we finally found our solution, which is this reform. Specifically speaking, it will be a reform of adjustments on both the price and the tax of tobacco. In terms of the price, the factory price in which the cigarette factories sell to the wholesale enterprises will remain the same, while the unified wholesale price of all tobacco products in which the wholesalers sell to the retailers will be lifted up by 6%. The guiding price offered by the local tobacco board to retailers, which is known by us as the recommended retail price, will be lifted accordingly with the unified

60

X. Shang

retail price. The reason for that is because retailers will be able offer an actual retail price through making adjustments of the guiding price according to the local relation between demand and supply. In terms of adjustments of the tax, we require a 250 RMB wholesale tax on each carton of cigarettes. If this reform is to be implemented as we expect, by 2015 we will be able to cut the volume of sales of cigarettes by 800,000–900,000 cartons, while at the same time due to price increases, profits, and uplifted tax, the whole industry will increase its contribution to the national finance by 100,000,000,000RMB, and such success will prove that we have actually found a way to control smoking while maintaining economic stability. This will also lay a foundation for more efforts to control smoking and for future reforms. This is all of my briefing. Thank you.

Appendix 3 Band Descriptions 1. 2.

The band descriptions below are provided for the reference of jury members at the Exam. They are intended to provide a structured framework for grading the performances of candidates in each test, taking into account: (a) (b) (c)

3.

Fidelity. Target language quality (Language). Delivery.

After each candidate has completed each test, jury members are invited to fill in the marking sheet provided, as follows: (a)

(b) (c)

assign a score in each of these three areas, having regard to the band descriptions provided in this document for Distinction, Good, Fair, Weak and Poor respectively; assign an overall grade of Distinction, Pass, Discussion or Fail. make detailed comments on the strengths and weaknesses of the interpretation.

Fidelity Scale

Descriptors

Distinction (9–10) Full, faithful and accurate rendering of all message elements in the passage, including all or nearly all details, nuances, mood and tone Good (7–8)

Faithful and accurate rendering of all important message elements and most details in the passage, with no significant meaning errors

Fair (5–6)

Despite generally clear rendering of all important message elements and most details, there exist isolated and infrequent minor meaning errors on details (but NOT on key messages) that will not fundamentally mislead the audience or embarrass the speaker (continued)

Developing a Weighting Scheme for Assessing …

61

(continued) Scale

Descriptors

Weak (3–4)

There exist more serious isolated meaning errors that might mislead listeners, or a pattern of minor distortions Non-trivial omission or incompleteness

Poor (1–2)

Serious misinterpretation of important message elements, resulting in major meaning error that would mislead the audience or embarrass the speaker Serious omission of important message elements

Language Scale

Descriptors

Distinction (9–10)

Strong and impressive command of the target language, including register, terminology, word choice and style

Good (7–8)

Solid command of the target language at required standard for A or B, with appropriate register and terminology

Fair (5–6)

Appropriate, acceptable. For the most part, idiomatic and clear Occasional problems with register and idiomatic usage (but not with basic grammar and pronunciation)

Weak (3–4)

Inadequate command of register, technical terms not rendered accurately Output is clearly understandable, but contains too many distracting errors of grammar, usage or pronunciation

Poor (1–2)

Inadequate language skills: e.g., pattern of awkward, faulty expression, strong foreign accent, poor grammar and usage, inadequate vocabulary

Delivery Scale

Descriptors

Distinction (9–10) Very clear, with expressive and lively delivery The candidate is very communicative, as though giving his or her own speech with momentum and conviction Good (7–8)

Fluent and effective delivery: minimum hesitations or voiced pausing (um-er), intelligent prosody

Fair (5–6)

Some recurrent delivery problems, such as hesitation, backtracking, voiced pausing (uh, um)—tolerable for audience but not quite as polished as expected in a trained interpreter

Weak (3–4)

Delivery exhibits patterns of hesitation and backtracking

Poor (1–2)

Stammering, halting delivery

62

X. Shang

Appendix 4 Sample Marking Sheet Examinee No.: 4 Distinction (9–10) Good (7–8) Fair (5–6) Weak (3–4) Poor (1–2) Score Holistic score

7

Fidelity

8

language

7

delivery

6

Please tick the one that best represents your view of the candidate’s performance. Overall performance:

Distinction

Pass

Discussion

Fail

Strengths: Generally accurate meaning. Successfully used generalization to circumvent vocabulary issues/meaning loss. Voice is calm, no fillers. Accurate numbers, though switched sales for consumption which is a minor inaccuracy. Weaknesses: Awkward expressions: to hold this media press? Audition? Ordinary people? Misuse of/missing conjunctions and prepositions. Backtracking. Very long pauses during interpretation that show gaps in memory. The interpreter sacrificed the whole picture for details, for example the first sentence of segment 3.

References Bachman, Lyle. 1990. Fundamental considerations in language testing. Oxford: Oxford University Press. Bachman, Lyle, and Adrian S. Palmer. 1996. Language testing in practice. Oxford: Oxford University Press. Bonk, William, and Gary Ockey. 2003. A many-facet Rasch analysis of the second language group oral discussion task. Language Testing 20 (1): 89–110. Bühler, Hildegund. 1986. Linguistic (semantic) and extralinguistic (pragmatic) criteria for the evaluation of conference interpretation and interpreters. Multilingua 5 (4): 231–235. Campbell, Stuart, and Sandra Hale. 2003. Translation and interpreting assessment in the context of educational measurement. In Translation today: Trends and perspectives, ed. Gunilla Anderman and Margaret Rogers, 205–224. Clevedon: Multilingual Matters. Chalhoub-Deville, Micheline. 1995. Deriving oral assessment scales across different tests and rater groups. Language Testing 12 (1): 16–33.

Developing a Weighting Scheme for Assessing …

63

Chalhoub-Deville, Micheline, and Gillian Wigglesworth. 2005. Rater judgment and English language speaking proficiency. World Englishes 24 (3): 383–391. Choi, Jungyoon. 2013. Assessing the impact of text length on consecutive interpreting. In Assessment issues in language translation and interpreting, ed. Dina Tsagari and Roelof van Deemter, 85–96. Frankfurt am Main: Peter Lang. Clifford, Andrew. 2001. Discourse theory and performance-based assessment: Two tools for professional interpreting. Meta 46 (2): 365–378. Eckes, Thomas. 2015. Introduction to many-facet Rasch measurement: Analyzing and evaluating rater-mediated assessment. Frankfurt, New York, Oxford, Warszawa, and Wien: Peter Lang. Fayer, Joan M., and Emily Krasinski. 1987. Native and nonnative judgments of intelligibility and irritation. Language Learning 37 (3): 313–327. Field, Andy. 2018. Discovering statistics using IBM SPSS statistics, 5th ed. Los Angeles: Sage. Galloway, Vicki. 1980. Perceptions of the communicative efforts of American students of Spanish. Modern Language Journal 64 (4): 428–433. Gile, Daniel. 1991. A communication-oriented analysis of quality in nonliterary translation and interpretation. In Translation: Theory and practice: Tension and interdependence, ed. Mildred L. Larson, 188–200. Binghamton, NY: SUNY. Gravetter, Frederick, and Larry B. Wallnau. 2013. Statistics for the behavioral sciences, 9th ed. Belmont, CA: Wadsworth. Green, Anthony, and Roger Hawkey. 2010. Marking assessments: Rating scales and rubrics. In The Cambridge guide to second language assessment, ed. Christine Coombe, Peter Davidson, Barry O’Sullivan, and Stephen Stoynoff, 299–306. New York: Cambridge University Press. Hadden, Betsy L. 1991. Teacher and nonteacher perceptions of second-language communication. Language Learning 41 (1): 1–24. Han, Chao. 2015. Investigating rater severity/leniency in interpreter performance testing: A multifaceted Rasch measurement approach. Interpreting 17 (2): 255–283. Han, Chao. 2016. Investigating score dependability in English/Chinese interpreter certification performance testing: A generalizability theory approach. Language Assessment Quarterly 13 (3): 186–201. Hyland, Ken, and Eri Anan. 2006. Teachers’ perceptions of error: The effects of first language and experience. System 34 (4): 509–519. Johnson, Robert, James A. Penny, and Belita Gordon. 2009. Assessing performance: Designing, scoring and validating performance tasks. New York: The Guilford Press. Kim, Youn-Hee. 2009. An investigation into native and non-native teachers’ judgments of oral English performance: A mixed-methods approach. Language Testing 26 (2): 187–217. Kurz, Ingrid. 1989. Conference interpreting—User expectations. In Coming of Age: Proceedings of the 30th Annual Conference of the American Translators Association, ed. Diana L. Hammond, 143–148. Medford: Learned Information. Kurz, Ingrid. 1993. Conference interpretation: Expectations of different user groups. The Interpreters’ Newsletter 5: 13–21. Kurz, Ingrid. 2001. Conference interpreting: Quality in the ears of the user. Meta 46 (2): 394–409. Lee, Jieun. 2008. Rating scales for interpreting performance assessment. The Interpreter and Translator Trainer 2 (2): 165–184. Lee, Sang-Bin. 2015. Developing an analytic scale for assessing undergraduate students’ consecutive interpreting performances. Interpreting 17 (2): 226–254. Linacre, John. 1989. Many-facet Rasch measurement. Chicago: MESA Press. Linacre, John. 2005. A user’s guide to facets: Rasch model computer programs [Computer software and manual]. Retrieved April 10, 2005, from www.winsteps.com. Liu, Minhua. 2013. Design and analysis of Taiwan’s interpretation certification examination. In Assessment issues in language translation and interpreting, ed. Dina Tsagari and Roelof van Deemter, 163–178. Frankfurt: Peter Lang. Liu, Minhua. 2015. Assessment. In Routledge encyclopedia of interpreting studies, ed. Franz Pöchhacker, 20–23. London: Routledge.

64

X. Shang

Lumley, Tom, and Timothy F. McNamara. 1995. Rater characteristics and rater bias: Implications for training. Language Testing 12 (1): 54–71. Ng, B.Chin. 1992. End users’ subjective reaction to the performance of student interpreters. The Interpreters’ Newsletter 1: 35–41. Pöchhacker, Franz. 2004. Introducing interpreting studies. London and New York: Routledge. Pöchhacker, Franz. 2001. Quality assessment in conference and community interpreting. Meta 46 (2): 410–425. Pöchhacker, Franz. 2015. Routledge encyclopedia of interpreting studies. London: Routledge. Rennert, Sylvi. 2010. The impact of fluency on the subjective assessment of interpreting quality. The Interpreters’ Newsletter 15: 101–115. Riccardi, Alessandra. 2002. Evaluation in interpretation: Macro-criteria and micro-criteria. In Teaching translation and interpreting 4: Building bridges, ed. Eva Hung, 115–126. Amsterdam: John Benjamins. Roberts, Roda. 2000. Interpreter assessment tools for different settings. In Critical link 2, ed. Roda P. Roberts, Silvana E. Carr, Diana Abraham, and Aideen Dufour, 103–130. Amsterdam: John Benjamins. Sawyer, David. 2004. Fundamental aspects of interpreter education: Curriculum and assessment. Amsterdam: John Benjamins. Schjoldager, Anne. 1995. Assessment of simultaneous interpreting. In Teaching translation and interpreting 3: New horizons, ed. Cay Dollerup and Vibeke Appel, 187–195. Amsterdam: John Benjamins. Setton, Robin, and Manuela Motta. 2007. Syntacrobatics: Quality and reformulation in simultaneous-with-text. Interpreting 9 (2): 199–230. Setton, Robin, and Andrew Dawrant. 2016. Conference interpreting: A trainer’s guide. Amsterdam: John Benjamins. Sheorey, Ravi. 1986. Error perceptions of native speaking and non-native speaking teachers of ESL. ELT Journal 40 (4): 306–312. Shi, Ling. 2001. Native and nonnative-speaking EFL teachers’ evaluation of Chinese students’ English writing. Language Testing 18 (3): 303–325. Shohamy, Elana. 1985. A practical handbook in language testing. Tel Aviv: Tel Aviv University. Skaaden, Hanne. 2013. Assessing interpreter aptitude in a variety of languages. In Assessment issues in language translation and interpreting, ed. Dina Tsagari and Roelof van Deemter, 35–50. Frankfurt am Main: Peter Lang. Su, Wei. 2019. Exploring native English teachers’ and native Chinese teachers’ assessment of interpreting. Language and Education 33 (6): 577–594. Su, Wei, and Xiaoqi Shang. 2019. NNS and NES teachers’ co-teaching of interpretation class: A case study. The Asia Pacific Education Researcher 29 (4): 353–364. Tang, Jun. 2017. Translating into English as a non-native language: A translator trainer’s perspective. The Translator 23 (4): 388–403. Weigle, Sara. 2002. Assessing writing. Cambridge: Cambridge University Press. Wu, Fred S. 2013. How do we assess students in the interpreting examinations? In Assessment issues in language translation and interpreting, ed. Dina Tsagari and Roelof van Deemter, 15–33. Frankfurt am Main: Peter Lang. Zhang, Ying, and Catherine Elder. 2011. Judgments of oral proficiency by non-native and native English speaking teacher raters: Competing or complementary constructs? Language Testing 28 (1): 31–50.

Xiaoqi Shang is an assistant professor in the School of Foreign Languages at Shenzhen University, China. He received his Ph.D. in interpreting studies at the Graduate Institute of Interpretation and Translation at Shanghai International Studies University. His research interest lies in testing

Developing a Weighting Scheme for Assessing …

65

and assessment of interpreting. His recent publications have appeared in such journals as Perspectives, The Interpreter and Translator Trainer, Asia Pacific Education Researcher, and Chinese Translators Journal.

Rubric-Based Self-Assessment of Chinese-English Interpreting Wei Su

Abstract Although previous research suggests that rubrics can potentially facilitate self-assessment in traditional language tasks (e.g., writing, speaking), relatively little is known about rubric use in interpreter training. To address this gap, the current study tracked a group of interpreting learners (n = 32) over six weeks on how they used a rubric to self-assess their Chinese-English interpreting. The strategy-based rubric consisted of three criteria: (a) lexical conversion, (b) syntactic conversion, and (c) discourse expansion. The study found that, thanks to the rubric, the students could detect their strengths and weaknesses in strategy use more effectively, leading to considerably more assessment points. In addition, their interpreting performance as measured by error-free information segments had improved moderately, indicating that the acquisition of strategy use was a slower and more flexible process than strategy awareness. Students’ self-reports also revealed that their gains in strategy awareness and acquisition related mainly to local level strategies rather than to global, discourse-level strategies. Based on the results, we provided some suggestions on using a strategy-based rubric in the teaching of interpreting. Keywords Self-assessment · Rubrics · Rubric training · Strategy · Interpreting

1 Introduction A rubric is a document that articulates the expectations for an assignment by listing the criteria or what counts, and by describing levels of quality for the criteria (Andrade 2000). It has been widely seen as an important instructional tool in various educational contexts: it can help students set goals and monitor progress in attaining them (Stevens and Levi 2013), boost students’ confidence in marking and assessing (Jones et al. 2017), and facilitate their assessment practices (Patri 2002; Wang 2014). The facilitative function is worth noting, not only because self- or peer assessment relieves instructors’ burden of marking, but also because it helps to cultivate a learning W. Su (B) Research Institute of Interpreting Studies, College of Foreign Languages and Cultures, Xiamen University, Xiamen, China © Springer Nature Singapore Pte Ltd. 2021 J. Chen and C. Han (eds.), Testing and Assessment of Interpreting, New Frontiers in Translation Studies, https://doi.org/10.1007/978-981-15-8554-8_4

67

68

W. Su

community which engages students in higher-levels of self-reflection (Tsui and Ng 2000; Camp and Bolstad 2011; Novakovich 2016). Despite these assumed benefits, the debate on whether students can be competent rubric users has spawned a number of quantitative (e.g., Saito 2008; Babaii 2016) and qualitative studies (e.g., Cheng and Warren 2005; Fujii and Mackey 2009), resulting in inconsistent, if not conflicting, research findings. Some researchers find that, thanks to rubrics (student) assessors can become independent learners (e.g., Kissling and O’Donnell 2015), whereas others argue that rubric use constrains assessors in the process of learning management. For example, it could limit students’ thinking and imagination in their writing assignment (Li and Lindsey 2015), or constrain their discussion about writing quality levels (Li and Lorenzo-Dus 2014). A partial solution to the debate might be to explore the extent to which rubrics can actually improve students’ learning, and identify the areas where interventions are needed to complement the rubric use. Interpreting researchers have explored some key areas in rubric use, for example, the design of rubric criteria (e.g., Bartłomiejczyk 2007), students’ perception of rubrics (e.g., Lee 2017), and rubric training (e.g., Su 2020; Su 2021). The present study aims to explore how rubric use in the context of Chinese-English interpreting training could influence students’ learning. Our focus on the specific pedagogical context of interpreter training can be justified by the following two reasons. First, from a theoretical point of view, students’ use of an interpreting rubric could provide a valuable window through which teachers can obtain rich information of student self-learning. Students’ L1 background has been consistently shown to affect their L2 performance (e.g., Horiba 2012), yet their awareness of such L1 influences and their ability to overcome them have been under-researched. An interpreting rubric could guide students to compare their rendition with the source text, identify traces of L1 interferences, and manage their learning accordingly. Results of rubric research in interpreting are expected to inform interpreting training and selfregulated learning. Second, in practical terms, an investigation of student interpreters’ rubric use can potentially benefit interpreting teaching. As of 2019, 249 schools in China have been offering the Master of Translation and Interpreting (MTI) program, and have enrolled nearly 60,000 students (Zhong 2019). The growing number of interpreting students warrants a better understanding of their needs and difficulties in using rubrics in interpreting learning.

2 Literature Review 2.1 Rubrics in Self-Assessment Self-assessment has been considered as an epitome of formative assessment (Brown 2004; Taras 2010), as it promotes students to reflect on their own learning process and helps them heed performance quality rather than completion of tasks. To fulfil

Rubric-Based Self-Assessment of Chinese-English Interpreting

69

the formative function of self-assessment, a rubric is often used for affective and cognitive reasons. Affectively, providing students with a rubric in their assessment could stimulate their motivation in assessment and learning (e.g., Covill 2012). When students are given the specific description of what success looks like, they could become more confident in their abilities to succeed and more motivated to work harder. With boosted confidence, they are more willing to pinpoint weaknesses, and to identify the gap between their performance and expected standards. In addition, by aligning themselves with performance standards, students could constantly adjust their learning strategies and modify their learning goals, transforming themselves into self-responsible and self-regulated learners (Kissling and O’Donnell 2015). Cognitively, a rubric can relieve student assessors’ working memory load and speed up their decision-making process (Su 2020). During self-assessment, students look for evidence of good/bad performance, focus on important features, retain such evidence in memory, and finally make informed judgments. The use of rubrics could reduce the amount of information that must be held in working memory, and thus free up cognitive resources for more effective problem-solving and decision-making (Holliway and McCutchen 2004; McCutchen 2006; Min 2016).

2.2 Student’ Experiences of Rubric Use Based on the theoretical explanation of rubric scaffolding, a number of studies have tapped how students actually use a rubric in assessment (e.g. Reddy and Andrade 2010; Brown et al. 2014; Kissling and O’Donnell 2015; Huang 2016). Huang (2016) studied how 50 EFL students at a college in Taiwan used a rubric to self-assess their speaking performance. She found that assessment criteria like content and delivery could guide student assessors to count their error types and frequencies, and produce meaningful and abundant self-feedback. Similarly, Kissling and O’Donnell (2015) tracked 13 students’ self-assessment of oral Spanish over one semester. They found that with the aid of the rubric, the learners could align their perceptions of oral abilities more closely with the rubric, and could provide a balanced and clear account of their strengths and weaknesses. Moreover, the students claimed that they had made progress in oral proficiency over the three assessment sessions, though such progress was self-claimed and was not validated with regard to other measures. Students’ experiences of rubric use are not always positive. Li and Lindsey (2015) examined 98 university students’ experience of rubric use in writing and their related attitude. It is found that the students’ attitudes towards the role of the rubrics in learning were divided: while some students claimed that the rubrics made their writing more manageable, others were worried that the rubrics could limit their thinking and restrain their imagination in writing. Similar reservations about the rubric use were also reported in Cucchiarini et al. (2002) and Li and Lorenzo-Dus (2014).

70

W. Su

In sum, the reported gains of the rubric use in speaking and writing seem to be intuitive and inconsistent. Although students appear to produce somewhat better speaking/writing products as measured by some macro criteria, it is difficult to quantify how much better they have become with regard to a given criterion (e.g., content) over a certain period of time. To address this problem, we use the task of language interpreting as a better lens to observe students’ progress. Through the practice of interpreting, we can measure how much source-language content s/he is able to reproduce each time and we can also obtain a more accurate understanding of students’ progress over time.

2.3 Rubric Use in Interpreting Learning Traditionally, rubrics in interpreting have been developed and used to certify professional interpreters (e.g., Clifford 2001; Liu 2013). The assessments with such rubrics are mostly summative as they test whether the product of interpreting meets professional quality standards. As the general interpreting education has been extended from postgraduate-level programs to undergraduate-level courses, particularly in China, interpreting is increasingly used as an important task to foster and train certain skills in foreign language learning. As a result, more process-oriented rubrics are developed to track undergraduate students’ interpreting learning (Zannirato 2008; Bale 2013; Setton and Dawrant 2016). One common type of the rubric used by interpreting teachers and learners is strategy-based. An interpreting strategy is usually defined as an intentional and goaloriented procedure to solve problems during interpreting (Bartlomiejczyk 2006; Gile 2009). A strategy-based rubric can thus guide assessors to judge whether interpreting performance is indicative of such a problem-solving ability. With a strategy-based rubric, teachers could design exercises for strategy teaching, and students could self-assess the degree of strategy acquisition. Li (2013) found that the teaching of specific strategies is positively related to students’ strategy use, though it is less certain whether strategy teaching is effective. Later, Li (2015) summarized four groups of strategies: (a) language-based, (b) meaning-based, (c) delivery-based, and (d) knowledge-based strategies. For example, language-based strategies like syntactic transformation can be used to cope with dissimilar grammatical structures between the source and the target languages. Thus, students could use such strategy rubrics to self-assess their degree of success of addressing language-based problems. According to Meuleman and Van Besien (2009), when a source speech consists of complex syntactic structures or dense information, the use of segmentation strategy would help students retain essential information. Using a strategy-based rubric could potentially assist assessors to judge whether a certain strategy has been applied. Specifically, interpreting from a high-context and implicit source language (e.g., Chinese) to a low-context and explicit target language (e.g., English) often requires the deliberate and conscious application of such strategies as expansion (i.e., adding more words) and conversion (i.e., shifting sequences or parts of speech) (Wu 2001;

Rubric-Based Self-Assessment of Chinese-English Interpreting

71

Chen 2011; Su 2019b). Setton and Dawrant (2016) observed, based on their extensive teaching experience, that Chinese EFL students often omitted source content in their interpretations due to poor command of these strategies, and that they did not reflect on strategy use in self-assessment. Despite the theoretical discussion above, little information and evidence has been available to give us a clear understanding of whether and how students’ use of strategy-based rubrics during their learning could help them prevent information loss and retain a maximal amount of source-language information. In sum, whereas interpreting rubrics are generally perceived as being motivating to student learning, their scaffolding effects have been under-examined. Although a few studies claim that students can make progress because of the rubric use, such progress has not been measured and substantiated. To address this gap, I aim to track students’ progress as they use a strategy-based, formative learning-oriented interpreting rubric in their self-assessment over the course of a six-week workshop on Chinese-English interpreting. Drawing on suggestions by previous studies (e.g., Gile 2009; Su 2019a), I chose information recovery ratio (i.e., how much of the source-language content has been rendered) as a window to measure students’ learning at different times. I also investigated students’ use of strategies and their perceptions of the rubric after six weeks’ self-assessment practice. Specifically, the research questions (RQ) are as follows: RQ1 RQ2 RQ3

How did the students use a strategy-based rubric in self-assessment during the six-week workshop? To what extent did the students make progress measured by information recovery ratio after the six weeks’ practice? How did they perceive their rubric use during the six-week workshop?

3 Methodology 3.1 Participants The study took place in an interpreting workshop the author coordinated in a Chinese university in 2019. Members of the workshop met every Thursday afternoon for one hour to help those students who were interested in interpreting to enhance their mastery of interpreting strategies through deliberate practice and self-assessments. A total of 32 undergraduates registered to participate in the workshop. The workshop included three parts. The first six weeks focused on the strategies for information recovery (i.e., how students could use various strategies to achieve informational correspondence between the source-language speech and the target-language interpretation), followed by another three weeks in which both information recovery and language accuracy (e.g., tense, plural forms) were assessed. Finally, the last three weeks incorporated the paralinguistic aspects of interpreting (e.g., fluency, pronunciation) in the assessment. It was hoped that, through such an incremental procedure,

72

W. Su

the students could make progress on one dimension of performance at a time and develop a repertoire of strategies over time. The data collected from Weeks 1 to 6 were used for the current study. At week 1 the author introduced the students to the purpose and the procedure of the research, and 32 students consented to take part in the study. All were native Chinese students, aged 21–22 years old, and had learned English for more than eight years. Prior to the workshop, they had finished an 18-week course of interpreting and their endof-semester scores were 81–87 (the full score was 100). Therefore, they could be regarded as a homogenous group of upper intermediate English learners. All data collected had been anonymized and kept confidential to be in line with the research ethics of the university.

3.2 Interpreting Rubric During the first six weeks’ training that was focused on information recovery, the author introduced the students to a strategy-based rubric. The rubric was a modified version of Chen (2010) which was originally designed for strategy assessment in interpreting in a specific teaching context (i.e., Chinese students learning interpreting). Its three criteria included lexical conversion, syntactic conversion, and discourse expansion, as illustrated in Table 1. It should be noted that potential strategies for information recovery far outnumbered the listed three. However, given the limited time of the workshop and the scope of the rubric, we had to choose the more frequent and user-friendly strategies to be imparted to the undergraduates (see also Han and Chen 2016; Wu and Liao 2018). Before the data collection, it was also important to provide a transparent definition for self-assessment. Self-assessment in this study was restricted to students’ comments and reflections about their success and failure of using each strategy. By checking their interpretations against the source text, and by comparing their renditions with the exemplars, student assessors were expected to identify the instances where they had applied relevant strategies and where they had not. The total number of the instances each student identified was used in the study as a proxy measure of the degree of student engagement with the self-assessment practice. Table 2 was an excerpt of the self-assessment form by Student 12 at week 4. As is shown in Table 2, Student 12 identified her intentional use of lexical conversion in one instance (i.e., where a verb was converted to an adjective). In addition, after she compared her version with that of a professional interpreter (i.e., the exemplar), she further identified three instances where she should have, but had not, applied relevant strategies. In total, her self-assessment of this segment had four instances of strategy use, resulting in four points. Here lies an important distinction in rubric use between interpreting and other language tasks: in interpreting, students not only analyze their actual performance, but also envisage potential performance by comparing it with exemplars. Appropriate use of the rubric and the exemplars by the students not only helps to reveal gaps between their current performance and

Rubric-Based Self-Assessment of Chinese-English Interpreting

73

Table 1 The strategy-based rubric Lexical conversion (Grammar-based)

Definition: changing a part of the lexicons without altering meaning Example: Source text: 中国经济本身也在转型 (Gloss: China’s economy itself is transforming) Target text: China’s economic transition is under way (verb converted to noun)

Syntactical conversion (Grammar-based) Definition: segmenting/re-sequencing/restructuring sentences Example: Source text: 中国新的动能正在生成, 而且超出我 们的预期 (Gloss: China’s new drivers are forming, and they are beyond our expectation) Target text: New drivers are taking shape in a way that beats expectations. (join the two sentences into one) Discourse expansion (Meaning-based)

Definition: adding elements to make the discourse explicit and coherent Example: Source text: 去年是世界经济6年来增速最低, 我们 还是实现了7%的增长 (Gloss: Last year is world economy six-year low, we still realize 7% growth) Target text: Last year, despite a six-year low in world growth, we managed to realize 7% growth (add a connector)

Table 2 An excerpt of the rubric-based self-assessment form Chunk 1 Source text

中国经济本身也在转型, 一些矛盾长期积累, 不断凸显, 所以说下行 的压力确实在持续加大。

My interpretation

Chinese economy itself is now changing from, eh, it has many contra, many problems, they accumulated long time, eh, so they have a larger and larger downward pressure

Exemplar

China’s economic transition is under way and its deep-rooted problems are emerging. The downward pressure indeed is increasing

My assessment points

Strong points: 1. Lexical conversion: verb/adj large Weak points: 1. Lexical conversion: verb/n transition 2. Lexical conversion: verb/adj deep-rooted 3. Syntactic conversion: its problems are emerging Total points: 4

74

W. Su

the ideal performance, but also helps to identify feasible solutions (i.e., strategies) to bridge the performance gap. This process corresponds to what Vygotsky (1980) describes as the Zone of Proximal Development, in which deliberate efforts need to be made to bridge the gap between what learners can do independently and what they can do with external assistance and further scaffolding. Potentially, a strategy-based interpreting rubric could therefore promote and assist students’ learning. At week 1, the teacher/author used 20 min to explain the rubric criteria, the assessment forms and the workshop procedures. For the rest of the time (including the subsequent five weeks), the teacher instituted a three-step model of “interpreting + assessment + critique”. First, the students listened to a two-minute Chinese speech excerpt. The speech was then paused every one or two sentences, so that the students could interpret each segment into English. The length of each segment was manipulated to the effect that each segment formed a meaningful, functional interpretation unit, but did not impose too much burden on working memory. Second, in selfassessment, each student was given an assessment form filled with the source text transcript and the exemplar interpretation (produced by a professional interpreter). The students were then asked to listen to the recording of their own interpretation, transcribed it, and filled it in the form. They then compared their renditions with the exemplar and the source text, and recorded relevant strategies identified in the selfassessment. Third, the teacher/author randomly selected one assessment form, evaluated its completeness and accuracy on-site, and explained how the teacher himself would assess. This type of live demonstration of self-analysis and self-critique is also called rubric training or teacher intervention (cf. Patri 2002; Min 2016). In addition to each week’s workshop, the students were encouraged to conduct more interpreting practice and relied on the strategy-based rubric in their self-assessment.

3.3 Test Material The test material the teacher used was based on the Chinese Premier’s response to the reporters at 2016 press conference, and the official interpreter’s rendition was used as the benchmark or the exemplar for self-assessment. Based on the authentic speech data, the author selected three excerpts, each about two minutes, of comparable length and of difficulty. Prior to the interpreting practice, the author provided a brief introduction to all technical terms and unfamiliar words which may cause difficulty to interpreting. As such, the students could focus on the use of the three strategies in their interpreting. The excerpts had a similar number of the segments (i.e., 18, 20, 20 segments, respectively) and the exemplar interpretations had a similar frequency of the three strategies (i.e., 32, 28, 29, respectively). Importantly, the frequency of strategy use was identified by the author based on Table 1. To ensure the reliability of the judgment, the author invited another interpreting teacher to evaluate the excerpts and calculated the occurrence of strategies based on the rubric. Through their discussion, they revised the strategy occurrence in the last excerpt from 28 to 31, and both deemed the three excerpts as being comparable for this study.

Rubric-Based Self-Assessment of Chinese-English Interpreting

75

To examine whether the students improved their interpreting performance over time, we used the number of error-free segments to measure information recovery. If the meaning of one segment was not distorted or omitted by the students, it was regarded as an error-free segment. This measure was meaning-based and excluded students’ grammatical mistakes to be in line with the objective of the research.

3.4 Data Collection and Analysis To address the first research question (RQ1), the author chose three self-assessment sessions (week 1, week 4, and week 6) to explore students’ rubric use. During each session, the author asked the students to interpret one excerpt, collected each student’s self-assessment sheet and calculated the assessment points. A paired-samples t-test between Sessions 1 and 3 was performed to detect possible existence of statistically significant difference in rubric use. To address RQ2, at each of the three sessions the author calculated the error-free segments by each student. There were a few cases where students’ rendition was complete but had a few grammatical mistakes (e.g., a wrong tense) which altered the meaning of a rendition. For these cases, the author discussed with the students to confirm that they did not intentionally use the tense to convey a different meaning. Ultimately, these segments were still considered error-free. To address RQ3, by the end of session 3 (week 6) each student submitted a self-reflective report detailing their perceptions of the rubric and how the rubric use influenced their interpreting performance over time. The author collected and analyzed the 32 reports using a coding scheme consistent with the rubric criteria and their descriptors (see Table 1). Specifically, any comments involving one strategy or its related descriptors were grouped under that criterion in Table 1. A comment could be coded and categorized into multiple criteria. For example, a comment comparing discourse expansion with lexical conversion was coded as such. This method of analysis captured almost all learners’ comments, though a few were too vague to fit in any category (e.g., “still not satisfied with my own performance”) and thus were excluded in the analysis.

4 Results and Discussion 4.1 Rubric Use To address RQ1, Table 3 provides the descriptive statistics for the students’ assessment points in three sessions. As is shown in Table 3, the student assessors tended to produce more assessment points, as their assessment experience grew over time. Session 3 generated more assessment points (M = 17.7, SD = 3.6) than Session 1

76

W. Su

Table 3 The descriptive statistics concerning the assessment points Maximum

Minimum

Mean

Standard deviation

Session 1 (week 1)

17

9

11.7

3.3

Session 2 (week 4)

23

11

16.2

3.7

Session 3 (week 6)

24

12

17.7

3.6

Table 4 The paired-samples t-test of the strong and weak points between Sessions 1 and 3 Session 1 Max

Session 3 Min

M

SD

t

Max

Min

M

df

p

SD

Strong points

6

3

4.8

1.5

9

4

6.0

1.5

2.61

31

0.15

Weak points

11

6

8.1

1.9

15

9

11.6

2.1

3.50

31

0.01

Notes: Max = maximum, Min = minimum, M = mean, SD = standard deviation

(M = 11.7, SD = 3.3). The paired-samples t-test between Session 3 and Session 1 shows that the difference was statistically significant t (31) = 3.70, p = 0.004, and represented a medium-sized effect, d = 0.46. This means that after six weeks’ rubric use, the students were able to locate more instances of strategy use in their self-assessment. Such a change could be accounted for by two explanations. On the one hand, the students applied strategies more frequently in their interpreting after six weeks of training, and consequently were able to identify more strategies used in interpreting. On the other hand, the students may have applied strategies as frequently in Session 3 as in Session 1. However, after the six weeks’ rubric use, they became more sensitive to strategy use in their assessment, quicker to identify more of their performance weaknesses when comparing their version with the exemplar, and thus generated more assessment points at Session 3. To provide an insight into the change in the students’ assessment points, Table 4 lists the number of their self-identified weaknesses (i.e., weak points) and strengths (i.e., strong points) in Sessions 1 and 3. Table 4 further displays that both the strong and the weak points increased over time, with their mean value up from 4.8 to 6.0 and from 8.1 to 11.6, respectively. However, the paired-samples t-test shows that only the weak points gained a statistically significant increase t (31) = 3.50, p = 0.01. Even when we used the Bonferroni correction to reduce the threshold of critical p value from 0.05 to approximately 0.02 (because we performed significance tests on the same dataset three times), the result was still significant. We can thus state that the primary reason why the students generated the more assessment points over time was that they were able to identify the weaknesses in their performance. Although their ability of using strategy seemed to remain unchanged (as indicated by the relatively stable level of the strong points), they became more sensitive to the gap between their current performance and the exemplar interpretation, more flexible with the strategy use to bridge the gap, and more aware of the instances where strategies were used in the exemplar rendition. This demonstrated the connection and the distinction between skill awareness and acquisition. As DeKeyser (2014) explained, skills may

Rubric-Based Self-Assessment of Chinese-English Interpreting

77

be acquired through perceptive observation (awareness) and assessment of others engaged in skilled behaviour. Students’ keen awareness of skills may not immediately translate into their acquisition, but such awareness and assessment can potentially lead to skill acquisition. The strategy-based rubric in the current study seemed to have helped the students cultivate such an awareness, paving the way for a slow but steady mastery of interpreting competence.

4.2 Interpreting Performance To address RQ2 and to tap the potential of the rubric in promoting interpreting learning, we also compared the students’ performances across the three sessions using the error-free segment measure. Table 5 shows that, compared with Session 1, the students generated a larger number of the error-free segments at Sessions 2 and 3, with the mean value up from 8.9 to 11.1 and to 11.2, respectively. This means that, after using the rubric for four or six weeks, the students could reproduce/recover a higher portion of the source-language content, and improved both their interpreting competence (as indexed by the error-free segments) and their self-assessment competence (as indexed by the total assessment points). Additionally, the paired-samples t-test conducted on the number of the error-free segments between Sessions 3 and 1 indicates that the difference was not statistically significant: t (31) = 2.04, p = 0.10. In other words, although the students’ performance seemed to improve considerably, the gain was not significant from a statistical point of view. Indeed, the number of the error-free segments slightly decreased from Session 2 to Session 3 as indexed by the mean value (i.e., 11.7 down to 11.2) and maximum value (i.e., 18 down to 17). To some extent, this trend we observed here reflected the natural course of the development of interpreting competence. Similar to the acquisition of other skills, interpreting learning is an incremental, non-linear process. At some stages, the competence may wax and wane, and also plateau for a period of time, though in the long run it is characterized by an overall progressive trajectory. Specifically, we would argue that assessing strategies is not the same as acquiring them. An awareness of a strategy is declarative knowledge, whereas its mastery is procedural and thus requires longer, sustained practice (Setton and Dawrant 2016). The limited improvement of interpreting performance also echoes the above finding that the students’ ability to use relevant strategies (indicated by the number of strong points) improved moderately over time. Table 5 The Error-free segment measure across the three sessions Error-free Segment

Maximum

Minimum

Mean

Standard deviation

Session 1 (week 1)

11

5

8.9

2.1

Session 2 (week 4)

18

6

11.7

2.7

Session 3 (week 6)

17

6

11.2

2.5

78

W. Su

4.3 Rubric Perception To address RQ3, the self-reflective reports were analyzed. The analysis shows that more than half of the comments centered on the strategy of conversion (at both lexical and syntactic levels), and that many of them pertained to both levels interchangeably. Therefore, we address the student perceptions from the two aspects: conversion and discourse expansion.

4.3.1

The Criterion of Conversion Strategy

The most common response to this criterion is its easiness to understand and apply in self-assessment. The students in this study were well acquainted with the key concepts in conversion strategy such as part of speech and subject-verb-object (SVO) structure, because in their high school they had undergone numerous grammar drills and became very sensitive to these terms. Comments like “I am aware of the components (of conversion strategy)” and “I can automatically identify these elements” indicated their prior knowledge and sound understanding of the criterion. Furthermore, the criterion of conversion seemed to raise their awareness of some longneglected features of their native language (i.e., Mandarin Chinese), as is shown in their comments, for example, “[the criterion] help to see clearly the syntactic differences [between Chinese and English]”, and “I now realize how my mother tongue influences my English”. Additionally, the student assessors seemed to be very familiar with the word- and sentence-level conversion. As the students often checked their renditions word by word or sentence by sentence, they were sensitive to local areas in the text that require the conversion strategy. Also, given that they were EFL learners at an intermediate level, they were more attuned to the surface linguistic features (e.g., Alptekin and Erçetin 2011). As a result, they tended to identify quickly the instances of strategy use at the lexical and/or sentential levels. Comments like “it’s much easier to assess conversion” and “conversion is very salient” all reflected their preference. Regarding the impact of this criterion on the students’ interpreting learning, the most frequently mentioned point is their enhanced awareness and use of this strategy in interpreting. Comments such as “I’m now paying more attention to the strategy of conversion” and “the use of the conversion strategy facilitates my interpreting process” evidenced the ease of acquiring this strategy, and comments such as “by using conversion, I can render more source-language messages” also pointed to the potential connection between strategy use and interpreting quality. The students’ positive response concerning the impact of strategy use may be attributed to their measurable improvement over the three training sessions. As their acquisition of the conversion strategy became increasingly automatic, they were no longer deterred by the seemingly complex and long source-language sentences. Their delivery

Rubric-Based Self-Assessment of Chinese-English Interpreting

79

speed accelerated, renditions became more informative and accurate, and information recovery as indexed by the error-free segments also improved, although in a statistically non-significant manner (p > 0.05).

4.3.2

The Criterion of Discourse Expansion Strategy

Compared to the criterion of conversion, the criterion of discourse expansion strategy received mixed responses. The students did mention their (improved) knowledge about this criterion. Comments like “expansion is not difficult to comprehend” and “with the teacher’s demonstration and the exemplar, I can understand (this criterion)” show that the students acquired a basic understanding of what discourse expansion was and how it could be used. However, when it comes to assessing discourse expansion, their opinions are divided. Some students said that by comparing the exemplar rendition with their own, they could easily identify where they had omitted discourse devices and conducted fair evaluation of their strategy use. By contrast, other students contended that their employment of relevant strategies (e.g., using discourse devices) were different from those manifested in the exemplar rendition. As they lacked the confidence to judge the appropriateness of their strategy use, they had difficulty in generating convincing assessment points. Comments like “expansion is not as easy as conversion to judge” epitomize the assessors’ uncertainty about the expansion strategy. The students’ mixed attitudes about the discourse strategy indicate a higher level of difficulty when performing discourse-level transfer than word-level transfer in interpreting. This finding is consistent with previous studies that report higher difficulty in assessing discourse features (e.g., Su 2019a). Discourse assessment requires a greater sensitivity to underlying logical connection, more time to compare and contrast adjacent sentences, and a larger memory capacity to store preceding arguments, to such an extent as even teacher assessors would cut back their comments on students’ discourse features. Consequently, we were not surprised to find less encouraging comments from the students, for example, “progress not obvious” and “need more practice”. Their attitude towards this strategy seems to lend some credence to the result of their modest progress in interpreting performance.

4.4 The Role of the Strategy-Based Rubrics in Interpreting: Awareness, Assessment, Acquisition Based on the three data sources, it can be said that the strategy-based rubric plays three main roles. First, the strategy-based rubric enhances students’ awareness of potential problems during interpreting process and of the possible tactics to solve them (i.e., definition of strategies). Second, the rubric facilitates students’ assessment

80

W. Su

of interpreting performance, reveals to them gaps between a reference level (exemplar renditions) and their performance level, and suggests ways to bridge them (i.e., judgment of strategies). Third, the rubric promotes students’ acquisition of interpreting competence as they could track their performance across comparable tasks, modify their learning priorities, and focus on their weaknesses and its improvement (i.e., use of strategies). In addition, two features need to be highlighted regarding the rubric’s triple roles. First, assessment is a central medium linking students’ awareness of strategies to acquisition of strategies. Only by examining exemplars and their own renditions during assessment, can students really develop strategy awareness. Likewise, only through assessment, can students reflect on and consolidate their mastery of strategies. Second, the role of awareness-raising often precedes that of acquisition. In other words, before students are able to apply the strategy of discourse expansion skilfully, they need to develop sound awareness and accurate understanding of this strategy. For example, if students’ understanding of discourse coherence is ambiguous, its assessment would be problematic and its acquisition process would be slow. It should also be noted that the benefits of such a rubric are not uniform across the criteria. The present study finds that the students generated more complete assessment points concerning the micro-level strategies (i.e., conversion) than the macro-level strategy (i.e., expansion). Their attitude to the former strategies was more favourable as they could see more substantial progress. However, the usefulness of the latter strategy cannot be downplayed. Being learning-oriented, the strategy-based rubric is by nature prospective rather than retrospective. It aims to help students reflect on strengths and weaknesses with a view to improving their future learning rather than collecting evidence on their achievements of learning outcomes. Indeed, student comments like “need more exercises” imply that such a difficult-to-assess criterion can help to modify their learning priorities, and contribute to a slow yet steady acquisition. When relating the interpreting task to other language tasks, the study identifies two potential areas to complement language teaching. First, a rubric in interpreting can stimulate students’ reflections of their L1’s influences on their L2. The potential of raising students’ awareness can be harnessed to counter fossilization in language learning. For example, adult learners often suffer from their L1 negative transfer in learning a foreign language, leading to the so-called fossilization (e.g., Montrul 2010). A possible remedy is to direct their attention to their L1 traces through interpreting/translation tasks, and to focus on specific cases of L1 negative transfer so as to promote deliberate and efficient self-correction and improvement. Second, the use of interpreting strategies could accelerate L2 acquisition. Yamashita and Jiang (2010) find that when L1 collocations were incongruent (dissimilar) with L2, the students felt it difficult to acquire them even after a considerable amount of exposure to L2. Our study contends that exposure to L2 alone may not be enough in language learning. Students need to contrast them with L1 structures, intentionally use cross-language conversion strategies before they could develop a high degree of sensitivity to and acquisition of these incongruent features. Our study shows that the word-level (lexical) and the syntactic conversion can clearly speed up

Rubric-Based Self-Assessment of Chinese-English Interpreting

81

their acquisition process. This strategy-based rubric therefore has great potential for L2 acquisition, especially when there is considerable syntactic dissimilarity between L1 and L2.

5 Conclusion This study investigated the use of a strategy-based rubric in self-assessment of Chinese- English interpreting over three learning sessions. It is found that the students could generate an increasing number of assessment points across three comparable interpreting tasks. In particular, the number of self-assessed weak points gained a statistically significant increase, suggesting that they could locate and identify more of their weaknesses in strategy use. Their progress in interpreting performance, as indexed by the number of error-free segments, was relatively moderate, indicating that the acquisition of strategies was a slower process than strategy awareness. The students’ self-reports also revealed that their gain in awareness and acquisition pertained mainly to the local level strategies rather than to the global strategy of discourse expansion. One implication from this study is that the inclusion of strategy use (especially local level strategies) in a self-assessment rubric could have immediate and clear effect on strategy awareness and acquisition. Instructors could reformulate their course objectives, incorporating the development of strategy awareness and use, and design a strategy-based, stage-by-stage rubric for students’ self-assessment. Such a rubric does not measure the end product of students’ performance. Rather, it taps the degree of their awareness and acquisition of strategies. Students could easily see their gain in strategy knowledge and mastery, and teachers could also track the progress and modify the sequence of strategy teaching so as to reach the course objectives. The use of the rubric and the relevant intervention are thus not limited to interpreting or language program, but can be replicated in wider educational contexts. Another implication is for MTI programs in China. As the growth of interpreting students outpaces that of teachers, self-assessment using rubrics can significantly ease teachers’ workload and foster students’ learning autonomy. Various assessment formats can be applied to cater to students at different proficiency levels. For those lower achievers, teachers could invite them to jointly determine strategy repertoire that they feel easier to understand and apply. Such mutual development of rubrics between teachers and students has shown a great instructional success in previous research (e.g. Skillings and Ferrell 2000; Becker 2016). This study is not without limitations. It only investigates three assessment sessions at one particular instructional context. Moreover, it did not include a control group in the longitudinal design. The lack of a control group makes it less convincing to ascribe the improvement to self-assessment practice over the six weeks’ practice. However, given the limited research on self-assessment of interpreting, we hope that our exploration of the strategy-based interpreting rubric could serve as a starting point to advance inquiry into this under-explored yet promising area.

82

W. Su

Acknowledgements This work was supported by Humanity and Social Science Fund of Ministry of Education of China (17YJC740074) and the Fundamental Research Funds for the Central Universities of China (20720181002).

References Andrade, Heidi. 2000. Using rubrics to promote thinking and learning. Educational Leadership 57 (5): 13–19. Bale, Richard. 2013. Undergraduate consecutive interpreting and lexical knowledge. The Interpreter and Translator Trainer 7 (1): 27–50. Bartlomiejczyk, Magdalena. 2006. Strategies of simultaneous interpreting and directionality. Interpreting 8 (2): 149–174. Bartłomiejczyk, Magdalena. 2007. Interpreting quality as perceived by trainee interpreters. The Interpreter and Translator Trainer 1 (2): 247–267. Becker, Anthony. 2016. Student-generated scoring rubrics: Examining their formative value for improving ESL students’ writing performance. Assessing Writing 29: 15–24. Brown, Douglas. 2004. Language assessment: Principles and classroom practices. Upper Saddle River, NJ: Allyn and Bacon. Brown, Anthony, Dan Dewey, and Troy Cox. 2014. Assessing the validity of can-do statements in retrospective (Then-Now) self-assessment. Foreign Language Annals 47 (2): 261–285. Camp, Heather, and Teresa Bolstad. 2011. Preparing for meaningful involvement in learning community work in the composition classroom. Teaching English in the Two-Year College 38 (3): 259–270. Chen, Jing. 2010. Sight translation. Shanghai: Shanghai Foreign language Education Press. Cheng, Winnie, and Martin Warren. 2005. Peer assessment of language proficiency. Language Testing 22 (1): 93–121. Clifford, Andrew. 2001. Discourse theory and performance-based assessment: Two tools for professional interpreting. Meta 46 (2): 365–378. Covill, Amy. 2012. College students’ use of a writing rubric: Effect on quality of writing selfEfficacy, and writing practices. Journal of Writing Assessment 5 (1): 1–20. Cucchiarini, Catia, Helmer Strik, and Lou Boves. 2002. Quantitative assessment of second language Learners’ fluency: Comparisons between read and spontaneous speech. The Journal of the Acoustical Society of America 111: 2862–2873. DeKeyser, Robert. 2014. Skill acquisition theory. In Theories in second language acquisition: An introduction, ed. Bill Vanpatten, 94–112. London and New York: Routledge. Gile, Daniel. 2009. Basic concepts and models for interpreter and translator training, rev ed. Philadelphia: John Benjamins. Han, Chao, and Sijia Chen. 2016. Strategy use in English-to-Chinese simultaneous interpreting. Forum 14 (2): 173–193. Holliway, David, and Deborah McCutchen. 2004. Audience perspective in young writers’ composing and revising: Reading as the reader. In Revision: Cognitive and instructional processes, ed. Linda Allal, Lucile Chanquoy, and Pierre Largy, 87–101. Boston, MA: Kluwer Academic Publishers. Horiba, Yukie. 2012. Word knowledge and its relation to text Comprehension: A comparative study of Chinese and Korea speaking L2 learners and L1 speakers of Japanese. The Modern Language Journal 96 (1): 108–121. Huang, Shuchen. 2016. Understanding learners’ self-assessment and self-feedback on their foreign language speaking performance. Assessment and Evaluation in Higher Education 41 (6): 803– 820.

Rubric-Based Self-Assessment of Chinese-English Interpreting

83

Jones, Lorraine, Bill Allen, Peter Dunn, and Lesley Brooker. 2017. Demystifying the rubric: A five-step pedagogy to improve student understanding and utilisation of marking criteria. Higher Education Research & Development 36 (1): 129–142. Kissling, Elizabeth, and Mary O’Donnell. 2015. Increasing language awareness and self-efficacy of FL students using self-assessment and the ACTFL proficiency guidelines. Language Awareness 24 (4): 283–302. Lee, Sang-Bin. 2017. University students’ experience of “scale-referenced” peer assessment for a consecutive interpreting examination. Assessment and Evaluation in Higher Education 42 (7): 1015–1029. Li, Hui, and Nuria Lorenzo-Dus. 2014. Investigating how vocabulary is assessed in a narrative task through raters’ verbal protocols. System 46 (1): 1–13 Li, Jinrong, and Peggy Lindsey. 2015. Understanding variations between student and teacher application of rubrics. Assessing Writing 26: 67–79. Li, Xiangdong. 2013. Are interpreting strategies teachable? Correlating trainees’ strategy use with trainers’ training in the consecutive interpreting classroom. Interpreters’ Newsletter 18: 105–128. Li, Xiangdong. 2015. Putting interpreting strategies in their place: Justifications for teaching strategies in interpreter training. Babel 61 (2): 170–192. Liu, Minhua. 2013. Design and analysis of Taiwan’s interpretation certification examination. In Assessment issues in language translation and interpreting, ed. Dina Tsagari and Roelof van Deemter, 163–178. Frankfurt am Main: Peter Lang. McCutchen, Deborah. 2006. Cognitive factors in the development of children’s writing. In Handbook of writing research, ed. Charles MacArthur, Steve Graham, and Jill Fitzgerald, 115–134. New York, NY: Guilford Press. Meuleman, Chris, and Fred Van Besien. 2009. Coping with extreme speech conditions in simultaneous interpreting. Interpreting 11 (1): 20–34. Min, Huitzu. 2016. Effect of teacher modeling and feedback on EFL students’ peer review skills in peer review training. Journal of Second Language Writing 31: 43–57. Montrul, Silvina. 2010. Dominant language transfer in adult second language learners and heritage speakers. Second Language Research 26 (3): 293–327. Novakovich, Jeanette. 2016. Fostering critical thinking and reflection through blog-mediated peer feedback. Journal of Computer Assisted Learning 32 (1): 16–30. Patri, Mrudula. 2002. The influence of peer feedback on self-and peer-assessment of oral skills. Language Testing 19 (2): 109–131. Reddy, Malini, and Heidi Andrade. 2010. A review of rubric use in higher education. Assessment and Evaluation in Higher Education 35 (4): 435–448. Setton, Robin, and Andrew Dawrant. 2016. Conference interpreting: A trainer’s guide. Amsterdam: John Benjamin. Skillings, Mary Jo, and Robbin Ferrell. 2000. Student-generated rubrics: Bringing students into the assessment process. the Reading Teacher 53: 452–455. Stevens, Dannelle, and Antonai Levi. 2013. Introduction to rubrics: An assessment tool to save grading time, convey effective feedback and promote student learning. Virginia: Stylus. Su, Wei. 2019a. Interpreting quality as evaluated by peer students. The Interpreter and Translator Trainer 13 (2): 177–189. Su, Wei. 2019b. Exploring native English teachers’ and native Chinese teachers’ assessment of interpreting. Language and Education 33 (6): 577–594. Su, Wei. 2020. Exploring how rubric training influences students’ assessment and awareness of interpreting. Language Awareness 29 (2): 178–196. Su, Wei. 2021. Understanding rubric use in peer assessment of translation, Perspectives. https://doi. org/10.1080/0907676X.2020.1862260. (in press). Taras, Maddalena. 2010. Student self-assessment: Processes and consequences. Teaching in Higher Education 15 (2): 199–209. Tsui, Amy, and Marina Ng. 2000. Do secondary L2 writers benefit from peer comments? Journal of Second Language Writing 9: 147–170.

84

W. Su

Vygotsky, Lev. 1980. Mind in society: The development of higher psychological processes. Cambridge, MA: Harvard University Press. Wang, Weiqiang. 2014. Students’ perceptions of rubric-referenced peer feedback on EFL writing: A longitudinal inquiry. Assessing Writing 19: 80–96. Wu, Michelle. 2001. The importance of being strategic. Studies of Translation and Interpretation 6: 79–92. Wu, Yinyin, and Posen Liao. 2018. Re-conceptualising interpreting strategies for teaching interpretation into a B language. The Interpreter and Translator Trainer 12 (2): 188–206. Yamashita, Junko, and Nan Jiang. 2010. L1 influence on the acquisition of L2 collocations: Japanese ESL users and EFL learners acquiring English collocations. TESOL Quarterly 44 (4): 647–668. Zannirato, Alessandro. 2008. Teaching interpreting and interpreting teaching: A conference interpreter’s overview of second language acquisition. In Translator and interpreter training: Issues, methods and debates, ed. Kearns John, 19–38. London and New York: Continuum. Zhong, Weihe. 2019. China’s translation education in the past four decades: Problems, challenges and prospects. Chinese Translators Journal 1: 68–75.

Wei Su is associate professor in the College of Foreign Languages and Cultures at Xiamen University, China. He earned his PhD degree in 2011 at Shanghai International Studies University in China. His research interests include feedback and assessment in interpreting education, and coteaching by native English speaking and non-native English speaking teachers. His recent publications have appeared in peer-reviewed journals such as The Interpreter and Translator Trainer, Language and Education, and Language Awareness.

Detecting and Measuring Rater Effects in Interpreting Assessment: A Methodological Comparison of Classical Test Theory, Generalizability Theory, and Many-Facet Rasch Measurement Chao Han Abstract Human raters play an important role in interpreting assessment. Their evaluative judgment of the quality of interpreted renditions produces quantitative measures (e.g., scores, ratings, marks, ranks) that form the basis of relevant decisionmaking (e.g., program admission, professional certification). Previous research in the field of language testing finds that human raters may be inconsistent over time, excessively harsh or lenient, and biased against a particular group of test candidates, a certain type of tasks and a given assessment criterion. Such undesirable phenomena (i.e., rater inconsistency, rater severity/leniency, and rater bias) are collectively known as rater effects. Their presence could lead to unreliable, invalid, and unfair assessments. It is therefore of importance to investigate possible rater effects in interpreting assessment. Although a number of statistical indices can be computed to measure rater effects, there has been no systematic attempt to compare their applicability and utility. Against this background, the current study aims to compare three psychometric approaches, namely, classical test theory, generalizability theory, and manyfacet Rasch measurement, to detecting and measuring rater effects. Our analysis is based on the data from a previous assessment of English-to-Chinese simultaneous interpreting in which a total of nine raters were involved. Through our comparison, we hope that interpreting researchers and testers could obtain an in-depth understanding of statistical information generated by these approaches, and be able to make informed decisions in selecting an analytic approach commensurate with their local assessment needs. Keywords Rater effects · Interpreting assessment · Classical test theory · Generalizability theory · Many-facet Rasch measurement

C. Han (B) Research Institute of Interpreting Studies, College of Foreign Languages and Cultures, Xiamen University, Xiamen, China e-mail: [email protected] © Springer Nature Singapore Pte Ltd. 2021 J. Chen and C. Han (eds.), Testing and Assessment of Interpreting, New Frontiers in Translation Studies, https://doi.org/10.1007/978-981-15-8554-8_5

85

86

C. Han

1 Introduction In all types of performance-based assessment in which human raters play a crucial role of making evaluative judgments and assigning scores, there is a common concern of rater effects. Definitionally, rater effects encompass a wide range of undesirable phenomena, including rater disagreement/rater inconsistency, rater severity/leniency, and rater bias, which may contribute systematic yet unwanted variability to the measurement process. These effects may ultimately mislead the interpretation of assessment results. Therefore, it is of great importance to detect and measure rater effects in rater-mediated performance assessment. There are at least three potential benefits of doing so. One benefit is that testers and researchers could understand rater behavior better (e.g., rating inconsistency, bias) and evaluate the quality of assessment results better (e.g., to what extent test scores have been influenced by erratic rating behavior). Another potential benefit relating to the detection of rater effects is that relevant statistical information could help testers adjust and correct for biased test scores caused by certain raters. An additional benefit is that insights into rater behavior could inform design and development of future rater training. When it comes to testing and assessment of spoken-language interpreting, raters may disagree with one another in terms of both qualitative and quantitative decisionmaking. Specifically, in evaluating interpreting performance, raters not only need to pay attention to multiple quality dimensions (e.g., fidelity, fluency, language use) of target-language interpretation, but also must attend to source-language input to determine the informational and pragmatic equivalence of the target-language rendition. This multi-tasking process of evaluating cross-linguistic/cultural mediation is linguistically challenging and cognitively taxing (Gile 1995; Wu 2010; Han 2015, 2018c), which may give rise to the rater effects. Left unexamined, the rater effects could pose a potential risk to the reliability and validity of measurement results, and ultimately to the credibility of a given testing procedure/program. Such a risk could be further magnified in high-stakes testing scenarios such as certification testing and professional qualification examination. To understand and investigate the rater effects in interpreting assessment, testers and researchers have trialed and utilized a number of psychometric approaches (e.g., Lee 2008; Wu 2010; Liu 2013; Lee 2015; Han 2015, 2016). Particularly, three analytical approaches merit special attention, namely, classical test theory, generalizability theory, and many-facet Rasch analysis. As can be been in the literature review below, although classical test theory has long been used in analyzing the rater effects in interpreting assessment (e.g., Gile 1995), the application of generalizability theory and Rasch analysis seem to be a recent trend, initially proposed by researchers such as Clifford (2004) and Wu (2010), and piloted by Zhao and Dong (2013) and Han (2015, 2016, 2019). Despite the individual attempts to investigate the rater effects based on different approaches, there has been little analysis in the field of interpreting testing and assessment to explore and examine the commonalities and discrepancies of the three approaches. This chapter, therefore, sets out to compare the three approaches, based on the authentic assessment data from a

Detecting and Measuring Rater Effects in Interpreting …

87

previous study (i.e., Han 2015). By doing so, we want to demonstrate how each approach can be applied to examine the rater effects in spoken-language interpreting assessment, and to conduct a methodological comparison of their strengths and weaknesses. Hopefully, our demonstration and analysis could provide useful information to interpreting testers and researchers to assist their selection and application of these analytical approaches in future assessment and research.

2 Literature Review In this section, we will first provide an overview of rater-mediated assessment of spoken-language interpreting, focusing on raters, rating scales, and the rater effects. We then describe three major psychometric approaches to investigating the rater effects by explaining conceptual and theoretical basis of each approach, and we also review previous research in interpreting literature that documents their applications. The review thus paves the way for the data-based methodological comparison of the three approaches.

2.1 Rater-Mediated Assessment of Spoken-Language Interpreting Over the past decade, testing and assessment of spoken-language interpreting has been gaining traction (Han and Slatyer 2016). In parallel to this trend, descriptorbased rating scales have also been increasingly used to evaluate interpreting performance in interpreter education (e.g., Wu 2010; Lee 2015; Setton and Dawrant 2016; Wang et al. 2020), professional certification (e.g., Liu 2013; Han and Slatyer 2016; National Accreditation Authority for Translators and Interpreters [NAATI] 2019), and interpreting research (e.g., Tiselius 2009; Han 2015; Han and Riazi 2017; Shang and Xie 2020). For instance, Wang et al. (2020) reported their design and development of the Interpreting Competence Scale, based on a corpus of relevant performance descriptors, which can be used to guide testing and assessment in interpreter training and education. Additionally, in terms of professional certification, NAATI (2019) released their descriptor-based analytic rating scales to be used by raters to evaluate performance for different modes of interpreting. Lastly, when it comes to interpreting research, a number of researchers have applied rating scales to produce metrics that are in turn used as (in)dependent variables in investigation of translational phenomena (e.g., Hale and Ozolins 2014; Han and Riazi 2017). In the above assessment practices, one of the most important factors is human raters. They play a critical role in analyzing features of interpreted renditions, and determining and assigning scores on multiple quality dimensions of interpreting performance. In other words, the assessment is rater-mediated. On the one hand,

88

C. Han

the rater-mediated nature of interpreting assessment gives raters an enormous power of producing assessment results; on the other hand, it also makes the assessment practice susceptible to potential erratic and idiosyncratic rating behaviors. Therefore, it is important to analyze and detect problematic rating behavior.

2.2 Psychometric Approaches to Examining the Rater Effects There have been three major approaches used by interpreting testers and researchers to investigate the rater effects, including classical test theory, generalizability theory, and many-facet Rasch measurement. In this section, we intend to offer a brief review of conceptual and theoretical basis of each approach and highlight relevant studies that apply these approaches to analyze the rater effects.

2.2.1

Classical Test Theory

The central tenet in classical test theory (CTT) is that an observed score (denoted by X) can be decomposed into a true score (T ) plus a random error component (E) (e.g., Lord et al. 1968; Crocker and Algina 1986), which is expressed mathematically in Eq. 1. X =T+E

(1)

In other words, for a test candidate, the raw score comprises a true score that reflects the test construct of interest (most often a latent variable) and an error caused by sources of variability. Specifically, a true score for the test candidate is assumed to be the average observed score over an infinite number of independent test administrations (Traub and Rowley 1991; Shultz and Whitney 2005). As for the error term in the CTT, it is an undifferentiated random error, uncorrelated with errors of other observed scores and with the true score (e.g., Kline 2005; DeVellis 2006). Therefore, the less random error there is, the more the observed score reflects the true score. The theory of true and error scores developed over multiple samplings of the same test candidates is applicable to a single administration of a test to a group of test takers (Kline 2005). However, in the latter scenario, of interest is the variance of the observed scores, of the true scores, and of the random error across a group of test takers. According to the CTT, the total variance of observed scores (denoted by σ X2 ) can be partitioned into variance of true scores (σT2 ) plus variance of random error (σ E2 ), which is expressed mathematically in Eq. 2. σ X2 = σT2 + σ E2

(2)

As a theoretical quantity, reliability can be defined as the amount of variation in the observed scores that is due to the true score variance. Technically speaking,

Detecting and Measuring Rater Effects in Interpreting …

89

reliability (denoted by ρ) is the ratio of the true score variance to the observed score variance, which is expressed in Eq. 3. ρ=

σT2 σT2 = σ X2 σT2 + σ E2

(3)

Conceptually, to the extent that the true score variance accounts for a large proportion of the observed score variance, reliability is high, and vice versa. The only problem is that, as a “metaphysical platonic” concept, a true score (X) and its variance (σ X2 ) are unobservable directly and are never known for sure. Because of these unknown properties, an estimate for reliability defined by the signal-to-noise manner cannot be calculated. A more practical way to estimate reliability is through correlation between two repeated measurements (Bachman 1990; Traub and Rowley 1991), or what Brennan (2001a) calls “replication” of a measurement procedure. More importantly, before calculating reliability estimates via correlation, it is necessary to consider what type of measurement error needs to be examined (Bachman 1990; Shultz and Whitney 2005). That is, relevant analysis should be conducted to identify and distinguish potential sources of variability that contribute to error in a given measurement procedure, as different sources of error is linked to different methods of reliability estimation. Typically, there are four sources of measurement error under the CCT framework, and the CTT fares well in modeling one source of random error at a time (Bachman 1990; Shultz and Whitney 2005). Table 1 presents four methods of estimating reliability, with each addressing a different source of measurement error. As can be seen in the table, all reliability estimation hinges on the notion of correlation or replication. That is, reliability indices are operationally calculated by correlating two sets of test scores derived from replications of a measurement procedure in one way (e.g., parallel tests) or another (e.g., split-halves). In addition, caused by different sources, each type of measurement error is associated with a particular type of reliability and, accordingly, a certain reliability estimate. For example, stability concerns the extent to which occasions influence observed scores. Errors associated with this type of reliability address how changes in test takers or testing conditions impact measurement Table 1 The CTT approach to reliability estimation Source of error

Type of reliability

Reliability estimate

Estimation methods

Changes of test takers or testing conditions

Stability

Test-retest

Pearson’s r 12

Differences of alternate Equivalence test forms

Parallel-form

Pearson’s r xx’

Heterogeneity of items/tasks sampled in a test

Internal consistency

Split-half; Cronbach’s alpha

Spearman-Brown’s; Cronbach’s α, etc.

Rater effect

Rater consistency

Intra and inter-rater

Pearson’s r, Spearman’s ρ, etc.

90

C. Han

results. Usually, Pearson’s product-moment correlation is used to estimate reliability between two sets of scores derived from two administrations of the same test over a proper interval. It is worth reiterating that the CTT only investigates one source of random measurement error at a time. That is, the CTT cannot examine different sources of error simultaneously, and it is only interested in random error. As a result, when the CTT is used, there can only be one source of random error under scrutiny. In interpreting studies, researchers commonly use the CTT approach to examine rater reliability in rater-mediated, scale-based assessment of interpreting performance (Lee 2008; Tiselius 2009; Lee 2015; Han 2018a). For instance, Lee (2008) calculated Cronbach’s alpha for a group of raters in the assessment of English-to-Korean consecutive interpreting. In addition, Han (2018a) reported Pearson’s correlation coefficients as a measure of rater reliability in student self- and peer assessment of English–Chinese consecutive interpreting.

2.2.2

Generalizability Theory

Pioneered by Cronbach et al. (1972), generalizability theory (i.e., G-theory) is a statistical theory for evaluating the reliability or the dependability of behavioral measurements. It has been developed and enriched by researchers, notably, Brennan (2001b), to design, assess, and improve the reliability of measurement results. Overall, the G-theory and its developments can be regarded as a useful extension of the CTT in a number of important ways (Shavelson and Webb 1991; Traub and Rowley 1991). First, while the CTT considers error to be a single undifferentiated entity which is random in its influence, the G-theory recognizes that measurement error is multifaceted and can be systematic. The G-theory is able to pinpoint multiple sources of error in measurements, disentangles them, and simultaneously estimates each source of error and their interactions. Second, whereas the CTT only allows a simple measurement design to assess reliability, the G-theory permits a much more complicated design to estimate multiple measurement errors in a phase known as generalizability (G) study. Third, whereas the CTT does not provide sufficient information to inform better measurement design for higher reliability, the G-theory explicitly guides the development and optimization of measurement procedures in a process called decision (D) study. Fourth, while the CTT produces reliability coefficients for relative (rank-ordering) decision-making, the G-theory generates two reliabilitylike coefficients for norm-referenced (i.e., relative) and criterion-referenced (i.e., absolute) score interpretations, respectively. Last, the CTT only works with one dependent measurement variable (e.g., only a score for a measurement object); the G-theory can be applied in multivariate cases (e.g., a profile of scores). Because of these strengths, the G-theory has become a useful technique in psychological, educational, and language testing and assessment. Ideally, the most accurate, thus least error-prone, measurement for a student/examinee/test taker is based on an expected score across all possible tasks administered in all possible occasions, evaluated by all possible raters. This conceptually indefinite pool encompassing all possible tasks, occasions, and raters defines

Detecting and Measuring Rater Effects in Interpreting …

91

a universe of admissible observations (UAO). The expected score is known as a universe score, equivalent to a true score in the CTT; and the test taker is recognized as the object of measurement. But in real-life measurement, it is impractical, if not impossible, to implement multiple repeated measurements. Typically, an observed score for the test taker is based on a limited number of tasks, occasions, and raters sampled from the UAO. Due to sampling variability concerning such facets as tasks, occasions, and raters, sources of systematic error are introduced to a given measurement procedure. Conceptually, reliability (or dependability) in the G-theory refers to the accuracy of generalizing from a person’s observed score on a particular test to the average score the person would have received over all possible test conditions that the test user would be equally willing to accept (Shavelson and Webb 1991). In addition, under the framework of the G-theory, the undifferentiated random error in the CTT can thus be disentangled into three major components, including (a) identified sources of systematic error (e.g., tasks, occasions, raters), (b) unidentified systematic error (e.g., other facets not specified), and (c) random error (e.g., ephemeral and unsystematic error). Similar to the CTT, the total score variance among a group of test takers can be decomposed into several variance components (VC), including a universe-score variance, differentiated error variances, and unidentified random error variance. For a one-facet measurement design featuring rater (R) facet, the mathematical model for partitioning total variance of test scores can be expressed in Eq. 4: σT2 otal = σ P2 + σ R2 + σ P2 R,E

(4)

As can be seen in the equation, total score variance is partitioned into variance attributed to systematic differences among persons (P) (e.g., test takers), variation due to systematic variability among raters (R) plus a residual term inclusive of variance ascribed to person-by-rater interaction and variance attributable to unidentified systematic and/or random error. Mathematically, reliability or dependability can also be defined as a proportion of total score variance that is attributed to universe-score variance. That is, holding all else constant, the lower the unwanted error variance is, the more precise and reliable a measurement is. In the G-study phase, provided that the UAO is well-defined and measurement design is proper, each variance component can be estimated separately using analysis of variance (ANOVA) methods or maximum likelihood estimation (MLE) (Webb and Shavelson 2005). In addition, relative magnitudes of estimated variance components can be calculated, which provides information about the potential sources of error influencing a measurement. In the D-study phase, decision-makers need to define a universe that they wish to generalize to, known as the universe of generalization (UG) which is characterized by some or all facets in the universe of admissible observations (UAO). For instance, whereas a UAO could be defined to have three facets including test items, occasions, and raters, a UG could be the same size of the UAO by incorporating all possible items, occasions, and raters, or narrower than that of the UAO by incorporating only two or one of the three facets (e.g., all possible items and

92

C. Han

occasions). In addition, variance component estimates produced by the G-study are used in the D-study to optimize measurement design in order to achieve a desirable level of reliability for measurements, taking into consideration theoretical and logistical constraints. Furthermore, three reliability-like coefficients can be calculated to estimate reliability for two different types of decision-making. An index of generalizability (ρ 2 ), an analog to the CTT reliability indices, is calculated to allow norm-referenced score interpretations or relative decision-making that focuses on the rank-ordering of persons. In addition, an index of dependability (Φ) is computed for absolute decision-making that emphasizes absolute level of performance, regardless of rank-ordering. Moreover, as an extension of Φ, Brennan (2001b) introduced a “coefficient of criterion-referenced measurement”, known as Phi or Φ(λ), to address cut-off score applications. That is, the proposed index indicates how reliably an instrument can locate individual scores with respect to a threshold score set at point λ on the measurement scale. Apart from reliability-like coefficients, standard error of measurement (SEM), which is deemed as a piece of critical information on measurement quality (e.g., Brennan 2001a; Cardinet et al. 2010), can also be computed to evaluate measurement precision. SEM informs decision-makers of the size of error affecting measurement results. Overall, the G-theory has been applied to investigate rater reliability/variability in different assessment contexts, including van Weeren and Theunissen (1987) in testing foreign language pronunciation, and Sudweeks et al. (2005) in writing assessment, and Bachman et al. (1995) and Lynch and McNamara (1998) in performance testing of foreign language speaking. In terms of interpreting assessment, Clifford (2004) proposed the application of the G-theory to investigate potential sources of measurement errors in interpreter certification examination, which was later echoed by Wu (2010). Han (2016, 2019) first used the G-theory to examine rater-related variances in the test scores and explored the dependability of the assessment results produced in the context of professional interpreter certification and of educational summative assessment.

2.2.3

Many-Facet Rasch Measurement

Many-facet Rasch measurement (MFRM) could be viewed as an extension of one-parameter item response theory (IRT) model, in that MFRM takes into account multiple assessment facets that potentially affect expected performance of students/examinees/test takers (for details, see Linacre 1989; McNamara 1996; Bond and Fox 2015; Eckes 2015). To investigate rater effects, the MFRM analysis can produce calibrated estimates for each rater in a common equal-interval metric (i.e., logit). Several statistics can be computed to indicate potential rater variability, including homogeneity statistic, separation index, and separation reliability (McNamara 1996; Eckes 2015). In addition, both infit and outfit statistics can be used to indicate overall self-consistency of each rater. Moreover, bias analysis is capable of examining interaction involving rater and other assessment facets (e.g., tasks,

Detecting and Measuring Rater Effects in Interpreting …

93

assessment criteria, students/examinees/test takers), pinpointing significant interactions case by case using standardized Z scores. In the field of second/foreign language testing, numerous empirical studies have been conducted to gain an in-depth understanding of the rater effects in language performance testing (e.g., Wigglesworth 1993; Lumley and McNamara 1995; Weigle 1998; Kondo-Brown 2002; Bonk and Ockey 2003; Eckes 2005, 2008; Schaefer 2008). In the field of interpreting assessment, Zhao and Dong (2013), and Han (2015, 2017) applied MFRM to understand how raters used rating scales to assess English-Chinese interpreting. Han (2018b) and Han and Zhao (2020) also used a Rasch-based rater accuracy model to investigate the accuracy of student peer ratings on three quality dimensions (i.e., fidelity, fluency, and target language quality) of English-Chinese consecutive interpreting.

3 Research Purposes As described in the literature review above, despite the growing use of the three approaches in the interpreting assessment, there have been no attempts to compare these psychometric approaches in a single study, based on a common set of assessment data, so as to understand what type of information these approaches could provide to detect and measure possible rater effects. Against this background, we intend to apply the CTT, the G-Theory and the MFRM to analyze the authentic data from a previous assessment of English-to-Chinese simultaneous interpreting. By doing so, we aim to achieve two goals. First, it is hoped that the demonstration of relevant analyses, based on the assessment data, would help interpreting testers and researchers gain insights into the type and the amount of statistical information (concerning rater effects) generated by each approach. Second, it is expected that the methodological comparison would provide useful information about relative strengths and weaknesses of each approach, so that testers and researchers could be better informed about the appropriateness of the psychometric approaches in their local assessment contexts.

4 Method 4.1 Data Source The raw interpreting data was derived from a previous experiment (i.e., Han 2015), in which 32 practicing interpreters performed simultaneous interpreting (SI) for four 10 min English speeches, based on a Latin square design (with the aim to avoiding potential order effect). The four source-language speeches were four SI tasks, denoted as TaskA , TaskB , TaskC and TaskD . All SI performances were audio-recorded and stored as mp3 files for later evaluation.

94

C. Han

4.2 Assessment Criteria and Rating Scale Three assessment criteria were used to evaluate the SI performance: (1) information completeness (InfoCom: i.e., to what extent source-language information is successfully translated), (2) fluency of delivery (FluDel: i.e., to what extent disfluencies, such as (un)filled pauses, long silence, fillers, and/or excessive repairs are present in target-language output) and target language quality (TLQual: i.e., to what extent target-language expressions are natural to a native Chinese speaker). As such, an analytic rating scale was created with three sub-scales focusing on the three quality dimensions. As can be seen in Appendix 1, each sub-scale can be divided into four two-point bands with relevant descriptors provided for each band. When using the rating scale, a rater could give a maximum score of eight, and a minimum score of one.

4.3 Raters and Rater Training A total of nine raters assessed the SI performance, using the eight-point analytic rating scale. The raters were postgraduate trainee interpreters specializing in English– Chinese conference interpreting, and they all had experience of rating audio-recorded interpreting performance for a regional certification test in China. Before the official scoring session, the raters attended a 5 h training program to help them understand the scoring procedure. Specifically, the raters participated in a familiarization session in which they familiarized themselves with the assessment criteria, the rating scale, and the rating sheet. The raters also rated the anchored interpreting performances, compared their scores with one another, and discussed why they provided a particular score. Moreover, the raters participated in a pilot scoring session in which a bundle of four SI recordings were assessed before a short break.

4.4 Operational Scoring One day after the training, all raters gathered together in a quiet room and evaluated all recorded performances independently. A fully-crossed measurement design was implemented, in which each rater evaluated all interpreting performances in the four SI tasks using the three sub-scales. This design produced a total of 3456 data points (i.e., 32 interpreters × 4 tasks × 9 raters × 3 criteria), which is considered as an optimal arrangement from the measurement perspective (because such a design could generate the maximal amount of data for analysis).

Detecting and Measuring Rater Effects in Interpreting …

95

4.5 Data Analysis As mentioned previously, we used the three analytic approaches to analyzing the assessment data: (a) the CTT, (b) the G-theory, and (c) the MFRM. First, based on the CTT approach, we computed consistency estimates of inter-rater reliability using Pearson’s r for all possible rater pairs in each task-by-criterion condition, using SPSS 18.0. Second, we used the G-theory to analyze the assessment data. A two-facet, crossed random design was operationalized. Table 2 describes the decomposition of the total score variance (for the scores of InfoCom) into different variance components. For the dependent variable of InfoCom, this two-facet, crossed, and random design had seven sources of variability, which represented the total variation of the scores. The first source of variability, attributable to the object of measurement (i.e., interpreters), arose from the systematic individual differences among the interpreters in terms of their performance, which is also known as universe-score variability. The remaining six sources of variability were associated with the two measurement facets, and they introduced inaccuracies or errors to sample-to-universe generalization. For example, overall inter-rater inconsistency would increase uncertainty when generalizing from the scores given by the particular group of raters to scores provided by a universe of admissible raters. That is, generalization from sample to universe would be hazardous due to rater-engendered errors. In addition, interaction Table 2 The decomposition of the total variance into variance components Source of variability

Description of variance component

Notation

1. Interpreters (i)

Universe-score variance (object of measurement)

σ2i

2. SI tasks (t)

Main effect for all interpreters due to their performance inconsistency from one task to another

σ2t

3. Raters (r)

Main effect for all interpreters due to rater severity

σ2r

4. Interpreter × Task (i × t)

Interaction effect: inconsistent from one task to another in a particular interpreter’s performance

σ2it

5. Rater × Interpreter (r × i)

Interaction effect: inconsistent rater severity towards a particular interpreter

σ2ri

6. Rater × Task (r × t)

Interaction effect: inconsistent rater severity from one task to another for all interpreters

σ2rt

2 7. Interpreter × Task × Rater (i × t × r), Residual: unique three-way i × t × r σitr,e error interaction, unidentified and unmeasured facets that potentially affect the measurement, and/or random errors

96

C. Han

among the interpreters, the raters, and the tasks could pose potential threats to generalization. For instance, inconsistency between raters’ severity may arise for a particular interpreter. That is, a rater would provide consistently lower-than-warranted scores to a particular interpreter, while another rater would rate a particular interpreter consistently leniently. Such interaction involving only some raters and some interpreters introduce additional error to the generalization process. Furthermore, in the residual, given only one observation (i.e., one score on InfoCom) per cell of interpreter-by-task-by-rater matrix in Table 2, this three-way interaction (i.e., i × t × r) was confounded with unidentified and/or random errors. We used EduG (Cardinet et al. 2010) to estimate variance components and compute relevant generalizability coefficients. Third, we run a series of MFRM analysis on the four-facet, fully-crossed design (i.e., interpreters, tasks, raters, and criteria). Specifically, a partial credit model (PCM) (Masters, 1982) was selected for analysis, because each scale was assumed to have a distinctive structure. The PCM model consisted of the following four assessment facets: interpreters (i), raters (r), tasks (t) and criteria (c). The mathematical model could be expressed as follows:  Log

Pir tcx Pir tcx−1

 = θi − αr − δt − βc − τxc

where: Pirtcx = probability of interpreter i being awarded a rating x on task t on criterion c by rater r, Pirtcx−1 = probability of interpreter i being awarded a rating x−1 on task t on criterion c by rater r, θ i = ability of interpreter i, α r = severity of rater r, δ t = difficulty of task t, β c = difficulty of criterion c, τ xc = threshold difficulty of being rated in x relative to x−1 on criterion c. All analyses were implemented via FACETS 3.71.0 (Linacre 2013) to estimate raters’ internal self-consistency, severity/leniency measures, and to detect any possible biased interactions with other assessment facets. By convention, mean logit measures of the rater, the task and the criterion facets were arbitrarily centered to zero, while the facet of interpreters was made the non-centered facet, with its mean logit measures varying according to the samples analyzed.

Detecting and Measuring Rater Effects in Interpreting …

97

5 Results 5.1 Results from the CTT Analysis Pearson’s r coefficients were calculated for each rater pair (i.e., with nine raters, there were 36 rater pairs in total) in each of 12 task-by-criterion conditions (e.g., TaskA -by-InfoCom condition). Table 3 shows maximum, minimum, mean Pearson’s correlation coefficients, and standard deviations (SD) for each condition. As can be seen in the table, none of mean Pearson’s r coefficients were above the acceptable level of 0.70. This suggests that averaged across all the rater pairs, the raters seemed to rank-order the interpreters in a moderately consistent manner for each condition. Furthermore, mean Pearson’s r coefficients were also computed for each rater pair across the three assessment criteria and the four SI tasks. Figure 1 shows the distribution of mean Pearson’s correlation coefficients for each rater pair. As can be seen in the figure, the Pearson’s correlation coefficient averaged across all the raters, the assessment criteria and the SI tasks was 0.55, an indicator of the moderate strength of correlation. In addition, it is worth mentioning that the maximum value of mean Pearson’s r was 0.81 concerning Rater 4 (R04) and Rater 5 (R05), which seemed to be an outlier. This is because the value of 0.81 did not fall within the interval spanning over the mean plus/minus three SDs, as can be seen in Fig. 1. A further inspection of all correlation coefficients associated with R04 and R05 revealed that they consistently achieved higher correlation across the criteria and the tasks, with the minimum Pearson’s r coefficient (i.e., 0.71 in the TaskA -by-TLQual condition) being greater than the acceptable level of 0.70. A follow-up interview with the raters found that the seating arrangement, with R04 and R05 being in close proximity, may have contributed to possible collusion between R04 and R05 from time to time. During the operational scoring, R04 and R05 shared a table, while all other raters were separated spatially. Table 3 Descriptive statistics on Pearson’s correlation coefficients Assessment criteria

TaskA

InfoCom

TaskB

TaskC

TaskD

[0.92, 0.56]; 0.69, [0.85, 0.55]; (0.08) 0.69, (0.08)

[0.93, 0.42]; 0.59, (0.13)

[0.90, 0.27]; 0.64, (0.11)

FluDel

[0.80, 0.29]; 0.52, [0.84, 0.29]; (0.16) 0.56, (0.14)

[0.78, 0.25]; 0.48, (0.12)

[0.73, 0.31]; 0.49, (0.12)

TLQual

[0.75; 0.21]; 0.54, (0.14)

[0.72, 0.14]; 0.42, (0.16)

[0.77, 0.03]; 0.36, (0.15)

[0.79, 0.31]; 0.48, (0.11)

Notes Data in each cell of the table were structured as [maximum, minimum]; mean Pearson’s r, (standard deviation). InfoCom = Information completeness, FluDel = Fluency of delivery, TLQual = target language quality

98

C. Han

Pearson correlation coefficient (r)

.9

Maximum: R04 & R05, r = 0.81 (averaged across all criteria and all tasks)

.8

Mean + 3SD = 0.76 .7

.6

Mean r = 0.55 (for all rater pairs) SD = 0.07

.5

.4

Minimum:R07 & R09, r = 0.44 Mean - 3SD = 0.34

.3 0

5

10

15

20

25

30

35

Number of rater pair (n) Fig. 1 Distribution of mean Pearson’s r coefficients for each rater pair

5.2 Results from the G-Theory Analysis 5.2.1

Variance Components

The univariate G-study estimated all seven variance components (VCs). Table 4 presents the estimation results for seven sources of variability. As shown in the table, in terms of the InfoCom scores, the largest VC was that for the interpreters (σˆ i2 = 1.52), accounting for 45.6% of the total score variance. This indicated that averaging over the tasks and the raters, the interpreters differed systematically in the InfoCom scores. Given that the interpreters were the object of measurement, this fairly large variation (i.e., true-score variance) is desirable and expected, because it Table 4 Estimated variance components for InfoCom, FluDel, and TLQual Source of variability

MS, Estimated VC, (% of total variance) InfoCom

FluDel

TLQual

Interpreter (i)

59.42, 1.52, (45.6)

28.64, 0.71 (32.4)

27.23, 0.68 (33.8)

Task (t)

81.58, 0.27 (8.0)

58.51, 0.19 (8.9)

10.24, 0.03 (1.3)

Rater (r)

23.45, 0.16 (4.7)

34.46, 0.25 (11.6)

25.17, 0.18 (9.0)

i×t

3.63, 0.32 (9.5)

2.28, 0.18 (8.4)

1.98, 0.14 (6.9)

i×r

1.77, 0.25 (7.4)

1.45, 0.21 (9.4)

1.61, 0.22 (10.8)

r×t

2.25, 0.05 (1.4)

1.13, 0.02 (0.7)

1.30, 0.02 (0.9)

i × t × r, e

0.78, 0.78 (23.4)

0.62, 0.62 (28.5)

0.75, 0.75 (37.3)

Note MS = mean square, VC = variance component; InfoCom = Information completeness, FluDel = Fluency of delivery, TLQual = target language quality

Detecting and Measuring Rater Effects in Interpreting …

99

contributes to reliability of measurement. For the FluDel scores, the similar pattern was also observed, with the interpreter VC being the largest one (σˆ i2 = 0.71), albeit accounting for a smaller proportion of the total variance (i.e., 32.4%). For the TLQual scores, although the interpreter VC represented a relatively large portion of the total variance (i.e., 33.8%), it was the second largest one, after the residual VC (37.3%). This means that the magnitude of the unmeasured sources of variation was too much. Overall, of the three criteria, the FluDel scores were most error-prone, because 67.6% of the total variance (i.e., (1 − 0.324) × 100%) were measurement errors, larger than that of the remaining two criteria, respectively. In terms of the task-related VCs, the mildly large VC (σˆ t2 = 0.27) for the InfoCom scores made up 8% of the total variance. This means that summarized over the interpreters and the raters, the difficulty of SI tasks varied moderately. For the FluDel scores, the similar pattern was also found. However, for the TLQual scores, the trivial magnitude of task VC (σˆ t2 = 0.03) indicates that the average task difficulty remained almost stable across the tasks and did not contribute too much to measurement error. Regarding the rater-related VCs, the non-negligible magnitude for the InfoCom scores (σˆ r2 = 0.16) represented a 4.7% of the total variation. It suggests that the raters differed somewhat in terms of severity/stringency, after controlling for the effects of the interpreters and the tasks. It is also worth noting that the total variability due to the tasks (σˆ t2 = 0.27) was almost twice as large as that due to the raters (σˆ r2 = 0.16). This indicates that given the InfoCom scores, the SI tasks differed much more in average level of difficulty than the raters’ average level of severity. For the FluDel score, the rater VC was also non-trivial, but larger than that of the tasks which showed a different pattern to that of the InfoCom scores. For the TLQual scores, the rater VC (σˆ r2 = 0.18) was even as six times large as that of the task VC (σˆ r2 = 0.03), indicating much greater heterogeneity of the average rater severity than that of the average task difficulty. Regarding the interaction VCs, for the InfoCom scores, the relatively large interpreter-by-task VC (i.e., σˆ it2 = 0.32, 9.5% of total variance) suggests that the relative standing of the interpreters varied somewhat from one task to another. This pattern generally held true for both the FluDel and the TLQual scores. With respect to the interpreter-by-rater VCs, the non-trivial VC for the InfoCom scores (i.e., σˆ it2 = 0.25, 7.4% of total variance) means that different raters disagreed somewhat with one another in rank-ordering the interpreters. It is also worth noting that the interpreter-by-task VC was larger than the interpreter-by-rater VC. In conjunction with the earlier findings, it can be said that the SI tasks were a greater source of variability in the interpreters’ scores than were the raters. For the FluDel and the TLQual scores, a contradictory pattern emerged that the interpreter-by-rater VCs were greater than the interpreter-by-task VCs. As a result, the SI tasks could be said to contribute less variability to the observed scores than that of the raters, for both FluDel and TLQual scores. In addition, for the three assessment criteria, all rater-by-task VCs were negligible, accounting for approximately 1% of the total variance, respectively. This suggests that the raters used the rating scale consistently across the tasks for all the assessment criteria.

100

C. Han

Table 5 Generalizability coefficients for InfoCom, FluDel and TLQual respectively

G coefficient

InfoCom

FluDel

TLQual

(ρ 2 )

0.92

0.89

0.90

G coefficientAbsolute (Φ)

0.88

0.81

0.86

G coefficientRelative

Note G coefficient = generalizability coefficient; InfoCom = Information completeness, FluDel = Fluency of delivery, TLQual = target language quality

Last, the residual VCs for the three respective assessment criteria show that a substantial proportion of variance was attributable to the triple-order interaction between the interpreters, the tasks and the raters, and/or other systematic or random sources of variation that were not measured.

5.2.2

Generalizability Indices

Table 5 shows two generalizability coefficients, ρ 2 and Φ, for relative and absolute decision-making, respectively. As can be seen in the table, for the InfoCom, the FluDel and the TLQual scores, both ρ 2 and Φ were desirable, that is, above the conventionally accepted minimum value of 0.80 (Cardinet et al. 2010). Note that these coefficients were calculated based on the measurement design of nine raters and four tasks.

5.3 Results from the Rasch Analysis Table 6 presents logit estimates of overall rater severity for the nine raters in a descending order. As can be seen in the table, the most stringent rater was R12, while the least severe was R06, which represents a total of 0.74-logit span. In addition, the homogeneity statistic shows that the raters were not equally severe (after controlling for effects of the interpreters, the tasks and the criteria), given the statistically significant chi-square value (χ 2 = 307.6, df = 8) at ρ < 0.01. Next, a high separation index of 6.13 indicates that the rater severity difference was approximately six times larger than the estimated error. Finally, a high separation reliability index (0.97) means the heterogeneous degree of the overall rater severity averaged over other assessment facets. However, despite the statistically significant difference concerning rater severity, each individual rater was generally self-consistent. As show in Table 6, the mean-square infit statistics suggest that most of the raters had a good internal consistency, because their infit values were within a tight fit control range between 0.7 and 1.3 (McNamara 1996; Bond and Fox 2015). R08 with an infit value of 1.96 was a misfit who displayed a high degree of inconsistency with his/her ratings (i.e., excessive variation), while R06 (infit value = 0.63) could be regarded

Detecting and Measuring Rater Effects in Interpreting … Table 6 Logit estimates of overall rater severity

Rater ID

Severity measure (in logit)

101 Model error

Infit mean square

R12

0.39

0.05

1.03

R05

0.32

0.05

0.80

R04

0.19

0.05

0.71

R07

0.14

0.05

1.14

R03

0.00

0.05

1.09

R01

−0.09

0.05

0.84

R09

−0.27

0.05

0.83

R08

−0.34

0.05

1.96

R06

−0.35

0.05

0.63

Homogeneity statistic: Chi-square 0.01 Separation index = 6.13 Separation reliability = 0.97

(χ 2 )

= 307.6* , df = 8, * ρ
90%), with a few number of deviations, inaccuracies, and minor/major omissions

Delivery on the whole fluent, containing a few disfluencies such as (un)filled pauses, long silence, fillers and/or excessive repairs

Target language idiomatic and on the whole correct, with only a few instances of unnatural expressions and grammatical errors

Band 3 (Score range: 5–6)

Majority of original messages delivered (i.e., 60–70%), with a small number of deviations, inaccuracies, and minor/major omissions

Delivery on the whole generally fluent, containing a small number of disfluencies

Target language generally idiomatic and on the whole mostly correct, with a small amount of instances of unnatural expressions and grammatical errors (continued)

110

C. Han

(continued) Band/scoring criteria Information completeness (InfoCom)

Fluency of delivery (FluDel)

Target language quality (TLQual)

Band 2 (Score range: 3–4)

About half of original messages delivered (i.e., 40–50%), with many instances of deviations, inaccuracies, and minor/major omissions

Delivery rather fluent. Acceptable, but with regular disfluencies

Target language to a certain degree both idiomatic and correct. Acceptable, but contains many instances of unnatural expressions and grammatical errors

Band 1 (Score range: 1–2)

A small portion of original messages delivered (i.e., < 30%), with frequent occurrences of deviations, inaccuracies, and minor/major omissions, to such a degree that listeners may doubt the integrity of renditions

Delivery lacks fluency. It is frequently hampered by disfluencies, to such a degree that they may impede comprehension

Target language stilted, lacking in idiomaticity, and containing frequent grammatical errors, to such a degree that it may impede comprehension

References Bachman, Lyle, Brian Lynch, and Maureen Mason. 1995. Investigating variability in tasks and rater judgements in a performance test of foreign language speaking. Language Testing 12 (2): 238–257. Bachman, Lyle. 1990. Fundamental considerations in language testing. Oxford: Oxford University Press. Bond, Trevor, and Christine Fox. 2015. Applying the Rasch model: Fundamental measurement in the human sciences, 3rd ed. New York: Routledge. Bonk, William, and Gary Ockey. 2003. A many-facet Rasch analysis of the second language group oral discussion task. Language Testing 20 (1): 89–110. Brennan, Robert. 2001a. An essay on the history and future of reliability from the perspective of replications. Journal of Educational Assessment 38 (4): 295–317. Brennan, Robert. 2001b. Generalizability theory. New York: Springer. Cardinet, Jean, Sandra Johnson, and Gianreto Pini. 2010. Applying generalizability theory using EduG. New York, NY: Routledge. Clifford, Andrew. 2004. A preliminary investigation into discursive models of interpreting as a means of enhancing construct validity in interpreter certification. https://ruor.uottawa.ca/handle/ 10393/29086. Accessed 7 May 2019. Crocker, Linda, and James Algina. 1986. Introduction to classical and modem test theory. Toronto: Holt, Rinehart and Winston. Cronbach, Lee, Goldine Gleser, Harinder Nanda, and Nageswari Rajaratnam. 1972. The dependability of behavioral measurements. New York: Wiley. DeVellis, Robert. 2006. Classical test theory. Medical Care 44 (1): 55–59.

Detecting and Measuring Rater Effects in Interpreting …

111

Eckes, Thomas. 2005. Examining rater effects in TestDaF writing and speaking performance assessments: A many-facet Rasch analysis. Language Assessment Quarterly 2 (3): 197–221. Eckes, Thomas. 2008. Rater types in writing performance assessments: A classification approach to rater variability. Language Testing 25 (2): 155–185. Eckes, Thomas. 2015. Introduction to many-facet Rasch measurement: Analyzing and evaluating rater-mediated assessments, revised ed. Frankfurt am Main: Peter Lang. Fan, Xitao, and Shaojing Sun. 2014. Generalizability theory as a unifying framework of measurement reliability in adolescent research. The Journal of Early Adolescence 34 (1): 38–65. Gile, Daniel. 1995. Fidelity assessment in consecutive interpretation: An experiment. Target 7 (1): 151–164. Hale, Sandra, and Uldis Ozolins. 2014. Monolingual short courses for language-specific accreditation: Can they work? A Sydney experience. The Interpreter and Translator Trainer 8 (2): 1–23. Han, Chao, and Helen Slatyer. 2016. Test validation in interpreter certification performance testing: An argument-based approach. Interpreting 18 (2): 231–258. Han, Chao, and Mehdi Riazi. 2017. Investigating the effects of speech rate and accent on simultaneous interpretation: A mixed-methods approach. Across Languages and Cultures 18 (2): 237–259. Han, Chao, and Xiao Zhao. 2020. Accuracy of peer ratings on the quality of spoken-language interpreting. Assessment and Evaluation in Higher Education 46: 1–15. https://doi.org/10.1080/ 02602938.2020.1855624. Han, Chao. 2015. Investigating rater severity/leniency in interpreter performance testing: A multifaceted Rasch measurement approach. Interpreting 17 (2): 255–283. Han, Chao. 2016. Investigating score dependability in English/Chinese interpreter certification performance testing: A generalizability theory approach. Language Assessment Quarterly 13 (3): 186–201. Han, Chao. 2017. Using analytic rating scales to assess English–Chinese bi-directional interpreting: A longitudinal Rasch analysis of scale utility and rater behaviour. Linguistica Antverpiensia, New Series: Themes in Translation Studies 16: 196–215. Han, Chao. 2018a. A longitudinal quantitative investigation into the concurrent validity of self and peer assessment applied to English–Chinese bi-directional interpretation in an undergraduate interpreting course. Studies in Educational Evaluation 58: 187–196. Han, Chao. 2018b. Latent trait modelling of rater accuracy in formative peer assessment of English– Chinese consecutive interpreting. Assessment and Evaluation in Higher Education 43 (6): 979– 994. Han, Chao. 2018c. Using rating scales to assess interpretation: Practices, problems and prospects. Interpreting 20 (1): 59–95. Han, Chao. 2019. A generalizability theory study of optimal measurement design for a summative assessment of English/Chinese consecutive interpreting. Language Testing 36 (3): 419–438. Kline, Theresa. 2005. Psychological testing: A practical approach to design and evaluation. Thousand Oaks, CA: Sage. Kondo-Brown, Kimi. 2002. A FACETS analysis of rater bias in measuring Japanese second language writing performance. Language Testing 19 (1): 3–31. Lee, Jieun. 2008. Rating scales for interpreting performance assessment. The Interpreter and Translator Trainer 2 (2): 165–184. Lee, Sang-Bin. 2015. Developing an analytic scale for assessing undergraduate students’ consecutive interpreting performances. Interpreting 17 (2): 226–254. Linacre, John. 1989. FACETS: Computer program for many-facets Rasch measurement. Chicago: MESA Press. Linacre, John. 2013. A user’s guide to FACETS: Program manual 3.71.2. http://www.winsteps. com/a/facets-manual.pdf. Accessed 21 Oct 2019.

112

C. Han

Liu, Minhua. 2013. Design and analysis of Taiwan’s interpretation certification examination. In Assessment issues in language translation and interpreting, ed. Dina Tsagari and Roelof van Deemter, 163–178. Frankfurt: Peter Lang. Lord, Frederic, Melvin Novick, and Allan Birnbaum. 1968. Statistical theories of mental test scores. Reading, MA: Addison-Wesley. Lumley, Tom, and Tim McNamara. 1995. Rater characteristics and rater bias: Implications for training. Language Testing 12 (1): 54–71. Lynch, Brian, and Tim McNamara. 1998. Using G-theory and many-facet Rasch measurement in the development of performance assessments of the ESL speaking skills of immigrants. Language Testing 15 (2): 158–180. Marcoulides, George, and Zvi Drezner. 1993. A procedure for transforming points in multidimensional space to a two-dimensional representation. Educational and Psychological Measurement 53 (4): 933–940. Masters, Geoff. 1982. A Rasch model for partial credit scoring. Psychometrika 47 (2): 149–174. McGraw, Kenneth O., and S.P. Wong. 1996. Forming inferences about some intraclass correlation coefficients. Psychological Methods 1 (1): 30–46. McNamara, Tim. 1996. Measuring second language performance. London: Longman. NAATI. 2019. Certified conference interpreter test assessment rubrics. https://www.naati.com.au/ media/2357/cci_spoken_assessment_rubrics.pdf. Accessed 20 Mar 2020. Schaefer, Edward. 2008. Rater bias pattern in an EFL writing assessment. Language Testing 25 (4): 465–493. Setton, Robin, and Andrew Dawrant. 2016. Conference interpreting: A trainer’s guide. Amsterdam: John Benjamins. Shang, Xiaoqi, and Guixia Xie. 2020. Aptitude for interpreting revisited: Predictive validity of recall across languages. The Interpreter and Translator Trainer 14 (3): 344–361. Shavelson, Richard, and Noreen M. Webb. 1991. Generalizability theory: A primer. Newbury Park, CA: Sage. Shrout, Patrick, and Jeseph Fleiss. 1979. Intraclass correlations: Uses in assessing rater reliability. Psychological Bulletin 86 (2): 420–428. Shultz, Kenneth, and David Whitney. 2005. Measurement theory in action: Case studies and exercises. Thousand Oaks, CA: Sage. Sudweeks, Richard, Suzanne Reeve, and William S. Bradshaw. 2005. A comparison of generalizability theory and many-facet Rasch measurement in an analysis of college sophomore writing. Assessing Writing 9 (3): 239–261. Tiselius, Elisabet. 2009. Revisiting Carroll’s scales. In Testing and assessment in translation and interpreting studies, ed. Claudia V. Angelelli and Holly E. Jacobson, 95–121. Amsterdam: John Benjamins. Traub, Ross, and Glenn L. Rowley. 1991. An NCME instructional module: Understanding reliability. Educational Measurement: Issues and Practices 10 (1): 37–45. van Weeren, J., and T.J.J.M. Theunissen. 1987. Testing pronunciation: An application of generalizability theory. Language Learning 37 (1): 109–122. Wang, Weiwei, Xu Yi, Wang Binghua, and Mu Lei. 2020. Developing interpreting competence scales in China. Frontiers in Psychology 11: 481. https://doi.org/10.3389/fpsyg.2020.00481. Webb, Noreen, and Richard J. Shavelson. 2005. Generalizability theory: Overview. In Encyclopedia of Statistics in Behavioral Science, ed. S. Everitt Brian and David C. Howell, 717–719. Chichester: Wiley. Weigle, Sara. 1998. Using FACETS to model rater training effects. Language Testing 15 (2): 263– 287. Wen, Qian. 2019. A many-facet Rasch model validation study on business negotiation interpreting test. Foreign Languages in China 16 (3): 73–82. Wigglesworth, Gillian. 1993. Exploring bias analysis as a tool for improving rater consistency in assessing oral interaction. Language Testing 10 (3): 305–319.

Detecting and Measuring Rater Effects in Interpreting …

113

Wu, Shao-Chuan. 2010. Assessing simultaneous interpreting: A study on test reliability and examiners’ assessment behavior. https://theses.ncl.ac.uk/jspui/handle/10443/1122. Accessed 15 Apr 2019. Zhao, Nan, and Yanping Dong. 2013. Validation of a consecutive interpreting test based on multifaceted Rasch model. Journal of PLA University of Foreign Languages 36 (1): 86–90.

Chao Han is a full professor in the College of Foreign Languages and Cultures at Xiamen University, China. He conducted his Ph.D. research at the Department of Linguistics at Macquarie University (Sydney), focusing on interpreter certification testing. He is interested in testing and assessment issues in translation and interpreting (T&I) and methodological aspects of T&I studies. His recent publications have appeared in such journals as Interpreting, The Translator, Language Testing, and Language Assessment Quarterly. He is currently a member of the Advisory Board of the International Journal of Research and Practice in Interpreting.

Automatic assessment

Quantitative Correlates as Predictors of Judged Fluency in Consecutive Interpreting: Implications for Automatic Assessment and Pedagogy Wenting Yu and Vincent J. van Heuven

Abstract This chapter presents an experimental study of consecutive interpreting which investigates whether: (a) judged fluency can be predicted from computer-based quantitative prosodic measures including temporal and melodic measures. Ten raters judged six criteria of accuracy and fluency in two consecutive interpretations of the same recorded source speech, from Chinese ‘A’ into English ‘B’, by twelve trainee interpreters (seven undergraduates, five postgraduates). The recorded interpretations were examined with the speech analysis tool Praat. From a computerized count of the pauses thus detected, together with disfluencies identified by raters, twelve temporal measures of fluency were calculated. In addition, two melodic measures, i.e., pitch level and pitch range, were automatically generated. These two measures are often considered to be associated with speaking confidence and competence. Statistical analysis shows: (a) strong correlations between judged fluency and temporal variables of fluency; (b) no correlation between pitch range and judged fluency, but a moderate (negative) correlation between pitch level and judged fluency; and (c) the usefulness of effective speech rate (number of syllables, excluding disfluencies, divided by the total duration of speech production and pauses) as a predictor of judged fluency. Other important determinants of judged fluency were the number of filled pauses, articulation rate, and mean length of pause. The potential for developing automatic fluency assessment in consecutive interpreting is discussed, as are implications for

This chapter is an extended and rewritten version of Yu and van Heuven (2017) originally published in the International Journal of Research and Practice in Interpreting, Volume 19, Issue 1, pp. 47–68. https://www.benjamins.com/catalog/intp. W. Yu (B) Shanghai International Studies University, Shanghai, China e-mail: [email protected] V. J. van Heuven University of Pannonia, Veszprém, Hungary Leiden University, Leiden, Netherlands V. J. van Heuven e-mail: [email protected] © Springer Nature Singapore Pte Ltd. 2021 J. Chen and C. Han (eds.), Testing and Assessment of Interpreting, New Frontiers in Translation Studies, https://doi.org/10.1007/978-981-15-8554-8_6

117

118

W. Yu and V. J. van Heuven

informing the design of rubrics of fluency assessment and facilitating formative assessment in interpreting education. Keywords Fluency · Quantitative correlates · Consecutive interpreting · Automatic assessment

1 Introduction Fluency is an important aspect of quality assessment of natural speech and also of interpreting. Though fluency alone provides no guarantee of the interpreter’s reliability unless content too is taken into account, it is widely recognized as a feature of successful interpretation (Mead 2005). Since the 1980s, fluency has been studied as one out of many aspects of quality in interpreting and its role in interpreting quality assessment has gained an increasing amount of recognition. It is nevertheless only recently that interpreting scholars have begun to foreground fluency as a specific component of quality, with the aim of defining it in practical terms and determining to what extent it might affect intelligibility and user perception (Rennert 2010). So far, a number of empirical studies have indicated that: (a) there is a contrast between user expectations and user perception regarding the importance of fluency as a quality criterion in interpreting; and (b) fluency is such an important element in perceived quality as its experimental manipulation affects user perception of other criteria (Pradas Macías 2003, 2007; Collados Aís et al. 2007; Rennert 2010). These studies argue that the importance of fluency had long been underestimated, given that previous surveys of quality perception or expectations among interpreters and habitual listeners showed a tendency to prioritize accuracy or faithfulness over formrelated features like fluency (Bühler 1986; Kurz 1993, 2001, 2003; Moser 1996; Chiaro and Nocella 2004). Fluency is a polysemous and somewhat elusive concept, though it is frequently applied to describe oral language performance. Since there has been no generally agreed definition of fluency and since it has been used to characterize a wide range of different skills and speech characteristics such as the temporal aspects of speech, the degree of coherence and semantic density in speech, and the appropriacy of speech content in different contexts (Leeson 1975; Fillmore 1979; Brumfit 1984; Lennon 1990; Schmidt 1992; Chambers 1997), assessment of fluency in interpreting is considered a difficult task. In everyday language use, fluency is often considered a synonym of general language proficiency (Lennon 1990; Chambers 1997). Accordingly, fluency is often associated with the overall quality of interpreting. In light of a more restricted definition of fluency that lays emphasis on “native-like rapidity” (Lennon 1990, p. 390), assessors of interpreting may be more concerned with how an interpretation is delivered and the way an interpreter controls the use of silent pauses, hesitations, filled pauses, self-corrections, repetitions, false starts and the like. Thus, the various definitions of fluency may largely contribute to the variability

Quantitative Correlates as Predictors of Judged Fluency …

119

in assessment standards and lack of consensus in fluency assessment. Another difficulty of fluency assessment in interpreting is that assessors’ judgement is very likely to be subject to the “impressions” after listening to the interpretation, because an interpreter’s work is ephemeral in nature and even the same assessor might not be consistent in assessment at different times. In this case, it is also very likely that assessors may find it difficult to dissociate fluency with other quality measures when their “impressions” of certain indices (e.g., fidelity, information completeness) become dominant. In response to the existing problems described above, Mead (2005) proposed that objective and quantitative measurement of temporal measures of fluency could be more transparent and reliable than assessment based on content-related parameters such as completeness and correctness, which is often subject to differing opinions of what is right, wrong or missing and is ultimately more difficult to pin down (Mead 2005). Admittedly, it is important to recognize that fluency as a key, form-related feature of successful interpretation may not reveal the quality of interpreting without taking content into account. However, a quantitative perspective on different features of interpreting such as fluency can contribute to the overall assessment of quality (Mead 2005). Though the quantitative approach to assessment has only recently been introduced to interpreting studies, it has long been practiced in more established disciplines like language testing and educational assessment. Studies on L1 and L2 speech have attempted to define fluency in terms of objective speech properties (Goldman-Eisler 1968; Grosjean and Deschamps 1975; Lennon 1990; Riggenbach 1991; Freed 1995; Cucchiarini et al. 2000, 2002). According to Cucchiarini et al. (2002), these studies have adopted a dual approach in which fluency is evaluated by listeners and quantified by temporal measures, and in which subjective evaluations are usually related to objective measures. This type of approach, particularly useful for gaining insight into the acoustic features underlying listeners’ evaluations, has a long tradition in phonetic research (Cucchiarini et al. 2002). It has been observed that, for speech tasks entailing different degrees of cognitive effort, there will be corresponding differences in fluency rating (Grosjean 1980; Bortfeld et al. 1999) and in objective and acoustic predictors of fluency (Cucchiarini et al. 2002). It is therefore reasonable to expect that, for the cognitively complex task of interpreting, relations between perceived fluency and quantitative measures will differ from those in everyday L1 and L2 speech. In other words, the quantitative measures that are highly predictive of judged oral fluency of unconstrained L1 and L2 production might not retain their predictive power for interpreted rendition. In Yu and van Heuven (2017), our study applied the dual approach to gain insights into the temporal aspects of fluency as indicators of perceived fluency in consecutive interpreting (CI) (from Chinese ‘A’ into English ‘B’). That study was innovative in two important aspects. First, it extended the existing research on the effect of task complexity on the linguistic output of fluency. Second, it investigated the relation between temporal parameters of fluency and judged fluency in CI, probing into which temporal measure(s) of fluency were most indicative of fluency in CI. In this chapter, based on the same data set generated from our previous experiment (Yu and van Heuven 2017), we further expand our study by extracting new data of two

120

W. Yu and V. J. van Heuven

quantitative prosodic measures, namely median vocal pitch and size of effective pitch range, which we hypothesized to predict fluency in CI like the temporal parameters we had identified (Yu and van Heuven 2017). As explained previously, the importance of this kind of research is two-fold. It not only generates a better understanding of how perceived fluency would relate to quantitative characteristics of delivery in CI, but also produces implications for the future development of quantitative testing instruments. Ultimately, such research may contribute to greater testing efficiency and better learning results in interpreter training.

2 Quantitative Assessment of Fluency It is a common practice that fluency is used to describe oral language performance, and sometimes written performance (e.g., Lennon 1990). A review of relevant literature reveals that fluency has been used to characterize different skills and speech properties, but there has been no consensus on its definition in different contexts (e.g., Leeson 1975; Fillmore 1979; Brumfit 1984; Lennon 1990, 2000; Schmidt 1992; Chambers 1997). Cucchiarini et al. (2000) distinguish native language fluency from fluency in the context of foreign language teaching and testing. Native language fluency is used to characterize the performance of a speaker, and is often considered a synonym of “overall language proficiency” (Lennon 1990; Chambers 1997), but it does not really constitute an evaluation criterion (Cucchiarini et al. 2000). By contrast, fluency in foreign language teaching and testing tends to be formally used as a criterion of evaluation by which non-native performance can be judged (Riggenbach 1991; Freed 1995), despite the variability and vagueness of its definitions. In a more restricted sense, fluency can refer to temporal aspects of oral proficiency (Nation 1989; Lennon 1990; Riggenbach 1991; Schmidt 1992; Freed 1995; Towell et al. 1996), in line with Lennon’s (1990) assumption that the goal in foreign language learning consists in producing “speech at the tempo of native speakers, unimpeded by silent pauses and hesitations, filled pauses, […] self-corrections, repetitions, false starts and the like”. The identification of temporal features of speech is a prerequisite for quantitative studies of fluency in different contexts: previous research shows that perceived fluency can be correlated with different quantitative measures, depending on the language and specific speech task—e.g., L1 vs. L2 speech; read vs. spontaneous speech (Möhle 1984; Towell et al. 1996; Cucchiarini et al. 2002; Kormos and Dénes 2004). The focus of the present study is to investigate the relationship between judged fluency and quantitative measures of fluency in the cognitively demanding speech task of CI from Chinese ‘A’ into English ‘B’ (or from L1 to L2). For our trainee interpreters, who are native Chinese speakers, there is always a major gap between A (Chinese) and B (English) language proficiency. Accordingly, oral proficiency in B language often receives more attention in the interpreting syllabus. Therefore, it is

Quantitative Correlates as Predictors of Judged Fluency …

121

necessary to review the existing studies on fluency in two types of L2 speech tasks (i.e., read speech and spontaneous speech), in which listeners’ evaluations of speech are examined in relation to temporal and prosodic measures calculated for the same speech samples. Research on fluency in L2 speech mostly aims to gain insight into the factors that underpin listeners’ evaluations (Cucchiarini et al. 2002), and/or to help develop objective tests of L2 fluency that might lead to automatic assessment (Townshend et al. 1998; Cucchiarini et al. 2002). A number of researchers, including Lennon (1990), Riggenbach (1991), Freed (1995), Cucchiarini et al. (2002), Kormos and Dénes (2004), and Pinget et al. (2014), have carried out studies in which samples of spontaneous speech produced by non-native speakers of English were judged by experts on fluency and were then analyzed in terms of quantitative variables such as speech rate, phonation time ratio (the percentage of speaking time used for actual speech production), mean length of run, and number and length of pauses. In Cucchiarini et al.’s (2002) study, the relationship between the objective fluency measures of speech and perceived fluency in L2 Dutch read and the spontaneous speech was investigated in two separate experiments. They found that the fluency ratings in both cases were closely related to speech rate, phonation time ratio, number of silent pauses per minute, duration of silent pauses per minute, and mean length of run. While articulation rate showed almost no relationship with the perceived fluency ratings in spontaneous L2 speech, the two were closely correlated in read L2 speech production. The authors’ tentative explanation for this finding was that, since pauses tended to occur much more frequently in spontaneous speech, articulation rate (which takes no account of pauses) may in practice be relegated to a position of irrelevance. Kormos and Dénes (2004) conducted a study on L2 Hungarian spontaneous speech fluency ratings and temporal measurements, and reported that speech rate, mean length of utterance, phonation time ratio and the number of stressed words produced per time unit were the best predictors of the fluency scores. Like Cucchiarini et al. (2002), they did not find that articulation rate, the number of filled and unfilled pauses, or other disfluency phenomena were good predictors of fluency ratings. A more recent study by Pinget et al. (2014) investigated which acoustic measures of fluency can predict perceived fluency in L2 Dutch spontaneous speech. Although their acoustic measures (calculated on the basis of syllable length and pause length/frequency) differed from those used in previous research, these parameters showed high predictive power for much of the variance in the fluency ratings, while two measures of repair fluency (number of corrections and number of repetitions) showed a certain—albeit limited—degree of predictability compared to other studies (cf. Cucchiarini et al. 2002; Kormos and Dénes 2004). These studies which scrutinize different L2 languages show that: (a) fluency ratings are mainly affected by temporal variables related to speed fluency (i.e., the speed at which speech is delivered) and breakdown fluency (i.e., the number and length of pauses); (b) the relationship between fluency ratings and temporal variables in spontaneous speech may be rather complex, since the former can be affected by non-temporal language features such as grammar, vocabulary and accentedness (Lennon 1990; Riggenbach 1991; Freed 1995).

122

W. Yu and V. J. van Heuven

To sum up, a number of studies have shown that quantitative assessment can be used to identify objective measures that are predictive of subjective fluency ratings in L2 speech. The general consensus is that temporal measures related to speed fluency and breakdown fluency are far more predictive than repair fluency. However, the predictive power of objective measures differs for different L2 speech tasks depending on the cognitive effort involved.

3 Applying Quantitative Assessment of Fluency to CI Presumably, interpreting is more complex than L2 oral tasks, regarding the cognitive processes involved. What distinguishes CI from everyday spoken language activity is readily appreciated by basing comparison of the two on models often used to analyze them: Levelt’s (1989) speech production model and Gile’s (1995) Effort Model. The main difference between the two is that the speech production model has an initial conceptualization stage, whereas CI starts with the perception and comprehension of the source language, with parallel storage, processing, and retrieval of information through note-taking, memory functions, and coordination of all these efforts. As a result, more attentional resources are almost certainly required in CI than in spontaneous speech production. Assessment of interpreting quality is necessary for both professional practice and interpreter training, which is generally based on two broad aspects: (a) contentrelated features such as accuracy and completeness of information; and (b) formrelated features represented by fluency of delivery, accentedness, intonation, and voice quality (e.g., Bühler 1986; Zwischenberger and Pöchhacker 2010). Fluency is among the most important formal criteria, and it contributes to the overall quality of interpreting, which is testified by the fact that it forms part of numerous rating scales in interpreting tests (Han et al. 2020). However, it seems to have attracted little attention in the teaching and training of interpreting. Recent research has examined how interpreting service users’ expectations could differ from their actual assessments of interpreting fluency. Research results suggest that limited fluency may impact negatively on the overall judged quality of an interpretation (e.g., Collados Aís 1998; Pradas Macías 2003; Rennert 2010). In a large-scale global survey on conference interpreting quality involving 704 AIIC interpreters worldwide (AIIC 2002; Pöchhacker 2012), fluency was perceived as being very important by 71% of the participants and ranked third out of eleven quality criteria (behind sense consistency and logical cohesion). In Yu and van Heuven (2017), judged fluency in CI was significantly correlated with judged accuracy. These studies indicate that the importance of fluency as one of the key quality indicators of interpreting performance is attracting a growing amount of attention. Against this backdrop, research on quantitative approaches to fluency in interpreting seems both indispensable and practical. It is indispensable because, among the problems related to assessment

Quantitative Correlates as Predictors of Judged Fluency …

123

method, judging-by-impression by assessors is most likely to result from the evanescent nature of interpreting, thus rendering the assessment subjective. Even for experienced assessors, lack of consistency between the various assessments might occur, which indicates considerable variability in standards and priorities from one assessor to another (Mead 2005). It is practical because a number of acoustic parameters that underpin fluency (e.g., speech rate, the number and length of un/filled pauses, the number of false starts, and self-repairs) make it possible to examine (and potentially predict) fluency through objective measurement. Compared with an abundant number of empirical studies on quantitative assessment of fluency in L2 speech, relatively little research has been conducted to explore quantitative temporal parameters that potentially underlie fluency, with a few notable exceptions (Mead 2005; Yang 2015; Yu and van Heuven 2017; Han et al. 2020). Based on the analysis of five temporal parameters of fluency in Mead’s (2005) pioneering work on elaborating a conceptual approach to quantitative assessment of fluency in interpreting, he suggested that speech rate, pause duration, and length of fluent run can be taken as the most relevant parameters in assessing interpreting fluency. Yang (2015) attempted to relate temporal measures of utterance fluency to perceived fluency ratings in an exploratory study in which 18 postgraduate student interpreters consecutively interpreted from Chinese to English. Her results indicated that overall speech rate (syllables per minute), articulation rate (pruned, syllables per minute), phonation time ratio, and mean length of silent pauses (pause cutoffs set at 0.3, 0.4 and 0.5 s) were closely related to the subjective fluency ratings. In a similar experimental study involving twelve trainee interpreters (seven undergraduate students and five postgraduate students) by Yu and van Heuven (2017) on the correlations between judged fluency and temporal predictors of fluency in CI (Chinese to English), they included in their study repair disfluencies such as false starts, restarts, corrections, and repetitions and identified effective speech rate (number of syllables, excluding disfluencies, divided by total duration of speech production and pauses) as the most predictive temporal parameter among twelve temporal fluency measures. Most recently, Han et al. (2020) made an extended study with 41 undergraduate interpreting students as participants, aiming at modeling the relationship between utterance fluency and raters’ perceived fluency of CI. They reported that mean length of unfilled pauses, phonation time ratio, mean length of run and speech rate had fairly strong correlations with perceived fluency ratings in both interpreting directions (English to Chinese and Chinese to English) and across rater types. In addition, melodic features such as intonation and accentedness may potentially play a role in influencing perceived fluency. Melodic features, together with temporal features constitute two broad categories of phenomena of prosody (van Heuven 1994, 2017). Like temporal features, melodic features are also part of quality criteria in interpreting which usually are deemed less important than content-related criteria such as accuracy and logical cohesion (Collados Aís et al. 2007; Pöchhacker 2012). However, empirical studies have shown that melodic features in interpreting delivery seem to influence users’ understanding of the speech content (Déjean Le Féal 1990; Shlesinger 1994; Holub and Rennert 2011), their perception of overall interpreting

124

W. Yu and V. J. van Heuven

quality (Cheung 2013), and the perception of fluency (Christodoulides and Lenglet 2014). Most interestingly, experimental research has found that listeners asked to rate “fluency” seem to unify temporal features with intonation (e.g., pitch variation) (Collados Aís et al. 2007). Therefore, it seems appropriate to incorporate quantifiable melodic parameters in the present study. This experimental study aims to address two related research questions. The first research question is: What are the correlations between raters’ subjective assessments of consecutive interpreters’ fluency and quantitative measurements of both temporal and melodic parameters? This is followed by the second research question: Which quantitative prosodic measure(s) can best predict judged fluency in CI?

4 Method 4.1 Participants Twelve students from Shanghai International Studies University participated in this study: seven third-year BA translation and interpretation majors, with a mean age of 20, and five second-year MA students, with a mean age of 25, from the Graduate Institute of Interpretation and Translation. The BA students were still working on the development of basic interpreting skills, while the MA students were already interpreting part-time and were working towards their professional qualification. By the time of the experiment, the BA students had completed three semesters of CI training courses; the MA students had undergone three semesters of intensive and advanced interpreter training (at least three hours a day, covering both CI and simultaneous interpreting). All participants had Chinese A and English B.

4.2 Material A source audio clip in Chinese (3.5 min in duration, with a total of 501 Chinese characters comprising six paragraphs) was prepared from recordings of the press conference (2.5 h) held by the former Chinese Premier Wen Jiabao during the National People’s Congress in 2009. The audio clip was played to the student participants and they interpreted it consecutively into English. Of the six interpreted paragraphs, two (paragraphs 4 and 5) were selected for perceptual rating and acoustic analysis. The reason for focusing on an extended extract, rather than the whole interpretation, was that the rating task had to be manageable so as to avoid rater fatigue.

Quantitative Correlates as Predictors of Judged Fluency …

125

4.3 Procedures 4.3.1

Experiment

The experiment was originally designed and run to test an earlier hypothesis regarding the improvement of judged fluency when exactly the same speech is interpreted a number of times in quick succession (Yu and van Heuven 2013). The experiment took place in conference rooms equipped with booths for simultaneous interpreting. The stimulus speech and the participants’ rendition were digitally recorded on separate tracks to maintain time differences. One of the authors monitored the interpreters over headphones and ensured that all of them had finished interpreting one paragraph before the next was played to them. The participants were instructed to interpret the same source speech three times (deliveries 1, 2 and 3), paragraph by paragraph, with a break of two minutes between deliveries. Delivery 1 and delivery 3 were selected for both auditory rating and acoustic analysis. Delivery 2 was excluded, because previous studies (e.g., Zhou 2006) suggest that the third delivery is often the most proficient during oral task repetition.

4.3.2

Fluency Ratings

The online survey software Qualtrics was used for the rating procedure. Twenty-four clips (12 interpreters × 2 interpretations each) were rated on six measures related to accuracy and fluency: (a) accuracy of information; (b) grammatical correctness; (c) speed of delivery; (d) control of pauses (both silent and filled); (e) control of other disfluencies (unnecessary repetitions, false starts, inappropriate lengthening of syllables, self-corrections); and (f) overall fluency, on a scale of 1–10. The order of the 24 clips was randomized for each rater, to prevent the potential order effect. Ten raters (five male, five female) with a background of studying or teaching at Leiden University participated in the online rating: three were native English speakers (two from the UK and one from the US) and seven had near-native English proficiency (six Dutch L1 and one Portuguese L1, all being either members of the academic staff in the English section or linguistics Ph.D. candidates). The raters were informed that the entire rating session would last an hour and advised to take a ten-minute break after rating twelve clips to avoid fatigue. They were then asked to complete a background questionnaire, after which they carefully read through an English translation of the two paragraphs so that they understood the messages to be interpreted. Subsequently, the audio clips of two specimen interpretations were played to them: one very good, the other less so. These were recordings of interpreters who were not actually included in the experimental sample. The ten raters scored all the twelve participants across two deliveries, as explained above. Means were calculated to obtain one score for every single delivery on each rating measure. A total of 24 ratings was thus obtained for each of the six rating measures.

126

4.3.3

W. Yu and V. J. van Heuven

Acoustic Correlates of Fluency

The following 14 measures were selected for investigation: 1.

2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12.

13. 14.

articulation rate: number of syllables, including disfluencies, divided by total duration of speech apart from all (silent and filled) pauses longer than 0.25 s (seconds); speech rate: as for articulation rate, but including all pauses in the total speech duration; effective speech rate: as for speech rate, but excluding disfluencies from the syllable count; number of silent pauses above 0.25 s in duration; mean length of silent pauses longer than 0. 25 s; number of filled pauses (uh, er, mm, etc.); mean length of filled pauses; number of pauses: the sum of (4) and (6). mean length of pauses: mean of (5) and (7), weighted by their respective frequencies (items 4 and 6); number of other disfluencies (repetitions, restarts, false starts, corrections); mean length of fluent runs: mean number of syllables produced between silent pauses longer than 0.25 s; percentage of articulation time, calculated on the basis of items 4 and 6, as a percentage of overall speech time: (total duration of speech without pauses, divided by total duration of speech including pauses) × 100 median pitch: the median value of the fundamental frequency (in hertz, Hz) size of effective pitch range: the difference between the 10th and 90th percentile of the pitch distribution (in semitones).

A threshold of 0.25 s for silent pauses is used to distinguish hesitation in speech (Towell et al. 1996) from pauses that are part of normal articulation for some combinations of sounds or may be classified as micro-pauses (Riggenbach 1991; Pinget et al. 2014). According to Tavakoli and Skehan’s (2005) perspective on the temporal properties of fluency, the above acoustic measures are predictive of speed fluency (1, 2), breakdown fluency (2, 4, 5, 6, 7, 8, 9, 11 and 12), repair fluency (10), or all three categories (3). In the present study, speech length is measured in syllables. This is in line with Pöchhacker’s (1993) observation that the use of syllables as a standard international unit of measurement obviates the practical drawback caused by the considerable variability in word length across different languages. Calculation of some temporal measures (e.g., percentage of articulation time) requires a count of the transcribed syllables: in our study, this was done manually by the first author and checked by a student assistant. For all 24 clips, the transcription was made by a graduate student assistant and checked by the first author. The transcriber was instructed to listen very carefully, noting any apparently unpronounced syllables for the purpose of the subsequent

Quantitative Correlates as Predictors of Judged Fluency …

127

Fig. 1 Visualization of a selected sound wave, with two annotation tiers in a text grid, in Praat (CH4_1st = Delivery 1 by the participant interpreting in Booth No. 4; fp = filled pause; sp = silent pause; rn = repetition)

syllable count. No syllables had been omitted. Detailed phonetic transcription and a manual syllable count would be laborious for longer speech samples. Automatic syllable count can be realized by running a script in Praat developed by de Jong and Wempe (2009), with the prospect of further improvements for future use in research such as this. Automatic phonetic transcription will also be applicable as automatic speech recognition technology evolves. Currently, however, we argue that manual calculation offers the best guarantee of accuracy. The transcription of the 24 clips included filled pauses and all types of disfluencies. Silent pauses were detected by running the MarkInterval script in Praat.1 The software converts an acoustic signal into an oscillogram and/or spectrogram, visualizing sounds as a continuous wave pattern in which any segment can be matched with the corresponding recording. At a sampling frequency of 44.1 kHz, duration of different speech features can be measured in milliseconds. Together with the oscillogram, two annotation levels (‘tiers’ in a ‘textgrid’) were created for the transcribed texts and the labelled disfluencies (see Fig. 1 for an example). The length of each silent and filled pause detected, the total duration and number of all pauses, and the number of disfluencies were automatically extracted. Silences at the very beginning and end of every delivery were discarded. The selection of the variables in this study is slightly different from Cucchiarini et al.’s (2002) choice of nine temporal variables related to fluency in L2 spontaneous speech. First, a distinction was made between effective speech rate (a variable proposed by us) and speech rate in general. Effective speech rate is calculated after excluding syllables identified as disfluencies (e.g., involuntary repetitions), because these are very likely to be more frequent in the cognitively demanding speech task of interpreting than in unconstrained speech. Second, syllables were used as the units of measurement. 1 1. The MarkInterval script was written by Jos J. A. Pacilly, speech software specialist at the Leiden

University Centre for Linguistics.

128

W. Yu and V. J. van Heuven

Third, mean length of filled pauses was added. The rationale for this decision was that interpreting is probably more conducive than spontaneous speech to hesitation pauses, as the interpreter takes time to analyze incoming information while also planning and retrieving the components of target language production. Finally, the number and mean length of pauses were also added so as to measure overall pausing. Before we set out to generate the pitch information for each of the 24 audio clips using Praat, we set an analysis range (the upper boundary and the lower boundary of pitch) between 125 and 400 Hz for most participants with the exception of two females and one male. As the clips of one of the two female participants had patches containing a creaky voice (extremely low pitch), we used a routine procedure to unvoice the creaky parts and adjusted the analysis range to 150–350 Hz. The clips of the other female participant had a rather low-pitched voice for a woman, so we set her pitch range between 75 and 250 Hz, which is the same analysis range we applied to the clips by the male participant. Besides, we also unvoiced the parts where some participants produced jump-ups of the pitch when pronouncing affricate and fricative sounds such as /tЀ/ and /Ѐ/. The effective size of the pitch range was calculated in semitones as the difference between the 90th and the 10th percentiles of the speaker’s pitch distribution. The median pitch was the exact value in hertz (Hz) of the 50th percentile of the pitch distribution. The quantitative measures described above were calculated separately for every interpretation (i.e., two values per participant for each parameter). A total of 24 values was thus obtained for each quantitative prosodic measure. Correlations between these values and the judged fluency ratings were then analyzed.

4.4 Statistical Analysis The ten raters were highly consistent in their ratings on the six measures related to accuracy and fluency, with a mean Cronbach’s α of 0.96. The highest α was 0.97, for grammatical correctness and speed of delivery; the lowest α was 0.94, for accuracy of information. Table 1 shows Cronbach’s alpha as a measure of internal consistency for the six fluency rating items. Table 1 Internal consistency (Cronbach’s α) of the ten raters for each rating item

Rating item

α

Accuracy of information

0.94

Grammatical correctness

0.97

Speed of delivery

0.97

Control of pauses

0.95

Control of disfluencies

0.96

Overall fluency

0.95

Quantitative Correlates as Predictors of Judged Fluency …

129

For the statistical analyses of the rating results and the quantitative measures, Pearson’s r and (multiple) linear regression were performed.

5 Results The results of the accuracy and fluency ratings assigned by the ten raters are presented first, followed by those for the quantitative acoustic measures of fluency. Finally, the relationship between the two sets of measures is examined.

5.1 Auditory Ratings of Fluency and Accuracy We present the mean scores on the six ratings for accuracy and fluency parameters (see Table 2). The ratings differed for the beginner and advanced students, as well as for deliveries 1 and 3. The analysis in our earlier study (Yu and van Heuven 2013) Table 2 Mean ratings for the six measures related to accuracy and fluency Delivery 1 Accuracy

Delivery 3 Fluency

Accuracy

Fluency

Proficiency ID AI

GC SD

CP CD OF AI

GC SD

CP CD OF

BA

1

5.7

5.7 5.7

4.8 4.8 5.3 6.1

6.4 6.7

6.1 5.4 6.1

BA

2

5.6

5.2 5.5

4.9 5.0 5.6 7.5

6.6 6.5

5.5 5.6 6.3

BA

3

4.5

4.6 5.0

4.7 4.3 4.5 6.4

6.0 6.2

5.9 5.9 6.2

BA

4

5.2

4.9 4.4

4.0 3.6 4.1 5.1

5.2 5.9

5.6 5.2 5.7

BA

5

5.6

5.3 5.2

5.0 4.8 5.1 6.8

6.6 6.1

6.4 6.0 6.2

BA

6

4.7

5.2 4.9

4.6 4.4 4.8 6.7

5.9 5.5

5.6 6.0 5.9

BA

7

5.5

5.7 5.0

3.9 4.4 4.4 5.7

6.0 6.3

5.7 5.1 5.7

5.3

5.2 5.1

4.6 4.5 4.8 6.3

6.1 6.2

5.8 5.6 6.0

Mean (BA) MA

8

7.0

6.6 7.0

6.4 6.1 6.8 7.5

6.5 7.3

6.9 6.5 7.2

MA

9

7.6

6.8 7.7

6.7 6.4 7.1 8.1

7.6 7.9

7.6 7.4 7.6

MA

10 7.8

6.9 7.6

6.9 6.8 7.3 8.2

7.7 8.0

7.2 7.2 7.8

MA

11 7.1

5.7 6.3

5.7 5.7 6.1 7.4

6.0 6.8

6.6 6.2 6.6

MA

12 5.6

5.5 5.6

4.8 4.7 5.3 6.2

6.7 6.7

5.8 5.2 5.8

Mean (MA)

7.0

6.3 6.8

6.1 5.9 6.5 7.5

6.9 7.3

6.8 6.5 7.0

Mean (grand)

6.0

5.7 5.8

5.2 5.1 5.5 6.8

6.4 6.7

6.2 6.0 6.4

Notes ID = identification number of participants, AI = accuracy of information, GC = Grammatical correctness, SD = speed of delivery, CP = control of pauses, CD = control of other disfluencies, OF = overall fluency

130

W. Yu and V. J. van Heuven

Table 3 Pearson’s r correlations between the accuracy-related and the fluency-related ratings Speed of delivery

Control of pause

Control of disfluencies

Overall fluency

Accuracy of information

0.88**

0.86**

0.92**

0.92**

Accuracy of grammar

0.90**

0.85**

0.86**

0.88***

Note **p < 0.01 (two-tailed)

showed significant main effects of both repetition (delivery) and proficiency on the perceived accuracy and fluency. This means that the advanced students were judged as being significantly more accurate and fluent than the beginners, while all students were judged as being significantly more accurate and fluent in delivery 3 than in delivery 1. In addition, the ratings for accuracy and fluency-related measures were found to correlate strongly in Yu and van Heuven’s (2017) study (see Table 3).

5.2 Acoustic Measures of Fluency In this section, the fourteen temporal and melodic variables are presented (see Table 4). Table 4 also shows the values of the different temporal and melodic variables for delivery 1 vs delivery 3. The D3/D1 ratio is the mean of each acoustic variable for delivery 3, divided by that for delivery 1. For ten of the fourteen acoustic variables, scores changed in accordance with a more favorable fluency rating; the remaining four parameters were unaffected by the repeated delivery. These exceptions are mean length of filled pauses, percentage of articulation time, median pitch and size of effective pitch range which hardly changed. The number of filled pauses and number of disfluencies were halved in delivery 3. Overall, the temporal measures of fluency were consistent with the trends for the fluency ratings in our earlier study, where both beginner and advanced students achieved significantly higher scores in delivery 3 (Yu and van Heuven 2013).

5.3 Correlations Between the Quantitative Prosodic Measures and the Fluency Ratings The quantitative prosodic measures are compared with the fluency ratings to determine how and to what extent the two were related. Pearson’s r values are shown in Table 5. Twelve out of the fourteen quantitative prosodic measures were closely correlated with judged fluency. Effective speech rate had the highest correlation (r

61.0

40.0

0.5

0.4

0.7

0.4

15.0

13.0

0.3

0.3

57.0

54.0

Length of silent 1 pauses (s) 3

3

Length of filled 1 pauses (s) 3

1

3

N filled pauses

N pauses

No. disfluencies

Length of pauses (s)

0.6

41.0

3

14.0

3

0.4

12.0

3

1

0.6

1

1

0.5

42.0

N silent pauses 1

4.0

17.0

0.5

0.5

99.0

1560

59.0

95.0

2.2

3.3

1.8

2.3

2.3

3.8

4.5

3

3.8

3

2

2.8

3.2

5.1

1

5.2

3

1

BA student

1

Del

1

Effective speech rate (syll/s)

Speech rate (syll/s)

Articulation rate (syll/s)

Variable 3

3.0

7.0

0.7

0.7

35.0

60.0

0

0.2

0.0

6.0

0.7

0.7

35.0

54.0

3.3

2.8

3.3

3.1

4.7

5.7

2.9

1.7

3.1

2.3

4.4

3.6

9.0

28.0

0.5

0.5

41.0

96.0

0.5

0.4

9.0

35.0

0.5

0.6

32.0

61.0

4

3.4

2.2

3.5

2.3

4.5

3.6

3.0

8.0

0.4

0.6

40.0

69.0

0.4

0.4

2.0

9.0

0.4

0.7

38.0

60.0

5

6

2.0

8.0

0.8

0.8

33.0

54.0

0.3

0.3

3.0

11.0

0.8

1.0

30.0

43.0

2.8

1.9

2.8

2.1

4.0

3.2

2.9

1.7

3.2

2.0

5.0

4.4

7.0

16.0

0.6

0.9

52.0

88.0

0.2

0.4

7.0

25.0

0.7

1.1

45.0

63.0

7

8

0.0

2.0

0.4

0.5

32.0

41.0

0.3

0.3

5.0

5.0

0.4

0.5

27.0

36.0

4.1

3.7

4.1

3.7

5.1

5.1

MA student 9

3.0

3.0

0.5

0.4

36.0

40.0

0.3

0.2

3.0

6.0

0.5

0.4

33.0

34.0

3.7

3.7

3.8

3.7

5.0

4.8

3.8

3.9

3.9

4.0

5.0

4.8

3.0

2.0

0.5

0.4

29.0

28.0

0

0

0.0

0.0

0.5

0.4

29.0

28.0

10

3.6

3.5

3.7

3.7

5.0

5.4

4.0

15.0

0.4

0.4

48.0

57.0

0.3

0.4

6.0

3.0

0.4

0.4

42.0

54.0

11

Table 4 Descriptive statistics of the 14 acoustic fluency measures for the 12 participants in the two consecutive interpretations

6.0

3.4

2.6

3.8

2.9

6.2

4.3

11.0

9.0

0.5

0.5

65.0

82.0

0.3

0.3

11.0

14.0

0.5

0.5

54.0

12

5.0

11.0

0.5

0.6

47

69

0.3

0.3

8.0

16.0

0.5

0.6

39.0

53.0

3.3

2.7

3.4

2.9

4.8

4.6

Mean

(continued)

0.5

0.8

0.7

1.0

0.5

0.8

0.7

1.2

1.2

1.1

D3/D1

Quantitative Correlates as Predictors of Judged Fluency … 131

Effective pitch range (semitones)

Median pitch (Hz)

Percentage of articulation time

Length of fluent runs (syllables)

Variable

3.8

4.6

3

203.0

1

205.0

3

0.7

3

1

0.6

5.8

1

4.8

3

1

BA student

1

Del

Table 4 (continued)

2

4.5

4.6

220.0

220.0

0.6

0.5

2.9

2.2

0.7

0.6

7.3

4.7

7.3

7.5

247.0

235.0

3

0.7

0.6

5.7

3.0

5.1

5.2

220.0

209.0

4

0.8

0.7

6.7

4.1

6.5

7.6

296.0

290.0

5

0.7

0.7

7.1

4.9

4.1

4.5

236.0

229.0

6

0.6

0.5

5.2

3.2

5.1

4.7

252.0

242.0

7

0.8

0.7

8.2

6.9

4.7

5.1

136.0

134.0

8

MA student

0.8

0.8

7.6

6.6

8.8

8.4

169.0

167.0

9

0.8

0.8

9.9

9.5

6.9

6.5

223.0

230.0

10

0.7

0.7

5.5

5.3

8.9

9.4

119.0

122.0

11

0.6

0.7

4.8

4.2

6.3

6.3

208.0

212.0

12

6.1

6.1

211.0

208.0

0.7

0.7

6.4

5.0

Mean

1.0

1.0

1.0

1.3

D3/D1

132 W. Yu and V. J. van Heuven

Quantitative Correlates as Predictors of Judged Fluency …

133

Table 5 Pearson’s correlations between the fluency ratings and the acoustic fluency measures for the twelve participants in the two consecutive interpretations Accuracy of information

Grammatical correctness

Speed of delivery

Control of pause

Control of disfluencies

Overall fluency

Articulation rate (syll/s)

0.27**

0.34**

0.50**

0.42**

0.33**

0.36**

Speech rate (syll/s)

0.65**

0.66**

0.83**

0.84**

0.74**

0.77**

Effective speech rate (syll/s)

0.72**

0.70**

0.86**

0.90**

0.82**

0.84**

−0.50**

−0.54**

−0.58**

−0.68**

−0.64**

−0.60**

Length of −0.56** silent pauses (s)

−0.42**

−0.62**

−0.64**

−0.51**

−0.62**

N filled pauses

−0.33**

−0.37**

−0.43**

−0.55**

−0.50**

−0.42**

Length of filled pauses (s)

−0.44**

−0.52**

−0.56**

−0.50**

−0.52**

−0.52**

N pauses

−0.44**

−0.49**

−0.53**

−0.65**

−0.60**

−0.54**

Length of pauses (s)

−0.54**

−0.43**

−0.62**

−0.59**

−0.46**

−0.58**

N disfluencies

−0.57**

−0.56**

−0.62**

−0.71**

−0.73**

−0.69**

Length of fluent runs (syll)

0.66**

0.68**

0.73**

0.82**

0.82**

0.78**

Percentage of articulation time

0.68**

0.63**

0.72**

0.83**

0.77**

0.78**

Median pitch −0.43** (Hz)

−0.22**

−0.45**

−0.37**

−0.34**

−0.41**

0.22**

0.37**

0.42**

0.39**

0.34**

N silent pauses

Effective pitch range (semitones)

0.39**

Note *p < 0.05, **p < 0.01 (two-tailed)

= 0.84, p < 0.01), followed by mean length of fluent runs (r = 0.78, p < 0.01) and percentage of articulation time (r = 0.78, p < 0.01).

134

W. Yu and V. J. van Heuven

5.4 Quantitative Prosodic Measures as Predictors of Judged Fluency Several linear regression models were built in stepwise mode, using SPSS-22, to investigate to what extent the fourteen quantitative prosodic measures could explain the variance in fluency ratings. Tables 6, 7, 8 and 9 show the (adjusted) proportion of variance explained (R2 ) by these models, and thus the incremental predictive power (R2 ) of each acoustic parameter. First, model 1 (Table 6) evaluates all fourteen quantitative prosodic measures as predictors of fluency ratings: the adjusted R2 shows that 78.9% of the variance in the ratings on speed of delivery may be explained on the basis of two temporal measures—i.e., effective speech rate (R2 = 72.1%) and number of filled pauses (R2 = 6.8%). Effective speech rate appeared to be the best indicator of the ratings on Table 6 Model 1 (dependent variable: judged speed of delivery) Predictors

R2

Adj R2

Effective speech rate

0.734

0.721

Number of filled pauses

0.807

0.789

Increment

SE of estimate

0.068

0.46557

0.53479

Table 7 Model 2 (dependent variable: judged control of pauses) Predictors

R2

Adj R2

Effective speech rate

0.805

0.796

Increment

SE of estimate 0.45052

Articulation rate

0.855

0.841

0.045

0.39763

Number of filled pauses

0.897

0.882

0.041

0.34262

Table 8 Model 3 (dependent variable: judged control of disfluencies) Predictors

R2

Adj R2

Effective speech rate

0.676

0.661

Speech rate

0.773

0.752

0.091

0.47875

Number of filled pauses

0.842

0.818

0.066

0.40961

Mean length of fluent runs

0.897

0.876

0.058

0.33850

Increment

SE of estimate 0.55945

Table 9 Model 4 (dependent variable: judged overall fluency) Predictors

R2

Adj R2

Effective speech rate

0.701

0.687

Number of filled pauses

0.769

0.747

Increment

SE of estimate

0.060

0.50647

0.56339

Articulation rate

0.883

0.865

0.118

0.36983

Mean length of pauses

0.919

0.902

0.037

0.31471

Quantitative Correlates as Predictors of Judged Fluency …

135

speed of delivery. Second, 88.2% of the variance in the ratings for control of pauses (Table 7) may be explained on the basis of three temporal measures—i.e., effective speech rate (R2 = 79.6%), articulation rate (R2 = 4.5%) and number of filled pauses (R2 = 4.1%). Again, effective speech rate appeared to be the best indicator. Third, model 3 (Table 8) shows that 87.6% of the variance in the ratings on control of disfluencies may be explained on the basis of four temporal measures—i.e., effective speech rate (R2 = 66.1%), speech rate (R2 = 9.1%), number of filled pauses (R2 = 6.6%) and mean length of fluent run (R2 = 5.8%). Here, too, effective speech rate appeared to be the best indicator. Finally, model 4 (Table 9) shows that 90.2% of the variance in the overall fluency ratings may be explained on the basis of four temporal measures—i.e., effective speech rate (R2 = 68.7%), number of filled pauses, (R2 = 6%), articulation rate (R2 = 11.8%) and mean length of pause (R2 = 3.7%). Once again, effective speech rate appeared to be the best predictor of judged overall fluency.

5.5 A Closer Examination of the Melodic Parameters As the two melodic parameters were not found to correlate as closely with the judged fluency ratings as the twelve temporal measures, we examined the relationship between judged fluency and each of the two melodic parameters separately. Figure 2 plots judged overall fluency as a function of the effective pitch range (in semitones, in panel A) and of the median pitch (in Hz, in panel B) for our twelve participants. In each panel, the data points were plotted separately for the first (circles) and the third (triangles) delivery by the participant. Figure 2A shows that there was a weak (almost negligible) correlation between the effective pitch range and judged fluency during the first delivery. A stronger

A

B

Fig. 2 Judged overall fluency as a function of effective pitch range (semitones, panel A) and of median pitch (Hz, panel B) for the twelve participants broken down by delivery

136

W. Yu and V. J. van Heuven

correlation was seen in the third (and better) delivery, suggesting that the raters tended to associate larger pitch movements with speaker confidence and competence, which perhaps may affect their perception of fluency in interpreting. However, the effective pitch range did not change systematically with delivery, t(11) = 0.449, p = 0.331 (one-tailed). But the interpreters with a larger effective pitch range were judged to be more fluent when the fluency judgments and the ranges were averaged across the two deliveries, r = 0.416, p = 0.089 (one-tailed). Figure 2B shows that there was a moderate and negative correlation between the median pitch and judged overall fluency, indicating that the low-pitched interpreters were perceived as more confident and competent, which in turn, might also affect the raters’ perception of fluency in interpreting. Similarly, the pitch level did not change systematically with delivery, t(11) = 1.552, p = 0.075 (one-tailed). The overall trend to judge the interpreters with lower medium pitch as being more fluent was significant, r = −0.509, p = 0.049 (one-tailed). The interpreters with the highest fluency ratings were typically MA students. One of the five MA participants was a male speaker, who, naturally, has a lower median pitch than the eleven female speakers. Crucially, however, when we omitted the single male speaker from the analysis, the correlational strength between the median pitch and judged fluency of interpreting was not affected, and in fact increased somewhat, r = −0.536, p = 0.044 (one-tailed), which indicates that the association between low (median) pitch and judged fluency (mediated by the attribution of competence) was not confounded by the potential gender effect.

6 Discussion Similar to what we found earlier (Yu and van Heuven 2017), all four judged fluency criteria (speed of delivery, control of pauses, control of disfluencies, overall fluency) correlated significantly with almost all of the twelve temporal measures of fluency, although the judged control of disfluencies and the judged overall fluency did not correlate significantly with the articulation rate (see Table 5). In particular, the study indicates that effective speech rate, percentage of articulation time, length of fluent runs, and speech rate had fairly strong correlations with perceived fluency ratings (r ≥ 0.77, p < 0.01), which generally corroborates the findings of other similar studies (Yang 2015; Han et al. 2020). However, the overall interconnection between the perceived fluency ratings and the temporal measures including measures of speed fluency, breakdown fluency and repair fluency in this study was not found in previous research (e.g., Yang 2015; Han et al. 2020), in which only measures of speed fluency, such as speech rate, and measures of pausing, such as mean length of silent pauses, were found to be useful correlates of perceived fluency. In this case, further empirical studies are needed to provide more evidence. The two newly added melodic parameters were mainly found to correlate weakly or moderately with the four judged fluency criteria: the median pitch correlated significantly (but moderately) with the judged speed of delivery and the judged

Quantitative Correlates as Predictors of Judged Fluency …

137

overall fluency; the size of the effective pitch range correlated significantly (but moderately) with the judged control of pause (Table 5). The result resonates with findings of previous research (Collados Aís et al. 2007; Christodoulides and Lenglet 2014), suggesting that interpreting-specific prosodic features are associated with the perception of fluency. It is also worth noting that, unlike the twelve temporal measures of fluency, the two melodic parameters did not change systematically with delivery, meaning that both the pitch level and the effective pitch range may not improve with repetition in interpreting. The study also identified which of the fourteen quantitative prosodic measures could best predict the fluency ratings. The results of the linear regression models in terms of useful predictors of judged fluency are largely consistent with those found in our previous study (Yu and van Heuven 2017). Effective speech rate (i.e., number of syllables, excluding disfluencies, divided by the total duration of speech production and pauses) appeared to be the best predictor of all four judged fluency criteria in consecutive interpreting (Tables 5, 6, 7 and 8) which might be explained by the fact that effective speech rate incorporates three aspects of fluency (speed fluency, breakdown fluency, and repair fluency). The other temporal measures that related (albeit less closely) to the fluency ratings were number of filled pauses, articulation rate, and mean length of pause. However, the two melodic parameters, i.e., median pitch and size of effective pitch range, were not among the predictors of the judged fluency in this study, although previous studies suggest that confident/dominant speakers have a lower (mean) pitch (Ohala 1983) and execute larger pitch movements (Gussenhoven 2004). This might be attributed to one of the following reasons: (a) compared with temporal measures of fluency that are sensitive to the varying cognitive effort involved in complex speech tasks such as consecutive interpreting, melodic features are relatively stable as a result of long-term training and thus less prone to alteration, even when the cognitive load is reduced via repetition; (b) the effects of the melodic parameters on the judged fluency were only measured with global parameters in this study. Therefore, we cannot rule out the possibility that specific local pitch differences may correlate with perceived competence and/or fluency of interpreting. The exploration of quantitative parameters of judged fluency in interpreting offers insights into what features of an interpretation potentially contribute to judged fluency. The results of this experimental study offer implications for the automatic assessment of fluency in interpreting performance. The advent of artificial intelligence for scoring spoken language texts (Litman et al. 2018) opens up new possibilities for fluency assessment in interpreting to be delivered with precision, consistency and objectivity. Given the availability of technologies that can easily detect syllables, pauses, and disfluencies by running relevant scripts in the speech analysis software package Praat and that can very quickly calculate these objective measures, it is likely that the labor-intensive and impressionistic rating of consecutive interpreting exams could be facilitated by quantitative measurement of effective speech rate or a combination of the good fluency predictors identified. With the empirical results of this study together with those of other studies (e.g., Han et al. 2020), automatic scoring systems such as SpeechRater (for scoring spontaneous non-native speech in the context of the TOEFL iBT Practice Online) and Pearson’s Ordinate technology

138

W. Yu and V. J. van Heuven

(for scoring the spoken portion of PTE Academic) may hopefully be applied to interpreting assessment. This could be quite efficient, at least in terms of screening out candidates who do not score satisfactorily on major objective fluency measures useful in predicting human ratings. Our study should thus serve as an initial step towards the development of an automatic quantitative assessment tool for fluency in interpreting. The empirical results may also have implications for informing and improving the design of rubrics of interpreting fluency assessment. For raters who are gauging fluency, perhaps the rubrics should emphasize effective speech rate, number of filled pauses, articulation rate, and mean length of pauses over the other objective fluency measures that did not contribute to explaining variance in the ratings and were therefore considered poor empirical correlates in our preliminary study. However, further research on a larger scale that draws on the state-of-the-art knowledge of language testing and other disciplines will be needed to identify the most ideal scalar descriptors for the evaluation of interpreting fluency. Furthermore, the findings of this study are expected to shed light on facilitating formative assessment in interpreting training. More specifically, interpreting teachers may need to guide trainees to develop a more comprehensive understanding of fluency, incorporating not only speed fluency but also breakdown and repair fluency. Our experiment shows that the number of filled pauses and the number of disfluencies were half in delivery 3 compared to delivery 1 (Table 4). Two aspects of fluency (i.e., breakdown fluency and repair fluency) thus seem to have a more important effect on fluency ratings in consecutive interpreting than was the case in previous research on L2 read and spontaneous speech (Cucchiarini et al. 2000, 2002; Kormos and Dénes 2004; Pinget et al. 2014). In addition, interpreting trainees need to be oriented and instructed in the melodic aspects of delivery, which proves to be significantly though not strongly correlated with judged fluency. As such, the objective prosodic information (temporal and melodic) generated (semi-)automatically may serve as evidence of trainees’ learning progress, which is to be brought back to them through feedback provided by trainers, peers or trainees themselves.

7 Conclusion This experimental study probes into the quantitative assessment of fluency in consecutive interpreting by studying potential correlations between judged fluency and automatically quantified temporal and melodic measures. It was found that effective speech rate, percentage of articulation time, length of fluent runs, and speech rate had fairly strong correlations with perceived fluency ratings. The regression models show that effective speech rate was the best predictor of perceived fluency, followed by number of filled pauses, articulation rate, and mean length of pause. Melodic measures such as median pitch and size of effective pitch range were not found to contribute to the prediction of the fluency ratings.

Quantitative Correlates as Predictors of Judged Fluency …

139

The study has several limitations. First, there seemed to be the concern of multicollinearity in both the judged fluency criteria and the acoustic temporal measures. In other words, the four judged fluency criteria might not be independent and mutually exclusive. The same might apply to the twelve acoustic fluency measures. For example, it is possible that speed of delivery is associated with both control of pauses and control of disfluencies. The overall fluency rating might be connected with the three partial criteria of fluency, while the mean length of fluent runs might be related to the number of filled pauses and number of disfluencies. Therefore, the results of the relative contribution of the objective measures to explain variance in the fluency ratings should be interpreted with caution. Future research may need to circumvent the problem of multicollinearity by choosing variables that are, in theory and in practice, not too highly interrelated. The second limitation lies in our overly conservative approach to detect and categorize silent/filled pauses and repetitions as disfluencies, since the interpreters might deliberately use these for clarity and emphasis. As such, by drawing on work from applied linguistics, psycholinguistics, discourse analysis and sociolinguistics, a more sophisticated manner of conceptualizing disfluencies should be developed in future studies (de Jong 2018), which should make an effort to distinguish (non-planned) disruptive disfluencies from planned speech pauses— which have been found to enhance communication (Scharpff and van Heuven 1988; van Heuven and Scharpff 1991; Scharpff 1994). Last, there was a lack of ecological validity since no listeners or audiences were present in this study. Such design does not reflect the situational dynamics of consecutive interpreting in real-life practice. More ecologically valid data is needed in future studies to verify the findings obtained in this study. To sum up, the present experimental study attempted to investigate the relations between subjective ratings of fluency and objectively quantified prosodic measures in consecutive interpreting. The most powerful predictor(s) of perceived fluency might ultimately lend themselves to the development of automatic assessment for trainee interpreters’ fluency in examination settings, inform and improve the design of rubrics for interpreting fluency assessment as well as facilitate formative assessment. We call for future research to overcome the limitations described above to investigate the relationship between acoustically measured variables of fluency and listeners’ perceived fluency in interpreting. Acknowledgements This research was supported by a Grant for the “Prosodic Aspects in Chinese to English and English to Chinese Consecutive Interpreting Training” project sponsored by Shanghai International Studies University.

References AIIC. 2002. Regulation governing admissions and language classification. Geneva: AIIC. Bortfeld, Heather, Silvia D. Leon, Jonathan E. Bloom, Michael F. Schober, and Susan E. Brennan. 1999. Which speakers are most disfluent in conversation and when? Proceedings ICPhS99 Satellite Meeting on Disfluency in Spontaneous Speech: 7–10.

140

W. Yu and V. J. van Heuven

Brumfit, Christopher. 1984. Communicative methodology in language teaching: The roles of fluency and accuracy. Cambridge: Cambridge University Press. Bühler, Hildegund. 1986. Linguistic (semantic) and extra-linguistic (pragmatic) criteria for the evaluation of conference interpretation and interpreters. Multilingua 5 (4): 231–235. Chambers, Francine. 1997. What do we mean by fluency? System 25 (4): 535–544. Cheung, Andrew K. F. 2013. Non-native accents and simultaneous interpreting quality perceptions. Interpreting 15 (1): 25–47. Chiaro, Delia, and Giuseppe Nocella. 2004. Interpreters’ perception of linguistic and non-linguistic factors affecting quality: A survey through the World Wide Web. Meta 49 (2): 278–293. Christodoulides, George, and Cédric Lenglet. 2014. Prosodic correlates of perceived quality and fluency in simultaneous interpreting. Proceedings 7th International Conference on Speech Prosody 2014, 1002–1006. Collados Aís, Ángela. 1998/2002. Quality assessment in simultaneous interpreting: The importance of nonverbal communication. In The interpreting studies reader, ed. Franz Pöchhacker and Miriam Shlesinger, 327–336. London and New York: Routledge. Collados Aís, Ángela, Macarena E. Pradas Macías, Stévaux Elisabeth Macarena, and Olalla García Becerra. 2007. La evaluación de la calidad en interpretación simultánea: Parámetros de incidencia. Granada: Comares. Cucchiarini, Catia, Helmer Strik, and Louis Boves. 2000. Quantitative assessment of second language learners’ fluency by means of automatic speech recognition technology. Journal of the Acoustical Society of America 107 (2): 989–999. Cucchiarini, Catia, Helmer Strik, and Louis Boves. 2002. Quantitative assessment of second language learners’ fluency: Comparisons between read and spontaneous speech. Journal of the Acoustical Society of America 111 (6): 2862–2873. de Jong, Nivja H. 2018. Fluency in second language testing: Insights from different disciplines. Language Assessment Quarterly, 15 (3): 237–254. de Jong, Nivja H., and Ton Wempe. 2009. Praat script to detect syllable nuclei and measure speech rate automatically. Behavior Research Methods 41 (2): 385–390. Déjean Le Féal, Karla. 1990. Some thoughts on the evaluation of simultaneous interpretation. In Interpreting—Yesterday, today and tomorrow, ed. David Bowen, and Margareta Bowen, 154–160. Binghamton: SUNY Fillmore, Charles J. 1979. On fluency. In Individual differences in language ability and language behaviour, ed. Charles J. Fillmore, Daniel Kempler, and William S.-Y. Wang, 85–101. New York: Academic Press. Freed, Barbara F. 1995. What makes us think that students who study abroad become fluent? In Second language acquisition in a study-abroad context, ed. Barbara F. Freed, 123–148. Amsterdam: John Benjamins. Gile, Daniel. 1995. Basic concepts and models for interpreter and translator training. Amsterdam: John Benjamins. Goldman-Eisler, Frieda. 1968. Psycholinguistics: Experiments in spontaneous speech. New York: Academic Press. Grosjean, François. 1980. Temporal variables within and between languages. In Towards a crosslinguistic assessment of speech production, ed. Hans W. Dechert and Manfred Raupach, 39–53. Frankfurt: Lang. Grosjean, François, and Alain Deschamps. 1975. Analyse contrastive des variables temporelles de l’Anglais et du Francais: Vitesse de parole et variables composantes, phénomènes d’hésitation. Phonetica 31 (3/4): 144–184. Gussenhoven, Carlos. 2004. The phonology of tone and intonation. New York: Cambridge University Press. Han, Chao, Sijia Chen, Fu. Rongbo, and Qin Fan. 2020. Modeling the relationship between utterance fluency and raters’ perceived fluency of consecutive interpreting. Interpreting 22 (2): 211–237.

Quantitative Correlates as Predictors of Judged Fluency …

141

Holub, Elisabeth, and Sylvi Rennert. 2011. Fluency and intonation as quality indicators. Paper presented at the Second International Conference on Interpreting Quality. Spain: Casa de la Cultura de Almuñécar. Kormos, Judit, and Mariann Dénes. 2004. Exploring measures and perceptions of fluency in the speech of second language learners. System 32 (2): 145–164. Kurz, Ingrid. 1993. Conference interpretation: Expectations of different user groups. the Interpreters’ Newsletter 5: 3–16. Kurz, Ingrid. 2001. Conference interpreting: Quality in the ears of the user. Meta 46 (2): 394–409. Kurz, Ingrid. 2003. Quality from the user perspective. In La evaluación de la calidad en interpretación: Investigación, ed. Ángela Collados Aís, Manuela Fernández Sanchez, and Daniel Gile, 3–22. Granada: Comares. Leeson, Richard. 1975. Fluency and language teaching. London: Longman. Lennon, Paul. 1990. Investigating fluency in EFL: A quantitative approach. Language Learning 40 (3): 387–417. Lennon, Paul. 2000. The lexical element in spoken second language fluency. In Perspectives on fluency, ed. Heidi Riggenbach, 25–42. Ann Arbor, MI: The University of Michigan Press. Levelt, Willem J. M. 1989. Speaking: From intention to articulation. Cambridge, MA: MIT Press. Litman, Diane, Helmer Strik, and Gad S. Lim. 2018. Speech technologies and the assessment of second language speaking: Approaches, challenges, and opportunities. Language Assessment Quarterly 15 (3): 294–309. Mead, Peter. 2005. Methodological issues in the study of interpreters’ fluency. The Interpreters’ Newsletter 13: 39–63. Möhle, Dorothea. 1984. A comparison of the second language speech production of different native speakers. In Second language productions, ed. Hans W. Dechert, Dorothea Möhle, and Manfred Raupach, 26–49. Tübingen: Gunter Narr. Moser, Peter. 1996. Expectations of users of conference interpretation. Interpreting 1 (2): 145–178. Nation, Paul. 1989. Improving speaking fluency. System 17 (3): 377–384. Ohala, John J. 1983. Cross-language use of pitch: An ethological view. Phonetica 40 (1): 1–18. Pinget, A.-F., Hans Rutger Bosker, Hugo Quené, and Nivja H. De Jong. 2014. Native speakers’ perceptions of fluency and accent in L2 speech. Language Testing 31 (3): 349–365. Pradas Macías, E. Macarena. 2003. Repercusión del intraparámetro pausas silenciosas en la fluidez: Influencia en las expectativas y en la evaluación de la calidad en interpretación simultánea. Ph.D. dissertation, University of Granada. Pradas Macías, E. Macarena. 2007. La incidencia del parámetro fluidez. In La evaluación de la calidad en interpretación simultánea: Parámetros de incidencia, ed. Ángela Collados Aís, E. Macarena Pradas Macías, Elisabeth Stévaux, and Olalla García Becerra, 53–70. Granada: Comares. Pöchhacker, Franz. 1993. On the science of interpretation. The Interpreters’ Newsletter 5: 52–59. Pöchhacker, Franz. 2012. Interpreting quality: Global professional standards? In Interpreting in the age of globalization: Proceedings of the 8th National Conference and International Forum on Interpreting, ed. Wen Ren, 305–318. Beijing: Foreign Language Teaching and Research Press. Rennert, Sylvi. 2010. The impact of fluency on the subjective assessment of interpreting quality. The Interpreters’ Newsletter 15 (3/4): 101–115. Riggenbach, Heidi. 1991. Toward an understanding of fluency: A microanalysis of non-native speaker conversations. Discourse Processes 14 (4): 423–441. Scharpff, Peter J. 1994. The effect of speech pauses on the recognition of words in read-out sentences, PhD dissertation, Leiden University. Scharpff, Peter J., and Vincent J. van Heuven. 1988. Effects of pause insertion on the intelligibility of low quality speech. In Proceedings of the 7th FASE/Speech-88 Symposium, ed. W.A. Ainsworth and J.N. Holmes, 261–268. The Institute of Acoustics, Edinburgh. https://hdl.handle.net/1887/ 2591. Accessed 26 February 2021. Schmidt, Richard. 1992. Psychological mechanisms underlying second language fluency. Studies in Second Language Acquisition 14 (4): 357–385.

142

W. Yu and V. J. van Heuven

Shlesinger, Miriam. 1994. Intonation in the production and perception of simultaneous interpretation. In Bridging the Gap: Empirical Research in Simultaneous Interpretation, ed. Sylvie Lambert and Barbara Moser-Mercer, 225–236. Amsterdam and Philadelphia: John Benjamins. Tavakoli, Parvaneh, and Peter Skehan. 2005. Strategic planning, task structure and performance testing. In Planning and task performance in a second language, ed. Rod Ellis, 239–273. Amsterdam: John Benjamins. Towell, Richard, Roger Hawkins, and Nives Bazergui. 1996. The development of fluency in advanced learners of French. Applied Linguistics 17 (1): 84–119. Townshend, Brent, Jared Bernstein, Ognjen Todic, and Eryk Warren. 1998. Estimation of spoken language proficiency. Proceedings of the ESCA Workshop Speech Technology in Language Learning (STiLL 98): 179–182. van Heuven, Vincent J., and Peter J. Scharpff. 1991. Acceptability of several speech pausing strategies in low quality speech synthesis: interaction with intelligibility. Proceedings of the 12th International Congress of Phonetic Sciences, Aix-en-Provence: 458–461. van Heuven, Vincent J. 1994. Introducing prosodic phonetics. In Experimental studies of Indonesian prosody, ed. Cecilia Odé and Vincent J. van Heuven, Semaian 9, 1–26. Leiden: Vakgroep Talen en Culturen van Zuidoost-Azië en Oceanië, Leiden University. https://core.ac.uk/download/pdf/ 15590627.pdf. Accessed 25 February 2021. van Heuven, Vincent J. 2017. Prosody and sentence type in Dutch. Nederlandse Taalkunde 22 (1): 3–29, 44–46. Yang, Liuyan. 2015. An exploratory study of fluency in English output of Chinese consecutive interpreting learners. Journal of Zhejiang International Studies University 1: 60–68. Yu, Wenting, and Vincent J. van Heuven. 2013. Effects of immediate repetition at different stages of consecutive interpreting: An experimental study. In Linguistics in the Netherlands 2013, eds. Suzanne Aalberse, and Anita Auer, 201–213. Amsterdam: John Benjamins. Yu, Wenting, and Vincent J. van Heuven. 2017. Predicting judged fluency of consecutive interpreting from acoustic measures. Potential for automatic assessment and pedagogic implications. Interpreting 19 (1): 47–69. Zhou, Dandan. 2006. A study on the effects of input frequency and output frequency. Modern Foreign Languages 29 (2): 154–163. Zwischenberger, Cornelia, and Pöchhacker, Franz. 2010. Survey on quality and role: Conference interpreters’ expectations and self-perceptions. Communicate! https://www.aiic.net/ViewPage. cfm/article2510.htm. Accessed 21 January 2013.

Wenting Yu teaches interpreting for the translation and interpretation majors at Shanghai International Studies University (SISU). She obtained her Ph.D. degree in Linguistics from SISU in 2012. She was a visiting post-doctoral researcher at Leiden University Center for Linguistics (2012–2013). She specializes in interdisciplinary studies of interpreting, prosody, psycholinguistics and second language acquisition. Her publications include work on cognitive processing in interpreting, and interpreter assessment and training. Vincent J. van Heuven is emeritus professor of Experimental Linguistics and Phonetics at Leiden University, and at the University of Pannonia (Veszprém, Hungary). He served as director of the Holland Institute of Linguistics and the Leiden University Centre for Linguistics, chair of the Netherlands Graduate School in Linguistics, and vice-president/secretary of the Permanent Council of the International Phonetics Association. A former (associate) editor of several international book series and professional journals, he is a member of the Royal Netherlands Academy of Arts and Sciences.

Chasing the Unicorn? The Feasibility of Automatic Assessment of Interpreting Fluency Zhiwei Wu

Abstract This chapter examines the feasibility of using Praat for automatic assessment of interpreting fluency (i.e., speed and breakdown fluency). A total of 140 audio recordings were collected in an English-to-Chinese consecutive interpreting exam. Two raters assessed the fluency of the interpreting performance using a four-point scale and their mean scores were used as the perceived fluency scores. Two versions of the audio files, unedited and edited (with inter-segmental silences and noises being removed), were subjected to Praat analysis in two conditions (intensity-adjusted versus intensity-unadjusted). In the intensity-adjusted condition, the intensity threshold was set differently for individual recordings, depending on the maximum intensity and the 99% quantile intensity. In the intensity-unadjusted condition, the threshold was set uniformly for all the audio files. The correlations between Speed Fluency, Breakdown Fluency, and Perceived Fluency were examined across the four conditions (i.e., edited and intensity-adjusted, edited and intensityunadjusted, unedited and intensity-adjusted, unedited and intensity-unadjusted). Statistical analyses showed that: (a) removing inter-segmental silences and noises slightly improved the correlation coefficients; (b) adjusting the intensity threshold significantly improved the correlation coefficients; (c) the silent pause threshold of 0.25 second produced the best correlation coefficients; and (d) Mean Silence Duration had the strongest correlation with judged fluency. Based on these findings, a flowchart is designed to guide teachers’ decision-making in fluency assessment, ranging from pre-processing of audio files, to configuration of parameters in Praat, and finally to relating acoustic measures to assessment goals. Keywords Automatic assessment · Breakdown fluency · Interpreting performance · Praat · Speed fluency

Z. Wu (B) Department of Chinese and Bilingual Studies, The Hong Kong Polytechnic University, Hong Kong, China e-mail: [email protected] © Springer Nature Singapore Pte Ltd. 2021 J. Chen and C. Han (eds.), Testing and Assessment of Interpreting, New Frontiers in Translation Studies, https://doi.org/10.1007/978-981-15-8554-8_7

143

144

Z. Wu

1 Introduction Fluency, as laypeople understand it, is synonymous with proficiency: “She is fluent in English.” While this broad sense of fluency has been popular in laypeople’s discourse, researchers are more interested in a narrow sense of fluency, defined as the “rapid, smooth, accurate, lucid, and efficient translation of thought or communicative intention under the temporal constraints of on-line processing” (Lennon 2000, p. 26). Segalowitz (2010) further theorizes speech fluency into three aspects: cognitive fluency, utterance fluency, and perceived fluency. Studies on first and second language (L1 and L2) acquisition are particularly interested in the relationship between utterance fluency and perceived fluency. This is because acoustic features of utterances (e.g., speech rate, length of pauses, and frequency of repairs) are possible predictors of listeners’ subjective perception of speakers’ fluency. Skehan (2003) proposes three categories of measures to objectively assess utterance fluency: (a) speed fluency, (b) breakdown fluency, and (c) repair fluency. Conventional measures of speed fluency include Speech Rate (number of syllables divided by total response time), Phonation Time Ratio (speaking time divided by total response time), and Mean Length of Run (mean number of syllables between silent pauses) (Tavakoli and Skehan 2005; Baker-Smemoe et al. 2014). Breakdown fluency is frequently measured by the number and the length of silent pauses. Repair fluency is usually assessed by the number of repetitions, false starts, and self-corrections. In the field of interpreting studies, fluency and dysfluency have received sustained scholarly attention over the past decades (e.g., Tissi 2000; Cecot 2001; Mead 2000, 2005; Gósy 2005; Rennert 2010). A recent trend to extend this line of research is to relate acoustic measures to perceived fluency in interpreting, but mixed results have been reported. For instance, Han (2015) found that Speech Rate (SR), Phonation Time Ratio (PTR), and Mean Length of Run (MLR) were significantly correlated with the raters’ perceived fluency in English-to-Chinese simultaneous interpreting. Recently, Han et al. (2020) further found that SR, PTR, MLR and Mean Length of Unfilled Pauses (MLUP) were strongly correlated with the raters’ judged fluency, with MLR and MLUP being the most significant predictors. However, Zhang and Song (2019) found a different set of acoustic features that were correlated with the raters’ fluency ratings: the number of (un)filled pauses and the number of repetitions. Furthermore, Yu and van Heuven (2017) reported that all the acoustic measures in their study correlated moderately or strongly with the raters’ judged fluency, but Effective Speech Rate (number of syllables, excluding false starts, repetitions and repairs, divided by total speaking time) had the highest correlation. These findings offer meaningful empirical evidence for the potentiality of automatic assessment of interpreting fluency because some of the acoustic measures can be automatically annotated by speech analysis programs. This approach of relating automatically assessed/assessable acoustic measures to human raters’ fluency judgement has received sustained attention in L2 fluency literature over the past two

Chasing the Unicorn? The Feasibility …

145

decades (e.g., Cucchiarini et al. 2000; Kang and Johnson 2018). However, the feasibility of this automatic approach remains unclear in the context of assessing interpreting fluency for at least three reasons. First, except Han (2015) and Han et al. (2020), the existing studies are based on relatively small samples of students (12 students in Zhang and Song [2019]; 24 short excerpts by 12 students in Yu and van Heuven [2017]). The extent to which these findings could be generalized to other populations and longer interpreting performances remains unclear. Second, given that the research results are largely inconsistent across the studies, teachers are uncertain about which acoustic feature(s) should be examined when submitting students’ audio files to automatic analysis. Han’s (2015) results seem to highlight the importance of the number of syllables uttered per second, while Zhang and Song’s results (2019) accentuate the role of pauses in fluency judgement. Third, these studies operationalize silent pauses differently. Han et al. (2020) and Yu and van Heuven (2017) used 0.25 s (seconds) as the cut-off point, but Han (2015) and Zhang and Song (2019) set the pause threshold at 0.5 s. It should be noted that previous studies on L1 and L2 oral performance usually set the threshold at 0.25–0.4 s (Kormos and Dénes 2004; Iwashita et al. 2008; Rossiter 2009), although some studies use a cutoff point as low as 0.1 s (e.g., Riazantseva 2001) or as high as 1 s (e.g., Mehnert 1998). In the literature on interpreting fluency, thresholds ranging from 0.2 to 2 s have been reported (Han 2015). Han and An (2020) compared a range of thresholds from 0.25 to 1 s with an interval of 0.05 s. After examining the correlation between acoustic measures with perceived fluency scores, they “advise selecting pause thresholds of 0.35 s to 0.50 s for English-to-Chinese direction, and 0.25 s to 0.35 s for the Chinese-to-English direction” (p. 14). As Han and An’s (2020) study was based on manual annotation of speech samples, it would be interesting to know how different silent pause thresholds in automatic annotation would influence the correlations with human raters’ fluency ratings. In light of the aforementioned issues, this study seeks to examine the feasibility of automatic fluency assessment by drawing on a relatively large sample (i.e., 140 recordings of interpreting performance, with the duration of each recording being about five minutes). Five acoustic features that index speed fluency and breakdown fluency are examined because: (a) they have been found to correlate with human raters’ fluency perception in the literature (see Sect. 2.2 for details); and (b) they can be automatically annotated in Praat, a speech analysis program (Boersma and Weenink 2019). De Jong and Wempe (2009, 2010) offer very useful Praat scripts to automatically count syllables and silent pauses in speech samples. In the scripts, two parameters (i.e., silent pause threshold and intensity threshold) call for special attention because Praat uses them to “recognize” silences. The intensity threshold “determines the maximum silence intensity value in dB with respect to the maximum intensity” of an audio file (Boersma and Weenink 2019). For instance, we may set the intensity threshold at −35 dB and silent pause threshold at 0.25 s. If the maximum intensity of a sound file is 75 dB, any sound portion longer than 0.25 s and below 40 dB (75 minus 35) will be automatically annotated as “silence” by Praat. As intensity levels and lengths of silent pauses vary greatly within and between speech samples, different thresholds will influence the precision of automatic annotation.

146

Z. Wu

In interpreting studies, although Han and An’s (2020) study provides useful findings about selecting silent pause thresholds, there has been no systematic comparison of intensity thresholds, which makes it difficult to configure Praat parameters for automatic assessment. Also absent in the literature is the examination of the necessity of pre-processing audio files before Praat analysis. This procedure may affect the precision of automatic assessment because inter-segmental silences and noises are usually recorded in audio files of students’ consecutive interpreting (see Sect. 2.1 for details). To examine the feasibility of using Praat for automatic assessment of interpreting fluency, this study addresses the following research questions (RQ): RQ1: Is it necessary to pre-process audio files before subjecting them to Praat analysis/annotation? RQ2: To what extent do different intensity thresholds in Praat affect the automatic assessment results? RQ3: To what extent do different silent pause thresholds in Praat affect the automatic assessment results? RQ4: Which of the five acoustic measures (i.e., Speech Rate, Phonation Time Ratio, Mean Length of Run, Mean Silence Duration, and Number of Silent Pauses) are correlated with raters’ perceived fluency?

2 Method 2.1 Context and Participants A total of 140 recordings of interpreting students (104 female and 36 male) were collected from a final exam for the course of English-to-Chinese Consecutive Interpreting. All the students were the fourth-year undergraduates majoring in English. They had Chinese as their A language and English their B language. The source text to be interpreted was a coherent and logical speech on four stages of developing a skill. The duration of the source speech was four and a half minutes, divided into nine segments. The students’ interpreting performance was recorded in a language lab equipped with a system called Lan Ge. Functionally, the system is similar to other popular language (interpreting) training systems in China. For instance, an individual student’s interpreting performance can be saved as individual segments or merged into one audio file. In this study, the latter option was chosen for the ease of processing and managing files. The original speech was not recorded as part of the recording. However, fixed periods of time were inserted between segments in the original speech, and the students needed to finish interpreting a segment within the given time limit. This artificial intervention created inter-segmental silences and noises when some students finished their interpreting and waited for the next segment. Although the inter-segmental silences and noises might not have negative impacts on human raters’ quality assessment, they could influence the precision of automatic annotation in Praat. As a result, it is debatable whether we need to pre-process the audio files before automatic analysis (more on this in the next subsection).

Chasing the Unicorn? The Feasibility …

147

2.2 Data Processing and Analysis The processing and analytical procedures were divided into four stages (see also Fig. 1). In the first stage, two research assistants, both native Chinese speakers, were hired to rate the fluency of the students’ interpreting performance. One assistant had an M.A. degree in linguistics and the other had an M.A. degree in translation and interpreting. The raters were given a week to rate the 140 interpretations. To ensure rating consistency, the source text was divided into 29 assessment units, each of which was rated on a four-point scale, adapted from Han et al. (2020). If the performance had no or few dysfluent features, three points were awarded. If the performance had some dysfluent features but did not impede comprehension, two points were given. If some dysfluent features were present and hampered comprehension, one point was given. If dysfluent features severely affected comprehension, no point was given. The inter-rater reliability was measured by the Pearson correlation coefficient (r = 0.86), which showed that the ratings were consistent between the two raters. The raters’ scores were averaged as the perceived fluency score for each interpreting performance. In the second stage, four conditions were created based on our editing of the recordings (i.e., edited versus unedited) and the Praat configuration (i.e., intensityadjusted versus intensity-unadjusted). A research assistant was hired to edit out all the inter-segmental silences and noises and replaced them by silences below 250 ms, a commonly used lower cut-off point to measure pauses in the literature (Bosker et al. 2013; De Jong and Bosker 2013; Préfontaine et al. 2016). This step produced 140

Fig. 1 Processing and analytical procedures (Notes The bolded parts indicate the range of particular Praat parameters. For instance, under the “Edited recordings” box, the second “Praat parameters” box indicates that the silent pause threshold was uniformly set at 0.25 s and the intensity threshold was set at −15 dB, −16 dB, −17 dB, and up to −42 dB, respectively)

148

Z. Wu

edited recordings vis-à-vis 140 original/unedited recordings. These recordings were then subjected to analysis in Praat, using a script adapted from De Jong and Wempe (2009, 2010). At this point, an intensity threshold and a silent pause threshold needed to be determined for Praat to “recognize” silences. When the edited audio files were used, the silent pause threshold was set at 0.25 s (note that in the fourth stage we also experimented with a range of pause thresholds). When the unedited audio files were used, we defined silent intervals as those longer than 0.25 s and shorter than 5 s. The upper limit was set at 5 s because we examined the edited audio files and found that in only four instances the pauses were longer than 5 s. In other words, the pauses longer than 5 s in the unedited sound files were almost always inter-segmental silences produced by the students waiting for the next segment, and thus were not included in the calculation. With respect to the intensity threshold, this study adapted De Jong and Bosker’s (2013) method and compared the effects of a range of intensity thresholds on the correlation between automatic assessment results and judged fluency. In Praat, the intensity threshold can be adjusted based on individual audio files or simply unadjusted. In the unadjusted condition, a single intensity threshold X was set uniformly to all the audio files. In the adjusted condition, the threshold was calculated as follows: X − (Maximum intensity − 99% quantile of intensity) (De Jong and Wempe 2009, 2010). To illustrate the difference between the two conditions, we consider a hypothetical Recording A, with its maximum intensity value being at 85 dB and 99% quantile intensity value at 78 dB. If we set the intensity constant X at −20 dB and the silent pause threshold at 0.25 s, then Praat will produce different annotation results for the two conditions. In the unadjusted condition, any sound portion longer than 0.25 s with an intensity value lower than 65 dB (85–20) in Recording A will be annotated as silence. However, in the adjusted condition, any sound portion longer than 0.25 s with an intensity value lower than 58 dB [85–20–(85–78)] will be annotated as silence. In De Jong and Wempe’s (2010) script, X is an intensity constant set at −20 or −25 dB. However, in this study, as we attempted to experiment with a range of thresholds, X was set from −1 to −35 dB in the adjusted condition. In the unadjusted condition, the intensity constants ranged from −15 to −42 dB. These ranges were selected because Praat failed to recognize any silence in one third of the recordings when X was set at −1 −35 dB (the adjusted condition), −15 and −42 dB (the unadjusted condition). In other words, if the intensity constants were set outside these ranges, Praat would return more null data and severely distort the results. In the third stage, we used Praat to automatically annotate the audio files in the 2 × 2 conditions (edited-adjusted, edited-unadjusted, unedited-adjusted, and unedited-unadjusted) and produced a set of measurements. Five acoustic measures were selected as the independent variables because (a) they could be automatically measured in Praat; and (b) they were frequently found to be correlated with perceived fluency in previous studies. Table 1 summarizes the independent variables, their definitions, and relevant studies in which they are investigated. Correlation analyses were conducted between the independent variables and the dependent variable across the 2 × 2 conditions. As the acoustic data were not normally distributed, Spearman’s correlation coefficients (r s ) were calculated.

Chasing the Unicorn? The Feasibility …

149

Table 1 A summary of the acoustic features (treated as the independent variables) Aspect

Acoustic measure

Definition

Relevant study

Speed fluency

Speech Rate (SR)

Number of syllables divided by total response time (i.e., including silent pauses) in seconds

Kormos and Dénes (2004), Han (2015), Yu and van Heuven (2017)

Phonation Time Ratio (PTR)

Speaking time divided by Kormos and Dénes total response time (2004), Han (2015), Yu and van Heuven (2017)

Mean Length of Run (MLR)

Number of syllables divided by number of fluent intervals (i.e. intervals between silent pauses)

Isaacs and Trofimovich (2012), Han (2015), Préfontaine et al. (2016)

Mean Silence Duration (MSD)

Total silent pause time divided by number of silent pauses

Kormos and Dénes (2004), Yu and van Heuven (2017), Han et al. (2020)

Number of Silent Pauses per second (NSP)

Number of silent pauses divided by total response time

Iwashita et al. (2008), Yu and van Heuven (2017)

Breakdown fluency

In the final stage, we inspected 126 (35 + 28 + 35 + 28) correlation matrices across the four conditions and found out that the variable having the highest correlation with the judged fluency was Mean Silence Duration, calculated in the editedadjusted condition with the intensity constant being set at −18 dB. Using this acoustic measure and this intensity configuration, we experimented with a range of silent pause thresholds (from 0.25 to 2 s with an interval of 0.25 s) to examine their influences on the correlations with the judged fluency. All the statistical analyses were performed in SPSS 21.

3 Results and Discussion RQ1 and RQ2 ask whether editing the audio files and adjusting the intensity thresholds could have an impact on the correlations between the automatically assessed acoustic measures and the human raters’ judged fluency. Due to the space limit, we use Mean Silence Duration (i.e., the acoustic measure having the highest correlation with the human raters’ judged fluency) to elucidate how various Praat configurations could affect the automatic assessment results. As can be seen from Figs 2(a) and 2(b) the highest correlation with the judged fluency was found in the edited-adjusted condition (r s = −0.61, when the intensity constant was set at −18 dB), followed by the unedited-adjusted condition (r s =

150

Z. Wu

Fig. 2 The correlations between Mean Silence Duration and the judged fluency

−0.58, when the intensity constant was set at −17 dB). In the edited-unadjusted condition, the highest correlation was found when the intensity threshold was set at −30 dB (r s = −0.56). The highest correlation coefficient in the unedited-unadjusted condition was −0.42, when the intensity threshold was set at −30 dB. Comparison of these values demonstrate that the pre-processing of the audio files by editing out the inter-segmental silences and noises could slightly improve the correlation coefficients (−0.61 versus −0.58), while adjusting the intensity threshold could significantly improve the correlation coefficients, especially when unedited audio files were used (−0.58 versus −0.42). Taken together, the correlational strengths between Mean Silence Duration and the judged fluency decreased in the following order: editedadjusted > unedited-adjusted > edited-unadjusted > unedited-unadjusted. This result indicates the need to adjust the intensity threshold for individual audio files, so that automatic assessment results can correlate with those of human raters more closely. RQ3 asks whether different silent pause thresholds would affect the automatic assessment results. The correlations between Mean Silence Duration and the judged fluency were calculated in the edited-adjusted condition, with the intensity constant being set at −18 dB (i.e., the configuration whereby the highest correlation was observed in 126 correlation matrices). Figure 2(c) shows that as the threshold increased from 0.25 to 2 s, the magnitude of correlation generally declined from −0.61 to −0.26. This result indicates that the threshold of 0.25 s appears to be an optimal cut-off point for silent pauses and for better precision of automatic assessment. Although this cut-off point is outside the 0.35–0.5 s range recommended by Han and An (2020), it is worth noting that the current study used the threshold for automatic annotation, as compared to manual annotation in Han and An (2020).

Chasing the Unicorn? The Feasibility … Table 2 Descriptive statistics of the dependent and the independent variables

151

Variables

Mean

Standard deviation

Speech Rate

2.94

0.69

Phonation Time Ratio

0.66

0.10

Mean Length of Run

6.24

2.44

Mean Silence Duration

0.66

0.14

Number of Silent Pauses

0.52

0.12

55.70

8.95

Judged fluency ratings

Table 3 Correlation matrix of all the variables in the edited-adjusted condition with intensity constant being set at −18 dB Measures

PTR

MLR

MSD

NSP

Judged fluency

SR

0.60***

0.76***

−0.58***

−0.30***

0.53***

0.79***

−0.63***

−0.71***

0.33***

−0.30***

−0.79***

0.25**

−0.01

−0.61***

PTR MLR MSD NSP

0.07

Notes *** p < 0.001; ** p < 0.01; SR = Speech Rate, PTR = Phonation Time Ratio, MLR = Mean Length of Run, MSD = Mean Silence Duration, NSP = Number of Silent Pauses

Despite this difference, both studies find that if the silent threshold is set larger than 0.5 s, the results will be undesirable. RQ4 asks which acoustic features correlated with the judged fluency. Correlation analyses were conducted in the edited-adjusted condition with the intensity constant being set at −18 dB. Table 2 presents the descriptive statistics for all the variables and Table 3 presents their correlations. As is evident in Table 3, of the five acoustic variables, Mean Silence Duration was found to have the highest correlation with the judged fluency (r s = −0.61, p < 0.001), followed by Speech Rate (r s = 0.53, p < 0.001). Additionally, Phonation Time Ratio and Mean Length of Run had weaker correlations with the judged fluency ratings, while the correlation between Number of Silent Pauses and the fluency ratings was close to zero. Some of these results are corroborated by previous studies in that the judged fluency has been found to strongly correlate with Mean Silence Duration (Yu and van Heuven 2017; Han et al. 2020) and Speech Rate (Han 2015; Yu and van Heuven 2017; Han et al. 2020), but not with Number Silent Pauses (Han et al. 2020). However, in this study, Phonation Time Ratio and Mean Length of Run were weakly correlated with the human raters’ fluency ratings, as compared to the stronger correlations observed in Han et al. (2020). Bosker et al. (2013, p. 162) pointed out that “differences in sensitivity to specific speech phenomena may account for differences in correlations between acoustic measures and fluency ratings.” As the raters in this study were asked to rate each interpreting performance in 29 assessment units, they might not be very sensitive to the overall rapidity of the performance (indicated by

152

Z. Wu

Phonation Time Ratio). Instead, they might be more sensitive to the number and the length of silent pauses within each assessment unit, which explained why Mean Silence Duration had the highest correlation with their ratings. With respect to Mean Length of Run, its relatively weaker correlation with the judged fluency might be caused by the compounded inaccuracy effect when Praat automatically calculated syllables and silent pauses. For instance, hypothetically, if the accuracy rate of the two aforementioned measures was 80%, the accuracy rate of Mean Length of Run might be 64%, as compared to human annotation, because the calculation of Mean Length of Run depends on the measures concerning syllables and silent pauses. Future studies are warranted to ascertain the extent to which the compounded inaccuracy effect may affect the automatic assessment of interpreting fluency.

4 Pedagogical Implications The purpose of this study is to examine the feasibility of automatic fluency assessment in interpreter training. Based on the statistical analyses on a relatively large sample size (n = 140), a flowchart is designed to summarize the major issues pertinent to using Praat as an automatic assessment tool. The following paragraphs discuss the decision-making process and offer some pedagogical suggestions. The first practical issue is whether it is necessary to edit audio files (see Part 1 in Fig. 3). There is no simple yes or no answer to this question. Teachers need to base their decisions on assessment goals and time available to them. The results in this study show that editing audio files can moderately improve the correlation between utterance fluency and perceived fluency. If the assessment is of low-stakes, the correlation coefficient (r s > 0.50) may be good enough when we use unedited audio files and appropriately configure parameters in Praat. If the assessment is summative (i.e., final exams and certifying exams), editing work is arguably needed to produce the optimal correlation coefficient. In addition to the assessment goals, the availability of time is another practical concern. In this study, a research assistant spent approximately seven hours editing out the inter-segmental silences and noises in the 140 recordings. The sheer amount of editing work could be a deterrent to the pre-processing of audio files, particularly when there are a relatively large number of recorded interpretations. The second issue pertains to the configuration of parameters in Praat (see Part 2 in Fig. 3). The intensity threshold is an important variable, because intensity levels within audio recordings vary greatly. In this study, the maximum intensity levels in the 140 recordings ranged from 52.68 to 87.52 dB. The range and the diversity of intensity levels might be a potential issue, because Praat uses the maximum intensity and the intensity constant we specify to determine the maximum silence intensity (De Jong and Wempe 2010). If the intensity of an interval is below this value, it will be automatically annotated as “silence” by Praat. Therefore, it is a good idea to adjust the intensity threshold to improve the precision of annotation. As explained in Sect. 2.2, De Jong and Wempe (2010) suggest the following formula: X dB −

Chasing the Unicorn? The Feasibility …

153

Fig. 3 The flowchart to facilitate the decision-making process in automatic fluency assessment

(Maximum intensity − 99% quantile intensity), with X being the intensity constant we specify. This then begs a question: what value of X should we use? This study shows that when X was set at −18, the highest correlation coefficient was observed. However, replication studies should be conducted to verify its appropriateness. In fact, −20 and −25 dB are two default constants in De Jong and Wempe’s (2010) Praat script. Interestingly, as shown in Fig. 2(a), the five highest correlations for the edited recordings were obtained when the constants were set from −16 to −20 dB, while the best range for the unedited recordings was from −15 to −19 dB. If teachers aim for “good enough” automatic results, setting the constant between −15 and − 20 dB seems to be reasonable. If teachers aim for higher precision, one possible way is to have one third of the data rated by human raters and then find out the best range for the threshold (for details about splitting a sample into three parts, see Shu 2020). Figure 2(d) shows the correlations between Mean Silence Duration and the judged fluency with various intensity constants, using one third and all of the edited audio

154

Z. Wu

files. It would seem that the best range was from −17 to −21 dB for the one third of the samples. As such, using one of the default intensity constants in the Praat script (i.e., −20 dB) seems to be appropriate. With respect to the silent pause threshold, the configuration is more straightforward. Setting the cut-off point at 0.25 s is well supported by studies on L1 and L2 oral performances (Bosker et al. 2013; De Jong and Bosker 2013; Préfontaine, Kormos and Johnson 2016). Some studies on interpreters’ fluency also adopted this threshold (Cecot 2001; Han et al. 2020; Mead 2005; Tissi 2000). The comparison presented in Fig. 2(c) demonstrates that setting the threshold at 0.25 s is optimal for automatic fluency assessment. In line with Han and An (2020), if the threshold is set larger than 0.5 s, the results will be undesirable. However, it should be noted that if unedited audio files are used, due to the inclusion of inter-segmental silences, it is necessary to set an upper limit for silent intervals. In this study, the upper cut-off point was set at 5 s after we examined the silences in the edited audio files; and the results showed that it produced a “good” enough correlation coefficient (r s = −0.58) when the intensity threshold was properly adjusted in Praat. The next concern is which acoustic measures can be used as indicators of the judged fluency (see Part 3 in Fig. 3). While both Speech Rate and Mean Silence Duration had strong correlations with the raters’ scores in this study, we caution teachers not to use Speech Rate as a default indicator of how fast/fluently students render their interpretation. As Praat cannot automatically distinguish between filled and unfilled pauses, filled pauses (such as “uhm” and “er”) are counted as syllables when calculating Speech Rate. In this sense, students’ speed fluency may be overestimated because they may use filled pauses to buy processing time (Hilton 2009). Notwithstanding this potential issue, Speech Rate is a fair measure of speed fluency for those who do not habitually or heavily fill their speech with “uhms” or “ers.” By comparison, Mean Silence Duration seems to be a robust indicator of fluency in students’ interpreting performance for at least two reasons. First, Mean Length of Unfilled Pauses (equivalent to Mean Silence Duration in this study) was identified as one of the two significant predictors of the fluency ratings in Han et al.’s (2020) study, which relied on human annotators and human raters. In the present study, Mean Silence Duration was automatically calculated in Praat and it had the highest correlation with the raters’ fluency ratings. Therefore, both studies have provided strong empirical evidence to support the use of Mean Silence Duration in automatic assessment of interpreting fluency. Second, monitoring and controlling (unduly long and frequent) silences are important for students to develop their interpreting competence. It would be inaccurate and inefficient for teachers and students to rely on their intuition to decide whether silences are long or short. With Mean Silence Duration being a quantifiable measure, we have a reliable and practical method to gauge silence duration and compare how different lengths of silent pauses can facilitate or impede listeners’ comprehension. The final issue in the decision flow chart concerns assessment goals (see Part 4 in Fig. 3). Notwithstanding the promising results from this study, it is still premature to claim that automatic assessment can replace human evaluation, especially in highstakes summative assessment. More work needs to be done to validate the feasibility

Chasing the Unicorn? The Feasibility …

155

of automatically assessing interpreting fluency. However, as this study suggests, it is useful and practical for teachers to use automatically generated measures as complementary indicators in formative assessment, and for students to use them to facilitate autonomous learning. As interpreter training is gaining momentum around the world and particularly in China over the past decade, teachers are faced with a daunting challenge to assess a large number of students’ interpretations in a reliable and timely manner. Previous studies have shown that students are overly reliant on teachers’ feedback and susceptible to demotivation due to delayed (and sometimes unavailable) feedback on their performance (Wu 2012, 2016a). For the purpose of formative assessment, feedback should be offered in a timely manner to inform students’ ongoing development (Abedi 2010). A possible solution is to use Praat as a tool for self- and/or peer assessment. Students can monitor their own and peers’ performance (Wu 2016b) and engage in deliberate practice (Ericsson 2000) to improve their fluency (e.g., increasing speech rate and reducing the number and length of silent pauses). They can also document and submit automatic assessment results in learning logs to be formatively evaluated by teachers. Based on the log files, teachers can design fluency enhancing activities, such as a 4/3/2 task in which students interpret the same speech within 4, 3, and 2 min (for more details, see De Jong and Perfetti 2011). These activities can potentially enable students to become mindful and strategic users of silences.

5 Conclusion This study adds to the emerging discussion about automatic assessment of interpreting fluency (Yu and van Heuven 2017; Han et al. 2020). Based on the 140 speech samples of interpreting performance, it is found that Speech Rate and Mean Silence Duration strongly correlated with the fluency ratings. A flowchart is also designed to facilitate teachers’ decision-making process regarding the pre-processing of audio files, configuration of intensity and silent pause threshold in Praat, and association of automatic assessment results with teaching and learning goals. When it comes to the feasibility of automatic fluency assessment, we need to consider two issues: (a) the feasibility to relate automatically assessed features to human raters’ judgements, and (b) the feasibility to use automatic assessment results to inform and improve interpreter training. These two considerations correspond to the important distinction between assessment of learning and assessment for learning (Stiggins 2006). The current study focused more on automatic assessment of learning by examining which acoustic features in what Praat configurations can best approximate human raters’ fluency judgements. Still, there is a need to understand how teachers and students make sense of automatic assessment results and use them to inform their training. Assessment “with high diagnostic value will tell us not only whether students are performing well but also why students are performing at certain levels and what to do about it” (Herman and Baker 2005, p. 5). Although Praat represents a possible solution assessing interpreting fluency quickly and objectively,

156

Z. Wu

students’ immediate access to assessment results does not mean that the diagnostic information will be acted upon. Users of automatic assessment tools may become too obsessed with the assessed variables to heed qualitatively important but less quantifiable aspects of interpreting performance. As such, it will be necessary for future studies to examine teachers’ and students’ emic accounts of using automatic assessment tools to advance interpreter training/learning. The title of this chapter asks whether automatic assessment of interpreting fluency is a unicorn. The research results seem to suggest that automatic assessment is possible, if audio files, Praat configurations, and acoustic measures are properly prepared and defined. Nevertheless, more evidence is certainly needed to demonstrate that automatic assessment of interpreting fluency is not wishful thinking, and that it actually contributes to substantial development of interpreting competence. However technologically advanced an assessment tool may be, the application of the tool would be not feasible, if it does not facilitate teaching and promote learning. We therefore need to embark on a quest to chase the unicorn before we can conclude whether it exists or not. This chapter is one of the early steps in the quest.

References Abedi, Jamal. 2010. Research and recommendations for formative assessment with English language learners. In Handbook of formative assessment, ed. Heidi L. Andrade and Gregory J. Cizek, 181–197. New York: Routledge. Baker-Smemoe, Wendy, Dan P. Dewey, Jennifer Bown, and Rob A. Martinsen. 2014. Does measuring L2 utterance fluency equal measuring overall L2 proficiency? Evidence from five languages. Foreign Language Annals 47 (4): 707–728. Boersma, Paul, and David Weenink. 2019. PRAAT. https://www.praat.org. Assessed 30 Aug 2019. Bosker, Hans Rutger, Anne-France Pinget, Hugo Quené, Ted Sanders, and Nivja de Jong. 2013. What makes speech sound fluent? The contributions of pauses, speed and repairs. Language Testing 30 (2): 159–175. Cecot, Michela. 2001. Pauses in simultaneous interpretation: A contrastive analysis of professional interpreters’ performances. The Interpreters’ Newsletter 11: 63–85. Cucchiarini, Catia, Helmer Strik, and Lou Boves. 2000. Quantitative assessment of second language learners’ fluency by means of automatic speech recognition technology. Journal of the Acoustical Society of America 107: 989–999. de Jong, Nivja, and Charles Perfetti. 2011. Fluency training in the ESL classroom: An experimental study of fluency development and proceduralization. Language Learning 61 (2): 533–568. de Jong, Nivja, and Hans Rutger Bosker. 2013. Choosing a threshold for silent pauses to measure second language fluency. In Proceedings of disfluency in spontaneous speech diss 2013, ed. Robert Eklund, 17–20. Stockholm: Royal Institute of Technology (KTH). de Jong, Nivja, and Ton Wempe. 2009. Praat script to detect syllable nuclei and measure speech rate automatically. Behavior Research Methods 41 (2): 385–390. de Jong, Nivja, and Ton Wempe. 2010. Praat Script Syllable Nuclei v2. https://sites.google.com/ site/speechrate/Home/praat-script-syllable-nuclei-v2 Assessed 30 Aug 2019. Ericsson, Anders. 2000. Expertise in interpreting: An expert-performance perspective. Interpreting 5 (2): 187–220. Gósy, Mária. 2005. Pszicholoingvisztika. Budapest: Osiris Kiadó.

Chasing the Unicorn? The Feasibility …

157

Han, Chao. 2015. (Para)linguistic correlates of perceived fluency in English-to-Chinese simultaneous interpretation. International Journal of Comparative Literature and Translation Studies 3 (4): 32–37. Han, Chao, and Kerui An. 2020. Using unfilled pauses to measure (dis)fluency in English-Chinese consecutive interpreting: In search of an optimal pause threshold(s). Perspectives. https://doi.org/ 10.1080/0907676X.2020.1852293. Han, Chao, Sijia Chen, Fu. Rongbo, and Qian Fan. 2020. Modeling the relationship between utterance fluency and raters’ perceived fluency of consecutive interpreting. Interpreting 22 (2): 211–237. Herman, Joan, and Eva Baker. 2005. Making benchmark testing work. Educational Leadership 63 (3): 48–54. Hilton, Heather. 2009. Annotation and analyses of temporal aspects of spoken fluency. CALICO Journal 26 (3): 644–661. Isaacs, Talia, and Pavel Trofimovich. 2012. Deconstructing comprehensibility: Identifying the linguistic influences on listeners’ L2 comprehensibility ratings. Studies in Second Language Acquisition 34 (3): 475–505. Iwashita, Noriko, Annie Brown, Tim McNamara, and Sally O’Hagan. 2008. Assessed levels of second language speaking proficiency: How distinct? Applied Linguistics 29 (1): 24–49. Kang, Okim, and David Johnson. 2018. The roles of suprasegmental features in predicting English oral proficiency with an automated system. Language Assessment Quarterly 15 (2): 150–168. Kormos, Judit, and Mariann Dénes. 2004. Exploring measures and perceptions of fluency in the speech of second language learners. System 32 (2): 145–164. Lennon, Paul. 2000. The lexical element in spoken second language fluency. In Perspectives on fluency, ed. Hedi Riggenbach, 25–42. Ann Arbor: University of Michigan Press. Mead, Peter. 2000. Control of pauses by trainee interpreters in their A and B languages. The Interpreters’ Newsletter 10: 89–102. Mead, Peter. 2005. Methodological issues in the study of interpreters’ fluency. the Interpreters’ Newsletter 13: 39–63. Mehnert, Uta. 1998. The effects of different lengths of time for planning on second language performance. Studies in Second Language Acquisition 20 (1): 83–108. Préfontaine, Yvonne, Judit Kormos, and Daniel Ezra Johnson. 2016. How do utterance measures predict raters’ perceptions of fluency in French as a second language? Language Testing 33 (1): 53–73. Rennert, Sylvi. 2010. The impact of fluency on the subjective assessment of interpreting quality. The Interpreters’ Newsletter 15: 101–115. Riazantseva, Anastasia. 2001. Second language proficiency and pausing: A study of Russian speakers of English. Studies in Second Language Acquisition 23: 497–526. Rossiter, Marian. 2009. Perceptions of L2 fluency by native and non-native speakers of English. Canadian Modern Language Review 65: 395–412. Segalowitz, Norman. 2010. Cognitive bases of second language fluency. Routledge. Shu, Xiaoling. 2020. Knowledge discovery in the social sciences: A data mining approach. Oakland: University of California Press. Skehan, Peter. 2003. Task based instruction. Language Teaching 36: 1–14. Stiggins, Rick. 2006. Assessment FOR learning: A key to student motivation and achievement. Phi Delta Kappan EDGE 2 (2): 3–19. Tavakoli, Parvaneh, and Peter Skehan. 2005. Strategic planning, task structure, and performance testing. In Planning and task performance in a second language, ed. Rod Ellis, 239–276. Amsterdam: John Benjamins. Tissi, Benedetta. 2000. Silent pauses and disfluencies in simultaneous interpretation: A descriptive analysis. The Interpreters’ Newsletter 10: 103–127. Wu, Zhiwei. 2012. Towards a better assessment tool for undergraduate interpreting courses: A case study in Guangdong University of Foreign Studies and beyond. In Translation and intercultural

158

Z. Wu

communication: Impacts and perspectives, ed. Zaixi Tan and Hu. Gengshen, 240–265. Shanghai: Shanghai Foreign Language Education Press. Wu, Zhiwei. 2016a. Towards understanding trainee interpreters’ (de)motivation: An exploratory study. Translation & Interpreting 8: 13–25. Wu, Zhiwei. 2016b. An in-class peer-review model for interpreting course. In Interpreting studies: The way forward. ed. Jing Chen and Liuyan Yang, 83–91. Beijing: Foreign Language Teaching and Research Press. Yu, Wenting, and Vincent van Heuven. 2017. Predicting judged fluency of consecutive interpreting from acoustic measures: Potential for automatic assessment and pedagogic implications. Interpreting 19 (1): 47–68. Zhang, Weiwei, and Zhongwei Song. 2019. The Effect of self-repair on judged quality of consecutive interpreting: Attending to content, form and delivery. International Journal of Interpreter Education 11 (1): 4–19.

Zhiwei Wu is an assistant professor in the Department of Chinese and Bilingual Studies The Hong Kong Polytechnic University. He was an academic visitor at Lancaster University (2014) and a visiting scholar The Pennsylvania State University (2016-2017). His research interests include interpreter assessment, computer-assisted language learning, multiliteracies, and fansubbing communities. His publications have appeared Interpreting, New Voices in Translation Studies, and Translation & Interpreting. He is working on a project to examine the feasibility of automatic assessment of interpreting fluency.

Exploring a Corpus-Based Approach to Assessing Interpreting Quality Yanmeng Liu

Abstract Interpreting quality assessment (IQA) is a challenging task, due to possible bias of human raters and transient nature of interpreting. To improve the consistency of IQA, researchers can draw on corpus linguistics to analyze machine-readable interpreting corpora. A corpus-based approach can potentially provide a more consistent way to evaluate interpreting quality. In this chapter, the author describes such an approach to evaluating the quality of students’ interpreting performance. A total of 64 English-to-Chinese interpreting renditions were sourced from the Parallel Corpus of Chinese EFL Learners (PACCEL) and then profiled based on linguistic features extracted from the corpus. These linguistic features were grouped into three quality parameters: information accuracy, output fluency, and audience acceptability. Principal component analysis and decision tree analysis were conducted on the multidimensional linguistic data to identify the appropriateness of proposed assessment indicators, and to verify assessment accuracy. Finally, assessment results were visualized via kernel principal component analysis. The results indicate that the proposed approach was capable of distinguishing students’ interpretations of different qualities. The study also shows that corpus linguistics has the potential to contribute to the development of IQA research. Keywords Interpreting quality assessment · Corpus linguistics · Corpus-based profiling · Linguistic data

1 Introduction Quality is important to every aspect of our life. The quality of language interpreting is also an important topic in the field of interpreting studies, attracting much attention from translation and interpreting scholars, researchers, and educators. However, Y. Liu (B) School of Languages and Cultures, The University of Sydney, Sydney, NSW, Australia e-mail: [email protected] © Springer Nature Singapore Pte Ltd. 2021 J. Chen and C. Han (eds.), Testing and Assessment of Interpreting, New Frontiers in Translation Studies, https://doi.org/10.1007/978-981-15-8554-8_8

159

160

Y. Liu

interpreting quality assessment (IQA) has long been a challenge for different stakeholders, particularly interpreting researchers and testers (Campbell and Hale 2003; Sawyer 2004; Hatim and Mason 2005; Huertas-Barros et al. 2019). The challenging nature of IQA is characterized by the subjectivity of human judgement and the transience of interpreting, among others. First, given the complexity of interpreting, human raters may have different understandings of what constitutes an excellent interpretation, which would result in different sets of assessment criteria in IQA (Pöchhacker 2001). Sometimes, human raters cannot identify, in an accurate manner, the difference between an outstanding and a modest performance (Kalina 2000). With IQA being heavily dependent on the human judgement (Sawyer 2004), the quality of an interpretation can vary from one person to another (Drugan 2013). An associated problem with human judgement is that assessment of interpreting places high cognitive demands on raters (Paravattil and Wilby 2019). Previous research from speaking assessment research shows that rating accuracy and consistency declines with longer working hours (Ling et al. 2014). Second, the nature of transience makes it more difficult to assess interpreting consistently and reliably. In the real world, interpreting is made accessible to recipients for a matter of seconds, after which it is “irrevocably gone” (Gumul 2008, 193), thereby depriving the audience of the possibility of close listening and repeated analysis. Because interpreting is transient, global scoring is usually practiced based on evaluators’ general impression and intuitive marking (Fulcher 2015). According to Feng (2005), assessment criteria used in such impressionistic scoring are far from robust, and the reliability and validity of assessment outcomes are problematic. Given the above two major drawbacks of the traditional approach to IQA, there is a need to explore alternative means of assessment that can increase the consistency and reliability of IQA. In this chapter, the author explores the feasibility of using a corpus-based approach to assessing the quality of spoken-language interpreting. There are several reasons why a corpus-based approach is more favorable than the traditional impressionistic scoring. One reason is that a corpus usually contains a large quantity of written/spoken samples (e.g., in the present case, interpreted renditions and their transcripts) which allows close analysis of linguistic information. Assisted by relevant computer technology, corpus-based analysis could generate new insights into linguistic data. The other reason is that the corpus-based approach can reduce human bias caused by evaluative judgment and perception, which can potentially improve IQA consistency and reliability. Corpus-based translation studies (CTS) can be traced back to the early 1990s, when Baker (1993) argued for leveraging the strength of corpus linguistics to describe and analyze translated texts. A few years later, Shlesinger (1998) stated that corpusbased interpreting studies (CIS) as an offshoot of CTS, ushering in a new era of using corpora to investigate features of interpreted renditions. A succession of interpreting researchers such as Straniero Sergio and Falbo (2012) has promoted and contributed to the development of CIS. A recent example is Russo et al. (2018) in which new frontiers and latest developments of CIS are described. In addition, Hu

Exploring a Corpus-Based Approach …

161

et al. (2018), and Ji and Oakes (2019) observe that corpus-based research has evolved from purely descriptive, micro-analyses of short texts to statistical mining of millions of mono/multi-lingual data to formulate new hypotheses, confirm previous research assertions and improve the generalizability of research findings. Given the benefits and capabilities of corpus-based analysis outlined above, the author sets out to conduct the current exploratory study to trial a corpus-based approach to quality assessment of English-to-Chinese interpreting. The study is interested in answering the following two research questions: (a) what measures and indicators can be included in the corpus-based analysis of interpreting quality? (b) how can corpus-based statistics be used to assess interpreting quality? The author bases the analysis on 64 English-to-Chinese interpreting renditions sampled from a Parallel Corpus of Chinese EFL Learners (PACCEL). In the remainder of the chapter, the author first provides an overview of different approaches to IQA. Then, the author examines a common set of assessment parameters/criteria, and associated each assessment parameter/criterion with linguistic features that could be extracted for the corpus. Finally, the author evaluates the accuracy and efficiency of the proposed corpus-based approach to IQA.

2 Interpreting Quality Assessment 2.1 Assessment Practices Theoretical discussion about IQA focuses on addressing the question of what makes a “good” interpretation. This is mainly achieved through critiquing of theoretical arguments based on relevant principles and hypotheses. For example, in China, influenced by Yan’s concepts of faithfulness, expressiveness, and elegance, Li (1999) developed such assessment criteria as fidelity, fluency, and quickness of interpreting. Wang (2007) argued that speed rate should be viewed as important criterion for evaluating an interpretation. Bao (2011) described that good interpretations are complete, accurate, and fluent. Outside China, for example, Grbi´c (2008) summarized three important models concerning interpreting quality. The first one relates to the traditional notion of quality advocated by early interpreting practitioners who regarded quality as being exclusive to highly-skilled professionals. The second theoretical model posits that perfect interpreting is possible given right conditions. This kind of zero-defect (impeccable) performance is pursued mostly in training settings, according to MoserMercer (1996). The third model defines quality in interpreting as fitness for purpose. As such, user expectation survey is needed to ascertain the fitness of an interpretation in a given context. Overall, these descriptions have provided theoretical guidance for IQA. Another line of research accentuates the expectation from different perspectives such as audience, employers and interpreters themselves. Usually, questionnaires are designed and distributed to relevant parties to collect their

162

Y. Liu

opinions on interpreting quality. Bühler (1986) surveyed 47 conference interpreters, and found out that such assessment criteria as the completeness of original meaning, correct grammar of target language, consistency in language style, and interpreting fluency were most valued by the interpreters. In a survey of 65 interpreters who were asked to evaluate two interpretations, Hearn et al. (as cited in Pöchhacker 2001) determined that knowledge of both source and target languages, socio-communicative skills, objectivity, reliability, and honesty were the most important assessment criteria. Similarly, in a large-scale survey conducted by Mesa in 1997 to collect client feedback for interpreting services, it is found that such behaviors as “a full understanding of the client language”, “ensures confidentiality”, “refrains from judgement”, and “translates faithfully” were cited as the most important interpreter attributes (cited in Pöchhacker 2001). However, a major downside of the survey-based IQA is that respondents usually have inconsistent (sometimes even contradictory) opinions, priorities and quality perceptions, which does not easily allow for generalization and objective evaluation. In addition, one group of researchers follow what Goulden (1992) calls atomistic method, that is, raters focus on points of content in an interpretation and/or its (para)linguistic features, including such items as omissions, errors, and pauses (for details, see Han 2018). The frequency of these features would be indicators of interpreting quality (Barik 1973, Pio 2003). However, this method tends to be reductionist (Beeby 2000), and may be difficult to be applied in large-scale interpreting quality evaluation (Han 2018). Instead of taking a reductionist assessment method, some scholars tried to design a comprehensive checklist to evaluate interpreting quality, given that interpreting can be influenced by many factors (e.g., Chen 2002; Riccardi 2002; Lee 2015). However, the variables that may influence quality are numerous and sometimes arbitrary, which calls for some of level of consistency in evaluation practice. Rubric-referenced rating scales have recently been used to assess interpreting quality. In this scale-based method, instead of leaving raters to understand quality criteria on their own, scale designers write specific description of each criterion to help raters gain a better understanding. For instance, Lee (2008) designed a rating scale for assessing interpreting performance, based on a review of the IQA literature, and conducted an experiment to evaluate the scale utility. In Han’s (2015) study, a four-band descriptor-based scale was used to assess fidelity, fluency, and target language quality of interpreting. Relevant scoring guide shared among the raters was found to be conducive to assessment consistency. However, the mechanical description of relevant criteria may be vague and less informative for human raters (Han 2018), as the use of “Excellent”, “Good” or “Poor” in the descriptors appear frequently. Although researchers hold different opinions regarding interpreting quality and assessment methods, it is shared that IQA will not be free from human judgment. As Strong and Rudser (1985) observe that a sound objective assessment should not replace subjective rating completely, because both are useful in IQA. What researchers have been doing is to explore new approaches to improve assessment consistency and reliability.

Exploring a Corpus-Based Approach …

163

2.2 Common Ground Based on the above review, three key quality criteria are highlighted, which are commonly used in previous IQA practices: (a) information accuracy, (b) output fluency, and (c) audience acceptability. Given their important role in IQA, these quality criteria will be used in the current study as well. In the following space, the author would like to provide a discussion of the assessment criteria, and also describe how each criterion is operationalized in this study.

2.2.1

Information Accuracy

The first criterion is information accuracy. According to the Interpretive Theory of the Paris School, interpretation represents the transfer of meaning rather than language symbols (Lederer 2015). In other words, consistency in sense between the source text and the interpretation is fundamental for IQA (Kurz 2002). It implies that not every word in the source text can find its corresponding equivalents, especially for syntactically asymmetrical languages such as English and Chinese (Xiao and Hu 2015). However, based on the experience of analyzing interpreted renditions, the author observes that certain individual words in source texts, which are informationintensive, are usually interpreted in a literal manner to maintain the information integrity. Most often, the renditions of these information-intensive words are errorprone in students’ interpretations. This is supported by the research finding that the accuracy of interpreted texts is compromised by increasing information density of the source text, as higher information density requires more cognitive load of interpreters (Dillinger 1994; Tommola and Helevä 1998). Interpreting products may therefore be differentiated from each other based on the information-intensive words, making them potential indicators of information accuracy in IQA. The current study sampled representative words (that indicate informationintensiveness) from the source text and analyzed the renditions of these words to estimate the overall accuracy of information. Specifically, relevant words selected can be grouped into three types. First, key words are selected as indicators of information accuracy. In the source texts, key words are usually information-intensive, and they must be interpreted in order to maintain the sense consistency. Therefore, key words in the source text were used in the present study as indicators to evaluate if the students’ interpretations accurately transferred the meaning of the source text. Second, conjunctions were selected as the indicators of information accuracy as well. Successful meaning transfer of the source text relies on clear and transparent recreation of logic relations in the target texts (Huang 2010). As a result, logic conjunctions (e.g., because, however, but, etc.) were also selected from the source text to be the indicators of information accuracy. Third, a more specific problem regarding the language pair involved in the study (i.e., English-to-Chinese) is that interpretation of numerical information is complicated by differences in Chinese and English units (Liu 2002; Huang 2006). Therefore, numerical information in the source text was selected as a potential indicator of accuracy of information transfer.

164

2.2.2

Y. Liu

Output Fluency

Fluency is considered as “proficiency that indicates the degree to which speech is articulated smoothly and continuously without any ‘unnatural’ breakdowns in flow” (Ejzenberg 2000, p. 287). It has been studied as one important aspect of quality in interpreting since the 1980s, and it has an impact on intelligibility and user perception of interpreting (Rennert 2010). Therefore, the criterion of output fluency is used as one criterion for the corpus-based analysis of interpreting quality. Unfortunately, there has been no general consensus on measurable parameters of speaking/interpreting fluency. Previous scholars mention factors like pauses and speech rate as metrics of fluency (Kurz and Pöchhacker 1995; Tavakoli and Skehan 2005). Definitions of what constitutes a pause vary (Kumpulainen 2015), with its cutoff (i.e., length of a pause measured in seconds) ranging from 0.01 s (Immonen 2006; Immonen and Mäkisalo 2010) to as long as 5 s (Jakobsen 2003; Dragsted and Hansen 2009). In the present study, a pause is defined as a silence in speech that persists for 0.25 s or longer. That is, the threshold of 0.25 s is adopted for identifying a pause. The number of pauses is thus an additional indicator of interpreting fluency. As for speech rate, even though speech rates of 250–260 characters/syllables per minute are normal for native Chinese speakers (Xie 2002), Feng (2002) argued that such a speed was impossible for interpreting into Chinese. Therefore, Li (2010) considered 150–180 characters per minute as a standard for Chinese interpreting. Therefore, a speech rate of 150 characters per minute is considered a standard against which fast and slow interpreting is determined.

2.2.3

Audience Acceptability

Introduced by Toury (cited in Baker and Saldanha 2009), the concept of audience acceptability refers to the extent a translation or interpretation is acceptable to the target audience. To ensure that interpreter-mediated communication is smooth, it is of great importance for interpreters to comply with the relevant norms of the target audience culture (Pöchhacker 2002). However, in practice, collecting feedback from a particular group of audience is not always practical and possible. As such, the current study used linguistic data extracted from the corpus to profile linguistic patterns of interpreting. Specifically, the linguistic features selected pertain to lexical, syntactic and grammatical levels of a given language. At the lexical level, the number of token and of type, type/token ratio, average word length, and frequent-used words were selected, as they can represent lexical density and richness (Wu 2016). Here, frequent-used words are the words that appear more than once in interpreting. At the syntactic level, such measures as average sentence length, the number of pronouns and of conjunctions were computed to evaluate sentence complexity and cohesion (Xu 2010). Lastly, at the grammatical level, the number of function words (e.g., preposition, adverb, auxiliary) is counted, because function words can reflect language style of interpretations (Wang and Hu 2008).

Exploring a Corpus-Based Approach …

165

3 Corpus-Based Profiling of Interpreted Texts In this section, the analysis of the linguistic features of interpreted renditions is described, using the proposed concept of corpus-based profiling. By doing so, it is intended to integrate corpus-based interpreting studies with IQA. Profiling is originally defined as “the activity of collecting information about someone to describe them” (Cambridge English Dictionary). In the present research context, each interpreting product (i.e., an interpretation or an interpreted rendition) can also be profiled with specific statistical indicators. It means that each interpretation has a statistical expression to represent itself, with statistics extracted from a given corpus that provide statistical description of the three quality dimensions (i.e., information accuracy, output fluency, and audience acceptability) (see also Fig. 1). Under each dimension, specific statistical indicators will be examined. With these indicators at hand, interpreting products would be much easier to be compared and classified into different quality levels, such as the four levels of interpreting quality used in present study (i.e., Excellent, Good, Pass, and Fail). In the process of profiling, the specific indicators should be distinctive enough to represent features of interpretations of different quality levels, so as to discriminate and classify interpretations into their respective quality categories (e.g., Excellent, Pass). Empirical evaluation is thus needed to verify the distinctiveness and efficiency of the statistical indicators. In this approach, corpus-based profiling specifies the complex and abstract concept of interpreting quality, and simplifies IQA by using relevant statistical indicators. Presumably, less rater bias would be involved in the evaluative judgment of specific indicators of interpreting quality. Based on the three-dimension IQA framework, statistics of the specific indicators are extracted by software from a given corpus.

Fig. 1 The corpus-based profiling of an interpreted rendition

166

Y. Liu

The assessment that has been traditionally dominated by human raters thus takes the form of a multi-dimensional statistical comparison, which provides empirical evidence to inform IQA.

4 Method 4.1 Sample Selection 64 English-to-Chinese interpreting samples were selected from the Parallel Corpus of Chinese EFL Learners (PACCEL). This corpus comprises over two million words for Chinese-English translation and interpreting data produced by learners. Experts scored the interpreting samples on a scale of 100 points, which also fell into four levels/categories: Excellent (100–80), Good (79–70), Pass (69–60), and Fail (