Quantitative Data Analysis for Language Assessment Volume I: Fundamental Techniques [1° ed.] 1138733121, 9781138733121

Quantitative Data Analysis for Language Assessment Volume I: Fundamental Techniques is a resource book that presents the

1,234 104 6MB

English Pages 288 [289] Year 2019

Report DMCA / Copyright

DOWNLOAD FILE

Polecaj historie

Quantitative Data Analysis for Language Assessment, Volume II [2] 1138733121, 9781138733121, 1138733148, 9781138733145

Quantitative Data Analysis for Language Assessment Volume II: Advanced Methods emonstrates advanced quantitative techniq

926 85 6MB Read more

Quantitative Data Analysis for Language Assessment Volume II 9781351741194, 9781315187808

678 39 4MB Read more

Quantitative Data Analysis for Language Assessment Volume II: Advanced Methods [1° ed.] 1138733148, 9781138733145

Quantitative Data Analysis for Language Assessment Volume II: Advanced Methods demonstrates advanced quantitative techni

1,262 65 7MB Read more

Quantitative Methods for Second Language Research_ A n Language Assessment)

Table of contents : Title......Page 3 Copyright......Page 4 CONTENTS......Page 5 List of Illustrations......Page 7 Index

1,260 130 11MB Read more

A Toolkit for Quantitative Data Analysis: Using SPSS 9781350394209, 9781137038258

This straightforward, approachable text provides students with a beginner's guide and continuing reference tool for

325 66 5MB Read more

Quantitative Longitudinal Data Analysis 9781350188877, 3924220131

401 113 1MB Read more

SQL for Data Analysis: Advanced Techniques for Transforming Data into Insights [1 ed.] 1492088781, 9781492088783

With the explosion of data, computing power, and cloud data warehouses, SQL has become an even more indispensable tool f

7,853 1,329 7MB Read more

Advanced Quantitative Data Analysis [1 ed.] 0335200591, 9780335200597, 9780335224661

*What do advanced statistical techniques do? *When is it appropriate to use them? *How are they carried out and reported

1,098 110 2MB Read more

Asteroseismic Data Analysis: Foundations and Techniques 9781400888207

Studies of stars and stellar populations, and the discovery and characterization of exoplanets, are being revolutionized

173 108 13MB Read more

What is Quantitative Longitudinal Data Analysis? 9781472515391, 9781472515407, 9781472545268, 9781472515421

Across the social sciences, there is widespread agreement that quantitative longitudinal research designs offer analysts

308 17 6MB Read more

Quantitative Data Analysis for Language Assessment Volume I: Fundamental Techniques [1° ed.]
1138733121, 9781138733121

Author / Uploaded
Vahid Aryadoust (editor)
Michelle Raquel (editor)

Table of contents :
Quantitative Data Analysis for Language Assessment Volume I
Contents
Figures
Tables
Preface
Editor and contributor biographies
Introduction
Section I Test development, reliability, and generalizability
1 Item analysis in language assessment
2 Univariate generalizability theory in language assessment
3 Multivariate generalizability theory in language assessment
Section II Unidimensional Rasch measurement
4 Applying Rasch measurement in language assessment
5 The Rasch measurement approach to differential item functioning (DIF) analysis in language assessment research
6 Application of the rating scale model and the partial credit model in language assessment research
7 Many-facet Rasch measurement
Section III Univariate and multivariate statistical analysis
8 Analysis of differences between groups
9 Application of ANCOVA and MANCOVA in language assessment research
10 Application of linear regression in language assessment
11 Application of exploratory factor analysis in language assessment
Index

Citation preview

Quantitative Data Analysis for Language Assessment Volume I Quantitative Data Analysis for Language Assessment Volume I: Fundamental Techniques is a resource book that presents the most fundamental techniques of quantitative data analysis in the field of language assessment. Each chapter provides an accessible explanation of the selected technique, a review of language assessment studies that have used the technique, and finally, an example of an authentic study that uses the technique. Readers also get a taste of how to apply each technique through the help of supplementary online resources that include sample datasets and guided instructions. Language assessment students, test designers, and researchers should find this a unique reference, as it consolidates theory and application of quantitative data analysis in language assessment. Vahid Aryadoust is an Assistant Professor of language assessment literacy at the National Institute of Education of Nanyang Technological University, Singapore. He has led a number of language assessment research projects funded by, for example, the Ministry of Education (Singapore), Michigan Language Assessment (USA), Pearson Education (UK), and Paragon Testing Enterprises (Canada), and published his research in, for example, Language Testing, Language Assessment Quarterly, Assessing Writing, Educational Assessment, Educational Psychology, and Computer Assisted Language Learning. He has also (co)authored a number of book chapters and books that have been published by Routledge, Cambridge University Press, Springer, Cambridge Scholar Publishing, Wiley Blackwell, etc. He is a member of the Advisory Board of multiple international journals including Language Testing, Language Assessment Quarterly, Educational Assessment, Educational Psychology, and Asia Pacific Journal of Education. In addition, he has been awarded the Intercontinental Academia Fellowship (2018–2019) which is an advanced research program launched by the University-Based Institutes for Advanced Studies. Vahid’s areas of interest include theory-building and quantitative data analysis in language assessment, neuroimaging in language comprehension, and eye-tracking research. Michelle Raquel is a Senior Lecturer at the Centre of Applied English Studies, University of Hong Kong, where she teaches language testing and assessment to postgraduate students. She has extensive assessment development and management experience in the Hong Kong education and government sector. In particular, she has either led or been part of a group that designed and administered largescale computer-based language proficiency and diagnostic assessments such as the Diagnostic English Language Tracking Assessment (DELTA). She specializes in data analysis, specifically Rasch measurement, and has published several articles in international journals on this topic as well as academic English, diagnostic assessment, dynamic assessment of English second-language dramatic skills, and English for specific purposes (ESP) testing. Michelle’s research areas are classroom-based assessment, diagnostic assessment, and workplace assessment.

Routledge Research in Language Education

The Routledge Research in Language Education series provides a platform for established and emerging scholars to present their latest research and discuss key issues in Language Education. This series welcomes books on all areas of language teaching and learning, including but not limited to language education policy and politics, multilingualism, literacy, L1, L2 or foreign language acquisition, curriculum, classroom practice, pedagogy, teaching materials, and language teacher education and development. Books in the series are not limited to the discussion of the teaching and learning of English only. Books in the series include Interdisciplinary Research Approaches to Multilingual Education Edited by Vasilia Kourtis-Kazoullis, Themistoklis Aravossitas, Eleni Skourtou and Peter Pericles Trifonas From language skills to literacy Broadening the scope of English language education through media literacy Csilla Weninger Addressing Difficult Situations in Foreign-Language Learning Confusion, Impoliteness, and Hostility Gerrard Mugford Translanguaging in EFL Contexts A Call for Change Michael Rabbidge Quantitative Data Analysis for Language Assessment Volume I Fundamental Techniques Edited by Vahid Aryadoust and Michelle Raquel

For more information about the series, please visit www.routledge.com/RoutledgeResearch-in-Language-Education/book-series/RRLE

Quantitative Data Analysis for Language Assessment Volume I Fundamental Techniques

Edited by Vahid Aryadoust and Michelle Raquel

First published 2019 by Routledge 2 Park Square, Milton Park, Abingdon, Oxon OX14 4RN and by Routledge 52 Vanderbilt Avenue, New York, NY 10017 Routledge is an imprint of the Taylor & Francis Group, an informa business © 2019 selection and editorial matter, Vahid Aryadoust and Michelle Raquel; individual chapters, the contributors The right of Vahid Aryadoust and Michelle Raquel to be identified as the authors of the editorial material, and of the authors for their individual chapters, has been asserted in accordance with sections 77 and 78 of the Copyright, Designs and Patents Act 1988. All rights reserved. No part of this book may be reprinted or reproduced or utilised in any form or by any electronic, mechanical, or other means, now known or hereafter invented, including photocopying and recording, or in any information storage or retrieval system, without permission in writing from the publishers. Trademark notice: Product or corporate names may be trademarks or registered trademarks, and are used only for identification and explanation without intent to infringe. British Library Cataloguing-in-Publication Data A catalogue record for this book is available from the British Library Library of Congress Cataloging-in-Publication Data A catalog record for this book has been requested ISBN: 978-1-138-73312-1 (hbk) ISBN: 978-1-315-18781-5 (ebk) Typeset in Galliard by Apex CoVantage, LLC Visit the eResources: www.routledge.com/9781138733121

Contents

List of figures List of tables Preface Editor and contributor biographies Introduction

vii ix xi xiii 1

VAH I D AR YAD O U S T A ND MICH EL L E RA Q U EL

SECTION I

Test development, reliability, and generalizability

13

1 Item analysis in language assessment

15

RI TA G RE EN

2 Univariate generalizability theory in language assessment

30

YAS U Y O S AWA KI A ND XIA O MING XI

3 Multivariate generalizability theory in language assessment

54

KI RBY C. G RA BO WS KI A ND RO NGCH A N L IN

SECTION II

Unidimensional Rasch measurement

81

4 Applying Rasch measurement in language assessment: unidimensionality and local independence

83

J AS O N FAN AN D T REVO R BO ND

5 The Rasch measurement approach to differential item functioning (DIF) analysis in language assessment research M I C H EL L E RA Q U EL

103

vi Contents 6 Application of the rating scale model and the partial credit model in language assessment research

132

I KKY U C H O I

7 Many-facet Rasch measurement: implications for rater-mediated language assessment

153

TH O M AS E CKES

SECTION III

Univariate and multivariate statistical analysis

177

8 Analysis of differences between groups: the t-test and the analysis of variance (ANOVA) in language assessment

179

TU Ğ B A E L I F T O P RA K

9 Application of ANCOVA and MANCOVA in language assessment research

198

ZH I LI AN D MICH EL L E Y. CH EN

10 Application of linear regression in language assessment

219

D AE R Y O N G S EO A ND H U S EIN TA HERBHA I

11 Application of exploratory factor analysis in language assessment

243

L I M E I ZH ANG A ND WENS HU L U O

Index

262

Figures

1.1 Facility values and distracter analysis 21 1.2 Discrimination indices 22 1.3 Facility values, discrimination, and internal consistency (reliability)23 1.4 Task statistics 23 1.5 Distracter problems 24 2.1 A one-facet crossed design example 36 2.2 A two-facet crossed design example 37 2.3 A two-facet partially nested design example 38 3.1 Observed-score variance as conceptualized through CTT 55 3.2 Observed-score variance as conceptualized through G theory 56 4.1 Wright map presenting item and person measures 93 4.2 Standardized residual first contrast plot 96 5.1 ICC of an item with uniform DIF 105 5.2 ICC of an item with non-uniform DIF 106 5.3 Standardized residual plot of 1st contrast 115 5.4 ETS DIF categorization of DIF items based on DIF size and statistical significance 116 5.5 Sample ICC of item with uniform DIF (positive DIF contrast) 119 5.6 Sample ICC of item with uniform DIF (negative DIF contrast) 119 5.7 Macau high-ability students (M2) vs. HK high-ability students (H2) sample ICCs of an item with NUDIF (positive DIF contrast) 121 5.8 Macau high-ability students (M2) vs. HK high-ability students (H2) sample ICCs of an item with NUDIF (negative DIF contrast) 121 5.9 Plot diagram of person measures with vs. without DIF items 124 6.1 Illustration of the RSM assumption 136 6.2 Distributions of item responses 140 6.3 Estimated response probabilities for Items 1, 2, and 3 from the RSM (dotted lines) and the PCM (solid lines) 143

viii Figures 6.4 Estimated standard errors for person parameters and test information from the RSM (dotted lines) and the PCM (solid lines) 6.5 Estimated response probabilities for Items 6, 7 and 8 from the RSM (dotted lines) and the PCM (solid lines), with observed response proportions (unfilled circles) 7.1 The basic structure of rater-mediated assessments 7.2 Fictitious dichotomous data: Responses of seven test takers to five items scored as correct (1) or incorrect (0) 7.3 Illustration of a two-facet dichotomous Rasch model (log odds form) 7.4 Fictitious polytomous data: Responses of seven test takers evaluated by three raters on five criteria using a five-category rating scale 7.5 Illustration of a three-facet rating scale measurement model (log odds form) 7.6 Studying facet interrelations within a MFRM framework 7.7 Wright map for the three-facet rating scale analysis of the sample data (FACETS output, Table 6.0: All facet vertical “rulers”) 7.8 Illustration of the MFRM score adjustment procedure 9.1 An example of boxplots 9.2 Temporal distribution of ANCOVA/MANCOVA-based publications in four language assessment journals 9.3 A matrix of scatter plots 10.1 Plot of regression line graphed on a two-dimensional chart representing X and Y axes 10.2 Plot of residuals vs. predicted Y scores where the assumption of linearity holds for the distribution of random errors 10.3 Plot of residuals vs. predicted Y scores where the assumption of linearity does not hold 10.4 Plot of standardized residuals vs. predicted values of the dependent variable that depicts a violation of homoscedasticity 10.5 Histogram of residuals 10.6 Plot of predicted values vs. residuals 11.1 Steps in running EFA 11.2 Scatter plots to illustrate relationships between variables 11.3 Scree plot for the ReTSUQ data

145

146 154 155 156

157 157 160 162 167 202 204 212 221 224 224 227 237 238 245 247 256

Tables

2.1 2.2 2.3 2.4 2.5 2.6 2.7 2.8 3.1 3.2 3.3 3.4 3.5 3.6 3.7 3.8 4.1 4.2 4.3 4.4 4.5 5.1 5.2 5.3 5.4 5.5

Key Steps for Conducting a G Theory Analysis Data Setup for the p × i Study Example With 30 Items (n = 35) Expected Mean Square (EMS) Equations (the p × i Study Design) G-study Results (the p × i Study Design) D-study Results (the p × I Study Design) Rating Design for the Sample Application G- and D-study Variance Component Estimates for the p × r ′ Design (Rating Method) G- and D-study Variance Component Estimates for the p × r Design (Subdividing Method) Areas of Investigation and Associated Research Questions Research Questions and Relevant Output to Examine Variance Component Estimates for the Four Subscales (p• × T • × R• Design; 2 Tasks and 2 Raters) Variance and Covariance Component Estimates for the Four Subscales (p• × T • × R• Design; 2 Tasks and 2 Raters) G Coefficients for the Four Subscales (p• × T • × R• Design) Universe-Score Correlations Between the Four Subscales (p• × T • × R• Design) Effective Weights of Each Subscale to the Composite Universe-Score Variance (p• × T • × R• Design) Generalizability Coefficients for the Subscales When Varying the Number of Tasks (p• × T • × R• Design) Structure of the FET Listening Test Summary Statistics for the Rasch Analysis Rasch Item Measures and Fit Statistics (N = 106) Standardized Residual Variance Largest Standardized Residual Correlations Selected Language-Related DIF Studies Listening Sub-skills in the DELTA Listening Test Rasch Analysis Summary Statistics (N = 2,524) Principal Component Analysis of Residuals Approximate Relationships Between the Person Measures in PCAR Analysis

34 42 42 43 44 47 48 49 59 63 69 72 73 74 74 75 90 92 94 96 98 109 112 113 114 115

x Tables 5.6 Items With Uniform DIF 118 5.7 Number of NUDIF Items 120 5.8 Texts Identified by Expert Panel to Potentially Disadvantage Macau Students 122 6.1 Item Threshold Parameter Estimates (With Standard Errors in the Parentheses) From the RSM and the PCM 142 6.2 Person Parameter Estimates and Test Information From the RSM and the PCM 144 6.3 Infit and Outfit Mean Square Values From the RSM and the PCM 146 7.1 Excerpt From the FACETS Rater Measurement Report 164 7.2 Excerpt From the FACETS Test Taker Measurement Report 166 7.3 Excerpt From the FACETS Criterion Measurement Report 168 7.4 Separation Statistics and Facet-Specific Interpretations 169 8.1 Application of the t-Test in the Field of Language Assessment 183 8.2 Application of ANOVA in Language Testing and Assessment 189 8.3 Descriptive Statistics for Groups’ Performances on the Reading Test 193 9.1 Summary of Assumptions of ANCOVA and MANCOVA 201 9.2 Descriptive Statistics of the Selected Sample 209 9.3 Descriptive Statistics of Overall Reading Performance and Attitude by Sex-Group 210 9.4 ANCOVA Summary Table 210 9.5 Estimated Marginal Means 211 9.6 Descriptive Statistics of Reading Performance and Attitude by Sex-Group 213 9.7 MANCOVA Summary Table 213 9.8 Summary Table for ANCOVAs of Each Reading Subscale 214 10.1 Internal and External Factors Affecting ELLs’ Language Proficiency231 10.2 Correlation Matrix of the Dependent Variable and the Independent Variables 234 10.3 Summary of Stepwise Selection 235 10.4 Analysis of Variance Output for Regression, Including All the Variables 235 10.5 Parameter Estimates With Speaking Included 235 10.6 Analysis of Variance Without Including Speaking 236 10.7 Parameter Estimates of the Four Predictive Variables, Together With Their Variation Inflation Function 236 10.8 Partial Output as an Example of Residuals and Predicted Values (a Model Without Speaking) 237 11.1 Categories and Numbers of Items in the ReTSUQ 253 11.2 Part of the Result of Unrotated Principal Component Analysis 255 11.3 The Rotated Pattern Matrix of the ReTSUQ (n = 650) 257

Preface

The two-volume books, Quantitative Data Analysis for Language Assessment (Fundamental Techniques and Advanced Methods), together with the Companion website, were motivated by the growing need for a comprehensive sourcebook of quantitative data analysis for the community of language assessment. As the focus on developing valid and useful assessments continues to intensify in different parts of the world, having a robust and sound knowledge of quantitative methods has become an increasingly essential requirement. This is particularly important given that one of the community’s responsibilities is to develop language assessments that have evidence of validity, fairness, and reliability. We believe this would be achieved primarily by leveraging quantitative data analysis in test development and validation efforts. It has been the contributors’ intention to write the chapters with an eye toward what professors, graduate students, and test-development companies would need. The chapters progress gradually from fundamental concepts to advanced topics, making the volumes suitable reference books for professors who teach quantitative methods. If the content of the volumes is too heavy for teaching in one course, we would suggest professors consider using them across two semesters, or alternatively choose any chapters that fit the focus and scope of their courses. For graduate students who have just embarked on their studies or are writing dissertations or theses, the two volumes would serve as a cogent and accessible introduction to the methods that are often used in assessment development and validation research. For organizations in the test-development business, the volumes provide a unique topic coverage and examples of applications of the methods in small- and large-scale language tests that such organizations often deal with. We would like to thank all of the authors who contributed their expertise in language assessment and quantitative methods. This collaboration has allowed us to emphasize the growing interdisciplinarity in language assessment that draws knowledge and information from many different fields. We wish to acknowledge that in addition to editorial reviews, each chapter has been subjected to rigorous double-blind peer review. We extend a special note of thanks to a number of colleagues who helped us during the review process: Beth Ann O’Brien, National Institute of Education, Singapore Christian Spoden, The German Institute for Adult Education, Leibniz Centre for Lifelong Learning, Germany

xii Preface Tuğba Elif Toprak, Izmir Bakircay University, Turkey Guangwei Hu, Hong Kong Polytechnic University, Hong Kong Hamdollah Ravand, Vali-e-Asr University of Rafsanjan, Iran Ikkyu Choi, Educational Testing Service, USA Kirby C. Grabowski, Teachers College Columbia University, USA Mehdi Riazi, Macquarie University, Australia Moritz Heene, Ludwig-Maximilians-Universität München, Germany Purya Baghaei, Islamic Azad University of Mashad, Iran Shane Phillipson, Monash University, Australia Shangchao Min, Zhejiang University, China Thomas Eckes, Gesellschaft für Akademische Studienvorbereitung und Testentwicklung e. V. c/o TestDaF-Institut Ruhr-Universität Bochum, Germany Trevor Bond, James Cook University, Australia Wenshu Luo, National Institute of Education, Singapore Yan Zi, The Education University of Hong Kong, Hong Kong Yasuyo Sawaki, Waseda University, Japan Yo In’nami, Chuo University, Japan Zhang Jie, Shanghai University of Finance and Economics, China We hope that the readers will find the volumes useful in their research and pedagogy. Vahid Aryadoust and Michelle Raquel Editors April 2019

Editor and contributor biographies

Vahid Aryadoust is Assistant Professor of language assessment literacy at the National Institute of Education of Nanyang Technological University, Singapore. He has led a number of language assessment research projects funded by, for example, the Ministry of Education (Singapore), Michigan Language Assessment (USA), Pearson Education (UK), and Paragon Testing Enterprises (Canada), and published his research in Language Testing, Language Assessment Quarterly, Assessing Writing, Educational Assessment, Educational Psychology, and Computer Assisted Language Learning. He has also (co)authored a number of book chapters and books that have been published by Routledge, Cambridge University Press, Springer, Cambridge Scholar Publishing, Wiley Blackwell, etc. Trevor Bond is an Adjunct Professor in the College of Arts, Society and Education at James Cook University Australia and the senior author of the book Applying the Rasch Model: Fundamental Measurement in the Human Sciences. He consults with language assessment researchers in Hong Kong and Japan and with high-stakes testing teams in the US, Malaysia, and the UK. In 2005, he instigated the Pacific Rim Objective Measurement Symposia (PROMS), now held annually across East Asia. He is a regular keynote speaker at international measurement conferences, runs Rasch measurement workshops, and serves as a specialist reviewer for academic journals. Michelle Y. Chen is a research psychometrician at Paragon Testing Enterprises. She received her Ph.D. in measurement, evaluation, and research methodology from the University of British Columbia (UBC). She is interested in research that allows her to collaborate and apply psychometric and statistical techniques. Her research focuses on applied psychometrics, validation, and language testing. Ikkyu Choi is a Research Scientist in the Center for English Language Learning and Assessment at Educational Testing Service. He received his Ph.D. in applied linguistics from the University of California, Los Angeles in 2013, with a specialization in language assessment. His research interests include secondlanguage development profiles, test-taking processes, scoring of constructed responses, and quantitative research methods for language assessment data.

xiv Editor and contributor biographies Thomas Eckes is Head of the Psychometrics and Language Testing Research Department, TestDaF Institute, University of Bochum, Germany. His research focuses on psychometric modeling of language competencies, rater effects in large-scale assessments, and the development and validation of web-based language placement tests. He is on the editorial boards of the journals Language Testing and Assessing Writing. His book Introduction to Many-Facet Rasch Measurement (Peter Lang) appeared in 2015 in a second, expanded edition. He was also guest editor of a special issue on advances in IRT modeling of rater effects (Psychological Test and Assessment Modeling, Parts I & II, 2017, 2018). Jason Fan is a Research Fellow at the Language Testing Research Centre (LTRC) at the University of Melbourne, and before that, an Associate Professor at College of Foreign Languages and Literature, Fudan University. His research interests include the validation of language assessments and quantitative research methods. He is the author of Development and Validation of Standards in Language Testing (Shanghai: Fudan University Press, 2018) and the co-author (with Tim McNamara and Ute Knoch) of Fairness and Justice in Language Assessment: The Role of Measurement (Oxford: Oxford University Press, 2019, in press). Kirby C. Grabowski is Adjunct Assistant Professor of Applied Linguistics and TESOL at Teachers College, Columbia University, where she teaches courses on second-language assessment, performance assessment, generalizability theory, pragmatics assessment, research methods, and linguistics. Dr. Grabowski is currently on the editorial advisory board of Language Assessment Quarterly and formerly served on the Board of the International Language Testing Association as Member-at-Large. Dr. Grabowski was a Spaan Fellow for the English Language Institute at the University of Michigan, and she received the 2011 Jacqueline Ross TOEFL Dissertation Award for outstanding doctoral dissertation in second/foreign language testing from Educational Testing Service. Rita Green is a Visiting Teaching Fellow at Lancaster University, UK. She is an expert in the field of language testing and has trained test development teams for more than 30 years in numerous projects around the world including those in the fields of education, diplomacy, air traffic control, and the military. She is the author of Statistical Analyses for Language Testers (2013) and Designing Listening Tests: A Practical Approach (2017), both published by Palgrave Macmillan. Zhi Li is an Assistant Professor in the Department of Linguistics at the University of Saskatchewan (UoS), Canada. Before joining UoS, he worked as a language assessment specialist at Paragon Testing Enterprises, Canada, and a sessional instructor in the Department of Adult Learning at the University of the Fraser Valley, Canada. Zhi Li holds a doctoral degree in applied linguistics and technology from Iowa State University, USA. His research interests include language assessment, technology-supported language teaching and learning, corpus linguistics, and computational linguistics. His research papers have been published in System, CALICO Journal, and Language Learning & Technology.

Editor and contributor biographies xv Rongchan Lin is a Lecturer at National Institute of Education, Nanyang Technological University, Singapore. She has received awards and scholarships such as the 2017 Asian Association for Language Assessment Best Student Paper Award, the 2016 and 2017 Confucius China Studies Program Joint Research Ph.D. Fellowship, the 2014 Tan Ean Kiam Postgraduate Scholarship (Humanities), and the 2012 Tan Kah Kee Postgraduate Scholarship. She was named the 2016 Joan Findlay Dunham Annual Fund Scholar by Teachers College, Columbia University. Her research interests include integrated language assessment and rubric design. Wenshu Luo is an Assistant Professor at National Institute of Education (NIE), Nanyang Technological University, Singapore. She obtained her Ph.D. in educational psychology from the University of Hong Kong. She teaches quantitative research methods and educational assessment across a number of programs for in-service teachers in NIE. She is an active researcher in student motivation and engagement and has published a number of papers in top journals in this area. She is also enthusiastic to find out how cultural and contextual factors influence students’ learning, such as school culture, leadership practices, classroom practices, and parenting practices. Michelle Raquel is a Senior Lecturer at the Centre of Applied English Studies, University of Hong Kong, where she teaches language testing and assessment to postgraduate students. She has worked in several tertiary institutions in Hong Kong as an assessment developer and has either led or been part of a group that designed and administered large-scale diagnostic and language proficiency assessments. She has published several articles in international journals on academic English diagnostic assessment, ESL testing of reading and writing, dynamic assessment of second-language dramatic skills, and English for specific purposes (ESP) testing. Yasuyo Sawaki is a Professor of Applied Linguistics at the School of Education, Waseda University in Tokyo, Japan. Sawaki is interested in a variety of research topics in language assessment ranging from the validation of large-scale international English language assessments to the role of assessment in classroom English language instruction. Her current research topics include examining summary writing performance of university-level Japanese learners of English and English-language demands in undergraduate- and graduate-level content courses at universities in Japan. Daeryong Seo is a Senior Research Scientist at Pearson. He has led various state assessments and brings international psychometric experience through his work with the Australian NAPLAN and Global Scale of English. He has published several studies in international journals and presented numerous psychometric issues at international conferences, such as the American Educational Research Association (AERA). He also served as a Program Chair of the Rasch special-interest group, AERA. In 2013, he and Dr. Taherbhai received an outstanding paper award from the California Educational

xvi Editor and contributor biographies Research Association. Their paper is titled “What Makes High School Asian English Learners Tick?” Husein Taherbhai is a retired Principal Research Scientist who led large-scale assessments in the U.S. for states, such as Arizona, Washington, New York, Maryland, Virginia, Tennessee, etc., and for the National Physical Therapists’ Association’s licensure examination. Internationally, Dr. Taherbhai led the Educational Quality and Assessment Office in Ontario, Canada, and worked for the Central Board of Secondary Education’s Assessment in India. He has published in various scientific journals and has reviewed and presented at the NCME, AERA, and Florida State conferences with papers relating to language learners, rater effects, and students’ equity and growth in education. Tuğba Elif Toprak is an Assistant Professor of Applied Linguistics/ELT at Izmir Bakircay University, Izmir, Turkey. Her primary research interests are implementing cognitive diagnostic assessment by using contemporary item response theory models and blending cognition with language assessment in her research. Dr. Toprak has been collaborating with international researchers on several research projects that are largely situated in the fields of language assessment, psycholinguistics, and the learning sciences. Her current research interest includes intelligent real-time assessment systems, in which she combines techniques from several areas such as the learning sciences, cognitive science, and psychometrics. Xiaoming Xi is Executive Director of Global Education and Workforce at ETS. Her research spans broad areas of theory and practice, including validity, fairness, test validation methods, approaches to defining test constructs, validity frameworks for automated scoring, automated scoring of speech, the role of technology in language assessment and learning, and test design, rater, and scoring issues. She is co-editor of the Routledge book series Innovations in Language Learning and Assessment and is on the Editorial Boards of Language Testing and Language Assessment Quarterly. She received her Ph.D. in language assessment from UCLA. Limei Zhang is a lecturer at the Singapore Centre for Chinese language, Nanyang Technological University. She obtained her Ph.D. in applied linguistics with emphasis on language assessment from National Institute of Education, Nanyang Technological University. Her research interests include language assessment literacy, reading and writing assessment, and learners’ metacognition. She has published papers in journals including The Asia-Pacific Education Researcher, Language Assessment Quarterly, and Language Testing. Her most recent book is Metacognitive and Cognitive Strategy Use and EFL Reading Test Performance: A Study of Chinese College Students (Springer).

Introduction Vahid Aryadoust and Michelle Raquel

Quantitative techniques are mainstream components in most of the published literature in language assessment, as they are essential in test development and validation research (Chapelle, Enright, & Jamieson, 2008). There are three families of quantitative methods adopted in language assessment research: measurement models, statistical methods, and data mining (although admittedly, setting a definite boundary between this classification of methods would not be feasible). Borsboom (2005) proposes that measurement models, the first family of quantitative methods in language assessment, either fall in the paradigm of classical test theory (CTT), Rasch measurement, or item response theory (IRT). The common feature of the three measurement techniques is that they are intended to predict outcomes of cognitive, educational, and psychological testing. However, they do have significant differences in their underlying assumptions and applications. CTT is founded on true scores, which can be estimated by using the error of measurement and observed scores. Internal consistency reliability and generalizability theory are also formulated based on CTT premises. Rasch measurement and IRT, on the other hand, are probabilistic models that are used for the measurement of latent variables – attributes that are not directly observed. There are a number of unidimensional Rasch and IRT models, which assume the attribute underlying test performance comprises only one measurable feature. There are also multidimensional models that postulate that latent variables measured by tests are many and multidivisible. Determining whether a test is unidimensional or multidimensional requires theoretical grounding, the application of sophisticated quantitative methods, and an evaluation of the test context. For example, multidimensional tests can be used to provide fine-grained diagnostic information to stakeholders, and thus a multidimensional IRT model can be used to derive useful diagnostic information from test scores. In the current two volumes, CTT and unidimensional Rasch models are discussed in Volume I, and multidimensional techniques are covered in Volume II. The second group of methods is statistical and consists of the commonly used methods in language assessment such as t-tests, analysis of variance (ANOVA), analysis of covariance (ANCOVA), multivariate analysis of covariance (MANCOVA), regression models, and factor analysis, which are covered in Volume I. In addition, multilevel modeling and structural equation

2 Vahid Aryadoust and Michelle Raquel modeling (SEM) are presented in Volume II. The research questions that these techniques aim to address range from comparing average performances of test takers to prediction and data reduction. The third group of models falls under the umbrella of data mining techniques, which we believe are a relatively underresearched and underutilized technique in language assessment. Volume II presents two data mining methods: classification and regression trees (CART) and the evolutionary algorithm-based symbolic regression, both of which are used for prediction and classification. These methods detect the relationship between dependent and independent variables in the form of mathematical functions which confirm postulated relationships between variables across separate datasets. This feature of the two data mining techniques, discussed in Volume II, improves the precision and generalizability of the detected relationships. We provide an overview of the two volumes in the next sections.

Quantitative Data Analysis for Language Assessment Volume I: Fundamental Techniques This volume is comprised of 11 chapters contributed by a number of experts in the field of language assessment and quantitative data analysis techniques. The aim of the volume is to revisit the fundamental quantitative topics that have been used in the language assessment literature and shed light on their rationales and assumptions. This is achieved through delineating the technique covered in each chapter, providing a (brief) review of its application in previous language assessment research, and giving a theory-driven example of the application of the technique. The chapters in Volume I are grouped into three main sections, which are discussed below.

Section I. Test development, reliability, and generalizability Chapter 1: Item analysis in language assessment (Rita Green) This chapter deals with the fundamental, but, as Rita Green notes, an oftendelayed step in language test development. Item analysis is a quantitative method that allows test developers to examine the quality of test items, i.e., which test items are working well (constructed to assess the construct they are meant to assess) and which items should be revised or dropped to improve overall test reliability. Unfortunately, as the author notes, this step is commonly done after a test has been administered and not when items have just been developed. The chapter starts with an explanation of the importance of this method at the testdevelopment stage. Then several language testing studies that have utilized this method to investigate test validity and reliability, to improve standard-setting sessions, and to investigate the impact of test format and different testing conditions on test taker performance are reviewed. The author further emphasizes the need for language testing professionals to learn this method and its link to language

Introduction 3 assessment research by suggesting five research questions in item analysis. The use of this method is demonstrated by an analysis of a multiple-choice grammar and vocabulary test. The author concludes the chapter by demonstrating how the analysis can answer the five research questions proposed, as well as suggestions on how to improve the test.

Chapter 2: Univariate generalizability theory in language assessment (Yasuyo Sawaki and Xiaoming Xi) In addition to item analysis, investigating reliability and generalizability is a fundamental consideration of test development. Chapter 2 presents and extends the framework to investigate reliability within the paradigm of classical test theory (CTT). Generalizability theory (G theory) is a powerful method of investigating the extent in which scores are reliable, as it is able to account for different sources of variability and their interactions in one analysis. The chapter provides an overview of the key concepts in this method, outlines the steps in the analyses, and presents an important caveat in the application of this method, i.e., conceptualization of an appropriate rating design that fits the context. A sample study demonstrating the use of this method is presented to investigate the dependability of ratings given on an English as a foreign language (EFL) summary writing task. The authors compared the results of two G theory analyses, the rating method and the block method, to demonstrate to readers the impact of rating design on the results of the analysis. The chapter concludes with a discussion of the strengths of the analysis compared to other CTT-based reliability indices, the value of this method in investigating rater behavior, and suggested references should readers wish to extend their knowledge of this technique.

Chapter 3: Multivariate generalizability theory in language assessment (Kirby C. Grabowski and Rongchan Lin) In performance assessments, multiple factors contribute to generate a test taker’s overall score, such as task type, the rating scale structure, and the rater, meaning that scores are influenced by multiple sources of variance. Although univariate G theory analysis is able to determine the reliability of scores, it is limited in that it does not consider the impact of these sources of variance simultaneously. Multivariate G theory analysis is a powerful statistical technique, as in addition to results generated by univariate G theory analysis, it is able to generate a reliability index accounting for all these factors in one analysis. The analysis is also able to consider the impact of subscales of a rating scale. The authors begin the chapter with an overview of the basic concepts of multivariate G theory. Next, they illustrate an application of this method through an analysis of a listening-speaking test where they make clear links between research questions and the results of the analysis. The chapter concludes with caveats in the use of this method and suggested references for readers who wish to complement their MG theory analyses with other methods.

4 Vahid Aryadoust and Michelle Raquel Section II. Unidimensional Rasch measurement Chapter 4: Applying Rasch measurement in language assessment: unidimensionality and local independence (Jason Fan and Trevor Bond) This chapter discusses the two fundamental concepts required in the application of Rasch measurement in language assessment research, i.e., unidimensionality and local independence. It provides an accessible discussion of these concepts in the context of language assessment. The authors first explain how the two concepts should be perceived from a measurement perspective. This is followed by a brief explanation of the Rasch model, a description of how these two measurement properties are investigated through Rasch residuals, and a review of Rasch-based studies in language assessment that reports the existence of these properties to strengthen test validity claims. The authors demonstrate the investigation of these properties through the analysis of items in a listening test using the Partial Credit Rasch model. The results of the study revealed that the listening test is unidimensional and that the principal component analysis of residuals analysis provides evidence of local independence of items. The chapter concludes with a discussion of the practical considerations and suggestions on steps to take should test developers encounter situations in which these properties of measurement are violated.

Chapter 5: The Rasch measurement approach to differential item functioning (DIF) analysis in language assessment research (Michelle Raquel) This chapter continues the discussion of test measurement properties. Differential item functioning (DIF) is the statistical term used to describe items that inadvertently have different item estimates for different subgroups because they are affected by characteristics of the test takers such as gender, age group, or ethnicity. The author first explains the concept of DIF and then provides a brief overview of different DIF detection methods used in language assessment research. A review of DIF studies in language testing follows, which includes a summary of current DIF studies, the DIF method(s) used, and whether the studies investigated the causes of DIF. The chapter then illustrates one of the most commonly used DIF detection methods, the Rasch-based DIF analysis method. The sample study investigates the presence of DIF in a diagnostic English listening test in which students were classified according to the English language curriculum they have taken, Hong Kong vs. Macau. The results of the study revealed that although there were a significant number of items flagged for DIF, overall test results did not seem to be affected.

Chapter 6: Application of the rating scale model and the partial credit model in language assessment research (Ikkyu Choi) This chapter introduces two Rasch models that are used to analyze polytomous data usually generated by performance assessments (speaking or writing tests) and

Introduction 5 questionnaires used in language assessment studies. First, Ikkyu Choi explains the relationship of the Rating Scale Model (RSM) and the Partial Credit Model (PCM) through a gentle review of their algebraic representations. This is followed by a discussion of the differences of these models and a review of studies that have utilized this method. The author notes in his review that researchers rarely provide a rationale for the choice of model, and neither do they compare models. In the sample study investigating the scale of a motivation questionnaire, the author provides a thorough and graphic comparison and evaluation of the RSM and the PCM models and their impact on the scale structure of the questionnaire. The chapter concludes by providing justification as to why the PCM was more appropriate for the context, the limitations of the parameter estimation method used by the sample study, and a list of suggested topics to extend their knowledge of the topic.

Chapter 7: Many-facet Rasch measurement: implications for rater-mediated language assessment (Thomas Eckes) This chapter discusses one of the most popular item response theory (IRT)-based methods to analyze rater-mediated assessments. A common problem in speaking and writing tests is that the marks or grades are dependent on human raters who most likely have their own conceptions how to mark despite training, which impacts test reliability. Many-facet Rasch measurement (MFRM) provides a solution to this problem in that the analysis simultaneously includes multiple facets such as raters, assessment criteria, test format, or the time when a test is taken. The author first provides an overview of rater-mediated assessments and MFRM concepts. The application of this method is illustrated through an analysis of a writing assessment in which the author demonstrates how to determine rater severity, consistency of ratings, and how to generate test scores after adjusting for differences in ratings. The chapter concludes with a discussion on advances in MFRM research and controversial issues related to this method.

Section III. Univariate and multivariate statistical analysis Chapter 8: Analysis of differences between groups: the t-test and the analysis of variance (ANOVA) in language assessment (Tuğba Elif Toprak) The third section of this volume starts with a discussion of two of the most fundamental and commonly used statistical techniques used to compare test score results and determine whether differences between the groups are due to chance. For example, language testers often find themselves trying to compare two or multiple groups of test takers or compare pre-test and post-test scores. The chapter starts with an overview of t-tests and the analysis of variance (ANOVA) and the assumptions that must be met before embarking on these analyses. The literature review provides summary tables of recent studies that have employed each method. The application of the t-test is demonstrated through a sample

6 Vahid Aryadoust and Michelle Raquel study that investigated the impact of English songs on students’ pronunciation development in which the author divided the students into two groups (experimental vs. control group) and then compared the groups’ results on a pronunciation test. The second study utilized ANOVA to determine if students’ academic reading proficiency differed across college years (freshmen, sophomores, juniors, seniors), and to determine which group was significantly different from the others.

Chapter 9: Application of ANCOVA and MANCOVA in language assessment research (Zhi Li and Michelle Y. Chen) This chapter extends the discussion of methods used to compare test results. Instead of using one variable to classify groups that are compared, analysis of covariance (ANCOVA), and multivariate analysis of covariance (MANCOVA) consider multiple variables of multiple groups to determine whether differences in group scores are statistically significant. ANCOVA is used when there is only one independent variable, while MANCOVA is used when there are two or more independent variables that are included in the comparison. Both techniques control for the effect of one or more variables that co-vary with the dependent variables. The chapter begins with a brief discussion of these two methods, the situations in which they should be used, the assumptions that must be fulfilled before analysis can begin, and a brief discussion of how results should be reported. The authors present the results of their meta-analyses of studies that have utilized these methods and outline the issues related to results reporting in these studies. The application of these methods is demonstrated in the analyses of the Programme for International Student Assessment (PISA) 2009 reading test results of Canadian children.

Chapter 10: Application of linear regression in language assessment (Daeryong Seo and Husein Taherbhai) There are cases in which language testers need to determine the impact of one variable on another variable, such as if someone’s first language has an impact on the learning of a second language. Linear regression is the appropriate statistical technique to use when one aims to determine the extent to which one or more independent variables linearly impact a dependent variable. This chapter opens with a brief discussion of the differences between single and multiple linear regression and a full discussion on the assumptions that must be fulfilled before commencing analysis. Next, the authors present a brief literature review of factors that affect English language proficiency, as these determine what variables should be included in the statistical model. The sample study illustrates the application of linear regression by predicting students’ results on an English language arts examination based on their performance in English proficiency tests of reading, listening, speaking, and writing. The chapter concludes with a checklist of concepts to consider before doing regression analysis.

Introduction 7 Chapter 11: Application of exploratory factor analysis in language assessment (Limei Zhang and Wenshu Luo) A standard procedure in test and survey development is to check and see whether a test or questionnaire measures one underlying construct or dimension. Ideally, test and questionnaire items are constructed to measure a latent construct (e.g., 20-items to measure listening comprehension), but each item is designed to measure different aspects of the construct (e.g., items that measure the ability to listen for details, ability to listen for main ideas, etc.). Exploratory factor analysis (EFA) is a statistical technique that examines how items are grouped together into themes and ultimately measure the latent trait. The chapter commences with an overview of EFA, the different methods to extract the themes (factors) from the data, and an outline of steps in conducting an EFA. This is followed by a literature review that highlights the different ways the method has been applied in language testing research, with specific focus on studies that confirm the factor structure of tests and questionnaires. The sample study demonstrates how EFA can do this by analyzing the factor structure of the Reading Test Strategy Use Questionnaire used to determine the types of reading strategies that Chinese students use as they complete reading comprehension tests.

Quantitative Data Analysis for Language Assessment Volume II: Advanced Methods Volume II comprises three major categories of quantitative methods in language testing research: advanced IRT, advanced statistical methods, and nature-inspired data mining methods. We provide an overview of the sections and chapters below.

Section I. Advanced Item Response Theory (IRT) models in language assessment Chapter 1: Mixed Rasch modeling in assessing reading comprehension (Purya Baghaei, Christoph Kemper, Samuel Greif, and Monique Reichert) In this chapter, the authors discuss the application of the mixed Rasch model (MRM) in assessing reading comprehension. MRM is an advanced psychometric approach for detecting latent class differential item functioning (DIF) which conflates the Rasch model and latent class analysis. MRM relaxes some of the requirements of conventional Rasch measurement while preserving most of the fundamental features of the method. MRM further combines the Rasch model with latent class modeling, which classifies test takers into exclusive classes with qualitatively different features. Baghaei et al. apply the model to a high-stakes reading comprehension test in English as a foreign language and detect two latent classes of test takers for whom the difficulty level of the test items differs. They discuss the differentiating feature of the classes and conclude that MRM can be applied to identify sources of multidimensionality.

8 Vahid Aryadoust and Michelle Raquel Chapter 2: Multidimensional Rasch models in first language listening tests (Christian Spoden and Jens Fleischer) Since the introduction of Rasch measurement to language assessment, a group of scholars has contended that language is not a unidimensional phenomenon, and, accordingly, unidimensional modeling of language assessment data (e.g., through the unidimensional Rasch model) would conceal the role of many linguistic features that are integral to language performance. The multidimensional Rasch model could be viewed as a response to these concerns. In this chapter, the authors provide a didactic presentation of the multidimensional Rasch model and apply it to a listening assessment. They discuss the advantages of adopting the model in language assessment research, specifically the improvement in the estimation of reliability as a result of the incorporation of dimension correlations, and explain how model comparison can be carried out while elaborating on multidimensionality in listening comprehension assessments. They conclude the chapter with a brief summary of other multidimensional Rasch models and their value in language assessment research.

Chapter 3: The Log-Linear Cognitive Diagnosis Modeling (LCDM) in second language listening assessment (Elif Toprak, Vahid Aryadoust, and Christine Goh) Another group of multidimensional models, called cognitive diagnostic models (CDMs), combines psychometrics and psychology. One of the differences between CDMs and the multidimensional Rasch models is that the former family estimates sub-skills mastery of test takers, whereas the latter group provides general estimation of ability for each sub-skill. In this chapter, the authors introduce the Log-Linear Cognitive Diagnosis Modeling (LCDM), which is a flexible CDM technique for modeling assessment data. They apply the model to a high-stakes norm-referenced listening test (a practice that is known as retrofitting) to determine whether they can derive diagnostic information concerning test takers’ weaknesses and strengths. Toprak et al. argue that although norm-referenced assessments do not usually provide such diagnostic information about the language abilities of test takers, providing such information is practical, as it helps language learners who wish to know this information to improve their language skills. They provide guidelines on the estimation and fitting of the LCDM, which is also applicable to other CDM techniques.

Chapter 4: Hierarchical diagnostic classification models in assessing reading comprehension (Hamdollah Ravand) In this chapter, the author presents another group of CDM techniques including the deterministic noisy and gate (DINA) model and the generalized deterministic noisy and gate (G-DINA) model, which are increasingly attracting more attention in language assessment research. Ravand begins the chapter by providing

Introduction 9 step-by-step guidelines to model selection, development, and evaluation, elaborating on fit statistics and other relevant concepts in CDM analysis. Like Toprak et al. who presented the LCDM in Chapter 3, Ravand argues for retrofitting CDMs to norm-referenced language assessments and provides an illustrative example of the application of CDMs to a non-diagnostic high-stakes test of reading. He further explains how to use and interpret fit statistics (i.e., relative and absolute fit indices) to select the optimal model among the available CDMs.

Section II. Advanced statistical methods in language assessment Chapter 5: Structural equation modeling in language assessment (Xuelian Zhu, Michelle Raquel, and Vahid Aryadoust) This chapter discusses one of the most commonly used techniques in the field, whose application in assessment research goes back to, at least, the 1990s. Instead of modeling a linear relationship of variables, structural equation modeling (SEM) is used to concurrently model direct and indirect relationships between variables. The authors first provide a review of SEM in language assessment research and propose a framework for model development, specification, and validation. They discuss the requirements of sample size, fit, and model respecification and apply SEM to confirm the use of a diagnostic test in predicting the proficiency level of test takers as well as the possible mediating role for some demographic factors in the model tested. While SEM can be applied to both dichotomous and polytomous data, the authors focus on the latter group of data, while stressing that the principles and guidelines spelled out are directly applicable to dichotomous data. They further mention other applications of SEM such as multigroup modeling and SEM of dichotomous data.

Chapter 6: Growth modeling using growth percentiles for longitudinal studies (Daeryong Seo and Husein Taherbhai) This chapter presents a method for modeling growth that is called student growth percentile (SGP) for longitudinal data, which is estimated by using the quantile regression method. A distinctive feature of SGP is that it compares test takers with those who had the same history of test performance and achievement. This means that even when the current test scores are the same for two test takers with different assessment histories, their actual SGP scores on the current test can be different. Another feature of SGP that differentiates it from similar techniques such as MLM and latent growth curve models is that SGP does not require test equating, which, in itself, could be a time-consuming process. Oftentimes, researchers and language teachers wish to determine whether a particular test taker has a chance to achieve a pre-determined cut score, but a quick glance at the available literature shows that the quantitative tools available do not provide such information. Seo and Taherbhai show that through the quantile regression method, one can estimate the propensity of test takers to achieve an SGP score required to reach the

10 Vahid Aryadoust and Michelle Raquel cut score. The technique lends itself to investigation of change in four language modalities, i.e., reading, writing, listening, and speaking.

Chapter 7: Multilevel modeling to examine sources of variability in second language test scores (Yo In’nami and Khaled Barkaoui) Multilevel modeling (MLM) is based on the premise that test takers’ performance is a function of students’ measured abilities as well as another level of variation such as the classrooms, schools, or cities the test takers come from. According to the authors, MLM is particularly useful when test takers are from pre-specified homogeneous subgroups such as classrooms, which have different characteristics from test takers placed in other subgroups. The betweensubgroup heterogeneity combined with the within-subgroup homogeneity yield a source of variance in data which, if ignored, can inflate chances of a Type I error (i.e., rejection of a true null hypothesis). The authors provide guidelines and advice on using MLM and showcase the application of the technique to a second-language vocabulary test.

Chapter 8: Longitudinal multilevel modeling to examine changes in second language test scores (Khaled Barkaoui and Yo In’nami) In this chapter, the authors propose that flexibility of MLM renders it well suited for modeling growth and investigating the sensitivity of test scores to change over time. The authors argued that MLM is an alternative hierarchical method to linear methods such as analysis of variance (ANOVA) and linear regression. They present an example of second-language longitudinal data. They encourage MLM users to consider and control for the variability of test forms that can confound assessments over time to ensure test equity before using test scores for MLM analysis and to maximize the validity of the uses and interpretations of the test scores.

Section III. Nature-inspired data mining methods in language assessment Chapter 9: Classification and regression trees in predicting listening item difficulty (Vahid Aryadoust and Christine Goh) The first data mining method in this section is classification and regression tress (CART), which is presented by Aryadoust and Goh. CART is used in a similar way as linear regression or classification techniques are used in prediction and classification research. CART, however, relaxes the normality and other assumptions that are necessary for parametric models such as regression analysis. Aryadoust and Goh review the literature on the application of CART in language

Introduction 11 assessment and propose a multi-stage framework for CART modeling that starts with the establishing of theoretical frameworks and ends in cross-validation. The authors apply CART to 321 listening test items and generate a number of IF-THEN rules which link item difficulty to the linguistic features of the items in a non-linear way. The chapter also stresses the role of cross-validation in CART modeling and the features of two cross-validation methods (n-fold crossvalidation and train-test cross).

Chapter 10: Evolutionary algorithm-based symbolic regression to determine the relationship of reading and lexico-grammatical knowledge (Vahid Aryadoust) Aryadoust introduces the evolutionary algorithm-based (EA-based) symbolic regression method and showcases it application in reading assessment. Like CART, EA-based symbolic regression is a non-linear data analysis method that comprises a training and a cross-validation stage. The technique is inspired by the principles of Darwinian evolution. Accordingly, concepts such as survival of the fittest, offspring, breeding, chromosomes, and cross-over are incorporated into the mathematical modeling procedures. The non-parametric nature and cross-validation capabilities of EA-based symbolic regression render it a powerful classification and prediction model in language assessment. Aryadoust presents a prediction study in which he adopts lexico-grammatical abilities as independent variables to predict the reading ability of English learners. He compared the prediction power of the method with that of a linear regression model and showed that the technique renders more precise solutions.

Conclusion In sum, Volumes I and II present 23 fundamental and advanced quantitative methods and their applications in language testing research. An important factor to consider in choosing these fundamental and advanced methods is the role of theory and the nature of research questions. Although some may be drawn to use advanced methods, as they might provide stronger evidence to support validity and reliability claims, in some cases, using less complex methods might cater to the needs of researchers. Nevertheless, oversimplifying research problems could result in overlooking significant sources of variation in data and drawing possibly wrong or naïve inferences. The authors of the chapters have, therefore, emphasized that the first step to choosing the methods is the postulation of theoretical frameworks to specify the nature of relationships among variables, processes, and mechanisms of the attributes under investigation. Only after establishing the theoretical framework should one proceed to select quantitative methods to test the hypotheses of the study. To this end, the chapters in the volumes provide step-by-step guidelines to achieve accuracy and

12 Vahid Aryadoust and Michelle Raquel precision in choosing and conducting the relevant quantitative techniques. We are confident that the joint effort of the authors has emphasized the research rigor required in the field and highlighted strengths and weaknesses of the data analysis techniques.

References Borsboom, D. (2005). Measuring the mind: Conceptual issues in contemporary psychometrics. Cambridge: Cambridge University Press. Chapelle, C. A., Enright, M. K., & Jamieson, J. M. (Eds.). (2008). Building a validity argument for the test of English as a foreign language. New York, NY: Routledge.

Section I

Test development, reliability, and generalizability

1

Item analysis in language assessment Rita Green

Introduction Item analysis is arguably one of the most important statistical procedures that test developers need to carry out. This is because when developing test items, item writers can only speculate on how difficult a particular item might be and how valid it is in terms of testing the targeted construct (for example, the gist, the speaker’s attitude toward x, and so on). Test developers need the data resulting from a field trial to help them corroborate or refute those initial expectations. This can be accomplished by entering the data into a spreadsheet such as IBM® SPSS® Statistics software (SPSS)1 at the item or response level (in the case of questionnaires). Where multiple-choice or multiple matching items have been used in the field trial, the particular option chosen by the test taker should be entered so that distracter analysis can take place. Once all the data have been entered, it is crucial that the data file is checked to ensure that no errors have inadvertently crept in. Item analysis can be divided into three stages. The first allows the researcher to check the facility values or difficulty levels of the items in order to ascertain whether the test population found them easy or difficult. These findings should be compared with the test developer’s expectations and any significant differences earmarked for further investigation. The performance of any distracters used in multiple-choice and multiple-matching tasks should also be analyzed at this stage in order to ascertain the degree to which they have worked. In other words, whether any has been ignored and/or whether a distracter has worked too strongly that is, has been chosen by more test takers than the actual key. The second stage of item analysis offers insights into the levels of discrimination each item provides – in other words, the extent to which the items are separating the stronger test takers from the weaker ones and in which direction this is happening. Where the discrimination is positive, this implies that the stronger test takers are answering the item correctly and the less able ones are answering it incorrectly; where an item shows a negative or weak discrimination index, it suggests the opposite may be happening and would indicate that the item needs to be reviewed (Popham, 2000, p. 322, refers to this phenomenon as a “red flag”). Finally, item analysis helps to identify the amount of internal consistency (also referred to as internal consistency reliability or alpha) that exists between the items. In other words, the extent to which the items are targeting the same

16 Rita Green construct or, put more simply, the degree to which the items “stick” together. For example, in a listening test, it would be important to confirm that all the items focus on the various types of listening behavior the test developer wishes to target and that they do not require the test taker to exhibit other types of ability, such as a knowledge of mathematics or history, for instance. Item analysis is able to detect this type of problem when it analyzes the test takers’ responses to the individual items by comparing this with their overall performance on the task (see Salkind, 2006, p. 273). To expand, let us take a group of test takers who have been performing well on the listening test but suddenly fail to answer an easy item. This could be due to the fact that the item is testing more than just listening and as a consequence the item is answered incorrectly. Based on the listeners’ overall scores, it is expected that the listeners would answer the easy item correctly, and when they fail to do so, this is flagged in the item analysis results either in terms of a weak/negative discrimination index and/or a problematic internal reliability index related to that particular item. In order to get a clear picture of the degree to which the items are working together, there must be a certain number of items. Indeed, the more items there are, the stronger the picture should become. These three analyses are fundamental in helping the test developer decide whether the trialed items can be banked for use in a future test, should undergo revision and be re-trialed, or must be dropped.

Use of item analyses procedures in language assessment research Item analysis is a commonly used procedure in a number of language assessment research areas. Probably its most frequent use is during the validity and reliability studies that are carried out in the initial development stages of many high-stakes tests. For example, this approach was followed in the Standarisierte Reifeprüfung project (SRP), which involved the development and administration of language tests for English, French, Spanish, and Italian 12th graders in Austria aimed at the CEFR B1 and B2 levels. Data resulting from those field trials were systematically analyzed to check the items’ facility values, discrimination and internal consistency properties before determining whether they could be used in live test administrations or whether they needed to be revised and re-trialed or simply dropped (Green & Spoettl, 2009, 2011). Other examples can be found in the English Language Proficiency for Aeronautical Communication (ELPAC) test, which was developed by EUROCONTROL (Alderson, 2010, 2007; Green, 2005); and in the DIFA Proficiency Test for Thai Diplomats (Bhumichitr, Gardner, & Green, 2013) to name but a few examples. Another common need for item analyses is in the preparation of the materials for standard-setting sessions. This is where banked items are subjected to a final review by a group of informed stakeholders. One of the criteria for choosing appropriate tasks for such sessions is that they should exhibit appropriate psychometric properties in terms of their facility values, discrimination indices, and

Item analysis in language assessment 17 internal consistency. Presenting tasks which do not have these appropriate characteristics would be a costly waste of time and resources, as they are more than likely to be rejected by the judges who review them. In addition, some standard-setting procedures (for example, the modified Angoff method) often advocate that some information regarding the difficulty level of the items be shared with the judges, usually after the first round of anonymous judgments has been discussed. This is due to the fact that such statistics can provide insights into how the items have performed on a representative test population and should thus be of help in those cases in which there is some indecision as to which level of difficulty should be assigned to a particular item (see Alderson & Huhta, 2005; Cizek & Bunch, 2007, Zieky, Perie, & Livingston, 2008; Ilc & Stopar, 2015; Papageorgiou, 2016, for further discussions of the role of statistics in standard-setting sessions). The use of item analysis is also evident in other research involving studies which have investigated the impact of the test format on test taker performance. For example, studies which aim at exploring the difference between a multiple-choice item having three or four options (Shizuka, Takeuchi, Yashima, & Yoshizawa, 2006); in examining the effect of using multiple-choice questions versus constructed-response items in measuring test takers’ knowledge of language structure (Currie & Chiramanee, 2010); and those which have examined the use of sequencing as an item type in reading tests (Alderson, Percsich, & Szabo, 2000). Item analyses have also been found useful when exploring the impact of different testing conditions on test taker performance. For example, this has been the case in studies which have examined the number of times test takers are allowed to listen to sound (Fortune, 2004) as well as those which have looked at the effect of having access to non-verbal information presented in video files as opposed to simply having only auditory input from sound files and how this impacts on ESL listening test taker performance (Wagner, 2008, 2010). Finally, it has also played a role in research which has investigated the effects of rhetorical organization of a text and the response method on reading comprehension (Kobayashi, 2002). Other studies which have benefited from item analysis include those which have focused on determining lexical difficulty (Campfield, 2017) and those investigating the meaning of a text (Sarig, 1989). Item analysis has also been found useful when exploring the impact of working without test specifications (Jafarpur, 2003), developing multiple forms of a test (LaFlair et al., 2015) and analyzing questionnaire data to ascertain the degree to which the “items” target the same construct (Hsu, 2016). Other researchers have used this type of analysis to check the degree of internal consistency between items prior to carrying out a factor analysis when examining item dimensionality (Culligan, 2015), to research the impact of diagnostic feedback (Jang, Dunlop, Park, & van der Boom, 2015), and to investigate construct validity (Anderson, Bachman, Perkins, & Cohen, 1991).

Gap(s) in theory and/or practice Personal experience of working on many test development projects over the past 30 years has shown that there are still a large number of testing bodies who

18 Rita Green do not field trial their tasks and therefore do not run statistical analyses at the pilot stage of test development. It obviously follows that as there are no data to inform the users of their psychometric properties, these tasks are used “blind” with consequential effects for many of the stakeholders and, in particular, the test takers. Other test providers wait until the live test administration before carrying out any kind of analysis, but even then, many of these analyses are only run at the global level. In other words, such organizations are purely interested in reporting the results at the school, provincial, or national levels and therefore concern themselves with mainly providing such statistics as the mean, the standard deviation, and the minimum and maximum test scores. Where internal consistency reliability statistics are reported, they tend to be at the test level rather than at the skill, task, or item level. Personal experience again has shown that this is frequently due to the fact that data are not collected at the individual item level (for example, where a, b, c, or d entries appear in the dataset) but only at the task or test level. One of the major drawbacks of this approach to statistical analyses is that much valuable information for stakeholders, such as the test takers, teachers, test developers, and educational authorities, is lost. The test takers are not able to identify their strengths and weaknesses; the teachers are not able to see which skills (sub-skills) require more or less focus in their lessons; the test developers are unable to determine which test items and/or tasks, test methods and so on are performing well and which need further review, and whether the test specifications need to be modified; and educational authorities are unable to invest this useful information back into the reviews of the curriculum and syllabus design cycle. In addition, it is important that test development teams themselves be empowered to understand the crucial role that item analyses can play in their work and that this is not simply left in the hands of those who have no contact with the actual development of the test items. This can be easily accomplished through simple training based on raw data (at the “abcd” or “zero-one” level) and, if at all possible, based on tasks the test developers have developed themselves and which have been field trialed. Such training would enable test developers to improve their own item writing skills by identifying which items are working and which are not and how they might be modified. To help them in this endeavor, they should ask themselves the following five research questions: RQ1: RQ2: RQ3: RQ4: RQ5:

Do the items appear to be at the appropriate level of difficulty? How well do the distracters fulfill their function? How well do the individual items discriminate? Does every item contribute to the internal consistency of the task? Should any of the items be dropped?

Based on a sample dataset, these questions will be explored and the findings discussed in the following section.

Item analysis in language assessment 19

Sample study: Item analysis of a contextualized grammar and vocabulary test Data resulting from the field trial of a language in use (LiU) task aiming at measuring a test taker’s contextualized knowledge of grammar and vocabulary based on a 150-word text and 11 multiple-choice questions has been used for this item analysis exercise. The task formed part of a test targeting the Common European Framework of References (CEFR) at the B1 level. The test takers were asked to read the text and then complete the items by selecting the correct answer from the options (a, b, c, or d) provided for each of the 11 items. The data can be downloaded through the Companion website. The resulting data (n = 102) were entered into an IBM SPSS spreadsheet under the following variables and values: • • • • •

ttid gender item01 to item11 item01a to item11a totscore

test taker identification number (1–102) 1 = female, 2 = male items 1–11; a, b, c, d, or x (no answer) items 1–11; 0 (incorrect), 1 (correct) values 0 through 11 (actual 1–10)

Methodology Stage 1: Review the task as a test taker and evaluator Before looking at the data resulting from any field trial, it is crucial that the person who is going to analyze the data should physically do the task unless s/he has been involved in the actual task development itself and is therefore already very familiar with its strengths and weaknesses. Doing the task beforehand will make interpreting the data much easier; however, it must be carried out under the same conditions (including time allowed) as the test takers. It is also important that the researcher is engaged with the task and not simply giving the task a superficial glance. By doing the task this way, s/he will become acquainted with the items and be able to determine what the original task developer was attempting to target in terms of the construct and the extent to which they might have been successful. The researcher should also pay careful attention to the task instructions to ensure there are no ambiguities and that the language used is of an appropriate level. S/he should review the example if provided, determine whether it is likely to help the test takers (its presence or absence can have a deleterious effect on test taker performance which may be reflected in the data), and consider whether the test method used is likely to be familiar to the target test population. In addition, the options provided (multiple choice, matching tasks) and/or the number of words the test taker is allowed to use (short-answer questions) should be analyzed for any possible difficulties which might emerge in the data. Where options are provided, the researcher should determine whether they are likely to distract or not. S/he should also check for overlap between items (in terms of

20 Rita Green what is being tested) and that they are not interdependent. The layout of the task should be evaluated in terms of accessibility (for example, page turning, visibility, etc.). Where a text has had words removed, the researcher should check that there is still sufficient context to allow the test taker to complete the item. The key should also be reviewed to ensure that it is correct, that there is only one possible answer (multiple choice) or, in the case of short-answer items, there are not too many possible answers which might lead to guessing. In addition, it should be noted whether any of the items can be answered based on general knowledge rather than on comprehension of the text or sound file in the case of a listening task. Where the researcher has not chosen the correct answer or has produced a new but seemingly viable alternative, a note should be made, as again this may help during the interpretation of the data. Finally, it may be useful to predict in advance which items might be easy, difficult, and/or problematic. Such predictions help when it comes to looking at and interpreting the output of the item analysis (for further details regarding this stage, see Green, 2017, chapter 4). Just before looking at the data itself, it is useful to review the available information about the trial population itself in terms of age, L1, level of ability, location, and time of the trial, as these factors may well have a bearing on how the statistics can be interpreted.

Stage 2: Run the item analysis using SPSS Once the data entry has been checked for any infelicities (see Green, 2013, chapter 2 for advice on how to do this), there are three main procedures which should be run in order to carry out an item analysis. The first is known as Frequencies (in IBM SPSS v24) and should be run on the abcdx data, that is, item01 through item11 (not item01a through item11a). The Frequencies analysis enables the researcher to see how the individual distracters and the key in the multiple-choice items are performing. It also provides information about how many test takers have left any of the items unanswered (indicated in this study by x). Once the Frequency output has been analyzed and conclusions and recommendations drawn, the second and third procedures (discrimination and internal consistency) should be run using the zero/one variables, that is item01a to item11a. IBM SPSS runs these two procedures simultaneously through the SCALE command. The output consists of a number of tables which include information about the task’s overall internal consistency reliability (reported here as Cronbach’s alpha), the discrimination power of each item, and the degree of consistency between the items and what each item contributes to the task as a whole. It also provides information about the mean, variance, and standard deviation of the task as a whole.

Results The first thing which the researcher should check when going through the output generated by an item analysis is that the total number of reported cases is equal to the number they had expected to see and that none of the cases has been excluded

Item analysis in language assessment 21 for some reason. For example, in the current data sample, the number of cases should be 102. Having confirmed this is correct, the facility value for each of the items should be reviewed to see if these are in line with the expectations/predictions made when reviewing or developing the task. The facility values tell the researcher how many test takers answered the item correctly and how many did not. Based on this information, the difficulty level of the item for this test population may be extrapolated. The decision regarding whether this is then perceived to be easy or difficult needs to be measured against the purpose of the test, for example, whether it is an achievement or a proficiency test, and the nature of the target test population. The frequency output on the LiU items revealed facility values ranging from 29.4% to 96.1% as shown in Figure 1.1 together with information about how each distractor worked:

Figure 1.1 Facility values and distracter analysis. Source: Reprint Courtesy of International Business Machines Corporation, © International Business Machines Corporation.

The discrimination indices tell us the extent to which the items separate the stronger test takers from the weaker ones in a positive or negative way. In IBM SPSS these indices are referred to as Corrected Item-Total Correlation (CITC), and the results for the 11 LiU items show that the discrimination indices ranged from −.222 (item06) to .500 (item07) as shown in Figure 1.2.

22 Rita Green

Figure 1.2 Discrimination indices. Source: Reprint Courtesy of International Business Machines Corporation, © International Business Machines Corporation.

Together with the CITC findings, IBM SPSS also reports information about the degree of internal consistency between the items based on the way the test takers have responded to those items. To help us investigate this aspect of the test items, two statistics are reported. The first indicates the task-level alpha, which in this case is based on the 11 items that made up the LiU task, while the second indicates the amount of internal consistency each item contributes to that tasklevel alpha by revealing the impact which their individual removal would have on that statistic. The internal consistency reliability figure for LiU at the task level (here reported as Cronbach’s Alpha) was .628. The impact which the removal of an individual item would have on that task alpha can be seen in the column entitled Cronbach’s Alpha if Item Deleted (CAID) in Figure 1.3. The final part of the item analysis output in IBM SPSS provides information about the overall mean score on the task as well as the variance and the standard deviation based on the 11 items as shown in Figure 1.4. Although concerned with the test level rather than the item level, it does provide insights into the overall difficulty level of the task itself and may be of use when making decisions such as whether an item should be dropped or not.

Discussion Facility values (frequencies) The findings displayed in Figure 1.1 revealed that the facility values for the LiU task ranged from 29.4% to 96.1%. One of the first things which should strike you about these results is the extensive range of facility values given that the task was supposed to be targeting one level (CEFR B1). This finding suggests that either

Item analysis in language assessment 23

Figure 1.3 Facility values, discrimination, and internal consistency (reliability). Source: Reprint Courtesy of International Business Machines Corporation, © International Business Machines Corporation.

Figure 1.4 Task statistics. Source: Reprint Courtesy of International Business Machines Corporation, © International Business Machines Corporation.

some of the items were not at the same difficulty level and/or the test population was heterogeneous in nature. What facility values should you expect to find? This to some extent depends on the purpose of the test. For example, in an achievement test, one might expect the facility values to be higher, possibly in the range of 80% to 90%, as this would suggest that the students had understood what they had been taught. In a proficiency test, on the other hand, a facility value of around 50% is optimal (Popham, 2000), as this could be interpreted to suggest that the item is separating the stronger test takers from the weaker ones, though of course this would need to be confirmed by checking the discrimination index of the item. In general, facility values which fall between 30% and 70% are often quoted (see Bachman, 2004) as being acceptable in language proficiency testing, though those which fall between 20% and 80% can also be useful (see Green, 2013) provided the items discriminate and contribute to the internal consistency of the task. In designing multiple-choice items, a test developer’s aim is for every distracter to work (i.e., each item should attract a certain percentage of the test population).

24 Rita Green Where this does not happen, that is where one or two of the distracters in a particular item attract either very few or no test takers, this can result in making the item easier than intended. For example, in a four-option item (abcd) where two of the options receive zero or only a few endorsements, the item becomes a true–false one with a 50:50 chance of the test taker answering the item through guessing. Prior to analyzing the trial results, it is important that the researcher has decided upon a threshold of acceptance for the distracters in the task and that s/he then reviews each of them in turn to see if they have complied with this level. In general, we would want a distracter in a four-option multiple-choice item to attract a minimum of 7% of the test population in a proficiency test and possibly more in a high-stakes test where the consequences of the test score results carry more weight (Green, 2013). In an achievement test where the expected facility values are much higher, a lower threshold level must be established. Taking 7% as a minimum threshold, it can be observed from the information provided in Figure 1.5 that the distracters in some of the LiU items appear to be working well. For example, in items 02, 03, 04 (distracter b would be rounded up to 7%), 09, and 10, all the distracters have been chosen by more than 7% of the test takers. However, it can also be seen that some of the distracters in items 01, 05, 07, and 11 were chosen by less than the desired percentage. Indeed in the case of item01, none of the distracters could be classified as “working”, while in item07, only options “a” and “b” (the key) are functioning. In addition, it will be noted that in two of the items (06 and 08), one of the distracters is attracting more test takers than the actual key. Such items must be investigated in more depth in order to determine whether it is the weaker test takers who are selecting the distracter and the stronger ones who are endorsing the key or vice versa. This information can be obtained in IBM SPSS by using the Select Cases command and defining the conditions (a, b, c, or d) and then comparing the test takers’ choices with their scores on the task as a whole.

Figure 1.5 Distracter problems. Source: Reprint Courtesy of International Business Machines Corporation, © International Business Machines Corporation.

Discrimination The output in Figure 1.2 revealed that the CITCs ranged from −.222 (item06) to .500 (item07). Discrimination is measured on a scale of −1 to +1, and in general we would expect figures of +.3 and above, though in some circumstances +.25 might be deemed acceptable (Henning, 1987). Where a discrimination figure is lower than this, it suggests that the item is not discriminating in the desired way. In other words, some of the weaker test takers may have answered the item

Item analysis in language assessment 25 correctly and/or some of the stronger ones may have completed it incorrectly. Items with weak or negative discrimination indices must be reviewed in order to determine why this is the case. Looking down the list of discrimination indices, it is immediately obvious that a number of items have CITCs below +.3, that is, items 01, 02, 04, and 08. The items and their options should be analyzed to determine why this has occurred. For example, it was noted above that item01 had three weak distracters, while item08 had a distracter which worked more strongly than the actual key. The weak discrimination index on item08 suggests that distracter b will need to be revised, as the statistic suggests that some of the good test takers chose this option. This should be confirmed by running the Select Cases command as mentioned above. In addition, item06 is discriminating negatively. An analysis of distracter b’s performance above showed that it was attracting more test takers than the key and this negative discrimination implies that some of the better test takers chose this option (this can be confirmed in SPSS by asking the program to select those test takers who chose “b” and comparing the findings with the test takers’ total scores). There are also a number of items which are discriminating well. For example, items 03, 05, 07, 09, 10, and 11 all have indices above +.3. The item which is discriminating the strongest is item07 (.500).

Internal consistency reliability The output in Figure 1.3 showed that the task had an internal consistency reliability of .628. This is based on a calculation which takes the total number of items, the variance of the test scores, and the variance of the test items into account (see Green, 2013, p. 29). In general, one would expect the task-level alpha to be at least .7 (Nunnally, 1978) and preferably .8 (see Pallant, 2007; Salkind, 2006). Where this statistic is lower, it could be interpreted to suggest that one or more of the items is targeting something different in terms of the construct. In the case of the LiU task, the figure is only .628, which is lower than one would expect for a task with 11 items. In general, the fewer the items, the more difficult it may be to reach .7 (Green, 2013), but it is still possible to get .8 with as few as five items. (Note: Where a test paper is made up of a number of tasks, the internal reliability analysis should be run on each task separately.) The second part of the output in Figure 1.3 revealed what impact the removal of each individual item would have on the task’s internal consistency. Looking across these figures, it can be seen that the task’s alpha of .628 would vary from .556 to .698 on the removal of one of the task’s items. For example, in the case of item01, the task alpha would drop from .628 to .621 or by .007 of a point. The removal of item02 would see a drop in the task alpha from .628 to .618, item03 from .628 to .576, and so on. It will be clear that the removal of certain items would impact more heavily (in other words, make the alpha decrease more dramatically) on the task alpha than others. To summarize, the lower the figure in the CAID column, the greater the impact the item’s removal would have and the stronger the argument for retaining that

26 Rita Green item in the test, as it appears to be contributing more to the internal consistency of the task. Looking at the output, it will be noted that item07 contributes the most to the internal consistency statistic; its removal would result in the task-level alpha dropping to .556. Conversely, the elimination of item06 would result in an increase in the task alpha, making it .698, which, given the item’s weak discrimination index, is a further argument for its removal. It should be remembered, however, that the removal of one item from a task will have an immediate impact on the CITC values of the other items, as these will now be based on 10 items and not 11. It is therefore strongly advised that the researcher should re-run the SCALE analysis to see what impact that removal has on the remaining items, as it may have a destabilizing effect on the revised discrimination figures. Where a number of weak items are identified, care should also be taken to check the impact their collective removal might have on the construct validity of the task.

Scale statistics The output in Figure 1.4 provided information about the general statistics relating to the task as a whole, that is, the mean, the variance, and the standard deviation. It will be observed that the average test score was around 6 out of a possible 11 (in fact no test taker scored either 0 or 11), revealing that the average test taker did not score very highly on the test. Both the variance and the standard deviation provide information about the degree to which the test takers are spread out below and above the mean (the standard deviation is the square root of the variance). Where the standard deviation is small, it indicates that the mean provides a clear reflection of the scores, as the scores are clustered around the mean; where it is large, the mean may be hiding the true spread (see Field, 2009).

Answers to research questions Having now discussed the findings of the item analysis on the 11 items, let us see how these can be applied to the research questions (RQs).

RQ1: Do the items appear to be at the appropriate level of difficulty? The wide range of facility values (29.4% to 93.1%) suggests that this task was not targeting just one particular level of difficulty. If the test population is truly representative in terms of the targeted level of ability, this finding would imply that some of the items were not at that level and should be reviewed. This is particularly the case for item01, which appears far too easy for this test population, and items 06 and 08, which appear to be rather on the difficult side.

RQ2: How well do the distracters fulfill their function? It is clear from Figure 1.1 that not all of the distracters are working. Using 7% as a minimum working threshold, there are distracters in a number of items (01,

Item analysis in language assessment 27 05, 07, and 11) which would need revising. Items 05 and 11 only have one weak distracter, and the test developer would need to decide whether this finding is likely to improve with a larger test population given that there are only 102 in the current dataset. In circumstances where it is impossible to come up with any other viable option which will target the same construct and still be at the appropriate difficulty level, the test developer would either have to hope that the distracter will attract more test takers in a larger test administration or develop a completely new item. In addition, the distracters in items 06 and 08, which attracted more test takers than the key, would need to be reviewed and revised.

RQ3: How well do the individual items discriminate? A number of items (03, 05, 07, 09–11) discriminate above the minimum threshold of +.3, but several others (01, 02, 04, 06, and 08) do not. Of these, item06 is particularly problematic in that it has a negative discrimination. All the latter mentioned items would need to be reviewed, revised, and/or dropped before the task could be re-trialed.

RQ4: Does every item contribute to the internal consistency of the task? With the exception of item06, every item does contribute to the internal consistency, though as discussed above, some more than others. Where the test developer is satisfied with the other properties (the facility values and discrimination), s/he may decide to accept the smaller contributions made by some of the items.

RQ5: Should any of the items be dropped? It is clear that distracter b in item06 must be revised; if this proves to be impossible, the item would need to be dropped. If this finding (that is to say: negative discrimination, negative contribution to internal consistency) occurred in a live test administration, then the mark this item carries should be dropped from the final calculation of the test scores in order to ensure it is fair for all test takers.

Conclusion By carrying out this simple item analysis, the test developer has gained valuable insights into the task. S/he has ascertained that the items cover a wide range of facility values and that some items fall outside the oft-quoted facility value parameters of 30%–70% applied to proficiency tests. In addition, it has been observed that some distracters have not been endorsed by a large enough percentage of the test population, while two other distracters have proved to be more attractive than the key. The discrimination indices have revealed one item with a negative index and three others with weak indices (below +.3). Some of these findings are related

28 Rita Green to weak distracter performance and suggest the items will need to be reviewed and re-trialed. The internal consistency reliability figures for the task as a whole proved to be weak (.628), though the removal of the negatively discriminating item improves this to .698, or .7 when rounded up, thus meeting the minimal requirement threshold for a set of items. The overall statistics (Scale) revealed that the test population did not do very well on this task (mean was 6 out of a possible 11). This could be interpreted to suggest that either the test population is not at an appropriate level for this test or that some of the items are not at an appropriate level of difficulty.

Note 1 SPSS Inc. was acquired by IBM in October 2009.

References Alderson, J. C. (2007). Final report on the ELPAC validation study. Retrieved from www.elpac.info/ Alderson, J. C. (2010). A survey of aviation English tests. Language Testing, 27(1), 51–72. Alderson, J. C., & Huhta, A. (2005). The development of a suite of computer-based diagnostic tests based on the Common European Framework. Language Testing, 22(3), 301–320. Alderson, J. C., Percsich, R., & Szabo, G. (2000). Sequencing as an item type. Language Testing, 17(4), 423–447. Anderson, N. J., Bachman, L., Perkins, K., & Cohen, A. (1991). An exploratory study into the construct validity of a reading comprehension test: Triangulation of data sources. Language Testing, 8(1), 41–66. Bachman, L. F. (2004). Statistical analyses for language assessment. Cambridge: Cambridge University Press. Bhumichitr, D., Gardner, D., & Green, R. (2013, July). Developing a test for diplomats: Challenges, impact and accountability. Paper presented at the Language Testing Research Colloquium, Seoul, South Korea. Campfield, D. E. (2017). Lexical difficulty: Using elicited imitation to study child L2. Language Testing, 34(2), 197–221. Cizek, G. J., & Bunch, M. B. (2007). Standard setting: A guide to establishing and evaluating performance standards on tests. Thousand Oaks, CA: SAGE Publications. Culligan, B. (2015). A comparison of three test formats to assess word difficulty. Language Testing, 32(4), 503–520. Currie, M., & Chiramanee, T. (2010). The effect of the multiple-choice item format on the measurement of knowledge of language structure. Language Testing, 27(4), 471–491. Field, A. (2009). Discovering statistics using SPSS. London: SAGE Publications. Fortune, A. (2004). Testing listening comprehension in a foreign language: Does the number of times a text is heard affect performance. Unpublished MA dissertation, Lancaster University, United Kingdom. Green, R. (2005). English Language Proficiency for Aeronautical Communication – ELPAC. Paper presented at the Language Testing Forum (LTF), University of Cambridge.

Item analysis in language assessment 29 Green, R. (2013). Statistical analyses for language testers. New York, NY: Palgrave Macmillan. Green, R. (2017). Designing listening tests: A practical approach. London: Palgrave Macmillan. Green, R., & Spoettl, C. (2009, June). Going national, standardized and live in Austria: Challenges and tensions. Paper presented at the EALTA Conference, Turku, Finland. Retrieved from www.ealta.eu.org/conference/2009/programme.htm Green, R., & Spoettl, C. (2011, May). Building up a pool of standard setting judges: Problems, solutions and insights. Paper presented at the EALTA Conference, Siena, Italy. Retrieved from www.ealta.eu.org/conference/2011/programme.html Henning, G. (1987). A guide to language testing: Development, evaluation, research. Cambridge: Newbury House Publishers. Hsu, T. H. L. (2016). Removing bias towards World Englishes: The development of a rater attitude instrument using Indian English as a stimulus. Language Testing, 33(3), 367–389. Ilc, G., & Stopar, A. (2015). Validating the Slovenian national alignment to CEFR: The case of the B2 reading comprehension examination in English. Language Testing, 32(4), 443–462. Jafarpur, A. (2003). Is the test constructor a facet? Language Testing, 20(1), 57–87. Jang, E. E., Dunlop, M., Park, G., & van der Boom, E. H. (2015). How do young students with different profiles of reading skill mastery, perceived ability, and goal orientation respond to holistic diagnostic feedback? Language Testing, 32(3), 359–383. Kobayashi, M. (2002). Method effects on reading comprehension test performance: Text organization and response format. Language Testing, 19(2), 193–220. LaFlair, G. T., Isbell, D., May, L. N., Gutierrez Arvizu, M. N., & Jamieson, J. (2015). Equating in small-scale language testing programs. Language Testing, 34(1), 127–144. Nunnally, J. C. (1978). Psychometric theory (2nd ed.). New York, NY: McGraw-Hill. Pallant, J. (2007). SPSS survival manual: A step by step guide to data analysis using SPSS for Windows (3rd ed.). New York, NY: Open University Press. Papageorgiou, S. (2016). Aligning language assessments to standards and frameworks. In D. Tsagari & J. Banerjee (Eds.), Handbook of second language assessment (pp. 327–340). Berlin: Walter de Gruyter Inc. Popham, W. J. (2000). Modern educational measurement: Practical guidelines for educational leaders. Boston, MA: Allyn & Bacon. Salkind, N. J. (2006). Tests & measurement for people who (think they) hate tests & measurement. Thousand Oaks, CA: SAGE Publications. Sarig, G. (1989). Testing meaning construction: Can we do it fairly? Language Testing, 6(1), 77–94. Shizuka, T., Takeuchi, O., Yashima, T., & Yoshizawa, K. (2006). A comparison of three-and four-option English tests for university entrance selection purposes in Japan. Language Testing, 23(1), 35–57. Wagner, E. (2008). Video listening tests: What are they measuring? Language Assessment Quarterly, 5(3), 218–243. Wagner, E. (2010). The effect of the use of video texts on ESL listening test-taker performance. Language Testing, 27(4), 493–513. Zieky, M. J., Perie, M., & Livingston, S. A. (2008). Cut-scores: A manual for setting standards of performance on educational and occupational tests. Princeton, NJ: Educational Testing Service.

2

Univariate generalizability theory in language assessment Yasuyo Sawaki and Xiaoming Xi

Introduction Suppose that you administer a speaking test to place students into different levels of speaking courses in your English language program. Suppose further that the test involves multiple speaking tasks and that the same two instructors in your program score all students’ responses to all tasks. You would hope that the scores you obtain from the test offer reliable information about individual students’ speaking ability. However, test scores are influenced by something besides actual speaking ability, namely, measurement error. Various factors can contribute to measurement error. Some factors, such as variation in the difficulty of tasks and in the severity of raters, lead to systematic errors in the sense that they impact the test scores across the entire sample of test takers. However, such systematic sources of measurement error can be identified, modeled, and controlled. Others, such as students’ level of concentration during the test session, which only affects the test scores of test takers randomly, are considered sources of random measurement error and cannot be identified or modeled. In an assessment situation like the above example, it is necessary to identify and control various systematic sources of measurement error adequately to ensure that decisions about candidates can be made consistently based on assessment results. Generalizability theory (G theory; Brennan, 2001; Cronbach, Gleser, Nanda, & Rajaratnam, 1972; Shavelson & Webb, 1991) is a powerful framework that can assist in analyses of test data based on constructed-response items such as the performance-based assessment tasks above as well as those based on selected-response items (e.g., multiple-choice reading comprehension test items). G theory builds on the idea of reliability conceptualized in classical test theory (henceforth, CTT), but offers some flexibility in its extension of CTT. We begin with a brief review of the literature on previous G theory applications in second language (L2) assessment.

Applications of G theory in L2 assessment G theory has been applied to L2 assessments for examining measurement consistency since the 1980s. While this analytic framework has been employed to

Univariate generalizability theory 31 examine standardized assessments of reading, listening, vocabulary, and grammar comprising selected-response items (Brown, 1999; Sawaki & Sinharay, 2013; Zhang, 2006), previous research has focused mainly on the analysis of L2 performance assessment data of speaking and writing based on constructed-response items. The steady increase in G theory applications in the field since then has attested to its usefulness in examining various measurement issues such as rating consistency in existing L2 performance assessments for norm-referenced and criterion-referenced score interpretations (e.g., Bachman, Lynch, & Mason, 1995; Brown & Bailey, 1984; Lynch & McNamara, 1998; Sawaki, 2007; Schoonen, 2005; Xi, 2007) and the effects of rater training on rater performance consistency (Xi & Mollaun, 2011). The usefulness of G theory in developing new L2 performance assessments has also been demonstrated in studies such as a series of G theory analyses of prototype speaking and writing test forms of the TOEFL iBT® test for determining the numbers of ratings and tasks required to achieve a sufficient level of measurement consistency for high-stakes use (Lee, 2006; Lee & Kantor, 2005). Previous G theory applications in the field have advanced our understanding of specific issues surrounding L2 performance assessment. A notable contribution of previous G theory applications is the identification of the relatively large share of L2 performance score variability explained by the examinee-by-task interaction compared to other sources of variability such as rater effects when raters are trained (e.g., Xi & Mollaun, 2011; see In’nami & Koizumi, 2016, for a recent meta-analysis). In other words, the rank-ordering of examinees is likely to vary across task(s) that are employed, which in turn affects test score interpretations. This suggests the importance of employing multiple tasks to ensure that generalizable assessment results could be obtained.

Classical test theory (CTT) Since the early 20th century, CTT has been used widely to examine reliability, or the degree of consistency of measurement. The key concepts of CTT are fairly simple to explain, beginning with three equations: X =T + E (2.1) σ 2 x = σ 2t + σ 2e (2.2) rxx ′ = σ 2t /σ 2 x = σ 2t /(σ 2t + σ 2e ) (2.3) First, in Equation 2.1, X denotes an observed score (the test score for a given examinee that we actually report), while T denotes the true score. The true score is technically defined as the mean of the theoretical distribution of score X across repeated independent testing of the same person on the same test. The difference between an observed score and the true score is called the measurement error (E), which represents random score variation due to reasons unrelated to what is being measured. As this formula shows, an observed test score is conceptualized

32 Yasuyo Sawaki and Xiaoming Xi in CTT as a combination of the examinee’s true score and measurement error. Second, Equation 2.2 concerns the observed score variance (σ2x), which is a measure of the degree to which the observed scores of individuals are spread across the scale. This equation defines that the observed score variance is a combination of the score variance due to the differences in examinees’ true scores (σ2t) and the score variance due to measurement error (σ2e). Third, the reliability of a measurement X (rxx′) is theoretically defined in CTT as the proportion of observed score variance that can be explained by the true score variance, as shown in the middle term of Equation 2.3 (σ2t /σ2x). Note that the right-most term of the equation (σ2t /(σ2t + σ2e)) depicts how the denominator can be transformed to a combination of the true score variance and the error variance based on Equation 2.2 as well. However, the CTT reliability formula cannot be put directly into practice because both a given examinee’s true score (T ) and the true score variance (σ2t) are unknown. For this reason, an operational definition of CTT reliability is in order. CTT reliability is defined operationally as the correlation between observed scores on two parallel measures. Tests or measures can be described as parallel when they have been designed and shown to measure the same attribute or ability; technically, this is demonstrated when they have the same observed mean and variance as well as the same correlation to a third measure. Building on this operational definition, various reliability indices have been developed in CTT. Among such indices is Cronbach’s α (Cronbach, 1951), an internal consistency reliability index that has had widespread applications in behavioral measurement (for a recent overview of CTT reliability, see Sawaki, 2013).

Limitations of CTT Although the transparency of the logic in the above definitions of reliability is a notable advantage, CTT is not without limitations. Among various issues raised by previous authors, three issues discussed by Bachman (2004) and Shavelson and Webb (1991) are particularly relevant to this chapter. First, the measurement error in CTT in the above definition is a single term. This means that it does not distinguish different types of error contributing to measurement inconsistency from one another. On the one hand, test scores may be influenced by nonsystematic (random) errors that affect different test takers to varying degrees (e.g., the level of concentration during the test session). On the other hand, test scores may be affected by systematic sources of error that are assumed to have the same distorting effects on all scores (e.g., task difficulty, rater severity, task types). Thus, a major drawback of the CTT conceptualization of measurement error is that different sources of error – both systematic and nonsystematic – are lumped together as a single number. This is not ideal in analyzing language assessment data, particularly when the purpose is to identify important systematic sources of measurement error that need to be controlled so as to improve measurement consistency.

Univariate generalizability theory 33 Second, the CTT framework does not allow modeling multiple sources of error at once. For instance, test-retest reliability and parallel-forms reliability provide estimates of the consistency of measurement across two testing occasions and two test forms, respectively, while Cronbach’s α focuses on the degree to which individual items in a test consistently assess the same construct of interest. Meanwhile, a correlation coefficient might be reported as a measure of the degree to which scores are assigned consistently by a single rater across different occasions (intra-rater reliability) or by multiple raters in a single occasion (interrater reliability). While these indices are helpful in examining a specific type of measurement consistency, they do not allow the investigator to examine how multiple sources of error and their interactions affect measurement consistency within a single study (e.g., rating consistency across occasions and raters as well as their interactions). In reality, potential sources of error operate simultaneously, and it would be helpful to understand their relative impact on the reliability of test scores. Third, the conceptualization of reliability in CTT assumes norm-referenced test score interpretation (henceforth, NRT), focusing on the extent to which examinees are rank-ordered consistently across items, raters, occasions, forms, among others. CTT reliability estimates are not appropriate when test scores are used for criterion-referenced test score interpretation (henceforth, CRT), where the focus is on the degree to which examinees are classified consistently as satisfying (or not satisfying) a pre-determined performance criterion. Accordingly, separate procedures are required for examining measurement consistency for CRT (see Brown & Hudson, 2002, for a comprehensive discussion of measurement consistency in CRT). G theory, however, provides a framework to address all the aforementioned limitations of CTT. It allows the analysis of measurement consistency for both NRT and CRT by modeling multiple sources of error simultaneously and distinguishing specific systematic sources of error and their interactions.

Key G theory concepts Dependability and universe score A central concept representing measurement consistency in G theory is dependability, which is defined as “the accuracy of generalizing from a person’s observed scores on a measure to the average score that person would have received under all the possible conditions that the test user would be equally willing to accept” (Shavelson & Webb, 1991, p. 1). This average across all scores is the person’s universe score, which is the G theory analogue of the true score in CTT or the domain score in CRT. Thus, dependability concerns the degree to which one can reliably generalize from a given candidate’s observed test score to his/her universe score. Note that, while similar to reliability in CTT, the notion of dependability in G theory applies not only to rank-ordering consistency in NRT (relative decisions) but also to classification consistency in CRT (absolute decisions).

34 Yasuyo Sawaki and Xiaoming Xi Defining the study design for conducting a G theory analysis In a G theory analysis, the investigator needs to define the study design for conducting a G theory analysis in steps, as summarized in Table 2.1.

Step 1: Specifying the universe of admissible observations A first component to specify is the universe of admissible observations, which is a set of measurement procedures that the investigator accepts as interchangeable with one another in estimating a candidate’s universe score. In the speaking placement test example above again, suppose that your primary focus is on incoming students’ spoken English ability for academic purposes. In this case, you may consider using two speaking tasks in which a candidate has to summarize the main points of an academic lecture, one on history and the other on business. Furthermore, you may decide that candidates’ responses to these two tasks must be scored by raters who have completed a rater training program. In this case, you are treating scores assigned by trained raters on the two oral summary tasks as samples of observations drawn from the universe of admissible observations, namely, the populations of different tasks and raters that you are willing to treat as interchangeable with one another. Meanwhile, scores based on non-academic speaking tasks (e.g., a role-play task at an airline ticket counter) or scores assigned by raters who have missed the rater training session may not be treated as admissible observations. In G theory, such observations that the investigator considers as coming from the universe of admissible observations and thus are interchangeable

Table 2.1 Key Steps for Conducting a G Theory Analysis Step

Description

1

Specify the universe of admissible observations: Define the type of measurements that can be treated as interchangeable with one another. Examples: tasks that require examinees to listen to an academic lecture and summarize it orally; raters who have completed the same training program Identify different sources of score variability: Define all important sources of score variability that need to be included in a G-theory analysis. Examples: academic English speaking ability of test takers as the object of measurement; tasks and raters as the facets of measurement Determine how to specify each facet of measurement: Determine whether each facet of measurement is treated as random or fixed. Examples: tasks and raters as random facets; task types as a fixed facet Define the study design: Specify the relationships among the object of measurement and all facets of measurement as crossed or nested. Example: All test takers complete the same two tasks in a speaking assessment, and all their responses are scored by the same three raters.  Test takers, tasks, and raters are crossed with one another (a two-facet crossed design).

2

3

4

Univariate generalizability theory 35 with one another are called randomly parallel measures. Note that randomly parallel measures in G theory highlight the conceptual parallelism of measures, not statistical parallelism required for parallel measures in CTT.

Step 2: Identifying different sources of score variability The investigator needs to identify different sources of score variability in a given measurement design. G theory distinguishes two major types of sources. One is the object of measurement, or the target construct of interest. The other major source of variability is measurement error. As noted above, measurement error comes in various types, both systematic and nonsystematic. In G theory, systematic sources of error that affect the accuracy of generalization from a candidate’s test score to his/her universe score are called facets of measurement. In the case of the speaking placement test example above, examinees’ academic English speaking ability is the object of measurement, while tasks and raters can be specified as the facets of measurement. In a reading assessment comprising multiple sets of comprehension items based on different texts, reading ability is the object of measurement, while texts and items can be treated as facets of measurement. If the same assessment is repeated twice, assessment occasion would be an important facet of measurement as well. Identifying important facets involved in a given measurement design adequately is key to a well-designed G theory analysis that yields results that are meaningful and interpretable.

Step 3: Determining how to specify each facet of measurement The investigator needs to specify whether each facet of measurement is random or fixed. A facet can be described as a random facet when (a) the size of the sample employed in a given assessment condition is much smaller than the size of the universe (namely, the population) and (b) the sample is drawn randomly from the universe or considered to be interchangeable with another sample drawn from the same universe. Raters, tasks, items, texts, and occasions mentioned in the above examples can often be conceptualized as random facets because there are a large number of options that the investigator can choose from for each facet. In contrast, a facet of measurement is considered a fixed facet when the number of conditions exhausts all possibilities in the universe of admissible observations. Typical instances of a fixed facet include analytic rating scales and task types which are selected on purpose and thus cannot be replaced with another. In the speaking placement test above, for example, you may design analytic rating scales to tap into three specific aspects: pronunciation, fluency, and linguistic resources. None of them is interchangeable with each other or with any other rating scale that taps into another aspect (e.g., content). Another example of a fixed facet is that the same small number of raters always score examinee responses. This may happen in a small language program where the number of individuals who can serve as raters is limited (see an overview of multilevel modeling in Chapter 7, Volume II, for a related discussion of random effects and fixed effects).

36 Yasuyo Sawaki and Xiaoming Xi Step 4: Defining the study design It is important for the investigator to define the study design for a G theory analysis by specifying the relationships among the object of measurement and facets of measurement, either as crossed or nested. For instance, when two sources of score variability are crossed with each other, all possible combinations of the conditions in one source and those in the other are observed in the measurement design. Thus, when examinee responses to all tasks are scored by all raters, for instance, raters are crossed with tasks. Likewise, when all examinees complete all items, the object of measurement is crossed with the facet of measurement for items. In contrast, when two or more conditions of one source of score variability appear under one and only one condition of the other, the former is nested within the latter. A typical example is a reading or listening comprehension assessment based on multiple texts, where a specific group of items is associated only with one specific text. In this case, items are nested within texts.

Variance components Once the study design is identified, the investigator can conduct a G theory analysis to obtain a decomposition of the observed score variance into different parts, or variance components (σ2). A variance component refers to the part of score variance that can be explained by the object of measurement or a facet(s) of measurement. How the observed score variance is decomposed into variance components can be explained by using a Venn diagram, an example of which is presented in Figure 2.1. For instance, let us consider a scenario where all 35 students complete all items in a 30-item vocabulary test. With students (persons) as the object of measurement and items as a random facet of measurement, this measurement design can be specified as a one-facet crossed study design (termed as a p × i design, where the p and i denote persons and items, respectively, and × is read as “crossed with”).

Figure 2.1 A one-facet crossed design example.

Univariate generalizability theory 37 The three areas in Figure 2.1 represent different effects involved in this example: the main effect of the object of measurement, the main effect of the facet of measurement, and their interaction. Corresponding to this, the observed score variance, denoted as σ2(Xpi), is divided into three types of variance components in this design: σ2 (X pi ) = σ p 2 + σi2 + σ pi ,e 2 (2.4) The three variance components are interpreted as follows: i

Variance component for the object of measurement (σp2): score variance due to overall examinee differences in the universe score ii Variance component for the facet of measurement (σi2): score variance due to mean differences across items (item difficulty differences) iii Variance component for the interaction of the two and error (σpi,e2): a combination of score variance due to rank-ordering differences of examinees across items and score variance due to undifferentiated error (various nonsystematic sources of variance and other potential systematic sources of error not being modeled). Note that the variance component for the highest-order interaction in a given measurement design (the person-byitem interaction in this case) also includes the error component because they cannot be distinguished from each other. Thus, the notation for this variance component includes an “e” for the measurement error. As a second example, consider a measurement design for a speaking placement test (Figure 2.2). Involving two important random facets of measurement (tasks and raters), this design can be depicted as a two-facet crossed design, which

Figure 2.2 A two-facet crossed design example.

38 Yasuyo Sawaki and Xiaoming Xi defines examinee speaking ability as the object of measurement and tasks and raters as the random facets of measurement. Here, examinees respond to both tasks, while all their responses are scored by all raters, so the persons, tasks, and raters are crossed with each other (denoted as a two-facet crossed design, or a p × t × r design). The observed score variance for this two-facet crossed design can be decomposed into seven parts: σ2 (X pi ) = σ p 2 + σt 2 + σr 2 + σ pt 2 + σ pr 2 + σtr 2 + σ ptr ,e 2 (2.5) A notable difference from the one-facet crossed design is that there are now three variance components representing the three main effects (σp2, σt2, and σr2), three variance components attributable to their two-way interactions (σpt2, σpr2, and σtr2), and the variance component due to the three-way interaction among them confounded with error (σptr,e2). Finally, Figure 2.3 illustrates the decomposition of the observed score variance to different types of variance components for the two-facet partially nested design for a situation such as a reading comprehension test comprising different sets of comprehension items associated with different texts (denoted as a two-facet partially nested design, or the p × (i:t) design, where the “:” is read “nested within”). In the case of this two-facet partially nested design, the observed score variance is decomposed into five parts: σ 2(X pi ) = σ p 2 + σt 2 + σi ,it 2 + σ pt 2 + σ pi , pit ,e 2 (2.6) A comparison of Figure 2.2 with Figure 2.3 clearly shows the effect of nesting. In a situation like this reading comprehension test example, neither the variance

Figure 2.3 A two-facet partially nested design example.

Univariate generalizability theory 39 component for items (σi2) nor that for its interaction with persons (σpi2) can be obtained independently of the text facet because a specific set of items are always associated with a specific text. For this reason, only five types of variance components can be obtained for this design. It should be noted that all the measurement designs described so far are univariate designs (measurement designs with one dependent variable) involving random facets of measurement only. When an assessment also involves a fixed facet (e.g., analytic scores), it is possible to conduct univariate analyses separately for the different conditions of a fixed facet. However, this approach results in the loss of important information concerning the relationships among different conditions in a fixed facet. Therefore, a preferred alternative is to employ multivariate G theory, a powerful framework that allows a comprehensive analysis of measurement designs involving a fixed facet. For further details about multivariate G theory, see Chapter 3, Volume I.

The importance of appropriate rating designs As noted above, the quality of a G theory analysis is fundamentally determined by the degree to which facets of measurement involved in a given measurement design are adequately identified and modeled. When raters are used to score examinee responses, it is particularly important to carefully determine the rating design, or how to assign examinee responses to different raters for scoring, so that all important effects can be teased apart for proper analyses. However, rating designs are usually constrained by the contexts in which ratings are conducted. In this section, we discuss rating design considerations in two rating contexts, research studies and operational scoring for large-scale tests. We provide practical guidance on how to set up rating designs that can lead to meaningful investigations into the dependability of assessment scores assigned by raters. In a study that aims to understand the effects associated with different facets, we have the luxury of supporting a fully crossed design in which all raters score all tasks from all examinees, so that we can model various effects related to raters and tasks. When this scenario is too resource intensive, another more practical and feasible design is to use random rater pairs to score blocks of responses from each task. This is a nested design with persons crossed with tasks and raters nested within persons and tasks, which can be denoted as the r:(p × t) design. However, because of the nesting, this design does not allow estimation of independent variance components involving raters, thus yielding limited information about different sources of error associated with raters. An alternative design that allows all rater-related errors to be estimated independently is to treat the data assigned to a specific rater pair as a block and model it as a fully crossed design to estimate the variance components separately on each block. Note that this design could become complicated as the number of rater pairs increases because the G theory analysis for the p × t design needs to be repeated as many times as the number of rater pairs. However, more stable variance components obtained by averaging the variance components across blocks can then be used to estimate

40 Yasuyo Sawaki and Xiaoming Xi the impact of varying study designs on the dependability of an assessment (Brennan, Gao, & Colton, 1995; Chiu, 2001; Chiu & Wolfe, 2002). The separate analysis for each rater pair provides information on the rater effects and other effects associated with a particular pair, whereas averaging various components across rater pairs allows us to examine overall rater and task effects and their interactions. In operational scoring of large-scale tests, scoring is designed to avoid systematic scoring errors and to minimize random scoring errors. When responses are double scored, two raters randomly selected from a rater pool are typically assigned to each response (e.g., essay). In addition, if each test taker provides multiple responses, it is not advisable to use the same rater pair for all responses. Instead, a different rater pair is assigned to each response so as to randomize the impact of individual rater errors. This random assignment of rater pairs, however, makes it challenging to model rater effects systematically because it might not be possible to identify blocks of responses by rater pairs. In this case, it is advisable not to model raters but to model ratings as a random facet (Lee, 2006). However, this changes the interpretation of the rater (now rating) facet. Although this kind of design may yield estimates of rating effects that are similar to those using raters as a random facet, information related to specific rater pairs or raters is not available to facilitate monitoring of rater performance, as in the sample study at the end of this chapter.

Measurement model underlying univariate G theory analyses As mentioned above, a key step in a univariate G theory analysis is the decomposition of the observed score variance into different types of variance components for a given measurement design. Because there is no way to obtain variance components directly, they need to be estimated. A common measurement model employed to do just that is random-effects repeated-measures analysis of variance (ANOVA), while other measurement models are also available (e.g., see Schoonen, 2005, for an example). Variance component estimates required for a G theory analysis can be obtained by means of statistical software such as SPSS as well as freely available software developed specifically for G theory analyses, including EduG (Cardinet, Johnson, & Pini, 2009) and the GENOVA Suite of Programs available from the University of Iowa website (https://education. uiowa.edu/centers/center-advanced-studies-measurement-and-assessment/ computer-programs), which comprises programs designed for univariate analyses (GENOVA and urGENOVA) as well as mGENOVA, a program for multivariate analyses to be discussed in Chapter 3, Volume I. We employed GENOVA for all analyses to be described in subsequent sections. It should be noted that published rules of thumb on sample size requirements for conducting a G theory analysis are scarce. In general, conducting a G theory analysis with a large sample allows stable variance component estimation, which is critical for generating consistent and trustworthy G theory analysis results. A

Univariate generalizability theory 41 recent simulation study conducted by Atilgan (2013) recommends a sample size of 50 to 300 for obtaining stable G theory analysis results. However, Atilgan’s study focused on a specific study design for a multivariate G theory analysis of selected item response data. Therefore, further studies are required to examine the extent to which similar results can be obtained for other study designs and for different types of assessment data. The random-effects ANOVA application to a univariate G theory analysis is similar to regular ANOVA analyses for group comparisons in that, in both cases, a test taker’s score is decomposed into parts explained by different effects. For instance, in a G theory analysis of the vocabulary test depicted in Figure 2.1, a given learner’s score is decomposed into several parts which are explained by the object of measurement (differences in test-takers’ ability), the facet of measurement (e.g., differences in item difficulty), and the residual (a combination of the interaction between person ability and item difficulty as well as undifferentiated error). Note, however, that the random-effects ANOVA application to a univariate G theory analysis differs from a regular ANOVA for group comparisons. While conducting a statistical test (F test) is often the primary goal of an ANOVA for a group comparison, the main interest of using ANOVA for a G theory analysis lies in calculating mean squares for different effects involved in a measurement design that are required for estimating the corresponding variance components. We will show you the use of ANOVA in G theory through an illustrative example in the following section.

Key steps in a G theory data analysis The estimation of variance components in a G theory analysis proceeds in two key steps. The first is a generalizability study (G study), which aims to examine the relative effects of different sources of variance on score variability for a hypothetical scenario, where a test containing only one item drawn from the universe of admissible items is administered. The purpose of a G study is to obtain baseline data against which the dependability of other measurement designs could be compared. To do so, variance component estimates obtained for a random-effects ANOVA model are used. The second step of the analysis is called the decision study (D study), where the relative effects of different sources of variance on score variability are calculated for particular measurement designs of interest based on G-study variance component estimates. The results are then used to estimate the level of score dependability of the specific measurement designs for relative decisions in NRT and/or absolute decisions in CRT. As an illustrative example, the discussion in this section is based on the p × i study example for the hypothetical vocabulary test comprising 30 selected-response items (k = 30) taken by 35 test takers (n = 35) presented earlier in Figure 2.1. The data layout for this example is shown in Table 2.2 (“1” = correct response; “0” = incorrect response). The item mean scores and person mean scores are also presented in the bottom and right margins, respectively. The value in the bottom-right cell is the grand mean (0.90762).

42 Yasuyo Sawaki and Xiaoming Xi Table 2.2 Data Setup for the p × i Study Example With 30 Items (n = 35) Student

Item 1

Item 2

…

Item 29

Item 30

Mean

1 2 3 … 35 Mean

1 0 1 … 1 0.97143

0 1 0 … 1 0.77143

… … … … …

0 1 1 … 0 0.82857

1 0 1 … 1 0.94286

0.86667 0.93333 0.96667 0.90000 0.90762

Table 2.3 Expected Mean Square (EMS) Equations (the p × i Study Design) Source

Expected mean squares

Variance component estimates

Person (p)

EMSp = σpi,e2 + niσp2

0.56880 = 0.03683 + 30σp2 σp2 = 0.01773

Items (i)

EMSi = σpi,e2 + npσi2

1.11662 = 0.03683 + 35σi2 σi2 = 0.03084

pi, e

EMSpi,e = σpi,e2

σpi,e2 = 0.03683

Note: p = persons, i = items, e = error, ni = sample size for items, np = sample size for persons, σp2 = variance component for person main effect, σi2 = variance component for item main effect, σpi,e2 = variance component for person-by-item interaction confounded with error.

Step 1: Generalizability study (G study) As noted above, a G study is conducted for a hypothetical scenario of a single observation. In this p × i study example, for instance, this refers to a case in which only one item is included in the vocabulary test. In this analysis, first, mean square values are obtained from a random-effects ANOVA model for specific effects and are plugged into a set of expected mean square (EMS) equations for the specific G-study design. The set of EMS equations are then solved for obtaining specific variance component estimates of interest. As an example, Table 2.3 presents the EMS equations for the p × i study design along with the calculations of the three associated variance component estimates. The reason why the mean square values obtained from the ANOVA model are entered as EMS values in the calculation is that EMSs are not directly observable and therefore must be estimated. EMS equations specific to common G-study designs can be found in various G theory references (e.g., Brennan, 2001; Shavelson & Webb, 1991). Table 2.4 shows the results of the G study. The variance component estimates for the p × i study design are presented along with the proportion of the total observed variance accounted for by the three sources of variation. One way to interpret these results is to focus on the relative sizes of the variance

Univariate generalizability theory 43 Table 2.4 G-study Results (the p × i Study Design) Source of variation

σ 2

% total

Persons (p) Items (i) pi,e Total

0.01773 0.03084 0.03683 0.08540

20.8% 36.1% 43.1% 100.0%

Note: p = persons, i = items, e = error.

component estimates. Because the purpose of the measurement is to assess students’ vocabulary knowledge, the goal is to maximize the proportion of variance explained by the object of measurement (σp2). The results show, however, that only 20.8% of the total score variance is attributable to the object of measurement. As noted above, this is because a G study is conducted on a hypothetical scenario based on only one item (ni = 1) drawn randomly from the universe of admissible observations in order to obtain baseline information for conducting D studies on different measurement designs. For the same reason, the other variance components for the facets of measurement are quite sizable, with 36.1% of the score variance accounted for by differences in difficulty across items (σi2), and 43.1% by the person rank-ordering differences across items confounded with error (σpi,e2).

Step 2: Decision study (D Study) In this step, variance component estimates are obtained for different D studies based on the results of the G study by changing the sample size for any facets of measurement of interest. D studies are conducted to estimate the score dependability of different measurement designs, such as the actual measurement design in use and alternative measurement designs for consideration that could optimize score dependability. For the vocabulary test example above, an issue of interest is whether the current test design with 30 items provides a sufficient level of score dependability. Thus, running a D study for the case of 30 items is reasonable. In addition, the investigator might want to know whether reducing the number of items to 20 is still sufficient to ensure score dependability (to reduce testing time) and whether increasing the items to 40 would be beneficial to enhance score dependability (if the stakes of the assessment are high). Table 2.5 presents the results of the D studies for the p x i study design with 20, 30, and 40 items. By convention, uppercase letters are used to represent the facet whose sample size is manipulated in the D study, and the D-study sample size for that facet is noted with a prime (′). Thus, the item facet and its sample size are denoted as I and ni′, respectively, in Table 2.5. As can be seen in Table 2.5, the D-study variance component estimates for the object of measurement are identical to that for the G-study person variance

44 Yasuyo Sawaki and Xiaoming Xi Table 2.5 D-study Results (the p × I Study Design) Source of variation

Persons (p) Items (I) pI,e Total

D studies ni′ = 20

ni′ = 30

ni′ = 40

σ2

% total

σ2

% total

σ2

% total

0.01773 0.00154 0.00184 0.02111

84.0 7.3 8.7 100.0

0.01773 0.00103 0.00123 0.01999

88.7 5.1 6.1 100.0

0.01773 0.00077 0.00092 0.01942

91.3 4.0 4.7 100.0

σRel2 = 0.00184 σAbs2 = 0.00338 EρRel2 = 0.90598 Φ = 0.83989

σRel2 = 0.00123 σAbs2 = 0.00226 EρRel2 = 0.93513 Φ = 0.88694

σRel2 = 0.00092 σAbs2 = 0.00169 EρRel2 = 0.95067 Φ = 0.91298

Note: p = persons, I = items, e = error, ni′ = D-study sample size for items, σ2 = variance component, σRel2 = relative error variance, σAbs2 = absolute error variance, EρRel2 = generalizability coefficient, Φ = index of dependability.

component (see Table 2.4), while the D-study variance component estimates for facets of measurement are obtained by dividing the corresponding G-study variance component estimates by the D-study sample size(s) of the facet(s) involved. For instance, the D-study variance component estimates for the facets of measurement with 30 items in Table 2.5 were obtained as σI2/ni′ = 0.03084/30 = 0.00103 and σpI,e2/ni′ = 0.03683/30 = 0.00123. The proportions of variance explained by the different sources of variation for the three D studies in Table 2.5 show that with 20 or more items, at least 84.0% of the total score variance or more can be explained by the object of measurement, while the proportions accounted for by the facets of measurement are fairly small. Once the D-study variance component estimates are obtained, they are used to model score dependability for specific measurement designs. First, an estimated error variance is calculated for the specific type of decision of interest: the relative error variance (σRel2) for a relative decision in NRT and the absolute error variance (σAbs2) for an absolute decision in CRT. Only the facet of measurement involving interactions with the object of measurement contributes to the relative error variance, because the focus of relative decisions is person rank-ordering consistency. In contrast, all variance component estimates involving the facets of measurement contribute to the absolute error variance because not only rank-ordering consistency but also performance-level consistency are of interest in making an absolute decision. Thus, for the D-study with 30 items in Table 2.5, σRel2 = σpI,e2 = 0.00123, and σAbs2 = σI2 + σpI,e2 = 0.00103 + 0.00123 = 0.00226. The square roots of the relative and absolute error variances are the standard errors of measurement (SEM) for relative and absolute decisions, respectively, which can be used to construct confidence intervals for estimating universe scores from observed scores

Univariate generalizability theory 45 (Brennan, 2001, pp. 35–36). For example, for a relative decision for the D study with 30 items, the universe score of a student who has earned the raw score of 15 lies in the range of 15±1.96√0.00123, or from 14.931 and 15.069, at the 95% confidence level. (Note that the SEM is multiplied by 1.96 here because 95% of the scores fall between 1.96 SEM units below and above the mean score when the score distribution is normal.) Second, these error variances can be used to calculate indices of score dependability for two types of decisions, which are analogous to a CTT reliability index: the generalizability coefficient (G coefficient, or EρRel2) for relative decisions for NRTs and the index of dependability (Φ) for absolute decisions for CRTs. The formulae for the calculation of these indices are presented as Equations 2.7 and 2.8: E ρRel 2 = σ p 2 /(σ p 2 + σRel 2 ) (2.7) Φ = σ p 2 / (σ p 2 + σ Abs 2 ) (2.8) Hence, for the D study with 30 items, EρRel2 = 0.01773/(0.01773 + 0.00123) = 0.93513, and Φ = 0.01773/(0.01773 + 0.00226) = 0.88694 (see Table 2.5). These indices range from 0 to 1 and can be interpreted in the same way as CTT reliability indices such as Cronbach’s α. There are various rules of thumb for interpreting such reliability coefficients (e.g., .80 and above suggesting good reliability and between .70 and .80 suggesting marginal reliability; Hoyt, 2010, p. 152). A few issues should be kept in mind in using them, however. First, the required level of measurement consistency depends on various factors such as the purpose and stakes of assessment (Nunnally, 1978). Second, in the context of G theory, a dependability coefficient (Φ) offers a more stringent estimate of score dependability than a G coefficient (EρRel2) obtained from the same D-study design. This is because the absolute error variance used for the calculation of Φ is always equal to or greater than that for EρRel2, as can be seen in the discussion above. Table 2.5 shows that the dependability of measurement for 20 items is acceptable for both relative decisions (EρRel2 = 0.91) and absolute decisions (Φ = 0.84). Thus, unless the test is used for very high-stakes decisions, it is deemed sufficient to administer only 20 items from the perspective of measurement consistency. When a cut score(s) is employed for classifying candidates into different levels of performance for making absolute decisions, the Φλ index (pronounced “phi-lambda”), which indicates the degree of consistency of classification decisions at the cut score (λ), can also be calculated (for further details about the phi-lambda index, see Brennan, 1983).

Sample study: Dependability of an EFL summarization writing assessment It was discussed earlier that the results obtained from a G theory analysis can change depending on how various facets of measurement that are deemed to affect score dependability in a given measurement context are specified. In order to elaborate

46 Yasuyo Sawaki and Xiaoming Xi on this point further, this section presents results of a sample G theory application to a dependability analysis of ratings based on a rating scale designed for low-stakes, criterion-referenced assessment of learners’ summarization skills as part of L2 academic writing instruction at a university in Japan. The dataset analyzed in this section was obtained from a pilot test of the draft rating scale. Given the intended use of the rating scale in writing courses, a primary purpose of this investigation was to examine the degree to which practitioners could score learner-produced summaries consistently based on the rating scale. In this particular assessment context, multiple raters scored L2 learners’ writing samples. However, because it was infeasible to have the same two raters score all examinee responses for double-scoring, learnerproduced writing samples were divided into blocks, which were then assigned to different rater pairs. In the subsequent analysis, the results of a common design where a rating facet is modeled without distinguishing the individual raters involved in the first and the second ratings will be reported first. The results will then be compared against another G theory analysis run based on the block-level analysis of different rater pairs discussed earlier (see “The importance of appropriate rating designs”). These two methods were also described by Lin (2017) as the rating method and the “subdividing” method, respectively. The latter method was also employed in Xi’s (2007) dependability analysis of the TOEFL iBT speaking data. This comparison exemplifies how these two methods differ in the extent to which they yield information that informs training and monitoring of individual raters.

Methodology Participants Participants in this study were 74 Japanese undergraduate students majoring in English at a private university in Japan, who were recruited as part of a larger study. Among the participants, 13 students were enrolled in an elective TOEFL test preparation course, while the remaining 61 were enrolled in six different sections of an academic writing course required for second-year English majors. Prior to their study participation, the students had received instruction from their respective course instructors on how to write summaries in English as part of the curriculum. As for the English writing ability level of the participants, the mean score on an expository TOEFL essay in the Criterion® Writing Program (ETS) used for course placement was 4.12 (SD = 1.01). The raters were experienced instructors of English writing courses in the undergraduate program and a doctoral student in applied linguistics with extensive L2 writing teaching and research experience. For rater training, the raters familiarized themselves with the writing task and the rating scale, scored student responses in a practice set, and discussed their rating results in a three-hour session.

Instrument, scoring, and design All students wrote a summary in English (approximately 60 words) of a source text written in English, which was a 345-word problem-solution text on a new

Univariate generalizability theory 47 Table 2.6 Rating Design for the Sample Application Block

Response

Rating 1

Rating 2

1

1 . . . 37 38 . . . 74

Rater A

Rater B

Rater C

Rater D

2

environmental technology. The description below concerns a four-point rating scale (0–3) of essay content which focuses on the degree to which the relationships among main points of the text are represented accurately and sufficiently in the learner-produced summary. All 74 summaries were scored by two independent raters. The summaries were randomly divided into two blocks, each containing 37 summaries, and the two blocks were assigned to different rater pairs (Raters A & B scoring Block 1, and Raters C & D scoring Block 2; see Table 2.6 for the rating design). For each rater, the order of presentation of the summaries was randomized. The measurement design was specified in two different ways for running G and D studies. First, a G study was conducted for a univariate p × r′ study design for the entire dataset (n = 74), where persons (p) were fully crossed with ratings (r′). A D study with two ratings was then conducted. Second, for a comparison with the first analysis, another G study was conducted for each block separately (n = 37) for a univariate p × r design, where persons were fully crossed with raters (r). The obtained G study variance component estimates were then averaged across the two blocks, and a D study with two raters was conducted using the averaged variance component estimates. Based on the D-study results, the indices of dependability for absolute decisions were obtained for the current measurement designs (two ratings or two raters). The computer program, GENOVA Version 3.1 (Crick & Brennan, 2001) was used for all analyses.

Results Table 2.7 presents G-study and D-study results for the p × r′ design. The righthand columns present results of the D study with two ratings, which reflects the actual test design. The D-study results presented in Table 2.7 show that the proportion of variance accounted for by person ability differences is 64.4%. The variance component for ratings was virtually nil, suggesting that the first and second ratings were comparable in terms of the overall rating severity. The person-by-rating interaction and random error variance component explained 35.6% of the total variance. While this might mean that the student rank-ordering differed between the first

48 Yasuyo Sawaki and Xiaoming Xi Table 2.7 G- and D-study Variance Component Estimates for the p × r ′ Design (Rating Method) Source of variation

G study

D study (nr′′= 2)

σ2

% total

σ2

% total

Persons (p) Ratings (r′) pr′,e Total

0.23528 0.00000 0.25990 0.49518

47.5 0.0 52.5 100.0

0.23528 0.00000 0.12995 0.36523

64.4 0.0 35.6 100.0

Φ

0.64420

Note: p = persons, r ′ = ratings, e = error, nr ′′ = D-study sample size for ratings, σ = variance component, Φ = index of dependability. 2

and second ratings, it cannot be determined that it was the primary contributor to this large variance component because the person-by-rating interaction effect on the variance component cannot be separated from the undifferentiated error in the p × r ′ study design. The index of dependability was 0.644. This value is not satisfactory even for this low-stakes assessment context. The fairly low dependability of the ratings above suggests that some sources of error not taken into account in the D-study design might be at play. One possibility is differences among the individual raters, on which the above p × r ′ design does not offer any information. However, by conducting a block-level analysis for the p × r design described above, some insights into the rater effects associated with a specific rater pair can be obtained.1 The top and middle sections of Table 2.8 present the G-study and D-study variance component estimates and the index of dependability obtained separately for each block, and the bottom section presents those averaged across the two blocks. The index of dependability for the first rater pair (Φ = 0.497) is low compared to that for the second rater pair (Φ = 0.708), and the variance component estimates provide some reasons for the difference. In both blocks, the contribution of the rater variance component (σr2) to the total score variance was very small (1.4% and 0.5% of the total score variances explained by the rater facet in Blocks 1 and 2, respectively), suggesting that the first and second ratings were comparable in overall severity. However, the two blocks differed in the degrees of contributions of the other two variance components. In Block 2, the contribution of person ability differences (σp2) was more than twice as large as person rank-ordering differences confounded with error (σpr,e2), explaining 70.8% and 29.2% of the total score variance, respectively. In contrast, in Block 1, both sources of variability contributed almost equally, explaining approximately 49% of the total score variance each. It is worth noting that the person variance component estimate (σp2) for Block 1 is quite small in size (0.12012). This suggests that the lack of variability in learner ability, which is independent of rater performance, is partly responsible for the fairly low index of dependability for Block 1. However, another possibility is that Raters A and B tended to disagree more in rank-ordering students than Raters C

Univariate generalizability theory 49 Table 2.8 G- and D-study Variance Component Estimates for the p × r Design (Subdividing Method) Batch

1 (Raters A & B; N = 37) 2 (Raters C & D; N = 37) Mean

Source of variation

G study

Persons (p) Raters (r) pr,e Total

0.12012 0.00676 0.23649 0.36337

33.1 1.9 65.1 100.0

0.12012 0.00338 0.11824 0.24174

49.7 1.4 48.9 100.0

0.49689

Persons (p) Raters (r) pr,e Total

0.32883 0.00000 0.27177 0.60060

54.8 0.0 45.2 100.0

0.32883 0.00000 0.13589 0.46472

70.8 0.0 29.2 100.0

0.70759

Persons (p) Raters (r) pr,e Total

0.22448 0.00338 0.25413 0.48199

46.6 0.7 52.7 100.0

0.22448 0.00169 0.12707 0.35323

63.5 0.5 36.0 100.0

0.63549

σ2

D study (nr′ = 2) % total

Φ

% total

σ2

Note: p = persons, r = raters, e = error, nr′ = D-study sample size for raters, σ = variance component, Φ = index of dependability. 2

and D despite their similarity in overall severity. The effect of rank-ordering differences between the raters on rating consistency cannot be isolated from that of the confounded error in this D-study design. Meanwhile, the results suggest the need to explore potential factors that may contribute to rank-ordering differences. In this pilot study, we also analyzed raters’ written explanations for their rating decisions that they provided along with their ratings. This pointed to discrepancies between Raters A and B in the interpretation of the rating scale and aspects of learner responses to which they paid attention. This would warrant re-examining the level descriptors for the rating scale and the content and procedures for rater training. The D-study variance components and the index of dependability for the two raters based on the G-study variance components, which are averaged across the blocks in the bottom section of Table 2.8, are similar to those for the p × r′ design presented in Table 2.7. However, reporting the results for the block-level analysis for the p × r design might be preferable, because Lin’s (2017) recent simulation study shows that the variance component estimates for the block-level analysis are statistically less biased than those for the p × r′ design.

Discussion and conclusion This chapter provided an introduction to generalizability theory (G theory), a flexible framework for simultaneous modeling of multiple sources of measurement error that could be applied to validating existing assessments, modifying

50 Yasuyo Sawaki and Xiaoming Xi assessments for future use, and designing new ones for both norm-referenced and criterion-referenced score interpretations. First, the chapter described basic concepts of G theory as well as key steps for conducting a G theory analysis, focusing specifically on univariate G theory analyses. Then, in the illustrative example presented, two study designs for a G theory analysis of an EFL summarization writing assessment for low-stakes use were compared with each other. We demonstrated how the results and interpretation of G theory analysis could change depending on how various facets of measurement involved in an assessment are specified and modeled. Two issues discussed in this chapter are particularly important in obtaining meaningful results from a G theory analysis. First, the investigator should keep in mind the strengths of G theory over the traditional conceptualization of score reliability in classical test theory (CTT) and take advantage of them to address specific research questions. As discussed in this chapter, one strength of G theory is its capability for a simultaneous analysis of multiple systematic sources of measurement error and their interactions. Because the measurement designs for performance assessments are often complex, G theory might be seen as an analytic framework that is best suited for analyzing performance assessment data. G theory is also applicable to traditional forms of assessment based on selected-response and limited-production responses because they often involve multiple systematic sources of measurement error. For instance, a typical measurement design for a reading or listening comprehension assessment is to employ groups of items associated with different stimulus texts, such as the example depicted in Figure 2.3. While a common approach to a reliability analysis in this case is to report a CTT reliability index (e.g., Cronbach’s α), G theory-based indices of score dependability would yield an estimate of measurement consistency that better reflects the actual measurement design and assessment purpose than do CTT-based indices. For the above example, both stimulus texts and items could be specified as random facets of measurement in calculating a G-coefficient for relative decisions or an index of dependability for absolute decisions. The investigator would also benefit from this G theory analysis because it offers additional information on the extent to which stimulus texts and items affect score dependability, which could be utilized to optimize the measurement design in future test administrations. A second issue an investigator should keep in mind is that the quality of a G theory application hinges on adequately modeling different sources of score variability which are associated with test contexts. All facets of measurement involved in a measurement design should be identified and modeled appropriately. This demands a careful specification of a rating design in advance and pays special attention to how raters have been assigned to different examinee responses during the scoring stage. The information that can be obtained from a study design for a G theory analysis involving nesting is limited compared to the information obtained from crossed designs. For example, if the primary purpose of an investigation is to examine rating consistency, study designs where the rating or rater facet is crossed with other sources of score variability is called for in order to estimate variance components involving the rating or rater facet independently of

Univariate generalizability theory 51 other sources. This again suggests the critical importance of the rating design that is compatible with a specific research question a study aims to address. Another issue closely related to the above is the difference between the rating method and Lin’s (2017) subdividing method described in the sample G theory application to a score dependability analysis of a L2 writing assessment presented at the end of this chapter. Examining rater behaviors is an important research topic in L2 performance assessment, and both methods have been employed in previous G theory applications in the field (In’nami & Koizumi, 2016), shedding light on different aspects of rating consistency. On the one hand, modeling a rating facet might be useful when the focus of the investigation is the overall rating consistency after averaging out individual rater effects. On the other hand, modeling a rater facet is better suited to examining the degree to which scores are assigned consistently within and across different individuals. It should be noted that, while the subdividing method above yields some information that could inform rater training and monitoring, another analytic approach would be required when the study purpose is to conduct a fine-grained analysis of individual raters’ behaviors. One such approach is many-facet Rasch measurement, which generates diagnostic information about each rater’s behavior such as severity estimates and reports on rating bias that he/she tends to exhibit when scoring responses on a certain task or those of specific examinees. Previous L2 performance assessment research suggests that G theory and many-facet Rasch measurement offer complementary measurement insights (e.g., Bachman et al., 1995; Lynch & McNamara, 1998). It is advisable that the reader employ either of these two analytic approaches that suits the purpose and scope of a rater behavior investigation. While this chapter discussed key issues for conducting univariate G theory analyses for examining score dependability of L2 assessments, the coverage is far from comprehensive. One important issue that we could only touch upon briefly in this chapter is the consistency of classification decisions made at a given cut score (e.g., pass/fail decisions) in absolute decision making, using the Φλ index. Interested readers can refer to studies such as Bachman et al. (1995), Sawaki (2007), and Sawaki and Sinharay (2013) for applications in language assessment that address this topic. To conclude, G theory is a flexible framework that offers rich information about score dependability of assessments. A merit of employing G theory is that it requires the investigator to examine a given measurement design thoroughly through identifying and modeling different sources of score variability. It is our hope that this chapter will facilitate appropriate applications of G theory to score dependability investigations for enhancing the quality of the measurement in L2 assessment and facilitating meaningful assessment score interpretation and use.

Acknowledgment This work was partially supported by the Japan Society for the Promotion of Science (JSPS) Grant-in-Aid for Scientific Research (C) (No. 16K02983) awarded to the first author.

52 Yasuyo Sawaki and Xiaoming Xi

Note 1 For gaining further insights into raters’ performance consistency, a many-facet Rasch measurement analysis can be conducted to obtain estimates of leniency/ harshness for individual raters (see Chapter 7, Volume I).

References Atilgan, H. (2013). Sample size for estimation of g and phi coefficients in generalizability theory. Eurasian Journal of Educational Research, 51, 215–228. Bachman, L. F. (2004). Statistical analyses for language assessment. Cambridge: Cambridge University Press. Bachman, L. F., Lynch, B. K., & Mason, M. (1995). Investigating variability in tasks and rater judgment in a performance test of foreign language speaking. Language Testing, 12(2), 238–257. Brennan, R. L. (1983). Elements of generalizability theory. Iowa City, IA: American College Testing. Brennan, R. L. (2001). Generalizability theory. New York, NY: Springer-Verlag, Inc. Brennan, R. L., Gao, X., & Colton, D. A. (1995). Generalizability analyses of Work Keys listening and writing tests. Educational and Psychological Measurement, 55(2), 157–176. Brown, J. D. (1999). The relative importance of persons, items, subtests and languages to TOEFL test variance. Language Testing, 16(2), 217–238. Brown, J. D., & Bailey, K. M. (1984). A categorical instrument for scoring second language writing skills. Language Learning, 34(4), 21–38. Brown, J. D., & Hudson, T. (2002). Criterion-referenced language testing. Cambridge: Cambridge University Press. Cardinet, J., Johnson, S., & Pini, G. (2009). Applying generalizability theory using EduG. New York, NY: Routledge. Chiu, C. W. T. (2001). Scoring performance assessments based on judgments: Generalizability theory. Boston, MA: Kluwer Academic. Chiu, C. W. T., & Wolfe, E. W. (2002). A method for analyzing sparse data matrices in the generalizability theory framework. Applied Psychological Measurement, 26(3), 321–338. Crick, J. E., & Brennan, R. L. (2001). GENOVA (Version 3.1) [Computer program]. Cronbach, L. J. (1951). Coefficient alpha and the internal structure of tests. Psychometrika, 16, 297–334. Cronbach, L. J., Gleser, G. C., Nanda, H., & Rajaratnam, N. (1972). The dependability of behavioral measurements: Theory of generalizability for scores and profiles. New York, NY: Wiley. Hoyt, W. T. (2010). Inter-rater reliability and agreement. In G. R. Hancock & R. O. Mueller (Eds.), The reviewer’s guide to quantitative methods in the social sciences (pp. 141–154). New York, NY: Routledge. In’nami, Y., & Koizumi, R. (2016). Task and rater effects in L2 speaking and writing: A synthesis of generalizability studies. Language Testing, 33(3), 341–366. Lee, Y.-W. (2006). Dependability of scores for a new ESL speaking assessment consisting of integrated and independent tasks. Language Testing, 23(2), 131–166. Lee, Y.-W., & Kantor, R. (2005). Dependability of new ESL writing test scores: Evaluating prototype tasks and alternative rating schemes. TOEFL Monograph Series No. 31. Princeton, NJ: Educational Testing Service.

Univariate generalizability theory 53 Lin, C.-K. (2017). Working with sparse data in rated language tests: Generalizability theory applications. Language Testing, 34(2), 271–289. Lynch, B. K., & McNamara, T. F. (1998). Using G-theory and many-facet Rasch measurement in the development of performance of ESL speaking skills of immigrants. Language Testing, 15(2), 158–180. Nunnally, J. C. (1978). Psychometric theory (2nd ed.). New York, NY: McGraw-Hill. Sawaki, Y. (2007). Construct validation of analytic rating scales in a speaking assessment: Reporting a score profile. Language Testing, 24(3), 355–390. Sawaki, Y. (2013). Classical test theory. In A. Kunnan (Ed.), The companion to language assessment (pp. 1147–1164). New York, NY: Wiley. Sawaki, Y., & Sinharay, S. (2013). The value of reporting TOEFL iBT subscores. TOEFL iBT Research Report No. TOEFLiBT- 21. Princeton, NJ: Educational Testing Service. Schoonen, R. (2005). Generalizability of writing scores: An application of structural equation modeling. Language Testing, 22(1), 1–30. Shavelson, R. J., & Webb, N. M. (1991). Generalizability theory: A primer. Newbury Park, CA: SAGE Publications. Xi, X. (2007). Evaluating analytic scores for the TOEFL® Academic Speaking Test (TAST) for operational use. Language Testing, 24(2), 251–286. Xi, X., & Mollaun, P. (2011). Using raters from India to score a large-scale speaking test. Language Learning, 61(4), 1222–1255. Zhang, S. (2006). Investigating the relative effects of persons, items, sections, and languages on TOEIC score dependability. Language Testing, 23(3), 351–369.

3

Multivariate generalizability theory in language assessment Kirby C. Grabowski and Rongchan Lin

Introduction Since the early 1990s, language assessment has seen a period of change, in which “performance assessment, where learners have to demonstrate practical command of skills acquired, is rapidly replacing more traditional test formats such as paperand-pencil tests involving multiple choice questions” (McNamara, 1996, p. 1). The performance-based assessment format can be distinguished from traditional assessments by the presence of two factors: an observed performance sample by the candidate (e.g., an essay or speech sample) and scoring by human judges using an agreed-upon rating process (McNamara, 1996). Oftentimes, the scoring of these tests also involves analytic rating scales, where test takers receive a profile of scores based on a complex and multidimensional underlying test construct consisting of components that may have hypothesized relationships among them. Thus, performance assessments, by their very nature, introduce additional aspects into the measurement design that may have an unexpected and systematic effect on test scores, such as raters (and their relative severity), tasks (and their relative difficulty), potential interactions between those facets, and other random factors. Adding additional complexity is the fact that these effects may present differently depending on the rating process that is adopted. As a result, test method facets and their interactions become important considerations in the analysis and interpretation of test data, and language assessment researchers need statistical procedures that can account for these potential sources of construct (ir)relevant variance and/or measurement error (Brennan, 1992). While the univariate approach to generalizability (G) theory focuses on the above factors when examining the dependability of a single measure or score, multivariate generalizability (MG) theory, as an extension of the univariate model, is a more complex statistical theory that can account for these factors when multiple measures are part of the measurement context. In language assessment research, multiple measures usually come in the form of a profile of scores, such as those derived from analytic scales as in a writing or speaking assessment, or a series of subtest scores given as part of a proficiency test. More often than not, subscores such as these are also then aggregated into some type of composite score. As such, language assessment researchers have used MG theory to examine a variety of issues, such as analytic scoring (e.g., Sawaki, 2007; Xi, 2007; Xi & Mollaun,

Multivariate generalizability theory 55 2014), score dependability by task type (e.g., Lee, 2006), effective weights of rating criteria (Sato, 2011), the construct validity of tests (e.g., Grabowski, 2009), and optimal test design (e.g., Brennan, Gao, & Colton, 1995; Lee & Kantor, 2007; Liao, 2016). Ultimately, the benefit of MG theory is that it allows researchers to statistically model multiple measures simultaneously. The purpose of this chapter is to introduce MG theory and illustrate its application to language assessment data by answering a variety of research questions as they pertain to scores from a Chinese listening-speaking test. This particular test was chosen because the context includes facets of measurement that are common in performance assessment (i.e., analytic scoring across multiple tasks and raters), all of which make it suitable for MG theory analyses. The strengths and limitations of MG theory in the context of the illustrative study, as well as complementary approaches for investigating analytic scoring, will also be outlined. MG theory can obviously be applied to a variety of multivariate designs; however, our hope is that the illustrative study and accompanying discussion are accessible and comprehensive enough to allow for a logical and practical extension of the analyses presented to other types of data (the comprehensive introduction to univariate G theory in Chapter 2, Volume I should be consulted before proceeding further).

Overview of multivariate generalizability theory In order to fully understand MG theory, it is important to first present the fundamental concepts related to G theory on the whole and the advantages it offers over classical test theory (CTT). Following is a brief overview of the conceptual framework of G theory, with an explicit emphasis on the multivariate extension of the model. A full exemplification of how each of the concepts outlined below might be utilized in second language (L2) assessment research is included in the illustrative study that follows.

The effect of multiple sources of variation on total score variability Although estimates of reliability from a CTT perspective (as seen in Figure 3.1) may provide useful information about measurement error when analyzing test data, in this approach, error terms are often overgeneralized, resulting in estimates that may, in reality, be much more complex than is represented.1

error variance

true score variance

Figure 3.1 Observed-score variance as conceptualized through CTT.

56 Kirby C. Grabowski and Rongchan Lin

tasks (t) other sources of variation

raters (r)

rxt universe score variance

pxr

object of measurement, persons (p)

pxt ptr,e

Figure 3.2 Observed-score variance as conceptualized through G theory.

G theory (Brennan, 1983; Cronbach, Gleser, Nanda, & Rajaratnam, 1972; Shavelson & Webb, 1991) is a statistical theory that aims to disentangle the effect of multiple sources of variation on test score variability. As a complement to CTT, this approach models the relative contribution of person ability (p, the object of measurement) and other method factors, such as tasks (t) and raters (r), on scores. Other sources of variation in the model could include the interactions that these facets may have with each other, plus undifferentiated error (e), as seen in Figure 3.2. The effect of background variables (e.g., first language [L1], age, biological sex, etc.) on scores is something that can also be modeled in addition to a variety of variables, such as text genre, prompt type, or occasion, to name a few. (For a discussion on G theory and CTT, consult Brennan, 2011.) An examination of test method facets purported to have the greatest impact on scores is critical in the investigation of construct validity (Bachman, 1990, 2004, 2005) and should always be considered by language assessment researchers in performance assessment contexts, since, ultimately, score variability relates to how reliable, or dependable, are the measures from an assessment. G theory, as a multifacet measurement model, allows for an investigation into the effect of test method facets on scores, as well as ability-level differences, and systematic interaction effects not accounted for in CTT. By extension, the benefit of MG theory is that it can be used to model multiple measures simultaneously as part of an investigation into the relative contribution of multiple sources of variation to the total score variability, particularly when a profile of observed scores (e.g., analytic ratings or subtest scores) and/or a composite score are given. MG analyses yield results unique to this type of simultaneous modeling, including indices of composite score dependability, statistics on the relationships between components in a composite (in the form of covariances and universe-score correlations), effective weights of the components in a composite, and the statistical effect of alternative measurement designs on the composite. Each is discussed below.

Multivariate generalizability theory 57 Score dependability Reliability, or dependability in G theory terminology, refers specifically to the extent to which the test-takers’ observed scores and their theoretical average scores, or universe score, compare across all possible test conditions (e.g., scores on all possible similar tasks given by all possible similar raters). A universe score is a much more powerful notion than the concept of observed score since it accounts for a larger domain of possible measures than can be represented in a single test (Bachman, 2004), which may contain only one or two tasks that are scored by few raters. Since it is common in performance assessment contexts for test takers to receive a profile of scores based on ratings from an analytic rubric, MG theory is most often used to investigate score dependability at the individual subscale level as well as at the composite score level. However, the same type of analysis can be performed on data from contexts in which composite scores are made up of other types of components (e.g., reading, writing, listening, and speaking subtests in a proficiency exam). No matter which components form the composite, the concept of dependability is applicable to both norm-referenced and criterion-referenced score interpretations for making relative and absolute decisions, respectively.

Relationships between components in a composite Since subscales in an analytic rubric or subtests in a proficiency exam often represent the operational definition of the ability being measured (Alderson, 1981; Bachman & Palmer, 1982; Davies et al., 1999), there are often hypothesized relationships between or among the underlying variables represented by the components in a composite (e.g., subscales in a rubric or subtests in a proficiency exam). While the concept of univariate G theory can provide information about total score variability for each component in a composite separately, MG theory extends the measurement model to allow for an investigation into the interrelationships among the multiple dimensions represented in the composite (Shavelson & Webb, 1991; Webb, Shavelson, & Maddahian, 1983) through covariances and universescore correlations. Covariance component estimates between analytic subscales, for instance, reveal directional relationships (i.e., whether the variables move or vary together in the same or opposite direction) across multiple sources of variation, while the universe-score correlations are scaled indicators of the magnitude of the relationships between the subscales for the object of measurement – usually person ability. Universe-score correlations are disattenuated correlations (like true-score correlations in CTT) in that they represent the relationships among the abilities the subscales tap into after removing the effects of the sources of error modeled in a given G theory analysis. Covariance component estimates and universe-score correlations produced through MG analyses are useful in that they illuminate dependencies between the variables across the multiple sources of variation that can be used as validity evidence for the hypothesized construct underlying the assessment.

58 Kirby C. Grabowski and Rongchan Lin Effective weights of components in a composite MG theory can also reveal the extent to which certain components in a composite (e.g., subscales in a rubric or subtests in a proficiency exam) may be contributing more or less to the composite universe-score variance. In other words, certain components may be more important than others to the overall picture of score dependability, even if the nominal, or a priori, weight they are given is equal in the calculation of raw scores. Effective weights differ from nominal weights in that the former reveals the empirical contributions (based on variance within scales and covariance across scales) of the components to the composite, while the latter reflects the weighting assigned to the components by the test developer. (See Sawaki, 2007; Wang & Stanley, 1970 regarding nominal and effective weights.) By revealing the effective weights of individual components (e.g., how much one subscale may be contributing statistically over others), researchers can gather additional information about how the individual components are functioning as part of a whole.

Alternative measurement designs One of the most commonly cited drawbacks of performance assessments is the relative lack of practicality (Bachman & Palmer, 1996) that often goes along with them. Specifically, the resources needed for complex assessments involving human ratings over multiple tasks often outweigh the resources available. As a result, there may be a need to modify an assessment (e.g., reduce the number of tasks or raters) in order to maximize practicality while still preserving an acceptable level of score generalizability. As one of its most useful applications, G theory can be used to investigate alternative measurement designs in order to explore the often disparate and competing principles of practicality and score dependability. Known as decision (D) studies, these analyses involve varying test facets, such as raters or tasks, in number (e.g., increasing test length from two to four tasks in an effort to increase dependability, or decreasing the test length from four to two tasks in an effort to decrease test administration time). D studies in G theory are most commonly used to explore an operational measurement design, but they are also frequently used to explore the dependability of hypothetical, alternative measurement designs to help determine the minimum number of tasks and/or raters needed for optimizing the dependability of scores on a test. Therefore, D studies are especially valuable since tasks and raters are often associated with the greatest sources of variance (Weigle, 2004), and practicality is a serious consideration in nearly all performance tests. As a result, D studies can provide a useful analytic framework for informing test design, analysis, the impact of alternative measurement designs on score dependability, and, in the MG framework specifically, composite score dependability. Table 3.1 outlines a number of general research questions that can be answered in alignment with the specific areas of investigation discussed in the Overview of Multivariate Generalizability Theory above.2 Although the research questions outlined above reflect general areas of investigation, researchers should adapt and tailor these questions to the specific measurement context in which they are working.

Multivariate generalizability theory 59 Table 3.1 Areas of Investigation and Associated Research Questions Area of investigation

Research question

The effect of multiple sources of variation on total score variability

1 What is the relative contribution of multiple sources of variation (e.g., test takers, tasks, and raters) to the total score variability for each of the components in the composite?

Score dependability

2 How dependable are the scores for each of the individual components in the composite and at the composite score level?

Relationships between components in a composite

3 To what extent are the components in the composite related?

Effective weights of components in a composite

4 To what degree does each component in the composite contribute to the composite universescore variance?

Alternative measurement designs

5 What is the effect of alternative measurement designs on score dependability?

Sample study: Investigating the rating scale of a listening-speaking test The purpose of the illustrative study that follows is to exemplify some of the most common and meaningful applications of MG theory, as outlined above. While MG theory can be used to investigate data from a variety of contexts in which a profile of scores is given, the illustrative study below involves a specific language assessment context in which analytic scores form the composite. Specifically, the data are derived from a listening-speaking test that consisted of two retell tasks, the responses from which were rated by two raters across four analytic subscales. This assessment was chosen not only because its data is conducive to MG analyses but also because the context includes facets of the measurement design (i.e., multiple tasks and multiple raters) that are very common in L2 performance assessments. The study first includes a description of the Chinese listening-speaking test context, followed by a detailed explanation of the MG theory analysis procedures used to answer the research questions. The results of the study are then discussed with reference to the general types of information that can be derived from an MG analysis such as this.

Participants Test takers When conducting MG analyses, it is best to have a sample size of at least 50 in order to have robust estimations (Atilgan, 2013). In the illustrative study, 71 adult non-native speakers of Chinese studying in a university in China took the listening-speaking test. They varied in their duration of learning Chinese and living in China, were enrolled in different programs, and came from diverse fields.

60 Kirby C. Grabowski and Rongchan Lin Raters G theory can be a useful analytic framework in test contexts involving multiple raters since raters are often a source of systematic and/or unsystematic variance in the scores. In the illustrative study, two raters, both applied linguists, rated the performances.

Instruments The test The assessment administered for the illustrative study was designed as a proficiency test, from which relative interpretations about test takers in terms of their scores (i.e., rank-ordering) were made. As a performance-based assessment of Chinese listening-speaking ability, the test consists of two retell tasks situated in an academic scenario. In the given scenario, the test takers are asked to collaborate with a simulated classmate to deliver a presentation on the pros and cons of using social media. For Retell Task 1, the test takers are asked to watch an authentic news clip related to a positive aspect of social media, after which they have to inform their classmate that they found a clip useful for their presentation, retell the content of the news, and then suggest to their classmate that they integrate the news content into their presentation. The requirements for Retell Task 2 are the same, except that the news clip is about a negative aspect of social media. As the test context includes multiple facets of measurement (i.e., two tasks and two raters), the resulting scores are well suited to G theory analyses.

Analytic rubric As discussed above, MG theory becomes advantageous over univariate G theory when a profile of scores, and, thus, multiple measures, are part of the measurement context. It is only through a multivariate framework that scores of this nature can be modeled simultaneously. In the illustrative study, the analytic rubric was constructed to reflect listening-speaking ability as operationalized in the retell tasks. It is important to note that listening-speaking ability is a form of integrated speaking, and its nature is different from that of independent speaking. As a result, the ability to integrate content from the input was emphasized in the rubric. Specifically, the analytic rubric consisted of four subscales, namely Content Integration, Organization, Delivery, and Language Control, and each subscale had a score band range from 1 to 5. • Content Integration describes the extent to which the test taker addressed all aspects of the prompt, conveyed the relevant key points of the news clip accurately, and distinguished the main idea from details. The scale has a task-specific approach by explicitly stating the number of key points expected for each score band.

Multivariate generalizability theory 61 • Organization is concerned with the overall progression of ideas and the effective and appropriate use of transitional elements and cohesive devices. • Delivery focuses on fluency, pronunciation, pace, intonation, and overall intelligibility. • Language Control describes the extent to which the test taker demonstrated control of basic and complex grammatical structures, displayed effective use of a wide range of vocabulary, and paraphrased the input. The analytic rubric was mainly adapted from the holistic speaking rubrics used in existing large-scale speaking exams (e.g., Advanced Placement [AP®] Chinese Language and Culture Exam and Test of English as a Foreign Language Internetbased Test [TOEFL iBT®]) and included some of the relevant aspects from both Plakans (2013) and Frost, Elder, and Wigglesworth (2011).

Data collection, rater norming and scoring procedures Data collection The participants took the test individually in a quiet room. They recorded their responses into the computer for each retell task.

Norming and scoring procedures The two raters first went through a norming process, involving transcripts and audio recordings, to maximize consistency in scoring. Subsequently, they proceeded to score all responses for Retell Task 1. They could replay the audio recordings during scoring if necessary. They then repeated the same procedures described above for Retell Task 2.

MG theory analysis stages and design considerations As outlined in Chapter 2, Volume I, the first stage of a generalizability analysis is the G study, which provides estimates of the relative effect of a single observation of test facets, such as one task and one rater, on a hypothesized dependent variable, or total score. The estimates of these relative effects are related to variance components, which represent a variety of facets and their interactions, such as variance related to the test takers’ ability, task difficulty, or rater severity separately, or perhaps the variance related to the interaction between task difficulty and rater severity. Some of the most common components and interactions in performance assessment contexts, which often involve multiple tasks and raters, are as follows: 1 the object of measurement, which is usually defined as person ability or test taker ability (p), since the focus of the measurement is on the systematic differences among the test takers with respect to their ability;

62 Kirby C. Grabowski and Rongchan Lin 2 individual facets of measurement, such as tasks (t) or raters (r), since they often have a systematic or unexpected effect on scores; 3 interactions between the object of measurement and each individual facet, such as persons systematically interacting with tasks (p x t), or persons systematically interacting with raters (p x r), or interactions between any individual test facets, such as tasks systematically interacting with raters (t x r); 4 multifaceted interaction effects such as interactions among persons, tasks, and raters together (ptr); 5 and, undifferentiated error (e), or random error, which is combined with the highest-order interaction in the study design, such as person, tasks, and raters together (ptr,e).3 The variance component estimates are derived from expected mean squares (i.e., a theoretical reflection of group differences) in an analysis of variance (ANOVA) of the object of measurement (p) and the test effects in question. The percentage of total variance contributed is calculated for each variance component in order to determine their relative contribution to the total score variability. For example, the percentages would reflect how much of the total score variability can be attributed to true differences in the test takers’ ability (the higher the better), as opposed to other (typically) less desirable variance components such as differences in rater severity (r) or differences in how task difficulty was perceived by the raters (t x r). In the multivariate framework, similar calculations are made; however, the relationships between variables across the components in the composite can be modeled simultaneously. As a result, variances and covariances then serve as building blocks for variance and covariance component estimates, universe-score correlations, effective weights, and composite score dependability, all of which contribute complementary information unavailable in the univariate framework. As stated above, the information from a G study, while serving a baseline function, is limited to only one observation per facet (one task and one rater, for instance). Thus, the interpretability of the results is limited against the backdrop of operational test conditions, which often include more than one task and more than one rater. Therefore, more often than not, it is more important for researchers to focus on the results of a secondary analysis, the D study, and the same is true of the current study. As such, G study variance component estimates for single observations are not often relevant when asking research questions about operational testing conditions; however, it is important to note that D study variance component estimates are, in fact, a multiplier of G study variance component estimates for single observations. When it is most relevant to gather information about the dependability of scores from a test as it is currently designed, the main focus of an analysis should be on the D study results. It is through D studies that the dependability of scores from the operational measurement design as well as alternative measurement designs can be scrutinized. In terms of alternative measurement designs, the number of conditions of any of the facets of the measurement design can be varied, either consecutively or concurrently. To illustrate with the current study, a multivariate

Multivariate generalizability theory 63 D study was first conducted for the operational test design (i.e., two tasks and two raters), followed by additional multivariate D studies investigating the dependability of scores for alternative measurement designs where only the tasks were varied in number. As stated in the Introduction, D studies are especially important due to the fact that practicality is often a major consideration in performance assessments, which may incorporate multiple tasks and multiple raters. For example, there are two tasks and two raters in the illustrative study, but if the results from the D study indicate a need for more tasks in order to maximize score generalizability, it is possible that the test could prove impractical given the resources available. By contrast, assume that a particular test has 30 items. If the D study results indicate that test scores remain acceptably generalizable with even fewer items, then the test can be more easily administered and scored.4 Output from a multivariate D study also includes information about the dependability of scores from each of the individual subscales as well as information about composite score dependability, universe-score correlations between the analytic subscales, and effective weights of the analytic subscales. For a discussion on additional statistics available through MG theory, consult Brennan (2001a). Table 3.2 shows a summary of the research questions that the illustrative study seeks to address as well as the corresponding output needed to answer each question. Research questions of this nature are common in studies using MG theory (cf. Table 3.1). Table 3.2 Research Questions and Relevant Output to Examine Research question

Relevant output to examine

1 What is the relative contribution of multiple sources of variation (i.e., test takers, tasks, and raters) to the total score variability in the listening-speaking test for each of the four subscales in the analytic rubric? 2 How dependable are the scores at the individual subscale level (i.e., Content Integration, Organization, Delivery, and Language Control) and at the composite score level in the listening-speaking test? 3 To what extent are the components of listening-speaking ability (i.e., Content Integration, Organization, Delivery, and Language Control) related in the test? 4 To what degree does each analytic subscale in the listening-speaking test contribute to the composite universe-score variance? 5 How many tasks would be optimal for reliably measuring listening-speaking ability in the test?

Multivariate D study variance and covariance component estimates for the multiple sources of variation Generalizability coefficients (for relative decisions) for each subscale and the composite Universe-score correlations between the four subscales Effective weights of the subscales Generalizability coefficients for the four subscales and the composite for the multivariate D studies conducted for alternative designs

64 Kirby C. Grabowski and Rongchan Lin As mentioned before, the main focus of the illustrative study is on the multivariate D study results, and the output based on the operational test design (i.e., the default D study) can be used to answer the first four research questions. Output from multivariate D studies for alternative measurement designs was used to answer the last research question. The analyses outlined in the chapter were conducted using the computer software mGENOVA (Version 2.1) (Brennan, 2001b). For step-by-step guidelines on how to perform the analyses with the software, please refer to the tutorial on the Companion website.

Multivariate generalizability analyses in detail Research question 1 In order to answer research question 1, an MG study would be conducted, focusing on the multivariate D study output for the listening-speaking test’s operational design (i.e., two tasks and two raters). As a result, the study’s seven variance components include: 1 2 3 4

persons in terms of ability (p); tasks in terms of difficulty (t); raters in terms of severity (r); the person-by-task interaction (i.e., rank-ordering differences of persons in terms of their ability by the tasks in terms of their difficulty) (p x t); 5 the person-by-rater interaction (i.e., rank-ordering differences of persons in terms of their ability by the raters in terms of their severity) (p x r); 6 the task-by-rater interaction (i.e., rank-ordering differences of tasks in terms of their difficulty by the raters in terms of their severity) (t x r); and 7 the three-way interaction plus undifferentiated or random error (ptr,e). In MG theory, the above-mentioned facets of measurement are treated as random or fixed. Random test facets are posited to have conditions (i.e., constituent elements) that are interchangeable with any other set elements from the universe of admissible observations (Lynch & McNamara, 1998). In the illustrative study, task was treated as a random facet because each task in the test can theoretically be substituted with any other task from the universe of similar tasks (i.e., those tapping into the same abilities as the two tasks in the illustrative study). Similarly, the raters were treated as a random facet because they can theoretically be substituted with other raters from the universe of similar raters (i.e., those with similar experience and expertise as the two raters who participated in the illustrative study). In the case of random facets, the average of a test taker’s score across all combinations of possible test conditions represents the universe score (Bachman, 2004). In other words, the concept of universe score allows for the generalization of a test taker’s ability beyond a single instance, or measure, of performance.

Multivariate generalizability theory 65 In contrast, fixed facets are those whose conditions (i.e., constituent elements) are necessarily limited in some way. According to Shavelson and Webb (1991), fixed facets can be derived in two ways: 1 either through the decision maker’s purposeful selection of certain conditions of a facet to which the results are intended to be generalized (e.g., qualitatively choosing specific subscales over others for the listening-speaking test’s analytic rubric); 2 or through the facet itself being limited in conditions (e.g., a “course level” facet only including a certain number of intact groups in a given school). In the illustrative study, the construct of listening-speaking ability, as defined by the test developer and represented in the analytic rubric, is a fixed facet because the four constituent scales are not interchangeable with any other dimensions from the universe of possibilities. They were selected deliberately and purposefully and are, thus, treated as the conditions of the fixed facet. Therefore, the multivariate D study employed to represent the operational measurement design in the listening-speaking test is a random-effects model with a two-facet (task and rater) fully crossed design in order to examine test taker ability with respect to the four language knowledge dimensions in the analytic rubric. In this instance, a fully crossed design refers to a design where all test takers completed both tasks and were, in turn, rated by both raters across all four analytic subscales. Determining what types of facets are part of the measurement design is paramount when conducting analyses using G theory. If there is a fixed facet involved (e.g., the analytic subscales in the listening-speaking test), MG theory should be employed, since univariate designs presume a random model and, thus, the dependability of scores from each individual subscale would need to be modeled separately. So, while a univariate analysis can provide some useful information about individual subscales, the utility of MG theory is that analytic subscales can be treated as the fixed facet and, thus, can be modeled simultaneously. Of primary importance in an MG analysis, then, would be the dependability of the analytic scores and the composite in light of the relative contribution of variance and covariance by the other test facets, their interactions, and undifferentiated error. This is a much more powerful representation of the measurement design than the univariate model, since it takes into account all of the score variance and covariance across the entire test in one analysis. Given its multivariate orientation, the notation for representing the relationships among the variables being investigated through MG analyses is slightly different from that for the univariate model. The multivariate G study design for the illustrative study would be formally written as p• ×t • ×r • where p, t, and r represent test takers, tasks, and raters, respectively. The superscript filled circle (•) designates that the object of measurement (p) and the facets

66 Kirby C. Grabowski and Rongchan Lin of measurement (t and r) were each crossed with the fixed-effect multivariate variables (i.e., the four analytic subscales). In a fully crossed design such as this, not only are variance components estimated for the seven sources of variation associated with each subscale as they would be in a univariate analysis, but covariance components are also estimated for all seven sources of variation for each subscale. This is a strength of a multivariate study design whereby random facets can be crossed with fixed facets. In the above G study design, test takers (p) were considered the object of measurement since the focus of the analysis was on the systematic differences among the test takers with respect to their ability on the four language knowledge dimensions. Therefore, in theory, a significant percentage of variance and covariance in test taker scores across the scales would ideally be accounted for by the test takers in terms of ability. These results would indicate that a correspondingly low proportion of the score variance and covariance is attributable to differences in tasks in terms of difficulty, raters in terms of severity, their interactions, and measurement error, thereby showing minimal evidence of construct-irrelevant variance. When multivariate D study results are presented, the notation is slightly different. In the case of the listening-speaking test, it would be formally written as follows: p• ×T • × R• where the uppercase T and R indicate that the D study results were calculated using mean scores over the two tasks and two raters, respectively, instead of single observations (as would be done in a G study).5

Research question 2 Although the focus of G theory tends to be on the interpretation of the variance component estimates, researchers may also be interested in an estimate of the overall dependability of test scores, which carries a similar interpretation to a CTT reliability estimate. There are two possible indicators that can be calculated to answer research question 2 depending on the type of decisions that will be made about test takers in terms of their ability: a Generalizability (G) Coefficient (Eρ2) for relative decisions and an Index of Dependability (Φ or phi) for absolute decisions. As stated earlier, the listening-speaking test was perceived as a proficiency test that was meant to distinguish high-ability students from low-ability ones and, thus, was associated with decisions related to relative interpretations of test takers’ scores (i.e., the rank ordering of test takers). Therefore, the respective G coefficient for each of the four subscales and the composite was most relevant for interpretation. The Index of Dependability, by contrast, is related to decisions associated with absolute interpretations of test takers’ scores (i.e., to indicate an individual’s performance regardless of how other test takers perform, such as for criterion-referenced test score interpretations or in classroom-based assessment contexts). Generally speaking, the G Coefficient, which ranges from 0 to 1, will

Multivariate generalizability theory 67 be higher than the Index of Dependability, as absolute error variance is typically greater than relative error variance.

Research question 3 In order to answer research question 3, universe-score correlations (which are based on measures for the person effect) would be calculated. Since MG theory treats analytic subscales as multiple dependent variables, relevant universe-score correlations can reveal dependencies among the language knowledge dimensions, which can then be interpreted based on the hypothesized operational definition of language ability being investigated. Similar to Pearson product-moment correlations, universe-score correlations can reveal differences in the relatedness and/or distinctiveness of subscales in the scoring rubric, with the additional benefit that they also take into account various sources of measurement error (as would a true score). Therefore, MG analysis findings can be used to support claims of validity of an underlying test construct, where relationships between underlying language knowledge dimensions may be hypothesized. In the illustrative study, the analytic subscales (i.e., Content Integration, Organization, Delivery, and Language Control) were treated as the multiple dependent variables in the MG analysis. As such, universe-score correlations were calculated for each pair of variables.

Research question 4 In order to answer research question 4, effective weights of the subscales can be obtained. Effective weights reveal the relative empirical contribution of each subscale to the composite universe-score variance. Although the a priori weights of subscales are represented as equal in a scoring rubric (e.g., 25% each out of 100%), the extent to which the variance and covariance in the scales statistically contribute to the composite universe-score variance may be different. This information tells researchers if one scale is effectively more important in differentiating test takers in terms of their ability, and if so, by how much. If one subscale is contributing relatively little variance, the scoring bands, scale descriptors, and/or construct itself might need to be reconsidered. For example, if one subscale is found to contribute little to no variance to the composite universe-score variance compared with the other subscales, that subscale would need to be further examined in terms of its qualitative importance to the overall test construct and how it was operationalized in the rubric.

Research question 5 In order to answer research question 5, separate multivariate D studies would need to be conducted, one for each of the alternative measurement designs under investigation. The results from this analysis have the capacity to highlight the dependability of one alternative measurement design over another for each subscale separately and the composite and, thus, may have implications for practicality

68 Kirby C. Grabowski and Rongchan Lin as well. In the illustrative study, this would be accomplished through a series of four multivariate D studies (including the default D study), each varying the number of conditions of the task facet from two to five tasks. A step-by-step guide on how to conduct all of the analyses described above using mGENOVA is provided on the Companion website.

Results Research question 1 In order to answer research question 1, regarding the relative contributions of the multiple sources of variation to the total score variability for each subscale in the analytic rubric, the multivariate D study variance and covariance component estimates (i.e., in this case, using the operational measurement design of two tasks and two raters) computed by mGENOVA were examined. First, variance component estimates are typically presented both in terms of total amount of variance (σ²) explained as well as percentage of total variance that can be attributed to the individual variance components, their interactions, and the combination of the three-way interaction and undifferentiated error for each analytic subscale in the rubric. In the illustrative study, seven sources of variation were identified, as seen in the left-hand column of Table 3.3. Three of the sources of variation are the main effects for persons (p), tasks (T), and raters (R). The other four sources of variation are the interaction between persons and tasks (p × T), persons and raters (p × R), tasks and raters (T × R), and the three-way-interaction including undifferentiated error (pTR,e). Exactly what percentage of variance is defensible for each variance component and the interaction effects is entirely dependent on the context of the assessment, whether or not the variance is expected (i.e., construct relevant) or unexpected (i.e., likely construct irrelevant), and the stakes of the decisions being made based on the scores. Obviously, large variance component estimates (i.e., larger than 75%, say, for low-stakes decisions and larger than 90% for high-stakes decisions) for the person effect are desirable because they indicate that a correspondingly large proportion of score variance can be attributed to true differences in test taker ability – the object of measurement. By contrast, near-zero variance component estimates (i.e., less than 2%, say) are generally considered desirable for the task facet, indicating that differences in task difficulty had little to no effect on scores. This is similarly true of the rater facet, where, ideally, differences in raters’ severity would have a minimal effect on scores. For the twoway interactions involving raters (i.e., person-by-rater and task-by-rater), small variance component estimates (i.e., close to zero) are desirable. Otherwise, it could be an indication that the test takers were rank-ordered differently in terms of their ability by the raters or that the tasks were rank-ordered differently in terms of their difficulty by the raters. Not surprisingly, a relatively small variance component estimate for the three-way interaction effect and undifferentiated error (ptr,e) is also preferable.

% of variance explained

69 0 0 28 1 0 2 100

σ²

0.85433 0.00342 0.00065 0.34869 0.00639 0.00000 0.02674 1.24022

Content integration

0.88984 0.02510 0.00065 0.07701 0.00639 0.00000 0.04459 1.04358

σ²

Organization

*Total adjusted for rounding error.

Note: p = persons, T = tasks, R = raters, σ² = variance.

p T R p×T p×R T×R pTR,e Total

Source of variation

85 2 0 7 1 0 4 100*

% of variance explained 0.73773 0.01776 0.00000 0.09140 0.00357 0.00023 0.04027 0.89096

σ²

Delivery

83 2 0 10 0 0 5 100

% of variance explained

Table 3.3 Variance Component Estimates for the Four Subscales (p• × T • × R• Design; 2 Tasks and 2 Raters)

0.80875 0.02631 0.00241 0.07228 0.02575 0.00075 0.03446 0.97071

σ²

83 3 0 7 3 0 4 100

% of variance explained

Language control

70 Kirby C. Grabowski and Rongchan Lin As stated earlier, there are no concrete rules for what proportion of variance is acceptable for any given variance component; thus, researchers need to examine the patterns revealed in the proportions of variance as well as the expectations of the testing context when interpreting their data. Ultimately, sizable estimates from variance components other than the object of measurement have the potential to adversely affect score dependability indices (to varying degrees depending on whether relative or absolute decisions are relevant). For example, if the only sizeable variance component estimate other than the person effect (p) is the threeway interaction with undifferentiated error (pTR,e) at 12%, the resulting dependability index would be around .88, which would likely be considered acceptable in most testing contexts. However, if an additional 10% of the variance can be attributed to the task facet (T) and 12% can be attributed to pTR,e, the resulting dependability at .78 could be problematic. Therefore, researchers need to bear in mind a multiplicity of factors when interpreting variance component estimates. One final interaction effect, the person-by-task interaction, merits its own discussion. A moderately large variance component estimate is not uncommon in performance assessments involving tasks, especially if the tasks are meant to be tapping into different aspects of the same underlying trait. A relatively large variance component estimate is an indication that the rank-ordering of test takers by ability is different for each task. In this way, the tasks would not necessarily be different in terms of average difficulty (which would be reflected in the task main effect) but rather that they were differentially difficult for the test takers (i.e., certain tasks were systematically more or less difficult for certain test takers). While it is possible that tasks in an assessment may tap into different dimensions of a certain ability (e.g., delivery, organization, pronunciation, grammatical accuracy within speaking ability), and this may contribute to variability in the scores, it would be unexpected for scores on the different tasks to show systematically opposing patterns across these dimensions. As a result, any sizable interaction would negatively affect the dependability of scores, since true differences in test taker ability would be contributing proportionally less variance. (See In’nami & Koizumi, 2015 for a meta-analysis of contextual features related to p × t interactions in writing and speaking performance.) However, in some assessment contexts (e.g., the listening-speaking test in the illustrative study) task may, in fact, be considered construct relevant within certain subscales on the rubric since test takers’ ability may necessarily be dependent upon the task in which it is used – in other words, their ability is interacting with the context in which it is being elicited. Therefore, a relatively large person-bytask interaction is not necessarily undesirable in the illustrative study given the centrality of content (i.e., the pros and cons of social media) to task performance. What is of interest is the relative contribution of rank-ordering differences in persons across tasks depending on the subscale considered and what that reveals about the stability of task-specific measures on a test such as the content-driven listening-speaking test. As presented in Table 3.3, all four subscales shared a similar pattern in terms of the top three sources of variation that contributed most to the total score

Multivariate generalizability theory 71 variability, namely persons in terms of ability (i.e., p), the person-by-task interaction (i.e., p x T), and the person-by-task-by-rater interaction with undifferentiated error (i.e., pTR,e). However, a few interesting observations can be made at a more fine-grained level. Specifically, while the object of measurement (i.e., p) contributed the most to the total score variability across the four subscales, it is also worth noting that the amount of variance associated with test taker ability differences was lowest for Content Integration. In fact, the percentage associated with it (i.e., 69%) was quite a bit lower than for the other three subscales, which all yielded more than 80%. This discrepancy is due to the percentage of total variance accounted for by the person-by-task interaction for Content Integration (i.e., 28%), which was much higher than it was for the other three subscales. The relatively large person-by-task interaction effect for Content Integration indicates that the test takers were rank-ordered differently depending on which task they responded to. This finding suggests that although the object of measurement was the highest contributor to universe-score variance within each individual subscale, test taker performance on Content Integration was quite highly task specific when looking across the two tasks. In other words, test takers’ performance on the two tasks was much more dependent on their knowledge of the topic than was their performance on any of the other subscales. Ultimately, this means that test takers’ ability with respect to Content Integration is less generalizable across tasks than it is for the other abilities tested. However, as stated above, it could be argued that the person-by-task interaction in this case represents constructrelevant variance, as it would seem reasonable to assume that test takers’ ability to perform well on a retell task would be somewhat dependent on their familiarity and comfort with defending the pro and/or con position with respect to social media. This finding illustrates the power of MG theory in disentangling main effects and construct-relevant interactions within individual subscales that would otherwise be considered sources of undifferentiated measurement error in CTT. Lastly, the percentages of variance contributed by the remaining sources of variation across the four subscales were similar and were not high enough to be of concern in the broader discussion of score generalizability. Covariances were also estimated for all seven sources of variation for each analytic subscale. Usually examined alongside the variance component estimates, covariance component estimates reveal the extent to which the sources of variation associated with each subscale covaried with those of other subscales. In the case of the listening-speaking test, the covariance component estimates illuminate the contribution of true differences in test taker ability, rater severity, task difficulty, the rank-ordering differences associated with their interactions, and error to the observed score covariance. For instance, the covariance component estimate for the person effect between the Content and Organization subscales reveals the proportion of the observed score covariance that can be attributed to true differences in test taker ability. As seen in Table 3.4, the variance component estimates for each subscale in the listening-speaking test, seen in bold on the upper diagonal of the table, are identical to those presented in Table 3.3. On the lower diagonal, the covariance component estimates can be seen.

72 Kirby C. Grabowski and Rongchan Lin Table 3.4 Variance and Covariance Component Estimates* for the Four Subscales (p• × T • × R• Design; 2 Tasks and 2 Raters) Source of variation

D study sample size

Persons (p)

Tasks (T)

Raters (R)

pT

pR

TR

pTR,e

2

2

2

2

4

4

CONT (1)

ORG (2)

DEL (3)

LANG (4)

(1) (2) (3) (4)

0.85433 0.80025 0.67716 0.72203

0.88984 0.76901 0.82289

0.73773 0.76957

0.80875

(1) (2) (3) (4)

0.00342 0.01318 0.01190 0.01439

0.02510 0.02231 0.02699

0.01776 0.02221

0.02631

(1) (2) (3) (4)

0.00065 −0.00091 −0.00065 −0.00153

0.00065 0.00075 0.00169

0.00000 0.00060

0.00241

(1) (2) (3) (4)

0.34869 0.11358 0.08493 0.09477

0.07701 0.04987 0.04520

0.09140 0.04293

0.07228

(1) (2) (3) (4)

0.00639 0.00443 0.00418 −0.00023

0.00639 0.01685 0.02472

0.00357 0.01348

0.02575

(1) (2) (3) (4)

0.00000 0.00008 −0.00019 −0.00029

0.00000 −0.00025 −0.00029

0.00023 0.00079

0.00075

(1) (2) (3) (4)

0.02674 −0.00184 −0.00069 0.00293

0.04459 0.00377 0.00293

0.04027 0.01417

0.03446

* Variance component estimates are on the diagonal in bold text; covariance component estimates are on the off-diagonals in plain text.

Similar to the pattern in the variance component estimates, true differences in test taker ability accounted for the largest proportion of covariance for all subscales, indicating that test takers who received high scores on a given subscale had also received high scores on another. The covariances between each set of subscales for the person-by-task interaction were the second largest, also mirroring the pattern seen in the variance component estimates. The person-by-task interaction in this case reveals the proportion of observed score covariance between the

Multivariate generalizability theory 73 two subscales that can be attributed to differences in test takers’ rank-orderings across the tasks. The task effect, rater effect, task-by-rater interaction, person-byrater interaction, and three-way interaction and measurement error were generally small or close to zero, indicating that there was relatively little contribution of these subscale covariances to the total observed score covariance. Taken together, these findings further support the dependability of the analytic subscales, since the proportions of variance and covariance accounted for by the person effect in the multivariate analysis were the largest.

Research question 2 In order to answer research question 2, the generalizability (G) coefficients for the subscales and the composite were computed as seen in Table 3.5. Table 3.5 G Coefficients for the Four Subscales (p• × T • × R• Design)

G coefficient

Content Integration

Organization

Delivery

Language Control

Composite score

0.69

0.87

0.85

0.86

0.87

The G coefficients were similarly high for Organization, Delivery, and Language Control (i.e., ranging from 0.85 to 0.87), while Content Integration was the least generalizable (i.e., 0.69). This outcome is unsurprising given that Content Integration yielded a sizable person-by-task interaction effect as mentioned earlier, which would result in a decrease in the G coefficient. Overall, the G coefficients revealed that the four subscales operationalized in the scoring rubric were dependably measuring true differences in test taker ability, especially when taking into account the construct-relevant nature of the person-by-task interaction in Content Integration. Nonetheless, at the composite score level, the G coefficient was 0.87 across the test, indicating that the composite score is dependable despite the sizable person-by-task interaction observed for Content Integration. That said, an investigation into alternative measurement designs where tasks are varied in number (which will be performed as part of research question 5) could reveal that generalizability can be increased to an even more satisfactory level for that subscale.

Research question 3 To answer research question 3, the universe-score correlations between the four subscales were examined. As seen in Table 3.6, the universe-score correlations indicate that the subscales were moderately highly to highly correlated. In general, among all the subscales, the universe-score correlations, which range from −1 to 1, revealed that test takers who received a high score on one subscale also received a high, or at least relatively high, score on another. Given

74 Kirby C. Grabowski and Rongchan Lin Table 3.6 Universe-Score Correlations Between the Four Subscales (p• × T • × R• Design) Content Integration Content Integration Organization Delivery Language Control

1.0 0.92 0.85 0.87

Organization

Delivery

1.0 0.95 0.97

1.0 0.991

Language Control

1.0

1 Correlation of .996 rounded to .99

the moderate to high magnitudes of the correlations, there is some measure of distinctness in the subscales, providing evidence that they were indeed measuring different dimensions of the same listening-speaking construct. It is also worth noting that the association of Content Integration to the other subscales is comparatively less strong, perhaps due to its interaction with task/topic, as discussed above. However, in some cases, the high magnitudes of the correlations reveal that particular subscales were very highly related. This suggests that further investigation may be warranted for gathering evidence for discriminant validity for some of the subscales.

Research question 4 In order to answer research question 4, the effective weights of the four subscales in the listening-speaking test were examined, as shown in Table 3.7. The contributions of the four subscales to the composite universe-score variance were very similar, ranging from 24% for Delivery to 26% for Organization. The results suggest that the four subscales are equally important in assessing listening-speaking ability as operationalized in the test. In other words, no one dimension of listening-speaking ability is “outweighing” the others in terms of its importance. The results are a bit surprising given the relatively large variance component estimate for the person-by-task interaction for Content Integration, reflecting the centrality of content and the integrated nature of the tasks in the test; nonetheless, the results are consistent with our expectations in terms of the

Table 3.7 Effective Weights of Each Subscale to the Composite Universe-Score Variance (p• × T • × R• Design)

Composite Universe-Score Variance

Content Integration

Organization

Delivery

Language Control

25%

26%

24%

25%

Multivariate generalizability theory 75 nominal, or a priori, weight assigned to each rating scale (i.e., 25% each), and there is evidence that the current weighting scheme is appropriate for the test.

Research question 5 In order to answer research question 5, which asked whether the generalizability of the test scores would increase or decrease based on the number of tasks in the test, a series of four multivariate D studies was conducted. Although the current measurement design of two tasks (and two raters) was practical for the purpose and context of the test, and the G coefficients revealed that the current measurement design produced acceptable generalizability in terms of scores for Organization, Delivery, and Language Control, the potential for a substantial gain (or inconsequential loss) in score generalizability on the Content Integration scale may make an alternative measurement design more attractive. Table 3.8 provides the G coefficients for the alternative measurement designs of three, four, and five hypothetically similar tasks, in addition to the estimates for the current design (i.e., two tasks) for each subscale and the composite. The number of raters (two) was kept constant. Not unexpectedly, there is a qualitatively meaningful gain in score dependability when increasing the measurement design from two to three tasks. Beyond that, there is an indefensible return on generalizability for Organization, Delivery, and Language Control when considering the increased impracticality of including more tasks on the listening-speaking test. However, given that the generalizability increases from .69 to .77 for Content Integration when going from two to three tasks, there could be a reasonable argument to increase the test length by one task since a primary purpose of the test is in having the test takers listen to speak about a topic. Of course, issues of generalizability always need to be balanced with issues of practicality, so the choice of an optimal design would also be dependent on the resources available and the context and stakes of the test. Ultimately, if a decision were to be made about the current test, it would likely make the most sense to keep the test at two tasks, particularly given that the current tasks represent two opposing arguments (i.e., pro and con) for the topic at hand.

Table 3.8 Generalizability Coefficients for the Subscales When Varying the Number of Tasks (p• × T • × R• Design) Number of Number of Content Organization Delivery Language Composite tasks raters Integration Control score 2 3 4 5

2 2 2 2

0.69 0.77 0.81 0.85

0.87 0.91 0.93 0.94

0.85 0.89 0.91 0.93

0.86 0.89 0.91 0.92

0.87 0.91 0.93 0.94

76 Kirby C. Grabowski and Rongchan Lin Discussion The purpose of the chapter is to illustrate common applications of MG theory in the context of a listening-speaking test study as illustrated above. Based on the analyses conducted on the listening-speaking test, we wish to highlight a number of points to further explicate and evaluate the use of MG theory for language assessment research. First, the illustrative study used in the chapter employed a fully crossed, twofacet design (i.e., p• × T • × R•). A crossed design is deemed beneficial, as it allows us to maximize the number of design structures for subsequent D studies (Brennan, 2001a). Tasks and raters were chosen as facets in the context of the test, as it is expected that they would have the greatest likelihood of impacting the test takers’ scores. Nonetheless, in their own contexts, researchers would have to make an informed decision as to whether there is a need to include other facets (e.g., occasions) that could potentially impact test score variability. The exclusion of important facets could produce results that oversimplify the phenomenon under study, since their contribution would not be disentangled from the undifferentiated error variance component. Second, the MG theory analyses illustrated in this chapter are necessarily limited by the nature of the data. Where possible, sample sizes for a G study should be as large as possible to obtain stable variance component estimates (Brennan, 2001a). Related to this is that MG theory requires a fully crossed dataset with no missing data (see Chiu & Wolfe, 2002 for alternate methods with sparse data matrices). Furthermore, unbalanced designs (e.g., items vary in number across batteries of tests, or different numbers of persons are grouped within class levels) can be handled through nested designs in MG theory (with some reduction in the variance component information derived; Brennan, 1992), but carrying out studies such as this can be complex in analysis and somewhat opaque in interpretation (see Zhang, 2006, for an example of a nested design). Researchers should approach these types of designs well informed and with caution. Third, in the context of the illustrative study, the analytic subscales were considered as the four conditions of the fixed facet. In other words, they made up the construct of the listening-speaking ability under study and, therefore, the data only contained information with respect to test takers’ performance on these four dimensions. The conditions of the fixed facet are essentially decided by the researcher, based on theory and/or existing empirical studies. Therefore, how the construct of integrated speaking is defined will impact the design of the analytic rubric and eventually affect and potentially limit the validity of the claims made. The onus is on the researcher to provide theoretical and empirical evidence that the construct is adequately and appropriately defined for meaningful and defensible interpretations.

Complementary research methods While MG theory allows us to address the limitations of CTT, it also has its own weaknesses. For example, though MG theory provides a macro view of the data, it does not offer insights into individual conditions of each facet (e.g., statistics on

Multivariate generalizability theory 77 each rater in terms of relative severity). Hence, statistical methods such as manyfacet Rasch measurement (MFRM) can provide complementary information in further illuminating data–model fit, as well as supplementary information on rating scale functionality (see Chapter 7, Volume I). MFRM enables researchers to investigate various issues such as the spread of the test takers in terms of ability, the relative difficulty level of the individual tasks, and bias terms for interactions, to name a few. Using the same data as this current chapter, Lin (2017) employed MFRM to analyze the data and discovered that Content Integration was the most difficult component. Additionally, she also found that unexpected-response patterns mainly pertained to Content Integration. Findings such as this can complement information obtained from MG analyses and shed light on the analytic subscales on a more fine-grained level. Readers who are interested in understanding the contrasting nature and complementary use of G theory and MFRM in language assessment research are encouraged to refer to Bachman, Lynch, and Mason (1995), Lynch and McNamara (1998), and Grabowski (2009).6 Other statistical approaches used to provide validity evidence, specifically in the form of construct validity, can also be employed to complement the use of MG theory in examining the underlying trait structure of a test (see Sawaki, 2007, for an application of confirmatory factor analysis in conjunction with MG theory).

Conclusion This chapter illustrated the basic functions of MG theory in examining multivariate data from a listening-speaking test. It outlined the strengths and weaknesses of MG theory and suggested complementary research methods to fully uncover the phenomenon under study. While this chapter attempts to explain some of the most common concepts in MG theory, it is necessarily limited by space constraints and the data from the research study presented herein. Readers are encouraged to read Brennan (2001a) for a more in-depth and technical discussion on G theory.

Acknowledgments The test data used in this illustrative study originate from Rongchan Lin’s research study, which was funded by the Confucius Institute Headquarters (Hanban) and completed while she was undertaking the Confucius China Studies Program Joint Research Ph.D. Fellowship at Peking University, Beijing. She would like to thank Professor James Purpura, Professor Yuanman Liu, and Mr. Mingming Shao for their guidance and support on that project.

Notes 1 Readers are encouraged to consult Bachman (2004) on the limitations of CTT. 2 There are many other applications of MG theory to data and associated research questions. Brennan (2001a) provides a more comprehensive treatment of additional theoretical and practical considerations within the MG theory framework.

78 Kirby C. Grabowski and Rongchan Lin 3 If there is only one test facet being modeled in addition to the object of measurement, there will be fewer variance components in the study design; if there are more than two test facets being modeled, there will be more variance components – one for each individual source of variance and one for each possible combination of interactions, plus undifferentiated error. 4 It is important to note that also of consideration in this process are qualitative judgments associated with shortening a test that need careful consideration, such as matters of domain coverage (e.g., eliminating tasks or items may limit content representation). Ultimately, it is up to the test developer, administrator, score-users, and other stakeholders to decide what is acceptable when trying to balance practicality and generalizability in light of the resources available and the context and stakes of the decisions being made based on test scores. 5 For the D study output generated by the software, the sources of variation are in lower case as seen in the accompanying tutorial. Additionally, though the last source of variation is labeled as ptr in the output, readers are reminded that it also consists of the undifferentiated error. 6 For a comparison of G theory and Rasch measurement, please also see www.rasch. org/rmt/rmt151s.htm (Linacre, 2001).

References Alderson, J. C. (1981). Report of the discussion on the testing of English for specific purposes. In J. C. Alderson & A. Hughes (Eds.), Issues in language testing. ELT Documents No. 111 (pp. 187–194). London: British Council. Atilgan, H. (2013). Sample size for estimation of G and phi coefficients in generalizability theory. Eurasian Journal of Educational Research, 51, 215–227. Bachman, L. F. (1990). Fundamental considerations in language testing. Oxford: Oxford University Press. Bachman, L. F. (2004). Statistical analyses for language assessment. Cambridge: Cambridge University Press. Bachman, L. F. (2005). Building and supporting a case for test use. Language Assessment Quarterly, 2, 1–34. Bachman, L. F., Lynch, B. K., & Mason, M. (1995). Investigating variability in tasks and rater judgment in a performance test of foreign language speaking. Language Testing, 12(2), 238–257. Bachman, L. F., & Palmer, A. (1982). The construct validation of some components of communicative proficiency. TESOL Quarterly, 16, 446–465. Bachman, L. F., & Palmer, A. (1996). Language testing in practice. Oxford: Oxford University Press.Brennan, R. L. (1983). Elements of generalizability theory. Iowa City, IA: ACT, Inc. Brennan, R. L. (1992). Generalizability theory. Educational Measurement: Issues and Practice, 11(4), 27–34. Brennan, R. L. (2001a). Generalizability theory. New York, NY: Springer-Verlag. Brennan, R. L. (2001b). mGENOVA (Version 2.1) [Computer software]. Iowa City, IA: The University of Iowa. Retrieved from https://education.uiowa.edu/centers/ center-advanced-studies-measurement-and-assessment/computer-programs Brennan, R. L. (2011). Generalizability theory and classical test theory. Applied Measurement in Education, 24(1), 1–21. doi:10.1080/08957347.2011.532417 Brennan, R. L., Gao, X., & Colton, D. A. (1995). Generalizability analyses of work keys listening and writing tests. Educational and Psychological Measurement, 55(2), 157–176.

Multivariate generalizability theory 79 Chiu, C., & Wolfe, E. (2002). A method for analyzing sparse data matrices in the generalizability theory framework. Applied Psychological Measurement, 26(3), 321–338. Cronbach, L. J., Gleser, G. C., Nanda, H., & Rajaratnam, N. (1972). The dependability of behavioral measurement: Theory of generalizability for scores and profiles. New York, NY: Wiley. Davies, A., Brown, A., Elder, E., Hill, K., Lumley, T., & McNamara, T. (1999). Dictionary of language testing. Studies in Language Testing, 7. Cambridge: Cambridge University Press. Frost, K., Elder, C., & Wigglesworth, G. (2011). Investigating the validity of an integrated listening-speaking task: A discourse-based analysis of test takers’ oral performances. Language Testing, 29(3), 345–369. Grabowski, K. C. (2009). Investigating the construct validity of a test designed to measure grammatical and pragmatic knowledge in the context of speaking. Unpublished doctoral dissertation, Teachers College, Columbia University. In’nami, Y., & Koizumi, R. (2015). Task and rater effects in L2 speaking and writing: A synthesis of generalizability studies. Language Testing, 33(3), 341–366. Lee, Y.-W. (2006). Dependability of scores for a new ESL speaking assessment consisting of integrated and independent tasks. Language Testing, 23(2), 131–166. Lee, Y.-W., & Kantor, R. (2007). Evaluating prototype tasks and alternative rating schemes for a new ESL writing test through g-theory. International Journal of Testing, 7(4), 353–385. Liao, Y.-F. (2016). Investigating the score dependability and decision dependability of the GEPT listening test: A multivariate generalizability theory approach. English Teaching and Learning, 40(1), 79–111. Lin, R. (2017, June). Operationalizing content integration in analytic scoring: Assessing listening-speaking ability in a scenario-based assessment. Paper presented at the 4th Annual International Conference of the Asian Association for Language Assessment (AALA), Taipei. Linacre, M. (2001). Generalizability theory and Rasch measurement. Rasch Measurement Transactions, 15(1), 806–807. Lynch, B. K., & McNamara, T. (1998). Using G-theory and many-facet Rasch measurement in the development of performance assessments of the ESL speaking skills of migrants. Language Testing, 15(2), 158–180. McNamara, T. (1996). Measuring second language test performance. New York, NY: Longman. Plakans, L. (2013). Assessment of integrated skills. In C. A. Chapelle (Ed.), The encyclopedia of applied linguistics (pp. 205–212). Malden, MA: Blackwell. Sato, T. (2011). The contribution of test-takers’ speech content to scores on an English oral proficiency test. Language Testing, 29(2), 223–241. Sawaki, Y. (2007). Construct validation of analytic rating scales in a speaking assessment: Reporting a score profile and a composite. Language Testing, 24(3), 355–390. Shavelson, R., & Webb, N. (1991). Generalizability theory: A primer. Newbury Park, CA: SAGE Publications. Wang, M. W., & Stanley, J. C. (1970). Differential weighting: A review of methods and empirical studies. Review of Educational Research, 40(5), 663–705. Webb, N. M., Shavelson, R. J., & Maddahian, E. (1983). Multivariate generalizability theory. In L. J. Fyans (Ed.), New directions in testing and measurement: Generalizability theory (pp. 67–82). San Francisco, CA: Jossey-Bass.

80 Kirby C. Grabowski and Rongchan Lin Weigle, S. (2004). Integrating reading and writing in a competency test for non-native speakers of English. Assessing Writing, 9, 27–55. Xi, X. (2007). Evaluating analytic scoring for the TOEFL® Academic Speaking Test (TAST) for operational use. Language Testing, 24(2), 251–286. Xi, X., & Mollaun, P. (2014). Investigating the utility of analytic scoring for the TOEFL Academic Speaking Test (TAST). ETS Research Report Series, 2006(1), i–71. Zhang, S. (2006). Investigating the relative effects of persons, items, sections, and language on TOEIC score dependability. Language Testing, 23(3), 351–369.

Section II

Unidimensional Rasch measurement

4

Applying Rasch measurement in language assessment Unidimensionality and local independence Jason Fan and Trevor Bond

Introduction The application of Rasch measurement in language assessment1 has expanded exponentially in the past two decades, as evidenced by the rapidly growing number of studies published in the leading journals in the field (e.g., Bachman, 2000; McNamara & Knoch, 2012). However, the growing popularity and extensive acceptance of the Rasch model among mainstream language assessment researchers have not been without controversy. The debates over the application of the Rasch model in language assessment research were described by McNamara and Knoch (2012) as “the Rasch wars”, which were fought on several fronts for a lengthy period of time, up to the 1990s (see also McNamara, 1996). One of the most heated debates surrounding the application of the Rasch model in language assessment was whether Rasch theory was appropriate for the analysis of language test data. Those who voiced their objections argued that the concept of unidimensionality in Rasch theory could not hold for language test constructs (e.g., Hamp-Lyons, 1989; Skehan, 1989). They argued that from an applied linguistic perspective, any language test inevitably entailed the assessment of multiple dimensions rather than one single dimension of language ability. For example, an academic English listening test, as the critics of the Rasch model would reason, taps into several different aspects of a test candidate’s listening proficiency or academic listening skills (e.g., Buck, 2001), thereby making the Rasch model, a priori, inappropriate for investigating the psychometric features of such tests. Those vigorous objections notwithstanding, McNamara (1996) argued that the Rasch model provides a powerful means of examining whether multiple dimensions actually exist in any dataset; consequently, objections to and reservations against the use of the Rasch model were ascribed primarily to a misunderstanding of the empirical notion of unidimensionality in Rasch measurement (McNamara, 1996). Local independence of test items is another Rasch measurement requirement that should be addressed in analysis. This principle requires that test candidates’ responses to one item should not be affected by or dependent on their responses to other items in the test. The Rasch model also provides effective means for ascertaining the extent to which such a principle holds true for any particular dataset.

84 Jason Fan and Trevor Bond In this chapter, we review the concepts of unidimensionality and local independence and show how to address them empirically. Then we review studies in the field of language assessment that have included examination of these Rasch measurement properties. Finally, we use a dataset from a local English listening test and the Rasch software Winsteps (Linacre, 2017a) to illustrate how unidimensionality and local independence should be investigated. Rasch measurement can be considered as comprising a family of models (McNamara, 1996). The focus of this chapter is Rasch analysis of dichotomous and polytomous data, including the basic Rasch model, the rating scale model, and the partial credit model.

Unidimensionality and local independence Interpretations of the “dimension” concept seem to vary across research fields. In mathematics, for example, dimension refers to “measure in one direction” (Merriam-Webster Dictionary).2 Following this definition, a line has one dimension (length); a square has two dimensions (i.e., length and width); and a cube has three dimensions (i.e., length, width, and height). In Rasch measurement, dimension refers to any one single underlying attribute that is not directly observable (i.e., is latent). Dimension might be used interchangeably with other terms such as “latent trait” or “construct”. Examples of such dimensions include listening ability and essay writing ability in language assessment, or extroversion, anxiety, and cognitive development in psychological research. The principle of unidimensionality in Rasch measurement requires that each individual human attribute be measured one at a time (Bond & Fox, 2015). An example from McNamara (1996) assists in elucidating the concept of unidimensionality. Suppose a group of students, including both native and non-native speakers of English, take a mathematics test in which questions are presented in English. Although the stated purpose of the test is to assess students’ mathematical ability, it is likely that their English language proficiency level will have a differential impact on their performance on the test items which are not presented exclusively as mathematical symbols. Some non-native English speakers, for example, might struggle with understanding a “word problem” presented in English due to their low English proficiency level, even though they have the required mathematical ability to construct and solve the relevant mathematical computations. In this scenario, it is reasonable to argue the test is not strictly unidimensional for this sample because it taps both students’ ability to solve mathematical problems and their ability to understand English. The controversy over the concept of unidimensionality has persisted for quite a long time in the field of language assessment. Issues concerning dimensionality largely underpinned the “Rasch wars”, and impeded, to a considerable extent, the broader application of Rasch measurement to language assessment research before the 1990s (McNamara & Knoch, 2012). Language ability, according to applied linguists, was always a complex construct that could not be captured in any one dimension (see McNamara, 1996; McNamara & Knoch, 2012, for details of the debates). Take second language (L2) academic listening ability as an example.

Applying Rasch measurement 85 This construct entails both L2 listening proficiency and a repertoire of hypothesized enabling skills (e.g., decoding aural input, inferencing, understanding unfamiliar words based on context, constructing macro-propositions) (see Buck, 2001). Therefore, it was regarded as rather self-evident that one’s L2 academic listening ability could not be explained by a single underlying dimension. However, as argued by McNamara (1996), the notion of “unidimensionality” has two interpretations. In psychology, unidimensionality is used to refer to a single underlying construct or trait; and in measurement, it means “a single underlying measurement dimension” or “a single underlying pattern of scores in the data matrix” (McNamara, 1996, p. 271). The distinction between psychological (conceptual) and psychometric (empirical) unidimensionality is crucial for understanding this notion in Rasch measurement, as well as articulating why Rasch measurement can be applied fruitfully to analyzing language assessment data. Fulfillment of the requirement of unidimensionality is a matter of degree, not just a matter of kind, as Bejar (1983, p. 31) pointed out: Unidimensionality does not imply that performance on items is due to a single psychological process. In fact, a variety of psychological processes are involved in responding to a set of test items. However, as long as they function in unison – that is, performance on each item is affected by the same process and in the same form – unidimensionality will hold. The Rasch model, as will be demonstrated in this chapter, provides a powerful means of analyzing psychometric dimensionality in the data. In addition, local independence requires that test candidates’ responses to any item should not be affected by their responses (i.e., success or failure) to other items in that test. In language assessment, however, such a principle might often be violated. For example, it is not unusual for language test developers to use several items that share the same prompt in listening or reading comprehension tests. Such practice can result in local dependence of items; having solved one item could predispose the candidate to success on other items based on the same passage. Previous research has shown that violation of local independence might lead to problems in parameter estimation, generating inaccurate person ability or item difficulty measures (e.g., Eckes, 2011; Min & He, 2014). Hence, local independence should be investigated as one of the prerequisites to applying the Rasch model. It should be noted that unidimensionality and local independence are both relative concepts. Perfect unidimensionality is a theoretical mathematical concept that can only ever be approximated in empirical data. Similarly, no two items in any test can be completely independent of each other. While these two Rasch measurement properties should be examined as an integral component of any Rasch analysis, the extent to which these principles should hold will be determined by the nature and consequences of the decisions made about the test candidates (Bond & Fox, 2015). Meanwhile, the two properties tend to be interrelated. If the principle of local independence is violated for a test, it is likely that additional

86 Jason Fan and Trevor Bond sub-dimensions also exist in the data matrix, making the principle of unidimensionality untenable. Both properties, as explained below, can be investigated through assessing the extent to which the test data fit the model. The benefits of the Rasch model, including interval-level person and item measurements, apply only to the extent to which the test data fit the model (Bond & Fox, 2015).

Addressing unidimensionality and local independence empirically In Rasch measurement, residual-based statistics are used routinely to address datato-model match, including unidimensionality and local independence. Residuals are the differences between the actual observed scores and the mathematically expected scores generated from the Rasch model. The Rasch model is an idealized mathematical model, and the expected item and person parameters for any one test are constructed on the basis of an empirical data matrix. Observed test data collected from real life can never attain that mathematical ideal. Therefore, there are always discrepancies between the observed data and the values the Rasch model expects. Such discrepancies are known as residuals. If a test is unidimensional, no pattern (e.g., clusters, identifiable or meaningful patterns) should exist among the residuals; that is, residuals should be distributed at random. This means that no meaningful sub-dimensions would emerge from the residuals; the measure which is accounted for by the Rasch model is the only dimension observable in the dataset. On the other hand, if patterns other than the Rasch dimension emerge from the residuals, this would constitute prima facie evidence that the test is not sufficiently unidimensional. This would be a concern to the extent to which the residual patterns are also substantively interpretable. The item/person residuals might also be used as indicators for investigating whether pairs of items are sufficiently independent of each other. After the extraction of the Rasch measure, the item residuals should not be correlated with each other if the items are substantially independent of each other. If the items are related, higher correlations between item residuals, along with muted fit statistics, are the consequence (see e.g., Bond & Fox, 2015; Eckes, 2011; Marais, 2013). To investigate whether meaningful dimensions exist in the Rasch residuals, principal component analysis of the residuals (PCAR) should be performed. Indeed, it is available by default in Rasch computer programs such as Winsteps for dimensionality analysis. The PCAR procedure involves (1) standard Rasch analysis, followed by (2) principal component analysis of the residuals that remain after the linear Rasch measure has been extracted, i.e., removed from the data matrix. PCAR enjoys several salient advantages over other traditional dimensionality analysis methods such as exploratory factor analysis (EFA). First, factors extracted by EFA might not be factors; items of similar difficulty can cluster together, though they are not meaningful factors; the classic paper by Ferguson (1941) is quite clear on this point. Second, results of EFA are often obscured by ordinality of variables and high correlations among factors. In comparison, PCA of the linearized Rasch residuals does not generate those illusory factors. Finally,

Applying Rasch measurement 87 missing data can bias EFA solutions, whereas Rasch measurement is resilient to missing data (Aryadoust, Goh, & Kim, 2011). Prior to PCAR, Rasch measurement dimensionality analysis should firstly involve the examination of point-measure (PTMEA) biserial correlations and fit statistics (Linacre, 1998). Negative PTMEA biserial correlations indicate that the items function in opposite ways to the primary underlying trait explained by the Rasch model, suggesting the existence of competing dimensions. Furthermore, high or erratic fit statistics, i.e., underfit to the Rasch model, cause legitimate concerns over the unidimensionality of the test data. On the other hand, low or muted fit statistics, i.e., indicating overfit to the model, suggest a lack of item independence. Although several techniques have been reported to investigate local independence in IRT research (e.g., Chen & Thissen, 1997; Yen, 1984), they are not used on a routine basis in Rasch measurement. Yen’s (1984, 1993) Q 3 index, for example, can be used to identify local dependence within a Rasch context (e.g., Chou & Wang, 2010; Christensen, Makransky, & Horton, 2017). This index represents the correlation between the raw residuals of pairs of items, so substantial correlations between pairs of item residuals could be suggestive of lack of local independence. Christensen et al. (2017) provide a definitive and well-referenced account of the use of Yen’s Q 3 index and critique the practice of using of a single correlation value as indicative of local dependence. Critical values of Q 3 from 0.1 to 0.3 are often used, with much higher values also in use. The authors contend, “(T)here are two fundamental problems with this use of standard critical values: (a) there is limited evidence of their validity and often no reference of where values come from, and (b) they are not sensitive to specific characteristics of the data” (Christensen et al., 2017, p. 181). The authors then conducted a number of simulation studies and concluded that the difference between the maximum correlation (maximum Q 3 value) and the average correlation (mean Q 3 value) suggested by Marais (2013), provided the most stable indicator for detecting local dependence. That said, the Q 3 approach is not common among Rasch practitioners generally, and, currently, no single specific statistical indicator is routinely used to detect local independence in Rasch analysis. It is recommended that in Rasch measurement, while mindful of substantial correlations between item residuals, analysts should look for “muted” fit statistics, i.e., overfit, as a prima facie indicator of the violation of the item independence requirement (Bond & Fox, 2015). If items are related to each other, chances are that one or more of them will be empirically or theoretically redundant in that whether a test candidate can successfully answer an item depends, to some extent, upon that candidate’s responses to preceding items, hence, the muted item fit statistics (i.e., responses to these items are more predictable than expected by the Rasch model).

Unidimensionality and local independence: A review of studies in language assessment Though unidimensionality and local independence are prerequisites for Rasch analysis, they have been rarely addressed in reports of Rasch analyses in the field of language assessment. A review of the research articles published in the two leading

88 Jason Fan and Trevor Bond journals in language assessment, i.e., Language Testing (LT) and Language Assessment Quarterly (LAQ), reveals that a total of 48 articles were published between 2010 and 2016, which either reported using the Rasch model as the primary or secondary data analysis method or discussed the use of Rasch model in the field. Despite this increasing number of publications, only a small fraction of those articles include analysis of unidimensionality and/or local independence. This finding is corroborated by Ferne and Rupp (2007) whose survey of differential item functioning (DIF) analysis in language assessment showed that only 8 out of 27 studies provided evidence in relation to unidimensionality. The following section highlights recent studies in language assessment that used the Rasch model with discussions on unidimensionality. Beglar (2010) used the Rasch model as his primary data analysis method in his validation of the Vocabulary Size Test (VST). The VST was developed to provide a reliable, accurate, and comprehensive measure of L2 English learners’ written receptive vocabulary size from the first 1,000 to the 14th 1,000-word families of English. The unidimensionality analysis of the VST was aimed at providing evidence regarding the structural aspect of its construct validity, based on Messick’s theory of test validity (Messick, 1989). The unidimensionality analysis was conducted using PCAR. Results indicated that the VST was essentially unidimensional because the Rasch measure accounted for an impressive 85.6% of the total variance; each of the first four contrasts explained only 0.4% to 0.6% (i.e., small amounts) of the variance in the data, and the total variance explained by the first four contrasts was 1.9%. Beglar (2010) therefore concluded, appropriately, that no meaningful secondary dimension was found and that the VST displayed a high degree of psychometric unidimensionality. Note that the term “contrast” is a term used by Linacre (2017c) in PCAR whose interpretation differs from that of “component” used in principal component analysis (PCA). Whereas a component indicates a linear pattern in a dataset which might be substantively interpretable (Field, 2009), a contrast might be completely accidental and has no substantive meaning. Pae, Greenberg, and Morris (2012) investigated the construct validity of the Peabody Picture Vocabulary Test (PPVT) – III Form A, using the Rasch model as the primary data analysis method. The PPVT has been widely used in the United States as a receptive vocabulary or verbal ability measure. Similar to Beglar’s (2010) methodology, unidimensionality in this study was assessed through PCAR. Results indicated that the standardized residual variance accounted for a mere 26.7% of the variance explained by the Rasch dimension. The first contrast showed an eigenvalue of 4.1, suggesting the possibility of a competing secondary dimension. Eigenvalue is a statistic indicating how much of the variation in the original set of variables is accounted for by a particular component (or contrast). In conventional PCA, a component with an eigenvalue more than 1 is usually considered as having substantive importance (Field, 2009). Despite the quite high eigenvalue of the first contrast, an examination of the item/person residual plot did not reveal any meaningful clusters of items. Fit analysis, on the other hand, revealed that some items had either overfitting or underfitting response patterns,

Applying Rasch measurement 89 with the overfitting items failing to contribute to the unidimensionality of this test with the given sample. The item–person map indicated that participants’ responses to these items were not aligned well with their abilities (i.e., poorly targeted). Based on these analyses, it was concluded that further investigations of these PPVT items were warranted. In addition to validation research, the Rasch model has also been utilized to examine measurement invariance of language assessments. Aryadoust et al. (2011), for example, investigated DIF (failure of item parameter invariance) in the Michigan English Language Assessment Battery (MELAB) listening test. In that study, the Rasch model was used as the primary data analysis method. Before performing DIF analysis, the researchers examined unidimensionality and local independence in the test data as prerequisites for DIF analysis through PCAR and Pearson correlation analysis of Rasch residuals, respectively. Furthermore, fit statistics were examined because erratic fit (underfit) would hint at lack of unidimensionality. Results showed that the Rasch dimension explained 31% of the observed variance, and the first contrast in the residuals explained only 2.5%. In addition, the item and person residuals did not form distinguishable clusters, indicating that they did not reveal any substantive structure. The authors argued that the Pearson correlations of the Rasch residuals indicated that the requirement of local independence held for the dataset, because no substantial intercorrelations emerged. These analyses lent support to the unidimensionality and local independence of MELAB items and further justified the use of Rasch-based DIF analysis. McNamara and Knoch (2012), in their review of the studies using the Rasch model in language assessment, concluded that the controversies surrounding the application of the Rasch model gradually subsided from the 1990s onward, and the period from 1990 to 2010 witnessed a substantial increase in the use of the Rasch model in language assessment research. The present review of the articles in LT and LAQ published from 2010 to 2016 further supports this conclusion. Nevertheless, as noted earlier, only a few studies have reported evidence in relation to unidimensionality and/or local independence of the dataset analyzed. A plausible explanation is the lack of awareness among language assessment researchers of the importance of addressing these two concepts when applying the Rasch model. Compared with traditional data analysis methods based on classical test theory, the Rasch model offers numerous advantages (e.g., Bond & Fox, 2015). Yet it should be reiterated that the benefits of the Rasch model apply only to the extent that datasets fit the model, including the requirements for unidimensionality and local independence. Certainly, additional analyses might be required, and findings of those analyses must be interpreted carefully. Given that researchers can address these two measurement properties in Rasch programs, it is straightforward to advocate that relevant evidence regarding these two principles be incorporated into research reports. However, assessment situations requiring the many-facets Rasch model such as the ones involving rater-mediated speaking or writing performances are far more complicated to deal with appropriately.

90 Jason Fan and Trevor Bond

Sample study: Unidimensionality and local independence of items in a listening test Research context and data The data in this demonstration were selected from the Fudan English Test (FET), a university-based English proficiency test. The FET was developed by the College English Centre at Fudan University, one of the first-tier research universities in China. Drawing on recent models of communicative competence (e.g., Bachman & Palmer, 1996, 2010), the FET was designed to assess students’ English language ability and skills in the four modalities of listening, reading, writing, and speaking. Since its inception in 2010, the FET has been subjected to a number of validation studies (e.g., Fan & Ji, 2014; Fan, Ji, & Song, 2014; Fan & Yan, 2017) whose results fundamentally support its validity argument. The data used in this demonstration are students’ item-level test scores on one of the FET listening tests. This random sample of participants, taken from a much larger dataset, includes 106 students with 51 males (48.1%) and 55 females (51.9%), aged between 19 and 23 (Mean = 20.52, SD = 0.82). The listening component of the FET consists of three sections: spot dictation (DIC), listening to conversations (CON), and listening to academic lectures (LEC). The test format in the listening component is presented in Table 4.1. As shown in this table, spot dictation has eight items (DIC1–8) and requires students to fill in eight blanks after listening twice to a short passage with approximately 200 words. Trained raters are subsequently recruited to evaluate students’ written responses. The first four DIC items (i.e., DIC1–4) are dichotomous, whereas a partial-credit scoring scheme is applied to DIC5–8 where students can be awarded up to 2 scores (i.e., 0, 0.5, 1, 1.5, and 2), depending on the quality of their responses. Four-option multiple-choice questions (MCQ) are used in each of the other two sections of the listening section, i.e., listening to conversations and academic lectures (see Table 4.1). A dichotomous scoring scheme is applied: 1 for a correct response, 0 for incorrect. Table 4.1 Structure of the FET Listening Test Section

No of items

Format

Scoring

Spot dictation

4 (DIC1–4) 4 (DIC5–8) 8 (CON1–8) 9 (LEC1–9)

Gap-filling Gap-filling MCQ MCQ

Dichotomous Partial credit Dichotomous Dichotomous

Conversation Academic lecture

Note: DIC = spot dictation, CON = conversation, LEC = academic lecture, MCQ = multiple-choice question.

Applying Rasch measurement 91 Research questions The focus of this chapter is to examine the extent to which the principles of unidimensionality and local independence hold for the FET listening test. Although, according to the FET Test Syllabus (FDU Testing Team, 2014), the FET listening test was designed in accordance with the skills model of listening ability (e.g., Taylor & Geranpayeh, 2011), it is reasonable to hypothesize that a general listening ability trait would account for students’ performance on the listening test. Therefore, dimensionality analysis of the listening test would verify whether such an assumption holds in practice, thereby strengthening (or weakening) the validity argument of the listening test (Chapelle, Enright, & Jamieson, 2008). In the FET listening test, the questions in each section share the same prompt material. For example, the eight questions in the Spot Dictation section share one passage (DIC1–8); although the first conversation is different in that it has only one question following it (CON1), the other three conversations have at least two questions each (CON2–8). In addition, each of the two academic lectures is followed by either four or five questions (LEC1–9). Therefore, it is meaningful to investigate whether each of the items in the listening test is sufficiently independent of all others, thus the following two research questions: i Does the FET listening test measure a psychometrically unidimensional listening ability? ii Are the items in the FET listening test locally independent?

Data analysis Since the items in the listening test have different scale structures, the partial credit Rasch model (PCM; Masters, 1982) was adopted to analyze the data. The mathematical expression of this model is presented below (McNamara, 1996, p. 285): P = Bn – Di – Fik (4.1) where P is the probability of achieving a score within a particular score category on a particular item; Bn is the ability (B) of a particular person (n), Di is the difficulty (D) of a particular item (i), and Fik is the difficulty of scoring within a particular score category (k) on a particular item (i). To address the first research question, point-measure (PTMEA) biserial correlations and fit statistics were examined and PCAR was implemented. Following Linacre (1998), items with negative or near-zero PTMEA correlations should be investigated closely because they hint at multidimensionality. Furthermore, items with underfitting patterns should be removed because they impinge on parameter estimation and are suggestive of the existence of additional dimensions. To address the second research question, we assumed that overfitting items would indicate possible item dependence or item redundancy. In addition, substantial correlations between item residuals are

92 Jason Fan and Trevor Bond considered to constitute evidence in favor of local dependence between pairs of items (e.g., Aryadoust et al., 2011). Data analysis was implemented using the Rasch computer program, Winsteps version 3.93 (Linacre, 2017a).

Results and discussion Summary statistics The Rasch analysis results are presented in Table 4.2. As shown in this table, the mean ability measure for persons was 1.03 logits (SD = 1.04), whereas the mean of item difficulty measure was set by default at 0 logits (SD = 0.88). In Rasch measurement theory, person ability and item difficulty are calibrated on the same interval-level measurement scale, and consequently, ability and difficulty can be compared directly. The higher mean measure for persons indicates that on average person ability is higher than item difficulty; or put differently, the items in the FET listening test are generally easy for this group of test candidates. This can also be evidenced in the Wright map (Figure 4.1), where the main bulk of the distribution of persons (left side) is located relatively higher than the distribution of items (right side). Specifically, few items are located toward the upper end of the figure, where they would target the high-ability test candidates; on the other hand, there is a preponderance of items which cluster at the lower end of the map, targeting few test candidates. One consequence is that the separation index for persons is rather low (1.80) and the standard error is relatively large (0.48) (see Table 4.2). The person separation index (GP) is an index for estimating the spread of persons on the measured variable. This is estimated as the adjusted person standard deviation (SAP) divided by the average measurement error (SEP), where measurement error is defined as that part of the total variance that is not accounted for by the Rasch model: GP = SAP (Bond & Fox, 2015, p. 354). SE P

Both separation and strata indicators reflect the measurably distinct groups of items or persons from the Rasch analysis. The formula that converts the 4G − 1 (person or item) separation index (G) into a strata index is Strata = 3 (Bond & Fox, 2015, p. 354). According to Linacre (2017b, p. 327), the difference between the two concepts lies in the tails of the measure distribution. If the distribution is “probably due to accidental circumstances”, the separation index should be used; if it is believed to be caused by “very high and very Table 4.2 Summary Statistics for the Rasch Analysis

Persons Items

Mean (SD)

SE

Range

R

G

Strata

1.03 (1.04) 0.00 (0.88)

0.48 0.24

−3.07/5.24 −1.41/1.85

0.76 0.93

1.80 3.54

2.07 4.39

Note: R = separation reliability, G = separation index, Strata = (4G-1)/3 (Bond & Fox, 2015, p. 354).

Applying Rasch measurement 93 MEASURE 3

2

1

0

-1

-2

PERSON - MAP - ITEM | + T| X | | XXX | | XXXXXXX | S+ XXXXXXXX | |T XXXXXXXXXXX | XXXXXXXX | | XXXXX | XXXXXX M+ XXXXXXX |S DIC7 DIC8 XXXXXXXX | XXXXXXXXXX | XXXXXXXX | XXXX | DIC4 XXX | DIC6 X S+M XXX | DIC3 XXX | XXX | X | | X |S T+ | DIC2 | | DIC1 DIC5 | |T | + |

LEC2 LEC3

CON8 CON4 CON6

CON5 LEC4 LEC5 CON1 CON3 LEC1 LEC6 LEC9 CON2 LEC8 LEC7 CON7

Figure 4.1 Wright map presenting item and person measures.

low abilities”, the strata index should be preferred. In this case, we used the separation index. A plausible explanation for low separation in this analysis is that the FET listening test does not have enough test items to measure these students’ listening ability. The FET developer should heed the results of this analysis in future test revisions.

94 Jason Fan and Trevor Bond Item fit analysis The item measure and fit statistics for the 25 items in the FET listening test are presented in Table 4.3. In Rasch measurement, infit and outfit mean square (MnSq) statistics, along with their standardized forms, are used routinely to examine whether test item performances fit the expectations of the Rasch model. Infit and outfit statistics reflect slightly different techniques in assessing the fit of an item to the Rasch model. The infit MnSq is weighted to give relatively more weight to the performances of those with ability measures closer to the item parameter value, whereas the outfit MnSq is not weighted and therefore remains relatively more sensitive to the influence of outlying scores. Both the infit and outfit MnSq statistics have an

Table 4.3 Rasch Item Measures and Fit Statistics (N = 106) Item

Mean SD

Measure Error Infit Infit MnSq ZStd

Outfit Outfit PTMEA MnSq ZStd Correlation

DIC1 DIC2 DIC3 DIC4 DIC5 DIC6 DIC7 DIC8 CON1 CON2 CON3 CON4 CON5 CON6 CON7 CON8 LEC1 LEC2 LEC3 LEC4 LEC5 LEC6 LEC7 LEC8 LEC9 Mean

0.89 0.86 0.72 0.66 2.76 1.24 2.18 1.59 0.72 0.81 0.72 0.53 0.69 0.58 0.87 0.48 0.72 0.33 0.35 0.68 0.69 0.79 0.84 0.82 0.79 1.72

−1.36 −1.07 −0.08 0.23 −1.41 0.18 0.91 0.89 −0.08 −0.68 −0.08 0.88 0.08 0.66 −1.16 1.10 −0.08 1.85 1.75 0.13 0.08 −0.55 −0.91 −0.75 −0.55 0.00

0.59b 0.84 1.25a 0.72b 0.80 1.13 0.96 0.74b 1.18 1.23a 0.96 1.10 1.08 1.40a 1.53a 1.08 0.94 1.25a 0.88 0.91 0.84 0.91 0.97 0.90 1.11 1.01

0.32 0.35 0.45 0.48 0.63 0.59 1.26 0.92 0.45 0.39 0.45 0.50 0.47 0.50 0.34 0.50 0.45 0.47 0.48 0.47 0.47 0.41 0.37 0.39 0.41 0.96

0.32 0.30 0.23 0.22 0.19 0.18 0.10 0.13 0.23 0.26 0.23 0.21 0.23 0.21 0.30 0.21 0.23 0.22 0.22 0.22 0.23 0.26 0.28 0.27 0.26 0.23

0.84 0.95 1.11 0.82 0.76b 1.13 0.98 0.76b 1.15 1.05 1.01 1.09 1.05 1.03 1.08 1.08 1.02 1.20 0.90 0.95 0.92 0.94 1.00 1.00 0.99 0.99

−0.6 −0.2 1.1 −2.2b −1.0 1.0 −0.1 −2.1b 1.4 0.4 0.1 1.3 0.5 0.5 0.5 1.1 0.2 2.0 −1.2 −0.5 −0.8 −0.4 0.0 0.1 0.0 0.0

−1.1 −0.4 1.4 −2.1b −0.2 1.0 −0.2 −2.1b 1.0 0.9 −0.2 0.9 0.5 2.9a 1.5 0.7 −0.3 1.4 −0.7 −0.6 −1.0 −0.3 0.0 −0.3 0.5 0.1

0.44 0.35 0.22 0.53 0.49 0.35 0.64 0.69 0.21 0.24 0.35 0.30 0.31 0.31 0.22 0.31 0.35 0.20 0.47 0.40 0.44 0.38 0.31 0.33 0.31 n/a

Note: DIC = spot dictation, CON = conversation, LEC = academic lecture, SD = standard deviation, MnSq = mean square, ZStd = standardized Z, n/a = not applicable, a = underfitting, b = overfitting.

Applying Rasch measurement 95 expected value of 1. In general, although MnSq results of 1.0 do occur in practice, it is usual that items either underfit (i.e., with infit and outfit MnSqs over 1, meaning erratic) or overfit (i.e., with infit and outfit MnSqs below 1, meaning predictable) the model. It is incumbent upon the practitioner to decide how much misfit can be tolerated and still leave the measurement findings as useful for their particular purpose. For example, in a high-stakes test, Bond and Fox (2015) adopted item fit cutoffs as follows: underfitting when MnSq indices are greater than 1.2 and overfitting with values less than 0.8. Moreover, there are many recommended upper- and lower-limit definitions discussed in the Rasch literature (e.g., McNamara, 1996). In addition to MnSq statistics, standardized Z (ZStd) values are also used to assess item fit. ZStd values ranging from −2 to +2 are regarded as indicating acceptable fit to the model. MnSq fit statistics indicate the amount of misfit observed, and the ZStd statistics reflect the probability that the misfit is statistically significant. Table 4.3 reports that all items have positive PTMEA correlations, indicating that they all function in the same direction. Arguably, these results provide preliminary support to the unidimensionality of the listening test. An examination of the infit MnSq values indicates that except for two items (DIC5 and DIC8), the other items fit into the range of 0.8 to 1.2. However, the outfit MnSq values of eight items do not fit into that range, including DIC1, DIC3, DIC4, DIC8, CON2, CON6, CON7, and LEC2. The standardized Z values indicate that among those eight items, only three of these are likely to be problematic (i.e., DIC4, DIC8, and CON6). According to the fit statistics, DIC4 and DIC8 overfit the model, whereas CON6 underfits the model. Underfitting items are considered more problematic, as responses to those items are less predictable, i.e., more erratic than expected by the Rasch model. Underfitting items are a threat to unidimensional measurement. Taken together, an on-balance evaluation of the fit statistics indicates that, generally, the items fit the model well enough for their current use. That said, overfitting items could constitute a concern for test developers. Although they are not a primary threat to unidimensionality, they do not contribute as much additional meaningful information to measurement as do acceptably fitting items. The results of the PCAR can be considered with a view to strengthening the unidimensionality finding by determining whether competing meaningful dimensions emerge in the Rasch residuals.

PCA of Rasch residuals (PCAR) Table 4.4 reports the standardized residual variance in eigenvalue units. Overall, 38.4% of the total raw variance is explained by Rasch measures. According to Linacre (2017c), the amount of variance explained by Rasch measures depends on the spread of the person and items. If there is a wide range of person abilities and item difficulties, the variance explained will be larger. Conversely, if the range of person abilities and item difficulties is narrow, the variance explained will be smaller. In this example, the person strata index is 2.07 (see Table 4.2), suggesting that there are just two measurably distinct groups of students in this sample. In other words, as the range of person abilities in this sample is rather narrow, a relatively small amount of variance (i.e., 38.4%) was explained by the Rasch measure.

96 Jason Fan and Trevor Bond Linacre (2017c) recommended that a first contrast with an eigenvalue less than 2 should not cause a concern. In PCAR, as mentioned earlier, “contrast” is distinct from “component”, because the interpretation of the components in Rasch residuals is different from the usual interpretation of PCA components. A contrast might not be substantive enough to constitute a dimension (e.g., a random effect in the data), whereas a component in PCA is often substantively interpretable. There are 25 items in the listening test. If the items are completely independent of each other, there should be 25 eigenvalues (sub-dimensions), each with the strength of 1. As demonstrated in Table 4.4, the first detected contrast has an eigenvalue of 1.91 units, which is not substantial (i.e., the absolute strength of the first contrast is 1.91 items out of a total of 25 items), while the Rasch model explains the common variance in 15.58 items. Given that the eigenvalue of the first contrast is rather small, it provides negligible evidence of multidimensionality. To better interpret the meaning of the first contrast, the standardized residual contrast 1 plot in Figure 4.2 illustrates the extent to which items’ residuals might Table 4.4 Standardized Residual Variance

Total raw variance in observations Raw variance explained by measures Raw variance explained by persons Raw variance explained by items Raw unexplained variance (total) Unexplained variance in 1st contrast

Eigenvalue

Observed

Expected

40.58 15.58 7.11 8.47 25.00 1.91

100% 38.4% 17.5% 20.9% 61.6% 4.7%

100% 36.8% 16.8% 20.0% 63.2% 7.6%

STANDARDIZED RESIDUAL CONTRAST 1 PLOT

C O N T R A S T 1

.5 .4 .3 .2 .1

.0 L O -.1 A D -.2 I N -.3 G -.4

-4 -3 -2 -1 0 1 2 3 4 -+-------+-------+-------+-------+-------+-------+-------+-------++ | A + | |B | + C | + | | D | + E| F + | | H G | + I | + | J | | + | + | | | +--------------------------L-----|K-------------------------------+ | kl |M | + |j + | i| | + h | + | g | f | + d e| + | c| | + | ab + -+-------+-------+-------+-------+-------+-------+-------+-------+-4 -3 -2 -1 0 1 2 3 4 ITEM MEASURE

Figure 4.2 Standardized residual first contrast plot.

COUNT 1 1 1 1 2 2 1 1 2 3 1 1 1 2 2 1 2

Applying Rasch measurement 97 cluster together to represent the first contrast. In Figure 4.2, the X-axis represents the difficulty measures of the items, with most of the items concentrated in the difficulty range between −1 and +1 logits. This assists the analyst in interpreting where the sub-dimension (if it exists) is located on the range of item difficulty. The Y-axis represents the loadings on the conceptual component in the unexplained variance (i.e., contrast). As pointed out by Linacre (2017c), the largest correlation (loading) with the latent sub-dimension is shown as positive. In this analysis, Item A (LEC2) has the highest loading (around 0.5). Note that on the vertical axis, Items a (DIC7) and b (CON4) have negative loadings of about −0.4. In conventional PCA, items with loadings > 0.4 would be closely examined (e.g., Stevens, 2002). However, in PCAR, the clustering of items is more important than the mere magnitude of component loading. In this example, two items, i.e., Items A (LEC2) and B (LEC4), have loadings of over 0.4. An examination of the content of these two items, however, revealed that they did not form a dimension with reasonable substantive interpretation. Therefore, the results of PCAR indicate that the test is essentially unidimensional.

Local independence Although some IRT models adopt a number of indicators of item dependence, no single specific statistical test is used to address the principle of local independence in Rasch measurement practice. Empirical indicators, essentially “muted” fit statistics and substantial correlations between item residuals (Yen’s Q 3), provide evidence of the potential violation of Rasch’s item independence requirement. In this example, only one item, i.e., DIC8 (0.89, SE = 0.13), shows marginal overfit on all four fit statistics (Infit MnSq = 0.76, Infit ZStd = −2.1, Outfit MnSq = 0.74, Outfit ZStd = −2.1, see Table 4.2). This suggests that success/failure on DIC8 might not be completely independent of success/ failure on other items in the FET, but an analysis of the content and function of this item and the other items in that section (i.e., Spot Dictation) did not indicate that it is related to other items. So the merely marginal overfit of this item, per se, does not provide sufficient evidence of the violation of the principle of local independence. Table 4.5 presents the largest standardized residual correlations. If the correlation between any pair of item residuals (Q 3) is substantial, this would also hint that the principle of local independence might be untenable for that pair. As shown in Table 4.5, no substantially high correlations are observed, with the highest correlation reported between LEC1 and LEC3 (r = 0.32). Remember that “common variance = r2”, so items LEC1 and LEC3 share 0.322, or only about 10% of the variance in their residuals in common, or put differently, about 90% of each of their residual variances differ. Practitioners might find that the process of summarizing the Q 3s across all items with PCAR makes any dependency pattern clearer. In view of the results from both item fit analysis and correlation analysis between item residuals, it is reasonable to conclude that the principle of local independence of items holds for the FET listening test.

98 Jason Fan and Trevor Bond Table 4.5 Largest Standardized Residual Correlations Item

Item

Correlation

LEC1 DIC2 DIC2 CON3 CON8 CON1 DIC8 DIC5 DIC7 DIC4

LEC3 DIC6 DIC3 CON4 LEC2 LEC2 CON3 CON6 LEC2 CON8

0.32 0.23 0.22 0.20 0.20 −0.26 −0.27 −0.25 −0.24 −0.23

Conclusion This chapter examined two essential principles of Rasch measurement, i.e., unidimensionality and local independence. Given that the many benefits of the Rasch model apply only to the extent that the data fit the model, these two measurement properties should be routinely addressed when using Rasch analysis with a dataset. Both concepts can be addressed through the examination of (mis)fit indicators, and the Rasch residuals, i.e., the discrepancy between the observed data and the data expected by the Rasch model. Unidimensionality can be assessed through point-measure correlations and fit statistics, augmented by the principal component factor analysis of the Rasch item residuals (PCAR); local independence can be investigated through item overfit to the model, augmented by the Pearson correlations of pairs of item residuals. A variety of indicators from this analysis support the contention that the FET listening test is essentially unidimensional. First, no negative PTMEA correlation was observed, and the fit statistics for most items fell within the range for productive measurement. Second, the PCAR showed that the Rasch measure had an eigenvalue of 15.58, whereas the first contrast had the strength of only 1.91 items. It should be noted, however, that the Rasch measures explained only 38.4% of the total variance. This finding might be attributable at least in part to the insufficient number of test items in the FET listening test as well as restricted range of item difficulties. Finally, a close examination of the first contrast plot does not reveal any meaningful clustering of items, suggesting that no secondary dimension is contained in the Rasch residuals. Taken together, the results suggest that the listening test is essentially unidimensional. The finding lends empirical support to the theoretical assumption underpinning the design of the FET listening test, i.e., the test is intended to measure students’ general listening ability. It also strengthens the validity argument for the test and facilitates the interpretation and use of test scores. However, a few items were found overfitting the Rasch model, suggesting a lack of item independence. Although overfitting items do not

Applying Rasch measurement 99 constitute as much of a threat to the validity of the test as do underfitting items (McNamara, 1996), it should be reiterated that they do not contribute unique additional psychometric information to effective measurement, and therefore they do warrant further examination in future test revisions. While violation of the Rasch requirement for local independence can be a focus for statistical theorists, when restricted to a small number of items, it rarely has practical implications for testing because its empirical effect is the slight and localized expansion of the relative placement of items and persons along the underlying latent variable (Linacre, Personal Communication, May, 2017). Correlations of the Rasch item residuals (Q3) might also suggest the possibility of local dependence; but, in this case, no spuriously high correlations were found, suggesting that the item pairs are not related to each other. But of much more importance than any pairs of correlated items are clusters of related items that appear in the standardized residual contrast plot derived from the PCAR. Note that slightly overfitting item DIC8 does not appear in the table indicating between-item correlations but appears in the PCAR plot (Figure 4.2) with CON6 (F), LEC3 (G), and CON8 (H), so the overfit does suggest some among-item dependence. In other words, DIC8 seems to summarize some ability already detected by other items; it picks up some of the ability detected by CON6, LEC3, and CON8, but the converse is not true. Although it is a measurement truism that the only test that can be purely unidimensional is the one-item test, from that rather impractical perspective, then DIC8 might be considered the single-best item in the FET rather than being seen as merely dependent and thereby potentially redundant. As dimensionality analysis is to reveal a cluster of items in the contrast plot, it is incumbent upon researchers to conduct detailed content analysis of those item clusters to ascertain whether they form meaningful sub-dimensions. Items tapping into a sub-dimension might be revised for subsequent administrations of the test, be removed from the test, or be administered separately to students. Alternatively, a multidimensional Rasch model could be applied to the dataset that might effectively accommodate the existence of multiple Rasch-like dimensions (e.g., Bond & Fox, 2015). On the other hand, if items are found to be locally dependent, e.g., language items are often related to each other if one prompt is used for choosing cloze responses, then those should be properly addressed, for example, by combining the locally dependent dichotomous cloze items into polytomous super-items (e.g., Thissen, Steinberg, & Mooney, 1989; Zhang, 2010). The two Rasch measurement requirements for unidimensionality and local impendence tend to have been largely ignored in language assessment research literature. Clearly, they should be addressed routinely as an important and integral component in any Rasch analysis. Furthermore, relevant results should be incorporated in the research report and made transparent to readers. If the requirements of these measurement principles are not met, reflective remedial actions should be taken. For example, test scores should be reported in accordance with the results of dimensionality analysis (e.g., Sawaki, Stricker, & Oranje, 2009). Our recommendation is that researchers using Rasch analysis in language assessment should routinely report on the unidimensionality evidence rehearsed

100 Jason Fan and Trevor Bond in our argument and analysis above, which includes point-measure correlations, all four fit statistics, and PCAR. Moreover, any evidence suggesting item dependence should be addressed in the test development phase, so any practical measurement impacts will be minimized.

Notes 1 We concur with Bachman and Palmer (2010) that it is not necessary to make fine distinctions between “assessment” and “test”. These two terms are therefore used interchangeably in this chapter. 2 www.merriam-webster.com/dictionary/dimension

References Aryadoust, V., Goh, C. C., & Kim, L. O. (2011). An investigation of differential item functioning in the MELAB listening test. Language Assessment Quarterly, 8(4), 361–385. Bachman, L. F. (2000). Modern language testing at the turn of the century: Assuring that what we count counts. Language Testing, 17(1), 1–42. Bachman, L. F., & Palmer, A. S. (1996). Language assessment in practice: Designing and developing useful language tests. Oxford: Oxford University Press. Bachman, L. F., & Palmer, A. S. (2010). Language assessment in practice: Developing language assessments and justifying their use in the real world. Oxford: Oxford University Press. Beglar, D. (2010). A Rasch-based validation of the Vocabulary Size Test. Language Testing, 27(1), 101–118. Bejar, I. I. (1983). Achievement testing: Recent advances. Beverly Hills, CA: SAGE Publications. Bond, T., & Fox, C. M. (2015). Applying the Rasch model: Fundamental measurement in the human sciences. New York, NY: Routledge. Buck, G. (2001). Assessing listening. Cambridge: Cambridge University Press. Chapelle, C. A., Enright, M. K., & Jamieson, J. M. (2008). Building a validity argument for the test of English as a foreign language. New York, NY and London: Routledge, Taylor & Francis Group. Chen, W.-H., & Thissen, D. (1997). Local dependence indexes for item pairs using item response theory. Journal of Educational and Behavioral Statistics, 22(3), 265–289. Chou, Y. T., & Wang, W. C. (2010). Checking dimensionality in item response models with principal component analysis on standardized residuals. Educational and Psychological Measurement, 70(5), 717–731. Christensen, K. B., Makransky, G., & Horton, M. (2017). Critical values for Yen’s Q3: Identification of local dependence in the Rasch model using residual correlations. Applied Psychological Measurement, 41(3), 178–194. Eckes, T. (2011). Introduction to many-facet Rasch measurement. Frankfurt am Main: Peter Lang. Fan, J., & Ji, P. (2014). Test candidates’ attitudes and their test performance: The case of the Fudan English Test. University of Sydney Papers in TESOL, 9, 1–35. Fan, J., Ji, P., & Song, X. (2014). Washback of university-based English language tests on students’ learning: A case study. The Asian Journal of Applied Linguistics, 1(2), 178–192.

Applying Rasch measurement 101 Fan, J., & Yan, X. (2017). From test performance to language use: Using self- assessment to validate a high-stakes English proficiency test. The Asia-Pacific Education Researcher, 26(1–2), 61–73. FDU Testing Team. (2014). The Fudan English test syllabus. Shanghai: Fudan University Press. Ferguson, G. A. (1941). The factorial interpretation of test difficulty. Psychometrika, 6(5), 323–330. Ferne, T., & Rupp, A. A. (2007). A synthesis of 15 years of research on DIF in language testing: Methodological advances, challenges, and recommendations. Language Assessment Quarterly, 4(2), 113–148. Field, A. (2009). Discover statistics using SPSS. London: SAGE Publications. Hamp-Lyons, L. (1989). Applying the partial credit method of Rasch analysis: Language testing and accountability. Language Testing, 6(1), 109–118. Linacre, J. M. (1998). Detecting multidimensionality: Which residual data-type works best? Journal of Outcome Measurement, 2, 266–283. Linacre, J. M. (2017a). Winsteps® (Version 3.93.0) [Computer software]. Beaverton, OR: Winsteps.com. Retrieved January 1, 2017 from www.winsteps.com. Linacre, J. M. (2017b). Facets computer program for many-facet Rasch measurement, version 3.80.0 user’s guide. Beaverton, OR: Winsteps.com. Linacre, J. M. (2017c). Winsteps® Rasch measurement computer program user’s guide. Beaverton, OR: Winsteps.com. Linacre, J. M. (May, 2017). Personal Communication. Marais, I. (2013). Local dependence. In K. B. Christensen, S. Kreiner, & M. Mesbah (Eds.), Rasch models in health (pp. 111–130). London: John Wiley & Sons Ltd. Masters, G. N. (1982). A Rasch model for partial credit scoring. Psychometrika, 47(2), 149–174. McNamara, T. (1996). Measuring second language proficiency. London: Longman Publishing Group. McNamara, T., & Knoch, U. (2012). The Rasch wars: The emergence of Rasch measurement in language testing. Language Testing, 29(4), 553–574. Messick, S. (1989). Validity. In R. L. Linn (Ed.), Educational measurement (3rd ed., pp. 13–103). New York, NY: American Council on Education/Macmillan Publishing Company. Min, S., & He, L. (2014). Applying unidimensional and multidimensional item response theory models in testlet-based reading assessment. Language Testing, 31(4), 453–477. Pae, H. K., Greenberg, D., & Morris, R. D. (2012). Construct validity and measurement invariance of the Peabody Picture Vocabulary Test – III Form A. Language Assessment Quarterly, 9(2), 152–171. Sawaki, Y., Stricker, L. J., & Oranje, A. H. (2009). Factor structure of the TOEFL Internet-based test. Language Testing, 26(1), 5–30. Skehan, P. (1989). Language testing part II. Language Teaching, 22(1), 1–13. Stevens, J. (2002). Applied multivariate statistics for the social sciences. Mahwah, NJ: Lawrence Erlbaum Associates, Inc. Taylor, L., & Geranpayeh, A. (2011). Assessing listening for academic purposes: Defining and operationalising the test construct. Journal of English for Academic Purposes, 10(2), 89–101. Thissen, D., Steinberg, L., & Mooney, J. A. (1989). Trace lines for testlets: A use of multiple-categorical-response models. Journal of Educational Measurement, 26(3), 247–260.

102 Jason Fan and Trevor Bond Yen, W. M. (1984). Effects of local item dependence on the fit and equating performance of the three-parameter logistic model. Applied Psychological Measurement, 8(2), 125–145. Yen, W. M. (1993). Scaling performance assessments: Strategies for managing local item dependence. Journal of Educational Measurement, 30(3), 187–213. Zhang, B. (2010). Assessing the accuracy and consistency of language proficiency classification under competing measurement models. Language Testing, 27(1), 119–140.

5

The Rasch measurement approach to differential item functioning (DIF) analysis in language assessment research Michelle Raquel

Introduction An important aspect of test validity is to provide evidence that a test is fair or free from bias (Bachman, 2005; Kunnan, 2000). A test is considered unfair if one group is systematically disadvantaged over another group, not because of their ability but because of other factors such as test design or, in many cases, due to other factors external to the test or performance may be sensitive to off-trait influences (e.g., age, gender, academic background, first language, test target language). In language assessment, this would mean that students who have the same language proficiency might be awarded different results because a number of test items disadvantage one group relative to another. In a hypothetical example, imagine an English reading test that is taken by groups of students, but the boys consistently receive higher scores than do the girls. A number of studies (e.g., Mullis, Martin, Kennedy, & Foy, 2007; Uusen & Müürsepp, 2012) have shown such results to be highly unlikely, so that even when boys and girls are matched to have the same language proficiency, boys are getting higher scores. There are probably several explanations for this at the test level: the test has reading text topics that the boys are more familiar with, such as a popular action movie or a computer game, which gives them an advantage over girls who are less familiar with the topic (see Bachman & Palmer, 1996, for a discussion of topical knowledge). Another possible explanation could be that the reading test had multiple-choice questions (MCQ), on which boys have been shown to be better (Aryadoust, Goh, & Kim, 2011). Differential item functioning (DIF) is the statistical term used to describe test items that unexpectedly tap characteristics other than what they are supposed to (i.e., off-trait characteristics) and consequently cause the items to have different item difficulty estimates across different subgroups (Bond & Fox, 2015; Ferne & Rupp, 2007; Tennant & Pallant, 2007). This could be due to the characteristics of the population such as gender (e.g., Gnaldi & Bacci, 2016; Salubayba, 2013), age (e.g., Ownby & Waldrop-Valverde, 2013), ethnicity (e.g., Uiterwijk & Vallen, 2005), academic background (e.g., Pae, 2004), and other characteristics. At the individual item level, if test items exhibit DIF, the property of invariant measurement of the test, therefore, could potentially have an impact on overall test results (Bond & Fox, 2015; Engelhard Jr., 2013). DIF could thus pose a

104 Michelle Raquel threat to test validity, especially when overall test results are different for the subgroups of the same ability level but different characteristics. This could also make interpretations or decisions made based on test results inaccurate (Karami & Salmani Nodoushan, 2011; Kunnan, 2007). Due to the significance of DIF analysis in language assessment, several DIF detection methods have been developed throughout the past few decades. Some of these are discussed in the following section.

Methods to detect DIF DIF detection methods use either manifest variables (e.g., age, gender, or mother tongue) or adopt latent class techniques. Manifest variable DIF detection methods include the Mantel-Haenszel (MH) and the logistic regression (Log-R) procedures and unidimensional and multidimensional item response theory (IRT) techniques. Latent class DIF detection techniques, such as mixture Rasch models, are those in which manifest variables, such as gender and age, are used to explain potential sources of DIF after DIF is detected through statistical processes (see Chapter 1, Volume II). The most popular DIF detection methods with manifest variables are the MH and the Log-R procedures. These approaches rely on the total raw score or group’s total number of incorrect/correct responses to an item (observed data) to match learners of the same ability and then identify DIF by calculating and comparing each group’s response to each item. The MH method relies on calculation and comparison of the probability of each subgroup getting an item correct, while the Log-R method uses regression to determine the probability of a subgroup to predict response to an item. For a detailed explanation on how to do an MH approach, refer to Mantel and Haenszel (1959). A detailed explanation of the Log-R procedure can be found in Swaminathan and Rogers (1990) and Shimizu and Zumbo (2005). For a recent development of the DIF methods based on Log-R, refer to Magis, Tuerlinckx, and De Boeck (2015). IRT DIF detection methods with manifest variables rely on identifying differences between a group’s score on the latent variable and the expected or modeled score (Carlson & von Davier, 2017; Shealy & Stout, 1993a). IRT models are classified as either unidimensional (which assume that a test measures one latent variable such as reading or listening ability) or multidimensional (which assume that a test measures more than one latent variable; for example, a reading test could measure vocabulary knowledge and reading skills) (Roussos & Stout, 1996, 2004). In both unidimensional and multidimensional models, DIF analysis is carried out based on manifest variables. An example of unidimensional IRT DIF detection method is Lord’s x2 (chi-square) method (Lord, 1980) and the IRT likelihood ratio test (Thissen, Steinberg, & Wainer, 1993). An example of the multidimensional IRT DIF detection method is the Simultaneous Item Bias Test (SIBTEST) method (Shealy & Stout, 1993b). In DIF analysis with latent class methods, DIF groups are identified based on the qualitative differences between test takers’ answers (e.g., Jang & Roussos,

The Rasch measurement approach to DIF analysis 105 2009). The subgroups that are detected are called latent classes as they emerge from the data. An example of this type of analysis is in Chapter 1, Volume II. Other examples of DIF detection include the Rasch trees (Bollmann, Berger, & Tutz, 2017), DIF techniques with both continuous and categorical variables (Schauberger & Tutz, 2015), and structural equation modeling (SEM), which includes the multiple indicators, multiple causes (MIMIC) method and the measurement invariance approach (Borsboom, Mellenbergh, & Van Heerden, 2002; Cheng, Shao, & Lathrop, 2015). For a comprehensive overview of other DIF detection methods, see Dorans (2017), Hidalgo and Gómez-Benito (2010), Mellenbergh (2005a, 2005b), Millsap and Everson (1993), and Sireci and Rios (2013). The next section discusses two types of DIF detection methods which use manifest variables: uniform and non-uniform.

Uniform vs. non-uniform DIF In IRT and Rasch measurement approaches to DIF, there are two types of DIF: uniform and non-uniform DIF (Mellenbergh, 1982). Uniform DIF occurs when the DIF pattern is consistent across the levels of ability. In non-uniform DIF, the pattern reverses and DIF differs with ability levels, that is, the DIF item becomes more difficult for one level but easier for another. All DIF methods are more or less sensitive to uniform DIF. However, only some methods, such as the Log-R method, the IRT likelihood ratio method, and the Rasch-based DIF methods, are able to identify non-uniform DIF. A visual representation of these types of DIF is shown in Figures 5.1 and 5.2. There are three item characteristic curves (ICCs):1 the model ICC is represented by the solid black line, the reference group ICC by the dotted line, and the focal

1

Expected score

Subgroup 1

0.5

Subgroup 2

0

-3

-2

-1

0

Measure relative to item

Figure 5.1 ICC of an item with uniform DIF.

1

2

3

106 Michelle Raquel

Expected Score

1

0.5

0

-3

-2

-1

0

Measure relative to item

1

2

3

Figure 5.2 ICC of an item with non-uniform DIF.

group ICC by the dashed line. In Figure 5.1, the two ICCs indicate that the item has a difficulty estimate for each subgroup (dotted vs. dashed lines), as well as for the item overall (solid line). The focal group (dashed line; subgroup 2) finds the item more difficult (+0.5 logits) than did the reference group (−0.5 logits; dotted line; subgroup 1). Figure 5.2 demonstrates an example of a non-uniform DIF; the ICCs of the reference and focal group intersect so that the item difficulty for all groups is 0.0 logit. However, the item is more difficult than expected for those of lower ability in subgroup 1 (dotted line) but easier than expected for those of higher ability in that subgroup. The converse is true for subgroup 2 (dashed line): the item is more difficult than expected for those of higher ability but easier than expected for those of lower ability in that subgroup. The following sections give an overview of empirical studies in language assessment that investigate DIF, followed by a discussion of one popular DIF method, the Rasch-based DIF approach.

DIF studies in language assessment DIF has been a popular research area in language testing, and it continues to grow. Zumbo (2007) proposed that the evolution of research on DIF and use of DIF can be divided into three generations. First-generation DIF studies focused mainly on whether items favored one group over another. Traditionally, the group for which a test is originally intended is called the reference group, while the group that is suspected to be disadvantaged is called the focal group (Holland & Thayer, 1988). The decision of which subgroup will be called focal or reference group could be based on theoretical or pragmatic considerations. In

The Rasch measurement approach to DIF analysis 107 addition, first-generation DIF studies made a distinction between item impact and item bias. Item impact refers to differences that exist because of between-group differences in the underlying ability, while item bias refers to group differences not related to the underlying ability being assessed (Millsap & Everson, 1993). Second-generation DIF studies were characterized by use of the term “DIF” and the evolution of DIF detection techniques, especially IRT methods. Finally, third-generation studies focused on the use of more complex (multidimensional) statistical models to identify the contextual reasons or test situation factors where DIF occurs (Zumbo et al., 2015). This classification of DIF research, however, does not seem to accommodate some of the aforementioned approaches to DIF such as SEM-based approaches to DIF where DIF is not due to multidimensionality but due to a violation of measurement invariance (e.g., Bulut, Quo, & Gierl, 2017) and the Rasch trees (e.g., Magis et al., 2015). Zumbo (2007) identified the following as motivations for language assessment DIF studies: (i) to investigate fairness and equity in testing (e.g., Runnels, 2013; Yoo, Manna, Monfils, & Oh, 2018); (ii) to investigate potential threats to test validity (e.g., Engelhard, Kobrin, & Wind, 2014); (iii) to investigate the comparability of translated or adapted tests (e.g., Allalouf, 2003; Allalouf, Hambleton, & Sireci, 1999; Gierl & Khaliq, 2006); (iv) to investigate if certain groups of students answer the test the same way (e.g., Banerjee & Papageorgiou, 2016; Huang, Wilson, & Wang, 2016; Yan, Cheng, & Ginther, 2018); and (v) investigate invariance property of measurement models (Bauer, 2016; Shin, 2005; Yoo & Manna, 2015). Overall, DIF research is necessary, as its results serve as evidence to support test fairness arguments and thus, strengthen test validity claims (Kunnan, 2007; Xi, 2010). Ferne and Rupp (2007) provided a review of DIF studies in language assessment. They conducted a meta-analysis of DIF studies in the field of language testing from 1990 to 2005. They reviewed 27 journal articles to summarize the praxis of DIF in the field and identified the following for each article: the test analyzed and the language skill assessed by the test (reading, listening, speaking, writing, grammar, vocabulary, other), learner characteristics (subgroup characteristics, age, level of study, sample size), DIF detection method used, statistical indicators of DIF used by the study, methods used to identify causes of DIF, and the consequences of DIF studies. Their study identified reading (e.g., Bae & Bachman, 1998) as the most researched language skill followed by vocabulary (e.g., Freedle & Kostin, 1997) and finally listening (e.g., Pae, 2004). The test format most frequently used was dichotomously scored multiple-choice items. Regarding subgroup identification, gender was the most popular grouping variable, followed by ethnicity. Their study also revealed that the commonly used methods were the M-H method and IRT methods. Some studies also opted to utilize at least two DIF detection methods, and this procedure was endorsed by the authors as different DIF methods might often flag different items (e.g., Karami & Salmani Nodoushan, 2011). The choice of method and statistical indicators of DIF were related to sample size and item type (dichotomous, polytomous, or mixed item). The article concluded by commenting that future DIF studies should check for

108 Michelle Raquel unidimensionality and local independence, as this was not commonly done in the majority of the DIF studies they reviewed. They also emphasized the need for strong theoretical grounding of DIF studies and suggestions to improve reporting of DIF studies to enable future researchers to use the results to argue for or against test validity. Table 5.1 lists selected DIF studies in language assessment with details of the testing situation, the language skill examined, the grouping variable, the DIF detection methods used, and how the cause of DIF was investigated if applicable. The studies reviewed in Table 5.1 are recent and have not been covered in the review by Ferne and Rupp (2007). As the sample study mentioned later in the chapter is a DIF detection study on a listening test, DIF studies on listening tests are reviewed in the next section.

Listening test DIF studies The impact of multiple factors on listening comprehension has been investigated in a number of DIF studies, such as gender (Aryadoust et al., 2011; Park, 2008), accent of speakers in audio recordings (Harding, 2012), nationality and previous exposure to a test (Aryadoust, 2012), topic and different age groups of test takers (Banerjee & Papageorgiou, 2016), academic background (Pae, 2004), language of the test (Filipi, 2012), first/target language of the test takers (Bae & Bachman, 1998; Elder, 1996; Elder, McNamara, & Congdon, 2003), and language domain (Banerjee & Papageorgiou, 2016). DIF found in listening items has been postulated to be due to characteristics of the test or of the test takers. For example, Filipi (2012) found that some listening items are easier if the question that required listening for explicit information was written in the target language, while the target language of the test did not matter if the question required the candidate to listen for global meaning. Aryadoust et al. (2011) proposed that multiple-choice questions favored low-ability males, as that question type was susceptible to guessing, while Aryadoust (2012) found that short answer questions favored high-ability groups disproportionately. Harding (2012) surmised that DIF items strongly favored Mainland Chinese native listeners on tests with Mandarin Chinese native speakers compared to other secondlanguage listeners due to shared-L1 accents of the speakers in the L2 recordings. To date, there has not been any study to investigate if secondary school study in different cities with different English language curricula could potentially cause a listening test to favor one group over another. It has been hinted, however, that curriculum differences could cause DIF (Huang, 2010; van de Vijver & Poortinga, 1997). For example, Huang et al. (2016) investigated causes of DIF in the PISA science test and compared groups of students according to where they came from (American vs. Canadian; mainland Chinese vs. HK Chinese; American vs. mainland Chinese). The groups were particularly chosen and grouped this way due to the similarities in language, culture, and curricula. The study aimed to determine if DIF was related to language/test translation, difference in curriculum coverage, or cultural differences. Their results showed that DIF items either favored Mainland Chinese or HK Chinese due to curriculum coverage differences and whether the item context, content, or topic was familiar to the students.

Table 5.1 Selected Language-Related DIF Studies Author and year of study

Language skill

Grouping variable

Method used

Method to investigate cause of DIF

Abbott (2007)

Reading

First language (Arabic vs. Mandarin)

Simultaneous Item Bias Test (SIBTEST)

–

Mantel-Haenzsel

–

Allalouf and Vocabulary First language (Arabic vs. Abramzon (2008) Reading Grammar Russian) Banerjee and Papageorgiou (2016)

Listening

IRT (Rasch model) Expert panel Age groups Language domain (public, occupational, personal)

Geranpayeh and Kunnan (2007)

Listening

Age group

IRT (Compact Expert panel model and augmented model)

Koo, Becker, and Kim (2014)

Reading

Grade level Gender Ethnicity

Mantel-Haenzsel

Magis, Raiche, Beland, and Gerard (2011)

Reading

Year test was taken

Logistic regression – Mantel-Haenzsel Lord’s χ2 test

–

Oliveri, Ercikan, Reading Lyons-Thomas, and Holtzman (2016)

Linguistic Latent class minority groups

Statistics

Oliveri, Lawless, Reading Robin, and Bridgeman (2018)

Gender Socioeconomic status

Mantel-Haenzsel IRT (LinnHarnisch) Latent class

Statistics

Pae (2012)

Reading

Item type Gender

Mantel-Haenzsel IRT (Likelihood ratio) Multiple linear regression

Statistics

Park (2008)

Listening

Gender

Mantel-Haenzsel

Expert panel

Runnels (2013)

Reading Listening Grammar

Academic discipline

IRT (Rasch model) –

Country

Poly-Simultaneous Statistics Item Bias Test (SIBTEST)

Sandilands, Oliveri, Reading Zumbo, and Ercikan (2013) Wedman (2017)

Vocabulary Gender

Mantel-Haenzsel – Logistic regression

110 Michelle Raquel

Rasch-based DIF The Rasch-based DIF method is a special example of an IRT-related DIF detection method. The strengths of this method compared to other IRT methods are that it can analyze a dataset with missing data, there is no need to match learner ability manually, and it can detect both uniform and non-uniform DIF (Linacre, 2018a; Linacre & Wright, 1987; Luppescu, 1993). From a Rasch measurement perspective, a test that has DIF items loses that crucial test property of measurement invariance. Specifically, item parameter invariance has been violated because uniform DIF items will have different item parameter estimates for the reference and the focal groups, and this invariance problem is even more complicated for non-uniform DIF (Bond & Fox, 2015). Furthermore, since in a unidimensional Rasch model, person estimates are calculated based on information from all items and persons (to the extent that the data fit the model), DIF analysis is not affected by the composition of subgroup samples. This means that Rasch-based DIF analysis, item estimates of subgroups are based only on item characteristics and not affected by the distribution of person ability in the subgroups, that is, person ability is controlled for in DIF analysis (Linacre & Wright, 1987; Smith, 1994). The next section outlines the steps involved in Rasch-based DIF analysis through an investigation of DIF in a diagnostic English language test.

Sample study: Investigating DIF in a diagnostic English listening test Background In 2016, a university in Macau adopted a tertiary English diagnostic test from Hong Kong as a post-entry language assessment. Specifically, it was used for placement of first-year undergraduate students into English language classes and for diagnostic purposes. The test is called the Diagnostic English Language Tracking Assessment (DELTA; Urmston, Raquel, & Tsang, 2013), a diagnostic test developed for HK university students. The rationale for the adoption of the test was a Macau university senior management belief that Macau students and Hong Kong students are the “same students” (i.e., similar learner characteristics), with just lower levels of English proficiency. However, this rationale is contentious given that a review of the Macau and Hong Kong English educational systems clearly shows that they are distinctly different from each other. Hong Kong, as a former British colony, has had a long history of English language education, while Macau does not (Bray & Koo, 2004), because as a former Portuguese colony, Portuguese was first considered the official language, and English became another official language only as the city’s economy grew (X. Yan, 2017). This led to a lack of policy on English language education in the Macau secondary school curriculum, which consequently led to schools having the “freedom” to dictate their own English language curricula (Young, 2011), such as having a medium of instruction in either Cantonese or Mandarin,

The Rasch measurement approach to DIF analysis 111 and English and/or Portuguese are taught as foreign languages. In contrast, Hong Kong has a government-controlled English language curriculum imposed on both the public and private systems, and a secondary level school-leaving exam with English as a main component of the assessment. Together with other societal factors, these significantly contribute to Hong Kong students having sufficient English for tertiary education (Bray, Butler, Hui, Kwo, & Mang, 2002; Evans & Morrison, 2012), while Macau’s decentralized system has resulted in students with a wide range of English proficiency levels. This has had an impact on the tertiary sector, where there exists a wide gap between Macau secondary students’ English proficiency level and the level required by their universities (Bray & Koo, 2004; Young, 2011). Given the differences in English language educational backgrounds and English curricula, it is quite reasonable to question the appropriateness of using DELTA to diagnose Macau students’ English language proficiency without further reflection. In particular, the DELTA listening comprehension test component should be investigated given that Macau secondary students are particularly weak in listening comprehension skills (Young, 2011). This study thus aims to determine whether Macau students are disadvantaged in the DELTA listening component and answers the following research questions: i To what extent does the DELTA listening component fulfill preconditions of unidimensionality and local independence for DIF analysis? ii Do DELTA listening items show DIF related to the English language curriculum taken (HK vs. Macau)? iii Where DIF exists, does it interact with differences in student ability (nonuniform DIF)?

Participants A total of 2,524 first-year undergraduate students from Macau and Hong Kong took the DELTA in academic year 2016–17. The Macau students (n = 1,464) came from one university, while the Hong Kong students (n = 1060) came from two Hong Kong universities. All HK students were local students, and although we were unable to determine the number of Macau local students, official statistics indicate that only 27% of the undergraduate population were non-local students (University of Macau, 2016), which suggests that most of the Macau students in the study were local students. All students came from a range of disciplines, and at least 90% were required to take the DELTA. As the study aimed to determine if the DELTA test overall is a fair test for Macau students, the HK group was considered the reference group, while the Macau group was considered the focal group.

DELTA listening component DELTA is a post-entry computer-based diagnostic language assessment designed for the Hong Kong tertiary education context. It is a multi-componential test of academic listening, reading, grammar, and vocabulary. The listening component

112 Michelle Raquel Table 5.2 Listening Sub-skills in the DELTA Listening Test Listening sub-skill

Number of items in item bank

Identifying specific information Interpreting a word or phrase as used by the speaker Understanding main ideas and supporting ideas Understanding information and making an inference Inferring the speaker’s reasoning Interpreting an attitude or intention of the speaker

147 29 57 33 21 22

48% 9% 18% 11% 7% 7%

tests students’ ability to listen to and understand the kinds of spoken English that they would listen to for English language learning and tertiary-level study more generally. Each question has been identified to assess one of the following listening sub-skills (Table 5.2). These sub-skills were derived from different taxonomies of listening sub-skills and theories of listening comprehension proposed by Buck (2001), Field (1998), Field (2005), Lund (1990), and Song (2008) (Urmston et al., 2013). The listening component is the first DELTA test component that students take and lasts for about 20 minutes. Each student listens to four recordings of texts, and each text has about six to eight multiple-choice questions with four distractors. Before each recording starts, students are given 2 minutes to study the questions for that text. The recordings are arranged to progress in level of difficulty (1 easy text – 2 medium texts – 1 difficult text). The difficulty level of these texts has been empirically determined by the Rasch calibration of the items and content analysis by the test developers. In 2016–17, the DELTA listening item bank had 47 texts and 309 items. The next section outlines the steps involved in Rasch-based DIF analysis together with the results of the study. Winsteps software version 4.2.0 (Linacre, 2018c) was used to analyze the data.

DIF analysis using the Rasch model The DIF method proposed in this chapter consists of five steps including the preDIF data screening analyses as follows: Stage 1: Investigating Rasch reliability and model fit Stage 2: Investigating unidimensionality (via the principal component analysis of residuals) Stage 3: (a) Uniform DIF or (b) non-uniform DIF (NUDIF) analysis Stage 4: (a) Results of the uniform DIF or (b) non-uniform DIF (NUDIF) analysis Stage 5: Discussion and interpretation of DIF

The Rasch measurement approach to DIF analysis 113 Stage 1: Investigating Rasch reliability and model fit As with any DIF detection method, we must first determine if the instrument has properties of sound measurement (i.e., adequate data fit to the model) before DIF detection can be undertaken (Ferne & Rupp, 2007; Roussos & Stout, 1996). This means first checking the fit statistics of the model, separation and strata, unidimensionality, and local independence of items. An explanation of these concepts is beyond the scope of this chapter, and readers may refer to Chapter 4, Volume I, for a comprehensive discussion of these concepts. Table 5.3 shows the summary statistics for items and persons. In Rasch measurement, separation and strata are indicators of measurably distinct groups of items or persons (Bond & Fox, 2015). The person strata in the present analysis (1.75) shows that the student population has almost two levels of ability, while the item strata (6.69) shows that the items are well-targeted as they cover a wide range of abilities. With regard to model fit, item infit mean square (MnSq) is investigated, as this is sensitive to the match of test taker ability to the item difficulty. Similar to other low-stakes tests (e.g., Banerjee & Papageorgiou, 2016), an investigation of infit MnSq should be sufficient. However, it is prudent to investigate outfit MnSq and infit and outfit standardized t values (Zstd) to get an indication of the extent of misfit in the data. It is important to note though that misfitting items do not impact DIF analysis, although DIF, in fact, is probably one of the causes of misfit (i.e., misfit as a function of DIF; Linacre, 2018a; Salzberger, Newton, & Ewing, 2014). An infit and outfit MnSq value that ranges from .7 – 1.3 is considered acceptable in a low-stakes test with values above 1.3 considered underfitting (i.e., problematic due to poor item construction, DIF, etc.). Meanwhile, infit and outfit Zstd values within ±2 are considered acceptable, although when deciding which items are more problematic, attention should first be given to infit Zstd greater than 2.0, as these underfitting items impact item and person estimates more than items with high outfit Zstd (Bond & Fox, 2015; Linacre, 2002). The point biserial correlation is also investigated for negative values, as these indicate items that potentially tap a second dimension.2 This test had an acceptable infit MnSq range (.72–1.37), although 15 items had outfit MnSq greater than 1.4, 31 items with infit Zstd greater than 2.0, and 7 items had negative point biserial correlations (total of 41 misfitting items). Table 5.3 Rasch Analysis Summary Statistics (N = 2,524) Mean (SD) SE

Measure range

Infit MnSq range

Outfit MnSq range

Infit Zstd range

Outfit Zstd range

Persons −0.03 (.78) 0.47 −3.76–3.58 0.33–2.42 0.20–4.29 −2.6–3.3 −2.3–4.2 Items 0.0 (1.0) 0.18 −2.91–3.02 0.72–1.37 0.61–1.97 −5.83–5.34 −5.29–5.35

114 Michelle Raquel It is important to note that the mean measure and sample size of each subgroup may affect DIF analysis as the baseline measure (i.e., overall measure).

Stage 2: Investigating unidimensionality (using principal component analysis of residuals) Principal component analysis of residuals (PCAR) was used to investigate whether there are secondary dimensions beside the main dimension measured by the test. This involves an investigation of the total unexplained variance in the 1st contrast eigenvalue and the person measure disattenuated correlations (correlations without taking error into account). These values indicate if clusters of items are measuring the same dimension but in a different way or measuring a secondary dimension. If the unexplained variance eigenvalue in the first contrast is more than 2, this means that there are more than two items that could measure a secondary dimension. It is important to note that the eigenvalue in the first contrast should be viewed in relation to the total number of items (Bond & Fox, 2015; Linacre, 2003). An investigation of the plot of residuals is necessary to determine if there is a pattern to these clusters of items or if it is random (Linacre, 2018b), and the relationship of these clusters is confirmed by checking the disattenuated correlations of the clusters. If the disattenuated correlation of the cluster comparison is less than .3, further investigation of items is required to determine if the correlations of the clusters are accidental.3 Table 5.4 shows that PCAR of this analysis shows that the unexplained variance in the first contrast is eigenvalue 2.0267 out of 309 items (ratio = 0.006), which is a negligible secondary dimension in the data. A look at the standardized residual plot (Figure 5.3) confirms this finding, as there is no clear separation of clusters of items in the plot. The disattenuated correlation of the item clusters is also 1.00, which provides evidence that the clusters of items are measuring the same dimension (see Table 5.5). The PCAR analysis thus confirms the unidimensional structure of the data, which means that the test scores accurately represent the test takers’ test performance (Ferne & Rupp, 2007).

Table 5.4 Principal Component Analysis of Residuals

Total raw variance in observations Raw variance explained by measures Raw variance explained by persons Raw variance explained by items Raw unexplained variance (total) Unexplained variance in 1st contrast Unexplained variance in 2nd contrast Unexplained variance in 3rd contrast Unexplained variance in 4th contrast

Eigenvalue

Observed

Expected

417.2682 108.2682 51.1166 57.1516 309.0000 2.0267 1.9051 1.8224 1.7980

100% 25.9% 12.3% 13.7% 74.1% .5% .5% .4% .4%

100% 25.6% 12.1% 13.5% 74.4% .7% .6% .6% .6%

The Rasch measurement approach to DIF analysis 115 STANDARDIZED RESIDUAL CONTRAST 1 PLOT -4

-+-------+-------+-------+-------+-------+-------+-------+-------+- COUNT

N

-3

-2

-1

0

.3 +

T

|

R

.2 +

A

|

S

.1 +

1

2

3

4

C O

CLUSTER

+

| |

A

C | EGD

FB

| 1

1

+ 6

1

| 9

1

+ 26

1

| 64

2

.0 +--------------111-11--231422535621552311-2-5-1121-2---1----------+ 73

2

T

IP| 1

1

|

1

|

11 1

-.1 + L

1 ST11

N

OL J KH M

3 1UX 2Q3V1ZY W1

R

222 32411331254238 321 1211

12 112241141454235265331112 11 1

2x3w3z32p6y33111 111

|

qm

O -.2 +

e

a

d

1 1

| 71

2

+ 37

3

| 17

3

|i

+ 5

3

|

|

gsj lhtkfvno u b c

2 1

r

A

|

D

-+-------+-------+-------+-------+-------+-------+-------+-------+-

I

-4

-3

-2

-1

0

1

2

3

4

ITEM MEASURE

Figure 5.3 Standardized residual plot of 1st contrast.

Table 5.5 Approximate Relationships Between the Person Measures in PCAR Analysis Item clusters

Pearson correlation

Disattenuated correlation

1–3 1–2 2–3

0.112 0.165 0.567

1.00 1.00 1.00

Stage 3a: Uniform DIF analysis After the data are confirmed to have sufficient fit to the model, we can run the DIF analysis using Winsteps. In this Rasch-based DIF method, the aim is to determine if there is significant difference between the original item difficulty estimate compared to the item difficulty estimate calculated for each subgroup. Winsteps estimates the new logit values for each subgroup (DIF measure) using a logitdifference (logistic regression) method, which re-estimates item difficulty while holding everything else constant (Linacre, 2018a). DIF contrast indicates the difference between the DIF logit measures of each subgroup. A positive DIF contrast indicates that the item is more difficult for the first subgroup, and negative DIF contrast indicates it is easier for that subgroup. The statistical significance of this effect size is determined by the Rasch-Welsh t statistic, a two-sided t-test that tests whether the difference between the DIF logits is statistically significant. Figure 5.4

|DIF|

B = slight to moderate

0.43 logits

0.64 logits

prob( |DIF| 0.43 logits ) < .05 approximately: |DIF| > 0.43 logits + 2 * DIF S.E. prob( |DIF| = 0 logits ) < .05

with DIF Size (Logits) and DIF Statistical Significance

Source: Reprinted with permission (Linacre & Wright, 1989).

Figure 5.4 ETS DIF categorization of DIF items based on DIF size and statistical significance.

approximately: |DIF| > 2 * DIF S.E. A = negligible C-, B- = DIF against focal group; C+, B+ = DIF against reference group Note: ETS (Educational Testing Service) use Delta units. 1 logit = 2.35 Delta units. 1 Delta unit = 0.426 logits. Zwick, R., Thayer, D.T., Lewis, C. (1999) An Empirical Bayes Approach to Mantel-Haenszel DIF Analysis. Journal of Educational Measurement, 36, 1, 1-28

|DIF|

C = moderate to large

ETS DIF Category

The Rasch measurement approach to DIF analysis 117 shows the categorization of DIF items based on DIF contrast sizes and statistical significance. Linacre (2018a) also recommends a more general guideline of first identifying those with DIF contrast size of 0.5 logit as a cutoff point indicative of moderate to large DIF and then identifying those with a Rasch-Welsh t = ± 1.96 and a significant probability (p < .05), as this indicates the that DIF contrast could “substantially impact observed test or test item performance, and have a theoretically sound cause” (Aryadoust et al., 2011, p. 363). These cutoffs are then adjusted depending on the testing situation. As the context of this sample study is a lowstakes test, these general guidelines are used to identify DIF items.

Stage 4a: Results of the uniform DIF analysis The results of the DIF analysis show that there were 117 items flagged for uniform DIF, with 58 items more difficult for Macau students (i.e., positive DIF contrast with DIF contrast size range of 0.5–2.52; items disadvantaged Macau students) and 59 items easier for HK students (i.e., negative DIF contrast with DIF contrast size range of −0.56–3.35; items advantaged HK students). Table 5.6 lists the top and bottom half of the list of DIF items (a complete list is available in the Companion website). Two DIF items with positive DIF contrast and four DIF items with negative DIF contrast were categorized as slight to moderate DIF. All other DIF items were categorized as moderate to large DIF. Figure 5.5 shows the ICCs for item 328, which has been identified to have uniform DIF with a large DIF contrast = 1.18 (joint SE = .25, t = 4.78, df = 345, p = .000). We can see that the item has two distinctly different ICCs for subgroups. The displacement of the Macau ICC (red curve) to the right indicates DIF that disadvantages Macau students, as their probability of success is lower than that for HK students of the same ability. The Macau students found this item much more difficult than HK students and that the item should have had a higher logit value for the Macau subgroup. In contrast, Figure 5.6 shows item 327 which has been identified to have uniform DIF with a large DIF contrast = −.76 (joint SE = .24, t = −3.22, df = 333, p = .001). Compared to Figure 5.5, the ICCs in Figure 5.6 are reversed where Macau students find the item easier compared to HK students.

Stage 3b: Non-uniform DIF (NUDIF) analysis As previously mentioned, non-uniform DIF occurs when the ICCs of the subgroups intersect, as this means that the pattern of responses to the item within the subgroup changes depending on the ability. It is important to identify non-uniform DIF, as it “could have implications for setting cutoff scores” (Ferne & Rupp, 2007, p. 134). The sample is split into two more groups based on ability by taking the range of the person measures and splits it in two at the mid-point of the range (Linacre, J. M., personal communication, 25 July 2018). Thus, groups are further divided into subgroups and compared against subgroups of the same ability: lowability HK group vs. low-ability Macau group (H1 vs. M1) and high-ability HK

DIF MEAS

−0.67 −0.47 −0.89 −1.27 1.82 0.24 −0.05 1.18 1.42 1.41 1.42 −0.19 −0.24 −0.86 – 0.98 −1.45 0.35 −0.56 0.56 0.37 0.32 0.58 0.51

PERSON CLASS (Focal)

M M M M M M M M M M M M M M – M M M M M M M M M

0.25 0.29 0.18 0.22 0.44 0.24 0.26 0.25 0.44 0.2 0.21 0.23 0.19 0.19 – 0.19 0.19 0.18 0.27 0.28 0.21 0.25 0.24 0.31

DIF SE

H H H H H H H H H H H H H H – H H H H H H H H H

PERSON CLASS (Reference)

Table 5.6 Items With Uniform DIF

−3.19 −2.82 −3.13 −3.48 −0.07 −1.63 −1.92 −0.57 −0.27 0.27 0.29 −1.32 −1.36 −1.97 – 2.58 0.19 1.99 1.13 2.4 2.31 2.36 3.4 3.86

DIF MEAS 1.02 1.03 0.72 1.02 0.44 0.49 0.48 0.23 0.41 0.19 0.18 0.39 0.3 0.4 – 0.25 0.24 0.2 0.29 0.39 0.25 0.37 0.47 0.62

DIF SE

2.52 2.35 2.23 2.22 1.89 1.88 1.87 1.75 1.69 1.14 1.13 1.13 1.12 1.11 – −1.6 −1.64 −1.64 −1.69 −1.85 −1.94 −2.04 −2.82 −3.35

DIF CONTRAST 1.05 1.07 0.74 1.04 0.62 0.55 0.55 0.34 0.6 0.28 0.28 0.46 0.36 0.44 – 0.32 0.31 0.27 0.4 0.48 0.32 0.45 0.53 0.69

JOINT SE 2.41 2.2 3.01 2.13 3.04 3.44 3.4 5.12 2.81 4.06 4.03 2.47 3.12 2.49 – −5.07 −5.36 −6.06 −4.23 −3.84 −6.01 −4.55 −5.31 −4.84

RaschWelsh t 50 29 93 48 67 59 95 239 83 313 334 79 150 129 – 271 176 293 110 100 223 105 95 80

RaschWelsh df 0.02 0.04 0.00 0.04 0.00 0.00 0.00 0.00 0.01 0.00 0.00 0.02 0.00 0.01 – 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00

RaschWelsh Prob. 840 370 337 375 373 386 319 271 266 427 293 846 416 258 – 393 335 364 301 456 269 228 349 297

ITEM number

L042-04 L026-01 L021-04 L027-01 L026-04 L028-04 L018-04 L011-04 L010-03 L034-08 L014-06 L043-03 L033-03 L008-04 – L029-05 L021-02 L025-03 L015-06 L039-03 L011-02 L003-04 L023-03 L015-02

Name

Figure 5.5 Sample ICC of item with uniform DIF (positive DIF contrast).

Figure 5.6 Sample ICC of item with uniform DIF (negative DIF contrast).

120 Michelle Raquel group vs. high-ability Macau group (H2 vs. M2). The same procedures are followed as in uniform DIF analysis where the cutoff points for identifying substantial DIF are followed (DIF contrast = ±.5; t = ±1.96; p 1 items, each of which has M > 2 score categories, the PCM has J × (M − 1) item parameters, whereas the RSM has J + (M − 1) item parameters. Having fewer parameters (i.e., being more restrictive) comes with both advantages and disadvantages. With the RSM, the relative locations of threshold parameters are estimated using data from all items. The PCM thresholds need to be estimated from individual item data. Because RSM threshold estimates are based on more data, they are likely to be more stable than the corresponding PCM estimates, especially when data are weak (e.g., a small sample). However, in practice, it is highly unlikely that the relative locations of thresholds are exactly the same across all items. By pooling data across all items based on this unlikely assumption, RSM threshold estimates can be considerably worse than PCM estimates as a representation of data. Whether the RSM is better suited than the PCM for a given analysis task, therefore, depends on the strength of data and the extent to

The rating scale and partial credit models 137 which the relative locations of thresholds in the observed data are comparable across items. In addition to the empirical considerations, there may also be substantive reasons to compare the RSM and the PCM. For example, a researcher may have a hypothesis that the score categories of multiple Likert-type items developed have the same relative locations. A comparison between the RSM and the PCM provides a direct test of this hypothesis. We illustrate the comparison between the RSM and the PCM later in the illustrative analysis section.

Language assessment research using the RSM and the PCM The RSM and the PCM have been used to address a wide range of research questions in the language assessment field. One of their primary uses has been to support test and research instrument development. Early uses of the PCM include Adams, Griffin, and Martin (1987), who used the PCM to evaluate the dimensionality and items of an interview-based oral proficiency test. Similarly, Pollitt and Hutchinson (1987) employed the PCM to estimate thresholds of writing tasks to justify them as a generalizable measure of writing skill. In addition to oral proficiency interviews and essay writing tasks, a variety of measures such as group oral discussion tasks (Fulcher, 1996), C-test (Lee-Ellis, 2009), and experimental tasks based on psycholinguistic theories (Van Moere, 2012) have been evaluated using the PCM. Aryadoust (2012) also used the PCM to evaluate the feasibility of including sentence structure and vocabulary as an analytic scoring criterion for English as a foreign language (EFL) writing assessment. The RSM has been a popular choice for evaluating self-assessment instruments. Aryadoust, Goh, and Kim (2012) utilized the RSM to estimate overall item difficulties and identify potential issues of a self-assessment questionnaire for academic listening skills. Similarly, Suzuki (2015) developed for Japanese learners a self-assessment instrument consisting of can-do statements and used the RSM to identify statements that did not follow model predictions well. The item and/or person parameters from the models were sometimes used as intermediary quantities for subsequent analyses. Eckes (2012, 2017) used the RSM person parameter estimates as inputs for classifying test takers into groups in deriving empirically based cut scores for placement purposes. Lee, Gentile, and Kantor (2008) focused on data patterns not well captured by the PCM to identify test takers with non-flat profiles on analytically scored writing tasks. The RSM and the PCM can also be used for theoretically oriented inquiries, as was the case in Stewart, Batty, and Bovee (2012), who evaluated theories and assumptions about second-language vocabulary acquisition by fitting the PCM to a scale of vocabulary knowledge. For some researchers, the importance and use of the RSM and the PCM are closely related to the properties of Rasch measurement models. In light of this, McNamara and Knoch’s (2012) review of Rasch model use in language assessment research can be relevant. We also note that one of the most frequent uses

138 Ikkyu Choi of the RSM and the PCM has been to serve as the basis of many-facets Rasch models (Linacre, 1994). Therefore, studies reviewed in Chapter 7, Volume I can be considered relevant as well. We noticed that there was not always a discussion of either why a given model was selected or how well it accounted for data. It is important though to evaluate how well, compared to available alternatives, a model accounts for characteristic patterns of data. Equally important is to identify patterns of data that are not well captured by the model. Some patterns in data, such as an unbalanced distribution of responses and extreme response patterns, are difficult to capture regardless of model choice. Other data patterns may be better accounted for with a more flexible model. For example, the RSM cannot capture the differences in the relative locations of item thresholds, whereas the PCM can. When reported, evaluations of the RSM and the PCM have often relied exclusively upon fit indices such as infit and outfit mean squares. These indices are convenient and useful but have theoretical caveats (Karabatsos, 2000) and do not show which data patterns are not captured by the model. A careful inspection of predicted response probabilities along with observed response proportions helps identify data patterns not captured by a given IRT model (Hambleton & Han, 2005; Wells & Hambleton, 2016) and therefore can complement index-based model evaluation. The illustrative analysis in the following section highlights discussions about model evaluation and comparison. The model evaluation utilizes infit and outfit mean square indices as well as the visual inspection of the alignment between model-predicted response probabilities and observed score proportions.

Sample study: PCM and RSM scaling of motivation questionnaire We illustrate the use of the RSM and the PCM with Likert-type items in a motivation questionnaire. The goal of this illustrative analysis is threefold: (1) to obtain a motivation score for each participant, (2) to examine the category thresholds of each item, and (3) to compare the PCM and the RSM to determine which model to use. Our discussion will center around the comparison between the RSM and the PCM. This narrow pool of considered models is an artifact of the scope of this chapter and does not represent a recommended practice. Acknowledging the popularity of Winsteps (Linacre, 2017a) in language assessment research, the analyses in this section were conducted using Winsteps unless noted otherwise. Details of all analyses in this section can be found in the Companion website.

Method Data Our data included responses to a set of eight Likert-type items (attached in the Companion website). The items were adopted from Kunnan’s (1991) motivation questionnaire, which was designed to assess motivation to learn and use English.

The rating scale and partial credit models 139 Each item presented a statement about English learning and use, and asked participants to indicate the extent to which they agreed or disagreed with the statement by choosing one of the following options: Strongly Disagree, Disagree, Agree, and Strongly Agree. Six of the eight items presented positively worded statements toward English learning and use. Responses to these six items were mapped onto a four-point scale following the order of the options, with Strongly Disagree coded as 0 and Strongly Agree as 3. The remaining two items featured negatively worded statements, and resulting responses were reverse-mapped onto the same 0-to-3 scale such that, for all eight items, higher scores would represent stronger agreement with positive statements toward English learning and use.

Participants The eight-item motivation questionnaire was administered online via SurveyMonkey as a part of a larger study investigating the relationship between English learning/use motivation and proficiency development among adult EFL learners. In this chapter, we focus on 1,946 participants (913 females, 900 males, and 133 unidentified) who completed the questionnaire. The participants were students enrolled in 14 English courses in 10 countries. They spoke a variety of first languages including Spanish (34%), Turkish (23%), and Greek (22%). The participants also varied in age, which ranged from 14 to 47, with mean of 21.

Analyses Response distributions We first examined how the responses were distributed across items. Figure 6.2 shows that, except for Item 6, only a small number of participants “strongly disagreed” with the positive statements. In other words, only a small amount of data was available to estimate the thresholds between “Strongly Disagree” and “Disagree” for seven of the eight items. The corresponding threshold estimates would not be stable enough to be meaningful or useful. Therefore, we collapsed the two lowest categories (i.e., “Strongly Disagree” and “Disagree”) of all items and kept “Disagree” as its label. The response distributions after collapsing the two lowest categories are given in the third and fourth rows of Figure 6.2. The lowest categories now have more responses with a minimum of 59 data points (Item 5). We used the collapsed data for all subsequent analyses in this section.

Model assumptions In the Model Presentation section, we mentioned two assumptions common to the RSM and the PCM: the unidimensionality assumption and the local independence assumption. These assumptions are also made in dichotomous IRT models, and therefore, the methods discussed in Chapter 4, Volume I can also be used to examine these assumptions for the RSM and the PCM. Due to space limits, we

140 Ikkyu Choi

Figure 6.2 Distributions of item responses. The first two rows show the original responses with four score categories. The third and fourth rows show the three-category responses after collapsing the two lowest score categories.

report our examination of these assumptions (and the corresponding R code) only in the Companion website of this book.

Model fitting We fitted the RSM and the PCM to the motivation questionnaire responses using Winsteps (Linacre, 2017a). We used the default joint maximum likelihood (JML) method for estimating the item and person parameters and examined the resulting estimates. We also examined the standard errors for the item threshold and person parameter estimates as well as the squared reciprocal of the person parameter standard errors, which is called test information.

The rating scale and partial credit models 141 Model performance examination and comparisons The RSM and PCM parameters and the corresponding standard errors are estimated under the assumption that the RSM and the PCM are the true data generating mechanisms (i.e., how participants and items interacted to yield the responses to the eight motivation items), respectively. It is unrealistic to expect that any given model represents the true data generating mechanism. Therefore, the item and person parameter estimates and their standard errors should not be taken as the “true” characterization of the items or the participants. They are data summaries, whose usefulness depends on how well the model approximates the true data generating mechanism. The farther the RSM or the PCM deviates from the true mechanism, the less useful the resulting parameter and standard error estimates become. Of course, the true data generating mechanism is not known in practice. Therefore, a direct comparison between a model and the true mechanism is not possible. Instead, we examine the performance of a model by comparing what would have happened if the model were true (i.e., model prediction) with what actually happened (i.e., data). We want our model predictions to agree with data, such that we can take its parameter and standard error estimates as useful data summaries. A convenient way of comparing model predictions with data is to look at a single number summary of how much model predictions differed from the data. Such a single number summary is often called a fit index. Two fit indices are routinely examined in Rasch model applications: infit and outfit mean squares. Both are based on squared differences between model predictions and data, with mean 1 and some variance (see Wright and Masters (1982) for computation details). When model predictions and data agree, their values should be close to 1. As the disagreement between model predictions and data becomes more severe, the indices deviate farther from 1. We also augmented the examination of the fit indices with visual inspection of model predictions and data. We then compared the performance of the RSM and the PCM to determine which of the two models should be used for this dataset.

Results Item threshold parameter estimates The RSM and PCM item threshold estimates and their standard errors are given in Table 6.1. Recall that, after collapsing the two lowest categories, every item had three score categories from 0 to 2. Therefore, each item had two thresholds: the first one between categories 0 and 1 and the second one between categories 1 and 2. With the RSM, the distance between these two thresholds is constrained to be equal across the eight items, whereas no such constraint is imposed on the PCM thresholds. Table 6.1 shows that the distance between the estimated first and second RSM thresholds was the same across all items, as −1.82 – (− 4.63) = 0.18 – (−2.63) = . . . = 0.11 – (− 2.7) ≈ 2.8. Their absolute

142 Ikkyu Choi Table 6.1 Item Threshold Parameter Estimates (With Standard Errors in the Parentheses) From the RSM and the PCM Item

1 2 3 4 5 6 7 8

RSM

PCM

Threshold 1

Threshold 2

Threshold 1

Threshold 2

−4.63 (0.03) −2.63 (0.03) −3.53 (0.03) −2.62 (0.03) −4.26 (0.03) −0.85 (0.03) −3.75 (0.03) −2.7 (0.03)

−1.82 (0.02) 0.18 (0.02) −0.72 (0.02) 0.19 (0.02) −1.45 (0.02) 1.96 (0.02) −0.94 (0.02) 0.11 (0.02)

−3.38 (0.14) −3.22 (0.1) −3.61 (0.12) −2.85 (0.09) −3.83 (0.15) −0.51 (0.06) −3.96 (0.14) −2.95 (0.09)

−2.07 (0.07) 0.45 (0.06) −0.68 (0.06) 0.32 (0.06) −1.52 (0.06) 1.47 (0.07) −0.88 (0.06) 0.24 (0.06)

locations differed because overall item difficulty of each item (i.e., τ1, . . ., τ8) differed; the estimated RSM overall difficulty parameters were −3.22, −1.22, −2.12, −1.21, −2.86, 0.56, −2.35, and −1.3 for the eight items, respectively. With the PCM, the distance between the first and second threshold estimates differed across items, ranging from −2.07 – (−3.38) = 1.31 for Item 1 to 0.45 – (−3.22) = 3.67 for Item 2. We can expect that the RSM and the PCM will yield different predictions for items in which the distance between the two PCM threshold estimates differ from 2.8 by a large margin (e.g., Items 1 and 2), and similar predictions on items whose PCM threshold estimates are approximately 2.8 points apart (e.g., Item 3). This is illustrated in Figure 6.3,4 which shows that the predicted probabilities from the RSM (dotted lines) and the PCM (solid lines) did not agree with each other for Items 1 and 2 (especially for the two lower score categories), but were almost identical across all score categories for Item 3. Table 6.1 shows another important difference between the RSM and the PCM: the estimated standard errors. All item threshold estimates had the same standard errors under the RSM, whereas the PCM standard errors varied across items. In addition, the RSM standard errors were always smaller than the corresponding PCM standard errors. These two patterns have to do with the amount of data used for RSM and PCM threshold estimation. Because of the equally distanced thresholds assumption, the RSM thresholds were estimated by pooling data from all items. Therefore, the RSM threshold estimates share the same amount of uncertainty across all items. On the other hand, the estimation of the PCM thresholds used data from a single item, and the amount of data available for each threshold naturally varied across items. For example, the first PCM threshold of Item 5 was estimated based on 517 data points that belonged to scores 0 or 1 on Item 5, whereas the second PCM threshold estimate of the same item was based on 1,887 data points that belonged to scores 1 or 2. Because the first PCM threshold of Item 5 was

The rating scale and partial credit models 143

Figure 6.3 Estimated response probabilities for Items 1, 2, and 3 from the RSM (dotted lines) and the PCM (solid lines).

estimated from much fewer data points than the second, the standard error of the former was larger than the standard error of the latter. In addition, because the number of data points used for any of the PCM thresholds was much smaller than the number of data points used for the corresponding RSM thresholds, the PCM standard errors were larger than the corresponding RSM standard errors.

Person parameter estimates Person parameter estimates from the RSM and the PCM correspond in one-to-one fashion to raw sum scores. In other words, participants with the same raw sum scores have the same person parameter estimates as well. Table 6.2 gives the raw sum scores and the RSM and PCM person parameter estimates from the motivation questionnaire dataset, as well as the corresponding standard errors and test information. Table 6.2 shows a total of 15 different raw sum scores in the data (from 2 to 16), each of which is associated with a unique person parameter estimate from each of the two models. For example, everyone who achieved the raw sum score of 6 had

144 Ikkyu Choi Table 6.2 Person Parameter Estimates and Test Information From the RSM and the PCM RSM

PCM

Raw score

Person parameter (s.e.)

Information

Person parameter (s.e.)

Information

2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

−4.58 (0.85) −3.96 (0.75) −3.44 (0.7) −2.98 (0.67) −2.55 (0.65) −2.14 (0.64) −1.73 (0.63) −1.33 (0.64) −0.92 (0.65) −0.49 (0.66) −0.03 (0.7) 0.5 (0.76) 1.16 (0.88) 2.15 (1.15) 3.58 (1.92)

1.4 1.79 2.07 2.26 2.39 2.46 2.49 2.47 2.4 2.26 2.04 1.72 1.3 0.76 0.27

−4.44 (0.81) −3.87 (0.71) −3.4 (0.66) −2.98 (0.64) −2.57 (0.63) −2.17 (0.64) −1.76 (0.64) −1.34 (0.65) −0.91 (0.66) −0.46 (0.68) 0.02 (0.7) 0.54 (0.75) 1.17 (0.85) 2.08 (1.1) 3.43 (1.88)

1.52 1.98 2.27 2.43 2.49 2.47 2.41 2.35 2.27 2.17 2.02 1.77 1.39 0.83 0.28

the person parameter estimates of −2.55 and −2.57 from the RSM and the PCM, respectively. Although the motivation level of participants who “strongly agreed” to every positive statement (those who achieved the maximum possible raw score of 16) would be infinitely high under maximum likelihood estimation (see, e.g., Hambleton, Swaminathan, & Rogers, 1991, p. 36, for a discussion on this issue), Winsteps provides their motivation level estimates as well based on extrapolation. Because of the one-to-one relationship between raw sum scores and person parameter estimates of the RSM and the PCM, the relative ranking of participants by the two models was identical. If the relative ranking of participants is the sole purpose of analysis, the choice between the RSM and the PCM is irrelevant because, as is shown here, they yield identical results (as long as every participant took the same set of items). The standard errors of person parameter estimates are related to the location of item thresholds. As shown in Table 6.1, most of the thresholds from the two models were located between −4 and 1. As a result, the questionnaire items provide more information about the participants whose person parameter estimates were within that range. Table 6.2 shows that the person parameter estimates within that range had smaller standard errors than those outside of the range. As mentioned earlier, there is a formal relationship between standard error and test information: test information is simply the squared reciprocal of the corresponding standard error (e.g., for person ability of −3.87 (with standard error of 0.71) estimated via the PCM, the test information index is

1 ≈ 1.98) 0.712

and conveys the same

The rating scale and partial credit models 145

Figure 6.4 Estimated standard errors for person parameters and test information from the RSM (dotted lines) and the PCM (solid lines). Although Winsteps yielded person parameter estimates larger than 3 for those who produced perfect raw scores, those estimates were extrapolated, and thus their standard errors and test information are not shown here.

information as the standard error in the opposite direction. The larger the information, the more precise the person parameter estimate. Standard errors and test information are often visualized across the person parameter scale, as can be seen in Figure 6.4, where the opposing patterns of standard errors and test information are illustrated.

Examining model performance Table 6.3 shows the infit and outfit indices from fitting the RSM and the PCM to the motivation questionnaire dataset. The indices from the RSM and the PCM were comparable. Most of them were within the range of 0.75 to 1.25, with a few exceptions. Whether this represents an overall satisfying agreement between model predictions and data is not easy to gauge. Infit and outfit indices differ in their variance, and both variance terms depend on data. Therefore, it is not easy to provide a general guideline for interpreting both indices that can be applied across different datasets. While acknowledging this difficulty, Wright and Linacre (1994) presented reasonable ranges for both infit and outfit for several types of tests. For survey questionnaire items, the suggested reasonable range was from 0.6 to 1.4, which would include all but three entries in Table 6.3. All three values greater than the suggested upper bound of 1.4 were from Item 6. The understanding of model performance can be facilitated by visually inspecting model predictions and data together. The item parameters of the RSM and the PCM provide predictions for response probabilities across the entire person parameter scale. These predicted probabilities for Items 6, 7, and 8 are represented as dotted and solid curves for the RSM and the PCM, respectively (see Figure 6.5). Figure 6.5 also presents the observed response

Table 6.3 Infit and Outfit Mean Square Values From the RSM and the PCM Item

1 2 3 4 5 6 7 8

RSM

PCM

Infit

Outfit

Infit

Outfit

1.21 0.75 0.80 1.08 0.86 1.48 1.09 0.76

1.26 0.76 0.78 1.15 0.77 1.49 1.20 0.75

1.07 0.86 0.81 1.13 0.82 1.34 1.11 0.81

1.33 0.84 0.78 1.16 0.74 1.44 1.20 0.79

Figure 6.5 Estimated response probabilities for Items 6, 7 and 8 from the RSM (dotted lines) and the PCM (solid lines), with observed response proportions (unfilled circles). The size of a circle is proportional to the number of data points available.

The rating scale and partial credit models 147 proportion of participants at each of the 15 different person parameter estimates from the PCM. Given the similarity between the RSM and PCM person parameter estimates (see Table 6.2), the RSM person parameters yielded practically identical plots and are therefore excluded for visual clarity. Each circle in Figure 6.5 represents the average proportion of participants who received the same raw score (and the same PCM person parameter). The size of each circle is proportional to the number of participants who belonged to that circle; the larger the number of participants, the larger the circle. We focus on how well the data patterns (i.e., circles) were approximated by the model predictions (i.e., dotted and solid curves). The three items shown in Figure 6.5 exemplify different cases of relative model performance. The predictions for Item 7 based on the RSM and the PCM were almost identical and very close to the observed proportions across all score categories, except for some fluctuations in the very-low-ability region, where data were scarce (very small circles). From a practical perspective, Item 7 presents an ideal case because the model predictions and the data agree well regardless of the model choice. Item 6 represents a challenging case in which neither the RSM nor the PCM yielded predictions that were in line with the observed proportions. In this item, score category 1 appears particularly problematic, with a plateau of observed proportions from −2 to 2 on the person parameter scale. A plateau of this shape (i.e., little changes in the correct response proportion throughout the middle range of the person parameter scale) is difficult to address with a parametric IRT model. Item 8 is somewhat similar to Item 6 in that both the RSM and the PCM failed to properly capture the observed response proportions in score category 1. The rise and fall of the observed response proportions were steeper than what the models predicted. Similarly, the decrease in the proportion in score category 0 and the increase in the proportion in Score category 2 were steeper than model predictions. This suggests that the discrimination of this item may be higher than the other items (recall that both the RSM and the PCM assume equal discrimination across all items). The generalized partial credit model (GPCM; Muraki, 1992) allows each item to have its own discrimination parameter, and therefore, may provide better predictions for Item 8. The advantage of Figure 6.5 (i.e., visual inspection of model and data together) over Table 6.3 (i.e., examination of fit indices) lies in the additional information about where the models struggled, and whether and how they can be improved. In the case of Item 6, the struggle had to do with the plateau of score category 1 proportions across a wide range of person parameters, which is difficult for any parametric IRT model to handle. On the other hand, the issue with Item 8 was the models’ inability to approximate the steep rise and fall of observed response proportions, which could be addressed by using the GPCM. A downside of Figure 6.5 is additional efforts required for visualization and interpretation, especially when there are a large number of items. Nevertheless, we believe that the additional information that can be gained by visualizing model predictions and data together is worth the additional effort.

148 Ikkyu Choi Model comparison The importance of model selection between the RSM and the PCM depends on the purpose of the analysis. As shown in Table 6.1, the RSM and the PCM yielded different estimates for the first and second thresholds, leading to the different relative orderings of the items in terms of threshold locations. On the other hand, the relative ranking of person parameter estimates from the two models (as well as raw sum scores) were identical, as can be seen in Table 6.2. Therefore, the choice between the RSM and the PCM becomes relevant only when item parameter estimates are of interest. For the purpose of illustration, we assume that the focus of this comparison lay on the item threshold parameter estimates. The distance between the first and second PCM threshold estimates in Table 6.1 varied across the eight items; the smallest distance was about half of the pooled estimated distance of the RSM, and the largest distance was about 30% larger than the RSM distance. Table 6.1 also shows that the threshold estimates of Items 1 and 2 differed across the two models by more than twice the size of their estimated standard errors. The capability of the PCM in capturing these data patterns is a compelling reason to prefer the PCM for the motivation questionnaire dataset. Of course, as a model gets more complex, its performance on that specific dataset will inevitably improve. Therefore, the degree of stability of estimates from the more complex model should also be considered. The standard errors of the PCM threshold estimates were indeed larger than the corresponding RSM standard errors. However, because of the relatively large sample size, the PCM standard errors were not substantial in absolute terms. We believe that the large sample size and the relatively small standard errors of threshold estimates can justify the use of the PCM over the RSM in this analysis.

Limitations: parameter estimation method A detailed discussion of IRT model parameter estimation has been presented in the educational measurement literature (e.g., Glas, 2016) and goes beyond the scope of this chapter. Most RSM and/or PCM applications in the language assessment literature have employed one of the following estimation methods: JML, marginal maximum likelihood (MML), and conditional maximum likelihood (CML). One caveat of using Winsteps for the illustrative analysis is that the item and person parameters were estimated using the JML method. JML has a wellknown issue that, even with an infinitely large sample, its parameter estimates do not converge to population values (Neyman & Scott, 1948). A proper coverage of this topic is beyond the scope of this chapter, and readers interested in this topic in the IRT modeling context are referred to Haberman (2006, 2016). The Winsteps user guide (Linacre, 2017b) also provides a comparison among estimation methods from a more favorable view of JML. We concur with Haberman and encourage language assessment researchers to explore other software packages that allow CML or MML estimation, especially when model estimates are to be used for high-stakes decisions. We demonstrate the MML estimation of the RSM

The rating scale and partial credit models 149 and PCM item parameters using the mirt package (Chalmers, 2012) in R (R Core Team, 2018) in the Companion website of this book.

Conclusion In this chapter, we have introduced the RSM and the PCM, briefly reviewed language assessment studies that utilized them, and illustrated their use on a motivation questionnaire dataset. The introduction of the models focused on how the dichotomous Rasch model can be extended to the PCM for polytomous items and how the RSM can be derived from the PCM with an additional assumption. The review of the literature showed that important topics such as model evaluation and comparison have not always been explicitly discussed or reported. Therefore, the illustrative analysis highlighted tools and strategies for evaluating model performance and comparing models. The findings from the illustrative analysis indicated that the PCM would be preferred for its superior ability to capture item response probabilities over the RSM for the motivation questionnaire dataset. Both the RSM and the PCM have been used for several decades. It is safe to say that these two models have become staple tools in analyzing polytomously scored responses in language assessment research. However, this does not mean that the models should be used whenever a dataset includes polytomous scores. As with any models, the RSM and the PCM are a simplification that may or may not be useful, depending on how well they capture important patterns in data. The illustrative analysis showcased our attempt to use some of the well-known methods to evaluate whether the RSM and/or the PCM can provide useful information about data. The illustrative analysis in this chapter is by no means comprehensive. We purposefully omitted or condensed several important aspects of RSM/PCM applications to highlight the topics we consider important and manageable. For example, we did not discuss missing data (Rose, von Davier, & Xu, 2010) or consider other polytomous models such as the graded response model (Samejima, 1969). Even when examining model fit, which was one of our main topics, we did not investigate person fit or conduct a direct model comparison using a global fit index (Cai & Hansen, 2013; Maydeu-Olivares & Joe, 2005). Some of these topics were too broad to cover in a short chapter or required a technical description beyond the scope of this chapter. Our selective focus in the illustrative analysis does not indicate that such omitted topics can be ignored. Instead, we hope that readers will find the references in this chapter to be useful in learning more about recent developments in and best practices for the application of the RSM or PCM.

Notes 1 In this chapter, we discuss the Rasch model and its polytomous extensions as item response theory models that describe the relationship between a person’s latent ability and observed item responses, without focusing on the mathematical and substantive characteristics specific to the Rasch model family. Extensive discussions

150 Ikkyu Choi about such characteristics of the Rasch model family are given by, among others, Bond and Fox (2015) and Fischer and Molenaar (1995). 2 Polytomous scores do not necessarily have an inherent order. As the RSM and the PCM can only accommodate ordered polytomous scores, we focus on such ordered scores in this chapter and omit “ordered” in all subsequent mentions of polytomous scores. 3 Although the event in Equation (1) (i.e., Yij = 1) is conditional on θi and bj, we omit the conditional probability notation (i.e., Pr (Yij = 1|θi,bj)) in all equations in this chapter because it is clear from the right-hand side. 4 Figures 6.3 to 6.5 were created using R (R Core Team, 2018). The code used can be found in the Companion website.

References Adams, R. J., Griffin, P. E., & Martin, L. (1987). A latent trait method for measuring a dimension in second language proficiency. Language Testing, 4(1), 9–27. Agresti, A. (2012). Categorical data analysis (3rd ed.). Hoboken, NJ: John Wiley & Sons Inc. Andersen, E. B. (1977). Sufficient statistics and latent trait models. Psychometrika, 42(1), 69–81. Andrich, D. (1978a). Application of a psychometric rating model to ordered categories which are scored with successive integers. Applied Psychological Measurement, 2(4), 581–594. Andrich, D. (1978b). A rating formulation for ordered response categories. Psychometrika, 43(4), 561–573. Aryadoust, V. (2012). How does “sentence structure and vocabulary” function as a scoring criterion alongside other criteria in writing assessment? International Journal of Language Testing, 2(1), 28–58. Aryadoust, V., Goh, C. C. M., & Kim, L. O. (2012). Developing and validating an academic listening questionnaire. Psychological Test and Assessment Modeling, 54(3), 227–256. Bond, T., & Fox, C. M. (2015). Applying the Rasch model: Fundamental measurement in the human sciences (3rd ed.). New York, NY: Routledge. Cai, L., & Hansen, M. (2013). Limited-information goodness-of-fit testing of hierarchical item factor models. British Journal of Mathematical and Statistical Psychology, 66(2), 245–276. Chalmers, R. P. (2012). Mirt: A multidimensional item response theory package for the R environment. Journal of Statistical Software, 48(6), 1–29. du Toit, M. (Ed.). (2003). IRT from SSI: Bilog-MG, multilog, parscale, testfact. Lincolnwood, IL: Scientific Software International, Inc. Eckes, T. (2012). Examinee-centered standard setting for large-scale assessments: The prototype group method. Psychological Test and Assessment Modeling, 54(3), 257–283. Eckes, T. (2017). Setting cut scores on an EFL placement test using the prototype group method: A receiver operating characteristic (ROC) analysis. Language Testing, 34(3), 383–411. Fischer, G. H., & Molenaar, I. W. (Eds.). (1995). Rasch models: Foundations, recent developments, and applications. New York, NY: Springer Science & Business Media. Fulcher, G. (1996). Testing tasks: Issues in task design and the group oral. Language Testing, 13(1), 23–51.

The rating scale and partial credit models 151 Glas, C. A. W. (2016). Maximum-likelihood estimation. In W. van der Linden (Ed.), Handbook of item response theory (Vol. 2, pp. 197–216). Boca Raton, FL: CRC Press. Haberman, S. J. (2006). Joint and conditional estimation for implicit models for tests with polytomous item scores (ETS RR-06-03). Princeton, NJ: Educational Testing Service. Haberman, S. J. (2016). Models with nuisance and incidental parameters. In W. van der Linden (Ed.), Handbook of item response theory (Vol. 2, pp. 151–170). Boca Raton, FL: CRC Press. Hambleton, R. K., & Han, N. (2005). Assessing the fit of IRT models to educational and psychological test data: A five-step plan and several graphical displays. In R. R. Lenderking & D. A. Revicki (Eds.), Advancing health outcomes research methods and clinical applications (pp. 57–77). McLean, VA: Degnon Associates. Hambleton, R. K., Swaminathan, H., & Rogers, H. J. (1991). Fundamentals of item response theory (Vol. 2). Newbury Park, CA: SAGE Publications. Karabatsos, G. (2000). A critique of Rasch residual fit statistics. Journal of Applied Measurement, 1, 152–176. Kunnan, A. J. (1991). Modeling relationships among some test-taker characteristics and performance on tests of English as a foreign language. Unpublished doctoral dissertation, University of California, LA. Lee, Y.-W., Gentile, C., & Kantor, R. (2008). Analytic scoring of TOEFL CBT essays: Scores from humans and E-rater® (ETS RR-08-81). Princeton, NJ: Educational Testing Service. Lee-Ellis, S. (2009). The development and validation of a Korean C-test using Rasch analysis. Language Testing, 26(2), 245–274. Linacre, J. M. (1994). Many-facet Rasch measurement (2nd ed.). Chicago, IL: MESA Press. Linacre, J. M. (2017a). Winsteps® Rasch measurement computer program [Computer software]. Beaverton, OR: Winsteps.com. Linacre, J. M. (2017b). Winsteps® Rasch measurement computer program user’s guide. Beaverton, OR: Winsteps.com. Masters, G. N. (1982). A Rasch model for partial credit scoring. Psychometrika, 47(2), 149–174. Maydeu-Olivares, A., & Joe, H. (2005). Limited and full information estimation and goodness-of-fit testing in 2n contingency tables: A unified framework. Journal of the American Statistical Association, 100(471), 1009–1020. McNamara, T., & Knoch, U. (2012). The Rasch wars: The emergence of Rasch measurement in language testing. Language Testing, 29(4), 555–576. Muraki, E. (1992). A generalized partial credit model: Application of an EM algorithm. Applied Psychological Measurement, 16(2), 159–176. Neyman, J., & Scott, E. L. (1948). Consistent estimation from partially consistent observations. Econometrica, 16, 1–32. Pollitt, A., & Hutchinson, C. (1987). Calibrating graded assessments: Rasch partial credit analysis of performance in writing. Language Testing, 4(1), 72–97. R Core Team. (2018). R: A language and environment for statistical computing. Vienna, Austria: R Foundation for Statistical Computing. Rose, N., von Davier, M., & Xu, X. (2010). Modeling nonignorable missing data with item response theory (ETS RR-10-11). Princeton, NJ: Educational Testing Service. Samejima, F. (1969). Estimation of latent ability using a response pattern of graded scores. Psychometrika, Monograph Supplement No. 17.

152 Ikkyu Choi Stewart, J., Batty, A. O., & Bovee, N. (2012). Comparing multidimensional and continuum models of vocabulary acquisition: An empirical examination of the vocabulary knowledge scale. TESOL Quarterly, 46(4), 695–721. Suzuki, Y. (2015). Self-assessment of Japanese as a second language: The role of experiences in the naturalistic acquisition. Language Testing, 32(1), 63–81. Van Moere, A. (2012). A psycholinguistic approach to oral language assessment. Language Testing, 29(3), 325–344. Wells, C. S., & Hambleton, R. K. (2016). Model fit with residual analysis. In W. van der Linden (Ed.), Handbook of item response theory (Vol. 2, pp. 395–413). Boca Raton, FL: CRC Press. Wright, B. D., & Linacre, J. M. (1994). Reasonable mean-square fit values. Rasch Measurement Transactions, 8, 370. Wright, B. D., & Masters, G. N. (1982). Rating scale analysis. Chicago, IL: MESA Press.

7

Many-facet Rasch measurement Implications for rater-mediated language assessment Thomas Eckes

Introduction Consider the following situation: Test takers are given a 60-minute writing task consisting of two parts. In the first part, a diagram is provided along with a short introductory text, and the test takers have to describe the relevant information shown in the diagram. In the second part, the test takers have to consider different positions on an aspect of the topic and write a structured argument. Two raters independently rate each test taker’s response on a small set of criteria using a four-category rating scale, and the test takers are then provided with scores for each criterion. The situation described above is typical of rater-mediated assessments, also known as constructed-response or performance assessments (e.g., Engelhard, 2013; Lane & Iwatani, 2016). In this format, test takers are required to create a response or perform a task rather than choose the correct answer from a number of alternatives given (as, for example, in multiple-choice tests). Constructedresponse tasks range from limited production tasks like short-answer questions to extended production tasks that prompt test takers to write an essay, deliver a speech, or provide work samples (Carr, 2011; Johnson, Penny, & Gordon, 2009). The term “rater-mediated assessment” reflects the central role of human raters in this type of assessment: The information that a test taker’s response provides about the construct of interest is mediated through raters’ judgmental and decision-making processes. Thus, a number of rater characteristics, for example, their overall tendency to assign harsh or lenient ratings, their understanding of the scoring criteria, and their interpretation and use of the rating scale, are likely to exert influence on the assessment and its outcomes. Figure 7.1 presents in a simplified manner the relations between the basic components of rater-mediated assessments. The figure suggests that raters do not passively transform an observed performance into a score (what scoring machines supposedly do); rather, they actively construct an evaluation of the performance. These rater-specific, more or less idiosyncratic constructions feed into the final (observed or raw) scores that are assigned to test takers. To some extent, then, the variability of scores is associated with characteristics of the raters and not just with the performance of test takers. The manifold contributions of individual raters to scores are commonly referred to as rater effects (or rater errors), and the rater-induced variability in scores is called rater variability.

154 Thomas Eckes

Figure 7.1 The basic structure of rater-mediated assessments.

A popular approach to dealing with rater effects, especially in large-scale assessments, consists of the following components: (a) rater training, (b) independent ratings of the same performances by two or more raters (repeated ratings), and (c) demonstration of interrater agreement or reliability. However, this standard approach has been criticized as being insufficient, in particular with respect to the often limited effectiveness of rater training and the strong reliance on interrater agreement/reliability as evidence of rating quality (Eckes, 2015; Wind & Peterson, 2018). On a more general note, the standard approach rests on the so-called testscore or observed ratings tradition of research into rater effects, as opposed to the scaling or scaled ratings tradition (Eckes, 2017; Engelhard, 2013; Wind & Peterson, 2018). The test-score tradition includes methods based on classical test theory and its extension to generalizability theory (G-theory; e.g., Brennan, 2011). Prominent examples of the scaling tradition include a wide range of models within item response theory (IRT; e.g., Yen & Fitzpatrick, 2006) and the various forms of Rasch measurement (Wright & Masters, 1982; Wright & Stone, 1979), in particular many-facet Rasch measurement (MFRM; Linacre, 1989). In the following section, MFRM is presented as a psychometric framework that is well suited to account for rater effects and, more generally, to provide a detailed analysis and evaluation of rater-mediated assessments.

Many-Facet Rasch Measurement (MFRM) Psychometric models within the MFRM framework extend the basic, dichotomous (logistic) Rasch model (Rasch, 1960/1980) in two ways: First, MFRM models go beyond just two components or facets of the assessment situation (i.e., test takers and items). Second, the data to be analyzed need not be dichotomous. Thus, in the analysis of performance assessments, MFRM allows the inclusion and

Many-facet Rasch measurement 155 estimation of the effects of additional facets that may be of interest besides test takers and items, such as raters, criteria, tasks, and assessment occasions. Since raters typically assign scores to test takers using three or more ordered categories (i.e., rating scales), the data most often consist of polytomous responses. For example, the rating scale may include the categories basic (= 1), advanced (= 2), and proficient (= 3). Within each facet, MFRM represents each element (i.e., each individual test taker, rater, criterion, task, etc.) by a separate parameter. The parameters denote distinct attributes of the facets involved, such as proficiency or ability (for test takers), severity or harshness (for raters), and difficulty (for scoring criteria or tasks). In most instances, the attribute of primary interest refers to test taker proficiency, such as when proficiency measures are used to inform decisions on university admission, placement, or graduation. As a critical MFRM feature, the measures of test taker proficiency compensate for variation in rater severity; that is, these measures are adjusted for the differences in the level of severity characterizing the raters who assigned the ratings. Test taker scores assigned by severe raters are adjusted toward higher proficiency (upward adjustment), and the scores assigned by lenient raters are adjusted toward lower proficiency (downward adjustment). Upward and downward adjustments follow from the fundamental principle of measurement invariance or specific objectivity (e.g., Linacre, 1989). This means that, when the data fit the model, MFRM constructs rater-invariant measures of test takers. Following the same principle, MFRM also constructs test taker–invariant measures of raters, that is, rater severity measures are adjusted for the differences in the level of proficiency of the particular set of test takers rated (Engelhard, 2013; Engelhard & Wind, 2018). Formally, Rasch models are measurement models that are based on a logistic (non-linear) transformation of qualitatively ordered counts of observations (raw scores, ratings) into a linear, equal-interval scale. Figure 7.2 helps to demonstrate how this abstract definition relates to empirical data.

Figure 7.2 Fictitious dichotomous data: Responses of seven test takers to five items scored as correct (1) or incorrect (0).

156 Thomas Eckes

Figure 7.3 Illustration of a two-facet dichotomous Rasch model (log odds form).

As a baseline, Figure 7.2 shows a much simplified, fictitious two-facet situation in which seven test takers responded to five items, and each response was scored as correct (= 1) or incorrect (= 0). There are no raters or scoring criteria involved; all that is required to assign scores to test takers is a well-defined scoring key (and possibly a machine doing the scoring). To estimate test taker ability and item difficulty from such data, the dichotomous Rasch model displayed in Figure 7.3 can be used. According to this model, the probability that test taker n answers item i correctly, that is, p(xni = 1), depends on (is a function of) the difference between the ability of the test taker (θn) and the difficulty of the item (βi). The ability parameter has a positive orientation (plus sign: a higher value means a higher score), whereas the difficulty parameter has a negative orientation (minus sign: a higher value means a lower score). The left side of the equation gives the natural logarithm (ln) of the probability of a correct answer divided by the probability of an incorrect answer; the ln of the probability ratio (odds ratio) is called logit (short for “log odds unit”). Under this model, logits are a simple linear function of the ability parameter βn and the difficulty parameter θi. For example, when test taker ability equals item difficulty, that is, when θn = βi, the log odds of a correct response are zero; or, to express it differently, when θn = βi, the test taker has a 0.50 chance of getting the item correct (i.e., ln(0.50/0.50) = 0). More generally, logits are the linear, equal-interval measurement units for any parameter specified in a Rasch model; hence, the measurement scale used in Rasch modeling is the logit scale. In other words, the logit scale represents the latent variable or dimension of interest (person ability, item difficulty, and so on), and Rasch measures correspond to locations on this scale. The data cube shown in Figure 7.4 illustrates a fictitious three-facet situation in which seven test takers performed a single task and three raters scored their performance according to five criteria using a five-category rating scale. Figure 7.5 presents an extension of the dichotomous Rasch model that can be used to estimate a separate parameter for each of the facets shown in Figure 7.4.

Many-facet Rasch measurement 157

Figure 7.4 Fictitious polytomous data: Responses of seven test takers evaluated by three raters on five criteria using a five-category rating scale.

Figure 7.5 Illustration of a three-facet rating scale measurement model (log odds form).

Most importantly, a rater parameter (α) has been added to the parameters for test takers (θ) and criteria (β). The α parameter has a negative orientation (minus sign: higher values mean lower scores) and, therefore, refers to the severity of rater j, as opposed to the rater’s leniency. Generally speaking, a severe rater systematically assigns lower ratings than expected given the quality of test takers’ performances (and a lenient rater assigns higher ratings).

158 Thomas Eckes There is yet another parameter included in the model equation shown in Figure 7.5: the threshold parameter, or category coefficient, which is denoted by τk. This parameter represents the assumed structure of the rating scale and is not considered a facet. More specifically, the threshold parameter is defined as the location on the logit scale where the adjacent scale categories, k and k−1, are equally probable to be observed. In other words, τk represents the transition point at which the probability of test taker n receiving a rating of k from rater j on criterion i (pnijk) equals the probability of this test taker receiving a rating of k−1from the same rater on the same criterion (pnijk−1). These transition points are called Rasch-Andrich thresholds (Andrich, 1998; Linacre, 2006). Together, the model shown in Figure 7.5 is a many-facet extension of the rating scale model (RSM) originally developed by Andrich (1978). The first comprehensive theoretical statement of this many-facet model has been advanced by Linacre (1989). The threshold parameter indicates how the rating data are to be analyzed. In Figure 7.5, this parameter specifies that an RSM should be used across all elements of each facet (indicated by the single subscript k). In the example, the five-category scale is treated as if all raters understood and used each rating scale category on each criterion (across all test takers) in the same manner. Regarding the criterion facet, this means that a particular rating, such as “3” on Criterion 1, is assumed to be equivalent to a rating of “3” on Criterion 2 and on any other criterion; more specifically, the set of threshold parameters, which defines the structure of the rating scale, is the same for all criteria (and for all raters and test takers). Alternatively, the threshold parameter may be specified in such a way as to allow for a variable rating scale structure across elements of a particular facet. For example, when the structure of the rating scale is assumed to vary from one criterion to the next, a partial credit version of the three-facet Rasch model, which is a criterion-related three-facet partial credit model, could be used (the subscript would change from k to the double subscript ik, that is, τik). This type of MFRM model rests on the general formulation of a partial credit model (PCM) originally developed by Masters (1982).

MFRM in language assessment This section briefly discusses MFRM applications within the context of language assessments. Informative studies adopting the MFRM approach can also be found in a wide range of disciplines and fields of research, including studies of teachers’ grading practices (Randall & Engelhard, 2009), judging anomalies at the Winter Olympics (Looney, 2012), music performance evaluations (Springer & Bradley, 2018), pilot instructors’ crew evaluations (Mulqueen, Baker, & Dismukes, 2002), and medical admission and certification procedures (Till, Myford, & Dowell, 2013). In language assessment, MFRM studies most often focus on the measurement of writing and speaking proficiency. Here, measurement outcomes provide an account of rating quality specific to individual raters, test takers, scoring criteria,

Many-facet Rasch measurement 159 and tasks, often in foreign- or second-language contexts (Barkaoui, 2014; McNamara & Knoch, 2012; Wind & Peterson, 2018). Before discussing MFRM studies in these contexts, some terminology is introduced that is essential to understanding the studies and their findings. In general, MFRM models have been used to pursue two overriding objectives: First, to study the effect that each facet exerts on the measurement results, which is called a main-effects analysis. Second, to perform an interaction analysis, which is the study of the effect of the interaction between two or more facets. Such an analysis addresses what is called differential facet functioning (DFF), which is analogous to differential item functioning (DIF; see Chapter 5, Volume I). When referring to interactions involving raters, for example, interactions between raters and test takers or interactions between raters and criteria, the analysis is known as the study of differential rater functioning (DRF) or rater bias. When a researcher has no specific hypothesis about which facets might interact with each other, an exploratory interaction analysis is recommended. That is, all combinations of elements from the respective facets are scanned for significant differences between observed and expected scores. The expected scores are derived from an MFRM model that does not include an interaction parameter, that is, a main-effects model (e.g., the model shown in Figure 7.5) in which significant differences are flagged for close inspection. On the other hand, when the researcher is able to formulate a specific hypothesis about which facets of the assessment situation interact with each other, a confirmatory interaction analysis is appropriate. Generally, main effects are not of prime interest in such an analysis. Additional explanatory variables may be introduced in a measurement design for the sole purpose of studying interactions; these variables are called dummy facets. Most often, dummy facets include demographic or other categorical variables describing test takers (e.g., gender or age groups), raters (experts vs. novices), or rating contexts (holistic vs. analytic rating scales). Figure 7.6 presents a summary of the conceptual distinctions involved in the study of facet interactions. MFRM studies abound in the language assessment literature. Therefore, only a few more relevant examples are discussed here. As a general finding, researchers identified a significant degree of rater main effects. In particular, they noted pronounced differences in rater severity that persisted even after extensive training sessions (e.g., Bonk & Ockey, 2003; Eckes, 2005; Elder, Barkhuizen, Knoch, & von Randow, 2007). Regarding DRF studies, researchers provided evidence for biases related to rater language background (Winke, Gass, & Myford, 2013), test taker gender (Aryadoust, 2016), time of rating (Lamprianou, 2006), performance tasks (Lee & Kantor, 2015), writing genres (Jeong, 2017), and scoring criteria (Eckes, 2012). Within the context of the Internet-Based Test of English as a Foreign Language (TOEFL iBT) speaking assessment, for example, Winke et al. found that raters with Spanish as a second language (L2) were significantly more lenient toward test takers with Spanish as a native language (L1) than they were toward L1 Korean or Chinese test takers. This finding indicated that accent familiarity influenced the ratings of L2 Spanish raters.

160 Thomas Eckes

Figure 7.6 Studying facet interrelations within a MFRM framework.

Researchers have also applied MFRM to gain insights into the effects associated with a range of different procedures commonly used for writing and speaking assessments, such as effects of analytic vs. holistic scoring rubrics (Curcin & Sweiry, 2017), paper-based vs. onscreen scoring (Coniam, 2010), and face-to-face vs. online rater training (Knoch, Read, & von Randow, 2007). For example, Knoch et al. showed that in both face-to-face and online training groups, only a few raters exhibited less criterion-related rater bias after training, whereas other raters developed new biases. On a wider scale, MFRM applications have proved useful when evaluating standard-setting procedures (Hsieh, 2013), studying rating accuracy within the context of expert or benchmark ratings (Wang, Engelhard, Raczynski, Song, & Wolfe, 2017), and examining the cognitive categories and processes that are involved when assigning ratings to test takers (Li, 2016).

Sample data: Writing performance assessment To illustrate basic concepts and procedures of an MFRM modeling approach to rater-mediated language assessments, a dataset taken from an examination of academic writing proficiency will be analyzed using the computer program FACETS (Version 3.81; Linacre, 2018a). The sample data have been thoroughly examined before (see Eckes, 2015), and therefore only a brief description of the broader assessment context is given here. The writing task (already outlined in the Introduction) required test takers to summarize the information contained in a diagram or graph (the description

Many-facet Rasch measurement 161 part) and to formulate a structured argument dealing with different statements or positions on some aspect of that information (the argumentation part). Being a component of the Test of German as a Foreign Language (Test Deutsch als Fremdsprache, TestDaF), the task tapped into the test taker’s ability to produce a coherent and well-structured text on a topic taken from the academic context.1 A total of 18 raters evaluated the writing performances (essays) of 307 test takers. Each essay was rated independently by two raters. In addition, one rater provided ratings of two essays that were randomly selected from each of the other 17 raters’ workload. These third ratings served to satisfy the basic requirement of a connected dataset in which all elements of all facets are directly or indirectly linked to each other (Eckes, 2015). As a consequence, all 18 raters could be directly compared along the same dimension of severity. Ratings were provided on a four-category rating scale, with categories labeled by TestDaF levels (TestDaF-Niveaus, TDNs). The four scale categories were as follows: below TDN 3, TDN 3, TDN 4, and TDN 5; TDNs 3 through 5 referred to German language ability at an intermediate to high level (for the purpose of computations, below TDN 3 was scored “2”, and the other levels were scored from “3” to “5”). Raters evaluated each essay according to three criteria: global impression (comprising aspects such as fluency, train of thought, and structure), task fulfillment (completeness, description, and argumentation), and linguistic realization (breadth of syntactic elements, vocabulary, and correctness). Taken together, there were 648 ratings on each criterion; that is, 614 double ratings plus 34 third ratings, making a total of 1,944 ratings. These ratings provided the input for the analysis, which was based on the MFRM model discussed previously (Figure 7.5). Because the present number of ratings is below the maximum limit of 2,000 ratings (or responses), readers may re-run the analysis of these data using the free (student) version of FACETS called MINIFAC.2 The dataset, which can be used directly as input into FACETS or MINIFAC, is available in a separate file on the Companion website (see also the screenshot tutorial on this website).

The complete picture: Test takers, raters, criteria, and the rating scale Figure 7.7 presents a graphical display of the measurement results. This display, which is called variable map or Wright map (named after Benjamin D. Wright; see Wilson, 2011), was directly taken from the FACETS output. In the Wright map, the logit scale appears as the first column (labeled “Measr”). All measures of test takers, raters, and criteria, as well as the category coefficients, are positioned vertically on the same dimension (the “vertical rulers”), with logits as measurement units. The second column (“+Test Takers”) shows the locations of the test taker proficiency estimates. Each test taker is labeled by a unique three-digit number (from 001 to 307; anonymized data). Proficiency measures are ordered with higherscoring test takers appearing at the top of the column and lower-scoring test takers

162 Thomas Eckes

Figure 7.7 Wright map for the three-facet rating scale analysis of the sample data (FACETS output, Table 6.0: All facet vertical “rulers”). Source: From Introduction to Many-Facet Rasch Measurement: Analyzing and Evaluating Rater-Mediated Assessments (2nd ed., p. 59), by T. Eckes, 2015, Frankfurt am Main, Germany: Peter Lang. Copyright 2015 by Peter Lang. Adapted with permission. Notes: LR = linguistic realization. TF = task fulfilment. GI = global impression.

appearing at the bottom. That is, the test taker facet is positively oriented, as indicated by the plus sign. Nine test takers received the maximum score “5” (TDN 5) on all criteria. Exactly the same number of test takers received the minimum score “2” (below TDN 3) on all criteria; one of these test takers received extreme scores through the third ratings as well. As a result, non-extreme scores were available for 1,833 responses, which were used for parameter estimation. The test takers with extreme scores are located at the top and bottom row, respectively. Generally, facet elements with extreme scores imply an infinite estimate for the parameter of the corresponding element and, thus, are dropped from the estimation. However, FACETS provides reasonable estimates for these elements by adjusting extreme scores by a fraction of a scoring point (default value = 0.3).

Many-facet Rasch measurement 163 The third column (“-Raters”) compares the raters in terms of the level of severity each exercised when rating the essays. More severe raters appear higher in the column, while less severe (or more lenient) raters appear lower; that is, the rater facet has a negative orientation (indicated by the minus sign). The fourth column (“-Criteria”) presents the locations of the criterion measures. Again, this facet is negatively oriented, which means that the criteria appearing higher in the column were more difficult than those appearing lower. Hence, it was much more difficult for test takers to receive a high score on linguistic realization or task fulfillment than on global impression. The last column (“Scale”) maps the four-category TDN scale to the logit scale. The lowest scale category (2) and the highest scale category (5), both of which would indicate extreme ratings, are shown in parentheses only, since the boundaries of the two extreme categories are minus and plus infinity, respectively. The horizontal dashed lines in the last column are positioned at the category thresholds or, more precisely, at the Rasch-half-point thresholds. These thresholds correspond to expected scores on the scale with half-score points (Linacre, 2006, 2010). Specifically, Rasch-half-point thresholds define the intervals on the latent variable in which the half-rounded expected score is the category value, which is the TDN in the present case. In other words, these thresholds are positioned at the points where the chance of a test taker receiving a rating in the next higher category begins to exceed the chance of receiving a rating in the lower category. For example, a test taker located at the lowest threshold (i.e., at –3.66 logits), has an expected score of 2.5. At the next higher threshold (i.e., at –0.06 logits), the expected score is 3.5. Between these two thresholds, the half-rounded expected score is 3.0. Therefore, test takers located in this category interval are most likely to receive a rating of TDN 3. In addition, scale categories 3 and 4 have approximately the same width; therefore, each of these categories corresponds to much the same amount of the latent variable. Note also that Rasch-half-point thresholds differ from Rasch-Andrich thresholds, which are defined on the latent variable in relation to category probabilities instead of expected scores.

Rater measurement results: A closer look The output from FACETS provides detailed measurement results for each individual element of each facet under study as well as summary statistics for each facet as a whole. Table 7.1 shows the measurement results for the rater facet. Due to space restrictions, this is an excerpt from the original, much more comprehensive FACETS table (the complete table is shown in the screenshot tutorial on the Companion website). In Table 7.1, raters appear in the order of their severity, from most severe (Rater 16) to most lenient (Rater 07). The average score assigned by each rater is shown in Column 3 (total score divided by total count). For example, the observed average for Rater 16 is 3.03. Note, however, that observed averages confound rater severity and test taker proficiency. For example, when a rater’s observed average

164 Thomas Eckes Table 7.1 Excerpt From the FACETS Rater Measurement Report Total score

Total count

Obsvd average

Fair (M) average

- Measure

Model S.E.

Infit MNSQ

Outfit MNSQ

Rater

182 383 445 301 478 415 268 202 492 366 240 537 476 428 295 495 271 829

60 123 129 84 141 123 72 57 141 102 63 135 132 123 72 123 60 204

3.03 3.11 3.45 3.58 3.39 3.37 3.72 3.54 3.49 3.59 3.81 3.98 3.61 3.48 4.10 4.02 4.52 4.06

3.00 3.08 3.15 3.32 3.32 3.37 3.60 3.63 3.64 3.65 3.73 3.83 3.94 3.94 3.99 4.17 4.23 4.23

2.40 2.09 1.83 1.21 1.21 1.05 0.29 0.16 0.14 0.09 −0.17 −0.57 −1.00 −1.02 −1.23 −2.01 −2.23 −2.24

0.30 0.20 0.18 0.22 0.17 0.19 0.23 0.26 0.18 0.20 0.27 0.18 0.18 0.19 0.24 0.19 0.29 0.15

0.93 0.82 1.10 1.39 0.81 1.12 0.89 0.75 1.05 1.11 1.30 0.81 1.08 1.02 1.16 0.82 0.96 0.94

0.80 0.74 1.09 1.43 0.79 1.06 0.87 0.75 1.07 1.08 1.39 0.83 1.09 0.99 1.17 0.74 1.23 0.92

16 13 14 15 09 05 04 11 08 06 18 17 12 10 02 03 01 07

Note: “- Measure” indicates that raters with higher measures assigned lower scores. From Introduction to Many-Facet Rasch Measurement: Analyzing and Evaluating Rater-Mediated Assessments (2nd ed., pp. 71–72), by T. Eckes, 2015, Frankfurt am Main, Germany: Peter Lang. Copyright 2015 by Peter Lang. Adapted with permission.

is markedly lower than other raters’ observed averages, this could be due to the rater’s high severity or the test takers’ low proficiency. The values that are shown in the fourth column (“Fair (M) Average”) resolve this problem: A fair average for a given rater adjusts the observed average for the average level of proficiency in the rater’s sample of test takers. For example, Rater 14 had an observed average of 3.45 and a fair average of 3.15 – a difference indicating a relatively high average level of test taker proficiency in his or her stack of performances; by contrast, Rater 10 (with almost the same observed average) had a much higher fair average (3.94), indicating a relatively low average proficiency level in this rater’s stack of performances. The next column gives the Rasch estimates of each rater’s severity in logits (“-Measure”). As a rough guideline, raters with severity estimates ≥ 1.0 logits, or ≤ −1.0 logits, may be considered “severe” or “lenient”, respectively. In the sample analysis, six raters turned out to be severe, another six raters were lenient, and again six raters were neither severe nor lenient (“average” or “normal”). However, there are no generally valid rules for this, because any rater severity classification should at least take into account (a) the assessment

Many-facet Rasch measurement 165 purpose (e.g., placement or admission), (b) the communicative goal (e.g., rater feedback), and (c) the estimate’s precision, given by the model-based standard error (“Model S.E.”). Within the present context, precision refers to the extent to which the location of a given measure on the severity dimension is reproducible based on the same assessment procedure. Higher precision (lower standard error) implies greater confidence about the reported measure. The columns headed by “Infit MNSQ” (mean square) and “Outfit MNSQ” present statistical indicators of the degree to which the raters used the rating scale consistently across test takers and criteria. These indicators are called rater fit statistics. Generally, rater fit statistics indicate the extent to which ratings provided by a given rater match the expected ratings that are generated by a particular MFRM model (for a detailed discussion of fit statistics in a three-facet measurement context, see Eckes, 2015). Rater infit is sensitive to unexpected ratings where the locations of rater j and the other elements involved are aligned with each other, that is, where the locations are close together on the measurement scale (e.g., within a range of about 0.5 logits). Rater outfit is sensitive to unexpected ratings where the latent variable locations of rater j and the locations of the other elements involved, such as test takers and criteria, are far apart from each other (e.g., separated by more than 1.0 logits). Thus, when an otherwise lenient rater assigns harsh ratings to a highly proficient test taker on a criterion of medium difficulty, this rater’s outfit MNSQ index will increase. Infit and outfit MNSQ statistics have an expected value of 1.0 and range from 0 to plus infinity (Linacre, 2002; Myford & Wolfe, 2003). Rater fit values greater than 1.0 indicate more variation than expected in the ratings; this kind of misfit is called underfit. By contrast, rater fit values less than 1.0 indicate less variation than expected, meaning that the ratings are too predictable or provide redundant information; this is called overfit. As a rule of thumb, Linacre (2002, 2018b) suggested 0.50 as a lower-control limit and 1.50 as an upper-control limit for infit and outfit. That is, Linacre considered mean square values in the range between 0.50 and 1.50 as “productive for measurement” or as indicative of “useful fit”. Rating patterns (or raters producing the patterns) with fit values above 1.50 may be classified as “noisy” or “erratic”, and those with fit values below 0.50 may be classified as “muted” (Engelhard, 2013; Linacre, 2018b). In the present analysis, there were no noisy or muted raters; that is, all raters exhibited “acceptable” rating patterns. Overall, this finding indicated a satisfactory degree of within-rater consistency.

Test taker and criterion measurement results Considering test takers with non-extreme scores only (see Figure 7.7), Test Taker 208 had the highest proficiency estimate (7.80 logits, SE = 1.10), and Test Taker 305 had the lowest estimate (–7.13 logits, SE = 1.15). Obviously, the test takers differed markedly in their writing proficiency – a finding confirmed in later TestDaF examinations using similarly difficult writing tasks (see Eckes, 2005).

166 Thomas Eckes In the present example, test takers’ observed or raw scores were computed as the average of ratings across the two (or three) raters involved. By contrast, test takers’ adjusted or fair average scores were computed on the basis of the parameter estimates resulting from the MFRM analysis. Thus, similar to the computation of fair averages for raters, test taker fair scores compensate for severity differences between the raters. In other words, for each test taker, there is an expected score that would be obtained from a hypothetical rater with an average level of severity. The reference group for computing this average severity level is the total group of raters included in the analysis. Table 7.2 presents an excerpt from the test taker measurement report provided by FACETS. Fair test taker scores, shown in the “Fair (M) Average” column, help illustrate the consequences that follow from raw scores taken at face value. For example, Test Taker 111 proved to be highly proficient (5.38 logits, SE = 0.81), and the six ratings he or she received showed satisfactory model fit (infit = 0.94, outfit = 0.95). Yet the observed average was 4.33. Using conventional rounding rules, the final score assigned would have been TDN 4. Figure 7.8, upper-left part of the diagram (“raw score approach”), displays this outcome. In contrast, the lower-left part (“fair score approach”) shows that the same test taker’s fair average was 4.84, yielding the final score TDN 5. The reason for the upward adjustment was that this test taker had happened to be rated by Raters 13 and 16, who, as we know from the analysis (Table 7.1), were the two most severe raters in the group. Given that both raters provided consistent ratings, it can be concluded that these raters strongly underestimated the writing

Table 7.2 Excerpt From the FACETS Test Taker Measurement Report Total Score

Total Count

Obsvd Average

Fair (M) Average

+ Measure

Model S.E.

Infit MNSQ

Outfit MNSQ

Test Taker

26 26 25 27 23 24 21 23 19 19 19 17

6 6 6 6 6 6 6 6 6 6 6 6

4.33 4.33 4.17 4.50 3.83 4.00 3.50 3.83 3.17 3.17 3.17 2.83

4.84 4.59 4.24 4.12 3.88 3.58 3.44 3.39 3.16 3.04 3.02 2.78

5.38 4.12 2.79 2.31 1.29 0.18 −0.27 −0.46 −1.24 −1.69 −1.78 −2.67

0.81 0.83 0.81 0.83 0.79 0.81 0.77 0.80 0.78 0.78 0.78 0.78

0.94 1.24 0.23 0.91 0.70 1.88 1.14 0.39 0.61 1.07 1.16 0.40

0.95 1.23 0.23 0.85 0.70 1.87 1.19 0.36 0.59 1.09 1.16 0.39

111 091 176 059 213 234 179 230 301 052 198 149

Note: “+ Measure” indicates that test takers with higher measures received higher scores. From Introduction to Many-Facet Rasch Measurement: Analyzing and Evaluating Rater-Mediated Assessments (2nd ed., pp. 96–97), by T. Eckes, 2015, Frankfurt am Main, Germany: Peter Lang. Copyright 2015 by Peter Lang. Adapted with permission

Many-facet Rasch measurement 167

Figure 7.8 Illustration of the MFRM score adjustment procedure. Relative to the final score based on the observed average, Test Taker 111 (left panel) receives an upward adjustment, whereas Test Taker 230 (right panel) receives a downward adjustment.

proficiency of Test Taker 111. This underestimation was compensated for by the MFRM analysis. The right panel (Figure 7.8) illustrates the opposite case. Test Taker 230 was assigned the same final score (TDN 4) as Test Taker 111, based on the observed average computed for the ratings provided by Raters 07 and 10 (“raw score approach”). Yet these two raters were among the most lenient raters in the group (Table 7.1). That is, both raters overestimated the writing proficiency of that test taker, as compared to the other raters. This overestimation was compensated for by the MFRM analysis, leading to a downward adjustment and a final score of TDN 3 (“fair score approach”). Table 7.3 presents an excerpt from the measurement results for the criterion facet. The criterion difficulty estimates indicate that receiving a high score on global impression was much easier than on linguistic realization or task fulfillment. Due to

168 Thomas Eckes Table 7.3 Excerpt From the FACETS Criterion Measurement Report Total score 2288 2303 2512

Total Obsvd count average 648 648 648

3.53 3.55 3.88

Fair (M) - Measure Model average S.E. 3.52 3.55 3.93

0.53 0.43 −0.97

0.08 0.08 0.08

Infit Outfit Criterion MNSQ MNSQ 0.97 1.10 0.90

0.96 1.07 0.91

LR TF GI

Note: “- Measure” indicates that criteria with higher measures were given lower scores. LR = linguistic realization, TF = task fulfillment, GI = global impression. From Introduction to ManyFacet Rasch Measurement: Analyzing and Evaluating Rater-Mediated Assessments (2nd ed., p. 113), by T. Eckes, 2015, Frankfurt am Main, Germany: Peter Lang. Copyright 2015 by Peter Lang. Adapted with permission.

the relatively large number of responses used for estimating each difficulty measure (total count per criterion was 648), the measurement precision was very high. For each criterion, MNSQ fit indices stayed well within very narrow quality control limits (i.e., 0.90 and 1.10). This is evidence supporting the assumption of unidimensional measurement as implied by the Rasch model. That is, these criteria worked together to define a single latent dimension.

Separation statistics For each facet included in the model, FACETS provides statistical indicators that summarize the spread of the element measures along the latent variable (Eckes, 2015; Myford & Wolfe, 2003). These indicators are called separation statistics. Table 7.4 presents three commonly used separation statistics. The separation statistics come with the interpretations that are specific to the facet they refer to. For example, the test taker separation reliability may be considered comparable to coefficient (or Cronbach’s) alpha: it provides information about how well we can differentiate among the test takers in terms of their level of proficiency (in the present sample analysis, R = 0.91). Regarding the rater facet, the meaning of this statistic is critically different. Viewed from the perspective of the standard approach to rater effects, low rater separation reliability is desirable, because this would indicate that raters were approaching the ideal of being interchangeable. By contrast, when raters within a group possess a highly dissimilar degree of severity, rater separation reliability will be close to 1. In the present analysis, rater separation reliability was as high as 0.98, attesting to a marked heterogeneity of severity measures, thereby indicating that interrater agreement was very low overall.

Rating scale use and functioning The preceding discussion provided a first insight into the analytic potential of the MFRM approach. Of course, MFRM includes a lot more procedures and statistics that are relevant for an in-depth study of rater-mediated assessments. To illustrate, one set of such procedures concerns the psychometric quality of the rating scale.

Many-facet Rasch measurement 169 Table 7.4 Separation Statistics and Facet-Specific Interpretations Statistic

Formula

Test takers

Raters

Criteria

The spread of the test takers’ proficiency measures relative to their precision.

The spread of the rater severity measures relative to their precision.

The spread of the criterion difficulty measures relative to their precision.

Separation H = (4G + The number of (strata) 1) / 3 statistically distinct index (H) classes of test takers; if an assessment aims at separating test takers into X classes, H should be at least as great as X.

The number of statistically distinct classes of raters; if an assessment requires raters to exercise a highly similar degree of severity, H should be close to 1.

The number of statistically distinct classes of criteria; if H is large, the set of criteria covers a wide range of difficulty.

Separation R = G 2 / reliability (1 + G 2) (R)

The overall precision of rater severity measures; if an assessment requires raters to be interchangeable, R should be close to 0.

The overall precision of criterion difficulty measures; if an assessment requires the criteria to be similarly difficult, R should be close to 0.

Separation G = SDt / RMSE ratio (G)

The overall precision of test taker proficiency measures; roughly similar to coefficient alpha or KR-20 (high values, i.e., R ≥ 0.80, are generally desirable).

Note: SDt = “True” standard deviation of measures, RMSE = root mean square measurement error. R can be converted into H using the equation G = (R / (1 – R))½.

A question of importance is: Did the raters interpret and use the scale categories as intended, or did they conceive of the categories in their own idiosyncratic way, more or less deviating from the scoring rubric? For example, some raters may tend to overuse the middle category of a rating scale while assigning fewer scores at both the high and low ends of the scale, thus exhibiting a central tendency (or centrality) effect. Others may show the opposite tendency and use the rating scale more in terms of a dichotomy, distinguishing only between very high and very low levels of performance (extremity effect). FACETS output provides category statistics that help answer questions like these. At the level of individual raters, one of these statistics refers to the average absolute distance between (or the standard deviation of) Rasch-Andrich thresholds that were estimated by means of a raterrelated partial credit model (PCM). A large value of this index would suggest that a particular rater tended to include a wide range of test taker proficiency levels in the middle category (or categories) of the rating scale, indicating a centrality effect (Wolfe & Song, 2015; Wu, 2017).

170 Thomas Eckes More specifically, when there is a reason to assume that at least some raters had a unique way of using the rating scale, the model discussed earlier (see Figure 7.5) may no longer be appropriate. To aid users in deciding on the kind of model (RSM or PCM), Linacre (2000) provided some guidelines as follows: (a) a PCM is indicated when the rating scale structure is likely to vary between elements of a given facet (e.g., when raters have clearly different views of scale categories or when the number of scale categories differs across criteria), (b) a PCM is suitable only when there are at least 10 observations in each scale category (this is to ensure sufficient stability of estimated parameters), and (c) a PCM may be preferred only when differences in PCM and RSM parameters are substantial (e.g., exceeding a difference of about 0.5 logits); otherwise the gain in knowledge from a PCM analysis would not outweigh the increased complexity of the results. Raters’ differential use of the rating scale is just one instance of a wide variety of rater effects, including rater bias or DRF (already discussed) and halo effects, as when raters tend to provide highly similar ratings of a test taker’s performance on conceptually distinct criteria (e.g., Myford & Wolfe, 2004). More recently, studies of rater effects have put forth a number of promising indicators that can be used in conjunction with MFRM modeling (or related) approaches to further explore the quality of rater-mediated assessments (e.g., Wind & Schumacker, 2017; Wolfe & Song, 2015; Wu, 2017).

Advances in applied MFRM research Over recent years, much progress has been made in terms of incorporating MFRM into research and theorizing on rater-mediated assessments. Approaches illustrating this development include the conceptualization of MFRM as a component within an argument-based framework of validation (Knoch & Chapelle, 2018) or the combination of MFRM with qualitative techniques in mixed-methods research designs (Yan, 2014). MFRM has also been used increasingly in research focusing on human categorization, inference, and decision making in assessment contexts – a field of research called rater cognition (Bejar, 2012). Specifically, rater cognition refers to mental activities that raters engage in when evaluating performances, such as identifying relevant performance features, aligning these features with a scoring rubric, and deciding on a score that suitably reflects the quality of the performance. Viewed from an information-processing perspective in rater cognition research, relevant questions are: What kind of cognitive categories and processes are involved in assigning ratings, how are these categories activated, and which factors have an impact on the ensuing processes? Viewed more from a raterdifferences perspective, typical questions include the following: Do systematic differences exist between raters regarding the cognitive underpinnings of the rating process; for example, are there differences in decision-making strategies or in criterion-weighting patterns? What are the consequences of these differences, and what factors account for their occurrence? MFRM studies have contributed

Many-facet Rasch measurement 171 to answering these questions and thus helped shape the field (Baker, 2012; Eckes, 2008, 2012; Engelhard, Wang, & Wind, 2018; Schaefer, 2016; Zhang, 2016).

Controversial issues The popularity of MFRM in rater-mediated language assessment notwithstanding, there has also been some controversy over certain methodological issues. One of these issues refers to the assumption that elements within facets are locally independent. Adopting a Rasch perspective on rater behavior, in particular, each score that a rater awards to an examinee is assumed to provide an independent piece of information about the examinee’s proficiency. Thus, raters basically have the same theoretical status as test items: increasing the number of raters assigning ratings to the same performance yields increasingly more measurement information (Linacre, 2003). Critics have argued that this approach ignores the multilevel, hierarchical structure of rating data: Observed ratings, in their view, are not directly related to proficiency; rather, they indicate the latent “true” category a given performance belongs to (“ideal rating”). This true or ideal category in turn serves as an indicator of proficiency (e.g., DeCarlo, Kim, & Johnson, 2011; for more detail on these diverging views, see Casabianca, Junker, & Patz, 2016; Robitzsch & Steinfeld, 2018). Note also that, using FACETS, observed and expected rater agreement statistics as well as the summary Rasch-kappa index (Eckes, 2015; Linacre, 2018b) facilitate a rough check on the overall degree of local dependence among raters. Another issue concerns the problem of correctly specifying the MFRM model to be used in a given assessment situation. Usually, the set of facets that can be assumed to have an impact on the ratings is known to the researcher. However, there are situations in which neither theoretical expertise nor prior knowledge is available to inform the identification of possibly influential facets. Even when the MFRM study is conducted within a familiar context, relevant facets may remain undetected for some reason or another, and, therefore, the model fails to be correctly specified. Yet unidentified or hidden facets sometimes indirectly manifest themselves by low data–model fit, for example, by large (absolute) values of standardized residuals associated with particular combinations of facet elements. The method of parameter estimation implemented in FACETS is joint maximum likelihood (JML). This method has been criticized for providing asymptotically inconsistent parameter estimates. However, the JML estimation bias is of minor concern in most measurement situations, particularly in situations involving more than two facets. In fact, JML estimation has many advantages in terms of practical applicability. For example, JML can be used with a wide variety of rating designs that may contain lots of missing data. Also, this estimation method allows the researcher to define a number of different Rasch models, incorporating different scoring formats, in one and the same analysis. Finally, all elements of all facets are treated in exactly the same way regarding the computation of parameter values, standard errors, and fit statistics (for a detailed discussion of estimation methods for rater models, see Robitzsch & Steinfeld, 2018).

172 Thomas Eckes

Conclusion Discussing human ratings in oral language assessment, Reed and Cohen (2001) pointed out that “a rater must struggle under the weight of the many factors that are known to cast influence over the outcome of a particular assessment. Thus, in a sense, when evaluating a language performance, the rater, too, must perform” (p. 87). The MFRM approach offers a rich set of concepts, tools, and techniques to evaluate the quality of both types of performances: the performance of raters on the complex task of rating and the performance of test takers on writing or speaking tasks. In addition, MFRM puts researchers, assessment providers, and exam boards in a position to report measures or scores that are as free as possible of the special conditions prevailing in the assessment situation, thereby increasing trust in the fairness and validity of the assessment outcomes. MFRM has changed the way we go about analyzing and evaluating rater- mediated assessments as well as the way we think about what raters can do and what raters should do. Thus, MFRM has led to a profound reconceptualization of the “ideal rater”: Whereas the standard approach emphasizes rater group homogeneity with a goal of making raters agree with each other exactly and, therefore, function interchangeably, MFRM expects raters to disagree with each other on details and encourages raters to maintain their individual level of severity or leniency. From a Rasch measurement perspective, the ideal rater acts as an independent expert, using the rating scale in a self-consistent manner across test takers, tasks, and criteria or whatever facets may be involved in an assessment.

Notes 1 For a review of the TestDaF and its assessment context, see Norris and Drackert (2018); see also www.testdaf.de. 2 Regardless of the limited number of responses, MINIFAC has the complete FACETS functionality. MINIFAC is available as a free download at the following address: www.winsteps.com/minifac.htm

References Andrich, D. (1978). A rating formulation for ordered response categories. Psychometrika, 43, 561–573. Andrich, D. (1998). Thresholds, steps and rating scale conceptualization. Rasch Measurement Transactions, 12, 648–649. Aryadoust, V. (2016). Gender and academic major bias in peer assessment of oral presentations. Language Assessment Quarterly, 13, 1–24. Baker, B. A. (2012). Individual differences in rater decision-making style: An exploratory mixed-methods study. Language Assessment Quarterly, 9, 225–248. Barkaoui, K. (2014). Multifaceted Rasch analysis for test evaluation. In A. J. Kunnan (Ed.), The companion to language assessment: Evaluation, methodology, and interdisciplinary themes (Vol. 3, pp. 1301–1322). Chichester: Wiley. Bejar, I. I. (2012). Rater cognition: Implications for validity. Educational Measurement: Issues and Practice, 31(3), 2–9.

Many-facet Rasch measurement 173 Bonk, W. J., & Ockey, G. J. (2003). A many-facet Rasch analysis of the second language group oral discussion task. Language Testing, 20, 89–110. Brennan, R. L. (2011). Generalizability theory and classical test theory. Applied Measurement in Education, 24, 1–21. Carr, N. T. (2011). Designing and analyzing language tests. Oxford: Oxford University Press. Casabianca, J. M., Junker, B. W., & Patz, R. J. (2016). Hierarchical rater models. In W. J. van der Linden (Ed.), Handbook of item response theory (Vol. 1, pp. 449–465). Boca Raton, FL: Chapman & Hall/CRC. Coniam, D. (2010). Validating onscreen marking in Hong Kong. Asia Pacific Education Review, 11, 423–431. Curcin, M., & Sweiry, E. (2017). A facets analysis of analytic vs. holistic scoring of identical short constructed-response items: Different outcomes and their implications for scoring rubric development. Journal of Applied Measurement, 18, 228–246. DeCarlo, L. T., Kim, Y. K., & Johnson, M. S. (2011). A hierarchical rater model for constructed responses, with a signal detection rater model. Journal of Educational Measurement, 48, 333–356. Eckes, T. (2005). Examining rater effects in TestDaF writing and speaking performance assessments: A many-facet Rasch analysis. Language Assessment Quarterly, 2, 197–221. Eckes, T. (2008). Rater types in writing performance assessments: A classification approach to rater variability. Language Testing, 25, 155–185. Eckes, T. (2012). Operational rater types in writing assessment: Linking rater cognition to rater behavior. Language Assessment Quarterly, 9, 270–292. Eckes, T. (2015). Introduction to many-facet Rasch measurement: Analyzing and evaluating rater-mediated assessments (2nd ed.). Frankfurt am Main: Peter Lang. Eckes, T. (2017). Rater effects: Advances in item response modeling of human ratings – Part I (Guest Editorial). Psychological Test and Assessment Modeling, 59(4), 443–452. Elder, C., Barkhuizen, G., Knoch, U., & von Randow, J. (2007). Evaluating rater responses to an online training program for L2 writing assessment. Language Testing, 24, 37–64. Engelhard, G. (2013). Invariant measurement: Using Rasch models in the social, behavioral, and health sciences. New York, NY: Routledge. Engelhard, G., Wang, J., & Wind, S. A. (2018). A tale of two models: Psychometric and cognitive perspectives on rater-mediated assessments using accuracy ratings. Psychological Test and Assessment Modeling, 60(1), 33–52. Engelhard, G., & Wind, S. A. (2018). Invariant measurement with raters and rating scales: Rasch models for rater-mediated assessments. New York, NY: Routledge. Hsieh, M. (2013). An application of multifaceted Rasch measurement in the Yes/No Angoff standard setting procedure. Language Testing, 30, 491–512. Jeong, H. (2017). Narrative and expository genre effects on students, raters, and performance criteria. Assessing Writing, 31, 113–125. Johnson, R. L., Penny, J. A., & Gordon, B. (2009). Assessing performance: Designing, scoring, and validating performance tasks. New York, NY: Guilford. Knoch, U., & Chapelle, C. A. (2018). Validation of rating processes within an argument-based framework. Language Testing, 35, 477–499. Knoch, U., Read, J., & von Randow, J. (2007). Re-training writing raters online: How does it compare with face-to-face training? Assessing Writing, 12, 26–43.

174 Thomas Eckes Lamprianou, I. (2006). The stability of marker characteristics across tests of the same subject and across subjects. Journal of Applied Measurement, 7, 192–205. Lane, S., & Iwatani, E. (2016). Design of performance assessments in education. In S. Lane, M. R. Raymond, & T. M. Haladyna (Eds.), Handbook of test development (2nd ed., pp. 274–293). New York, NY: Routledge. Lee, Y.-W., & Kantor, R. (2015). Investigating complex interaction effects among facet elements in an ESL writing test consisting of integrated and independent tasks. Language Research, 51(3), 653–678. Li, H. (2016). How do raters judge spoken vocabulary? English Language Teaching, 9(2), 102–115. Linacre, J. M. (1989). Many-facet Rasch measurement. Chicago, IL: MESA Press. Linacre, J. M. (2000). Comparing “Partial Credit Models” (PCM) and “Rating Scale Models” (RSM). Rasch Measurement Transactions, 14, 768. Linacre, J. M. (2002). What do infit and outfit, mean-square and standardized mean? Rasch Measurement Transactions, 16, 878. Linacre, J. M. (2003). The hierarchical rater model from a Rasch perspective. Rasch Measurement Transactions, 17, 928. Linacre, J. M. (2006). Demarcating category intervals. Rasch Measurement Transactions, 19, 1041–1043. Linacre, J. M. (2010). Transitional categories and usefully disordered thresholds. Online Educational Research Journal, 1(3). Linacre, J. M. (2018a). Facets Rasch measurement computer program (Version 3.81) [Computer software]. Chicago, IL: Winsteps.com. Linacre, J. M. (2018b). A user’s guide to FACETS: Rasch-model computer programs. Chicago, IL: Winsteps.com. Retrieved from www.winsteps.com/facets.htm Looney, M. A. (2012). Judging anomalies at the 2010 Olympics in men’s figure skating. Measurement in Physical Education and Exercise Science, 16, 55–68. Masters, G. N. (1982). A Rasch model for partial credit scoring. Psychometrika, 47, 149–174. McNamara, T., & Knoch, U. (2012). The Rasch wars: The emergence of Rasch measurement in language testing. Language Testing, 29, 555–576. Mulqueen, C., Baker, D. P., & Dismukes, R. K. (2002). Pilot instructor rater training: The utility of the multifacet item response theory model. International Journal of Aviation Psychology, 12, 287–303. Myford, C. M., & Wolfe, E. W. (2003). Detecting and measuring rater effects using many-facet Rasch measurement: Part I. Journal of Applied Measurement, 4, 386–422. Myford, C. M., & Wolfe, E. W. (2004). Detecting and measuring rater effects using many-facet Rasch measurement: Part II. Journal of Applied Measurement, 5, 189–227. Norris, J., & Drackert, A. (2018). Test review: TestDaF. Language Testing, 35, 149–157. Randall, J., & Engelhard, G. (2009). Examining teacher grades using Rasch measurement theory. Journal of Educational Measurement, 46, 1–18. Rasch, G. (1980). Probabilistic models for some intelligence and attainment tests. Chicago, IL: University of Chicago Press (Original work published 1960). Reed, D. J., & Cohen, A. D. (2001). Revisiting raters and ratings in oral language assessment. In C. Elder et al. (Eds.), Experimenting with uncertainty: Essays in honour of Alan Davies (pp. 82–96). Cambridge: Cambridge University Press.

Many-facet Rasch measurement 175 Robitzsch, A., & Steinfeld, J. (2018). Item response models for human ratings: Overview, estimation methods and implementation in R. Psychological Test and Assessment Modeling, 60(1), 101–138. Schaefer, E. (2016). Identifying rater types among native English-speaking raters of English essays written by Japanese university students. In V. Aryadoust & J. Fox (Eds.), Trends in language assessment research and practice: The view from the Middle East and the Pacific Rim (pp. 184–207). Newcastle upon Tyne: Cambridge Scholars. Springer, D. G., & Bradley, K. D. (2018). Investigating adjudicator bias in concert band evaluations: An application of the many-facets Rasch model. Musicae Scientiae, 22, 377–393. Till, H., Myford, C., & Dowell, J. (2013). Improving student selection using multiple mini-interviews with multifaceted Rasch modeling. Academic Medicine, 88, 216–223. Wang, J., Engelhard, G., Raczynski, K., Song, T., & Wolfe, E. W. (2017). Evaluating rater accuracy and perception for integrated writing assessments using a mixedmethods approach. Assessing Writing, 33, 36–47. Wilson, M. (2011). Some notes on the term: “Wright map”. Rasch Measurement Transactions, 25, 1331. Wind, S. A., & Peterson, M. E. (2018). A systematic review of methods for evaluating rating quality in language assessment. Language Testing, 35, 161–192. Wind, S. A., & Schumacker, R. E. (2017). Detecting measurement disturbances in rater-mediated assessments. Educational Measurement: Issues and Practice, 36(4), 44–51. Winke, P., Gass, S., & Myford, C. (2013). Raters’ L2 background as a potential source of bias in rating oral performance. Language Testing, 30, 231–252. Wolfe, E. W., & Song, T. (2015). Comparison of models and indices for detecting rater centrality. Journal of Applied Measurement, 16(3), 228–241. Wright, B. D., & Masters, G. N. (1982). Rating scale analysis. Chicago, IL: MESA Press. Wright, B. D., & Stone, M. H. (1979). Best test design. Chicago, IL: MESA Press. Wu, M. (2017). Some IRT-based analyses for interpreting rater effects. Psychological Test and Assessment Modeling, 59(4), 453–470. Yan, X. (2014). An examination of rater performance on a local oral English proficiency test: A mixed-methods approach. Language Testing, 31, 501–527. Yen, W. M., & Fitzpatrick, A. R. (2006). Item response theory. In R. L. Brennan (Ed.), Educational measurement (4th ed., pp. 111–153). Westport, CT: American Council on Education/Praeger. Zhang, J. (2016). Same text different processing? Exploring how raters’ cognitive and meta-cognitive strategies influence rating accuracy in essay scoring. Assessing Writing, 27, 37–53.

Section III

Univariate and multivariate statistical analysis

8

Analysis of differences between groups The t-test and the analysis of variance (ANOVA) in language assessment Tuğba Elif Toprak

Introduction There are many occasions on which language testers may need to examine the differences between two or more groups and determine whether these differences are meaningful or simply have arisen due to chance (i.e., sampling error and/or measurement error). There are two basic statistical techniques that can be utilized by language testers to examine the presence and magnitude of differences between groups. The first one is the t-test, which is used to examine the differences between the means (averages) of two sets of scores. The t-test is also known as Student’s t-test and has two types: the independent samples t-test and the dependent samples t-test. The second statistical test that is commonly used for detecting the differences between groups is the analysis of variance (ANOVA). ANOVA can be viewed as an extension of the t-test and examines the differences between more than two groups. Both the t-test and ANOVA allow language testers to identify the differences between conditions, groups, or sets of scores in the variable/s of interest. ANOVA and the t-test are examples of parametric tests that depend on the assumptions of certain parameters related to the population from which the sample is drawn. Since the properties of populations are called parameters and both the t-test and ANOVA are used to test researchers’ hypotheses about population parameters, these tests are referred to as parametric tests (Argyrous, 1996). Thus, similar to all parametric tests, certain assumptions need to be satisfied before the use of the t-test and ANOVA. These assumptions specify that: i

The latent variable of interest needs to be measured using an interval- or ratio-level scale; ii the population from which the sample is drawn needs to be normally distributed; iii the assumption of homogeneity of variances needs to be met. This assumption postulates that the variances of the samples are approximately equal; and iv the data should not contain any outliers, which can be defined as the extreme values that fall far from other values (Agresti & Finlay, 2009).

180 Tuğba Elif Toprak If these assumptions cannot be satisfied, language testers may resort to using non-parametric tests. Another issue that needs to be taken into account while opting for either parametric or non-parametric tests is the data type1 (i.e., nominal, ordinal, interval, and ratio), since treating ordinal data as interval data without scrutinizing the data and the purpose of the analysis can mislead the results (Allen & Seaman, 2007). Although parametric tests are more powerful than their nonparametric counterparts and may provide language testers with an increased chance of detecting significant differences between the conditions, non-parametric tests do not make assumptions about the shape of the population distributions. In other words, they may prove useful when the data are highly skewed and not normally distributed. Moreover, they can be employed in conditions in which the data contains outliers and the sample size is small. The non-parametric equivalent of the paired samples t-test is the Wilcoxon signed-rank test, which uses the rankings of scores and compares the sample mean ranks. The Mann-Whitney U test, on the other hand, is the non-parametric equivalent of the independent samples t-test and counts the number of the pairs for which scores from the first group is higher. When it comes to the non-parametric equivalent of ANOVA, language testers can apply the Kruskal-Wallis H test, which compares the medians instead of the means. These inferential tests have widespread use in the field of language testing and assessment, since language testers are often interested in comparing multiple groups of examinees, sets of scores, and conditions and further examining the differences between them, if any. Inferential tests are a set of statistical tools that are used to make inferences about a population based on the sample data. To illustrate, language assessment researchers have applied, for example, a paired samples t-test to compare different forms of a vocabulary test (e.g., Beglar & Hunt, 1999) and understand the impact of bilingual dictionary use on writing test scores (e.g., East, 2007); a one-way ANOVA to examine the differences among several proficiency groups on a computer-based grammatical ability test (Chapelle, Chung, Hegelheimer, Pendar, & Xu, 2010), and a within-participants or repeated measures ANOVA to determine the impact of a training on raters’ scoring behavior (e.g., Davis, 2016). Given the widespread uses of the t-test and ANOVA in language assessment, understanding their properties, statistical assumptions, uses, and misuses constitutes a substantial part of language testers’ statistical knowledge and skills and helps language testers enrich their statistical toolbox. Recognizing the importance of these inferential tests, this chapter provides readers with an introduction to the t-test, ANOVA, and their special cases and non-parametric equivalents. It elaborates on purposes and uses of the tests in language testing and assessment literature by reviewing concrete examples from the field. Moreover, it showcases how these statistical tests can be applied to language assessment data by reporting on two separate studies conducted in operational testing settings. The first study examines the differences between two groups of students on a pronunciation test and investigates the impact of an alternative treatment for listening and pronunciation on students’ test performances by using an independent samples t-test and its non-parametric equivalent, Mann-Whitney U test. The second study

Analysis of differences between groups 181 investigates whether there are meaningful differences among the performances of four groups of students on a second-language (L2) reading test by using ANOVA procedures. The next section elaborates on the conceptual and statistical foundations of the t-test and ANOVA. Moreover, it illustrates for what purposes these statistical tests have been utilized in the field of language testing and assessment.

The t-test The t-test was developed by William Gossett, a scientist who worked for Guinness and published his work using the pseudonym Student. That is why the t-test is sometimes referred to as Student’s t in some sources. The t-test is a statistical method that evaluates whether there is a statistically significant difference between the means of two samples and whether the difference is large enough to have been caused by the independent variable (IV). An independent variable (IV) is the variable that is manipulated or controlled in a study to examine the impact on the dependent variable (DP). The dependent variable is the variable that is tested and measured and is dependent on the independent variable. The difference between the means is measured by the t statistic which is large when the observed difference between the groups is higher. In contrast, if the difference between the groups is small, the observed difference may have arisen simply due to chance, and the IV of interest has not impacted the dependent variable (DV). In such a case, we need to accept the null hypothesis (H0), which proposes that the differences between the groups occurred due to chance and that there are no differences between the population means. On the other hand, if the difference between the groups is large and significant, we can reject the null hypothesis and accept the alternative hypothesis (notated as H1 or Ha), which states that the manipulation of our IV accounts for the outcome of the observed differences and, accordingly, there are differences between the population means. Rejecting or accepting the null hypothesis is decided by evaluating the level of significance (α) or the p value of the statistical test. The P value and t statistic are linked in that the greater the t statistic, the stronger the evidence against the null hypothesis which states that there is no significant difference between the groups. The p value determines the likelihood of obtaining (non-)significant differences between the two groups. If the level of significance is small (p < 0.05 or 0.01), then the null hypothesis is rejected, but if it is larger than 0.05 or 0.01, the null hypothesis is supported (Ott & Longnecker, 2001). The p value of 0.001 means that there is only one chance in a thousand of this outcome occurring due to sampling error, given that the null hypothesis is true. To illustrate, Knoch and Elder (2010) investigated the difference between 30 students’ academic writing scores performed under two time conditions: 55 minutes and 30 minutes. As an instrument, the authors used a writing task that was an argument essay asking students to respond to a brief prompt in the form of a question or statement. They used an analytic rating scale with three

182 Tuğba Elif Toprak rating categories (fluency, content, and form) rated on six band levels ranging from 1 to 6. The authors used a paired samples t-test to analyze the scores and found no significant difference between the two performance conditions on any of the rating criteria (p = 0.899, fluency; p = 0.125, content rating; p = 0.293, form; p = 0.262, total score). Since the p values obtained on all criteria were larger than 0.05, the authors concluded that student’ scores did not differ under the long (55-min.) and short (30-min.) time conditions. In other words, there was no statistical evidence showing that different time conditions affected students’ test performances.

Properties of the t-test There are two types of the t-test used in the literature: the independent samples t-test, which is also called between-participants or unrelated design t-test, and the paired samples t-test, which is also referred to as related, within-participants, repeated measures, or dependent samples t-test. For consistency throughout the chapter, the terms “the independent samples t-test” and “the paired samples t-test” are used. The independent samples t-test is used when participants perform only in one of the conditions. On the other hand, the paired samples t-test is used when the participants perform in both conditions. Typically, the paired samples t-test is used in longitudinal studies observing a group of subjects at two points of time and in experimental studies in which repeated measures are obtained from the same subjects (Davis & Smith, 2005). Another distinction between these two designs is that the paired samples t-test yields a larger t value and a smaller p value when compared to its independent samples counterpart. This is due to the research design employed in paired samples t-test, where the same participants perform in both conditions. Hence, the paired samples t-test has several advantages over the independent design, such as controlling for potential sources of bias that may stem from participants’ differences. In a paired design, employing the same participants across conditions may lead to a decrease in within-participants variance and increase the power of the test, thereby yielding a larger t value and a smaller p value. Power can be defined as the ability to detect a significant effect, if there is one, and the ability to reject the null hypothesis if it is false (Dancey & Reidy, 2011). Once a t value is and its associated p value are obtained, it becomes possible to calculate an effect size (d) to express the magnitude of the difference between the conditions. Cohen’s d index denotes the distance between means of two groups in terms of standard deviations and can be computed by using the formula: d=

x1 − x 2 (8.1) MSD

where X1 and X2 are the means for the first and second samples and M SD represents the mean for the standard deviations for samples. d is expressed in standard deviations and will be small if there is a large overlap between the two groups. If

Analysis of differences between groups 183 contrary, d will be relatively larger. According to Cohen (1988), effect sizes can be classified as small, d = .2, medium, d = .5, and large, d =.8.

Non-parametric equivalents of the t-test As previously mentioned, the t-test is a parametric test, and its use requires fulfillment of certain assumptions. For example, opting for non-parametric tests would be appropriate in cases in which the data have extreme outliers or are not a normal distribution, the sample includes fewer than 30 observations, and groups do not have equal variances (Agresti & Finlay, 2009). Non-parametric tests accommodate these situations by relying on the median instead of the mean. These tests also convert the ratio or interval data into ranks and check if the ranks of the two sets of scores differ significantly (Agresti & Finlay, 2009). The non-parametric equivalent of the independent samples t-test is the Mann-Whitney U test, which counts the number of the pairs for which scores from the first group is higher. The nonparametric equivalent of the paired samples t-test is the Wilcoxon signed-rank test, which utilizes the score rankings and compares the sample mean ranks. Before opting for a parametric or non-parametric test, it would be advisable to check whether the data are normally distributed. When conducting exploratory data analysis, this can be done using histograms and stem-and-leaf plots and running normality tests (e.g., Shapiro-Wilk and Kolmogorov-Smirnov tests) (Dancey & Reidy, 2011).

The use of the t-test in language assessment The t-test enables language testers to move from the domain of descriptive statistics to that of inferential statistics and has been widely used in the field. For instance, it has been used to determine whether the order in which examinees took two versions of a test influenced the scores (Kiddle & Kormos, 2011), understand whether language background affects test performance (Vermeer, 2000), or determine if groups have similar levels of ability (Fidalgo, Alavi, & Amirian, 2014). Table 8.1 lists a few studies in the field of language assessment that utilize the t-test to answer different research questions. Table 8.1 Application of the t-Test in the Field of Language Assessment Study

Type of t-test applied

The purpose was to . . .

Batty (2015)

PS

Beglar and Hunt (1999)

PS

Chae (2003)

PS

examine the difficulty of different subtests of a listening test. equate two different forms of a word-level test. investigate whether there was a difference in the mean scores of two forms of a picture-type creativity test. (Continued)

Table 8.1 (Continued) Study

Type of t-test applied

The purpose was to . . .

Cheng, Andrews, and Yu (2010)

PS

Coniam (2000)

IS

East (2007)

PS

Elgort (2012)

PS

Fidalgo et al. (2014)

IS

Ilc and Stopar (2015)

PS

Kiddle and Kormos (2011)

IS

explore the effects of school-based assessment (SBA) on students’ perceptions about SBA and external examinations. compare the effects of context-only visuals and audio-only input on the performance of EFL teachers. determine the impact of using bilingual dictionaries in an L2 writing test. investigate whether the scores on the bilingual version of a vocabulary size test differed from its monolingual version. investigate whether groups were similar to each other in terms of the ability measured by the test. determine whether there were significant differences in achievement levels of different reading subtests. determine whether the order in which examinees took two versions of a test influenced the scores. determine the effect of mode of response (online vs. face to face) on performance on an oral test. ascertain the difficulty level of two versions of a writing test task. determine the impact of time conditions on writing performance. determine the impact of oral performance of examinees on a role-play activity and interview. investigate the effects of pre-task planning on paired oral test performance. (i) check if participants assigned to two groups were different in terms of several variables such as age and years of English instruction they had received. (ii) determine the influence of culturally familiar words on examinees’ cloze test taking processes. investigate whether examinees coming from different language backgrounds performed differently on vocabulary tasks. identify the difference between examinees who reported using more languagerelated strategies vs. examinees who reported focusing more on content and organization on a speaking test task.

PS Knoch and Elder (2010)

IS PS

Kormos (1999)

PS

Nitta and Nakatsuhara (2014) Sasaki (2000)

PS

Vermeer (2000)

IS

Wigglesworth and Elder (2010)

IS

IS

Note: IS and PS refer to the independent samples and paired samples t-test, respectively.

Analysis of differences between groups 185

The analysis of variance (ANOVA) ANOVA was developed by the British statistician Ronald Fisher. ANOVA is an extension of the t-test and analyzes how likely it is that any differences between more than two conditions are due to chance. While the t-test compares two levels of an IV, ANOVA, which is a multi-groups design statistical test, compares three and more levels of an IV. Since ANOVA is a parametric test, the same statistical assumptions mentioned for the t-tests need to be satisfied before its use. For instance, one of these statistical assumptions is the homogeneity of variance, which means that conditions or groups need to have equal variances. This assumption can be tested by using the Levene test, which checks to see if the groups’ variances are equal or not. In addition, under ANOVA, the standard deviation of the population distribution needs to be the same for the group, and the samples drawn from the populations need to be independent random samples (Agresti & Finlay, 2009). Although ANOVA examines the differences between the means, it is called the analysis of variance because it compares the means of more than two groups by using two estimates of the variance. The first estimate of variance is the variability between each sample mean and the grand (i.e., overall) mean. The second variance pertains to the variability within each group. When the variability of the means of the groups is large and the variability within a group is small, the evidence is against the null hypothesis, which states there is no difference among the groups (i.e., null hypothesis is rejected). ANOVA follows several basic steps to test for statistical differences between the sample means. It first estimates the means for each group. Second, it calculates the grand mean by adding all means together and dividing it by the number of the groups. Third, it estimates the within-groups variation by calculating the deviation of each single score in a group from the mean of the group. Fourth, it estimates the between-groups variation by calculating the deviation of each group’s mean from the grand mean. Then it partitions the total variance into two sources. The first source is the between-groups variability, which stands for the variation occurring in the DV due to the effect of IV. The second source is the within-groups variability, which is also referred to as error variability. Within-groups variability may occur due to several factors such as individual differences and measurement errors and refers to any sort of variation in the DV that is not explained by the IV (Dancey & Reidy, 2011). ANOVA uses the test statistic, F, which is a ratio of the between-groups variance and the within-groups variance. F statistic is conceptualized as this: Variability due to independent variable + error variabilityy F=

(8.2) (Betweeen − groups estimate of variance) Error variability ( Within − groups estimate of variance)

The variance within a group is expected to be as small as possible, for this would make the F value larger. F will also be relatively larger if the variation between the

186 Tuğba Elif Toprak groups is larger than the variation within the groups. In this case, the likelihood that this result is due to chance diminishes considerably. If a large F is obtained, this would suggest that the IV has a strong impact on the DV, leading to more variability than the error. On the other hand, if the impact is a small one, which lends support to the null hypothesis, F will also be a small one (close to 1). For example, an F ratio of 6.12 indicates that the variability occurring due to the IV is six times as large as the variability occurring due to error. Once we have the F and its associated p value, we can confidently claim that there is a statistically significant difference between the means of different groups. Although a large F indicates that one or more groups are different from the others, it does not tell us which group possesses a significantly higher/lower mean. Additional statistical tests or multiple comparison tests or post hoc comparison tests resolve this issue. These multiple comparison tests allow for determining which group means are different from the other means and control for family-wise error rate. The family-wise error rate refers to the probability of committing one or more Type I errors while running multiple hypothesis tests (Lehmann & Romano, 2005). To illustrate, if there are three groups and these groups are compared in pairs, there would be three comparisons: 1 vs. 2, 1 vs. 3, and 2 vs. 3. Running the multiple comparisons in separate t-tests increases the likelihood of committing a Type I error, which refers to inaccurately rejecting the null hypothesis (Gordon, 2012). Thus, when F indicates a statistically significant difference among the groups, instead of running multiple t-tests, the use of post hoc multiple comparison tests is suggested. Among the most well-known post hoc comparison tests are the Tukey HSD (honestly significant difference) test, the Scheffe test, and the Newman–Keuls test, which all conduct pairwise comparisons to examine the difference between two groups (Seaman, Levin, & Serlin, 1991). If such comparisons among the pairs are made by running multiple t-tests (Connolly & Sluckin, 1971), the chances of making a Type I error could be reduced by applying the Bonferroni adjustment. Using the Bonferroni adjustment, the alpha obtained is divided by the number of comparisons. For instance, if we have two comparison tests at an alpha level of .05, we would divide the alpha .05/2 = .025, and the new alpha would be .025 for each test. These correction procedures are also available in commercial statistical packages such as SPSS and SAS.

One-way and factorial ANOVA There are two types of ANOVA tests – one-way ANOVA and factorial ANOVA – and their use is dependent on the number of IVs modeled. The ANOVA that has been elaborated on so far in this chapter is the one-way ANOVA, which includes a single IV. When the aim is to capture the differences between three or more groups on more than one IV, the design becomes a factorial ANOVA, which may be viewed as an extension of the one-way ANOVA. Although several terms such as the two-way ANOVA or the three-way ANOVA are used in the literature, all these N-way ANOVAs are gathered under the umbrella “term factorial ANOVA”. In a factorial ANOVA design, the term factor refers to the IVs. The simplest factorial design is denoted as 2 × 2 design, in which there are two IVs, or factors,

Analysis of differences between groups 187 each IV having two levels. For instance, if we have a 3 × 2 × 2 factorial design, this means that we are dealing with three IVs, the first IV with three levels and the second and third IVs, each with two levels. The effects of individual IVs on the DV are referred to as main effects. To illustrate, Schoonen and Verhallen (2008) examined the impact of language background and educational level on young language learners’ performance on a Dutch word association test. The authors used a factorial ANOVA with two IVs, each with two levels. The first IV was language background of the participants (native Dutch vs. non-native Dutch), while the second IV was educational level of the participants (grade 3 vs. grade 5). In this particular example, there were two main effects and an interaction effect in the factorial design; the first main effect referred to the difference in task performances between examinees who were native Dutch speakers and non-native Dutch speakers, and the second main effect referred to the difference in task performances between examinees who were attending grade 3 and 5. Finally, there was an interaction effect on the way that language background and education level interacted to influence examinees’ performance on the task. An interaction effect means that only the main effects cannot account for the impact of the IVs on the DV. When an interaction effect is present, one IV may function differently in the two conditions of the second IV (Dancey & Reidy, 2011). In factorial ANOVA, all sources of variance in the scores are examined. Thus, in the example above, there would be four sources of variance – the main effect of the first IV, the main effect of the second IV, the interaction between these two IVs, and the variance occurring due to the differences between participants employed in each condition (i.e., error variance). In other words, each IV would account for a part of the variation. Another part would be explained by the interaction effect between the IVs, and what remains would be the error variance. At this point, the chief function of factorial ANOVA is to determine the amount of variance in scores that is attributable to each factor. It then compares these sources of variance with the error variance and calculates the likelihood of obtaining such an effect due to chance. Although factorial ANOVA allows for modeling more than two IVs with one DV, it should be noted that as the number of IVs increases, so does the complexity of the design. In designs where many IVs come into play, interpreting the interactions between these IVs may prove extremely difficult. Moreover, modeling multiple IVs may lead to increased chances of making a Type I error, since multiple main and interaction effects would be tested against multiple null hypotheses (Dancey & Reidy, 2011). After obtaining an F value and its associated probability, d can be calculated as a measure of effect size for the differences between more than two conditions (Dancey & Reidy, 2011). However, there are also other measures of effect size in ANOVA such as omega squared, epsilon squared, eta squared, and partial eta squared (Levine & Hullett, 2002).2 These measures of effect size help us understand how much variance in the DV can be accounted for by the IV in percentage terms, and statistical packages such as SPSS provide researchers with these measures. Regarding how participants are assigned to conditions or groups, there are two types of one-way ANOVA: between-participants ANOVA, in which participants

188 Tuğba Elif Toprak perform in only one condition (also called an independent design), and withinparticipants ANOVA, in which the same participants perform in all conditions (related or repeated measures design). Within-participants ANOVA divides the within-groups variation in the scores into two parts: the variability due to individual differences and the variability due to random error. As individual differences do not interfere with the participants’ performances and each participant can be tested against himself/herself, the sensitivity of the design increases. As a result, when compared to between-participants ANOVA, within-participants ANOVA yields relatively larger F and smaller p values (Davis & Smith, 2005). However, there is a drawback to within-participants ANOVA that stems from the violation of the assumption of sphericity, where ANOVA assumes that the variances of difference between the mean scores of groups are not the same. Since this assumption is not likely to be met within-participants ANOVA, it may be more useful to assume that the assumption is violated and apply several corrections such as Greenhouse-Geisser correction formula. When Greenhouse-Geisser correction formula is used, it is assumed that the assumption of sphericity is not fulfilled. This formula makes the ANOVA test more stringent, and so the possibility of committing a Type I error is reduced (Dancey & Reidy, 2011). Finally, as in the t-tests, when the statistical assumptions that ANOVA makes cannot be satisfied, researchers may need to use non-parametric equivalent of ANOVA. On these occasions, the Kruskal-Wallis test can be used as an alternative to one-way ANOVA, while Friedman’s test can be used as an alternative to within-participants ANOVA.

ANOVA in language assessment To date, ANOVA and its variations have been utilized to respond to numerous needs such as determining the differences between various assessment tasks (Kormos, 1999), understanding whether the multiple-choice item format impacts examinees’ scores (Currie & Chiramanee, 2010), or examining whether there are group differences among raters in terms of gender and nationality (Schaefer, 2008). Table 8.2 presents a number of studies in the field of language assessment, which employed ANOVA procedures to answer different research questions. The previous sections have provided background to the t-test and ANOVA procedures. The following section showcases how these statistical procedures were applied in two different language assessment studies as well as checking the assumptions of these procedures.

Sample study 1: Investigating the impact of using authentic English songs on students’ pronunciation development Study 1 investigated the impact of an alternative treatment for listening and pronunciation on participants’ test performances. The study explored whether using authentic English songs would improve young EFL (English as a foreign

Table 8.2 Application of ANOVA in Language Testing and Assessment Study

Type of ANOVA

The purpose was to . . .

Barkaoui (2014)

OWA

compare examinees’ scores across the three writing tasks. examine the effects of English language proficiency and keyboarding skill level on paper-based independent task scores. create reliable and equivalent forms of the tests and determine whether four forms of a vocabulary test differed. compare writing performances of examinees across time and grade. investigate whether the gains from the translation task to immediate recall differ among groups of different proficiency levels. determine whether a prototype test of productive grammar ability distinguishes among three groups of examinees. investigate the effect of the multiplechoice item format on examinees’ scores. determine whether levels of ability or levels of prior experience with a dictionary led to differences in L2 writing scores. examine the relationship between writing proficiency and several discourse features. determine whether the attribute mastery probabilities of three reading proficiency levels differ. i investigate whether different text types affected reading comprehension test performance. ii investigate the effects of response formats across text types. iii examine text type effects on different language proficiency groups. determine whether examinees’ Raschadjusted scores differed when grouped by prompt. determine whether three original tests with five-option items were equally difficult. examine which of the language components varied across educational groups.

FA Beglar and Hunt (1999)

OWA

Casey, Miller, Stockton, and Justice (2016) Chang (2006)

FA

Chapelle et al. (2010)

OWA

Currie and Chiramanee (2010) East (2007)

OWA

Gebril and Plakans (2013) Kim (2015)

OWA

Kobayashi (2002)

OWA

Leaper and Riazi (2014)

OWA

Lee and Winke (2012)

OWA

Leong, Ho, Chang, and Hau (2012)

OWA

OWA

FA

OWA

(Continued)

190 Tuğba Elif Toprak Table 8.2 (Continued) Study

Type of ANOVA

The purpose was to . . .

Mann, Roy, and Morgan (2016) O’Sullivan (2002)

OWA

Ockey (2007)

OWA

Schaefer (2008)

FA

Schmitt, Ching, and Garras (2011)

OWA

Schoonen and Verhallen (2008)

FA

Shin and Ewert (2015)

OWA

investigate how examinees performed on different tasks measuring vocabulary. investigate the effects of pairing with a friend or a stranger and the gender of the partner on examinees’ performance in an oral proficiency test. investigate whether listening performance was improved by visuals when the visuals included information that complemented the audio input. determine whether there were group differences among raters in terms of gender and nationality. determine whether scoring methods of a measure of vocabulary distinguished between the examinees with different language proficiencies. determine the effects of grade and language background on the scores of examinees on a vocabulary test. determine whether examinees who chose a specific topic performed better than others.

FA

Note: OWA and FA stand for one-way ANOVA and factorial ANOVA, respectively.

language) learners’ word-level pronunciation through analyzing their scores on a picture-type pronunciation test. Given the importance of training young learners’ pronunciation skills in the target language at an early age (Neri, Mich, Gerosa, & Giuliani, 2008), songs can be of great help in offering authentic and pleasurable opportunities to increase young learners’ exposure to the target language. Especially the nature of children’s songs, which comprises repeated verses, would make songs useful tools for the presentation and the practice of language (Paquette & Rieg, 2008; Saricoban & Metin, 2000). Thus, it was hypothesized that using authentic English songs would be more effective than using traditional drills to foster participants’ pronunciation development. The IV of the study was teaching pronunciation through authentic songs, and the DV was the amount of improvement in pronunciation scores. Sixty young learners of English participated in the study. All were 10 years old attending fourth grade in a state school in an EFL environment and formed a representative sample. They all studied English as a school subject mainly through grammar-based traditional methods. The participants were divided into two groups of 30. Before giving the treatment, a pre-test was given to ensure that both

Analysis of differences between groups 191 groups were homogenous. They were given a picture-type pre-test that included 12 short words (e.g., dog, bird, cat, fish, egg, horse, car) with short and long vowels in English. The participants were asked to pronounce the given words, and their voices were recorded using an audio recorder. They were allowed to try pronouncing words many times before they produced their final pronunciation. The researcher and another rater who was a native speaker of English listened to the recordings and assessed each participant’s pronunciation, specifically focusing on how the vowels were pronounced. For each picture presented, if the student provided a correct pronunciation, s/he earned one point; if otherwise, zero. The maximum score that the students could earn on the exam was 12. After taking the pre-test, the experimental group studied the same vowels learnt through authentic songs, which can be found on the website www.songsforteaching. com. On the other hand, the control group engaged with traditional listen-and-repeat activities. The treatment lasted 4 weeks. Each week, participants had four English lessons, each lesson lasting approximately 45 minutes. In each lesson, nearly 15 minutes of the class time was devoted to studying pronunciation. Later, the pre-test was given as a post-test to compare the two groups’ performance and improvement.

Data analysis Histograms and results of a Shapiro-Wilk normality test for the pre-test scores of both groups were examined separately. Since the data were skewed and not normally distributed, and the number of participants was small, the most appropriate statistical test was decided to be the Mann-Whitney U test. All statistical analyses were conducted using commercial statistical package SPSS 20.

Results and discussion Descriptive statistics revealed that participants in experimental group (M = 3.33, Mdn = 2.5) and control group (M = 3.43, Mdn = 2) performed similarly on the pronunciation test. Moreover, the Mann-Whitney U index was found to be 410 (z = −0. 593) with an associated probability of (p) 0.55, which indicated that the difference between the groups may be due to the sampling error. These results also suggested that the two groups were similar in terms of their performance on the pronunciation test. After the intervention, both groups were given the post-test to examine the impact of the treatment. Initially, the post-test data were examined using histograms and the Shapiro-Wilk normality test, and it was concluded that the data were normally distributed. The post-test data demonstrated that participants in the experimental group scored higher (M = 7.1, SD = 2.3) than the participants in the control group (M = 5, SD = 2.6), and an independent samples t-test revealed that, if the null hypothesis were true, such a result would be highly unlikely to have arisen (t(58) = 3.11, p = .003, d = 0.85, two tailed). It was therefore concluded that there is a statistically significant difference in pronunciation scores between experimental and control groups, and the use of authentic songs as an

192 Tuğba Elif Toprak alternative treatment improved participants’ pronunciation greatly when compared to the traditional listen-and-repeat drills. Although both experimental and control groups’ performances were similar prior to the treatment, the Mann-Whitney U test of the pre-test results and posttest results showed that there was a statistically significant difference between the performances of the two groups after the treatment. It was previously hypothesized that the use of songs for teaching pronunciation would be more effective than traditional listen-and-repeat activities and repetitive drills and would lead to better performances on test tasks. The results showed that the experimental group not only made a considerable improvement in their pronunciation scores but also performed better than the control group on the post-test. These findings concur with findings of previous studies, which have demonstrated that songs have a positive impact on the development of language skills such as reading, listening, and vocabulary (Coyle & Gómez Gracia, 2014; Li & Brand, 2009; Paquette & Rieg, 2008).

Sample study 2: L2 reading comprehension ability across the college years Study 2, which utilized a descriptive research design, investigated whether there were meaningful differences between the performances of four groups of examinees on an L2 reading comprehension test. The participants, who were freshmen, sophomores, juniors, and seniors, received higher education in an EFL setting and mainly needed English reading comprehension skills for reading to learn. In such a context, reading was treated as an interaction between text-based and knowledge-based processes of a reader in which readers were assumed to utilize lower-level and higher-level reading processes in harmony (Grabe, 2009). Thus, the test primarily aimed to tap higher-level reading skills which help constitute a propositional representation (text base) and a situational representation of the text (Kintsch, 1988). It included 27 items designed in a multiple-choice format and tapped five reading sub-skills, which were understanding explicitly stated information, inferencing, using syntactic knowledge, using discourse knowledge, and reading critically. For each question, there was only one correct answer. Reading texts were analyzed quantitatively and qualitatively to ensure that they matched the language threshold of the students and were appropriate. It took students approximately 45 to 50 minutes to complete the tests. The Cronbach alpha coefficient of the test, which is a measure of internal consistency, was found to be .85 (for more information on reliability, see Chapter 1, Volume I). Participants of the study had successfully passed high-stakes admission and placement exams (both in general subjects and English) and English proficiency exams of their departments prior to embarking on their higher education. The test was given to four groups of examinees, a total of 120 participants (aged 18–23), with 30 examinees in each group (freshmen, sophomores, juniors, and seniors), to investigate the differences between these four groups of examinees. Since the instrument included 27 items and a student earned one point for each correct answer, the interval rating scale was 0 to 27. The highest possible score for the test was 27.

Analysis of differences between groups 193 Data analysis Before applying the appropriate statistical test, the data were examined using descriptive statistics, histograms, and the Shapiro-Wilk test for normality. The results of these tests indicated that the data met the normality assumption. Since there were more than two groups, the data were analyzed using the one-way ANOVA. Although the one-way ANOVA can indicate if the results obtained are significant, it does not tell which groups’ means are different. Thus, Tukey’s HSD, a post hoc comparison test, was employed to determine the group/s performing differently than the other groups.

Results and discussion Descriptive statistics, demonstrated in Table 8.3, shows that the freshmen scored better than the other three groups and that there were differences between the means of four groups. To fully understand whether these differences were statistically meaningful, the one-way ANOVA test was used. A Levene test for homogeneity of variances showed that the variations of the four groups were not significantly different from each other, thus indicating the assumption of homogeneity of variance was satisfied (F(3,116) = .263, p = .852). The one-way ANOVA test indicated that any differences between group scores were unlikely to be due to the sampling error, assuming the null hypothesis to be true. The results (F(3,116) = 9.73, p < 0.001) yielded an effect size (partial eta2) of 0.201, which suggests that 20.1% of the variation in scores could be accounted for by the college year of the examinees. Although the one-way ANOVA test revealed that one or more groups were different from others, it cannot specify which group was significantly different from the other groups. Hence, a Tukey’s HSD was applied to determine which group means differed from each other. Tukey’s HSD demonstrated that the differences between the freshmen (M = 17.5, SD = 4.3) and the juniors (M = 13.3, SD = 4.8) and the seniors (M = 11.1, SD = 4.6), and between the sophomores (M = 14.4, SD = 4.6) and the seniors were unlikely to be due to the sampling error. On the other hand, there were no significant differences between the freshmen and the sophomores and the juniors and the seniors. Given that there were no substantial differences in the educational objectives and background of the four groups of participants, it was surprising that the Table 8.3 Descriptive Statistics for Groups’ Performances on the Reading Test

Mean Standard deviation 95% Confidence interval

Group 1 (Freshmen)

Group 2 (Sophomores)

Group 3 (Juniors)

Group 4 (Seniors)

17.5 4.3 15.8–19.1

14.4 4.6 12.6–16.1

13.3 4.8 11.5–15.1

11.1 4.6 9.4–12.9

194 Tuğba Elif Toprak freshmen outperformed the other groups. The sophomores, juniors, and seniors had more experience in reading to learn, in a broader context, and they had considerably longer experience in academic reading when compared to their freshmen counterparts. More interestingly, although the seniors had the most experience in academic reading, they scored lower when compared to the other groups. Cumulatively, the participants’ success on the test diminished as their college year increased, suggesting the existence of a negative relationship between the college year and success. This finding may be accounted for by the far-reaching impact of large-scale and high-stakes English exams that prospective undergraduates take to enter university where focus is on honing test taking skills and strategies.

Conclusion and implications Since language testers try to examine and understand the nature of languagerelated constructs and develop and test novel tools for measuring these constructs and make sense of the data they collect, language testing and assessment can be considered both an art and a science (Green, 2013). For instance, language testers may need to deal with examining the impact of mode of response (e.g., online vs. face to face) on a speaking test performance, equating two different forms of a vocabulary test, or investigating whether examinees with different L1 backgrounds perform differently on a reading test. Most of the time, achieving these goals requires stretching beyond the realm of descriptive statistics, which helps in summarizing and describing the data of interest. On such occasions, a basic familiarity with inferential statistics, which helps make inferences and predictions about a population based on a sample, would be useful. Hence, language testers need to possess substantial knowledge of inferential statistics and a set of relevant skills, apply and interpret the results of related inferential statistical methods, and communicate the results of the analyses to the parties involved in assessment practices. This chapter was motivated by the desire to present readers with an introduction to the t-test and ANOVA, two basic inferential statistical techniques that enable investigating the differences between conditions or groups and determining whether these differences are meaningful. To this aim, the chapter provided fundamental information about the features, statistical assumptions, and uses of the t-test and ANOVA. Moreover, it showcased how these techniques can be used by reporting on two separate studies carried out in operational settings. The first study explored the effect of an alternative treatment for listening and pronunciation on students’ test performances by using the Mann-Whitney U test. The second study employed several ANOVA procedures to detect whether there were meaningful differences between the performances of four groups of examinees on a reading test. These studies had limitations regarding the type and number of statistical techniques they employed and the sample size. However, given that both t-test and ANOVA techniques have a widespread use in the field and can be used for accomplishing a wide range of purposes in every phase of language assessment practices, it is hoped that this chapter would serve to enrich the statistical toolbox of language testers and individuals interested in language testing and assessment practices.

Analysis of differences between groups 195

Notes 1 Since the full treatment of data types is outside the scope of this chapter, interested readers may refer to Stevens (1946). 2 Since the full treatment of these measures is outside the scope of this chapter, interested readers may refer to Brown (2008) and Richardson (2011).

References Agresti, A., & Finlay, B. (2009). Statistical methods for the social sciences (4th ed.). London: Pearson Prentice Hall. Allen, I. E., & Seaman, C. A. (2007). Likert scales and data analyses. Quality Progress, 40(7), 64–65. Argyrous, G. (1996). Statistics for social research. Melbourne: MacMillan Education. Barkaoui, K. (2014). Examining the impact of L2 proficiency and keyboarding skills on scores on TOEFL-iBT writing tasks. Language Testing, 31(2), 241–259. Batty, A. O. (2015). A comparison of video- and audio-mediated listening tests with many-facet Rasch modeling and differential distractor functioning. Language Testing, 32(1), 3–20. Beglar, D., & Hunt, A. (1999). Revising and validating the 2000 Word Level and University Word Level Vocabulary Tests. Language Testing, 16(2), 131–162. Brown, J. D. (2008). Effect size and eta squared. Shiken: JALT Testing & Evaluation SIG Newsletter, 12(2), 38–43. Casey, L. B., Miller, N. D., Stockton, M. B., & Justice, W. V. (2016). Assessing writing in elementary schools: Moving away from a focus on mechanics. Language Assessment Quarterly, 13(1), 42–54. Chae, S. (2003). Adaptation of a picture-type creativity test for pre-school children. Language Testing, 20(2), 179–188. Chang, Y. (2006). On the use of the immediate recall task as a measure of second language reading comprehension. Language Testing, 23(4), 520–543. Chapelle, C. A., Chung, Y., Hegelheimer, V., Pendar, N., & Xu, J. (2010). Towards a computer-delivered test of productive grammatical ability. Language Testing, 27(4) 443–469. Cheng, L., Andrews, S., & Yu, Y. (2010). Impact and consequences of school-based assessment (SBA): Students’ and parents’ views of SBA in Hong Kong. Language Testing, 28(2), 221–249. Cohen, J. (1988). Statistical power analysis for the behavioral sciences (2nd ed.). Hillsdale, NJ: Lawrence Erlbaum Associates. Coniam, D. (2000). The use of audio or video comprehension as an assessment instrument in the certification of English language teachers: A case study. System, 29(1), 1–14. Connolly, T., & Sluckin, W. (1971). An introduction to statistics for the social sciences. London: MacMillan Education. Coyle, Y., & Gómez Gracia, R. (2014). Using songs to enhance L2 vocabulary acquisition in preschool children. ELT Journal, 68(3), 276–285. Currie, M., & Chiramanee, T. (2010). The effect of the multiple-choice item format on the measurement of knowledge of language structure. Language Testing, 27(4), 471–491. Dancey, P., & Reidy, J. (2011). Statistics without maths for psychology. New York, NY: Prentice Hall/Pearson.

196 Tuğba Elif Toprak Davis, L. (2016). The influence of training and experience on rater performance in scoring spoken language. Language Testing, 33(1), 117–135. Davis, S. F., & Smith, R. A. (2005). An introduction to statistics and research methods: Becoming a psychological detective. Upper Saddle River, NJ: Pearson/Prentice Hall. East, M. (2007). Bilingual dictionaries in tests of L2 writing proficiency: Do they make a difference? Language Testing, 24(3), 331–353. Elgort, I. (2012). Effects of L1 definitions and cognate status of test items on the Vocabulary Size Test. Language Testing, 30(2), 253–272. Fidalgo, A., Alavi, S., & Amirian, S. (2014). Strategies for testing statistical and practical significance in detecting DIF with logistic regression models. Language Testing, 31(4) 433–451. Gebril, A., & Plakans, L. (2013). Toward a transparent construct of reading-to-write tasks: The interface between discourse features and proficiency. Language Assessment Quarterly, 10(1), 9–27. Gordon, R. A. (2012). Applied statistics for the social and health sciences. New York, NY: Routledge. Grabe, W. (2009). Reading in a second language: Moving from theory to practice. Cambridge: Cambridge University Press. Green, R. (2013). Statistical analyses for language testers. New York, NY: Palgrave Macmillan. He, L., & Shi, L. (2012). Topical knowledge and ESL writing. Language Testing, 29(3), 443–464. Ilc, G., & Stopar, A. (2015). Validating the Slovenian national alignment to CEFR: The case of the B2 reading comprehension examination in English. Language Testing, 32(4), 443–462. Kiddle, T., & Kormos, J. (2011). The effect of mode of response on a semidirect test of oral proficiency. Language Assessment Quarterly, 8(4), 342–360. Kim, A. Y. (2015). Exploring ways to provide diagnostic feedback with an ESL placement test: Cognitive diagnostic assessment of L2 reading ability. Language Testing, 32(2), 227–258. Kintsch, W. (1988). The role of knowledge in discourse comprehension: A construction-integration model. Psychological Review, 95(2), 163–182. Knoch, U., & Elder, C. (2010). Validity and fairness implications of varying time conditions on a diagnostic test of academic English writing proficiency. System, 38(1), 63–74. Kobayashi, M. (2002). Method effects on reading comprehension test performance: Text organization and response format. Language Testing, 19(2), 193–220. Kormos, J. (1999). Simulating conversations in oral proficiency assessment: A conversation analysis of role plays and non-scripted interviews in language exams. Language Testing, 16(2), 163–188. Leaper, D. A., & Riazi, M. (2014). The influence of prompt on group oral tests. Language Testing, 31(2), 177–204. Lee, H., & Winke, P. (2012). The differences among three-, four-, and five-optionitem formats in the context of a high-stakes English-language listening test. Language Testing, 30(1), 99–123. Lehmann, E. L., & Romano, J. P. (2005). Generalizations of the familywise error rate. The Annals of Statistics, 33(3), 1138–1154. Leong, C., Ho, M., Chang, J., & Hau, K. (2012). Differential importance of language components in determining secondary school students’ Chinese reading literacy performance. Language Testing, 30(4), 419–439.

Analysis of differences between groups 197 Levine, T. R., & Hullett, C. R. (2002). Eta squared, partial eta squared, and misreporting of effect size in communication research. Human Communication Research, 28(4), 612–625. Li, X., & Brand, M. (2009). Effectiveness of music on vocabulary acquisition, language usage, and meaning for mainland Chinese ESL learners. Contributions to Music Education, 73–84. Mann, W., Roy, P., & Morgan, G. (2016). Adaptation of a vocabulary test from British Sign Language to American Sign Language. Language Testing, 33(1), 3–22. Neri, A., Mich, O., Gerosa, M., & Giuliani, D. (2008). The effectiveness of computer assisted pronunciation training for foreign language learning by children. Computer Assisted Language Learning, 21(5), 393–408. Nitta, R., & Nakatsuhara, F. (2014). A multifaceted approach to investigating pretask planning effects on paired oral test performance. Language Testing, 31(2), 147–175.Ockey, G. (2007). Construct implications of including still image or video in computer-based listening tests. Language Testing, 24(4), 517–537. O’Sullivan, B. (2002). Learner acquaintanceship and oral proficiency test pair-task performance. Language Testing, 19(3), 277–295. Ott, R. L., & Longnecker, M. (2001). An introduction to statistical methods and data analysis (5th ed.). Belmont, CA: Duxbury Press. Paquette, K. R., & Rieg, S. A. (2008). Using music to support the literacy development of young English language learners. Early Childhood Education Journal, 36(3), 227–232. Richardson, J. T. (2011). Eta squared and partial eta squared as measures of effect size in educational research. Educational Research Review, 6(2), 135–147. Saricoban, A., & Metin, E. (2000). Songs, verse and games for teaching grammar. The Internet TESL Journal, 6(10), 1–7. Sasaki, M. (2000). Effects of cultural schemata on students’ test-taking processes for cloze tests: A multiple data source approach. Language Testing, 17(1), 85–114. Schaefer, E. (2008). Rater bias patterns in an EFL writing assessment. Language Testing, 25(4), 465–493. Schmitt, N., Ching, W., & Garras, J. (2011). The Word Associates Format: Validation evidence. Language Testing, 28(1), 105–126. Schoonen, R., & Verhallen, M. (2008). The assessment of deep word knowledge in young first and second language learners. Language Testing, 25(2), 211–236. Seaman, M. A., Levin, J. R., & Serlin, R. C. (1991). New developments in pairwise multiple comparisons: Some powerful and practicable procedures. Psychological Bulletin, 110(3), 577. Shin, S. Y., & Ewert, D. (2015). What accounts for integrated reading-to-write task scores? Language Testing, 32(2), 259–281. Stevens, S. S. (1946). On the theory of scales of measurement. Science, 103(2684), 677–680. Vermeer, A. (2000). Coming to grips with lexical richness in spontaneous speech data. Language Testing, 17(1), 65–83. Wigglesworth, G., & Elder, C. (2010). An investigation of the effectiveness and validity of planning time in speaking test tasks. Language Assessment Quarterly, 7(1), 1–24.

9

Application of ANCOVA and MANCOVA in language assessment research Zhi Li and Michelle Y. Chen

Introduction A typical type of research question in language education and assessment concerns the group differences in achievement and test performance. For example, some researchers may be interested in investigating if two different instructional methods affect test takers’ performance differently. Others may be interested in comparing students’ performance on different tasks or different administration mode of the same task. These types of research questions can be investigated using statistical techniques such as the analysis of covariance (ANCOVA) and the multivariate analysis of covariance (MANCOVA). With a practical orientation, this chapter provides an overview of the application of ANCOVA and MANCOVA in language assessment research. The chapter consists of three main parts: (i) an introduction of the basics and assumptions of ANCOVA and MANCOVA; (ii) a review of published papers, using either of these two techniques, from four major language assessment journals; and (iii) examples of analyses demonstrating the use of ANCOVA and MANCOVA using the Canadian component of the 2009 Programme for International Student Assessment (PISA).

Basics of ANCOVA and MANCOVA The analysis of variance (ANOVA) techniques, as discussed in Chapter 2 (this volume), can be extended to include one or more covariates (ANCOVA), multiple outcomes (MANOVA), or both (MANCOVA). The basic idea behind the analysis of covariance (ANCOVA) is to include potential confounding variables (covariates that may influence a dependent variable or outcome variable) in the model and reduce within-group error variances through controlling for the covariate(s), thus increasing statistical power. Statistical power refers to the probability of correctly rejecting a null hypothesis when it is false (Faul, Erdfelder, Lang, & Buchner, 2007). In other words, if statistical power is high, researchers are more likely to detect an existing effect. For example, suppose we want to compare the reading performance of students who were randomly assigned to one of two teaching methods, Method A and Method B. Although our interest centers on evaluating the effectiveness of the

Application of ANCOVA and MANCOVA 199 teaching methods, we may suspect that, besides the teaching methods, the students may have different reading proficiency levels before the experiment. In this case, we can implement a pre-test–post-test design to collect students’ reading performances before and after the experiment. Then we can run ANCOVA with posttest performance as the dependent variable, teaching methods as the independent variable (also known as factor, treatment, or predictor), and pre-test scores as the covariate. By doing so, we account for the pre-test difference between the two groups before deciding if the differences between the two groups are due to the independent variable. In this design, we can calculate adjusted means of the posttest for these two groups, which are hypothetical means supposing that the two groups had equal pre-test scores. By comparing these adjusted means of the posttest, we can evaluate the effectiveness of the teaching methods. MANCOVA and ANCOVA are both well-established techniques for comparing the means of multiple groups with one or more covariates included in the model. The independent variables in MANCOVA and ANCOVA are categorical,1 while both the covariates and the dependent variables are continuous.2 As opposed to ANCOVA which deals with only one outcome variable, MANCOVA is used when researchers want to analyze the effects of independent variables on two or more correlated outcome variables (i.e., dependent variables) simultaneously. Like ANOVA, both ANCOVA and MANCOVA come in different types, such as one-way ANCOVA and MANCOVA when there is only one independent variable involved; factorial ANCOVA and MANCOVA when two or more independent variables are used; and repeated measures ANCOVA and MANCOVA when the same measures are administered multiple times on the same participants. Referring to the example of reading performance, since only one independent variable, i.e., the teaching method, is used, the corresponding analysis is a one-way ANCOVA. If we are interested in multiple related measures of reading proficiency, for example, reading fluency and accuracy, we can include both measures as dependent variables and extend the above-mentioned analysis into a one-way MANCOVA. If we introduce grade level as another independent variable (e.g., Grade 8 vs. Grade 9), then the research question becomes whether teaching method and grade level affect student reading performance. To address this question, we can use a two-way ANCOVA or factorial ANCOVA. In the case where we assess reading performance multiple times throughout the course and we would like to control for the effect of a covariate3 (e.g., student motivation as measured by a multi-item scale), repeated measures ANCOVA will be an appropriate model to use. In some cases, researchers choose to conduct ANCOVAs as follow-ups to MANCOVA with the aim to further examine the specific effect of the independent variable on each dependent variable individually (e.g., Ockey, 2009; Wagner, 2010). This is also the default approach in SPSS GLM (General Linear Model) multivariate analysis. It is worth noting that running a series of ANCOVAs as follow-ups does not directly address the research question if researchers are interested in the composite of multiple dependent variables. As discussed in Huberty and Morris (1989), when the composite of the multiple dependent variables is not

200 Zhi Li and Michelle Y. Chen of interest, the value of conducting a MANCOVA in the first place is very limited. Researchers can stick to ANCOVAs and test the univariate effects if their primary research interest is to find out whether the independent variable has an effect on each dependent variable individually. Similar to ANOVA, when the independent variables have three levels or more, the main effect tests in ANCOVA and MANCOVA cannot tell us which specific groups are significantly different from each other. The main effect tests can only show that at least two groups are significantly different; planned contrasts or post hoc tests are then needed to determine which of these groups (i.e., which levels of an independent variable) differ from each other. The typical comparisons include multiple comparisons with Bonferroni adjustment of significance level, Fisher’s least significant difference (LSD) test, Tukey’s honest significant difference (HSD) test, and Scheffe’s test. Bonferroni adjustment is a conservative method that controls the overall error rate (i.e., alpha level) when conducting multiple non-independent statistical tests. One way to make Bonferroni adjustment for post hoc tests is to lower the alpha level for each comparison. This is done by dividing the conventional alpha level, usually .05, by the number of comparisons to be made. For example, a study sets its original alpha level at .05, and three comparisons are made. To maintain the overall alpha level, an adjusted alpha level of .016 (i.e., .05 ÷ 3) should be used to judge whether each of the comparisons is statistically significant. To learn more about multiple comparisons, researchers can refer to Field (2009).

Assumptions All the techniques described above belong to the larger family of models called general linear model (GLM). The dependent variable(s) should be on a continuous scale or can be treated as if they were continuous. These models make a number of assumptions that are grouped into three categories in Table 9.1: the general category, the ANCOVA-specific category, and the MANCOVA-specific category. Table 9.1 also summarizes some alternatives or possible solutions for ANCOVA and MANCOVA when one or more assumptions are violated. The general assumptions are applicable to ANOVA, ANCOVA and their multivariate counterparts. These assumptions include independent observations across cases, absence of outliers, normality of residuals of the model, and homogeneity of variance (also known as homoscedasticity). Both ANCOVA and MANCOVA assume that the observations are independent of each other. In the case of nonindependent observations due to repeated measures (e.g., multiple measures of the same construct administered to the same group of participants at different time points), repeated measures ANCOVA or MANCOVA should be used. The assumption of homogeneity of variance can be initially examined using boxplots (also known as box-whisker diagrams) to visualize the spread of the data for each group (Figure 9.1). Similar widths of the boxes for the groups of interest suggest that the variances are likely to be similar. Direct comparisons of the standard

Application of ANCOVA and MANCOVA 201 Table 9.1 Summary of Assumptions of ANCOVA and MANCOVA Category

Assumption

How to check

What to do when violated

General assumptions

Independence of observations

Check research design

Absence of outliers

Examine boxplot or standardized (e.g., z-score) of the variables Conduct ShapiroWilk test; Examine scatter plot of residuals, and/or P-P plot

Use repeated measures ANCOVA or mixedeffects models Consider removing outlier(s) or conduct sensitivity analysis

Normality of residuals

ANCOVA specific

MANCOVA specific

Homogeneity of variance

Conduct Levene’s test or examine boxplots

Homogeneity of regression slopes

Test the interaction effect

Linearity between covariate(s) and dependent variable

Conduct correlation analysis or examine scatter plots

No strong correlation among covariates Homogeneity of covariance matrices

Conduct correlation analysis Conduct Box’s M test

Transform dependent variable to improve normality of distribution; Use other statistical tests (e.g., non-parametric tests) Transform dependent variable to improve normality of distribution; Use other statistical tests Use other statistical tests (e.g., regression models with interaction effect) Transform dependent variable and/or covariate; Use other statistical tests Consider removing one of the highly correlated covariates Report Pillai–Bartlett Trace statistics

deviations of the dependent variables across groups can offer some similar indication as well. This assumption can also be assessed through hypothesis testing such as Levene’s test. A significant Levene’s test statistic (e.g., p < .05) suggests that the null hypothesis of equal variance should be rejected, that is, the assumption of homogeneity of variance is violated. In this case, we can transform the dependent variable to achieve the effect of similar variances across different groups (Tabachnick & Fidell, 2013). Alternatively, a robust ANCOVA method can be used from R package WRS2 (Mair & Wilcox, 2017).

202 Zhi Li and Michelle Y. Chen

Figure 9.1 An example of boxplots.

In addition to these general assumptions, there are three ANCOVA-specific assumptions. First, when multiple covariates are used, researchers should examine whether they all need to be included in the model. Highly correlated covariates may suggest redundancy in the information they provide, and they may introduce multicollinearity to the model. In such cases, researchers can combine different covariates into one variable or remove those less important ones from the model. The correlation between covariates can be checked through a correlational analysis between these variables. Second, the relationship between all pairs of dependent variables and covariates is assumed to be linear. Scatter plots can be used to check linearity between two variables. Lastly, the relationship between the covariates and the dependent variable must be similar across different levels of the independent variable – this is called the homogeneity of regression slopes. There are two ways to assess this assumption. One is to inspect the data graphically by performing a scatter plot analysis with a regression line between the covariate and the dependent variable for each group (i.e., each level of the independent variable). The other way is to statistically test the interaction effect of covariate by grouping. This can be done through the customization of the ANCOVA model with an interaction of a covariate and an independent variable (see the tutorial). Significant interaction effect would indicate that the assumption of homogeneity of regression slope is violated. For MANCOVA, Box’s M test can be conducted to check the assumption of homogeneity of covariance matrices. This test compares the covariance matrices of dependent variables across groups. A non-significant result suggests that the covariance matrices are roughly equal. In the case of a violation of this assumption, Pillai–Bartlett Trace (Pillai’s Trace) statistics in MANOVA are recommended for their robustness (Field, 2009).

Application of ANCOVA and MANCOVA 203 Reporting ANCOVA results Reporting the results of ANCOVA is similar to the practice of ANOVA. Two “play it safe”, tables in the APA (American Psychological Association) style are recommended by Nicol and Pexman (2010, p. 79) for reporting the results of ANCOVA: (a) a table of descriptive statistics for dependent variables and covariate(s) by the levels of independent variable(s) and (b) an ANCOVA summary table. The summary table should contain the degrees of freedom, sums of squares, mean squares, F ratios, p-values, and effect sizes in the form of partial eta-squared for the independent variable(s) and their interaction terms as well as for the covariate(s). It is also acceptable to combine these two tables as long as the key information is presented in a clear way. For MANCOVA, four different multivariate statistics are often used, and they are all available in SPSS. These four statistics are Pillai’s Trace, Wilks’ Lambda, Hotelling’s Trace, and Roy’s Largest Root. Each of the four tests generates its own associated F ratio or F value. If presented in a table, one of the multivariate test statistics should be included, along with the associated F ratios or F values, degrees of freedom, p-value, and effect size. The choice of test statistics should be based on considerations of research design and the outcomes of the checks of assumptions. Wilks’ Lambda is available in most statistical programs and is frequently reported (Tabachnick & Fidell, 2013). Pillai’s Trace is known for being robust against less-than-ideal research design, such as small sample size, unequal group sample sizes, and heterogeneity of variance. Among the four statistics, Roy’s Largest Root is less used. As it usually represents an upper bound on the F value, a statistically significant result of Roy’s Largest Root is often disregarded when none of the other three tests is significant. See Field (2009) for suggestions about which test statistics to report. It is noteworthy that when there are only two levels in an independent variable, the F values of these four statistics are identical. In most cases, the F values and the associated p-values from these four tests are fairly close in magnitude. In other cases where they differ, Pillai’s Trace is often used because of its robustness. Effect size is a measure of the magnitude of an observed effect or phenomenon, usually standardized for objective comparison and interpretation. In ANCOVA and MANCOVA, partial eta-squared (η2) is typically used as a measure of the effect size of a variable. It shows the percentage of the variance that is attributable only to an independent variable, a covariate, or an interaction. The values of the partial eta-squared can be interpreted based on Cohen (1988)’s guidelines: 0.01 for a small effect, 0.06 for a moderate effect, and 0.14 and above for a large effect. When planned contrasts or post hoc comparisons are carried out following an ANCOVA or MANCOVA, either group-level confidence interval or effect sizes in the form of Cohen’s d should be reported to facilitate the interpretation of the statistical results. Overall model fit information is expressed with adjusted R-squared (R2), which indicates how much variance can be accounted for by an ANCOVA model. Cohen (1988) suggests that a small R-squared value is .02, whereas a medium value is .13 and a large value is .26.

204 Zhi Li and Michelle Y. Chen

ANCOVA/MANCOVA in language assessment research To better understand how ANCOVA and MANCOVA have been used in language assessment research, we surveyed four academic journals that focused on language assessment for studies utilizing ANCOVA and/or MANCOVA techniques. We found a total of 24 papers: 11 from Language Testing (1984–2017), 10 from Assessing Writing (1994–2017), two from Language Assessment Quarterly (2004–2017), and one from Language Testing in Asia (2011–2017). Among them, 14 studies focused on writing tasks, which corresponds to the relatively large number of ANCOVA-based articles published in the journal of Assessing Writing. The rest of the papers are distributed among speaking tasks (3 papers), listening tasks (2 papers), aptitude test (2 papers), grammar test (1 paper), and subject knowledge test (1 paper). These selected papers are marked with asterisks in the References. Among these 24 papers, 5 of them used MANCOVA only or in conjunction with ANCOVA. ANCOVA was used in 22 studies, usually together with ANOVA and post hoc comparisons when an independent variable had more than two groups or levels. Noticeably, the applications of ANCOVA/MANCOVA in language assessment research are not widespread as yet. This may reflect a general research tradition in second-language research, where ANOVA and t-tests have been much more frequently used than ANCOVA and MANCOVA (Plonsky, 2013). While the number of publications used ANCOVA or MANCOVA is small from these four journals, an increase of studies using ANCOVA and/or MANCOVA is observed in the last 30 years (Figure 9.2). Three general categories of ANCOVA/MANCOVA-based comparisons emerged in the selected papers, namely comparison of task characteristics, test taker characteristics, and other situational factors. Seven studies are particularly related to comparisons of test takers’ performances based on task characteristics:

Figure 9.2 Temporal distribution of ANCOVA/MANCOVA-based publications in four language assessment journals.

Application of ANCOVA and MANCOVA 205 text type and/or question type in listening tests (Shohamy & Inbar, 1991; Wagner, 2010), visual complexity and planning time (Xi, 2005), writing task types and/or variants (Bridgeman, Trapani, & Bivens-Tatum, 2011; Kobrin, Deng, & Shaw, 2011; Riazi, 2016), and visuals used as writing stimuli (Bae & Lee, 2011). Eight studies examined differences in test performances based on participant characteristics: ethnic groups (Zeidner, 1986, 1987), participant personalities (Ockey, 2009), learner age (Bochner et al., 2016), participants’ writing ability (Erling & Richardson, 2010), learning disabilities (Windsor, 1999), and participant motivation (Huang, 2010, 2015). Other studies compared different situational factors related to a test such as keyboard type (Ling, 2017a, 2017b), keyboarding skills (Barkaoui, 2014), use of rubrics in writing assessment (Becker, 2016), provision of feedback (Diab, 2011; Khonbi & Sadeghi, 2012; Ross, Rolheiser, & Hogaboam-Gray, 1999; Sundeen, 2014), and intervals between two test administrations (Green, 2005). Depending on the research design and the nature of the comparisons of interest, different types of covariates have been used in these studies. In the studies employing a pre-test and post-test design or a similar within-group design, scores from a pre-test of the same construct are usually used as a covariate to control for pre-existing differences in the targeted construct. Pre-test tasks are typically similar to post-test tasks in terms of format and requirements (Becker, 2016; Wagner, 2010). Dissimilar pretests are employed as well. For example, Ockey (2009) examined the impact of group members’ personalities on individuals’ test performance on an oral discussion test. Besides personalities, individuals’ oral proficiency affects their test scores. To control for its effect, participants’ speaking proficiency was measured through a computer-scored speaking test delivered over phone calls (i.e., a test known as PhonePass SET-10) and included it as a covariate in the analysis. In this example, the pre-test was a phone-delivered speaking test, while the post-test was a group oral test. As observed in the selected papers, one group of covariates are test takers’ characteristics or constructs of individual difference that are theoretically relevant to the construct of interest (Bochner et al., 2016; Green, 2005; Xi, 2005). Unlike the covariate(s) used in a pre-test–post-test design, these non–pre-test covariates call for explicit justification for being included in the studies because they may shed light on the relationship between the covariate(s) and the construct of interest. In a study of visual-based speaking tasks, for example, Xi (2005) included two covariates, with one being the general speaking proficiency measured by nonvisual tasks in the same test and the other being the graph familiarity measured by a questionnaire. To justify the inclusion of graph familiarity as a covariate, Xi (2005) explained that previous literature suggests that graph familiarity could potentially affect visual comprehension and thus possibly influence speaking performance in the tasks with visuals as stimuli. In addition to the use of pretests of the same construct and/or individual differences as covariates, some studies utilized other related variables to control for the variability of ability continuum. In a study of the impact of keyboard type on writing performances on the TOEFL iBT (the Test of English as a Foreign Language

206 Zhi Li and Michelle Y. Chen Internet-based Test), for example, Ling (2017a) decided to use the scores from three other sections of the test, namely, reading, listening, and speaking, as covariates. While Ling did not provide a specific rationale for his choices of covariates, controlling for these covariates would provide conditioned comparisons of writing scores across different groups of keyboard types, accounting for test takers’ abilities in other related areas.

Issues in the use of ANCOVA/MANOVA in language assessment research In reviewing of the studies using ANCOVA and/or MANCOVA, we noticed some inconsistencies in the reporting practices related to assumption check, effect size, and justification of the choices of covariates. Firstly, assumption check was either absent or only partially reported in 15 studies of the selected pool. In other words, more than half of the sampled studies did not report their assumption checks for ANCOVA and/or MANCOVA or failed to report their checks adequately. In contrast, several studies provided a brief account of the assumptions required by ANCOVA. One example is Sundeen’s (2014, p. 83) study, which contains an explicit comment on the assumptions in his study of the effect of using instructional rubrics on writing quality: “The data met the requirements of preliminary checks to ensure that there was no violation of the assumptions of normality, linearity, homogeneity of variances, homogeneity of regression slopes, and reliable measurement of the covariate”. Of course, more details about each assumption check are desirable in reporting ANCOVA results. Secondly, a noticeable issue that emerged from the review is the reporting practice related to effect size. While the majority of the reviewed studies (14 out of 24) reported some indices of effect size, these studies diverge in terms of the types of effect size reported. For example, eight studies reported partial eta-squared values or eta-squared values as overall effect of independent variable(s) or covariate(s); and five other studies included Cohen’s d only for comparisons of group means. It is noteworthy that in the studies in which multiple comparisons were made (including the five studies mentioned above), confidence intervals were rarely reported for focused comparisons of group means. The information about effect sizes plays an important role in helping readers judge the practical significance of the results of a statistical test. In the data analysis examples, we demonstrate how to properly report effect sizes in ANCOVA/MANCOVA. The above two issues – lack of reporting on assumption check and effect sizes – are similar to the observations of the use of ANCOVA in Keselman et al. (1998), who analyzed the studies that employed ANOVA, ANCOVA, and/or MANOVA in educational research journals published in 1994 and 1995. In their review of 45 ANCOVA-based studies, 34 studies did not report their check for the normality assumption, and only 8 studies reported the assumption of homogeneity of regression slopes. Only 11 studies reported the effect size, with 7 being standardized mean differences.

Application of ANCOVA and MANCOVA 207 Lastly, the choice of covariates must be justified, especially in the cases where the covariates were not a pre-test of an experiment or a similar test of the same construct. Such justifications should be explicitly spelled out in reference to theoretical underpinnings and/or relevant findings in the literature. In addition to theoretical justifications, identification of the covariates should be informed by empirical evidence of strong association with dependent variable(s). While a small number of studies did not provide sufficient explanation about their choice of covariates and some even failed to report the effect of their covariates, the majority of the studies have provided some justification for including covariates. In their study of the effect of number of paragraphs and example types on test takers’ SAT essay scores, for example, Kobrin et al. (2011) chose the SAT (Scholastic Aptitude Test) Critical Reading and Writing multiple-choice scores as covariates and explained that “the essay features investigated in this study are related to the test takers’ writing ability, and it was of interest to ascertain the relationship of the essay features with essay scores irrespective of general writing ability” (p. 161). To some extent, these identified issues can compromise the transparency of the quantitative findings based on the results of ANCOVA/MANCOVA. Proper reporting practices can help readers interpret the results with caution when a violation of assumption is present, and in addition, it is easier for researchers to compare results across studies.

Sample study: Investigation of the impact of reading attitude on reading achievement In the following, we provide two examples of analyses that demonstrate the type of research questions that could be addressed through ANCOVA and MANCOVA techniques, respectively. It starts with a description of the background of the demonstration analysis, followed by a description of the data source and analysis procedures, and concludes with reporting of results. Readers may view the Companion website for a tutorial that demonstrates these statistical procedures using SPSS for Windows version 24 (IBM Corporation, 2016).

Background The data used for the demonstrations are obtained from the PISA 2009 database. PISA is a large-scale international testing program conducted by the Organisation for Economic Co-operation and Development (OECD) every 3 years. PISA assesses 15-year-old students’ performance on reading, math, and science literacy. In each administration, one of the three subject areas – reading, math, or science literacy – is chosen as the focus. It also collects information on students’ family backgrounds and their personal characteristics, as well as many school-level variables through questionnaires. As such, it yields a rich database that allows researchers to do cross-country comparisons and to explore relationships among the variables. For more details on the PISA database and its studies and reports, please see OECD (2000, 2012). For PISA 2009, the main focus was on reading

208 Zhi Li and Michelle Y. Chen literacy, which is defined as “understanding, using, reflecting on and engaging with written texts, in order to achieve one’s goals, to develop one’s knowledge and potential, and to participate in society” (OECD, 2009, p. 23). It is generally agreed that reading plays a critical role in students’ academic achievement (OECD, 2009). However, noticeable gaps in female and male students’ reading achievements have been reported, with female students outperforming male students (Logan & Johnston, 2009). The difference in reading achievement has been associated with individual difference in reading attitude (McKenna, Conradi, Lawrence, Jang, & Meyer, 2012) based on the assumption that reading motivation exerts influence on reading behaviors, which in turn impact reading achievement. Similar to the gap in reading performance between female and male students, female students tend to show a more positive attitude toward reading than their male peers (McKenna et al., 2012). In a meta-analysis on the relationship between reading attitude and achievement, Petscher (2010) confirmed a moderate positive relationship between attitude toward reading and reading achievement from an analysis of 32 studies. While the impact of reading attitudes on reading achievement has been largely verified, it is not clear whether female and male students would have comparable reading achievements when the difference in reading attitude is controlled. Therefore, this demonstration study attempts to address the following two research questions using a sample of the Canadian data of the PISA 2009. Q1: Do Grade 11 Canadian boys and girls show any difference in their overall

reading literacy performance after controlling for their attitude toward reading? (ANCOVA) Q2: Do Grade 11 Canadian boys and girls show any difference in the three subscales of reading literacy after controlling for their attitude toward reading? (MANCOVA)

Method Data source The data used for the demonstrations is a Canadian subset of Grade 11 students who took the English version of the PISA test in 2009. In total, the sample size is 266 (54.1% female and 45.9% male). Among the variables selected for this example, 18 missing data were observed in the covariate – attitude toward reading (JOYREAD in Table 9.2). The amount of missing data is less than 7%, and there is no statistically significant difference in their reading achievement scores between students who have scores on reading attitude and those without a score on this variable (p > .05). For simplicity, we deleted the cases with missing responses. The final sample used in the following analyses includes scores of 248 students (53.2% female and 46.8% male). The demonstration dataset contains the students’ reading test scores, students’ attitude toward reading, and their self-identified sex. Table 9.2 gives an overview of the sample employed in this example and presents the descriptive statistics of the selected variables.

Application of ANCOVA and MANCOVA 209 Table 9.2 Descriptive Statistics of the Selected Sample Variable name in the PISA dataset

Variable description

Mean

SD

Skewness

Kurtosis

PV1READ PV1READ1

PV1 reading PV1 access and retrieve PV1 integrate and interpret PV1 reflect and evaluate Joy/Like Reading

574.08 561.62

93.08 93.36

−0.19 −0.09

−0.04 −0.02

568.18

95.44

−0.21

−0.20

584.54

92.51

−0.36

0.01

0.51

1.08

0.15

0.40

PV1READ2 PV1READ3 JOYREAD

Note: PV = plausible value, JOYREAD = attitude toward reading.

Measures of reading abilities and reading attitude PISA uses the Rasch model to score students’ performance on cognitive assessments. Instead of using a single point estimate of proficiency for each student, PISA used Plausible Values (PVs) that are drawn from posteriori distributions of proficiency to show the scores that students are likely to attain. For technical details of how PVs are drawn in PISA, please refer to the PISA 2012 Technical Report (OECD, 2014) and von Davier, Gonzalez, and Mislevy (2009). In PISA 2009, each student received five PVs on their overall reading literacy performance as well as five PVs on each of the three reading subscales that comprise the ability to (a) access and retrieve information, (b) integrate and interpret the reading material, and (c) reflect on and evaluate the reading material (see Table 9.2). In this demonstration, the first PVs of the scale and subscales are treated as approximates of each individual’s proficiency on the corresponding scales and used in the following analyses. Reading attitude was measured using 11 four-point Likert scale items and reported as a single value for each participant based on the Rasch model analysis. We use reading attitude as a covariate based on the findings of previous research, which supports a moderate positive correlation between reading attitude and reading achievement (Petscher, 2010).

Data analysis To address the first research question (Q1), we ran a one-way ANCOVA with sex (female = 1 and male = 2) as an independent variable, reading attitude as the covariate, and the first PV of the reading scores (PV1 reading) as the dependent variable. To address the second research question (Q2), we ran a one-way MANCOVA with sex as an independent variable, reading attitude as a covariate, and the first PV of each of the three reading subscale scores (PV1 access and retrieve, PV1 integrate and interpret, PV1 reflect and evaluate) as the dependent variables.

210 Zhi Li and Michelle Y. Chen Results Descriptive statistics of the sample Table 9.2 presents an overview of the variables included in the data analysis (sample size n = 248). As mentioned earlier, each participant received five PVs on the reading scale and its three subscales. The means of the PVs range from 561.62 to 588.38, and the standard deviations (SDs) range from 90.65 to 101.97. The relatively large SDs are the result of the wide scale-range PVs reported on. The descriptive statistics also suggest that the differences among the five PVs for each scale are fairly small. As suggested by the skewness and kurtosis values, the distributions of the PVs and the reading attitude (JOYREAD) are close to normal.

Q1: A one-way ANCOVA Table 9.3 presents the means and SDs for the overall reading performance and reading attitude by sex-group. The descriptive statistics suggest that female students had better reading performances and enjoyed reading more than their male peers. This observation could be confirmed if we were to compare these two groups using an ANOVA on reading performance and ignoring their differences in reading attitudes (i.e., without any covariate). Nevertheless, the results from one-way ANCOVA tell a different story. Table 9.4 presents the main output of the ANCOVA. From the output table, we can see that the model as a whole is statistically significant, F(2, 245) = 38.38, p < .001, indicating that the model explains the variability in students’ reading Table 9.3 Descriptive Statistics of Overall Reading Performance and Attitude by Sex-Group Variable description

PV1 reading JOYREAD

Female

Male

Mean

SD

Mean

SD

587.97 0.87

92.43 1.03

558.27 0.11

91.66 0.99

Note: PV = plausible value; JOYREAD = attitude toward reading.

Table 9.4 ANCOVA Summary Table Source

Sum of squares (Type III)

df Mean square F statistic p-Values Partial eta-squared

Corrected Model 510465.39 2 255232.69 Attitude (JOYREAD) 455991.88 1 45599.88 Sex 426.39 1 426.39

38.38 68.57 0.064

0.000 0.000 0.800

0.24 0.22 0.00

Application of ANCOVA and MANCOVA 211 Table 9.5 Estimated Marginal Means Sex

Female Male

Mean

572.76 575.57

Standard error

7.33 7.86

95% Confidence interval Lower bound

Upper bound

558.32 560.10

587.20 591.04

performance. The adjusted R2 is 0.23, suggesting that the model explains 23% of the variability observed in students’ overall reading performance. Also, note that the F-test value for the variable “Sex” is no longer significant, F(1, 245) = 0.064, p = .80, partial eta-squared < 0.001. This result suggests that after considering the effect of reading attitude (i.e., “JOYREAD”), there is no longer a significant effect of sex on the average reading performance. As research contexts are relevant in interpreting these effects, we also recommend that researchers interpret the effect size by comparing their results with those reported in the available literature (Plonsky & Oswald, 2014). Table 9.5 displays the adjusted means for each sex-group. The adjusted means are predicted group means adjusting for the effect of covariate(s). In this example, the adjusted means are the average reading scores, adjusted for students’ reading attitude for female and male students, respectively. Consistent with the F test results, the adjusted mean reading scores do not differ significantly for female and male students (adjusted mean reading scores 572.76 vs. 575.57), as evident in the overlap between the 95% confidence interval of the two group means. By comparing the original (Table 9.2) and adjusted group means (Table 9.4), we can get some insight into the effect of the covariates. In this example, once adjusted for reading attitude, the reading test score gap observed between female and male students disappears. In this example, our grouping variable only has two levels, so we do not need to run a post hoc test. Levene’s test is used to check if there is equal variance between the female and male groups. In this example, the Levene’s test is non-significant (p = .733), which suggests that there is no evidence to reject the assumption of homogeneity of variance. To check the assumption of normality of residuals, we use scatter plots. Figure 9.3 presents a matrix of scatter plots. This scatter plot matrix is organized in a symmetric way, that is, the three scatter plots below the diagonal are identical to their counterparts presented above the diagonal. One of the most useful plots for examining the normal distribution of the residuals is the one with the predicted values on the x-axis and the residuals on the y-axis. This plot is presented in the middle column in Figure 9.3, at the bottom row.4 The residuals in this plot cluster toward the center and are distributed roughly symmetrically, suggesting the normality of residuals assumption has been met. In general, to support the assumption of normality of residuals, researchers would expect the distribution of the residuals in this plot do not follow any clear patterns. Ideally, the residuals

212 Zhi Li and Michelle Y. Chen

Figure 9.3 A matrix of scatter plots.

should cluster toward the middle of the plot (i.e., where the standardized residuals are close to zero), and they should distribute symmetrically. Scatter plots of predicted and observed values of the reading scores have also been included in Figure 9.3 (i.e., first column at the middle row and middle column at the top row). This group of scatter plots gives researchers some sense of the accuracy of the fitted ANCOVA model. A moderate to strong correlation between the model’s predicted values and the observed values is expected when the model is accurate. The last group of scatter plots included in Figure 9.3 are the ones plotted between observed values of the reading scores and the standardized residual (i.e., first column at the bottom row and last column at the top row). These scatter plots are useful in detecting outliers. If a residual appears to be far away from the regression line in the plot, then that data point may be suspected as a possible outlier. Another assumption we check for ANCOVA models is the homogeneity of regression slopes. To test this assumption, we run a regression model with the main effects of sex and reading attitude as well as their interaction term (Sex × JOYREAD). The results of the F test associated with the interaction term are what we are looking for. If the effect of the interaction term (Sex × JOYREAD) is statistically significant (e.g., p < .05), then the assumption of homogeneity of regression slopes is violated. In this example, this interaction effect of Sex × JOYREAD is non-significant, F(1, 244) = 0.011, p = .916, and thus, the assumption

Application of ANCOVA and MANCOVA 213 is met. The non-significant interaction terms of Sex × JOYREAD suggest the relationship between reading attitude and reading test scores is similar for the two sex groups. Therefore, our original ANCOVA model without the interaction term is appropriate for this dataset.

Q2: A one-way MANCOVA Table 9.6 presents the raw means and SDs of the reading subscale PVs and reading attitude scores by sex-group. Table 9.7 presents the main output of the MANCOVA, which included the F test results and partial eta-squared for the independent variable (sex) and the covariate (JOYREAD). As shown in the table, neither the independent variable sex nor the covariate reading attitude (i.e., JOYREAD) is statistically significant based on the multivariate test results. These results show that there are no sex differences across the three reading subscale scores after controlling for students’ reading attitude. Box’s M test is used to test the assumption of equal covariance matrices. In our example, the results from Box’s test show that the data meets the assumption of homogeneity of covariance matrices, Box’s M = 3.07, F(6, 419310.89) = 0.51, p = .805. The assumption of homogeneity of regression slopes can be checked by a customized MANCOVA model, which includes an interaction term between the independent variable and the covariate. In this example, the homogeneity of

Table 9.6 Descriptive Statistics of Reading Performance and Attitude by Sex-Group Variable description

Female

PV1 access and retrieve PV1 integrate and interpret PV1 reflect and evaluate JOYREAD

Male

Mean

SD

Mean

SD

563.97 577.49 599.01 0.87

90.78 92.28 86.48 1.03

558.96 557.59 568.08 0.11

96.54 98.23 96.69 0.99

Note: PV = plausible value, JOYREAD = attitude toward reading.

Table 9.7 MANCOVA Summary Table Effect

Pillai’s F statistic df df p-Values Partial etatrace (effect) (error) squared statistics

Sex 0.095 Attitude (JOYREAD) 0.228

8.47 23.90

3 3

243 243

0.095 0.228

0.10 0.23

214 Zhi Li and Michelle Y. Chen Table 9.8 Summary Table for ANCOVAs of Each Reading Subscale Source

Overall Model

Dependent variable

PV1 access & retrievea PV1 integrate & interpretb PV1 reflect & retrievec Sex PV1 access & retrieve PV1 integrate & interpret PV1 reflect & retrieve Attitude PV1 access & (JOYREAD) retrieve PV1 integrate & interpret PV1 reflect & retrieve

Sum of df Mean squares square (Type III)

F statistic p-Value Partial etasquared

404334

2 202167 28.33

0.000

0.19

524082

2 262041 37.21

0.000

0.23

415025

2 207512 29.93

0.000

0.20

35246

1

35246

4.94

0.027

0.02

10794

1

10794

1.53

0.217

0.01

261

1

261

0.04

0.846

0.00

402783

1 402783 56.44

0.000

0.19

499640

1 499640 70.94

0.000

0.23

355972

1 355972 51.33

0.000

0.17

Notes: a R-squared = .19; adjusted R-squared = .18; b R-squared = .23; adjusted R-squared = .23; c R-squared = .20; adjusted R-squared = .19.

regression slopes assumption has been met, Wilks’ Lambda = 0.99, F(3, 242) = 1.13, p = .339. When researchers are interested in studying the effect of sex on each of the three reading subscale scores separately, they can run a univariate ANCOVA for each subscale independently. In some cases, researchers may also choose to run univariate analyses as follow-ups to the MANCOVA. The results of the ANCOVAs are presented in Table 9.8. As shown in the table, if we run three ANCOVA tests independently for each reading subscale score, the sex difference is only statistically significant in the subscale that measures “access and retrieve information” in reading, F(1, 245) = 4.74, p = .027. The scores on the other two subscales, which measure “integrate and interpret skills” and “reflect and evaluate” skills in reading, yield no sex differences after controlling for students’ reading attitude. However, the homogeneity of regression slopes assumption has been violated when the three dependent variables are tested separately by ANCOVA. When the univariate analysis (i.e., ANCOVA with an interaction term between sex and reading attitude) is conducted for each dependent variable separately, the results show

Application of ANCOVA and MANCOVA 215 that the interaction effect is statistically significant in all three models. For access and retrieve reading subscale, the F(1, 244) = 260.28, p < .001; for integrate and interpret reading subscale, the F(1, 244) = 1675.97, p = .001; and for reflect and evaluate reading subscale, the F(1, 244) = 7360.32, p = .004. When the interaction effect of the independent variable and covariate is statistically significant, we should not interpret the main effects of each variable. In such cases, instead of running a one-way ANCOVA test, researchers may consider using a regression model (see Chapter 4, this volume) or path analysis (see Chapter 1, Volume 2), which allows the researcher to include interaction effects.

Summary of results Sex-group gap has been repeatedly reported on students’ academic achievement and attitudes in reading (Logan & Johnston, 2009; McKenna et al., 2012). The findings of this demonstration study offer some insight into the gaps between female and male students in reading achievements. That is, when the attitudes toward reading are controlled, female and male students from the Canadian participant group do not significantly differ in their reading performance.

Conclusion In quasi-experimental and correlational research designs, which are popular in language assessment research, ANCOVA and MANCOVA should be preferred over ANOVA and MANOVA because differences tend to exist on independent variables as well as other variables, even after an effort of random assignment. In other words, other variables rather than the independent variables may also affect the observed outcomes. With the inclusion of covariates, ANCOVA and MANCOVA are valuable tools to make comparisons of group means more rigorous and meaningful as these techniques control for the effects of covariates. Researchers should choose the data analysis technique that best suits their research question and research design. Due to the constraints of space, this chapter only illustrates an analysis of PISA data using one-way ANCOVA and one-way MANCOVA in SPSS. The reported GLM procedures in SPSS can be easily extended to other types of ANCOVA such as factorial ANCOVA, repeated measures ANCOVA, and the corresponding MANCOVA counterparts. We refer interested readers to Field (2009) and Tabachnick and Fidell (2013) for more detailed instructions about how to conduct these analyses in SPSS. Despite their usefulness, we find that both ANCOVA and MANCOVA are still underused techniques for comparing means in language assessment research, compared with ANOVA and t-tests. Like any other statistical tests, a good use of ANCOVA and/or MANCOVA requires close attention to its assumption requirements as well as an observation of suggested reporting practices. By providing an overview of these two techniques with data analysis demonstrations, we hope to break some barriers to using ANCOVA and MANCOVA for students, practitioners, and researchers.

216 Zhi Li and Michelle Y. Chen

Notes 1 A categorical variable is one that has a limited and usually fixed number of possible values. Examples of categorical variables include sex, marital status, and ethnicity. 2 A continuous variable has an infinite number of possible values. Variables such as income, temperature, height, weight, and test score are often treated as continuous variables. 3 Note that the covariate in this case cannot be the baseline pre-test reading performance because the person-specific variation has already been removed by using a repeated measures analysis with baseline scores as one of the outcome measures. 4 This is the same as the plot presented in the last/right column at the middle row.

References *Bae, J., & Lee, Y.-S. (2011). The validation of parallel test forms: “Mountain” and “beach” picture series for assessment of language skills. Language Testing, 28(2), 155–177. doi:10.1177/0265532210382446 *Barkaoui, K. (2014). Examining the impact of L2 proficiency and keyboarding skills on scores on TOEFL-iBT writing tasks. Language Testing, 31(2), 241–259. doi:10.1177/0265532213509810 *Becker, A. (2016). Student-generated scoring rubrics: Examining their formative value for improving ESL students’ writing performance. Assessing Writing, 29, 15–24. doi:10.1016/j.asw.2016.05.002 *Bochner, J. H., Samar, V. J., Hauser, P. C., Garrison, W. M., Searls, J. M., & Sanders, C. A. (2016). Validity of the American Sign Language Discrimination Test. Language Testing, 33(4), 473–495. doi:10.1177/0265532215590849 *Bridgeman, B., Trapani, C., & Bivens-Tatum, J. (2011). Comparability of essay question variants. Assessing Writing, 16(4), 237–255. doi:10.1016/j.asw.2011.06.002 Cohen, J. (1988). Statistical power analysis for the behavioral sciences (2nd ed.). Hillsdale, NJ: Lawrence Erlbaum Associates. *Diab, N. M. (2011). Assessing the relationship between different types of student feedback and the quality of revised writing. Assessing Writing, 16(4), 274–292. doi:10.1016/j.asw.2011.08.001 *Erling, E. J., & Richardson, J. T. E. (2010). Measuring the academic skills of university students: Evaluation of a diagnostic procedure. Assessing Writing, 15(3), 177–193. doi:10.1016/j.asw.2010.08.002 Faul, F., Erdfelder, E., Lang, A. G., & Buchner, A. (2007). G* Power 3: A flexible statistical power analysis program for the social, behavioral, and biomedical sciences. Behavior Research Methods, 39(2), 175–191. doi:10.3758/BF03193146 Field, A. (2009). Discovering statistics using SPSS. London: SAGE Publications. *Green, A. B. (2005). EAP study recommendations and score gains on the IELTS Academic Writing Test. Assessing Writing, 10(1), 44–60. doi:10.1016/j. asw.2005.02.002 *Huang, S.-C. (2010). Convergent vs. divergent assessment: Impact on college EFL students’ motivation and self-regulated learning strategies. Language Testing, 28(2), 251–271. doi:10.1177/0265532210392199 *Huang, S.-C. (2015). Setting writing revision goals after assessment for learning. Language Assessment Quarterly, 12(4), 363–385. doi:10.1080/15434303.2015. 1092544

Application of ANCOVA and MANCOVA 217 Huberty, C. J., & Morris, J. D. (1989). Multivariate analysis versus multiple univariate analyses. Psychological Bulletin, 105(2), 302–308. doi:10.1037/0033-2909.105.2.302 IBM Corporation (2016). SPSS for Windows (Version 24). Armonk, NY: IBM Corporation. Keselman, H. J., Huberty, C. J., Lix, L. M., Olejnik, S., Cribbie, R. A., Donahue, B., . . . & Levin, J. R. (1998). Statistical practices of educational researchers: An analysis of their ANOVA, MANOVA, and ANCOVA analyses. Review of Educational Research, 68(3), 350–386. doi:10.3102/00346543068003350 *Khonbi, Z., & Sadeghi, K. (2012). The effect of assessment type (self vs. peer vs. teacher) on Iranian University EFL students’ course achievement. Language Testing in Asia, 2(4), 47–74. doi:10.1186/2229-0443-2-4-47 *Kobrin, J. L., Deng, H., & Shaw, E. J. (2011). The association between SAT prompt characteristics, response features, and essay scores. Assessing Writing, 16(3), 154– 169. doi:10.1016/j.asw.2011.01.001 *Ling, G. (2017a). Are TOEFL iBT® writing test scores related to keyboard type? A survey of keyboard-related practices at testing centers. Assessing Writing, 31, 1–12. doi:10.1016/j.asw.2016.04.001 *Ling, G. (2017b). Is writing performance related to keyboard type? An investigation from examinees’ perspectives on the TOEFL iBT. Language Assessment Quarterly, 14(1), 36–53. doi:10.1080/15434303.2016.1262376 Logan, S., & Johnston, R. (2009). Gender differences in reading ability and attitudes: Examining where these differences lie. Journal of Research in Reading, 32(2), 199–214. doi:10.1111/j.1467-9817.2008.01389.x Mair, P., and Wilcox, R. (2017). WRS2: A collection of robust statistical methods [R package version 0.9–2]. Retrieved from http://CRAN.R-project.org/ package=WRS2 McKenna, M. C., Conradi, K., Lawrence, C., Jang, B. G., & Meyer, J. P. (2012). Reading attitudes of middle school students: Results of a U.S. survey. Reading Research Quarterly, 47(3), 283–306. doi:10.1002/rrq.021 Nicol, A. A. M., & Pexman, P. M. (2010). Presenting your findings: A practical guide for creating tables. Washington, DC: American Psychological Association. *Ockey, G. J. (2009). The effects of group members’ personalities on a test taker’s L2 group oral discussion test scores. Language Testing, 26(2), 161–186. doi:10.1177/0265532208101005 Organisation for Economic Co-operation and Development (OECD). (2000). Measuring student knowledge and skills: The PISA 2000 assessment of reading, mathematical and scientific literacy. Paris: OECD Publishing. Organisation for Economic Co-operation and Development (OECD). (2009). PISA 2009 assessment framework key competencies in reading, mathematics and science programme for international student assessment. Paris: OECD Publishing. Organisation for Economic Co-operation and Development (OECD). (2012). PISA 2009 technical report. Paris: OECD Publishing. Organisation for Economic Co-operation and Development (OECD). (2014). PISA 2012 technical report. Paris: OECD Publishing. Petscher, Y. (2010). A meta-analysis of the relationship between student attitudes towards reading and achievement in reading. Journal of Research in Reading, 33(4), 335–355. doi:10.1111/j.1467-9817.2009.01418.x Plonsky, L. (2013). Study quality in SLA: An assessment of designs, analyses, and reporting practices in quantitative L2 research. Studies in Second Language Acquisition, 35(4), 655–687. doi:10.1017/S0272263113000399

218 Zhi Li and Michelle Y. Chen Plonsky, L., & Oswald, F. L. (2014). How big is “big”? Interpreting effect sizes in L2 research. Language Learning, 64(4), 878–912. doi:10.1111/lang.12079 *Riazi, A. M. (2016). Comparing writing performance in TOEFL-iBT and academic assignments: An exploration of textual features. Assessing Writing, 28, 15–27. doi:10.1016/j.asw.2016.02.001 *Ross, J. A., Rolheiser, C., & Hogaboam-Gray, A. (1999). Effects of self-evaluation training on narrative writing. Assessing Writing, 6(1), 107–132. doi:10.1016/ S1075-2935(99)00003-3 *Shohamy, E., & Inbar, O. (1991). Validation of listening comprehension tests: The effect of text and question type. Language Testing, 8(1), 23–40. doi:10.1177/ 026553229100800103 *Sundeen, T. H. (2014). Instructional rubrics: Effects of presentation options on writing quality. Assessing Writing, 21, 74–88. doi:10.1016/j.asw.2014.03.003 Tabachnick, B. G., & Fidell, L. S. (2013). Using multivariate statistics (4th ed.). Boston, MA: Allyn and Bacon. von Davier, M., Gonzalez, E., & Mislevy, R. (2009). What are plausible values and why are they useful. In M. von Davier & Hastedt (Eds.), Issues and methodologies in large-scale assessments. IERI Monograph Series (pp. 9–36). Princeton, NJ: International Association for the Evaluation of Educational Achievement and Educational Testing Service. *Wagner, E. (2010). The effect of the use of video texts on ESL listening test-taker performance. Language Testing, 27(4), 493–513. doi:10.1177/0265532209355668 *Windsor, J. (1999). Effect of semantic inconsistency on sentence grammaticality judgements for children with and without language-learning disabilities. Language Testing, 16(3), 293–313. doi:10.1177/026553229901600304 *Xi, X. (2005). Do visual chunks and planning impact performance on the graph description task in the SPEAK exam? Language Testing, 22(4), 463–508. doi:10.1191/0265532205lt305oa *Zeidner, M. (1986). Are English language aptitude tests biased towards culturally different minority groups? Some Israeli findings. Language Testing, 3(1), 80–98. doi:10.1177/026553228600300104 *Zeidner, M. (1987). A comparison of ethnic, sex and age bias in the predictive validity of English language aptitude tests: Some Israeli data. Language Testing, 4(1), 55–71. doi:10.1177/026553228700400106

10 Application of linear regression in language assessment Daeryong Seo and Husein Taherbhai

Introduction Regression analyses are methods that analyze the variability in the dependent variable (DV) as a result of an observed or an induced change in one or more of the independent variable/s (IV/s). Linear regression analysis, within the various regression analytical methods, is one such method that is widely used by practitioners to study linear, additive relationships of the IVs and the effect of the IV/s on the DV. This method is often applied to explore the effect of indicators that affect an outcome and as a tool in predicting future outcomes with single or multiple variables. When a single variable is used to predict an outcome or assess its validity in the regression model, it is known as a simple linear regression (SLR) model. On the other hand, when multiple variables are used as IVs, the model is generally referred to as a multiple linear regression (MLR) model. For MLR models, the prediction for the DV is a straight-line function of each of the IVs while holding the other IVs fixed. The constant slope of the individual IVs has a straight-line relationship with the DV and is referred to as the regression coefficient. The relationship between the DV and the IVs, indicates that each coefficient is the change in the predicted value of DV per unit of change in the IV that is examined while holding the other coefficients constant. Besides the regression coefficients, there is yet one other parameter that needs to be estimated from the data for both the SLR and MLR models. This stand-alone constant, without the multiplicative effects of an IV attached to it, is the so-called intercept in the regression model. In general education and assessment, there are a variety of common factors that have an impact on student achievement. For example, students’ socioeconomic status, their parents’ education, peer pressure, and other factors are a function of student achievement regardless of the subject being measured (Linn, 2008; Van Roekel, 2008). However, each academic subject, such as math and science, may have certain unique indicators in its achievement that may not be common across different subjects or different students. In this context, the language spoken at home is a likely function in English language proficiency (ELP) examination for English language learners (ELLs). However, this variable may not have the same impact on non-ELL students’ performance in math or science examination.

220 Daeryong Seo and Husein Taherbhai English language learners (ELLs) are generally defined as students whose native language is other than English. In addition, any student who is hindered in academic achievement because of lack of proficiency in English can be also classified as an ELL (Seo & Taherbhai, 2018). Basically, ELLs’ ELP assessments are based on the achievement in four distinct modalities that are required for language acquisition, i.e., reading, writing, speaking, and listening (Gottlieb, 2004). This chapter provides an overview of linear regression analysis and uses variables that might be included in conceptualizing a model for the prediction of ELLs’ performance on an English language arts (ELA) examination.

Single linear regression (SLR) analysis A common statistic to show the strength of a relationship between two variables is the correlation coefficient. Although a correlation coefficient is a single statistic that describes the strength of the relationship between two variables, it does not necessarily imply causality. However, when the relationship is used in a directive manner as one variable being an indicator or a predictor for the occurrence of the other variable, regression analysis is used. There are many types of regression analysis such as linear regression, polynomial regression, ridge regression (James, Witten, Hastie, & Tibshirani, 2013), and others that are based on the nature of data available, the number of outcome variables used, and the type of inferences the researcher needs to make. In linear regression, which is the topic of this chapter, there is only one outcome variable, commonly known as the dependent variable (DV), which is continuous as a measuring unit. A continuous variable is a variable that has an infinite number of possible values such as 2.20, 2.256, and 2.2563 between two units of measurement, say from 2 to 3. The variables that influence the variation in the DV are known as independent variables (IVs). IVs can be continuous or discrete, where a discrete value is defined as a unit of measurement that can take specific values such as whole numbers. Depending on the type of regression model used, there are different statistical formulations or methods used in data analysis. Since the SLR has only one IV, the statistical equation for SLR is the following: Yi = β0 + β1 X i + εi ,(10.1) where β0 = intercept, β1 = slope, and εi = error term. In the above equation, Y is the DV while X is the IV and i refers to the student being examined. The equation provides information on how changes in X will affect the expected value of Y for each student, based on their score on the IV. When the regression line is graphed on a two-dimensional chart representing X and Y axes, β0 is the point where the regression line meets the Y axis (see Figure 10.1). This Y intercept, can be interpreted as the predicted value of Y if the value of X is equal to zero (X = 0). However, if X is not 0, then β0 has no meaningful interpretation other than anchoring the regression line at the appropriate place on

Application of linear regression 221

Plot of Regression Line where it Meets the Y-Axis = 113.60 116 115.5

Y-axis

115 114.5 114 113.5 113 112.5

0

10

20

X-axis

30

40

Figure 10.1 Plot of regression line graphed on a two-dimensional chart representing X and Y axes. β0 is the point where the regression line meets the Y axis, i.e., at 113.6.

the Y axis. β1, on the other hand, represents the difference in the predicted value of Y for each one-unit difference in X, provided that X is a continuous variable. If X is a categorical variable (coded, say, 0 for males and 1 for females), however, a one-unit difference in X represents switching from one category to the other. In this context, β1 is the average difference between the category for which X = 0 (the reference group) and the category for which X = 1 (the comparison group). The error term in the equation, εi, is the result of the difference between the predicted value and the actual value and is assumed to be normally distributed. Its inclusion in the equation, essentially means that the model is not completely accurate and produces differing results during real-world applications, i.e., it is included when the actual relationship between the IVs and the DV is not fully represented. Equation 10.1 is based on the assumption that a straight line is the best approximation of all the different values of the IVs and the corresponding values of the DV. Although this is easy to decipher with a small sample size, it is often not possible to detect visual linearity when the sample size is large. In other words, a straight line has to be mathematically derived to fit the hundreds or more student scores to capture the relationship between the IVs and DV. Many procedures are developed for fitting the linear regression line and estimating its parameters (i.e., the regression coefficients and the intercept). Among the most common and easy-to-conceptualize method in fitting a linear regression line is the method of ordinary least squares (OLS).

222 Daeryong Seo and Husein Taherbhai In OLS, the mathematically derived line is drawn in a manner whereby the sum of squares of the distances between the observed points and the regression line (the residuals) is minimized. The derived estimates of the parameters by the OLS method, sets the residuals equal to unique values that minimize the sum of squared errors. These squared errors are subject to the constraint that their mean error is zero within the sample of data to which the model is fitted.

Multiple linear regression (MLR) analysis When there is more than one IV in the linear regression equation, the regression equation is labeled as a multiple linear regression (MLR) model. MLR with three IVs, for example, can be mathematically formulated as: Yi = β0 + β1 X i 1 + β2 X i 2 + β3 X i 3 + εi .(10.2) In MLR, as stated earlier, the prediction of the difference in Y with a corresponding change in X1, as shown in Equation 10.2, is contingent upon X2 and X3 remaining constant. Y will differ by β1 units on average if X2 and X3 did not differ. When one or more IVs are categorical, however, it is the average difference in Y between categories that is meaningful. In other words, the coefficient for the categorical variable represents the average difference in Y between the category for which the variable = 0 (the reference group) and the category for which the variable = 1 (the comparison group).

Modeling interactions Since each IV is always influenced to some degree by other IVs, interaction terms are often added to the regression model in order to expand understanding of the relationships among the variables in the model and test different hypotheses. The interaction effect is best explained by an example. Suppose one were to hypothesize that students’ scores on an English language arts (ELA) examination are a function of their scores on an ELP examination and the years the students have been in the ELP program (notated as Yr_ELP). While the prediction of ELA with ELP is substantively clear, the inclusion of Yr_ ELP may need clarification. Yr_ELP is generally assumed to have a positive effect on ELP scores. However, it could also be possible that Yr_ELP could have a stagnant effect or a decreasing rate of achieving proficiency after an ELL has remained several years in the program. If one were to assume that 5 years is the average time spent in ELP classes before proficiency is achieved for the ELL to function adequately in other areas of academic achievement, then those students who go beyond the 5 years can be viewed as taking a little longer to achieve proficiency (positive effect of Yr_ELP), or it might be that they have reached a plateau where other unmodeled variables may be affecting students’ proficiency achievement. In other words, there may be an interaction effect between Yr_ELP and ELP that cannot be measured in terms of Yr_ELP and ELP as independent predictors of ELA.

Application of linear regression 223 Mathematically, the interaction effect is tested by adding a term to the model in which the two or more IVs are multiplied. The presence of a significant interaction between ELP scores and Yr_ELP would indicate that the effect of ELP on ELA is different at different values of the students being in the ELP program. The modeling of the interaction term in the above example would take the following mathematical formulation: ELA = β0 + β1 ELP + β2YrELP + β3 ELP * Yr _ ELP + εi (10.3) where ELP*Yr_ELP is the interaction term. It should be understood that all linear regression models, such as the ones depicted by Equations 10.1 to 10.3, function under certain assumptions (Poole & O’Farrell, 1971) that are discussed in the next section.

Assumptions in linear regression One particular assumption in standard regression analyses is that the IVs are measured without error. In addition to this assumption, there are other assumptions for linear regression including linearity, independence of errors, homoscedasticity, normality, and collinearity (Osborne & Waters, 2002) that are discussed here.

Linearity Linearity, as alluded to earlier, is the fundamental assumption in the application of linear regression. If linearity is violated, the parameter estimates and the test for significance for the analysis could be biased and would threaten the meaning of the parameters estimated in the analysis (Keith, 2006). The prevalent method of preventing non-linearity is through the use of empirical or theory-based knowledge. As Benoit (2011) cautions, if knowledge indicates the relationship to be curvilinear, then one should use a polynomial regression instead of a linear model. While non-linearity with one IV is observed via the examination of the plot of each dependent variable with the predictor variable, multivariate non-linearity in MLR is through an examination of the residual plots, i.e., the plots of the standardized residuals1 as a function of predicted values (Stevens, 1992). Linearity in MLR is based on the realization that error for any given observation is implied to be unpredictable. One can, however, determine whether a series of residuals is consistent with random error by examining the residual plot. If there is no departure from linearity, there would be a random scatter about the horizontal line (Stevens, 1992). As shown in Figure 10.2, the residuals are randomly distributed about the horizontal line throughout the range of predicted values, with an approximately constant variance2 that implies there is a linear relationship between the DV and the IVs. In Figure 10.3, on the other hand, a curvilinear line would likely fit the plot better than a linear line.

Plot of Standardized Residuals Vs. Predicted Scores

Standardized Residuals

4 3 2 1 0 -1

60

80

100

120

140

160

-2 -3 -4

Predicted Y

Figure 10.2 Plot of residuals vs. predicted Y scores where the assumption of linearity holds for the distribution of random errors.

Plot of Standardized Residuals Vs. Predicted Scores Standardized Residuals

2.5 2 1.5 1 0.5 0 -0.5 0

50

100

150

200

250

-1 -1.5 -2

Predicted Y

Figure 10.3 Plot of residuals vs. predicted Y scores where the assumption of linearity does not hold.

Application of linear regression 225 Independence of errors The assumption of independence of errors requires that there is little or no autocorrelation in the data. Autocorrelation occurs when the residuals are not independent of each other on different occasions of measurement, such as in a longitudinal dataset collected over time. There are various methods in detecting and remedying autocorrelation, but they are technical in nature and outside the scope of this chapter. However, when a particular set of data is collected only once at any given time (as in this chapter), independence assumption is generally assumed to be met.

Multicollinearity Multicollinearity refers to the assumption that the IVs are uncorrelated with one another (Darlington, 1968; Keith, 2006). MLR is designed to allow IVs to be correlated to some degree with one another (Hoyt, Leierer, & Millington, 2006). However, IVs are supposed to be more highly correlated with the DVs than with other IVs. If this assumption is not satisfied, multicollinearity is likely present in the regression model (Poole & O’Farrell, 1971). Multicollinearity makes the prediction equation unstable (Stevens, 1992). Such a condition would inflate standard errors and reduce the power of the regression coefficients, thereby creating a need for larger sample sizes (Jaccard, GuilamoRamos, Johansson, & Bouris, 2006; Keith, 2006). It would also make determining the importance of a variable difficult because the effects of the predictors would be confounded by the correlations among them. Among the various statistical methods used to identify multicollinearity, the variance inflation factor (VIF) is often used to assess the degree of multicollinearity in regression. The VIF statistic is the ratio of variance in a model with multiple terms divided by the variance of a model with one term alone. It quantifies the severity of multicollinearity in an ordinary least squares regression analysis. VIF is an index that is incorporated in many statistical packages such as SAS (Yu, 2016). The statistic has a lower bound of 1 but no upper bound. There is no universal acceptance of what constitutes a good VIF value, although authorities such as Shieh (2010) state that VIF ≥ 10 generally indicates there is evidence of multicollinearity. On the other hand, other experts get concerned when a VIF is greater than 2.50, which corresponds to an R2 of 0.60 with the other variables (Allison, 2012). The other common statistic used to measure multicollinearity is the tolerance statistic where the variable in question acts as the new dependent variable with the rest of the variables as independent variables. Generally speaking, T < 0.2 is a signal that there might be multicollinearity in the data, and T < 0.1 indicates a very high likelihood of multicollinearity in the model (Keith, 2006). One way to resolve multicollinearity is to combine “overlapping” variables in the regression analysis (Keith, 2006). For example, ELL students’ reading and writing scores may be combined together to create a single variable under the label of academic language. Other methods of resolving multicollinearity include the transference of scales related to the correlated IVs through such methods as using

226 Daeryong Seo and Husein Taherbhai logs, squaring values, etc. However, in many such cases, the scale transfer often introduces a polynomial term in the model and could change the meaning of the regression coefficient in the prediction of the DV, in which case the ridge regression model may have to be used. Ridge regression avoids least square regression’s issues in dealing with multicollinearity in data, partly because it does not require unbiased estimators. It adds just enough “bias” to make the estimates reasonably reliable in the approximation of true population values.

Normality in linear regression In linear regression, a few extreme observations can exert a disproportionate influence on parameter estimates and would likely affect the calculation of confidence intervals and various significance tests for coefficients, which are all based on the assumptions of normally distributed errors (Nau, 2017). However, violation of normal distribution assumption has no bearing on the outcome if the model is assumed to be substantively correct and the goal is to estimate its coefficients and generate predictions in such a way as to minimize mean squared error (Nau, 2017). Unless there is an interest in studying a few unusual points in the data or there is an effort to find a “better” model for explanatory purposes, non-normality is not an issue in the linear regression model. As Nau (2017) points out, real data rarely has errors that are perfectly normally distributed, and it may not be possible to fit data with a model whose errors do not violate the normality assumption. According to the author, it is often better to focus more on violations of the other assumptions and/or the influence of outliers (which likely may be the main cause for violations of normality), especially if the interest is in the outcome based on available data. The best test for normally distributed errors in linear regression is a normal probability plot or normal quantile plot of the residuals. There are also statistical tests such as the Shapiro-Wilk test and the Anderson-Darling test to test normality.3

Homoscedasticity for the error term The assumption of homoscedasticity describes a situation in which the error term has constant variance across all values of the predicted DV (Osborne & Waters, 2002). While statistical tests such as the Bartlett’s and Hartley’s test that have been identified as flexible and powerful tests to assess homoscedasticity (Aguinis, Petersen, & Pierce, 1999), heteroscedasticity can best be checked by visual examination of a plot of the standardized residuals and the predicted values (Osborne & Waters, 2002). Heteroscedasticity is indicated when the scatter is not even, such as a bowtie (butterfly) or the fan-shaped distribution depicted in Figure 10.4. In the figure, the assumption of homoscedasticity is violated, because as predicted values get larger, so does the vertical spread of the residuals. When heteroscedasticity is present, too much weight would be given to a small subset of the data where the error variance is the largest, producing biased standard error estimates that would provide incorrect conclusions about the significance of the regression coefficients. In other words, since there is no differential

Application of linear regression 227

Plot of Standardized Residuals vs. Predicted Values

Standardized Residuals

50 40 30 20 10 0

112.5

-10

113

113.5

114

114.5

115

115.5

116

116.5

-20 -30 -40

Predicted Y

Figure 10.4 Plot of standardized residuals vs. predicted values of the dependent variable that depicts a violation of homoscedasticity.

weighting of the small subset with large error variance, this subset could have an outlier effect on the entire dataset that would likely distort the standard errors of the regression coefficients. Fixing heteroscedasticity often entails collapsing the predictive variables into equal categories and comparing the variance of the residuals (Keith, 2006) or applying a log transformation of the DV if the DV is strictly positive. Overall, however, it should be noted that the violation of the heteroscedasticity must be quite severe to be considered a major problem given the robust nature of linear regression (Keith, 2006; Osborne & Waters, 2002).

Including or excluding independent variables In the context of standard MLR analysis, all IVs are entered into the analysis simultaneously. However, researchers often require the examination of each IV as it is entered into the regression model to consider its importance in an exploratory analysis. In the sequential selection of the items to be excluded or included, stepwise methods provide “automatic” selection methods that are devoid of subjectivity. These methods are particularly useful when there is a large set of predictors in the regression equation. The aim of this modeling technique is to maximize the prediction power of the model with a minimum number of IVs. The stepwise regression model is fitted by adding/dropping IVs one at a time based on a criterion that specifies no more regressors that would be significant if entered. Cody and Smith (1991) provide some methods in the stepwise-selection process: forward selection, stepwise selection, and backward elimination. There are

228 Daeryong Seo and Husein Taherbhai no hard and fast rules as to which method is best, leaving the selection of the procedure on the criterion requirement of the researcher. Some authors recommend the post hoc method of using several methods with a selected sample from the dataset and then comparing the F-statistics for their understanding of the IVs’ relative importance in the model. Most researchers, however, depend on substantive understanding of the IVs in their preference on how they should be included in the model. In Forward Selection, IVs are entered into the equation based on their correlation with the DV, i.e., from the strongest to the smallest correlate. Stepwise selection is similar to forward selection except that if in combination with other predictors, an IV no longer appears to contribute significantly to the equation, it is removed from the equation. Backward elimination, on the other hand, starts with all predictors in the model and removes the least significant variable at each step.

Determining model fit In regression analysis, a well-fitting model is considered to be that where the predicted values are close to the observed data values. Three statistics that are generally used in linear regression to evaluate model fit are R-square, the overall F-test, and the root mean square error (RMSE). R-square (also known as the coefficient of determination) is the fraction of the total sum of squares that is explained by the regression. It is interpreted as the proportion of total variance in the DV that is explained by the model (IVs) and is scaled between 0 and 1. In contrast to R-square, which tells us how much of the variance in the DV is explained by the regression model using the sample, the adjusted R-square tells us how much variance in the DV would be explained if the model had been derived from the population from which the sample was taken. As Grace-Martin (2012) cautions, adjusted R-square should always be used with MLR because it is adjusted for the number of predictors in the model. The adjusted R-square, unlike the R-square, increases only if the new term improves the model more than would be expected by chance. It could be, although not likely in most cases, that the adjusted R-squared can be negative when a predictor improves the model by less than expected by chance. Generally speaking, it will always be lower than the R-square. Other variations of R-square are Partial R-square and Model R-square. Partial R-square index basically presents R-square based only on the IV/s of interest, while Model R-square is based on all the IVs included in the model. It should be noted that high R-square is not necessary or relevant in situations in which the interest is in the relationship between variables, not in prediction. For example, Grace-Martin (2012) provides an example in which an R-square in the range of 0.10 to 0.15 is reasonable in the explanation of the effect of religion on health outcomes, knowing that IVs such as income, age, and other factors are likely to be more significant indicators that influence health. In addition to the R-square and Adjusted R-square statistics, the F-test is used to evaluate model fit. In general, the F-test, unlike the t-test, can assess multiple coefficients

Application of linear regression 229 simultaneously and it is particularly useful in comparing different fits of different linear models. The use of the F-test for its overall significance is a specific form of the F-test. It compares a model with no predictors (known in regression analysis as an intercept model) to the model that is specified. In the formulation of a null hypothesis, the expectation is that the fit of the intercept-only model and the specified model are equal. On the other hand, in testing the alternative hypothesis, the fit of the intercept model is significantly reduced in comparison to the specified model. In general, a significant F value indicates that the observed R-square is reliable and is not a spurious result of oddities in the dataset (Grace-Martin, 2012). The RMSE that is the square root of the mean of the squared variance of the residuals (MSE) is also one other method to determine model fit. Lower values of RMSE indicate better fit. RMSE is a good measure of how accurately the model predicts the response (DV), as it indicates how much the predictions deviate, on average, from the actual values in the dataset. However, there is no absolute good or bad threshold for RSME. For a datum which ranges from 0 to 1,000, an RMSE of 0.7 might be considered small, but RMSE of 0.7 may not be small if the data has only a ten-point difference between the low and high values, say from 1 to 10. The utility of RSME is often served as a comparative value whereby, for example, datasets with and without outliers can be compared to examine the differences between the two RMSEs. However, once again, it is left to the researcher as to what is an acceptable value for RMSE. The best measure of model fit depends on the researcher’s objectives, and more than one method is often useful (GraceMartin, 2012). One way for the model to be validated is by randomly splitting the data (not necessarily in half) and applying the equation from the derivation sample to see how well it predicts the dependent variable in the validation sample. The crossvalidated R can then be compared for each student’s DVs versus the predicted scores in the validation sample (Stevens, 1992). It is important to note that since stepwise regression is based on purely mathematical considerations, it is particularly important to cross-validate the results. The next section provides a generalized look at the effect of the various factors that impact ELLs’ performance in language acquisition. This is followed by a sample study on the use of MLR to determine if some of these variables could predict ELLs’ performance on an English language arts (ELA) examination.

Variables affecting English language proficiency The expectations for students to do well in all the four language modalities are reflected in how students are presented the curriculum, how they learn, and how they are assessed in academic settings. For example, listening is an important modality in understanding what the teacher is saying. By the same token, reading is also a necessary requirement in learning. Similarly, all four modalities are often required in projecting the knowledge acquired by ELLs during ELP assessment.

230 Daeryong Seo and Husein Taherbhai Many assessments in the United States use a composite score of the modalities (a compensatory model) as an indication of acquiring English language proficiency (Abedi, 2008). Unlike conjunctive models in which students must achieve “targets” in each of the modalities of ELP to be considered proficient, ELLs assessed by the compensatory model may not be proficient in one or even two aspects of language acquisition but could have an overall passing score to be eligible for being mainstreamed into a non-ELL classroom (Abedi, 2008). In this context, a proficient student who may have done poorly in reading may not be able to succeed in a “regular” classroom where there may be a requirement for substantial time spent on reading. In the next section, we briefly discuss the factors that can affect ELP performance before embarking on a discussion of regression modeling.

Factors that affect ELLs’ language acquisition The interest in ELLs’ performance has not only been studied by researchers with respect to how well they do in learning the language but also how their language skills affect their performance in other academic subject areas such as math and language arts (Abedi, 2008). It is in this context that selection of the variables to be included in examining students’ performance in language acquisition has to be understood. As stated earlier, regression analyses provide outcomes affected by the influence of certain variables that are identified substantively as having an impact on the outcome. The selection of these variables is expected to be well grounded in the literature or through empirical research. However, not all variables that influence an outcome are likely to be included in the regression model because of the impracticality of such an endeavor, and also because many variables may explain the same differences (variation) in the outcome to be considered significant in explaining the results. Acquiring language skills, however, is a function of various factors that could be influential in acquiring language proficiency. Halle, Hair, Wandner, McNamara, and Chien (2012), for example, found through regression analysis that although child, family, and school characteristics were influential in achieving English proficiency, their effect was mitigated by the grade level at which ELLs entered school. In the assessment of language proficiency, educators soon realized that even though some students “seemed” fluent in English, they were not performing well in an academic setting. In this context, Cummins (2008) differentiated between basic interpersonal communication skills (BICS) and cognitive academic language proficiency (CALP). “BICS refers to conversational fluency in a language while CALP refers to students’ ability to understand and express, in both oral and written modes, concepts and ideas that are relevant to success in school” (Cummins, 2008, p. 71). In the context of CALP, Shoebottom (2017) elaborates several crucial factors that influence success that are largely beyond the control of the learner. These factors are broadly categorized as internal and external. Internal factors are those that an ELL brings with him or her to the learning situation, while external factors are those that characterize the particular language learning situation. Shoebottom (2017) summarizes these factors as shown in Table 10.1.

Table 10.1 Internal and External Factors Affecting ELLs’ Language Proficiency Factor

Description

Internal factor

External factor

Age

Second language acquisition is influenced by the age of the learner, i.e., younger children generally have an easier time in language acquisition than those who are older. Students who have solid literacy skills in their own language are in better position to acquire a new language efficiently. Introverted or anxious learners usually make slower progress, especially with the development of oral skills. Generally speaking, students who enjoy language learning will do better than those who do not have such an intrinsic desire. On the other hand, extrinsic motivation such as communicating with an English-speaking boyfriend/girlfriend can be a great impetus in their acquiring language skills. By the same token, students who are given continuing, appropriate encouragement to learn by their teachers and parents will generally fare better than those students who do not have this type of motivational factor. Learners who have acquired general knowledge and experience by, say, having lived in different countries are in a stronger position to develop a new language than those who lack these types of experiences. Some linguists believe that there is a specific, innate language learning ability that is stronger in some students than in others. Students who are learning a second language that is from the same language family as their first language have, in general, a much easier task than those who speak a language that is not from the same family. For example, a Dutch child will learn English more quickly than a Japanese child because English and Dutch are from the Common Germanic family, while Japanese does not belong to this type of Indo-European language.

Yes

N/A

Yes

N/A

Yes

N/A

Yes

Yes

Yes

N/A

Yes

N/A

N/A

Yes

Literacy skills Personality Motivation

Experiences

Cognition

Native language

(Continued )

232 Daeryong Seo and Husein Taherbhai Table 10.1 (Continued) Factor

Description

Internal factor

External factor

Curriculum

For ELLs, language learning can be a monumental task if they are fully submersed into the mainstream program without assistance or, conversely, not allowed to be part of the mainstream until they have reached a certain level of language proficiency. Students make faster progress with teachers who are better than others at providing appropriate and effective learning experiences for ELLs’ linguistic development. Students who believe that their own culture has a lower status than that of the culture in which they are learning the language make slower progress. While Shoebottom (2017) does not clarify this particular point, it may be based on the colonization aspect of Western rule, where “everything” Western, at one time, was considered “superior” to native norms. The opportunity to interact with native speakers both within and outside of the classroom is a significant advantage, particularly in the oral/aural aspects of language acquisition.

N/A

Yes

N/A

Yes

N/A

Yes

N/A

Yes

Instruction

Culture and Status

Access to Native Speakers

In addition to Shoebottom’s (2017) factors that could affect language acquisition, there are some other variables that may also have an influence on the regression outcome. One such variable is gender. Zoghil, Kazemi, and Kalani (2013), for example, state that Iranian females outperform males in English language acquisition. While learning English in the context of cultural effects within the Iranian society is not known, it certainly indicates a need for including gender and other such variables in regression analysis for predicting their impact in English language acquisition (Seo, Taherbhai, & Franz, 2016). Once variables are identified through literature or empirical evidence, their inclusion as “indicating” variables can be ascertained through statistical methods such as regression analysis. If these variables are significant in measuring the outcome, they can then be used for prediction purposes using regression methods. The next section provides an example of how MLR is used in the context of ELP assessment, using the SAS programming package. The data are provided on the Companion website for readers to run the analysis.

Application of linear regression 233

Sample study: Using English language proficiency scores to predict performance in an English language arts examination Method Participants The sample of data analyzed for the study was based on 500 randomly selected ELLs who had been administered both an English language arts (ELA) and the four ELP modality tests in a large-scale assessment. The ELLs were classified on the basis of gender, which was the other variable included in the regression model.

Procedure The data collection design was undertaken with the ELA and ELP modality test scores measured on a continuous scale, while the gender variable was coded males = 1 and females = 0. ELLs’ performance on the modalities of ELP and the gender specification of the students provided the following regression equation: ELAtotal = ELPreading + ELPwriting + ELPlistening + ELPSpeking + Gender (10.4) To explore the importance of using these variables in the statistical model, a stepwise regression was used via the SAS statistical software to evaluate the relative importance of the variables in the model. The codes for the analysis were specified as follows: PROC REG DATA = regression; MODEL ELA_TOT = RD WR LS SP Gender SELECTION = STEPWISE SLENTRY = 0.3 SLSTAY = 0.3; TITLE “IVs Selection Test”; RUN; QUIT; The “stepwise” procedure selects the first variable for entry into the model with the lowest p value that is less than the value of SLENTRY. The option SLENTRY = 0.30 specifies the significance level for a variable to enter into the model. The option SLSTAY = 0.30 specifies the significance level for an entered variable to remain in the model (Billings, 2016). If the values of SLENTRY and SLSTAY are not specified, the SAS system uses default values of 0.05 for both. There is no real reason for using the 0.05 value other than an unwritten statistical tradition (Shtatland, Cain, & Barton, n.d.). Hosmer and Lemeshow (1999), however, proposed using the range from

234 Daeryong Seo and Husein Taherbhai 0.15 to 0.30. Since there is no one “supermodel” that is good for all purposes, Shtatland et al. (n.d.), suggest using SLENTRY values from 0.001 to 0.05 for explanatory models and from 0.15 to 0.30 for predictive models. The next step would be to run the regression with only the variables that are included in the stepwise model. However, if there is no interest in the exploratory aspect of the IVs being included in the model, the analyst would start with the following codes to obtain information regarding model and variable fit: PROC REG DATA = regression; MODEL ELA_TOT = RD WR LS SP Gender; TITLE “Initial Regression with ALL IVs Included”; RUN; QUIT; The above codes would also allow us to examine which IVs are to be dropped based on their significance in being included in the model. SAS codes and the outputs are provided in the Results section, depending on the IVs included in the final regression model.

Results Table 10.2 provides the correlation matrix of the dependent variable (i.e., ELA total score) and the independent variables (e.g., ELP reading score). In all the regression analyses that follow, unless specified otherwise, the predetermined alpha (p values in the tables below) is set at .05. The variables selected for entry into the model at each step of the selection process are presented in Table 10.3. Reading has entered the model first (Step 1) with p < .001. The partial R-square represents the variance in the DV that is associated only with the newly included IV, while the model R-square represents the total model variation explained by all the IVs included in the model. Speaking did not enter into the model because its SLENTRY did not meet the 0.30 criterion for entry, indicating that it was not an important indicator of ELA scores. On the other hand, the significance of Gender Table 10.2 Correlation Matrix of the Dependent Variable and the Independent Variables

1. ELA Total Score 2. ELP Reading 3. ELP Writing 4. ELP Speaking 5. ELP Listening ***p < 0.001.

1

2

– 0.63*** 0.51*** 0.19*** 0.47***

– 0.62*** 0.19*** 0.54***

3

– 0.21*** 0.47***

4

– 0.23***

5

M

SD

–

511.61 533.36 542.42 558.51 535.88

35.21 44.02 49.53 52.78 39.85

Application of linear regression 235 in the model is likely to make researchers delve into finding who is doing better (males or females) and why. Could it be that the test or some items were biased for one or the other gender? Most researchers at this stage would omit Speaking from the model and run the regression without its inclusion in the model. For those who had skipped the stepwise process, part of the output from the codes outlined in the Procedure section are provided in Tables 10.4 and 10.5. As seen from Table 10.4, the F value is 80.13, and the probability of getting this F value is p < .001. This indicates that the model explains a significant portion of the variation in ELA scores. In Table 10.5, the probability values of all ELP modalities except for Speaking (p = 0.316) indicate that the t values do not depict a major effect of chance on Table 10.3 Summary of Stepwise Selection Step

Variable entered

Partial R-square

Model R-square

F

p

1 2 3 4

Reading Writing Listening Gender

0.399 0.024 0.013 0.008

0.399 0.424 0.438 0.446

331.93 20.97 12.13 7.80